tensorflow
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/new_problem.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/new_problem.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/walkthrough.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/walkthrough.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tensor2tensor/bin/t2t-datagen‎
Lines changed: 3 additions & 3 deletions b/‎tensor2tensor/bin/t2t-datagen‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎tensor2tensor/data_generators/README.md‎
Lines changed: 3 additions & 3 deletions b/‎tensor2tensor/data_generators/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎tensor2tensor/data_generators/all_problems.py‎
Lines changed: 5 additions & 1 deletion b/‎tensor2tensor/data_generators/all_problems.py‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎tensor2tensor/data_generators/generator_utils.py‎
Lines changed: 15 additions & 48 deletions b/‎tensor2tensor/data_generators/generator_utils.py‎
Lines changed: 15 additions & 48 deletions
diff --git a/‎tensor2tensor/data_generators/ice_parsing.py‎
Lines changed: 5 additions & 3 deletions b/‎tensor2tensor/data_generators/ice_parsing.py‎
Lines changed: 5 additions & 3 deletions
@@ -286,7 +286,7 @@ registrations.
 To add a new dataset, subclass
 [`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
 and register it with `@registry.register_problem`. See
-[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
 for an example.
 
 Also see the [data generators
 
@@ -105,7 +105,7 @@ We're almost done. `generator` generates the training and evaluation data and
 stores them in files like "word2def_train.lang1" in your DATA_DIR. Thankfully
 several commonly used methods like `character_generator`, and `token_generator`
 are already written in the file
-[`wmt.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py).
+[`translate.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate.py).
 We will import `character_generator` and
 [`text_encoder`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/text_encoder.py)
 to write:
 
@@ -286,7 +286,7 @@ registrations.
 To add a new dataset, subclass
 [`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
 and register it with `@registry.register_problem`. See
-[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
 for an example.
 
 Also see the [data generators
 
@@ -43,7 +43,7 @@ from tensor2tensor.data_generators import all_problems  # pylint: disable=unused
 from tensor2tensor.data_generators import audio
 from tensor2tensor.data_generators import generator_utils
 from tensor2tensor.data_generators import snli
-from tensor2tensor.data_generators import wmt
+from tensor2tensor.data_generators import translate
 from tensor2tensor.data_generators import wsj_parsing
 from tensor2tensor.utils import registry
 from tensor2tensor.utils import usr_dir
@@ -82,9 +82,9 @@ _SUPPORTED_PROBLEM_GENERATORS = {
         lambda: algorithmic_math.algebra_inverse(26, 0, 2, 100000),
         lambda: algorithmic_math.algebra_inverse(26, 3, 3, 10000)),
     "parsing_english_ptb8k": (
-        lambda: wmt.parsing_token_generator(
+        lambda: translate.parsing_token_generator(
             FLAGS.data_dir, FLAGS.tmp_dir, True, 2**13),
-        lambda: wmt.parsing_token_generator(
+        lambda: translate.parsing_token_generator(
             FLAGS.data_dir, FLAGS.tmp_dir, False, 2**13)),
     "parsing_english_ptb16k": (
         lambda: wsj_parsing.parsing_token_generator(
 
@@ -23,7 +23,7 @@ All tasks produce TFRecord files of `tensorflow.Example` protocol buffers.
 To add a new problem, subclass
 [`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
 and register it with `@registry.register_problem`. See
-[`WMTEnDeTokens8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
 for an example.
 
 `Problem`s support data generation, training, and decoding.
@@ -37,7 +37,7 @@ for training/decoding, e.g. a vocabulary file.
 A particularly easy way to implement `Problem.generate_data` for your dataset is
 to create 2 Python generators, one for the training data and another for the
 dev data, and pass them to `generator_utils.generate_dataset_and_shuffle`. See
-[`WMTEnDeTokens8k.generate_data`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+[`TranslateEndeWmt8k.generate_data`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
 for an example of usage.
 
 The generators should yield dictionaries with string keys and values being lists
@@ -66,5 +66,5 @@ Some examples:
 
 *   [Algorithmic problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py)
     and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic_test.py)
-*   [WMT problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+*   [WMT En-De problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
     and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt_test.py)
@@ -33,8 +33,12 @@
 from tensor2tensor.data_generators import problem_hparams
 from tensor2tensor.data_generators import ptb
 from tensor2tensor.data_generators import snli
+from tensor2tensor.data_generators import translate_encs
+from tensor2tensor.data_generators import translate_ende
+from tensor2tensor.data_generators import translate_enfr
+from tensor2tensor.data_generators import translate_enmk
+from tensor2tensor.data_generators import translate_enzh
 from tensor2tensor.data_generators import wiki
-from tensor2tensor.data_generators import wmt
 from tensor2tensor.data_generators import wsj_parsing
 
 
 
@@ -264,41 +264,6 @@ def gunzip_file(gz_path, new_path):
         new_file.write(line)
 
 
-# TODO(aidangomez): en-fr tasks are significantly over-represented below
-_DATA_FILE_URLS = [
-    # German-English
-    [
-        "http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz",  # pylint: disable=line-too-long
-        [
-            "training-parallel-nc-v11/news-commentary-v11.de-en.en",
-            "training-parallel-nc-v11/news-commentary-v11.de-en.de"
-        ]
-    ],
-    # German-English & French-English
-    [
-        "http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz", [
-            "commoncrawl.de-en.en", "commoncrawl.de-en.de",
-            "commoncrawl.fr-en.en", "commoncrawl.fr-en.fr"
-        ]
-    ],
-    [
-        "http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz", [
-            "training/europarl-v7.de-en.en", "training/europarl-v7.de-en.de",
-            "training/europarl-v7.fr-en.en", "training/europarl-v7.fr-en.fr"
-        ]
-    ],
-    # French-English
-    [
-        "http://www.statmt.org/wmt10/training-giga-fren.tar",
-        ["giga-fren.release2.fixed.en.gz", "giga-fren.release2.fixed.fr.gz"]
-    ],
-    [
-        "http://www.statmt.org/wmt13/training-parallel-un.tgz",
-        ["un/undoc.2000.fr-en.en", "un/undoc.2000.fr-en.fr"]
-    ],
-]
-
-
 def get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
                                 generator):
   """Inner implementation for vocab generators.
@@ -337,13 +302,9 @@ def get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
   return vocab
 
 
-def get_or_generate_vocab(data_dir,
-                          tmp_dir,
-                          vocab_filename,
-                          vocab_size,
-                          sources=None):
-  """Generate a vocabulary from the datasets in sources (_DATA_FILE_URLS)."""
-  sources = sources or _DATA_FILE_URLS
+def get_or_generate_vocab(data_dir, tmp_dir, vocab_filename, vocab_size,
+                          sources):
+  """Generate a vocabulary from the datasets in sources."""
 
   def generate():
     tf.logging.info("Generating vocab from: %s", str(sources))
@@ -375,13 +336,19 @@ def generate():
 
         # Use Tokenizer to count the word occurrences.
         with tf.gfile.GFile(filepath, mode="r") as source_file:
-          file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5
+          file_byte_budget = 1e6
+          counter = 0
+          countermax = int(source_file.size() / file_byte_budget / 2)
           for line in source_file:
-            if file_byte_budget <= 0:
-              break
-            line = line.strip()
-            file_byte_budget -= len(line)
-            yield line
+            if counter < countermax:
+              counter += 1
+            else:
+              if file_byte_budget <= 0:
+                break
+              line = line.strip()
+              file_byte_budget -= len(line)
+              counter = 0
+              yield line
 
   return get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
                                      generate())
 
@@ -32,7 +32,7 @@
 from tensor2tensor.data_generators import generator_utils
 from tensor2tensor.data_generators import problem
 from tensor2tensor.data_generators import text_encoder
-from tensor2tensor.data_generators.wmt import tabbed_generator
+from tensor2tensor.data_generators import translate
 from tensor2tensor.utils import registry
 
 
@@ -51,15 +51,17 @@ def tabbed_parsing_token_generator(data_dir, tmp_dir, train, prefix,
       data_dir, tmp_dir, filename, 1,
       prefix + "_target.tokens.vocab.%d" % target_vocab_size, target_vocab_size)
   pair_filepath = os.path.join(tmp_dir, filename)
-  return tabbed_generator(pair_filepath, source_vocab, target_vocab, EOS)
+  return translate.tabbed_generator(pair_filepath, source_vocab, target_vocab,
+                                    EOS)
 
 
 def tabbed_parsing_character_generator(tmp_dir, train):
   """Generate source and target data from a single file."""
   character_vocab = text_encoder.ByteTextEncoder()
   filename = "parsing_{0}.pairs".format("train" if train else "dev")
   pair_filepath = os.path.join(tmp_dir, filename)
-  return tabbed_generator(pair_filepath, character_vocab, character_vocab, EOS)
+  return translate.tabbed_generator(pair_filepath, character_vocab,
+                                    character_vocab, EOS)
 
 
 @registry.register_problem