Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Commit 90aa796

Browse files
author
Ryan Sepassi
committed
Internal merge PR#370
PiperOrigin-RevId: 173611718
1 parent 86703a2 commit 90aa796

File tree

16 files changed

+928
-799
lines changed

16 files changed

+928
-799
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -286,7 +286,7 @@ registrations.
286286
To add a new dataset, subclass
287287
[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
288288
and register it with `@registry.register_problem`. See
289-
[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
289+
[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
290290
for an example.
291291

292292
Also see the [data generators

docs/new_problem.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ We're almost done. `generator` generates the training and evaluation data and
105105
stores them in files like "word2def_train.lang1" in your DATA_DIR. Thankfully
106106
several commonly used methods like `character_generator`, and `token_generator`
107107
are already written in the file
108-
[`wmt.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py).
108+
[`translate.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate.py).
109109
We will import `character_generator` and
110110
[`text_encoder`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/text_encoder.py)
111111
to write:

docs/walkthrough.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -286,7 +286,7 @@ registrations.
286286
To add a new dataset, subclass
287287
[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
288288
and register it with `@registry.register_problem`. See
289-
[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
289+
[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
290290
for an example.
291291

292292
Also see the [data generators

tensor2tensor/bin/t2t-datagen

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ from tensor2tensor.data_generators import all_problems # pylint: disable=unused
4343
from tensor2tensor.data_generators import audio
4444
from tensor2tensor.data_generators import generator_utils
4545
from tensor2tensor.data_generators import snli
46-
from tensor2tensor.data_generators import wmt
46+
from tensor2tensor.data_generators import translate
4747
from tensor2tensor.data_generators import wsj_parsing
4848
from tensor2tensor.utils import registry
4949
from tensor2tensor.utils import usr_dir
@@ -82,9 +82,9 @@ _SUPPORTED_PROBLEM_GENERATORS = {
8282
lambda: algorithmic_math.algebra_inverse(26, 0, 2, 100000),
8383
lambda: algorithmic_math.algebra_inverse(26, 3, 3, 10000)),
8484
"parsing_english_ptb8k": (
85-
lambda: wmt.parsing_token_generator(
85+
lambda: translate.parsing_token_generator(
8686
FLAGS.data_dir, FLAGS.tmp_dir, True, 2**13),
87-
lambda: wmt.parsing_token_generator(
87+
lambda: translate.parsing_token_generator(
8888
FLAGS.data_dir, FLAGS.tmp_dir, False, 2**13)),
8989
"parsing_english_ptb16k": (
9090
lambda: wsj_parsing.parsing_token_generator(

tensor2tensor/data_generators/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ All tasks produce TFRecord files of `tensorflow.Example` protocol buffers.
2323
To add a new problem, subclass
2424
[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
2525
and register it with `@registry.register_problem`. See
26-
[`WMTEnDeTokens8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
26+
[`TranslateEndeWmt8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
2727
for an example.
2828

2929
`Problem`s support data generation, training, and decoding.
@@ -37,7 +37,7 @@ for training/decoding, e.g. a vocabulary file.
3737
A particularly easy way to implement `Problem.generate_data` for your dataset is
3838
to create 2 Python generators, one for the training data and another for the
3939
dev data, and pass them to `generator_utils.generate_dataset_and_shuffle`. See
40-
[`WMTEnDeTokens8k.generate_data`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
40+
[`TranslateEndeWmt8k.generate_data`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
4141
for an example of usage.
4242

4343
The generators should yield dictionaries with string keys and values being lists
@@ -66,5 +66,5 @@ Some examples:
6666

6767
* [Algorithmic problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py)
6868
and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic_test.py)
69-
* [WMT problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
69+
* [WMT En-De problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/translate_ende.py)
7070
and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt_test.py)

tensor2tensor/data_generators/all_problems.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,12 @@
3333
from tensor2tensor.data_generators import problem_hparams
3434
from tensor2tensor.data_generators import ptb
3535
from tensor2tensor.data_generators import snli
36+
from tensor2tensor.data_generators import translate_encs
37+
from tensor2tensor.data_generators import translate_ende
38+
from tensor2tensor.data_generators import translate_enfr
39+
from tensor2tensor.data_generators import translate_enmk
40+
from tensor2tensor.data_generators import translate_enzh
3641
from tensor2tensor.data_generators import wiki
37-
from tensor2tensor.data_generators import wmt
3842
from tensor2tensor.data_generators import wsj_parsing
3943

4044

tensor2tensor/data_generators/generator_utils.py

Lines changed: 15 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -264,41 +264,6 @@ def gunzip_file(gz_path, new_path):
264264
new_file.write(line)
265265

266266

267-
# TODO(aidangomez): en-fr tasks are significantly over-represented below
268-
_DATA_FILE_URLS = [
269-
# German-English
270-
[
271-
"http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz", # pylint: disable=line-too-long
272-
[
273-
"training-parallel-nc-v11/news-commentary-v11.de-en.en",
274-
"training-parallel-nc-v11/news-commentary-v11.de-en.de"
275-
]
276-
],
277-
# German-English & French-English
278-
[
279-
"http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz", [
280-
"commoncrawl.de-en.en", "commoncrawl.de-en.de",
281-
"commoncrawl.fr-en.en", "commoncrawl.fr-en.fr"
282-
]
283-
],
284-
[
285-
"http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz", [
286-
"training/europarl-v7.de-en.en", "training/europarl-v7.de-en.de",
287-
"training/europarl-v7.fr-en.en", "training/europarl-v7.fr-en.fr"
288-
]
289-
],
290-
# French-English
291-
[
292-
"http://www.statmt.org/wmt10/training-giga-fren.tar",
293-
["giga-fren.release2.fixed.en.gz", "giga-fren.release2.fixed.fr.gz"]
294-
],
295-
[
296-
"http://www.statmt.org/wmt13/training-parallel-un.tgz",
297-
["un/undoc.2000.fr-en.en", "un/undoc.2000.fr-en.fr"]
298-
],
299-
]
300-
301-
302267
def get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
303268
generator):
304269
"""Inner implementation for vocab generators.
@@ -337,13 +302,9 @@ def get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
337302
return vocab
338303

339304

340-
def get_or_generate_vocab(data_dir,
341-
tmp_dir,
342-
vocab_filename,
343-
vocab_size,
344-
sources=None):
345-
"""Generate a vocabulary from the datasets in sources (_DATA_FILE_URLS)."""
346-
sources = sources or _DATA_FILE_URLS
305+
def get_or_generate_vocab(data_dir, tmp_dir, vocab_filename, vocab_size,
306+
sources):
307+
"""Generate a vocabulary from the datasets in sources."""
347308

348309
def generate():
349310
tf.logging.info("Generating vocab from: %s", str(sources))
@@ -375,13 +336,19 @@ def generate():
375336

376337
# Use Tokenizer to count the word occurrences.
377338
with tf.gfile.GFile(filepath, mode="r") as source_file:
378-
file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5
339+
file_byte_budget = 1e6
340+
counter = 0
341+
countermax = int(source_file.size() / file_byte_budget / 2)
379342
for line in source_file:
380-
if file_byte_budget <= 0:
381-
break
382-
line = line.strip()
383-
file_byte_budget -= len(line)
384-
yield line
343+
if counter < countermax:
344+
counter += 1
345+
else:
346+
if file_byte_budget <= 0:
347+
break
348+
line = line.strip()
349+
file_byte_budget -= len(line)
350+
counter = 0
351+
yield line
385352

386353
return get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
387354
generate())

tensor2tensor/data_generators/ice_parsing.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
from tensor2tensor.data_generators import generator_utils
3333
from tensor2tensor.data_generators import problem
3434
from tensor2tensor.data_generators import text_encoder
35-
from tensor2tensor.data_generators.wmt import tabbed_generator
35+
from tensor2tensor.data_generators import translate
3636
from tensor2tensor.utils import registry
3737

3838

@@ -51,15 +51,17 @@ def tabbed_parsing_token_generator(data_dir, tmp_dir, train, prefix,
5151
data_dir, tmp_dir, filename, 1,
5252
prefix + "_target.tokens.vocab.%d" % target_vocab_size, target_vocab_size)
5353
pair_filepath = os.path.join(tmp_dir, filename)
54-
return tabbed_generator(pair_filepath, source_vocab, target_vocab, EOS)
54+
return translate.tabbed_generator(pair_filepath, source_vocab, target_vocab,
55+
EOS)
5556

5657

5758
def tabbed_parsing_character_generator(tmp_dir, train):
5859
"""Generate source and target data from a single file."""
5960
character_vocab = text_encoder.ByteTextEncoder()
6061
filename = "parsing_{0}.pairs".format("train" if train else "dev")
6162
pair_filepath = os.path.join(tmp_dir, filename)
62-
return tabbed_generator(pair_filepath, character_vocab, character_vocab, EOS)
63+
return translate.tabbed_generator(pair_filepath, character_vocab,
64+
character_vocab, EOS)
6365

6466

6567
@registry.register_problem

0 commit comments

Comments
 (0)