33Tensor2Tensor supports running on Google Cloud Platforms TPUs, chips specialized
44for ML training.
55
6- Not all models are supported but we've tested so far with Transformer (sequence
7- model) as well as Xception (image model).
6+ Models and hparams that are known to work on TPU:
7+ * ` transformer ` with ` transformer_tpu `
8+ * ` xception ` with ` xception_base `
9+ * ` resnet50 ` with ` resnet_base `
810
911To run on TPUs, you need to be part of the alpha program; if you're not, these
1012commands won't work for you currently, but access will expand soon, so get
1113excited for your future ML supercomputers in the cloud.
1214
1315## Tutorial: Transformer En-De translation on TPU
1416
17+ Update ` gcloud ` : ` gcloud components update `
18+
1519Set your default zone to a TPU-enabled zone. TPU machines are only available in
1620certain zones for now.
1721```
@@ -40,29 +44,32 @@ gcloud alpha compute tpus create \
4044To see all TPU instances running: ` gcloud alpha compute tpus list ` . The
4145` TPU_IP ` should be unique amongst the list and follow the format ` 10.240.i.2 ` .
4246
43- Generate data to GCS
44- If you already have the data locally, use ` gsutil cp ` to cp to GCS.
47+ SSH in with port forwarding for TensorBoard
4548```
46- DATA_DIR=gs://my-bucket/t2t/data/
47- t2t-datagen --problem=translate_ende_wmt8k --data_dir=$DATA_DIR
49+ gcloud compute ssh $USER-vm -- -L 6006:localhost:6006
4850```
4951
50- SSH in with port forwarding for TensorBoard
52+ Now that you're on the cloud instance, install T2T:
5153```
52- gcloud compute ssh $USER-vm -L 6006:localhost:6006
54+ pip install tensor2tensor --user
55+ # If your python bin dir isn't already in your path
56+ export PATH=$HOME/.local/bin:$PATH
5357```
5458
55- Now that you're on the cloud instance, install T2T:
59+ Generate data to GCS
60+ If you already have the data, use ` gsutil cp ` to copy to GCS.
5661```
57- pip install tensor2tensor
62+ GCS_BUCKET=gs://my-bucket
63+ DATA_DIR=$GCS_BUCKET/t2t/data/
64+ t2t-datagen --problem=translate_ende_wmt8k --data_dir=$DATA_DIR
5865```
5966
6067Setup some vars used below. ` TPU_IP ` and ` DATA_DIR ` should be the same as what
6168was used above. Note that the ` DATA_DIR ` and ` OUT_DIR ` must be GCS buckets.
6269```
6370TPU_IP=<IP of TPU machine>
64- DATA_DIR=gs://my-bucket /t2t/data/
65- OUT_DIR=gs://my-bucket /t2t/training/
71+ DATA_DIR=$GCS_BUCKET /t2t/data/
72+ OUT_DIR=$GCS_BUCKET /t2t/training/
6673TPU_MASTER=grpc://$TPU_IP:8470
6774```
6875
@@ -73,25 +80,26 @@ tensorboard --logdir=$OUT_DIR > /tmp/tensorboard_logs.txt 2>&1 &
7380
7481Train and evaluate.
7582```
76- t2t-tpu-trainer \
77- --master=$TPU_MASTER \
78- --data_dir=$DATA_DIR \
79- --output_dir=$OUT_DIR \
80- --problems=translate_ende_wmt8k \
83+ t2t-trainer \
8184 --model=transformer \
82- --hparams_set=transformer_tiny_tpu \
85+ --hparams_set=transformer_tpu \
86+ --problems=translate_ende_wmt8k \
8387 --train_steps=10 \
8488 --eval_steps=10 \
8589 --local_eval_frequency=10 \
86- --iterations_per_loop=10
90+ --iterations_per_loop=10 \
91+ --master=$TPU_MASTER \
92+ --use_tpu=True \
93+ --data_dir=$DATA_DIR \
94+ --output_dir=$OUT_DIR
8795```
8896
8997The above command will train for 10 steps, then evaluate for 10 steps. You can
9098(and should) increase the number of total training steps with the
9199` --train_steps ` flag. Evaluation will happen every ` --local_eval_frequency `
92100steps, each time for ` --eval_steps ` . When you increase then number of training
93101steps, also increase ` --iterations_per_loop ` , which controls how frequently the
94- TPU machine returns control to the Python code (1000 seems like a fine number).
102+ TPU machine returns control to the host machine (1000 seems like a fine number).
95103
96104Back on your local machine, open your browser and navigate to ` localhost:6006 `
97105for TensorBoard.
0 commit comments