-
Notifications
You must be signed in to change notification settings - Fork 32.7k
TF : tensor mismatch error in training with opus100 and t5-small #24693
Description
System Info
transformers ==4.31.0.dev0
tensorflow-macos==2.10.0
Hello there! 👋
Thanks for creating examples for the Translation task!
Context
Im going through run_translation.py example modified with opus100 dataset.
Launching the script with flags listed below.
python train_model.py \
--model_name_or_path t5-small \
--do_train \
--do_eval \
--source_lang en \
--target_lang ro \
--source_prefix "translate English to Romanian: " \
--dataset_name opus100 \
--dataset_config_name en-ro \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 \
--overwrite_output_dir
Error
All dataset feature engineering seems to display well, It starts training but at some point, there is a tensor mismatch error in training.
Shape of tensor args_0 [16,128] is not compatible with expected shape [16,64].
[[{{node EnsureShape_1}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]] [Op:__inference_train_function_17297]
Any hints on how Shall I reshape this? At some point, I thought it was something with preprocessing, but it starts training, so a little bit confused... I also explored wtm16 (example tested and working) during #24579 and when I go 2 the Hub, it seems to have the same structure and partitions as opus100.
Thanks for the time dedicated to this!🙂 and for the help!
Looking forward to get all this working, and share it in PyCon Spain keynote this year!
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Launch training with config
python train_model.py \
--model_name_or_path t5-small \
--do_train \
--do_eval \
--source_lang en \
--target_lang ro \
--source_prefix "translate English to Romanian: " \
--dataset_name opus100 \
--dataset_config_name en-ro \
--output_dir /tmp/tst-translation \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 \
--overwrite_output_dir
Expected behavior
Training is not interrupted.