Skip to content

TF : tensor mismatch error in training with opus100 and t5-small #24693

@SoyGema

Description

@SoyGema

System Info

transformers ==4.31.0.dev0
tensorflow-macos==2.10.0

Hello there! 👋
Thanks for creating examples for the Translation task!

Context

Im going through run_translation.py example modified with opus100 dataset.
Launching the script with flags listed below.

python train_model.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --source_prefix "translate English to Romanian: " \
    --dataset_name opus100 \
    --dataset_config_name en-ro \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=16 \
    --per_device_eval_batch_size=16 \
    --overwrite_output_dir

Error

All dataset feature engineering seems to display well, It starts training but at some point, there is a tensor mismatch error in training.

Shape of tensor args_0 [16,128] is not compatible with expected shape [16,64].
         [[{{node EnsureShape_1}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]] [Op:__inference_train_function_17297]

Any hints on how Shall I reshape this? At some point, I thought it was something with preprocessing, but it starts training, so a little bit confused... I also explored wtm16 (example tested and working) during #24579 and when I go 2 the Hub, it seems to have the same structure and partitions as opus100.

Thanks for the time dedicated to this!🙂 and for the help!
Looking forward to get all this working, and share it in PyCon Spain keynote this year!

Who can help?

@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Launch training with config
python train_model.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --source_prefix "translate English to Romanian: " \
    --dataset_name opus100 \
    --dataset_config_name en-ro \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=16 \
    --per_device_eval_batch_size=16 \
    --overwrite_output_dir

Expected behavior

Training is not interrupted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions