-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Datasets in run_translation.py #24579
Description
System Info
Hello there! 👋
I'm following along the run_translation.py example.
Thanks for making it! It expands gratefully from the translation docs tutorial
Context
Managed to configure flags for training, When launching in CLI
python train_model.py --model_name_or_path '/Users/.../The-Lord-of-The-Words-The-two-frameworks/src/models/t5-small' --output_dir '/en-ru-model' --dataset_name '/Users/.../The-Lord-of-The-Words-The-two-frameworks/src/data/opus_books' --dataset_config_name en-ru --do_train --source_lang en --target_lang ru --num_train_epochs 1 --overwrite_output_dir
the following error appears
raise TypeError("Dataset argument should be a datasets.Dataset!")
TypeError: Dataset argument should be a datasets.Dataset!
Then read forum recommendation, tried to launch the training commenting the tf_eval_dataset creation , and launched training. The model trained without having the eval_dataset.
When I passed the flag --do_eval it raised error flagged here
I downloaded the opus books dataset and I saw in the README.md that it don´t have a validation split
- config_name: en-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- ru
splits:
- name: train
num_bytes: 5190880
num_examples: 15496
download_size: 1613419
dataset_size: 5190880
Issue 1. Reproducibility coming from tutorial
-
Can you please confirm that this example runs straightforward with WMT19 and that I might not have this issue taking this dataset and not the opus books one?
-
Would you be willing to accept a PR with a comment in the example either pointing to the readme table or making more explicit that this example comes with a specific dataset with its link around here ? Is there a way you think I could help those users having the path from docs tutorial to script example ?
Am I missing something ? I think it's dataset related but Im not sure anymore...
Issue 2. Broken link
Found a broken link, if you are ok i´ll fix it with this
Dependencies
transformers==4.31.0.dev0
tensorflow-macos==2.10.0
Tangential and mental model
I'm actually following this script which is a copy, that came recommended from #24254 . Please let me know if something has changed. Im seeing the history and last commit seems from Jun7 and mine is Jun13
I grouped the broken link with dataset in one issue as it might impact 1 PR for Reproducibility, but let me know if you prefer them separately.
Thanks so so much for your help 🙏 & thanks for the library!
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- run the script
- download opus books dataset
- config flags
- run script with and without eval_dataset logic
Expected behavior
- Dataset ? Either with link in README.md or in script commented?
- Correct link for
Tagging @sgugger