-
Notifications
You must be signed in to change notification settings - Fork 1.6k
c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711
Description
Short description
We are trying to extract the Norwegian (and eventually other Nordic languages) portion of c4/multilingual. Since there is no easy way to download only the data for one language, we are processing the entire c4/multilingual corpus first.
Environment information
-
Operating System:
Debian GNU/Linux 10 -
Python version:
Python 3.8.5(miniconda) -
tensorflow-datasets/tfds-nightlyversion:4.1.0/4.1.0.dev202011080107 -
tensorflow/tf-nightlyversion:2.3.1/2.5.0.dev20201108(tried with and withouttf-nightly) -
Does the issue still exists with the last
tfds-nightlypackage (pip install --upgrade tfds-nightly) ?
Yes, it does. We read all issues related to C4 and incorporated the necessary changes: pinningdillversion and adding options for 450 workers andexperiments=shuffle_mode=servicein Apache Beam.
Reproduction instructions
On a clean VM with 8vCPU and 32GB of RAM, we installed miniconda and run the next commands:
DATASET_NAME=c4
DATASET_CONFIG=multilingual
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=...
# Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls
# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done
# Prepare requirements
rm /tmp/beam_requirements.txt
echo "tensorflow_datasets[$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "tfds-nightly[gcp,$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "google-apitools" >> /tmp/beam_requirements.txt
# there's an error with avro-python3 and dill, dill version needs to be fixed
# https://github.com/tensorflow/datasets/issues/2636#issuecomment-722551597
echo "dill==0.3.1.1" >> /tmp/beam_requirements.txt
python -m pip install tensorflow tf-nightly
python -m pip install -r /tmp/beam_requirements.txt
# Run main command
python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=$DATASET_NAME/$DATASET_CONFIG \
--data_dir=$GCS_BUCKET/tensorflow_datasets \
--beam_pipeline_options=\
"region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,"\
"dataflow_job_file=gs://$GCS_BUCKET/job_file.json,"\
"requirements_file=/tmp/beam_requirements.txt,max_num_workers=450,experiments=shuffle_mode=service" 2>&1 | tee nb-mc4.log
Link to logs
We removed information about our project and bucket in the logs:
-
Process log: nb-mc4.log
-
JSON job file: job_file.zip
Expected behavior
We would have expected for the script to successfully launch the pipeline in Dataflow, but the JSON job file seems to be too big (37.5MB when max is 10MB), therefore all we get is a Your client issued a request that was too large error message (formatted as a HTML page in the console output).
Sample of the output
<title>Error 413 (Request Entity Too Large)!!1</title>
<p><b>413.</b> <ins>That���s an error.</ins>
<p>Your client issued a request that was too large.</p>
<ins>That���s all we know.</ins>
Additional context
If there is any other way to extract a language portion of c4/multilingual we'd be eager to try it as well.
Update (March 1st, 2021): Instructions to successfully run the pipeline using one dump are detailed in the #2711 (comment).