Skip to content

c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

@versae

Description

@versae

Short description
We are trying to extract the Norwegian (and eventually other Nordic languages) portion of c4/multilingual. Since there is no easy way to download only the data for one language, we are processing the entire c4/multilingual corpus first.

Environment information

  • Operating System: Debian GNU/Linux 10

  • Python version: Python 3.8.5 (miniconda)

  • tensorflow-datasets/tfds-nightly version: 4.1.0 / 4.1.0.dev202011080107

  • tensorflow/tf-nightly version: 2.3.1 / 2.5.0.dev20201108 (tried with and without tf-nightly)

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?
    Yes, it does. We read all issues related to C4 and incorporated the necessary changes: pinning dill version and adding options for 450 workers and experiments=shuffle_mode=service in Apache Beam.

Reproduction instructions
On a clean VM with 8vCPU and 32GB of RAM, we installed miniconda and run the next commands:

DATASET_NAME=c4
DATASET_CONFIG=multilingual
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=...

# Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls

# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done

# Prepare requirements
rm /tmp/beam_requirements.txt
echo "tensorflow_datasets[$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "tfds-nightly[gcp,$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "google-apitools" >> /tmp/beam_requirements.txt
# there's an error with avro-python3 and dill, dill version needs to be fixed
# https://github.com/tensorflow/datasets/issues/2636#issuecomment-722551597
echo "dill==0.3.1.1" >> /tmp/beam_requirements.txt
python -m pip install tensorflow tf-nightly
python -m pip install -r /tmp/beam_requirements.txt

# Run main command
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME/$DATASET_CONFIG \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options=\
"region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,"\
"dataflow_job_file=gs://$GCS_BUCKET/job_file.json,"\
"requirements_file=/tmp/beam_requirements.txt,max_num_workers=450,experiments=shuffle_mode=service" 2>&1 | tee nb-mc4.log

Link to logs
We removed information about our project and bucket in the logs:

Expected behavior
We would have expected for the script to successfully launch the pipeline in Dataflow, but the JSON job file seems to be too big (37.5MB when max is 10MB), therefore all we get is a Your client issued a request that was too large error message (formatted as a HTML page in the console output).

Sample of the output

  <title>Error 413 (Request Entity Too Large)!!1</title>
  <p><b>413.</b> <ins>That���s an error.</ins>
  <p>Your client issued a request that was too large.</p>
  <ins>That���s all we know.</ins>

Additional context
If there is any other way to extract a language portion of c4/multilingual we'd be eager to try it as well.

Update (March 1st, 2021): Instructions to successfully run the pipeline using one dump are detailed in the #2711 (comment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions