c4/multilingual produces Dataflow job file too big (38MB >> 10MB)

**Short description**
We are trying to extract the Norwegian (and eventually other Nordic languages) portion of `c4/multilingual`. Since there is no easy way to download only the data for one language, we are processing the entire `c4/multilingual` corpus first.

**Environment information**
* Operating System: `Debian GNU/Linux 10`
* Python version: `Python 3.8.5` (miniconda)
* `tensorflow-datasets`/`tfds-nightly` version: `4.1.0` / `4.1.0.dev202011080107`
* `tensorflow`/`tf-nightly` version: `2.3.1` / `2.5.0.dev20201108` (tried with and without `tf-nightly`)

* Does the issue still exists with the last `tfds-nightly` package (`pip install --upgrade tfds-nightly`) ?
Yes, it does. We read all issues related to C4 and incorporated the necessary changes: pinning `dill` version and adding options for 450 workers and `experiments=shuffle_mode=service` in Apache Beam.

**Reproduction instructions**
On a clean VM with 8vCPU and 32GB of RAM, we installed miniconda and run the next commands:
```
DATASET_NAME=c4
DATASET_CONFIG=multilingual
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=...

# Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls

# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done

# Prepare requirements
rm /tmp/beam_requirements.txt
echo "tensorflow_datasets[$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "tfds-nightly[gcp,$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "google-apitools" >> /tmp/beam_requirements.txt
# there's an error with avro-python3 and dill, dill version needs to be fixed
# https://github.com/tensorflow/datasets/issues/2636#issuecomment-722551597
echo "dill==0.3.1.1" >> /tmp/beam_requirements.txt
python -m pip install tensorflow tf-nightly
python -m pip install -r /tmp/beam_requirements.txt

# Run main command
python -m tensorflow_datasets.scripts.download_and_prepare \
 --datasets=$DATASET_NAME/$DATASET_CONFIG \
 --data_dir=$GCS_BUCKET/tensorflow_datasets \
 --beam_pipeline_options=\
"region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,"\
"dataflow_job_file=gs://$GCS_BUCKET/job_file.json,"\
"requirements_file=/tmp/beam_requirements.txt,max_num_workers=450,experiments=shuffle_mode=service" 2>&1 | tee nb-mc4.log
```

**Link to logs**
We removed information about our project and bucket in the logs:
- Process log: [nb-mc4.log](https://github.com/tensorflow/datasets/files/5509446/nb-mc4.log)

- JSON job file: [job_file.zip](https://github.com/tensorflow/datasets/files/5509458/job_file.zip)

**Expected behavior**
We would have expected for the script to successfully launch the pipeline in Dataflow, but the JSON job file seems to be too big (37.5MB when max is 10MB), therefore all we get is a `Your client issued a request that was too large` error message (formatted as a HTML page in the console output).

Sample of the output
```
 <title>Error 413 (Request Entity Too Large)!!1</title>
 413. <ins>That���s an error.</ins>
 Your client issued a request that was too large.
 <ins>That���s all we know.</ins>
```

**Additional context**
If there is any other way to extract a language portion of `c4/multilingual` we'd be eager to try it as well.

***Update (March 1st, 2021)***: Instructions to successfully run the pipeline using **one** dump are detailed in the https://github.com/tensorflow/datasets/issues/2711#issuecomment-787798658.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions