DCLM-refinedweb

Back to overview

Description

[...] We begin by evaluating several well-known open-source datasets (C4 [48, 130], RefinedWeb [121], RedPajama [160], and Dolma-V1 [150]) in Table 2. While all four datasets use various heuristic filters and data cleaning steps, we find that RefinedWeb performs the best on our CORE and EXTENDED metrics at the 7B-1x scale. RefinedWeb applies the following filtering pipeline: Common Crawl text extraction, heuristic selection rules (e.g., to remove spam), and deduplication of repeated content. Interestingly, RefinedWeb is solely filtered from Common Crawl, unlike RedPajama and Dolma-V1, which additionally mix in curated, “high-quality” sources like Wikipedia. [...]

Read the full paper on arXiv.

Contents

The data herein is stored as global/local shards represented as ZSTD-compressed JSONL files.

    $ aws s3 ls s3://commoncrawl/contrib/datacomp/DCLM-refinedweb/
                           PRE global-shard_01_of_10/
                           PRE global-shard_02_of_10/
                           PRE global-shard_03_of_10/
                           PRE global-shard_04_of_10/
                           PRE global-shard_05_of_10/
                           PRE global-shard_06_of_10/
                           PRE global-shard_07_of_10/
                           PRE global-shard_08_of_10/
                           PRE global-shard_09_of_10/
                           PRE global-shard_10_of_10/
2025-06-22 15:40:28    1015959 DCLM-refinedweb.paths.gz

The paths.gz file contains the prefixes/paths to the files.