Ting-Yun Chang, Muru Zhang, Jesse Thomason, and Robin Jia
Paper: https://arxiv.org/abs/2506.12044
- We release
$D_\text{large}$ and$D_\text{ctrl}$ under data/ - The data are tokenized, where each split has shape [1000, 512], and each row corresponds to a FineWeb example with sequence length = 512.
- To convert them back into texts, run:
python read_data.py --split large --quant_type awq3 --model_name Qwen/Qwen2.5-7B