data_swarm

Repository for Data Swarms: Optimizable Generation of Synthetic Evaluation Data.

Quick Start

Initialization

Create a conda environment for Data Swarms.

conda env create -f d_swarm.yml
conda activate d_swarm

Log into huggingface (for model access).

huggingface-cli login

Download initial experts.

cd initial_experts
python expert_init.py
cd ..

Setup Gemini access by providing your Google Cloud project ID at line 20 in utils.py.

Execute your first Data Swarms search.

Let's run Data Swarms on the Alpaca instruction following task and the difficult objective. data_swarm.sh provides the starter script for this.

Before running, how many GPUs do you have (and what are the GPU ids?). Change -g in line 23 of data_swarm.sh: by default five GPUs with ids 0,1,2,3, but you can change to 0, 0,1, 0,2,4,5 or any combination you'd like. It is recommended to have ~40GB memory per GPU and the code will auto double batch size for ~80GB GPUs.

Run it!

bash data_swarm.sh

You might be prompted to log into WandB and it is highly recommended. There will be a directory in search/ that starts with alpaca..., all the logs, models, and results will be stored there. You can check out search/alpaca_.../log.txt to see current progress. Check out the logs on the wandb website too. Yes, it might be slow.

Other Objectives

Choose one -t from alpaca, gsm8k, nlgraph, truthfulqa, wikidyk; Choose one -o from difficult, separate, novel, stable, and additionally the pair (alpaca, realistic).

Dual Swarms

bash dual_swarm.sh

Please keep -o as difficult for the adversarial setting.

Changing Hyperparameters and Settings

Do python data_swarm.py -h to see a list of all possible hyperparameters and settings. Additionally look at the comments for hyperparameters in data_swarm.py. We already included the default settings in the four data_swarm.sh starter scripts, but feel free to play around different settings.

Citation

If Data Swarms is helpful to you:

@article{feng2025data,
  title={Data Swarms: Optimizable Generation of Synthetic Evaluation Data},
  author={Feng, Shangbin and Wang, Yike and Shi, Weijia and Tsvetkov, Yulia},
  journal={arXiv preprint arXiv:2506.00741},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/eval		data/eval
initial_experts		initial_experts
search		search
.gitignore		.gitignore
README.md		README.md
d_swarm.yml		d_swarm.yml
data_swarm.py		data_swarm.py
data_swarm.sh		data_swarm.sh
dual_swarm.py		dual_swarm.py
merge.py		merge.py
swarm.py		swarm.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data_swarm

Quick Start

Initialization

Execute your first Data Swarms search.

Other Objectives

Dual Swarms

Changing Hyperparameters and Settings

Citation

About

Uh oh!

Releases

Packages

Languages

BunsenFeng/data_swarm

Folders and files

Latest commit

History

Repository files navigation

data_swarm

Quick Start

Initialization

Execute your first Data Swarms search.

Other Objectives

Dual Swarms

Changing Hyperparameters and Settings

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages