SynQuE: Estimating Synthetic Dataset Quality Without Annotations

This repository is the official implementation of SynQuE: Estimating Synthetic Dataset Quality Without Annotations.

Overview of the SynQuE framework for synthetic data quality estimation across multiple domains.

Requirements

We use uv for package management. To install uv, please follow the instructions here. To install dependencies, all you need to do is run:

uv sync

Training

The following instructions are provided for training models to obtain task performance metrics on real test data. All training results are provided in ./data/<task>/task_performance. Embeddings used for representation-based proxy calculations can be downloaded on Google Drive.

Sentiment Analysis

The training script is placed in notebooks/SynQuE_scoring.ipynb.

Text2SQL

We use the CodeS codebase for Text2SQL training. Training text-to-sql pairs are provided in data/text2sql/data/. Please follow the instructions in the CodeS repository for training the model. We use Qwen2.5-1.5B-Coder-Instruct as the base model. For each synthetic dataset, we train the model on its training data for 6 epochs with a learning rate of 5e-6, warmup ratio of 0.05, input block size of 4096. The rest of the hyperparameters are set to the default values in the CodeS repository.

Web Navigation Agent

We use the original NNetNav-Live dataset as synthetic data source. The original dataset is in conversation format and we use open-instruct to finetune Qwen2.5-7B-Instruct models on each disjoint subset of the dataset. We use LoRA with a LoRa rank of 64, batch size of 32, learning rate of 2e-5, warmup ratio of 0.03 for 3 epochs. The rest of the hyperparameters are the same as in the NNetNav repository.

Image Classification

We use ResNet-50 as the base model and train from scratch. We use 10% of the synthetic data as validation dataset and early stop on the validation loss. Models are trained for at least 10 epochs and we select the best model w.r.t validation loss to compute the task performance on real test data. ImageNet dataset can be downloaded from here. Synthetic datasets can be downloaded from Google Drive

LENS Scoring

LENS consists of two stages: 1. Rubric compilation and 2. Scoring with SynQuE.

Rubric Compilation

All scripts are in compile_rubrics/. Please set the OPENAI_API_BASE environment variable to the base URL of your service provider. For configuration files, please refer to compile_rubrics/conf/ as we use Hydra for configuration management, where you can easily override the default values (shown in the below example).

Example for sentiment analysis:

python compile_rubrics/generate_rubrics_sentiment_analysis.py \
    sentiment_analysis.rubric_generation.real_seed=42 \
    sentiment_analysis.rubric_generation.num_points=10

Scoring LENS

We use vLLM offline SDK for scoring LENS.

For sentiment analysis:

python -m src.generate.score_sentiment_analysis --scoring_model_path <path> --synthetic_data <optional> --seed <seed> [other options]

For web agent:

python -m src.generate.score_web_agent --scoring_model_path <path> --synthetic_data <optional> --seed <seed> [other options]

For image classification:

python -m src.generate.score_image_classification --scoring_model_path <path> --split <which split to use> --synthetic_data <optional> --rubric_path <path> [other options]

For text2sql:

python -m src.generate.score_text2sql --scoring_model_path <path> --synthetic_data <optional> --seed <seed> [other options]

Computing correlation coefficients

LENS

To compute the correlation coefficients between LENS scores and task performance, you can run the following script:

python -m src.compute.lens --data_path <path_to_lens_scores> --task <task>

Representation-based

To score and compute correlation coefficients for representation-based methods, you can run the following script:

python -m src.compute.representation_based --method <pad/mmd/mdm> --task <task> --embedding_path <path_to_the_embedding_folder>

Citation

If you find this work useful, please cite:

@article{chen2025synque,
  title={SynQuE: Estimating Synthetic Dataset Quality Without Annotations},
  author={Chen, Arthur and Zhong, Victor},
  journal={arXiv preprint arXiv:2511.03928},
  year={2025}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
compile_rubrics		compile_rubrics
data		data
prompt_templates		prompt_templates
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Requirements

Training

Sentiment Analysis

Text2SQL

Web Navigation Agent

Image Classification

LENS Scoring

Rubric Compilation

Scoring LENS

Computing correlation coefficients

LENS

Representation-based

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

r2llab/SynQuE

Folders and files

Latest commit

History

Repository files navigation

SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Requirements

Training

Sentiment Analysis

Text2SQL

Web Navigation Agent

Image Classification

LENS Scoring

Rubric Compilation

Scoring LENS

Computing correlation coefficients

LENS

Representation-based

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages