This repository is the official implementation of SynQuE: Estimating Synthetic Dataset Quality Without Annotations.
Overview of the SynQuE framework for synthetic data quality estimation across multiple domains.
We use uv for package management. To install uv, please follow the instructions here.
To install dependencies, all you need to do is run:
uv syncThe following instructions are provided for training models to obtain task performance metrics on real test data.
All training results are provided in ./data/<task>/task_performance. Embeddings used for representation-based proxy calculations can be downloaded on Google Drive.
The training script is placed in notebooks/SynQuE_scoring.ipynb.
We use the CodeS codebase for Text2SQL training. Training text-to-sql pairs are provided in data/text2sql/data/. Please follow the instructions in the CodeS repository for training the model.
We use Qwen2.5-1.5B-Coder-Instruct as the base model. For each synthetic dataset, we train the model on its training data for 6 epochs with a learning rate of 5e-6, warmup ratio of 0.05, input block size of 4096. The rest of the hyperparameters are set to the default values in the CodeS repository.
We use the original NNetNav-Live dataset as synthetic data source. The original dataset is in conversation format and we use open-instruct to finetune Qwen2.5-7B-Instruct models on each disjoint subset of the dataset. We use LoRA with a LoRa rank of 64, batch size of 32, learning rate of 2e-5, warmup ratio of 0.03 for 3 epochs. The rest of the hyperparameters are the same as in the NNetNav repository.
We use ResNet-50 as the base model and train from scratch. We use 10% of the synthetic data as validation dataset and early stop on the validation loss. Models are trained for at least 10 epochs and we select the best model w.r.t validation loss to compute the task performance on real test data. ImageNet dataset can be downloaded from here. Synthetic datasets can be downloaded from Google Drive
LENS consists of two stages: 1. Rubric compilation and 2. Scoring with SynQuE.
All scripts are in compile_rubrics/. Please set the OPENAI_API_BASE environment variable to the base URL of your service provider. For configuration files, please refer to compile_rubrics/conf/ as we use Hydra for configuration management, where you can easily override the default values (shown in the below example).
- Example for sentiment analysis:
python compile_rubrics/generate_rubrics_sentiment_analysis.py \ sentiment_analysis.rubric_generation.real_seed=42 \ sentiment_analysis.rubric_generation.num_points=10
We use vLLM offline SDK for scoring LENS.
- For sentiment analysis:
python -m src.generate.score_sentiment_analysis --scoring_model_path <path> --synthetic_data <optional> --seed <seed> [other options]
- For web agent:
python -m src.generate.score_web_agent --scoring_model_path <path> --synthetic_data <optional> --seed <seed> [other options]
- For image classification:
python -m src.generate.score_image_classification --scoring_model_path <path> --split <which split to use> --synthetic_data <optional> --rubric_path <path> [other options]
- For text2sql:
python -m src.generate.score_text2sql --scoring_model_path <path> --synthetic_data <optional> --seed <seed> [other options]
To compute the correlation coefficients between LENS scores and task performance, you can run the following script:
python -m src.compute.lens --data_path <path_to_lens_scores> --task <task>To score and compute correlation coefficients for representation-based methods, you can run the following script:
python -m src.compute.representation_based --method <pad/mmd/mdm> --task <task> --embedding_path <path_to_the_embedding_folder>If you find this work useful, please cite:
@article{chen2025synque,
title={SynQuE: Estimating Synthetic Dataset Quality Without Annotations},
author={Chen, Arthur and Zhong, Victor},
journal={arXiv preprint arXiv:2511.03928},
year={2025}
}