conda env create -f environment.yml
conda activate sql-icl
cd sciencebench_data/<dataset>
bash download.sh
Run sciencebench_data/<dataset>/extract_relevant_data.ipynb to transform the datasets to target format.
To utilize our LLM-based agent for text2SQL generation, run:
python src/run_agent.py
Execution Accuracy (EX) will be reported on the dev set. Parameters, arguments, and datasets can be set in ./config/run_config.yaml and ./config/agent/baseline.yaml.
Ablation of our pipeline components:
| Model | Cordis [EX] | OncoMx [EX] | SDSS [EX] |
|---|---|---|---|
| current SOTA on ScienceBenchmark | 35% | 56% | 21% |
| Llama 3.3 70B | 19% | 8% | 18% |
| + error correction | 23% | 7% | 17% |
| + error correction + schema | 38% | 35% | 19% |
| + error correction + schema flattened | 54% | 47% | 25% |
| + error correction + schema flattened + ICL (k=10) | 51% | 62% | 28% |
| + error correction + schema flattened + ICL (k=30) | 57% | 62% | 30% |
| + error correction + schema flattened + ICL (k=60) | 58% | 62% | 25% |
| + error correction + schema flattened + ICL (k=100) | 56% | 58% | 25% |
Experiments with different LLM backbones using the best, above-reported config:
| Model | Cordis [EX] | OncoMx [EX] | SDSS [EX] |
|---|---|---|---|
| SOTA on ScienceBenchmark | 35% | 56% | 21% |
| Llama 3.3 70B (best config) | 58% | 62% | 30% |
| QWEN2.5 72B (best config) | 57% | 67% (+11pp) | 31% |
| QWEN2.5-Coder 32B (best config) | 52% | 62% | 37% (+16pp) |
| Mistral Nemo (best config) | 47% | 59% | 19% |
| DeepSeek-R1 70B (best config) | 62% (+27pp) | 58% | 25% |
| QWQ 32B (best config) | 57% | 63% | 35% |
| Phi4 (best config) | 55% | 55% | 26% |
| Starcoder2 15B (best config) | 7% | 8% | -- |
| DeepSeek-Coder-V2 16B (best config) | 53% | 52% | 21% |
| DeepSeek Coder 33B (best config) | 46% | 57% | 25% |
| SQLCoder 15B (best config) | 11% | 8% | -- |
| mannix/defog-llama3-sqlcoder 8B (best config) | 9% | 7% | -- |
| Gemma3 27B (best config) | 54% | 60% | 25% |
@article{zhang2023sciencebenchmark,
title={Sciencebenchmark: A complex real-world benchmark for evaluating natural language to sql systems},
author={Zhang, Yi and Deriu, Jan and Katsogiannis-Meimarakis, George and Kosten, Catherine and Koutrika, Georgia and Stockinger, Kurt},
journal={arXiv preprint arXiv:2306.04743},
year={2023}
}