This version of TADV reproduces the experiments from the paper Towards Task-aware Data Validation, presented at the DEEM workshop, SIGMOD 2025. Please check the paper for more details.
TADV is a framework that leverages Language Models to generate data validation rules based on downstream tasks.
Here we provide the quick links to the experiment sections in the paper.
| Section | Source Code |
|---|---|
4.1 |
|
4.2 |
|
4.3 |
The project consists of the following modules:
- Error Injection – Provides APIs for injecting errors into datasets, enabling robustness testing for validation methods.
- Runtime Environments – Defines execution environments where datasets are evaluated in the context of downstream queries or machine learning pipelines.
- LLM – Contains classes for interacting with LLM APIs to generate data validation rules. This
process follows three key steps:
- Column Access Detection – Identifying relevant columns based on downstream context.
- Assumption Generation – Inferring data assumptions from provided context and dataset properties.
- Rule Generation – Producing executable validation rules to ensure data quality.
- Inspector – Extracts dataset metadata, including schema and statistics, to aid LLMs in generating informed validation rules.
We provide the following workflow for evaluating the data validation capabilities of LLMs compared to non-LLM methods. You can find the detailed implementation in the workflow directory.
To run the experiments, you need to create a .env file in the root directory of the project. The .env file should
contain the following environment variables:
HF_TOKEN=***
OPENAI_API_KEY=***
SPARK_VERSION=3.5Please replace *** with your own API keys.
We use poetry to manage the dependencies. The setup has been tested on macOS and Linux systems. If you are not familiar with poetry, we suggest you install it with pipx first by following the official documentation.
After installing poetry, you can install the dependencies by running the following command:
poetry install --with testYou can then test the installation by running the following command. It will run all the tests in the project.
poetry run pytestTo prepare the dataset for data validation, we need to preprocess the data in two steps:
- Error Injection: Inject errors into the dataset to simulate real-world data quality issues.
- Script Execution: Execute the downstream scripts to generate the ground truth for data validation.
We provide all the preprocessed data in the data_processed/ folder for paper reviewing. If you want to reproduce the
results, you need to delete the existing preprocessed data first by running the following command:
rm -r data_processed/*To inject errors into the dataset, run the following command:
poetry run python ./workflow/s1_preprocessing/error_injection/main.py \
--dataset-option "all" \
--downstream-task-option "all"This command will inject errors into the dataset in data/ folder and then save the corrupted dataset in
data_processed/ folder. The predefined error injection configurations can be found in data/<dataset>/errors/. You
could also customize the error injection configurations by modifying/adding the error injection scripts in the same
folder. Please make sure the name of the error injection script is started with <downstream-task>_, e.g.,
ml_inference_classification_1.yaml.
To Execute the downstream scripts, run the following command:
poetry run python ./workflow/s1_preprocessing/scripts_execution/main.py \
--dataset-option "all" \
--downstream-task-option "all" \
--processed-data-label "0"This command will execute the downstream scripts in data/<dataset>/scripts/ and then save the results in the
data_processed/<dataset>/<downstream-task>/<processed-data-label>/ folder.
To detect the accessed column, run the following command:
poetry run python ./workflow/s2_experiments/t1_column_access_detection/run_pipeline.py \
--dataset-option "all" \
--downstream-task-option "all" \
--processed-data-label "0"To generate data validation rules, run the following command:
poetry run python ./workflow/s2_experiments/t2_constraint_inference/run_deequ_dv.py \
--dataset-option "all" \
--downstream-task-option "all" \
--processed-data-label "0"poetry run python ./workflow/s2_experiments/t2_constraint_inference/run_langchain_tadv.py \
--dataset-option "all" \
--downstream-task-option "all" \
--processed-data-label "0"To evaluate the performance of the scripts in the downstream tasks, run the following command:
poetry run python ./workflow/s3_evaluation/evaluation/calculate_code_performance.py \
--dataset-option "all" \
--downstream-task-option "all" \
--processed-data-label "0"The evaluation results will be saved in the
data_processed/<dataset>/<downstream-task>/<processed-data-label>/output_validation/ folder.
poetry run python ./workflow/s3_evaluation/evaluation/validate_constraints.py \
--dataset-option "all" \
--downstream-task-option "all" \
--processed-data-label "0"The evaluation results will be saved in the
data_processed/<dataset>/<downstream-task>/<processed-data-label>/constraints_validation/ folder.
Now, you can aggregate the evaluation results by running the following command:
poetry run python ./workflow/s3_evaluation/evaluation/main.py \
--dataset-option "all" \
--downstream-task-option "all" \
--processed-data-label "0"