Task-aware Data Validation (TADV)

This version of TADV reproduces the experiments from the paper Towards Task-aware Data Validation, presented at the DEEM workshop, SIGMOD 2025. Please check the paper for more details.

Task-aware Data Validation (TADV)

TADV is a framework that leverages Language Models to generate data validation rules based on downstream tasks.

Experiment Reproduction

Here we provide the quick links to the experiment sections in the paper.

Section	Source Code
4.1	Column Access Detection
4.2	End-to-End Data Error Impact
4.3	Uncovering Implicit Data Assumptions

Project Structure

The project consists of the following modules:

Error Injection – Provides APIs for injecting errors into datasets, enabling robustness testing for validation methods.
Runtime Environments – Defines execution environments where datasets are evaluated in the context of downstream queries or machine learning pipelines.
LLM – Contains classes for interacting with LLM APIs to generate data validation rules. This process follows three key steps:
1. Column Access Detection – Identifying relevant columns based on downstream context.
2. Assumption Generation – Inferring data assumptions from provided context and dataset properties.
3. Rule Generation – Producing executable validation rules to ensure data quality.
Inspector – Extracts dataset metadata, including schema and statistics, to aid LLMs in generating informed validation rules.

Experiment Workflow

We provide the following workflow for evaluating the data validation capabilities of LLMs compared to non-LLM methods. You can find the detailed implementation in the workflow directory.

Step 0: Environment Setup

Create a `.env` file

To run the experiments, you need to create a .env file in the root directory of the project. The .env file should contain the following environment variables:

HF_TOKEN=***
OPENAI_API_KEY=***
SPARK_VERSION=3.5

Please replace *** with your own API keys.

Install the package

We use poetry to manage the dependencies. The setup has been tested on macOS and Linux systems. If you are not familiar with poetry, we suggest you install it with pipx first by following the official documentation.

After installing poetry, you can install the dependencies by running the following command:

poetry install --with test

You can then test the installation by running the following command. It will run all the tests in the project.

poetry run pytest

Step 1: Preprocessing

To prepare the dataset for data validation, we need to preprocess the data in two steps:

Error Injection: Inject errors into the dataset to simulate real-world data quality issues.
Script Execution: Execute the downstream scripts to generate the ground truth for data validation.

1.1 Remove Existing Preprocessed Data

We provide all the preprocessed data in the data_processed/ folder for paper reviewing. If you want to reproduce the results, you need to delete the existing preprocessed data first by running the following command:

rm -r data_processed/*

1.2 Errors Injection

To inject errors into the dataset, run the following command:

poetry run python ./workflow/s1_preprocessing/error_injection/main.py \
  --dataset-option "all" \
  --downstream-task-option "all"

This command will inject errors into the dataset in data/ folder and then save the corrupted dataset in data_processed/ folder. The predefined error injection configurations can be found in data/<dataset>/errors/. You could also customize the error injection configurations by modifying/adding the error injection scripts in the same folder. Please make sure the name of the error injection script is started with <downstream-task>_, e.g., ml_inference_classification_1.yaml.

1.3 Scripts Execution

To Execute the downstream scripts, run the following command:

poetry run python ./workflow/s1_preprocessing/scripts_execution/main.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

This command will execute the downstream scripts in data/<dataset>/scripts/ and then save the results in the data_processed/<dataset>/<downstream-task>/<processed-data-label>/ folder.

Step 2: Data Validation Rule Generation

2.1 Column Access Detection

To detect the accessed column, run the following command:

poetry run python ./workflow/s2_experiments/t1_column_access_detection/run_pipeline.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

2.2 End-to-End Data Validation Rule Generation

To generate data validation rules, run the following command:

poetry run python ./workflow/s2_experiments/t2_constraint_inference/run_deequ_dv.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

poetry run python ./workflow/s2_experiments/t2_constraint_inference/run_langchain_tadv.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

Step 3: Evaluation

3.1 scripts Performance Evaluation

To evaluate the performance of the scripts in the downstream tasks, run the following command:

poetry run python ./workflow/s3_evaluation/evaluation/calculate_code_performance.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

The evaluation results will be saved in the data_processed/<dataset>/<downstream-task>/<processed-data-label>/output_validation/ folder.

poetry run python ./workflow/s3_evaluation/evaluation/validate_constraints.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

The evaluation results will be saved in the data_processed/<dataset>/<downstream-task>/<processed-data-label>/constraints_validation/ folder.

Now, you can aggregate the evaluation results by running the following command:

poetry run python ./workflow/s3_evaluation/evaluation/main.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

Name		Name	Last commit message	Last commit date
Latest commit History 530 Commits
.github/workflows		.github/workflows
assets		assets
data		data
data_processed		data_processed
logs		logs
tadv		tadv
tests		tests
workflow		workflow
.coveragerc		.coveragerc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Task-aware Data Validation (TADV)

Experiment Reproduction

Project Structure

Experiment Workflow

Step 0: Environment Setup

Create a `.env` file

Install the package

Step 1: Preprocessing

1.1 Remove Existing Preprocessed Data

1.2 Errors Injection

1.3 Scripts Execution

Step 2: Data Validation Rule Generation

2.1 Column Access Detection

2.2 End-to-End Data Validation Rule Generation

Step 3: Evaluation

3.1 scripts Performance Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

guangchen811/tadv

Folders and files

Latest commit

History

Repository files navigation

Task-aware Data Validation (TADV)

Experiment Reproduction

Project Structure

Experiment Workflow

Step 0: Environment Setup

Create a .env file

Install the package

Step 1: Preprocessing

1.1 Remove Existing Preprocessed Data

1.2 Errors Injection

1.3 Scripts Execution

Step 2: Data Validation Rule Generation

2.1 Column Access Detection

2.2 End-to-End Data Validation Rule Generation

Step 3: Evaluation

3.1 scripts Performance Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Create a `.env` file

Packages