Skip to content

guangchen811/tadv

Repository files navigation

This version of TADV reproduces the experiments from the paper Towards Task-aware Data Validation, presented at the DEEM workshop, SIGMOD 2025. Please check the paper for more details.

Task-aware Data Validation (TADV)

License Python CI codecov

TADV is a framework that leverages Language Models to generate data validation rules based on downstream tasks.

Experiment Reproduction

Here we provide the quick links to the experiment sections in the paper.

Section Source Code
4.1
4.2
4.3

Project Structure

The project consists of the following modules:

  • Error Injection – Provides APIs for injecting errors into datasets, enabling robustness testing for validation methods.
  • Runtime Environments – Defines execution environments where datasets are evaluated in the context of downstream queries or machine learning pipelines.
  • LLM – Contains classes for interacting with LLM APIs to generate data validation rules. This process follows three key steps:
    1. Column Access Detection – Identifying relevant columns based on downstream context.
    2. Assumption Generation – Inferring data assumptions from provided context and dataset properties.
    3. Rule Generation – Producing executable validation rules to ensure data quality.
  • Inspector – Extracts dataset metadata, including schema and statistics, to aid LLMs in generating informed validation rules.

Experiment Workflow

We provide the following workflow for evaluating the data validation capabilities of LLMs compared to non-LLM methods. You can find the detailed implementation in the workflow directory.

Step 0: Environment Setup

Create a .env file

To run the experiments, you need to create a .env file in the root directory of the project. The .env file should contain the following environment variables:

HF_TOKEN=***
OPENAI_API_KEY=***
SPARK_VERSION=3.5

Please replace *** with your own API keys.

Install the package

We use poetry to manage the dependencies. The setup has been tested on macOS and Linux systems. If you are not familiar with poetry, we suggest you install it with pipx first by following the official documentation.

After installing poetry, you can install the dependencies by running the following command:

poetry install --with test

You can then test the installation by running the following command. It will run all the tests in the project.

poetry run pytest

Step 1: Preprocessing

To prepare the dataset for data validation, we need to preprocess the data in two steps:

  • Error Injection: Inject errors into the dataset to simulate real-world data quality issues.
  • Script Execution: Execute the downstream scripts to generate the ground truth for data validation.

1.1 Remove Existing Preprocessed Data

We provide all the preprocessed data in the data_processed/ folder for paper reviewing. If you want to reproduce the results, you need to delete the existing preprocessed data first by running the following command:

rm -r data_processed/*

1.2 Errors Injection

To inject errors into the dataset, run the following command:

poetry run python ./workflow/s1_preprocessing/error_injection/main.py \
  --dataset-option "all" \
  --downstream-task-option "all"

This command will inject errors into the dataset in data/ folder and then save the corrupted dataset in data_processed/ folder. The predefined error injection configurations can be found in data/<dataset>/errors/. You could also customize the error injection configurations by modifying/adding the error injection scripts in the same folder. Please make sure the name of the error injection script is started with <downstream-task>_, e.g., ml_inference_classification_1.yaml.

1.3 Scripts Execution

To Execute the downstream scripts, run the following command:

poetry run python ./workflow/s1_preprocessing/scripts_execution/main.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

This command will execute the downstream scripts in data/<dataset>/scripts/ and then save the results in the data_processed/<dataset>/<downstream-task>/<processed-data-label>/ folder.

Step 2: Data Validation Rule Generation

2.1 Column Access Detection

To detect the accessed column, run the following command:

poetry run python ./workflow/s2_experiments/t1_column_access_detection/run_pipeline.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

2.2 End-to-End Data Validation Rule Generation

To generate data validation rules, run the following command:

poetry run python ./workflow/s2_experiments/t2_constraint_inference/run_deequ_dv.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"
poetry run python ./workflow/s2_experiments/t2_constraint_inference/run_langchain_tadv.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

Step 3: Evaluation

3.1 scripts Performance Evaluation

To evaluate the performance of the scripts in the downstream tasks, run the following command:

poetry run python ./workflow/s3_evaluation/evaluation/calculate_code_performance.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

The evaluation results will be saved in the data_processed/<dataset>/<downstream-task>/<processed-data-label>/output_validation/ folder.

poetry run python ./workflow/s3_evaluation/evaluation/validate_constraints.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

The evaluation results will be saved in the data_processed/<dataset>/<downstream-task>/<processed-data-label>/constraints_validation/ folder.

Now, you can aggregate the evaluation results by running the following command:

poetry run python ./workflow/s3_evaluation/evaluation/main.py \
  --dataset-option "all" \
  --downstream-task-option "all" \
  --processed-data-label "0"

About

A framework for task-aware data validation

Resources

License

Stars

Watchers

Forks

Packages

No packages published