AnyMatch – Efficient Zero-Shot Entity Matching with a Small Language Model

This repository contains the implementation of the entity matching solution presented in "AnyMatch – Efficient Zero-Shot Entity Matching with a Small Language Model".

Instructions for Using AnyMatch

1. Install dependencies

Create a conda environment with the provided file, then activate it:

conda env create -f environment.yml
conda activate anymatch

2. Download the raw data

Download the nine raw datasets from their respective sources and place them in the data/raw directory. For more detailed instructions, refer to data/raw/readme.md.

3. Prepare the datasets

Follow the preparation steps in the data/preprocess.ipynb notebook to preprocess the raw data and generate the record-level and attribute-level datasets.

4. Fine-tune and evaluate a model for the experiment in Section 6.1

The following script can be used to train a matcher by excluding the DATASET_NAME dataset and will evaluate the predictive quality of the trained model:

python loo.py \
    --seed 42 \
    --base_model gpt2 \
    --leaved_dataset_name DATASET_NAME \
    --serialization_mode mode1 \
    --train_data attr+row \
    --patience_start 20

5. Evaluate the inference throughput in Section 6.2

The inference throughput experiment can be run using the following script:

python throughput.py

6. Abaltion study in Section 6.3 (the scripts are organized in the same order as in Table5)

the choice of base model

python loo.py --leaved_dataset_name DATASET_NAME --base_model t5-base
python loo.py --leaved_dataset_name DATASET_NAME --base_model bert-base

the choice of serialization mode

python loo.py -leaved_dataset_name DATASET_NAME --serialization_mode mode4
python loo.py -leaved_dataset_name DATASET_NAME --serialization_mode mode2
python loo.py -leaved_dataset_name DATASET_NAME --serialization_mode mode3

the choice of training data generation strategy

python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func automl_filter --train_data attr+row
python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func one_pos_two_neg --train_data attr+row
python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func one_pos_two_neg --train_data attr-row
python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func one_pos_two_neg --train_data row

Locate the main components of the AnyMatch implementation

1. AutoML filter & Label imbalance

The AutoML filter and label imbalance functionality are implemented in the automl_filter method of the utils/data_utils.py file. For simplicity in conducting experiments, we directly load the results after applying an AutoML model. Details on using such model can be found in data/preprocess.ipynb.

2. Serialisation

The serialization step can be found in the df_serializer method in utils/data_utils.py. Different modes can be selected by specifying the mode argument.

3. Model training & inference

The train and inference method can be found in the utils/train_eval.py file.

Baselines and Data Analysis

For the baseline implementation and data analysis, please check out the following repos:

StringSim: found in string_similarity.py
ZeroER: https://github.com/mohamedyd/rein-benchmark/tree/master/cleaners/zeroer
Ditto: https://github.com/megagonlabs/ditto
Jellyfish: https://huggingface.co/NECOUDBFM/Jellyfish-13B
MatchGPT: https://github.com/wbsg-uni-mannheim/MatchGPT/tree/main/LLMForEM

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
diagram		diagram
utils		utils
README.md		README.md
anymatch.png		anymatch.png
data.py		data.py
environment.yml		environment.yml
inference.py		inference.py
loo.py		loo.py
model.py		model.py
string_simlarity.py		string_simlarity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AnyMatch – Efficient Zero-Shot Entity Matching with a Small Language Model

Instructions for Using AnyMatch

1. Install dependencies

2. Download the raw data

3. Prepare the datasets

4. Fine-tune and evaluate a model for the experiment in Section 6.1

5. Evaluate the inference throughput in Section 6.2

6. Abaltion study in Section 6.3 (the scripts are organized in the same order as in Table5)

Locate the main components of the AnyMatch implementation

1. AutoML filter & Label imbalance

2. Serialisation

3. Model training & inference

Baselines and Data Analysis

About

Uh oh!

Releases

Packages

Languages

Jantory/anymatch

Folders and files

Latest commit

History

Repository files navigation

AnyMatch – Efficient Zero-Shot Entity Matching with a Small Language Model

Instructions for Using AnyMatch

1. Install dependencies

2. Download the raw data

3. Prepare the datasets

4. Fine-tune and evaluate a model for the experiment in Section 6.1

5. Evaluate the inference throughput in Section 6.2

6. Abaltion study in Section 6.3 (the scripts are organized in the same order as in Table5)

Locate the main components of the AnyMatch implementation

1. AutoML filter & Label imbalance

2. Serialisation

3. Model training & inference

Baselines and Data Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages