This repository contains the implementation of the entity matching solution presented in "AnyMatch – Efficient Zero-Shot Entity Matching
with a Small Language Model".
Create a conda environment with the provided file, then activate it:
conda env create -f environment.yml
conda activate anymatchDownload the nine raw datasets from their respective sources and place them in the data/raw directory. For more detailed instructions, refer to data/raw/readme.md.
Follow the preparation steps in the data/preprocess.ipynb notebook to preprocess the raw data and generate the record-level and attribute-level datasets.
The following script can be used to train a matcher by excluding the DATASET_NAME dataset and will evaluate the predictive quality of the trained model:
python loo.py \
--seed 42 \
--base_model gpt2 \
--leaved_dataset_name DATASET_NAME \
--serialization_mode mode1 \
--train_data attr+row \
--patience_start 20The inference throughput experiment can be run using the following script:
python throughput.py- the choice of base model
python loo.py --leaved_dataset_name DATASET_NAME --base_model t5-base
python loo.py --leaved_dataset_name DATASET_NAME --base_model bert-base- the choice of serialization mode
python loo.py -leaved_dataset_name DATASET_NAME --serialization_mode mode4
python loo.py -leaved_dataset_name DATASET_NAME --serialization_mode mode2
python loo.py -leaved_dataset_name DATASET_NAME --serialization_mode mode3- the choice of training data generation strategy
python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func automl_filter --train_data attr+row
python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func one_pos_two_neg --train_data attr+row
python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func one_pos_two_neg --train_data attr-row
python loo.py -leaved_dataset_name DATASET_NAME --row_sample_func one_pos_two_neg --train_data rowThe AutoML filter and label imbalance functionality are implemented in the automl_filter method of the utils/data_utils.py file. For simplicity in conducting experiments, we directly load the results after applying an AutoML model. Details on using such model can be found in data/preprocess.ipynb.
The serialization step can be found in the df_serializer method in utils/data_utils.py. Different modes can be selected by specifying the mode argument.
The train and inference method can be found in the utils/train_eval.py file.
For the baseline implementation and data analysis, please check out the following repos:
- StringSim: found in
string_similarity.py - ZeroER: https://github.com/mohamedyd/rein-benchmark/tree/master/cleaners/zeroer
- Ditto: https://github.com/megagonlabs/ditto
- Jellyfish: https://huggingface.co/NECOUDBFM/Jellyfish-13B
- MatchGPT: https://github.com/wbsg-uni-mannheim/MatchGPT/tree/main/LLMForEM