This repository contains the source code for the paper: Adapting Language Models to Text Matching based Recommendation Systems.
TASTE$^+$ enhances sequential recommendation by adapting language models to text matching. It introduces two pretraining tasks, Masked Item Prediction and Next Item Prediction, which allow the model to capture richer matching signals from user–item sequences. By balancing attention between prompt tokens and item IDs, TASTE$^+$ builds more accurate user representations and improves recommendation performance on Yelp and Amazon datasets, demonstrating the effectiveness of language model pretraining for text matching-based recommendation.
conda create -n taste-plus python=3.8
conda activate taste-plus
pip install -r requirements.txtgit clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .This section provides a step-by-step guide to reproduce the TASTE$^+$ results.
We utilize the Amazon Product 2014 and Yelp 2020 datasets. Download the original data from:
The following example uses the Amazon Beauty dataset.
wget -c http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Beauty.csv
wget -c http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Beauty.json.gzgzip -d meta_Beauty.json.gzmkdir data
mv ratings_Beauty.csv data/
mv meta_Beauty.json data/mkdir dataset
bash scripts/process_origin.shbash scripts/process_beauty.shBefore proceeding, process all four original datasets as described above to obtain the atomic files. Then, construct the mixed pretraining data for TASTE$^+$ according to your desired proportions.
bash scripts/gen_dataset.shbash scripts/gen_pretrain_items.shFor TASTE$^+$ pretraining data construction, we sampled the four datasets with balance. For each dataset, we selected the number of items corresponding to the dataset with the largest number of training samples and then randomly supplemented the datasets with insufficient training data:
python src/sample_train.pySimilarly, we selected the number of training samples from the dataset with the fewest training items in each case to serve as the validation set:
python src/sample_valid.pybash scripts/build_pretrain.shpython src/merge_json.pyPretrain the T5 model using next item prediction (NIP) and masked item prediction (MIP) tasks.
bash scripts/pretrain.shAdjust training parameters based on your GPU device. Select the checkpoint with the lowest evaluation loss as the final pretrained checkpoint.
bash scripts/gen_train_items.sh
bash scripts/build_train.shbash scripts/train_ft.shbash scripts/eval_ft.shbash scripts/test_ft.shFor questions, suggestions, or bug reports, please contact:
