Skip to content

jiny1623/DynamicER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DynamicER

This is the official repository of our EMNLP 2024 paper:
DynamicER: Resolving Emerging Mentions to Dynamic Entities for RAG

Figure 1

If you find our resources useful, please cite our work:

@inproceedings{kim-etal-2024-dynamicer,
  title="DynamicER: Resolving Emerging Mentions to Dynamic Entities for RAG",
  author="Kim, Jinyoung and Ko, Dayoon and Kim, Gunhee",
  booktitle="Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
  year="2024"
}

Dataset

Note for Dataset Users

  • Usage of DynamicER is subject to Tumblr Terms of Service and User Guidelines. Users must agree to and comply with the Tumblr Application Developer and API License Agreement.

  • DynamicER is intended solely for non-commercial research purposes. Commercial or for-profit uses are restricted; DynamicER must not be used to train models for deployment in production systems associated with business or government agency products. For more details, please refer to the Ethics Statement in the paper.

Dataset File

You can download our dataset file by clicking the link below:

Dataset Configuration

Please check DATA_CONFIG.md for configuration details of the dataset files.

TempCCA

Environment Setup

We provide a conda environment for setup. The following commands will create a new conda environment with all dependencies:

conda env create -f environment.yaml
conda activate tempcca

Training and Inference

For training (for each timestep 230820, 231020), use the following command:

PYTHONPATH='.' python blink/biencoder/proposed_train.py --bert_model=bert-base-cased --data_path=data/tempcca/processed/{timestep} --output_path=models/trained/tempcca_mst/{timestep} --pickle_src_path=models/trained/tempcca/{timestep} --path_to_model=models/trained/tempcca_mst/{prev_timestep}/epoch_4/pytorch_model.bin --num_train_epochs=5 --train_batch_size=32 --gradient_accumulation_steps=4 --eval_interval=10000 --pos_neg_loss --force_exact_search --embed_batch_size=1000 --data_parallel --path_to_prev_gold=models/trained/tempcca_mst/{prev_timestep} --alpha=0.8

For inference (for each timestep 231220, 240220, 240420), use the following command:

PYTHONPATH='.' python blink/biencoder/eval_cluster_linking.py --bert_model=bert-base-cased --data_path=data/tempcca/processed/{timestep} --output_path=models/trained/tempcca_mst/eval/{timestep} --pickle_src_path=models/trained/tempcca/eval/{timestep} --path_to_model=models/trained/tempcca_mst/231020/epoch_4/pytorch_model.bin --recall_k=64 --embed_batch_size=1000 --force_exact_search --data_parallel --path_to_prev_gold=models/trained/tempcca_mst/231020 --alpha=0.8

For entity resolution in entity-centric QA (for each timestep 231220, 240220, 240420), use the following command:

PYTHONPATH='.' python blink/crossencoder/qa_inference.py --data_path=data/tempcca/processed/{timestep} --output_path=models/trained/tempcca/{timestep}/qa_prediction/pred --pickle_src_path=models/trained/tempcca/{timestep}/qa_prediction --path_to_biencoder_model=models/trained/tempcca_mst/231020/epoch_4/pytorch_model.bin --bert_model=bert-base-cased --data_parallel --scoring_batch_size=64 --save_topk_result --path_to_prediction=models/trained/tempcca_mst/eval/231220 --pred_json=eval_results_{unix_timestamp}-directed-0.json --alpha=0.8

Acknowledgements

We acknowledge the use of skeleton code from BLINK and ArboEL, which provided the base infrastructure for this project.

About

Repository for our EMNLP 2024 paper: DynamicER

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published