Skip to content

We address the challenges posed by noise and emerging/rare entities in Named Entity Recognition task for social media domain. Following the recent advances, we employ Contextualized Word Embeddings from Language Models pretrained on large corpora along with some normalization techniques to reduce noise. Our best model achieves state-of-the-art r…

License

Notifications You must be signed in to change notification settings

JRC1995/SocialMediaNER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Named Entity Recognition in Social Media

See project description here

Abstract:

We address the challenges posed by noise and emerging/rare entities in Named Entity Recognition task for social media domain. Following the recent advances, we employ Contextualized Word Embeddings from Language Models pretrained on large corpora along with some normalization techniques to reduce noise. Our best model achieves state-of-the-art results (F1 52.47%) on WNUT 2017 dataset. Additionally, we adapt a modular approach to systematically evaluate different contextual embeddings and downstream labeling mechanism using Sequence Labeling and a Question Answering framework.

Note: This is a project report for the CS512 Advanced Machine Learning course, done back in 2020 Q2. It's no longer SOTA.

Credits

It was a group project. Team members:

  • Jishnu Ray Chowdhury
  • Usman Shahid
  • Tuhin Kundu
  • Zhimming Zou

Code credits:

@phdthesis{godin2019,
     title    = {Improving and Interpreting Neural Networks for Word-Level Prediction Tasks in Natural Language Processing},
     school   = {Ghent University, Belgium},
     author   = {Godin, Fr\'{e}deric},
     year     = {2019},
 }

Requirements

Downloads

  • DOWNLOAD: tweet word2vec from here
  • Keep the above download in embeddings/word2vec/
  • DOWNLOAD: fasttext (crawl-300d-2M-subword.zip) from here
  • Keep the above download in embeddings/fasttext/
  • Run save_locally_BERT.py and save_locally_ELECTRA.py to save pretrained contextualized transformer models

Preprocessing

  • run process_WNUT_phase1.py and process_WNUT_phase2.py (in that order) for preprocessing

Training

python train.py --model=[INSERET MODEL NAME HERE]

Testing

python train.py --model=[INSERET MODEL NAME HERE] --test=True

Evaluation

  • Generate file for evaluation: python generate_eval_file.py --model=[INSERET MODEL NAME HERE] or python generate_eval_file_MRC.py --model=[INSERET MODEL NAME HERE] for MRC based models
  • Run conlleval.py or conlleval.perl on the generated evaluation files.

About

We address the challenges posed by noise and emerging/rare entities in Named Entity Recognition task for social media domain. Following the recent advances, we employ Contextualized Word Embeddings from Language Models pretrained on large corpora along with some normalization techniques to reduce noise. Our best model achieves state-of-the-art r…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published