Implementation of the proposed algorithm in the paper Combining Local and Global Word Embeddings for Microblog Stemming(CIKM 2017) by Anurag Roy, Trishnendu Ghorai, Kripabandhu Ghosh, Saptarshi Ghosh. The proposed unsupervised algorithm finds stems of all words using help of local and global word embeddings.
python version: python 2.7
packages:
gensimnltkscikit_learn
To install the dependencies run pip install -r requirements.txt
If you use the codes, please refer to the following paper:
@inproceedings{Roy-CIKM17,
author = {Roy, Anurag and Ghorai, Trishnendu and Ghosh, Kripabandhu and Ghosh, Saptarshi},
title = {Combining Local and Global Word Embeddings for Microblog Stemming},
year = {2017},
doi = {10.1145/3132847.3133103},
booktitle = {{Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM)}},
pages = {2267–2270},
location = {Singapore, Singapore},
}
Hyperparameters and options in unsupclean.py.
model_filegensim Word2Vec model trained on the corpusglobal_model_filetext file of the global wordvectors in googles word2vec formatalphaThe alpha value used in the algorithm [0, 1]betaThe beta value used in the algorithm [0, 1]prefixThe prefix length of words matchedmThe minimum length of strings consideredlambda_valThe lambda value used in the algorithm [0, 1]
For the local word embedding, we have trained a word2vec model on tweets of Nepal Earthquake from the FIRE 2016 microblog track collection, and for the Global word embeddings we provide the pretained GloVe embeddings of words from twitter dataset. We changed the format of the GloVe embeddings to word2vec format. The datasets can be accessed from the following sources:
The downloaded zipped data files should be extracted in the CIKM2017 folder.
To generate the list of word stems, run the following command:
python2 driver.py
The word stem list will be stored in the word_stems_list.txt file. Each line in the file contains:
<stem> <list of words to be replaced with the stem>
An example entry of the file will be
msghelpea [u'msghelpea', u'msghelpeart', u'msghelpearthqu']