CIKM2017

Implementation of the proposed algorithm in the paper Combining Local and Global Word Embeddings for Microblog Stemming(CIKM 2017) by Anurag Roy, Trishnendu Ghorai, Kripabandhu Ghosh, Saptarshi Ghosh. The proposed unsupervised algorithm finds stems of all words using help of local and global word embeddings.

Dependencies

python version: python 2.7

packages:

gensim
nltk
scikit_learn

To install the dependencies run pip install -r requirements.txt

Citation

If you use the codes, please refer to the following paper:

@inproceedings{Roy-CIKM17,
author = {Roy, Anurag and Ghorai, Trishnendu and Ghosh, Kripabandhu and Ghosh, Saptarshi},
title = {Combining Local and Global Word Embeddings for Microblog Stemming},
year = {2017},
doi = {10.1145/3132847.3133103},
booktitle = {{Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM)}},
pages = {2267–2270},
location = {Singapore, Singapore},
}

Hyperparameters and Options

Hyperparameters and options in unsupclean.py.

model_file gensim Word2Vec model trained on the corpus
global_model_file text file of the global wordvectors in googles word2vec format
alpha The alpha value used in the algorithm [0, 1]
beta The beta value used in the algorithm [0, 1]
prefix The prefix length of words matched
m The minimum length of strings considered
lambda_val The lambda value used in the algorithm [0, 1]

Data for Demo

For the local word embedding, we have trained a word2vec model on tweets of Nepal Earthquake from the FIRE 2016 microblog track collection, and for the Global word embeddings we provide the pretained GloVe embeddings of words from twitter dataset. We changed the format of the GloVe embeddings to word2vec format. The datasets can be accessed from the following sources:

The downloaded zipped data files should be extracted in the CIKM2017 folder.

Run Demo

To generate the list of word stems, run the following command:

python2 driver.py

The word stem list will be stored in the word_stems_list.txt file. Each line in the file contains:

<stem> <list of words to be replaced with the stem>

An example entry of the file will be

msghelpea [u'msghelpea', u'msghelpeart', u'msghelpearthqu']

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
CIKM_stemmer.py		CIKM_stemmer.py
LICENSE		LICENSE
README.md		README.md
driver.py		driver.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CIKM2017

Table of Contents

Summary

Dependencies

Citation

Hyperparameters and Options

Data for Demo

Run Demo

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CIKM2017

Table of Contents

Summary

Dependencies

Citation

Hyperparameters and Options

Data for Demo

Run Demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages