Skip to content

ranarag/CIKM2017

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CIKM2017

License: MIT Python

Table of Contents

Summary

Implementation of the proposed algorithm in the paper Combining Local and Global Word Embeddings for Microblog Stemming(CIKM 2017) by Anurag Roy, Trishnendu Ghorai, Kripabandhu Ghosh, Saptarshi Ghosh. The proposed unsupervised algorithm finds stems of all words using help of local and global word embeddings.

Dependencies

python version: python 2.7

packages:

  • gensim
  • nltk
  • scikit_learn

To install the dependencies run pip install -r requirements.txt

Citation

If you use the codes, please refer to the following paper:

@inproceedings{Roy-CIKM17,
author = {Roy, Anurag and Ghorai, Trishnendu and Ghosh, Kripabandhu and Ghosh, Saptarshi},
title = {Combining Local and Global Word Embeddings for Microblog Stemming},
year = {2017},
doi = {10.1145/3132847.3133103},
booktitle = {{Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM)}},
pages = {2267–2270},
location = {Singapore, Singapore},
}

Hyperparameters and Options

Hyperparameters and options in unsupclean.py.

  • model_file gensim Word2Vec model trained on the corpus
  • global_model_file text file of the global wordvectors in googles word2vec format
  • alpha The alpha value used in the algorithm [0, 1]
  • beta The beta value used in the algorithm [0, 1]
  • prefix The prefix length of words matched
  • m The minimum length of strings considered
  • lambda_val The lambda value used in the algorithm [0, 1]

Data for Demo

For the local word embedding, we have trained a word2vec model on tweets of Nepal Earthquake from the FIRE 2016 microblog track collection, and for the Global word embeddings we provide the pretained GloVe embeddings of words from twitter dataset. We changed the format of the GloVe embeddings to word2vec format. The datasets can be accessed from the following sources:

  1. Word2Vec Embeddings
  2. Global Embeddings

The downloaded zipped data files should be extracted in the CIKM2017 folder.

Run Demo

To generate the list of word stems, run the following command:

python2 driver.py

The word stem list will be stored in the word_stems_list.txt file. Each line in the file contains:

<stem> <list of words to be replaced with the stem>

An example entry of the file will be

msghelpea [u'msghelpea', u'msghelpeart', u'msghelpearthqu']

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages