Repository for the second project in Machine Learning, EPFL 2016/2017.
Project: Text classification
Authors: Emanuele Bugliarello, Manik Garg, Zander Harteveld
Team: Memory Error
This repository contains the material shipped on December 22 and consists of the following folders:
data: contains the Twitter data files from Kagglecode: contains the Python files used to train the model and generate new predictions. Details about the files are available in the README inside thecodefolder.report: contains the submitted report and the files used to generate it.
The code is written in Python3, that you can download from here (we recommend installing a virtual environment such as Anaconda that already comes with many libraries).
The libraries required are:
-
NumPy (>= 1.6.1): you can install it by typying
pip install -U numpyon the terminal (it is included with Anaconda). -
NLTK (3.0): you can install it by typying
pip install -U nltkon the terminal. -
NLTK packages: you can download alll the packages of NLTK by typying
pythonon the terminal. Then:import nltk nltk.download('all')
It will automatically install all the packages of NLTK. Note that it takes a lot of time to download the
panlex_litepackage but you can stop the execution because the packages needed by our scripts will have been already installed. -
SciPy (>=0.9): you can install it by typying
pip install -U scipyon the terminal (it is in included with Anaconda). -
scikit-learn (0.18.1): you can install it by typying
pip install -U scikit-learn, orconda install scikit-learnif you use Anaconda, on the terminal.
The final model consists of a Logistic Regression classifier.
We apply the following pre-processing steps before feeding the data into the classifier:
- Remove the pound sign (#) in fron of words
- Stem words (by using
EnglishStemmerfromnltk.stem.snowball) - Replace two or more consecutive repetitions of a letter with two of the same
We then convert the collection of text documents to a matrix of token counts. We do this with CountVectorizer from sklearn.feature_extraction.text, with the following hyperparameters:
- analyzer = 'word'
- tokenizer = tokenize (function that tokenizes the text by applying the pre-processing steps described above)
- lowercase = True
- ngram_range = (1,3)
- max_df = 0.9261187281287935
- min_df = 4
After that, we transform the count matrix to a normalized tf-idf representation with TfidfTransformerfrom sklearn.feature_extraction.text.
Finally, we feed this representation into the Logistic Regression classifier from sklearn.linear_model, parameterized with the following value of the inverse of regularization strength:
- C = 3.41
In order to generate the top Kaggle submission, please ensure all Python requirements are installed and then run:
cd code
python run.pyThis makes use of the pre-trained classifier available in the code/models folder to predict labels for new tweets and store them in a .csv file in the code/results folder. The default test data file is data/test_data.txt (the one provided for the Kaggle competition) but it can be easily changed in code/run.py.
You can train the classifier that we use for the top Kaggle submission. To do:
- Ensure all Python requirements are installed
- Ensure the Twitter data files from Kaggle are in the
data/folder. - Run:
cd code
python train.pyThis file makes use of data/train_pos_full.txt and data/train_neg_full.txt (data files from the Kaggle competition) as the training sets and creates a model in the code/models folder.
The time needed to run it is between 50 and 60 minutes: around 50 minutes for pre-processing and around 10 minutes for fitting the classifier (depending on your machine).
You can then predict labels for new data as described in the previous section.