Reddit Flair Detector

A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.

Directory Structure

The directory is a Flask web application set-up for hosting on Heroku servers. The description of files and folders can be found below:

app.py - The file used to start the Flask server.
requirements.txt - Containing all Python dependencies of the project.
nltk.txt - Containing all NLTK library needed dependencies.
Procfile - Needed to setup Heroku.
Automated Testing - Webpage to predict the flairs of multiple posts at once using a .txt file.
templates - Folder containing HTML/CSS files.
flair-detector - Folder containing the main application which loads the Machine Learning models and renders the results on the web application.
data - Folder containing CSV and MongoDB instances of the collected data.
Models - Folder containing the saved model.
Jupyter Notebooks - Folder containing Jupyter Notebooks to collect Reddit India data and train Machine Learning models.

Project Execution

Open the Terminal.
Clone the repository by entering https://github.com/pranay-ar/Reddit-Flare-Detection.git.
Ensure that Python3 and pip is installed on the system.
Create a virtualenv by executing the following command: virtualenv -p python3 env.
Activate the env virtual environment by executing the follwing command: source env/bin/activate.
Enter the cloned repository directory and execute pip install -r requirements.txt.
Enter python shell and import nltk. Execute nltk.download('stopwords') and exit the shell.
Now, execute the following command: python manage.py runserver and it will point to the localhost with the port.
Hit the IP Address on a web browser and use the application.

Dependencies

The following dependencies can be found in requirements.txt:

Approach

Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using [2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest Algorithm. I have obtained test accuracies on various scenarios which can be found in the next section.

The approach taken for the task is as follows:

Collect 1800 India subreddit data for each of the 15 flairs using praw module [1].
The data includes title, comments, body, url, author, score, id, time-created and number of comments.
For comments, only top level comments are considered in dataset and no sub-comments are present.
The title, comments and body are cleaned by removing bad symbols and stopwords using nltk.
Five types of features are considered for the the given task:

a) Title
b) Comments
c) Urls
d) Body
e) Combining Title, Comments, Body and Urls as one feature.

The dataset is split into 70% train and 30% test data using train-test-split of scikit-learn.
The dataset is then converted into a Vector and TF-IDF form.
Then, the following ML algorithms (using scikit-learn libraries) are applied on the dataset:

a) Naive-Bayes
b) Linear Support Vector Machine
c) Logistic Regression
d) Random Forest

Training and Testing on the dataset showed the Linear Support Vector Machine showed the best testing accuracy of 77.97% when trained on the combination of Title + Comments + Body + Url feature.
The best model is saved and is used for prediction of the flair from the URL of the post.

Results

Title as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.6792452830
Linear SVM	0.8113207547
Logistic Regression	0.8231132075
Random Forest	0.8042452830
MLP	0.8042452830

Body as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.5636792452
Linear SVM	0.8278301886
Logistic Regression	0.8066037735
Random Forest	0.8207547169
MLP	0.7971698113

URL as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.5754716981
Linear SVM	0.7523584905
Logistic Regression	0.7523584905
Random Forest	0.6886792452
MLP	0.7523584905

Comments as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.4622641509
Linear SVM	0.4056603773
Logistic Regression	0.4716981132
Random Forest	0.4646226415
MLP	0.4599056603

Title + Comments + URL + Body as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.5589622641
Linear SVM	0.8325471698
Logistic Regression	0.8254716981
Random Forest	0.8089622641
MLP	0.8372641509

Intuition behind Combined Feature

The features independently showed a test accuracy near to 82% with the URL feature giving the worst accuracies during the training.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Data		Data
Notebooks		Notebooks
model		model
scripts		scripts
static		static
templates		templates
.gitignore		.gitignore
Predictor.ipynb		Predictor.ipynb
Procfile		Procfile
README.md		README.md
app.py		app.py
home.png		home.png
nltk.txt		nltk.txt
requirements.txt		requirements.txt
test.txt		test.txt
textcleaning.py		textcleaning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reddit Flair Detector

Directory Structure

Project Execution

Dependencies

Approach

Results

Title as Feature

Body as Feature

URL as Feature

Comments as Feature

Title + Comments + URL + Body as Feature

Intuition behind Combined Feature

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

pranay-ar/Reddit-Flare-Detection

Folders and files

Latest commit

History

Repository files navigation

Reddit Flair Detector

Directory Structure

Project Execution

Dependencies

Approach

Results

Title as Feature

Body as Feature

URL as Feature

Comments as Feature

Title + Comments + URL + Body as Feature

Intuition behind Combined Feature

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages