A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.

The directory is a Flask web application set-up for hosting on Heroku servers. The description of files and folders can be found below:
- app.py - The file used to start the Flask server.
- requirements.txt - Containing all Python dependencies of the project.
- nltk.txt - Containing all NLTK library needed dependencies.
- Procfile - Needed to setup Heroku.
- Automated Testing - Webpage to predict the flairs of multiple posts at once using a .txt file.
- templates - Folder containing HTML/CSS files.
- flair-detector - Folder containing the main application which loads the Machine Learning models and renders the results on the web application.
- data - Folder containing CSV and MongoDB instances of the collected data.
- Models - Folder containing the saved model.
- Jupyter Notebooks - Folder containing Jupyter Notebooks to collect Reddit India data and train Machine Learning models.
- Open the
Terminal. - Clone the repository by entering
https://github.com/pranay-ar/Reddit-Flare-Detection.git. - Ensure that
Python3andpipis installed on the system. - Create a
virtualenvby executing the following command:virtualenv -p python3 env. - Activate the
envvirtual environment by executing the follwing command:source env/bin/activate. - Enter the cloned repository directory and execute
pip install -r requirements.txt. - Enter
pythonshell andimport nltk. Executenltk.download('stopwords')and exit the shell. - Now, execute the following command:
python manage.py runserverand it will point to thelocalhostwith the port. - Hit the
IP Addresson a web browser and use the application.
The following dependencies can be found in requirements.txt:
Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using [2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest Algorithm. I have obtained test accuracies on various scenarios which can be found in the next section.
The approach taken for the task is as follows:
- Collect 1800 India subreddit data for each of the 15 flairs using
prawmodule [1]. - The data includes title, comments, body, url, author, score, id, time-created and number of comments.
- For comments, only top level comments are considered in dataset and no sub-comments are present.
- The title, comments and body are cleaned by removing bad symbols and stopwords using
nltk. - Five types of features are considered for the the given task:
a) Title
b) Comments
c) Urls
d) Body
e) Combining Title, Comments, Body and Urls as one feature.
- The dataset is split into 70% train and 30% test data using
train-test-splitofscikit-learn. - The dataset is then converted into a
VectorandTF-IDFform. - Then, the following ML algorithms (using
scikit-learnlibraries) are applied on the dataset:
a) Naive-Bayes
b) Linear Support Vector Machine
c) Logistic Regression
d) Random Forest
- Training and Testing on the dataset showed the Linear Support Vector Machine showed the best testing accuracy of 77.97% when trained on the combination of Title + Comments + Body + Url feature.
- The best model is saved and is used for prediction of the flair from the URL of the post.
| Machine Learning Algorithm | Test Accuracy |
|---|---|
| Naive Bayes | 0.6792452830 |
| Linear SVM | 0.8113207547 |
| Logistic Regression | 0.8231132075 |
| Random Forest | 0.8042452830 |
| MLP | 0.8042452830 |
| Machine Learning Algorithm | Test Accuracy |
|---|---|
| Naive Bayes | 0.5636792452 |
| Linear SVM | 0.8278301886 |
| Logistic Regression | 0.8066037735 |
| Random Forest | 0.8207547169 |
| MLP | 0.7971698113 |
| Machine Learning Algorithm | Test Accuracy |
|---|---|
| Naive Bayes | 0.5754716981 |
| Linear SVM | 0.7523584905 |
| Logistic Regression | 0.7523584905 |
| Random Forest | 0.6886792452 |
| MLP | 0.7523584905 |
| Machine Learning Algorithm | Test Accuracy |
|---|---|
| Naive Bayes | 0.4622641509 |
| Linear SVM | 0.4056603773 |
| Logistic Regression | 0.4716981132 |
| Random Forest | 0.4646226415 |
| MLP | 0.4599056603 |
| Machine Learning Algorithm | Test Accuracy |
|---|---|
| Naive Bayes | 0.5589622641 |
| Linear SVM | 0.8325471698 |
| Logistic Regression | 0.8254716981 |
| Random Forest | 0.8089622641 |
| MLP | 0.8372641509 |
The features independently showed a test accuracy near to 82% with the URL feature giving the worst accuracies during the training.