The purpose of this notebook is to showcase the use of Capsule Layers / CapsNet, particularly in solving an NLP problem, where contextuality and superposition of data is important. In this case, we'll be attempting to classify online comments as toxic or nontoxic. You can read more CapsNets here 1, 2, 3. Toxic comments will also have a further classification, delineating between 1 of a few different "styles" of toxicity mentioned below. We'll be using Datmo's open source CLI to help us get our environment setup and to track our experiment results and repository states along the way.
The specific goal of this model, as mentioned in the original Kaggle competition prompt, is to create a "multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate". The training and test datasets are comprised of comments from Wikipedia’s talk page edits.
This notebook is a fork of the final CapsNet+GRU kernel submitted to the competition by chongjiujjin.
The example notebook (capsnet-keras.ipynb) will cover the following:
- Training the Model
- Load Data
- Define optimization/scoring function
- Tokenize Data
- Load GloVe embeddings
- Define model architecture
- Fit model
- Save weights
- Create snapshot (end of training)
- Predicting on the Kaggle test set
- Create snapshot (post-prediction on test set)
- Predicting on new data
- Instantiate model architecture
- Load saved weights
- Predict on manually defined strings
- Predict on user-defined CSV
- Writing model predictions to CSV
Setup:
- Install and launch Docker
$ pip install datmo- Clone the repository
- Download the GloVe embedding file from this link (Warning: large file, around 2GB)
- Unzip the GloVe file and move it into the
input/directory. Also unzip thetest.csv.zipandtrain.csv.zipfiles already ininput/.
Before proceeding, your repository should look as follows:
.
├── README.md
├── best.hdf5
├── capsnet-keras.ipynb
├── input
│ ├── glove.840B.300d.txt
│ ├── test.csv
│ ├── test.csv.zip
│ ├── train.csv
│ └── train.csv.zip
└── output
Setting up the environment:
- Initialize your datmo repo with
$ datmo init - When asked about setting up an environment, type
y - Select the
cpuenvironment type - Select the
kaggleenvironment
Running the notebook:
- Initialize a jupyter notebook with
$ datmo notebook(Setting up the environment will take a while the first time you do it) - Select the notebook (
capsnet-keras.ipynb). Follow the instructions and run cells from top to bottom. - For predicting on your own data, add your
.csvdataset to theinput/directory.
-
The GloVe embeddings will be loaded into memory before the training, requiring around 6gb of available RAM. If the RAM is not available on your system, you will likely receive a generic python kernel crash error after a long-running cell.
-
If you run into a kernel crash during embedding file loading, it is likely because your system is limiting the available memory allotted for docker. To fix this, see the following issue on Stack Overflow.