The goal of this notebook is to build and analyse a map of the 10,000 most popular subreddits on Reddit. To do this we need a means to measure the similarity of two subreddits. In a great article on FiveThirtyEight Trevor Martin did an analysis of subreddits by considering the overlaps of users commenting on two different subreddits. Their interest was in using vector algebra on representative vectors to look at, for example, what happens if you remove r/politics from r/The_Donald. Our interest is a little broader -- we want to map out and visualize the space of subreddits, and attempt to cluster subreddits into their natural groups. With that done we can then explore some of the clusters and find interesting stories to tell.
The first step in all of this is acquiring the relevant data on subreddits. Reddit user u/Stuck_in_the_Matrix provided a vast amount of reddit comment data on Google's BigQuery service, and the FiveThirtyEight authors posted their code to extract the relevant commenter overlap information via BigQuery queries. I have placed the (slightly modified) code for BigQuery on github. The result is a file with over 15 million counts of pairwise commenter overlap between subreddits available here. This will be the starting point for our analysis.
To build a map of subreddits we'll need to get relevant data, and massage it into a form that we can use creating a map. We begin by loading all the relevant Python modules we will require.
import pandas as pd
import scipy.sparse as ss
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.base import BaseEstimator
from sklearn.utils import check_array
from os.path import isfile
import subprocess
Next we need to read in the overlap data, which we can do easily with pandas. The result is a dataframe, each row providing a from subreddit, a to subreddit, and the number of unique commenters that the two subreddits have in common. A call to head shows is the first 5 entries of the table.
raw_data = pd.read_csv('subreddit-overlap')
raw_data.head()
This table has a total of over 50,000 unique subreddits. That's too many to visualise, and in practice many of the less popular subreddits with sparse overlaps will act as noise, muddying our later analysis. That means that later in our analysis we will want to narrow our focus on just the top 10,000 subreddits. To make this easy to do I will rank the subreddits so that they are indexed in order of popularity. For this purpose I will define popularity as the total number of unique commenters in a subreddit -- the more unique commenters, the more popular the subreddit. We can estimate this from our table by summing the total number of overlaps for each subreddit. Pandas makes this easy; first we group the data by subreddit, then select NumOverlaps and sum those for each group. The result is a Series indexed by subreddit of the sums of overlaps for that subreddit. Sorting this Series and extracting the index will give us a ranking of subreddits by approximate popularity.
subreddit_popularity = raw_data.groupby('t2_subreddit')['NumOverlaps'].sum()
subreddits = np.array(subreddit_popularity.sort_values(ascending=False).index)
Our next step is going to be to pivot the data into a matrix such that rows and columns are both indexed by subreddits, and the entry at position (i,j) is the number of overlaps bwteen the ith and jth subreddits. We could do this in pandas via the pivot method, but the result will be quite large and consume a lot of memory. Since most of the entries of such a matrix will be zero a better option is to use a sparse matrix data structure which can be operated on as if it were a matrix, but only stores the non-zero entries. Since sparse matrices are numerically indexed we'll need a mapping from (popularity ranked) subreddits to 0-up integer indices. We can quickly and easily build a Python dictionary to do this, providing us with the ability to look up a numeric index for any given subreddit.
index_map = dict(np.vstack([subreddits, np.arange(subreddits.shape[0])]).T)
Now we need to build the matrix. Scipy provides good support for sparse matrices in a number of formats. For matrix construction the most suitable format is called COO format (short for coordinate format). This format requires us to specify triples of row, column, and value for each non-zero entry in the matrix. The COO matrix constructor accepts this as a triple of arrays: the first array is the values, the second and third are arrays of row and column indices. This all gets zipped together to create a sparse matrix. Pulling the relevant arrays out of our data table is straight forward since each array corresponds to a column of the table -- the only work required is to use our subreddit-to-integer-index map to convert the subreddit names in the table into numeric row and column indexes.
count_matrix = ss.coo_matrix((raw_data.NumOverlaps,
(raw_data.t2_subreddit.map(index_map),
raw_data.t1_subreddit.map(index_map))),
shape=(subreddits.shape[0], subreddits.shape[0]),
dtype=np.float64)
I can now consider each row of the matrix as a vector of information describing a subreddit in terms of its relations to other subreddits. Comparisons between these vectors are a little more difficult since the counts are on very different scales: AskReddit has large overlaps with many subreddits, since it has so many comments; meanwhile niche subreddits have smaller overlaps simply because they have few commenters. In practice I want to compare the vectors relative overlaps because it is the overlaps that are large for the average overlap size of that subreddit that will tell us the most about what makes a given subreddit different from others. We can achieve this by row-normalizing, dividing each subreddit vector by the sum of entries in the vector. You can think of the result as empirical conditional probabilities: given we are looking at a particular subreddit (our row of choice), what is the probability that it shares a commenter with another subreddit (our column of choice)? That's what the given matrix entry should tell us.
Computationally this is easy thanks for scikit-learn, which provides a convenient normalize function (which can operate on sparse matrices). We need to specify that we are using the so called "l1-norm", and then scikit-learn will take care of the rest. The one caveat is that while COO format was great for matrix construction it is much less efficient for row-wise operations like this. To make it easier for scikit-learn we'll convert the matrix to CSR format (short for Compact Sparse Row) which has efficient row-wise computation.
conditional_prob_matrix = count_matrix.tocsr()
conditional_prob_matrix = normalize(conditional_prob_matrix, norm='l1', copy=False)
Now that we have comparable vectors, perhaps we should start comparing subreddits? Not quite. Each vector is 56187-dimensional, which is a little large to work with easily. In practice we don't expect that our data is intrinsically that high dimensional, which means there is a lot of redundancy built into those 56187-dimensional vectors. To make things easier to work with we're going to use a truncated Singular Value Decomposition to reduce down to 500-dimensional vectors. That sounds fancy, but really it amounts to finding a matrix of 500-dimensional vectors from which we can (almost) reconstruct the full 56187-dimensional vectors. This is just (slightly lossy) information compression. The Singular Value Decomposition part simply refer to the linear algebra technique that makes finding the optimal such compressed form computationally efficient.
Once again scikit-learn comes to our aid, providing a truncated SVD implementation that operates on sparse matrices. That means that getting our compressed representation is a single line of code.
The next step is a little sleight of hand. The correct measure of similarity between these vectors is cosine similarity. Very soon I'm going to want to work with distances rather than similarities. There is such a thing as cosine distance, but that is not really a distance metric which would break some of the algorithms I want to apply. The correct distance measure is something called angular distance (essentially the size of the angle between the vectors). Fortunately the euclidean distance between normalized (by euclidean norm) vectors is equivalent to angular distance. Therefore I'll normalize the vectors now, and thereafter we can use euclidean distance knowing that it is actually effective angular distance in practice. Again scikit-learn's normalize function does the job, this time with the "l2-norm" which is another name for the euclidean norm.
reduced_vectors = TruncatedSVD(n_components=500,
random_state=1).fit_transform(conditional_prob_matrix)
reduced_vectors = normalize(reduced_vectors, norm='l2', copy=False)
Now we need to convert the subreddit vectors into a map. What does this mean? It means we need to generate an (x,y) coordinate pair for each subreddit. This sounds like another dimension reduction step -- this time reducing down to only 2-dimensional vectors. Why did we not just use the SVD above to do that for us? The SVD approach to dimension reduction performs a linear dimension reduction, which means that if the data lives in some flat hyperplane in high dimensional space it will do an excellent job, If, however, the data lives on several warped curving hypersurfaces with disjoint components ... well a linear approach to that will not work so well. Given that, with 56,000 dimensions, there was a lot of redundant information in the vectors, we could afford a linear approximation to carve away a lot of the redundancy. To get down to just two dimensions, however, we are going to need to worry about the distorted curvy hypersurfaces if we want to get a good representation. That means we need to start considering non-linear dimension reduction techniques (sometimes called manifold learning techniques, since the goal is to learn the curvy hypersurface, called a manifold in math speak, that the data lies on). I've chosen to use LargeVis, a relatively recent technique, very closely related to the popular t-SNE algorithm. In fact, LargeVis essentially does the same thing mathematically as t-SNE, but uses some clever tricks to get a more efficient computational approach, which in turn helps it to find a better answer.
There aren't (currently) any LargeVis implementations for Python; the authors of the LargeVis paper have a C++ implementation (with Python bindings that work only with Python2), which is what we'll be using. You can get it directly from their LargeVis github repo. Since scikit-learn has the best machine learning API around I will wrap the code to farm data out and back using scikit-learn's BaseEstimator class so that I can just keep using a familiar API.
class LargeVis (BaseEstimator):
def __init__(self, n_components=2, perplexity=50.0, gamma=5,
layout_samples=None, n_neighbors=None, negative_samples=5,
alpha=1.0, n_cores=4, knn_prop=3, trees=50):
self.n_components = n_components
self.perplexity = perplexity
self.layout_samples = layout_samples
self.alpha = alpha
self.n_cores = n_cores
self.knn_prop = knn_prop
self.negative_samples = negative_samples
self.n_neighbors = n_neighbors
self.gamma = gamma
self.trees = trees
if self.n_neighbors is None:
self.n_neighbors = int(self.perplexity * 3)
def fit_transform(self, X, y=None):
if self.layout_samples is None:
layout_samples = X.shape[0] / 100.0
else:
layout_samples = self.layout_samples
X = check_array(X, dtype=np.float64)
np.savetxt('/tmp/largevis_input',
X, header='{} {}'.format(*X.shape),
comments='')
subprocess.check_call(['/home/leland/Source/LargeVis/Linux/LargeVis',
'-input', '/tmp/largevis_input',
'-output', '/tmp/largevis_output',
'-outdim', str(self.n_components),
'-perp', str(self.perplexity),
'-samples', str(layout_samples),
'-gamma', str(self.gamma),
'-prop', str(self.knn_prop),
'-trees', str(self.trees),
'-neigh', str(self.n_neighbors),
'-alpha', str(self.alpha),
'-neg', str(self.negative_samples),
'-threads', str(self.n_cores)])
self.embedding_ = np.loadtxt('/tmp/largevis_output', skiprows=1)
return self.embedding_
def fit(self, X, y=None):
self.fit_transform(X)
return self
Now that we have LargeVis in Python the next step is to run it. It is at this point that I'll prune down to just the top 10,000 subreddits, only passing that many vectors to LargeVis. Since LargeVis is a randomized algorithm the results are never exactly the same between any two runs. For the sake of my own sanity in discussing what the map looks like I have set code up to cache a result and load that if it is available so that I can get a consistent map when re-running the notebook.
if isfile('largevis_subreddit_map.npy'):
subreddit_map = np.load('largevis_subreddit_map.npy')
else:
subreddit_map = LargeVis().fit_transform(reduced_vectors[:10000])
np.save('largevis_subreddit_map.npy', subreddit_map)
Now that we have a set of coordinates for each subreddit we can put that data into a pandas dataframe and join it with the actual names of the subreddits so that we can associate the actual subreddit with each point in 2-dimensional space. We can look at the head of the resulting dataframe and see that we do indeed have x and y coordinates for each subreddit.
subreddit_map_df = pd.DataFrame(subreddit_map[:10000], columns=('x', 'y'))
subreddit_map_df['subreddit'] = subreddits[:10000]
subreddit_map_df.head()
While we now have a map it does have 10,000 data points in it and is going to be a little hard to make sense of. We can make that task a little easier by clustering the subreddits in the map, allowing us to look at a small(ish) number of groups of subreddits that each clump together in the map.
For clustering I've chosen to use the HDBSCAN* clustering algorithm. HDBSCAN* is a density based clustering algorithm -- that means it view clusters as dense areas that are separated from other clusters by less dense areas. It also means that it can find clusters that are arbitrary shapes (unlike K-Means), and supports a notion of "noise" (data points that are outliers, not necessarily in any cluster). A final advantage is that it does not require the specification of the number of clusters to be found -- I honestly don't know how many clumps of subreddits to expect, so having to state that up front for the benefit of the clustering algorithm is hard.
The Python implementation of HDBSCAN*, hdbscan, is fully scikit-learn compatible, so we get to use a familiar scikit-learn API, and thanks to spatial indexing acceleration it can whip through all 10,000 data points in a fraction of second.
import hdbscan
clusterer = hdbscan.HDBSCAN(min_samples=5,
min_cluster_size=20).fit(subreddit_map)
cluster_ids = clusterer.labels_
We'll save the clusterer object for later -- it has some other properties we will explore. The cluster labels are the assignments of points to clusters (with a label of -1 for points that are considered noise). We can simply add these to our existing data frame as a new column, allowing us to associate each subreddit with a cluster number.
subreddit_map_df['cluster'] = cluster_ids
Now that we have subreddits mapped and clustered, the next step is to actually visualise the results so that we can see what is going on. Since we have 10,000 points a basic scatterplot is going to be very busy and have a great deal of overplotting. Furthermore while we can plot a point for each subreddit we won't know which subreddits they represent, and filling the screen with overplotted text is certainly not the answer.
To get around these problems I decided to use Bokeh for my initial visualisation. Bokeh is a Python plotting library that supports interactive plots in the browser. This means we can zoom in on the plot and pan around as needed. Better still we can add tooltips showing which subreddit points represent as we hover over them.
Bokeh is a large and powerful library. There are good tutorials on Bokeh available, so I won't explain everything Bokeh can do. Instead I'll just load the parts of it that we'll need for our interactive scatterplot map of subreddits.
from bokeh.plotting import figure, show, output_notebook, output_file
from bokeh.models import HoverTool, ColumnDataSource, CustomJS, value
from bokeh.models.mappers import LinearColorMapper
from bokeh.palettes import plasma
from collections import OrderedDict
output_notebook()
# To output to a static html file, comment out the previous
# line, and uncomment the line below.
# output_file('subreddit_interactive_map.html')
Now that we have Bokeh loaded it is time to build the plot. There's a lot going on in the next notebook cell, but each piece is relatively straightforward; I'll try to explain what is going on in each chunk of code.
First we need some colour. We can use the clusters for that, but that means we need to map cluster numbers to colours and reserve gray for noise points. We can construct a palette and then use the LinearColorMapper for this. There is also some code to set a fill_alpha value which we will come back to later.
Next we create a ColumnDataSource. This just connects up our dataframe with bokeh so it can embed data (like the names of the subreddits!) directly in the html for the plot (and not have to round trip back to Python to get that information). We also have some custom javascript for handling alpha values -- again, we'll come back to this.
The next step is to create a figure to plot to, and then add a hover tool (which we wire up to display the subreddit and cluster). We then add circles to the figure, taking x, y and color values for the circles from the ColumnDataSource we created. Then there is some custom callbacks which we'll skip over for now, and the rest, prior to show(plot) is just customising the display style a little.
The result is a scatterplot, colored by cluster, which we can zoom and pan around in, complete with tooltips. The plot itself is pure html and javascript, and we can embed it in a webpage independent of any Python -- it is standalone at this point.
One catch is that we set a fairly low alpha channel level (the translucency of points) so that we can see dense areas when zoomed out due to all the overplotting of points. If we were to zoom in while keeping that same alpha channel all the points would be very pale and hard to see. I was expecting that fixing this would be a very difficult task -- how does one dynamically alter the alpha value as you zoom in? It turns out to be surprsingly straightforward in Bokeh. By adding a fill_alpha value to the dataframe (and hence the ColumnDataSource) we can set the alpha value from the ColumnDataSource. Bokeh also let's us add custom callbacks when the zoom level changes, so we simply update the alpha values based on the bounds of the zoomed area.
# Construct a color palette and map clusters to colors
palette = ['#777777'] + plasma(cluster_ids.max())
colormap = LinearColorMapper(palette=palette, low=-1, high=cluster_ids.max())
color_dict = {'field': 'cluster', 'transform': colormap}
# Set fill alpha globally
subreddit_map_df['fill_alpha'] = np.exp((subreddit_map.min() -
subreddit_map.max()) / 5.0) + 0.05
# Build a column data source
plot_data = ColumnDataSource(subreddit_map_df)
# Custom callback for alpha adjustment
jscode="""
var data = source.data;
var start = cb_obj.start;
var end = cb_obj.end;
alpha = data['fill_alpha']
for (i = 0; i < alpha.length; i++) {
alpha[i] = Math.exp((start - end) / 5.0) + 0.05;
}
source.trigger('change');
"""
# Create the figure and add tools
bokeh_figure = figure(title='A Map of Subreddits',
plot_width = 700,
plot_height = 700,
tools= ('pan, wheel_zoom, box_zoom,'
'box_select, resize, reset'),
active_scroll=u'wheel_zoom')
bokeh_figure.add_tools( HoverTool(tooltips = OrderedDict([('subreddit', '@subreddit'),
('cluster', '@cluster')])))
# draw the subreddits as circles on the plot
bokeh_figure.circle(u'x', u'y', source=plot_data,
fill_color=color_dict, line_color=None, fill_alpha='fill_alpha',
size=10, hover_line_color=u'black')
bokeh_figure.x_range.callback = CustomJS(args=dict(source=plot_data), code=jscode)
bokeh_figure.y_range.callback = CustomJS(args=dict(source=plot_data), code=jscode)
# configure visual elements of the plot
bokeh_figure.title.text_font_size = value('18pt')
bokeh_figure.title.align = 'center'
bokeh_figure.xaxis.visible = False
bokeh_figure.yaxis.visible = False
bokeh_figure.grid.grid_line_color = None
bokeh_figure.outline_line_color = '#222222'
# display the figure
show(bokeh_figure);