0% found this document useful (0 votes)
9 views10 pages

Social-Network Analysis Using Topic Models

This paper explores the application of probabilistic topic models, specifically Latent Dirichlet Allocation (LDA), to analyze social network data by grouping and labeling users based on their topic interests. The authors propose extensions to existing LDA models to better handle popular nodes in social networks, demonstrating significant improvements in labeling quality through experiments on a Twitter dataset. The research aims to enhance user recommendations and identify influential users by providing a structured analysis of follow relationships within social networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Social-Network Analysis Using Topic Models

This paper explores the application of probabilistic topic models, specifically Latent Dirichlet Allocation (LDA), to analyze social network data by grouping and labeling users based on their topic interests. The authors propose extensions to existing LDA models to better handle popular nodes in social networks, demonstrating significant improvements in labeling quality through experiments on a Twitter dataset. The research aims to enhance user recommendations and identify influential users by providing a structured analysis of follow relationships within social networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Social-Network Analysis Using Topic Models

Youngchul Cha Junghoo Cho


UCLA Computer Science Dept UCLA Computer Science Dept
Los Angeles, CA 90095 Los Angeles, CA 90095
[email protected] [email protected]

ABSTRACT user. Then when the followed user “updates her status” (or
In this paper, we discuss how we can extend probabilistic shares a new thought), all users who follow her are immedi-
topic models to analyze the relationship graph of popular ately notified. This “follow relationship” among users often
social-network data, so that we can “group” or “label” the looks unorganized and chaotic, because follow relationships
edges and nodes in the graph based on their topic similarity. are created haphazardly by each individual user and not
In particular, we first apply the well-known Latent Dirich- controlled by a central entity.
let Allocation (LDA) model and its existing variants to the In this paper, we explore techniques to provide more struc-
graph-labeling task and argue that the existing models do ture to this follow relationship (1) by “grouping” the users
not handle popular nodes (nodes with many incoming edges) based on their topic interests, and (2) by “labeling” each fol-
in the graph very well. We then propose possible extensions low relationship with the identified topic group. More for-
to this model to deal with popular nodes. mally, we consider each user in a social network as a node in
Our experiments show that the proposed extensions are a graph and each follow relationship as a directed edge be-
very effective in labeling popular nodes, showing significant tween two nodes. Then our goal is to “group” a set of nodes
improvements over the existing methods. Our proposed in the graph based on their topics and “label” each edge in
methods can be used for providing, for instance, more rele- the graph with a topic group number.
vant friend recommendations within a social network. Inferring a structure within the social-network relation-
ship graph can be useful for many reasons. For example, a
novice user on a social network often encounters the boot-
Categories and Subject Descriptors strapping problem: discovering relevant users to connect
H.3.3 [Information Storage and Retrieval]: Information with. To mitigate this problem, social network services may
Search and Retrieval—clustering; D.2.8 [Software Engi- recommend potentially interesting users to new users if they
neering]: Metrics—performance measures can group users of similar interests and infer why the new
user has decided to follow a certain initial set of other users.
Similarly, we can identify a small set of “influential” users on
General Terms a certain topic (say, for marketing and advertising purposes)
Experimentation, Algorithms if we can identify the users’ topic interests.
Roughly, we can consider our goal as a clustering (or clas-
Keywords sification) problem, where many popular solutions such as
K-means [15] and DBSCAN [6] exist. These existing meth-
social-network analysis, topic model, Latent Dirichlet Allo- ods, however, are not appropriate for our task because they
cation, handling popular nodes either (1) associate each node with a single group (hard clus-
tering) or (2) can associate each node with multiple groups
1. INTRODUCTION (soft clustering), but require a completely separate method
Social network services are gaining popularity. A growing to label edges as well as nodes (since a node may be associ-
number of users use major social network services, such as ated with multiple groups). Given the diversity of interest a
Facebook and Twitter, to share their thoughts and where- user may have, it is too restrictive to associate a user with a
abouts with their friends and followers. On a social network, single topic group. For example, Kevin Rose, one of the most
a user indicates that she wants to get notified of another popular users on Twitter, may belong to the entrepreneur
user’s updates by “following” (or making friends with) that group as he is the founder of Digg.com, but may also belong
to the Internet commentator group since he runs a popular
Internet podcast. Since many users on social networks are
active in more than one community, we believe it is too un-
Permission to make digital or hard copies of all or part of this work for realistic to require that every user should belong to just one
personal or classroom use is granted without fee provided that copies are group.
not made or distributed for profit or commercial advantage and that copies In this paper, we apply a well-known probabilistic topic
bear this notice and the full citation on the first page. To copy otherwise, to model, called Latent Dirichlet Allocation (LDA), to the follow-
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
relationship graph of the social network, in order to label the
SIGIR’12, August 12–16, 2012, Portland, Oregon, USA. nodes and the edges in the graph with (possibly) multiple
Copyright 2012 ACM 978-1-4503-1472-5/12/08 ...$15.00.

565
topics. Unfortunately, LDA was developed for labeling doc- Topic models, the other class of related work, have also
uments and words with topics, (not nodes and edges in a been extended to analyze small social-network data. Though
graph) so some of the assumptions on which LDA was built not directly related to social-network analysis, the concept
are not applicable to social-network graph data. In particu- of author/user was initially introduced in the Author-Topic
lar, the direct application of LDA to our task requires that (AT) model [22]. It was used to extract hidden research
every node in the graph should be of roughly equal popu- topics and trends from CiteSeer’s abstract corpus. Zhou
larity and that we should remove nodes of high popularity et al. [27] modified the AT model and proposed the Com-
from the dataset. This is particularly problematic because munity User Topic (CUT) model to capture semantic com-
these popular nodes are really the ones that we want to label munities. McCallum et al. [16] extended the AT model
accurately; many users are particularly interested in identi- and proposed the Author-Recipient-Topic (ART) model and
fying the topic groups of these popular users. Earlier work the Role-Author-Recipient-Topic (RART) model in order to
on the application of the LDA model to social graph [9, 26] analyze the Enron e-mail corpus and an academic e-mail
has not addressed the handling of popular nodes. network. Pathak et al. [19] modified the ART model and
To address the issues arising from popular nodes in the suggested the Community-Author-Recipient-Topic (CART)
graph, we first explore existing variations of LDA. We then model similar to the RART model. Besides these members
propose our extensions, two-step labeling and threshold noise of the AT model family, Wang et al. [24] introduced the
filtering, to minimize the labeling noise introduced by pop- Group Topic (GT) model and applied it to voting data from
ular nodes. US Senate and the General Assembly of the UN. Mei et
In summary, we make the following contributions in this al. [17] also introduced a regularized topic modeling frame-
paper: work incorporating a graph structure in the data. Other
LDA extensions and probabilistic topic models were also
• We explore the application of the well-known LDA proposed for annotation data analysis [12], chat data anal-
model and its existing variations for this task and pro- ysis [23], tagging data analysis [8], and pairwise data analy-
pose two extensions to make LDA suitable for the sis [2]. While most of the LDA approaches introduced above
social-network relationship graph. attempted to utilize the authorship information of a given
text by adding the author component to the LDA’s text gen-
• We conduct extensive experiments using a real-world erative model, our approach focuses on using only the social
Twitter dataset. Through these experiments, we demon- part of the data (i.e., the social graph) and is generally ap-
strate that (1) the application of probabilistic topic plicable to many large social networks.
models to social-network graphs leads to useful edge/node Perhaps the work by Zhang et al. [26] and by Hender-
topic labeling and that (2) our proposed extensions son et al. [9] is closest to our work. In both works, the
significantly improve the labeling quality over existing authors applied LDA to academic social networks. How-
methods. ever, their respective focus was quite different from ours.
For example, [26] focused on the issue of how to convert
The rest of the paper is organized as follows. In Section 2, the co-authorship information into a graph (e.g., direct co-
we briefly review previous work related to our research. In authorship or indirect co-authorship, and edge weighting
Section 3, we describe LDA and justify why we use LDA to scheme based on collaboration frequency). Henderson et
solve this labeling problem. Then we introduce four different al. [9] addressed the issue of a large number of topic clus-
approaches to handle the noise generated by popular users ters generated due to low popularity nodes in the network,
in Section 4. In Section 5, we analyze our approaches using while our primary focus is the effective clustering of high
the Twitter dataset. We summarize this paper with our popularity nodes.
conclusion in Section 6.

2. RELATED WORK 3. APPLYING LDA TO SOCIAL-NETWORK


The problem that we are trying to solve can be viewed as ANALYSIS
clustering in a social network using a topic model. In this
In this section, we briefly describe LDA. Because LDA
section, we briefly review related work in social-network clus-
evolved from Probabilistic Latent Semantic Indexing (PLSI)
ter analysis and topic-model-based social-network analysis.
[11], we first describe the concept of PLSI and then explain
In his seminal work for social-network cluster analysis,
what differentiates LDA from PLSI. We also justify the rea-
Kleinberg [13] proposed Hyperlink-Induced Topics Search
son why we use LDA for social-graph mining and discuss
(HITS). In this work, he modeled the Web as a huge graph
some different aspects between the standard LDA and our
and extracted hubs and authorities in order to cluster com-
model.
munities on the Web. Recently, Mislove et al. [18] identified
communities in Facebook based on the social graph structure
and inferred unknown attributes through this community in- 3.1 Topic Models
formation. The methods they used to cluster communities Topic models assume that there are latent (hidden) top-
are based on pruning edges from the whole social graph and ics behind words in human language. Thus, even though
adding edges from some seed nodes, both of which are very an author uses the word automobile in a document and a
common and widely used approaches in social-network anal- searcher uses the word vehicle in a query, topic models as-
ysis. However, these approaches produce mutually exclusive sume that they might have the same concept (topic) car
groups and cannot support multiple memberships, which is in mind. Based on this assumption, topic models provide
important in our scenario where users have a variety of in- methods to infer those latent topics from visible words.
terests. PLSI introduced a probabilistic generative model to topic

566
models. Equation (1) represents its document generation
process based on the probabilistic generative model:
X
P (d, w) = P (d)P (w|d) = P (d) P (w|z)P (z|d). (1)
z∈Z

P (d, w) is the probability of observing a word w in a doc-


ument d and can be decomposed into the multiplication of (a) Subscription graph rep- (b) Subscription graph rep-
P (d), the probability distribution of documents, and P (w|d), resentation of our model resentation of the standard
the probability distribution of words given a document. This LDA
equation describes a word selection for a document, where
we first select a document then a word in that document. If Figure 1: Subscription graph representation of mod-
we iterate this selection multiple times, we can generate a els
document and eventually a whole document corpus.
By assuming that there is a latent topic z, we can rewrite
the equation above with the multiplication of P (w|z), the the statistical inference. By placing Dirichlet priors α and β
probability distribution of words given a topic, and P (z|d), on the multinomial distributions P (z|d) and P (w|z), those
the probability distribution of topics given a document. This multinomial distributions are smoothed by the amount of α
equation describes adding an additional topic selection step and β and become safe from the overfitting problem of PLSI.
between the document selection step and the word selection It is also known that PLSI emerges as a specific instance of
step. As there are multiple latent topics where a word may LDA under Dirichlet priors [7, 10].
come from, we sum the multiplication over a set of all the
independent topics Z.
PLSI and other probabilistic topic models support multi-
3.3 Applying LDA to the Relationship Graph
ple memberships using the probabilities P (w|z) and P (z|d). in a Social Network
For example, if P (wvehicle |zcar ) > P (wautomobile |zcar ), the Before justifying our approach, we briefly explain Twitter
word vehicle is more closely related to the topic car than and a few interesting aspects of its data, which we use in
the word automobile, though they are all related to the topic our performance evaluations later in this paper. In contrast
car. In this way, we can measure the strength of association to a mutual friendship in other social networks, Twitter’s
between a word w and a topic z by the probability P (w|z). relationships are unidirectional (i.e., a Twitter user does not
Similarly P (z|d) measures the strength of association be- need an approval from a user with whom she wants to make
tween a topic z and a document d. friends). Thus, we use the term follow when a user adds
Equation (2) represents the log-likelihood function of PLSI: another user as her friend. Formally, when a user f follows
a user g, f generates a follow edge, or simply an edge, e(f, g)
Y Y from a follower f to a followed user g. We also use e0 (f, g)
L = log[ P (d, w)n(d,w) ] to denote an edge from g to f (indicating that g is followed
d∈D w∈W by f ), e(f ) to denote the set of all outgoing edges from f ,
and e0 (g) to denote the set of all incoming edges to g. To
X X
= n(d, w) log P (d, w), (2)
d∈D w∈W refer to the set of all followers, the set of all followed users,
and the set of all edges in the dataset we use F , G, and E,
where D and W denote a set of all d and w respectively, and respectively.
n(d, w) denotes the term frequency in a document (i.e., the Figure 1(a) depicts this notation using a graph that we
number of times w occurred in d). refer to as a subscription graph. For example, we observe
By maximizing the log-likelihood function L, we can max- e(f1 , cnn) = e0 (cnn, f1 ) = e1 , e(f1 ) = {e1 , e2 }, and e0 (espn) =
imize the probability to observe the entire corpus and ac- {e2 , e3 } in Figure 1(a). Given this subscription graph, our
cordingly estimate the P (w|z) and P (z|d) that most likely goal is to label each edge with a correct label (interest)
satisfy Equation (1). and group (label) each followed user based on those labeled
edges. For example, since e2 and e3 are labeled with broad-
3.2 Latent Dirichlet Allocation cast and sports, respectively, espn is labeled with both. Now
Though PLSI is equipped with a sound probabilistic gen- we can frame our problem as the graph labeling problem of
erative model and a statistical inference method, it suffers automatically associating each user gi in G with a set of ac-
from the overfitting problem and does not cope well with curate interests zk in Z based on its labeled incoming edges
unobserved words. To solve this problem, Blei et al. [4] e0 (gi ). (We also label fj in F as well.)
introduced Dirichlet priors α and β to PLSI, to constrain We can view the interest here as a topic in a document
P (z|d) and P (w|z), respectively. α is a vector of dimension generative model. As briefly mentioned in Section 3.1, a
|Z|, the number of topics, and each element in α is a prior document topic model assumes that an author has a topic
for a corresponding element in P (z|d). Thus, a higher αi in mind when selecting a word for a document. Likewise,
implies that the topic zi appears more frequently than other when a user follows another user in Twitter (i.e., when a
topics in a corpus. Similarly, β is a vector of dimension |W |, user generates a follow edge), the follower has an interest in
the number of words, and each element in β is a prior for the followed user. This interest may be caused by reasons
a corresponding element in P (w|z). Thus, a higher βj im- such as sharing a common interest, having an off-line rela-
plies that the word wj appears more frequently than other tionship, being popular, etc. Among these reasons for fol-
words in the corpus. As a conjugate prior for the multino- lowing someone on Twitter, the two most common reasons
mial distribution, the Dirichlet distribution can also simplify are sharing a common interest and being popular, since the

567
unidirectional nature of Twitter relationships allows a user document corpus, these columns correspond to words
to follow another user without requiring that user to follower such as the and is, which we call stop words, and are
her in return, as in the case of a blog subscription graph. generally removed. Those columns in our matrix, how-
Furthermore, we can consider a follower f to be a docu- ever, correspond to users such as Barack Obama and
ment d, a followed user g as a word w, and a list of followed Britney Spears, which we can call popular users, and
users for the follower as the content of the document. Fig- should be taken into account.
ure 1 illustrates this equivalence between our edge generative
Among the four major differences described above, the
model and the standard LDA and justifies our approach of
first and the third do not affect our effort to apply LDA to
applying LDA to a relationship graph in a social network
the follow edge dataset. Only appropriate pre-processing is
without changing LDA’s generative model.
required. The second limits the size of the analysis but can
We use the same notation with LDA. z denotes a labeling
be solved by adding more resources or by dividing the work
of a followed user with a topic (interest), or simply a topic,
into multiple machines [20].
P (z|f ) denotes the multinomial distribution of topics given
However, the last difference has a significant effect on our
a follower, and P (g|z) denotes the multinomial distribution
analysis because it is related to the quality of our analysis
of followed users given a topic. α and β are Dirichlet priors
results. In a document corpus analysis, the stop words are
constraining P (z|f ) and P (g|z), respectively.
generally removed before analysis, since an analysis with-
3.4 Differences between LDA and Edge Gen- out removing these stop words produces a very noisy re-
sult where the frequent stop words are labeled with every
erative Model topic [4]. However, in our analysis, popular users are very
In the previous section, we described the equivalence be- important to include in the analysis because most users are
tween our edge generative model and the standard LDA. keenly interested in following famous and influential users
However, there is a subtle difference between the two gener- whose topic interests are similar to theirs. Unfortunately, it
ative processes. While words are sampled with replacement is not sufficient to simply include the popular users for LDA
in the standard LDA, followed users are sampled without analysis, because the inclusion of popular users produces the
replacement in our model. For example, in a document same noisy result seen in the text analysis case: when stop
generative model, a document may contain the word car words (or popular users) are included, they get included in
multiple times. On the other hand, in our edge generative every topic group, producing a very noisy result.
model, a user cannot follow the same user Barack Obama In the following section, we explore some LDA extensions
multiple times. As a result, the probabilistic distribution to deal with the noise generated by popular users. Note
in our model does not follow a multinomial distribution but that the earlier work on the application of LDA to an au-
follows a multivariate hypergeometric distribution. Fortu- thorship graph dataset [9, 26] did not address how we can
nately, the multinomial distribution can also be used for our handle popular users. For example, [9] focused on the is-
model because it is known that a multivariate hypergeomet- sue of how to transform co-authorship relations to a graph
ric distribution converges to a multinomial distribution as and [26] focused on how to deal with a large number of topic
the sample size grows large [1]. In our case, since sampling groups that are produced from the nodes whose connectivity
is done on millions of nodes, the two distributions become is very low.
practically indistinguishable. Before moving to the next section, we summarize the sym-
Also, when we represent E in matrix form by putting F bols used in this paper in Table 1.
in the rows and G in the columns as E ∈ B F ×G , where
B = {0, 1}, some differing aspects are noticed:
Table 1: Symbols used throughout this paper and
1. The rows and the columns are from the same entity their meanings
U , a set of all Twitter users (F, G ⊆ U ). In a matrix Symbol Meaning
formed from a document corpus, documents are lo- u A Twitter user
cated in the rows and words are located in the columns. U A set of all u
f A follower
2. The matrix is very big and sparse. Because users follow
each other, the size of the rows and columns is almost F A set of all f
equal to that of all Twitter users (|F | ' |G| ' |U |). g A followed user
The size of the matrix becomes |F | × |G| and most of G A set of all g
its values are 0. This aspect is different from a matrix z A topic (interest)
formed from a document corpus, where the size of the Z A set of all z
columns (words) is orders of magnitude smaller than e(f, g) A follow edge from f to g
that of the rows (documents). e0 (f, g) A follow edge from g to f
e(f ) A set of all outgoing edges from f
3. The matrix is a binary matrix, in which 1 indicates e0 (g) A set of all incoming edges to g
the existence of an edge and 0 indicates no edge. Each E A set of all e(f, g) ∀f ∈ F, ∀g ∈ G
element in a matrix formed from a document corpus is
the number of occurrences of each word in a document.

4. As in a matrix formed from a document corpus, the 4. HANDLING POPULAR USERS


distribution of the column sums show a power-law dis- As discussed in Section 3.4, popular users generate noise
tribution, in which some small portion of the columns if they are not dealt with carefully. This section describes
accounts for a majority of the total column sum. In a how to handle this issue efficiently (i.e., how to label popular

568
Figure 2: Topic hierarchy and documents generated
in the hierarchy Figure 3: two-step labeling

users with correct labels). We first attempt to use the stan- are expected to be labeled with the top level topic. On the
dard LDA with different settings (asymmetric priors). Then contrary, the bottom level topics are expected to be more
we explore the most appropriate LDA approach to this issue specific as they are associated with a small number of docu-
(Hierarchical LDA [3]) among a variety of LDA extensions ments. For example, if z2 is a topic about network, z4 and z5
we considered. Finally we propose two new LDA extensions would be a topic about queueing and routing, respectively.
of our own (two-step labeling and threshold noise filtering). As z1 is usually a topic consisting of the stop words, a doc-
The two new extensions can be combined together for better ument d1 from the document tree path of z1 -z2 -z4 consists
labeling quality. of words from topic z1 , z2 , and z4 and becomes a document
about network queueing. Similarly in our model, z1 is in-
4.1 Setting Asymmetric Priors volved in every user’s follow edge generation process and is
As mentioned in Section 3.2, LDA constrains the distri- expected to be associated with popular users.
butions of topics and words with Dirichlet priors α and β, This topic hierarchy is established because HLDA is based
respectively. Though each element of vectors α and β may on the Nested Chinese Restaurant Process (NCRP), a tree
take different values in principle, in the standard LDA, each extension to Chinese Restaurant Process, which probabilis-
element of α and β is assumed to have the same value (often tically generates a partition of a set {1, 2, . . . , n} at time
referred to as the symmetric prior assumption). Intuitively, n. In NCRP, a document is considered as a Chinese restau-
this assumption implies that every topic and word in a doc- rant traveler who visits L restaurants along a restaurant tree
ument corpus is equally likely. path, where L refers to the level of the tree (i.e., the length
Though the former sounds agreeable, the latter sounds of the path).
unrealistic since it is very well known that the probability
distribution of words follows a power-law distribution by 4.3 Two-Step Labeling
Zipf’s law. It is also the reason why stop words are removed In the previous sections, we explored existing LDA ap-
before applying LDA to a document corpus, since stop words proaches to handle the popular user issue. Now we propose
correspond to the head of the power-law distribution. a new LDA extension: two-step labeling. We decompose the
The most intuitive approach to address this issue would labeling process into two sub processes of establishing topics
be to set a different prior for each followed user. Between and labeling users with the established topics. In the first
the two priors α and β, we are only interested in β, the topic establishment step, we run LDA after removing pop-
prior over the distribution of words given a topic, because a ular users from the dataset similar to how we remove stop
followed user corresponds to a word in the standard LDA. words before applying LDA to a document corpus. This
As a higher prior value implies a higher likelihood of being step generates clean topics free from the noise generated
observed in the corpus, we set each prior value proportional by popular users. In the second labeling step, we apply LDA
to the number of incoming edges of each followed user. It only to popular users in the dataset. As we use the collapsed
is expected to associate popular users with more accurate Gibbs sampling algorithm [21], edges to popular users are la-
labels as they are given adequate prior values. beled according to the pre-established topics as represented
We set βgi , the prior for the followed user gi in the vector in Equation (4):
β, as in Equation (3):
N gj z + β N fi z + α
0.98|e0 (gi )| + 0.01max(|e0 (g)|) − 0.99min(|e0 (g)|) P (e(fi , gj ) = z|·) ∝ P|G| P|Z| , (4)
βgi = (3)
max(|e0 (g)|) − min(|e0 (g)|) k=1 (Ngk z + β) k=1 (Nfi zk + α)

Note that we set the lowest value for each element in βgi where P (e(f, g) = z|·) denotes the probability of labeling the
as 0.01 and the highest value as 0.99 to make prior values edge from a follower f to the followed user g with a topic z
skewed between 0 and 1. given all conditions, Ngz denotes the number of times g is
labeled with z, and Nf z denotes the number of times f is
4.2 Hierarchical LDA labeled with z.
Hierarchical LDA (HLDA) is also a good candidate for our This equation implies that an edge to a followed user is as-
problem because it generates a topic hierarchy and more fre- signed to a topic according to how many times that user has
quent topics are located at higher levels. Figure 2 shows an been assigned to the topic, as well as how many times that
example of topic hierarchy and document paths in the topic topic has been previously assigned to the follower following
tree, where zk denotes a topic and di denotes a document. that user. Thus, if some assignments are made in the first
In HLDA, when a document is generated according to Equa- step, they affect assignments in the second step. For exam-
tion (1), words are chosen from topics in a document path. ple, if a user f1 follows non-popular users g1 and g2 and a
Since the top level topic is associated with all the docu- popular user g3 , g1 and g2 are sampled at the first step and
ments, common words in every document (i.e., stop words) g3 is sampled at the second step with a higher likelihood to

569
As threshold noise filtering process is done after sampling,
it does not increase computational complexity similar to
two-step labeling. Figure 4 illustrates this process and shows
the top three popular users’ distributions over topic groups.
(Other normal users also show similar non-linear distribu-
tions.) Alternatively, we may filter out less relevant topics
by keeping only the top-K topics for each user, for a rea-
sonably small K value. We tested both schemes (threshold
noise filtering and top-K filtering), but we couldn’t see any
practical differences. Due to lack of space, we only report
the results with the former scheme. Though threshold noise
filtering can be used with any other approaches, we combine
Figure 4: An example of threshold noise filtering
it with the two most representative cases, two-step labeling
process
and the standard LDA in our experiments.

be labeled with the topic that g1 or g2 are labeled with at 5. EXPERIMENTS


the first step. In this section, we experimentally compare our proposed
This approach is illustrated in Figure 3, where the E1 part extensions to LDA, two-step labeling and threshold noise fil-
of the dataset is sampled at the first step and the E2 part is tering, against existing approaches. We ran experiments on
sampled at the second step. Note that two-step labeling does a real-world dataset obtained from Twitter and compared
not increase computational complexity because it samples a the performance of various approaches under two evaluation
different part of the dataset at each sampling step. (O(|Z| · metrics, perplexity and quality. As we will see from our re-
|E|) = O(|Z| · (|E1| + |E2|)) sults, our proposed extensions show performance equivalent
There is related research in the literature. Online LDAs to or significantly better than existing approaches in our
introduced in [5] are designed to deal with a growing corpus experiments.
and sample topics for words in a document as it comes in. In Section 5.1, we describe our dataset and the overall
Though both two-step labeling and online LDAs use multi- experimental setup. In Sections 5.2 and 5.3, we report our
step sampling, their goals are totally different. While online results on perplexity and quality, respectively. We also pro-
LDAs try to change topics as the corpus grows over time, vide a qualitative comparison of different approaches in Sec-
two-step labeling fixes a set of topics and labels users with tion 5.4.
this fixed set of topics. From Figure 3, G1 and G2 (words)
are mutually exclusive in two-step labeling while F 1 and F 2 5.1 Dataset and Experiment Settings
(documents) are mutually exclusive in online LDAs. For our experiments, we use a Twitter dataset we col-
lected between October 2009 and January 2010. The origi-
4.4 Threshold Noise Filtering nal downloaded dataset contained 273 million follow edges,
but we sampled 10 million edges from this dataset to keep
As briefly described in Section 3.1, the association be- our experiment manageable. To ensure that all the follow
tween a user (a word) and a topic has varying association edges were preserved in our sampled dataset, we first sam-
strengths represented by P (g|z) (P (w|z)) in a probabilis- pled followers randomly and included all follow edges from
tic topic model. Thus, we can list users labeled with the the sampled users, until we obtained 10 million follow edges.
same topic in descending order of P (g|z) and regard the Figure 5 shows the distributions of incoming and outgoing
top entries in the list (topic group) as more important than edges in the sampled dataset. The horizontal axis shows the
the entries at the bottom because they are more strongly number of edges at a node and the vertical axis shows how
associated with the topic. Similarly, we can measure the many nodes have the given edge count. Both axes are shown
association strength from the user’s viewpoint using P (z|g). in a logarithmic scale. From the graph, it is clear that the
Though two-step labeling may help label popular users with number of incoming edges follows a power-law distribution,
the right topics, popular users may take top positions even which is often the case for this type of dataset. Interestingly,
in less-relevant topic groups because even the smallest num- we observe that the number of outgoing edges is quite uni-
ber of times a popular user is assigned to a topic group could form between edge counts 1 to 100, which is different from
outnumber the largest number of times a non-popular user the distribution reported in [14]. We do not believe this
is assigned to that topic group. difference is due to our sampling, because the graph from
To mitigate this problem, we propose a new approach, the complete dataset shows the same flat curve between the
threshold noise filtering, which sets a cut-off value to deter- edge counts 1 and 100 [25]. It could be an interesting future
mine whether to label a user with each topic. By ignoring work to investigate where this difference comes from. In our
assignments below the cut-off value, we can expect smooth- sampled dataset, barackobama has the most followers, 7410,
ing and noise reduction effects as in the anti-aliasing filter. and zappos follows the most users, 142,669. Table 2 shows
We set Ngi zk , the number of times a followed user gi is as- some basic statistics of the sampled dataset.
signed to a topic group zk , as 0 if it is below the cut-off value Since we are mainly interested in investigating how differ-
C: ent approaches handle popular users, we categorized the fol-
lowed users G into two distinct sub groups according to their
( Ngi zk incoming-edge counts. One group is the normal user group,
0 if P (zk |gi ) = <C
Ngi zk =
P|Z|
j=1 Ngi zj
where |e0 (g)| ≤ V , a boundary value, and the other group is
Ngi zk otherwise. the popular user group, where |e0 (g)| > V . We tested three

570
Table 3: Experimental cases and descriptions
Case Experiment Description
base LDA over the whole group dataset
non-popular LDA over the normal user group dataset
beta LDA with asymmetric β priors
hlda-2lv HLDA with 2 levels
hlda-3lv HLDA with 3 levels
2step Two-step labeling
filter-base Threshold noise filtering after base
filter-2step Threshold noise filtering after 2step

Figure 5: Distributions of incoming and outgoing


edges

Table 2: Statistics of our Twitter dataset


Statistics Value
|E| 10, 000, 000
|U | 2, 430, 237
|F | 14, 015
|G| 2, 427, 373
max(|e(f )|) 142, 669 (f : zappos)
max(|e0 (g)|) 7, 410 (g: barackobama) Figure 6: Perplexity comparison

how well the trained model deals with an unobserved test


boundary values V = 50, 100, and 500. Since all the results data. More precisely, perplexity is defined to be [26]:
from the three boundary values show consistent patterns, we
only report the result from the case when V = 100, where P
e∈Etest log P (e)
the popular user group consists of only 0.3% of all followed −
P erplexity(Etest ) = exp |Etest | ,
users but accounts for 20.2% of all follow edges.
Table 3 summarizes the eight representative experimental where Etest denotes all the edges in the test dataset.1 We
cases we report on in this section. base is the standard LDA calculated Perplexity values for the 10% held-out dataset
experiment over the whole group. non-popular is also the after training the models on the remaining 90% dataset. In
standard LDA experiment but this experiment is applied general, lower perplexity means better generalizability.
only to the normal user group to remove the noise gener- Note that the original LDA model is designed to derive
ated by popular users. For beta, we use asymmetric Dirich- an optimal model that minimizes perplexity. Therefore, it
let priors for the hyperparameter β, where we have tested is unlikely that extensions to base LDA show lower perplex-
proportional, inversely-proportional, and ladder-shape prior ity. Our primariy goal of comparing the results under the
schemes. hlda-2lv and hlda-3lv are HLDA experiments with perplexity metric is to see whether our proposed extensions
two and three levels each. 2step is our two-step labeling ex- significantly degrade perplexity. Figure 6 compares Perplex-
periment. For threshold noise filtering, where we have tested ity values of the eight experimental cases. The last set of
multiple threshold settings including C = 0.01, 0.05, 0.10, bars show the Perplexity values on the overall test dataset.
and K = 1, 2, 3, we generate filter-base and filter-2step by The first set of bars show Perplexity only on the edges to nor-
combining threshold noise filtering with base and two-step la- mal users and the second set of bars show Perplexity only
beling each. Due to space limits and to clarify our discussion, on the edges to popular users.
we only report the results for a subset of our experimental From the graphs, we first notice that HLDA (hlda-3lv and
settings in this paper. In particular, we report the result hlda-2lv) shows significantly worse Perplexity than others.
from proportional β priors for beta and C = 0.05 for thresh- This is due to the fact that the standard LDA is designed to
old noise filtering. The results that are not reported in this minimize Perplexity while HLDA is not necessarily designed
paper are very similar to what we report here. for this task. We also note that the Perplexity value for
In all of our experiments, we generated 100 topic groups non-popular is significantly lower than others. This is be-
(50 groups for hlda-2lv ) and picked the top ten entries in cause we eliminate all edges to popular users in non-popular
each topic group according to their probabilities P (g|z) in and do not include these edges in computing the Perplexity
each group. value. Therefore, non-popular had an unfair advantage of
“ignoring” all noise from popular users and not being eval-
5.2 Perplexity uated on them. Overall, we find that our two-step labeling
and threshold noise filtering show similar perplexity values
In this section, we report our results on the perplexity
metric, which was also used in earlier related work such as [4, 1
The Perplexity values of hlda-2lv and hlda-3lv are calcu-
9,10,26]. Perplexity is a well-known standard metric used in lated with the empirical likelihood values provided by Mallet
IR. It tries to quantify the accuracy of a model by measuring (http://mallet.cs.umass.edu).

571
Figure 7: Quality comparison Figure 8: False positive rate of popular users

to the standard LDA model.2 That is, our extensions to significant than that of threshold noise filtering. How-
LDA do not introduce noticeable Perplexity degradation to ever, threshold noise filtering can be easily combined
the standard LDA model. with any other approaches to improve Quality, as in
filter-base.
5.3 Quality
3. If we apply the standard LDA to popular users as well
Ultimately, the effectiveness of various approaches should without any pre-filtering, the produced topic groups
be determined by how users perceive the quality of the iden- are quite noisy. For example, the Quality value of base
tified topic groups from each approach. To measure human- is 1.34 times lower than non-popular, which filters out
perceived quality, we conducted a survey with a total of 14 all popular users first.
participants. The participants in our survey were presented
with a random group of ten Twitter users (identified by one 4. Using an asymmetric Dirichlet prior for the hyperpa-
of the eight approaches described in Section 5.1) and were rameter β improves Quality, but the result is still nois-
asked to indicate if each user in the topic group was relevant ier compared to our extensions.
to the group or not. Overall, we collected 23 judged topic
groups per each approach with a total of 161 judged topic 5. The HLDA model shows only marginal improvements
groups. in the Quality metric when compared to base. This low
Given the survey results, we computed the human-perceived performance is because most of the low-level groups in
Quality value of the topic groups Z identified by an ap- HLDA have users with very few followers (who are
proach as: less likely to be interesting when recommended) and
contribute only small weights to Quality.
|Z| |zk |
X X In Figure 8, we also report the false positive rates for popu-
Quality(Z) = δ(gi , zk ) log(|e0 (gi )|), lar users, which measures incorrect labeling of popular users.
k=1 i=1 That is, whenever a user in the popular user group is placed
in a topic group from our analysis, our survey participants
where δ function is defined as:
 evaluated whether the topic group is a good fit for that user.
1 if user gi is related to topic group zk Figure 8 reports what fraction of them were considered to
δ(gi , zk ) =
−1 if user gi is not related to topic group zk .
be a “bad fit” or an “incorrect” association. Different from
Note that in the above Quality formula, the factor log(|e0 (gi )|) Figure 7, the HLDA models seemingly perform as well as or
is added to assign higher weights to more popular users, be- even better than our two approaches. This is because HLDA
cause most people are interested in following popular users tends to place popular users at top-level groups. When our
and pay more attention to the correct topic clustering of survey participants looked at such a group, they simply con-
those users. sidered it as a “popular user group” and determined that the
Figure 7 reports the results from this survey, where each popular users in the group were a good fit, even though the
bar shows the Quality value of one of the eight approaches. group itself did not exhibit any coherent topic. Therefore,
From this graph, we observe the following: even though the false positive rate for HLDA is low, it does
not necessarily improve the topic-based recommendation ac-
1. Both of our extensions, two-step labeling and thresh- curacy for popular users. We can observe that our 2step and
old noise filtering are very effective in improving the filter-2step show comparably low false positive rates similar
human-perceived quality of identified topic groups. For to those of the HLDA approaches.
example, when compared to base, filter-2step achieves
a 1.64 times higher Quality value. 5.4 Qualitative Comparison
In this section, we examine the topic groups generated
2. As two-step filtering initially forms clean topics free from each approach more closely to get a better sense of the
from popular user generated noise, its gain is more quality of produced topic groups.
2
In fact, the Perplexity values of our two approaches are Figure 9 shows the top ten users in a topic group from
lower than the standard LDA in our experiments, but we base, which we name as the cycle group. Together with their
expect that this is simple experimental fluctuation that does Twitter usernames, we show their follower counts |e0 (g)|, and
not have a significant meaning. the bio of each user. By going over the users’ bios, we can

572
Figure 9: Topic group cycle from base showing the Figure 11: Topic group from 2step corresponding to
popular user issue Figure 10

Figure 10: Topic group mobile gadget blog from non- Figure 12: Topic group from filter-2step correspond-
popular ing to Figure 10

see that many of the users in the group have the same in- mainly about the same topic, tech media, indicating that
terest cycle. However, we also observe that among the three 2step is able to group popular users into the right topic group
users with the highest number of incoming edges, barack- in general. However, we observe that 2step still suffers from
obama, stephenfry, and lancearmstrong, only lancearmstrong the presence of a few popular, yet less relevant users in the
is related to cycle. The other two users, barackobama and group (in this example, cnnbrk and breakingnews may be
stephenfry are included in this group simply because they considered less relevant to the group than others).
are famous and followed by many users. In particular, we With threshold noise filtering, we achieved less noisy re-
note that barackobama appears in 15 groups out of 100 topic sults. Figure 12 shows a result topic group from filter-2step
groups produced by base. Among the 15 groups, only one of corresponding to the group in Figure 11 from 2step. We
them is related to his specialty, politics, which clearly shows observe that cnnbrk and breakingnews, popular users on
that the standard LDA suffers from the noise from popular general media, are now removed from Figure 11 and more
users if applied directly to our social graph dataset. technology-centric media such as firefox, youtube, and engad-
Figure 10 shows an example topic group produced by non- get are added. Note that firefox and youtube are Twitter ac-
popular. Note that in this group, all users have |e0 (g)| values counts publishing news related to FireFox and YouTube. As
of smaller than 100, because popular users are removed from we pick only a few most probable topics for popular users in
the dataset. Therefore, none of the popular users will belong filter-2step, they have less chance to appear in less-relevant
to a topic group under this approach. When we go over the topic groups.
users’ bios, we see that all users in this group is somewhat
related to mobile gadgets. That is, the topic group produced
by non-popular is significantly cleaner (less noisy) than that 6. CONCLUSION
from base at the expense of not being able to group any In this paper, we applied LDA to analyze the relationship
popular users. graph in a large social network. Different from the usual
Figure 11 shows an example topic group from 2step. Note approaches that extract topics from textual data, such as
that many users in this group are very popular and they are bio and tweets, our approaches rely purely on the social-

573
network graph consisting of follow edges. Even with this [13] J. M. Kleinberg. Authoritative sources in a
relatively limited type of data compared to those of previous hyperlinked environment. Journal of the ACM, 1999.
approaches, our approaches generated very well-organized [14] H. Kwak, C. Lee, H. Park, and S. Moon. What is
results and showed the great potential of applying LDA to twitter, a social network or a news media? In
these kinds of clustering applications. Our approaches are Proceedings of the 19th international conference on
especially useful when only linkage data is available. World wide web, WWW ’10, pages 591–600, New
We also dealt with popular user generated noise, which is York, NY, USA, 2010. ACM.
inevitable in a large social network. Unlike the stop words [15] S. Lloyd. Least squares quantization in pcm. IEEE
in the standard LDA applications, popular users have im- Transactions on Information Theory, pages 129–137,
portant meanings and should be dealt with carefully. We 1982.
explored four extensions to the standard LDA and quantita- [16] A. Mccallum, X. Wang, and A. Corrada-Emmanuel.
tively and qualitatively analyzed their results using a Twit- Topic and role discovery in social networks with
ter dataset. Our proposed approaches, two-step labeling experiments on enron and academic email. Journal of
and threshold noise filtering, are very effective in handling Artificial Intelligence Research, 30:249–272, 2007.
this popular user issue, showing 1.64 times improvement in [17] Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic
the quality of the produced topic groups, compared to the modeling with network regularization. In Proceedings
standard LDA model. of the 17th international conference on World Wide
Web, WWW ’08, pages 101–110, New York, NY, USA,
7. ACKNOWLEDGEMENTS 2008. ACM.
We would like to thank Jong Han Park, and Christo- [18] A. Mislove, B. Viswanath, K. P. Gummadi, and
pher Moghbel for their help and feedback throughout this P. Druschel. You are who you know: Inferring user
research. profiles in online social networks. In WSDM, 2010.
[19] N. Pathak, C. DeLong, A. Banerjee, and K. Erickson.
Social topic models for community extraction. In The
8. REFERENCES 2nd SNA-KDD Workshop, 2008.
[1] Multinomial distribution. http://en.wikipedia.org/ [20] A. Smola and S. Narayanamurthy. An architecture for
wiki/Multinomial_distribution. parallel topic models. In VLDB, 2010.
[2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. [21] M. Steyvers and T. L. Griffiths. Probabilistic topic
Xing. Mixed membership stochastic blockmodels. In J. models. Handbook of Latent Semantic Analysis, 2007.
Mach. Learn. Res., 2008. [22] M. Steyvers, P. Smyth, M. Rosen-Zvi, and
[3] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. T. Griffiths. Probabilistic author-topic models for
Tanenbaum. Hierarchical topic models and the nested information discovery. In SIGKDD, 2004.
chinese restaurant process. In Neural Information [23] V. Tuulos and H. Tirri. Combining topic models and
Processing Systems, 2003. social networks for chat data mining. In In Proc. of
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent the 2004 IEEE/WIC/ACM International Conference
dirichlet allocation. Journal of Machine Learning on Web Intelligence, 2004.
Research, 3:993–1022, 2003. [24] X. Wang, N. Mohanty, and A. Mccallum. Group and
[5] K. R. Canini, L. Shi, and T. L. Griffiths. Online topic discovery from relations and text. In In Proc.
inference of topics with latent dirichlet allocation. In 3rd international workshop on Link discovery, pages
Artifical Intelligence and Statistics, 2009. 28–35. ACM, 2005.
[6] M. Ester, H. peter Kriegel, J. S, and X. Xu. A [25] M. J. Welch, U. Schonfeld, D. He, and J. Cho. Topical
density-based algorithm for discovering clusters in semantics of twitter links. In WSDM, 2011.
large spatial databases with noise. In Proceedings of [26] H. Zhang, B. Qiu, C. L. Giles, H. C. Foley, and
2nd International Conference on Knowledge Discovery J. Yen. An lda-based community structure discovery
and Data Mining, pages 226–231. AAAI Press, 1996. approach for large-scale social networks. In In IEEE
[7] M. Girolami and A. Kaban. On an equivalence International Conference on Intelligence and Security
between plsi and lda. In SIGIR, 2003. Informatics, pages 200–207, 2007.
[8] M. Harvey, I. Ruthven, and M. J. Carman. Improving [27] D. Zhou, E. Manavoglu, J. Li, C. L. Giles, and
social bookmark search using personalised latent H. Zha. Probabilistic models for discovering
variable language models. In WSDM, 2011. e-communities. In World Wide Web Conference, 2006.
[9] K. Henderson and T. Eliassi-Rad. Applying latent
dirichlet allocation to group discovery in large graphs.
In Proceedings of the 2009 ACM symposium on
Applied Computing, 2009.
[10] M. D. Hoffman, D. M. Blei, and F. Bach. Online
learning for latent dirichlet allocation. In In NIPS,
2010.
[11] T. Hofmann. Probabilistic latent semantic indexing. In
SIGIR, 1999.
[12] T. Iwata, T. Yamada, and N. Ueda. Modeling social
annotation data with content relevance using a topic
model. In NIPS, 2009.

574

You might also like