Social-Network Analysis Using Topic Models
Social-Network Analysis Using Topic Models
ABSTRACT user. Then when the followed user “updates her status” (or
In this paper, we discuss how we can extend probabilistic shares a new thought), all users who follow her are immedi-
topic models to analyze the relationship graph of popular ately notified. This “follow relationship” among users often
social-network data, so that we can “group” or “label” the looks unorganized and chaotic, because follow relationships
edges and nodes in the graph based on their topic similarity. are created haphazardly by each individual user and not
In particular, we first apply the well-known Latent Dirich- controlled by a central entity.
let Allocation (LDA) model and its existing variants to the In this paper, we explore techniques to provide more struc-
graph-labeling task and argue that the existing models do ture to this follow relationship (1) by “grouping” the users
not handle popular nodes (nodes with many incoming edges) based on their topic interests, and (2) by “labeling” each fol-
in the graph very well. We then propose possible extensions low relationship with the identified topic group. More for-
to this model to deal with popular nodes. mally, we consider each user in a social network as a node in
Our experiments show that the proposed extensions are a graph and each follow relationship as a directed edge be-
very effective in labeling popular nodes, showing significant tween two nodes. Then our goal is to “group” a set of nodes
improvements over the existing methods. Our proposed in the graph based on their topics and “label” each edge in
methods can be used for providing, for instance, more rele- the graph with a topic group number.
vant friend recommendations within a social network. Inferring a structure within the social-network relation-
ship graph can be useful for many reasons. For example, a
novice user on a social network often encounters the boot-
Categories and Subject Descriptors strapping problem: discovering relevant users to connect
H.3.3 [Information Storage and Retrieval]: Information with. To mitigate this problem, social network services may
Search and Retrieval—clustering; D.2.8 [Software Engi- recommend potentially interesting users to new users if they
neering]: Metrics—performance measures can group users of similar interests and infer why the new
user has decided to follow a certain initial set of other users.
Similarly, we can identify a small set of “influential” users on
General Terms a certain topic (say, for marketing and advertising purposes)
Experimentation, Algorithms if we can identify the users’ topic interests.
Roughly, we can consider our goal as a clustering (or clas-
Keywords sification) problem, where many popular solutions such as
K-means [15] and DBSCAN [6] exist. These existing meth-
social-network analysis, topic model, Latent Dirichlet Allo- ods, however, are not appropriate for our task because they
cation, handling popular nodes either (1) associate each node with a single group (hard clus-
tering) or (2) can associate each node with multiple groups
1. INTRODUCTION (soft clustering), but require a completely separate method
Social network services are gaining popularity. A growing to label edges as well as nodes (since a node may be associ-
number of users use major social network services, such as ated with multiple groups). Given the diversity of interest a
Facebook and Twitter, to share their thoughts and where- user may have, it is too restrictive to associate a user with a
abouts with their friends and followers. On a social network, single topic group. For example, Kevin Rose, one of the most
a user indicates that she wants to get notified of another popular users on Twitter, may belong to the entrepreneur
user’s updates by “following” (or making friends with) that group as he is the founder of Digg.com, but may also belong
to the Internet commentator group since he runs a popular
Internet podcast. Since many users on social networks are
active in more than one community, we believe it is too un-
Permission to make digital or hard copies of all or part of this work for realistic to require that every user should belong to just one
personal or classroom use is granted without fee provided that copies are group.
not made or distributed for profit or commercial advantage and that copies In this paper, we apply a well-known probabilistic topic
bear this notice and the full citation on the first page. To copy otherwise, to model, called Latent Dirichlet Allocation (LDA), to the follow-
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
relationship graph of the social network, in order to label the
SIGIR’12, August 12–16, 2012, Portland, Oregon, USA. nodes and the edges in the graph with (possibly) multiple
Copyright 2012 ACM 978-1-4503-1472-5/12/08 ...$15.00.
565
topics. Unfortunately, LDA was developed for labeling doc- Topic models, the other class of related work, have also
uments and words with topics, (not nodes and edges in a been extended to analyze small social-network data. Though
graph) so some of the assumptions on which LDA was built not directly related to social-network analysis, the concept
are not applicable to social-network graph data. In particu- of author/user was initially introduced in the Author-Topic
lar, the direct application of LDA to our task requires that (AT) model [22]. It was used to extract hidden research
every node in the graph should be of roughly equal popu- topics and trends from CiteSeer’s abstract corpus. Zhou
larity and that we should remove nodes of high popularity et al. [27] modified the AT model and proposed the Com-
from the dataset. This is particularly problematic because munity User Topic (CUT) model to capture semantic com-
these popular nodes are really the ones that we want to label munities. McCallum et al. [16] extended the AT model
accurately; many users are particularly interested in identi- and proposed the Author-Recipient-Topic (ART) model and
fying the topic groups of these popular users. Earlier work the Role-Author-Recipient-Topic (RART) model in order to
on the application of the LDA model to social graph [9, 26] analyze the Enron e-mail corpus and an academic e-mail
has not addressed the handling of popular nodes. network. Pathak et al. [19] modified the ART model and
To address the issues arising from popular nodes in the suggested the Community-Author-Recipient-Topic (CART)
graph, we first explore existing variations of LDA. We then model similar to the RART model. Besides these members
propose our extensions, two-step labeling and threshold noise of the AT model family, Wang et al. [24] introduced the
filtering, to minimize the labeling noise introduced by pop- Group Topic (GT) model and applied it to voting data from
ular nodes. US Senate and the General Assembly of the UN. Mei et
In summary, we make the following contributions in this al. [17] also introduced a regularized topic modeling frame-
paper: work incorporating a graph structure in the data. Other
LDA extensions and probabilistic topic models were also
• We explore the application of the well-known LDA proposed for annotation data analysis [12], chat data anal-
model and its existing variations for this task and pro- ysis [23], tagging data analysis [8], and pairwise data analy-
pose two extensions to make LDA suitable for the sis [2]. While most of the LDA approaches introduced above
social-network relationship graph. attempted to utilize the authorship information of a given
text by adding the author component to the LDA’s text gen-
• We conduct extensive experiments using a real-world erative model, our approach focuses on using only the social
Twitter dataset. Through these experiments, we demon- part of the data (i.e., the social graph) and is generally ap-
strate that (1) the application of probabilistic topic plicable to many large social networks.
models to social-network graphs leads to useful edge/node Perhaps the work by Zhang et al. [26] and by Hender-
topic labeling and that (2) our proposed extensions son et al. [9] is closest to our work. In both works, the
significantly improve the labeling quality over existing authors applied LDA to academic social networks. How-
methods. ever, their respective focus was quite different from ours.
For example, [26] focused on the issue of how to convert
The rest of the paper is organized as follows. In Section 2, the co-authorship information into a graph (e.g., direct co-
we briefly review previous work related to our research. In authorship or indirect co-authorship, and edge weighting
Section 3, we describe LDA and justify why we use LDA to scheme based on collaboration frequency). Henderson et
solve this labeling problem. Then we introduce four different al. [9] addressed the issue of a large number of topic clus-
approaches to handle the noise generated by popular users ters generated due to low popularity nodes in the network,
in Section 4. In Section 5, we analyze our approaches using while our primary focus is the effective clustering of high
the Twitter dataset. We summarize this paper with our popularity nodes.
conclusion in Section 6.
566
models. Equation (1) represents its document generation
process based on the probabilistic generative model:
X
P (d, w) = P (d)P (w|d) = P (d) P (w|z)P (z|d). (1)
z∈Z
567
unidirectional nature of Twitter relationships allows a user document corpus, these columns correspond to words
to follow another user without requiring that user to follower such as the and is, which we call stop words, and are
her in return, as in the case of a blog subscription graph. generally removed. Those columns in our matrix, how-
Furthermore, we can consider a follower f to be a docu- ever, correspond to users such as Barack Obama and
ment d, a followed user g as a word w, and a list of followed Britney Spears, which we can call popular users, and
users for the follower as the content of the document. Fig- should be taken into account.
ure 1 illustrates this equivalence between our edge generative
Among the four major differences described above, the
model and the standard LDA and justifies our approach of
first and the third do not affect our effort to apply LDA to
applying LDA to a relationship graph in a social network
the follow edge dataset. Only appropriate pre-processing is
without changing LDA’s generative model.
required. The second limits the size of the analysis but can
We use the same notation with LDA. z denotes a labeling
be solved by adding more resources or by dividing the work
of a followed user with a topic (interest), or simply a topic,
into multiple machines [20].
P (z|f ) denotes the multinomial distribution of topics given
However, the last difference has a significant effect on our
a follower, and P (g|z) denotes the multinomial distribution
analysis because it is related to the quality of our analysis
of followed users given a topic. α and β are Dirichlet priors
results. In a document corpus analysis, the stop words are
constraining P (z|f ) and P (g|z), respectively.
generally removed before analysis, since an analysis with-
3.4 Differences between LDA and Edge Gen- out removing these stop words produces a very noisy re-
sult where the frequent stop words are labeled with every
erative Model topic [4]. However, in our analysis, popular users are very
In the previous section, we described the equivalence be- important to include in the analysis because most users are
tween our edge generative model and the standard LDA. keenly interested in following famous and influential users
However, there is a subtle difference between the two gener- whose topic interests are similar to theirs. Unfortunately, it
ative processes. While words are sampled with replacement is not sufficient to simply include the popular users for LDA
in the standard LDA, followed users are sampled without analysis, because the inclusion of popular users produces the
replacement in our model. For example, in a document same noisy result seen in the text analysis case: when stop
generative model, a document may contain the word car words (or popular users) are included, they get included in
multiple times. On the other hand, in our edge generative every topic group, producing a very noisy result.
model, a user cannot follow the same user Barack Obama In the following section, we explore some LDA extensions
multiple times. As a result, the probabilistic distribution to deal with the noise generated by popular users. Note
in our model does not follow a multinomial distribution but that the earlier work on the application of LDA to an au-
follows a multivariate hypergeometric distribution. Fortu- thorship graph dataset [9, 26] did not address how we can
nately, the multinomial distribution can also be used for our handle popular users. For example, [9] focused on the is-
model because it is known that a multivariate hypergeomet- sue of how to transform co-authorship relations to a graph
ric distribution converges to a multinomial distribution as and [26] focused on how to deal with a large number of topic
the sample size grows large [1]. In our case, since sampling groups that are produced from the nodes whose connectivity
is done on millions of nodes, the two distributions become is very low.
practically indistinguishable. Before moving to the next section, we summarize the sym-
Also, when we represent E in matrix form by putting F bols used in this paper in Table 1.
in the rows and G in the columns as E ∈ B F ×G , where
B = {0, 1}, some differing aspects are noticed:
Table 1: Symbols used throughout this paper and
1. The rows and the columns are from the same entity their meanings
U , a set of all Twitter users (F, G ⊆ U ). In a matrix Symbol Meaning
formed from a document corpus, documents are lo- u A Twitter user
cated in the rows and words are located in the columns. U A set of all u
f A follower
2. The matrix is very big and sparse. Because users follow
each other, the size of the rows and columns is almost F A set of all f
equal to that of all Twitter users (|F | ' |G| ' |U |). g A followed user
The size of the matrix becomes |F | × |G| and most of G A set of all g
its values are 0. This aspect is different from a matrix z A topic (interest)
formed from a document corpus, where the size of the Z A set of all z
columns (words) is orders of magnitude smaller than e(f, g) A follow edge from f to g
that of the rows (documents). e0 (f, g) A follow edge from g to f
e(f ) A set of all outgoing edges from f
3. The matrix is a binary matrix, in which 1 indicates e0 (g) A set of all incoming edges to g
the existence of an edge and 0 indicates no edge. Each E A set of all e(f, g) ∀f ∈ F, ∀g ∈ G
element in a matrix formed from a document corpus is
the number of occurrences of each word in a document.
568
Figure 2: Topic hierarchy and documents generated
in the hierarchy Figure 3: two-step labeling
users with correct labels). We first attempt to use the stan- are expected to be labeled with the top level topic. On the
dard LDA with different settings (asymmetric priors). Then contrary, the bottom level topics are expected to be more
we explore the most appropriate LDA approach to this issue specific as they are associated with a small number of docu-
(Hierarchical LDA [3]) among a variety of LDA extensions ments. For example, if z2 is a topic about network, z4 and z5
we considered. Finally we propose two new LDA extensions would be a topic about queueing and routing, respectively.
of our own (two-step labeling and threshold noise filtering). As z1 is usually a topic consisting of the stop words, a doc-
The two new extensions can be combined together for better ument d1 from the document tree path of z1 -z2 -z4 consists
labeling quality. of words from topic z1 , z2 , and z4 and becomes a document
about network queueing. Similarly in our model, z1 is in-
4.1 Setting Asymmetric Priors volved in every user’s follow edge generation process and is
As mentioned in Section 3.2, LDA constrains the distri- expected to be associated with popular users.
butions of topics and words with Dirichlet priors α and β, This topic hierarchy is established because HLDA is based
respectively. Though each element of vectors α and β may on the Nested Chinese Restaurant Process (NCRP), a tree
take different values in principle, in the standard LDA, each extension to Chinese Restaurant Process, which probabilis-
element of α and β is assumed to have the same value (often tically generates a partition of a set {1, 2, . . . , n} at time
referred to as the symmetric prior assumption). Intuitively, n. In NCRP, a document is considered as a Chinese restau-
this assumption implies that every topic and word in a doc- rant traveler who visits L restaurants along a restaurant tree
ument corpus is equally likely. path, where L refers to the level of the tree (i.e., the length
Though the former sounds agreeable, the latter sounds of the path).
unrealistic since it is very well known that the probability
distribution of words follows a power-law distribution by 4.3 Two-Step Labeling
Zipf’s law. It is also the reason why stop words are removed In the previous sections, we explored existing LDA ap-
before applying LDA to a document corpus, since stop words proaches to handle the popular user issue. Now we propose
correspond to the head of the power-law distribution. a new LDA extension: two-step labeling. We decompose the
The most intuitive approach to address this issue would labeling process into two sub processes of establishing topics
be to set a different prior for each followed user. Between and labeling users with the established topics. In the first
the two priors α and β, we are only interested in β, the topic establishment step, we run LDA after removing pop-
prior over the distribution of words given a topic, because a ular users from the dataset similar to how we remove stop
followed user corresponds to a word in the standard LDA. words before applying LDA to a document corpus. This
As a higher prior value implies a higher likelihood of being step generates clean topics free from the noise generated
observed in the corpus, we set each prior value proportional by popular users. In the second labeling step, we apply LDA
to the number of incoming edges of each followed user. It only to popular users in the dataset. As we use the collapsed
is expected to associate popular users with more accurate Gibbs sampling algorithm [21], edges to popular users are la-
labels as they are given adequate prior values. beled according to the pre-established topics as represented
We set βgi , the prior for the followed user gi in the vector in Equation (4):
β, as in Equation (3):
N gj z + β N fi z + α
0.98|e0 (gi )| + 0.01max(|e0 (g)|) − 0.99min(|e0 (g)|) P (e(fi , gj ) = z|·) ∝ P|G| P|Z| , (4)
βgi = (3)
max(|e0 (g)|) − min(|e0 (g)|) k=1 (Ngk z + β) k=1 (Nfi zk + α)
Note that we set the lowest value for each element in βgi where P (e(f, g) = z|·) denotes the probability of labeling the
as 0.01 and the highest value as 0.99 to make prior values edge from a follower f to the followed user g with a topic z
skewed between 0 and 1. given all conditions, Ngz denotes the number of times g is
labeled with z, and Nf z denotes the number of times f is
4.2 Hierarchical LDA labeled with z.
Hierarchical LDA (HLDA) is also a good candidate for our This equation implies that an edge to a followed user is as-
problem because it generates a topic hierarchy and more fre- signed to a topic according to how many times that user has
quent topics are located at higher levels. Figure 2 shows an been assigned to the topic, as well as how many times that
example of topic hierarchy and document paths in the topic topic has been previously assigned to the follower following
tree, where zk denotes a topic and di denotes a document. that user. Thus, if some assignments are made in the first
In HLDA, when a document is generated according to Equa- step, they affect assignments in the second step. For exam-
tion (1), words are chosen from topics in a document path. ple, if a user f1 follows non-popular users g1 and g2 and a
Since the top level topic is associated with all the docu- popular user g3 , g1 and g2 are sampled at the first step and
ments, common words in every document (i.e., stop words) g3 is sampled at the second step with a higher likelihood to
569
As threshold noise filtering process is done after sampling,
it does not increase computational complexity similar to
two-step labeling. Figure 4 illustrates this process and shows
the top three popular users’ distributions over topic groups.
(Other normal users also show similar non-linear distribu-
tions.) Alternatively, we may filter out less relevant topics
by keeping only the top-K topics for each user, for a rea-
sonably small K value. We tested both schemes (threshold
noise filtering and top-K filtering), but we couldn’t see any
practical differences. Due to lack of space, we only report
the results with the former scheme. Though threshold noise
filtering can be used with any other approaches, we combine
Figure 4: An example of threshold noise filtering
it with the two most representative cases, two-step labeling
process
and the standard LDA in our experiments.
570
Table 3: Experimental cases and descriptions
Case Experiment Description
base LDA over the whole group dataset
non-popular LDA over the normal user group dataset
beta LDA with asymmetric β priors
hlda-2lv HLDA with 2 levels
hlda-3lv HLDA with 3 levels
2step Two-step labeling
filter-base Threshold noise filtering after base
filter-2step Threshold noise filtering after 2step
571
Figure 7: Quality comparison Figure 8: False positive rate of popular users
to the standard LDA model.2 That is, our extensions to significant than that of threshold noise filtering. How-
LDA do not introduce noticeable Perplexity degradation to ever, threshold noise filtering can be easily combined
the standard LDA model. with any other approaches to improve Quality, as in
filter-base.
5.3 Quality
3. If we apply the standard LDA to popular users as well
Ultimately, the effectiveness of various approaches should without any pre-filtering, the produced topic groups
be determined by how users perceive the quality of the iden- are quite noisy. For example, the Quality value of base
tified topic groups from each approach. To measure human- is 1.34 times lower than non-popular, which filters out
perceived quality, we conducted a survey with a total of 14 all popular users first.
participants. The participants in our survey were presented
with a random group of ten Twitter users (identified by one 4. Using an asymmetric Dirichlet prior for the hyperpa-
of the eight approaches described in Section 5.1) and were rameter β improves Quality, but the result is still nois-
asked to indicate if each user in the topic group was relevant ier compared to our extensions.
to the group or not. Overall, we collected 23 judged topic
groups per each approach with a total of 161 judged topic 5. The HLDA model shows only marginal improvements
groups. in the Quality metric when compared to base. This low
Given the survey results, we computed the human-perceived performance is because most of the low-level groups in
Quality value of the topic groups Z identified by an ap- HLDA have users with very few followers (who are
proach as: less likely to be interesting when recommended) and
contribute only small weights to Quality.
|Z| |zk |
X X In Figure 8, we also report the false positive rates for popu-
Quality(Z) = δ(gi , zk ) log(|e0 (gi )|), lar users, which measures incorrect labeling of popular users.
k=1 i=1 That is, whenever a user in the popular user group is placed
in a topic group from our analysis, our survey participants
where δ function is defined as:
evaluated whether the topic group is a good fit for that user.
1 if user gi is related to topic group zk Figure 8 reports what fraction of them were considered to
δ(gi , zk ) =
−1 if user gi is not related to topic group zk .
be a “bad fit” or an “incorrect” association. Different from
Note that in the above Quality formula, the factor log(|e0 (gi )|) Figure 7, the HLDA models seemingly perform as well as or
is added to assign higher weights to more popular users, be- even better than our two approaches. This is because HLDA
cause most people are interested in following popular users tends to place popular users at top-level groups. When our
and pay more attention to the correct topic clustering of survey participants looked at such a group, they simply con-
those users. sidered it as a “popular user group” and determined that the
Figure 7 reports the results from this survey, where each popular users in the group were a good fit, even though the
bar shows the Quality value of one of the eight approaches. group itself did not exhibit any coherent topic. Therefore,
From this graph, we observe the following: even though the false positive rate for HLDA is low, it does
not necessarily improve the topic-based recommendation ac-
1. Both of our extensions, two-step labeling and thresh- curacy for popular users. We can observe that our 2step and
old noise filtering are very effective in improving the filter-2step show comparably low false positive rates similar
human-perceived quality of identified topic groups. For to those of the HLDA approaches.
example, when compared to base, filter-2step achieves
a 1.64 times higher Quality value. 5.4 Qualitative Comparison
In this section, we examine the topic groups generated
2. As two-step filtering initially forms clean topics free from each approach more closely to get a better sense of the
from popular user generated noise, its gain is more quality of produced topic groups.
2
In fact, the Perplexity values of our two approaches are Figure 9 shows the top ten users in a topic group from
lower than the standard LDA in our experiments, but we base, which we name as the cycle group. Together with their
expect that this is simple experimental fluctuation that does Twitter usernames, we show their follower counts |e0 (g)|, and
not have a significant meaning. the bio of each user. By going over the users’ bios, we can
572
Figure 9: Topic group cycle from base showing the Figure 11: Topic group from 2step corresponding to
popular user issue Figure 10
Figure 10: Topic group mobile gadget blog from non- Figure 12: Topic group from filter-2step correspond-
popular ing to Figure 10
see that many of the users in the group have the same in- mainly about the same topic, tech media, indicating that
terest cycle. However, we also observe that among the three 2step is able to group popular users into the right topic group
users with the highest number of incoming edges, barack- in general. However, we observe that 2step still suffers from
obama, stephenfry, and lancearmstrong, only lancearmstrong the presence of a few popular, yet less relevant users in the
is related to cycle. The other two users, barackobama and group (in this example, cnnbrk and breakingnews may be
stephenfry are included in this group simply because they considered less relevant to the group than others).
are famous and followed by many users. In particular, we With threshold noise filtering, we achieved less noisy re-
note that barackobama appears in 15 groups out of 100 topic sults. Figure 12 shows a result topic group from filter-2step
groups produced by base. Among the 15 groups, only one of corresponding to the group in Figure 11 from 2step. We
them is related to his specialty, politics, which clearly shows observe that cnnbrk and breakingnews, popular users on
that the standard LDA suffers from the noise from popular general media, are now removed from Figure 11 and more
users if applied directly to our social graph dataset. technology-centric media such as firefox, youtube, and engad-
Figure 10 shows an example topic group produced by non- get are added. Note that firefox and youtube are Twitter ac-
popular. Note that in this group, all users have |e0 (g)| values counts publishing news related to FireFox and YouTube. As
of smaller than 100, because popular users are removed from we pick only a few most probable topics for popular users in
the dataset. Therefore, none of the popular users will belong filter-2step, they have less chance to appear in less-relevant
to a topic group under this approach. When we go over the topic groups.
users’ bios, we see that all users in this group is somewhat
related to mobile gadgets. That is, the topic group produced
by non-popular is significantly cleaner (less noisy) than that 6. CONCLUSION
from base at the expense of not being able to group any In this paper, we applied LDA to analyze the relationship
popular users. graph in a large social network. Different from the usual
Figure 11 shows an example topic group from 2step. Note approaches that extract topics from textual data, such as
that many users in this group are very popular and they are bio and tweets, our approaches rely purely on the social-
573
network graph consisting of follow edges. Even with this [13] J. M. Kleinberg. Authoritative sources in a
relatively limited type of data compared to those of previous hyperlinked environment. Journal of the ACM, 1999.
approaches, our approaches generated very well-organized [14] H. Kwak, C. Lee, H. Park, and S. Moon. What is
results and showed the great potential of applying LDA to twitter, a social network or a news media? In
these kinds of clustering applications. Our approaches are Proceedings of the 19th international conference on
especially useful when only linkage data is available. World wide web, WWW ’10, pages 591–600, New
We also dealt with popular user generated noise, which is York, NY, USA, 2010. ACM.
inevitable in a large social network. Unlike the stop words [15] S. Lloyd. Least squares quantization in pcm. IEEE
in the standard LDA applications, popular users have im- Transactions on Information Theory, pages 129–137,
portant meanings and should be dealt with carefully. We 1982.
explored four extensions to the standard LDA and quantita- [16] A. Mccallum, X. Wang, and A. Corrada-Emmanuel.
tively and qualitatively analyzed their results using a Twit- Topic and role discovery in social networks with
ter dataset. Our proposed approaches, two-step labeling experiments on enron and academic email. Journal of
and threshold noise filtering, are very effective in handling Artificial Intelligence Research, 30:249–272, 2007.
this popular user issue, showing 1.64 times improvement in [17] Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic
the quality of the produced topic groups, compared to the modeling with network regularization. In Proceedings
standard LDA model. of the 17th international conference on World Wide
Web, WWW ’08, pages 101–110, New York, NY, USA,
7. ACKNOWLEDGEMENTS 2008. ACM.
We would like to thank Jong Han Park, and Christo- [18] A. Mislove, B. Viswanath, K. P. Gummadi, and
pher Moghbel for their help and feedback throughout this P. Druschel. You are who you know: Inferring user
research. profiles in online social networks. In WSDM, 2010.
[19] N. Pathak, C. DeLong, A. Banerjee, and K. Erickson.
Social topic models for community extraction. In The
8. REFERENCES 2nd SNA-KDD Workshop, 2008.
[1] Multinomial distribution. http://en.wikipedia.org/ [20] A. Smola and S. Narayanamurthy. An architecture for
wiki/Multinomial_distribution. parallel topic models. In VLDB, 2010.
[2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. [21] M. Steyvers and T. L. Griffiths. Probabilistic topic
Xing. Mixed membership stochastic blockmodels. In J. models. Handbook of Latent Semantic Analysis, 2007.
Mach. Learn. Res., 2008. [22] M. Steyvers, P. Smyth, M. Rosen-Zvi, and
[3] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. T. Griffiths. Probabilistic author-topic models for
Tanenbaum. Hierarchical topic models and the nested information discovery. In SIGKDD, 2004.
chinese restaurant process. In Neural Information [23] V. Tuulos and H. Tirri. Combining topic models and
Processing Systems, 2003. social networks for chat data mining. In In Proc. of
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent the 2004 IEEE/WIC/ACM International Conference
dirichlet allocation. Journal of Machine Learning on Web Intelligence, 2004.
Research, 3:993–1022, 2003. [24] X. Wang, N. Mohanty, and A. Mccallum. Group and
[5] K. R. Canini, L. Shi, and T. L. Griffiths. Online topic discovery from relations and text. In In Proc.
inference of topics with latent dirichlet allocation. In 3rd international workshop on Link discovery, pages
Artifical Intelligence and Statistics, 2009. 28–35. ACM, 2005.
[6] M. Ester, H. peter Kriegel, J. S, and X. Xu. A [25] M. J. Welch, U. Schonfeld, D. He, and J. Cho. Topical
density-based algorithm for discovering clusters in semantics of twitter links. In WSDM, 2011.
large spatial databases with noise. In Proceedings of [26] H. Zhang, B. Qiu, C. L. Giles, H. C. Foley, and
2nd International Conference on Knowledge Discovery J. Yen. An lda-based community structure discovery
and Data Mining, pages 226–231. AAAI Press, 1996. approach for large-scale social networks. In In IEEE
[7] M. Girolami and A. Kaban. On an equivalence International Conference on Intelligence and Security
between plsi and lda. In SIGIR, 2003. Informatics, pages 200–207, 2007.
[8] M. Harvey, I. Ruthven, and M. J. Carman. Improving [27] D. Zhou, E. Manavoglu, J. Li, C. L. Giles, and
social bookmark search using personalised latent H. Zha. Probabilistic models for discovering
variable language models. In WSDM, 2011. e-communities. In World Wide Web Conference, 2006.
[9] K. Henderson and T. Eliassi-Rad. Applying latent
dirichlet allocation to group discovery in large graphs.
In Proceedings of the 2009 ACM symposium on
Applied Computing, 2009.
[10] M. D. Hoffman, D. M. Blei, and F. Bach. Online
learning for latent dirichlet allocation. In In NIPS,
2010.
[11] T. Hofmann. Probabilistic latent semantic indexing. In
SIGIR, 1999.
[12] T. Iwata, T. Yamada, and N. Ueda. Modeling social
annotation data with content relevance using a topic
model. In NIPS, 2009.
574