0% found this document useful (0 votes)
19 views17 pages

PS 07749

Uploaded by

Amit Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

PS 07749

Uploaded by

Amit Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Overview

An overview of social network


analysis
Márcia Oliveira and João Gama∗

Data mining is being increasingly applied to social networks. Two relevant rea-
sons are the growing availability of large volumes of relational data, boosted by
the proliferation of social media web sites, and the intuition that an individual’s
connections can yield richer information than his/her isolate attributes. This syn-
ergistic combination can show to be germane to a variety of applications such as
churn prediction, fraud detection and marketing campaigns. This paper attempts
to provide a general and succinct overview of the essentials of social network
analysis for those interested in taking a first look at this area and oriented to use
data mining in social networks. C 2012 Wiley Periodicals, Inc.

How to cite this article:


WIREs Data Mining Knowl Discov 2012, 2: 99–115 doi: 10.1002/widm.1048

INTRODUCTION bers of a group were depicted using the so-called so-


ciograms, which can be defined as charts where in-
T he world is a complex system of interconnected
parts. Each part itself constitutes a smaller sys-
tem whose networked structure can be, most of the
dividuals are represented as nodes and the relations
among them are represented by lines. Such diagrams
times, analyzed through the lens of social network revealed to be very useful in uncovering the hidden
analysis (SNA). structures of groups, by means of the identification
SNA is an interdisciplinary methodology re- of, for instance, stars, alliances, and subgroups.
search area with contributions from Sociology, So- In a broader sense, a social network is con-
cial Psychology, Anthropology, Physics, Mathemat- structed from relational data and can be defined as
ics, Computer Science, among others, being a rich a set of social entities, such as people, groups, and
scientific field that has significantly benefited from organizations, with some pattern of relationships or
the collaborative efforts of researchers from different interactions between them. These networks are usu-
scientific areas. Because networks were studied inde- ally modeled by graphs, where vertices represent the
pendently by distinct disciplines, for a considerable social entities and edges represent the ties established
amount of time, each one developed its own jargon. between them. The underlying structure of such net-
To avoid ambiguity and clarify the adopted language, works is the object of study of SNA. SNA methods
in Table 1 we present the network terminology used and techniques were thus designed to discover pat-
in different fields. Throughout this document, we will terns of interaction between social actors in social
use these terms interchangeably. networks.
The origins of SNA, as a basis for developing Hence, the focus of SNA is on the relationships
useful sociological concepts, can be traced back to the established between social entities rather in the social
early 1930s, when Moreno1 developed the sociomet- entities themselves. In fact, the main goal of this tech-
ric approach as a way to conceptualize the structure nique is to examine both the contents and patterns
of the social relations established among small groups of relationships in social networks to understand the
of individuals. These interpersonal ties between mem- relations among actors and the implications of these
relationships.

Common tasks of SNA involve the identification
Correspondence to: [email protected]
of the most influential, prestigious, or central actors,
Faculty of Economics, University of Porto, Porto, Portugal; The
Laboratory of Artificial Intelligence and Decision Support, Insti-
using statistical measures; the identification of hubs
tute for Systems and Computer Engineering of Porto, University of and authorities, using link analysis algorithms, and
Porto, Porto, Portugal. the discovery of communities, using community de-
DOI: 10.1002/widm.1048 tection techniques. These tasks are extremely useful

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 99
Overview wires.wiley.com/widm

T A B L E 1 Network Terminology for Different Fields of Knowl- In turn, information networks are based upon
edge the exchange of information among entities usually
aiming to enhance knowledge diffusion, business, or
Mathematics Computer Science Sociology Physics Others social aims. Examples include networks of citations
between academic papers, commonly represented by
Vertex/vertices Node Actor/agent Site Dot
an acyclic-directed graph where vertices represent pa-
Edge Link/connection Relational tie Bond Arc
pers and there is a direct edge if paper A cites paper
B; and preference networks, which are generally mod-
in the process of extracting knowledge from networks eled through bipartite graphs and represent individu-
and, consequently, in the process of problem solving. als’ consumption preferences for a given commercial
Because of the appealing nature of such tasks and to product13 (e.g., books). Another important example
the high potential opened by this kind of analyses, of an information network is the World Wide Web,
SNA has become a popular approach in a myriad of which can be represented as a directed graph, in which
fields, from Biology to Business. For instance, some vertices represent static Web pages and edges corre-
companies use SNA to maximize positive word of spond to the hyperlinks between them.14
mouth of their products by targeting the customers Technological networks are man-made net-
with higher network value (those with higher influ- works designed for distribution of some commodity
ence and support).2–4 Other companies, such as the or resource (e.g., electricity, information). Some ex-
ones operating in the sector of mobile telecommu- amples are networks of roads and railways, networks
nications, apply SNA techniques to the phone call of airline routes, and networks of physical connec-
networks and use them to identify customer’s pro- tions between computers (Internet).
files and to recommend personalized mobile phone The last type of networks are the so-called bio-
tariffs, according to these profiles. These companies logical networks15 and, as the name implies, are those
also use SNA for churn prediction, i.e., to detect cus- that arise from biological processes, such as networks
tomers who may potentially switch to another mo- of chemical reactions among metabolites, protein in-
bile operator by detecting changes in the patterns teraction networks, genetic regulatory networks, real
of phone contacts.5,6 Another interesting application neural networks, and food webs or predator–prey net-
emerges from the domain of fraud detection. For in- works.
stance, SNA can be applied to networks of organiza- Despite the fact the origins of network stud-
tional communications (e.g., Enron company dataset) ies go back a few centuries ago, in recent years we
to perform an analysis of the frequency and direc- witnessed an impressive advance in network-related
tion of formal/informal email communication, which fields, mainly because of the growing interest in so-
can reveal communication patterns among employees cial networks, which became a ‘hot’ topic and a focus
and managers. These patterns can help identify peo- of considerable attention. For this reason, a lot of
ple engaged in fraudulent activities, thus promoting students, practitioners and researchers are willing to
the adoption of more efficient forms of acting toward enter the field and explore, even superficially, the po-
the eradication of crime.7,8 tential of SNA techniques for the study of their prob-
Besides social networks, there are other types lems. Bearing this in mind, in this paper our aim is
of real-world structures that can be represented by to provide a general and succinct overview of the es-
networks. According to Newman,9 real-world net- sentials of SNA for those interested in knowing more
works can be categorized into four main types: social about the area and strongly oriented to use SNA in
networks, information networks (or knowledge net- practical problems.
works), technological networks, and biological net- The remainder of this document is organized
works. as follows. We begin by pointing out some types of
As previously mentioned, social networks are representations that can be used to model social net-
the ones that arise as a result of human and so- works. Then, we introduce the best known statistical
cial interactions and encompass studies of friend- measures to analyze them, according to two levels of
ship networks,10 informal communication networks analysis: the actor level and the network level. Af-
within companies,11 collaboration networks12 (e.g., terward, we talk a little about the link analysis task
networks of coappearance in movies by actors, in and explain how it can be used to identify influential
which two actors are connected if they appeared and authoritative nodes. Then, we distinguish two
together in a movie, and networks of coauthorship important network models and introduce the main
among academics, in which individuals are linked if properties of real-world networks. Later, we devote
they coauthored one or more papers), among others. a section to the problem of finding communities in

100 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

networks. After introducing the main concepts, we ces (contains both adjacency and degree information),
provide a list of the most popular SNA software and and distance matrices (identical to the adjacency ma-
tools for those readers interested in applying network trices with the difference that the entries of the matrix
analysis in professional or academic problems. This are the lengths of the shortest paths between pairs of
overview ends with the identification of the current vertices) are appropriate to represent full matrices.
trends arising in the field of SNA. Several types of graphs can be used to model dif-
ferent kinds of social networks. For instance, graphs
can be classified according to the direction of their
links. This leads us to the differentiation between
REPRESENTATION OF SOCIAL
undirected and directed graphs. Undirected graphs (or
NETWORKS undirected networks) are graphs whose edges connect
A social network consists of a finite set of actors and unordered pairs of vertices or, in other words, each
the relations, or ties, defined on them.16 The estab- edge of the graph connects concomitantly two ver-
lished relationships can be of personal, or profes- tices. A more strict type of graph is the so-called di-
sional, nature and they can range from casual ac- rected graph (or directed network). Directed graphs,
quaintance to close familiar bonds. Besides social or in the abbreviation form digraphs, can be straight-
relations, links can also represent flow of informa- forward defined as graphs whose all edges have an
tion/goods/money, interactions, similarities, among orientation assigned (also called arcs), so the order
others. The structure of such networks is usually rep- of the vertices they link matters. Formally, a directed
resented by graphs. Therefore, networks are often re- graph D is an ordered pair (V(D), A(D)) consisting
garded as equivalent to graphs. of a nonempty set V(D) of vertices and a set A(D),
A graph is composed of two fundamental com- disjoint from V(D), of arcs. If e12 is an arc and ν 1 and
ponents vertices and edges. Every edge is defined ν 2 are vertices such that e12 = (ν 1 , ν 2 ), then e12 is said
by a pair of vertices, also called its endpoints. Ver- to join ν 1 to ν 2 , being the first vertex ν 1 called ini-
tices represent a wide variety of individual entities tial vertex, or tail, and the second vertex ν 2 called the
(e.g., people, organizations, countries, papers, prod- terminal vertex, or simply head. Graphically, directed
ucts, plants, and animals) according to the applica- edges are depicted by arrows, indicating the direction
tion field. In turn, an edge is the line that connects of the linkage. This type of graphs can be either cyclic,
two vertices and, analogously, it can represent nu- i.e., graphs containing closed loops of edges or ‘ring’
merous kinds of relationships between individual en- structures, or acyclic (e.g., trees). A typical example
tities (e.g., communication, cooperation, friendship, of an undirected graph is FacebookTM because, in this
kinship, acquaintances, and trade). Edges may be di- social network, the established friendship tie is mu-
rected or undirected, depending if the nature of the tual or reciprocal (e.g., if I accept a friend request from
relation is asymmetric or symmetric. a given person then it is implicitly assumed that me
Formally, a graph G consists of a nonempty set and that person are friends of each other). Likewise,
V(G) of vertices and a set E(G) of edges, being defined TwitterTM is an example of a directed graph because
as G = (V(G), E(G)). According to Diestel,17 the order a person can be followed by others without necessar-
of a graph G is given by the total number of vertices ily following them. In this case, the tie between a pair
n or, mathematically, |V(G)| = n. Analogously, the of individuals is directed, with the tail being the fol-
size of a graph G is the total number of edges |E(G)| lower and the head being the followed, meaning that
= m. The maximum number of edges in a graph is a one-way relationship is established.
mmax = n(n−1)2
, for undirected graphs, and mmax = n(n Regarding the values assigned to edges, we can
− 1), for directed ones. make a distinction between unweighted and weighted
In the literature, two main types of graph- graphs. Unless it is explicitly said, we always assume
theoretic data structures are referred to represent that graphs are unweighted. Unweighted graphs are
graphs: the first one are list structures and the sec- binary since edges are either present or absent. On
ond are matrix structures. These structures are ap- the other hand, weighted graphs are richer graphs
propriate to store graphs to further analyze them because each edge has associated a weight w ∈ R+ 0
using automatic tools. List structures, such as inci- providing the user with more information about, for
dence lists and adjacency lists, are suitable for storing instance, the strength of the connection of the pair of
sparse graphs because they reduce the required stor- vertices it joins. According to Mark Granovetter,18,19
age space. On the other hand, matrix structures such in social networks the weight of a tie is generally a
as incidence matrices (A↓ (n × m)), adjacency ma- function of duration, emotional intensity, frequency
trices or sociomatrices (A↓ (n × n)), Laplacian matri- of interaction, intimacy, and exchange of services.

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 101
Overview wires.wiley.com/widm

network. The latter provides more compact informa-


tion and allows the assessment of the overall structure
of the network, giving insights about important prop-
erties of the underlying social phenomena.

Actor-Level Statistical Measures


F I G U R E 1 | A directed graph D represented by means of an Centrality, or prestige, is a general measure of how
adjacency matrix (left-hand side of the figure) and an adjacency list the position of an actor is within the overall struc-
(right-hand side of the figure).
ture of the social network and can be computed re-
sorting to several metrics. The most widely used are
Therefore, strong ties usually represent close friends, degree, betweenness, closeness, and eigenvector cen-
and weak ties represent acquaintances. In other kinds trality. The first three were proposed by Freeman20
of networks, the weight of a tie can represent a variety and were only designed for unweighted networks. Re-
of things, depending on the context; for instance, a tie cently, Brin and Page21 came up with extensions to
can represent the number of seats among airports, the weighted networks. The fourth metric—eigenvector
number of exchanged products, and so on. centrality—was later proposed by Bonacich22 and has
For undirected and unweighted graphs, adja- its foundations on spectral graph theory. It became es-
cency matrices are binary (as a consequence of being pecially popular after being used as the basis of the
unweighted) and symmetric (as a consequence of be- well-known Google’s Pagerank algorithm, which we
ing undirected, meaning that aij = aji ), with aij = 1 will talk about in the next Section.
representing the presence of an edge between vertices Although more actor-level statistical measures
i and j, and aij = 0 representing the absence of an edge were proposed in the literature, in this subsection we
between vertex pair (i,j). For directed and weighted will focus on explaining the mentioned measures of
graphs, the entries of such matrices take values from centrality. These measures determine the relative im-
interval [0, max(w)] and are nonsymmetric. In both portance of an actor within the network, showing
cases, we deal with nonnegative matrices. how the relationships are concentrated in a few in-
In Figure 1, we provide an example of how a dividuals and, therefore, giving an idea about their
graph can be represented by an edge list and by an social power. Higher centrality measures are associ-
adjacency matrix. ated to powerful actors in the network because their
central position offers them several advantages, such
as easier and quicker access to other actors in the
network (useful for accessing resources such as infor-
ELEMENTARY STATISTICAL mation) and ability of exerting control over the flow
MEASURES between the other actors.20 These central actors are
Mathematics is used to represent networks, while also called ‘focal points’. At the end of the section
Statistics is mainly used to analyze them. In this sec- we will also introduce the concept of transitivity and
tion, we present some graph measures and popular explain how it can be computed using a clustering
metrics used in the analysis of social networks that coefficient.
arose from the field of Statistics. These measures are The reader must take into account that some of
useful in the sense that they provide us insights about these actor-level metrics (e.g., degree, betweenness,
the structure of the network without the need to know and closeness) may need to be normalized to perform
its graphical representation. Studying the structure of comparisons of networks with different orders and
these networks aims at understanding the behavior sizes.
of the social systems that generated those networks,
which is normally the final goal of such analysis. Degree or Valency
The measures we will introduce in the following The degree, or valency, of a node ν, usually denoted
subsections can be divided according to the level of as kν , is a measure of the immediate adjacency and
analysis one wants to perform: at the level of small the involvement of the node in the network and is
units, such as actors, or at the level of the whole net- computed as the number of edges incident on a given
work. The former explores general measures of cen- node or, similarly, as the number of neighbors of node
trality as a way to understand how the position of ν. The neighborhood Nν is thus defined by the set of
a vertex is within the overall structure of the graph nodes that are directly connected to ν. Degree can be
and, therefore, helps identify the key players in the computed in, at least, two different ways: based on

102 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

the adjacency matrix or based on the neighborhood least asymptotically. This means that, in these net-
of a node. In Eqs (1) and (2), we present each one works, the distribution of the vertex degree is very
of the alternatives, for undirected networks. Despite heterogeneous and highly right skewed, with a large
its simplicity, degree is an effective measure to assess majority of vertices having a low degree and a small
the importance and influence of an actor in a social number having a high degree. These networks are
network. Yet, it has some limitations. The main one known as scale free, an expression coined by the same
is that it does not take into consideration the global researchers. Other common functional forms are ex-
structure of the network: ponential (e.g., railways and power grids networks)
n and power laws with exponential cutoffs (e.g., net-
ki = aij , 0 < ki < n, (1) works of movie actors and some collaboration net-
j=1
works).
where aij is the entry of the ith row and jth column of
the adjacency matrix A;
Betweenness
kν = |Nν |, 0 < kν < n, (2) Node betweenness bν measures the extent to which a
node lies between other nodes in the network and can
where |Nν | is the neighborhood of node ν
be computed using the formula presented in Eq. (6).
For directed networks, there are two variants
Nodes with high betweenness occupy critical roles
of degree centrality: in degree, denoted by kν+ , and
in the network structure, since they usually have a
out degree, denoted by kν− . The former is given by
network position that allow them to work as an in-
the number of incoming nodes (i.e., number of edges
terface between tightly knit groups, being ‘vital’ el-
beginning at vertex ν) and the latter by the number
ements in the connection between different regions
of outgoing nodes (i.e., number of edges ending at
of the network. In the social networks perspective
vertex ν), as defined in Eqs (3) and (4). The measure
‘interactions between two nonadjacent actors might
of degree in directed networks is also referred to as
depend on other actors in the set of actors, especially
prestige. This expression is especially used in the liter-
the actors who lies on the paths between the two’,16
ature of social networks because it was developed for
which stresses out the importance of a good value of
measuring the prominence or importance of actors in
betweenness. These actors are also called gatekeepers
the network. There are two types of prestige: support
because they tend to control the flow of information
and influence. The first is related to the in-degree cen-
between communities.
trality, which is seen as a measure of support, and the
second is related to the out-degree centrality, which  σst (ν)
bν = , (6)
is seen as a measure of influence: σst
s,t ∈ V(G)\ν

n
ki+ = aji , (3) where σ st denotes the number of shortest paths be-
j=1 tween vertices s and t (usually σ st = 1) and σ st (ν) ex-
presses the number of shortest paths passing through

n node ν.
ki− = aij . (4) This quantity can also be computed for edges.
j=1 The betweenness of an edge be is commonly defined as
the number of shortest paths between nodes that run
On weighted networks, strength is the equiv- along a given edge of the network. It is quite useful
alent of degree, being computed as the sum of the in SNA since it allows discovering bridges and local
weights of the edges adjacent to a given node, as ex- bridges which are, by definition, edges with high be-
pressed by Eq. (5): tweenness. In the context of SNA, bridges are connec-

n tions outside an individual’s circle of acquaintances.
kiw = aijw . (5) These connections are of great interest for individu-
j=1 als seeking to access new information and resources,
since they ease the diffusion of information across
A significant research effort was undertaken entire communities.25 However, situations like these
in studying the degree distribution of several types are quite rare in real-world scenarios and, even if
of networks, which turned it possible to classify a they happen, the advantages they confer are usually
network based on this distribution. For instance, temporary, due to the temporal instability of such
Barabási and coworkers23,24 discovered that most edges. A more common and realistic situation is local
real networks follow a power-law distribution, at bridges. Equation (7) indicates how this measure can

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 103
Overview wires.wiley.com/widm

be computed: Local Clustering Coefficient


 σuν (e) Social networks are naturally transitive, which means
be = , (7) that a given actor’s friends are also likely to be friends.
σuν
u,ν∈V(G) This property of transitivity is quantified by a cluster-
where σ uν (e) expresses the number of shortest paths ing coefficient that can be global, i.e., computed for
passing through edge e. The sum indicates that this the whole network, or local, i.e., computed for each
fraction needs to be computed for every pair of nodes node. Watts and Strogatz26 proposed a local version
u and ν in the network. of the clustering coefficient, denoted ci (i = 1,. . ., n).
In this context, transitivity is a local property of a
Closeness node’s neighborhood that indicates the level of cohe-
Closeness is a rough measure of the overall position sion between the neighbors of a node. This coefficient
of an actor in the network, giving an idea about how is, therefore, given by the fraction of pairs of nodes,
long it will take to reach other nodes from a given which are neighbors of a given node ν that are con-
starting node. Formally, it is the mean length of all nected to each other by edges [see Eq. (10)]:
shortest paths from one node to all other nodes in the 2|ejk |
network. Because of its definition, usually this mea- ci = : ν j , νk ∈ Ni , ejk ∈ E, (10)
ki (ki − 1)
sure is only computed for nodes within the largest
component of the network, using the formula pre- where Ni is the neighborhood of node ν i , ejk repre-
sented in Eq. (8). In the social networks context, close- sents the edge that connects node ν j to node ν k , ki is
ness is a measure of reachability that measures how the degree of node ν i , and |ejk | indicates the proportion
fast a given actor can reach everyone in the network: of links between the nodes within the neighborhood
of node ν i .
n−1
Clν = . (8)
u ∈ V(G)\ν d(u, ν)
Network-Level Statistical Measures
Before explaining each one of the network-level sta-
Eigenvector Centrality tistical measures, there are three fundamental con-
This metric is based on the assignment of a relative cepts that should be first introduced: path, geodesic
score to each node and measures how well a given ac- distance between two nodes, and eccentricity of a
tor is connected to other well-connected actors. This vertex.
score is given by the first eigenvector of the adjacency A path is a sequence of nodes in which con-
matrix. The basic idea behind eigenvector centrality is secutive pairs of nonrepeating nodes are linked by an
that the power and status of an actor is recursively de- edge; the first vertex of a path is called the start vertex
fined by the power and status of his/her alters. Alters and the last vertex of the path is called the end ver-
is a term frequently used in the egocentric approach tex. Of particular interest is the concept of geodesic
of social networks analysis, and it refers to the actors distance, or shortest path, between nodes i and j, de-
that are directly connected to a specific actor, called noted as d(i,j). The geodesic distance can be defined as
ego. In other words, we can say that the centrality the length of the shortest path, or the minimal path,
of a given node i is proportional to the sum of the between nodes i and j.
centralities of i’s neighbors. This is the assumption In turn, the eccentricity is the greatest geodesic
behind the eigenvector centrality formula, which is as distance between a given vertex ν and any other in
follows: the graph, as defined in Eq. (11). These three concepts
1
n are formed on the basis of most of the network-level
xi aij x j , (9) metrics we are going to introduce, namely, the diame-
λ
j=1 ter/radius, the average geodesic distance, the average
where xi /xj denotes the centrality of node i/j, aij rep- degree, the reciprocity, the density, and the global
resents an entry of the adjacency matrix A (aij = 1 if clustering coefficient.
nodes i and j are connected by an edge and aij = 0 ∈ν = max d(ν, i). (11)
otherwise) and λ denotes the largest eigenvalue of A. i ∈ V(G)\ν
Eigenvector centrality is a more elaborated ver-
sion of the degree, once it assumes that not all con-
nections have the same importance by taking into ac- Diameter and Radius
count not only the quantity, but especially the quality The diameter D is given by the maximum eccentricity
of these connections. of the set of vertices in the network and, analogously,

104 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

the radius R can be defined as the minimum eccen- Reciprocity


tricity of the set of vertices, as defined in Eqs (12) and Reciprocity r is a specific quantity for directed net-
(13). Sparser networks have generally greater diam- works that measures the tendency of pairs of nodes to
eter than full matrices, due to the existence of fewer form mutual connections between each other. There
paths between pairs of nodes. Leskovec et al.27 dis- are several ways to compute this metric. The most
covered that, for certain types of real-world networks, popular and intuitive way is to compute the ratio of
the effective diameter shrinks over time, contradicting the number of mutual connections in the network to
the conventional wisdom of increasing diameters. In the number of all connections, as shown in Eq. (17).
the context of SNA, this metric gives an idea about the Adopting this definition, the value of reciprocity rep-
proximity of pairs of actors in the network, indicating resents the probability that two nodes in a directed
how far two nodes are, in the worst of cases: network point to each other. By definition, in an undi-
rected network, reciprocity is always maximum r = 1
D = max{∈ν : ν ∈ V}, (12)
because all pairs of nodes are symmetric:
#mut
r= , 0 < r < 1, (17)
R = min{∈ν : ν ∈ V}. (13) #mut + #asym
where #mut denotes the number of mutual dyads and
#asym the number of asymmetric dyads.
Average Geodesic Distance Taking the definitions of Wasserman and
The average geodesic distance for all combinations of Faust,16 we say that an asymmetric dyad is a pair
vertex pairs in a network is usually denoted by l and of nodes that has an arc going in the direction of one
is given by Eq. (14). This metric gives an idea of how node or the other, but not both directions. In turn, a
far apart nodes will be, on average. For instance, in mutual dyad is defined by a pair of nodes connected
the SNA context the average geodesic distance can be by two arcs, each one going in a different direction
used to measure the efficiency of the information flow (e.g., a → b and b → a, being a and b two nodes in a
within the network: network).
1 
l= 1 d(i, j), (14) Density
n(n − 1) i≥ j
2
Density ρ is an important network-level measure,
where d(i, j) is the geodesic distance between nodes i which is able to explain the general level of con-
and j, and 1/2n(n–1) is the number of possible edges nectedness in a network. It is given by the propor-
in a network comprising n nodes. tion of edges in the network relative to the max-
When there is the case of a network having more imum possible number of edges, as defined in Eq.
than one connected component, the previous formula (18). Density is a quantity that goes from a minimum
does not hold, because the geodesic distance is con- of 0, when a network has no edges at all, to a max-
ventionally defined as infinite when there is no path imum of 1, when the network is perfectly connected
connecting two vertices. In such situations, it is more (also called complete graph or clique). Therefore,
appropriate to use the harmonic average geodesic dis- high values of ρ are associated to dense networks,
tance, defined in Eq. (15), once it turns infinite dis- and low values of density are associated to sparse
tances into zero nullifying their effect on the sum: networks:
1  1 m(G)
l −1 = 1 . (15) ρ(G) = , 0 < ρ < 1, (18)
2
n(n + 1) i≥ j d(i, j) mmax(G)
where m is the number of edges in the network and
mmax(G) denotes the number of possible edges, which
Average Degree
The average degree is simply the mean of the degrees is n(n−1)
2
for undirected networks and n(n–1) for di-
of all vertices in a network, as represented in Eq. rected ones.
(16). According to Costa et al.28 the average degree
can be used to measure the global connectivity of a Global Clustering Coefficient
network: There are several ways to compute the global ver-
sion of the clustering coefficient. We adopt the one
1
n
k̄ = ki . (16) proposed by Watts and Strogatz26 that obtains the
n global clustering coefficient c, for the whole network,
i=1
through the computation of the average of all local

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 105
Overview wires.wiley.com/widm

values ci (i = 1,. . ., n), as shown in Eq. (19). more popular that web page is). Nevertheless, in this
Small-world networks,26 such as the ones we find in process of weighting web pages, not only the num-
real social contexts, are characterized by high global ber of links or, equivalently, the degree of a node is
clustering coefficients, meaning that the property of relevant, but also the importance of the web pages
transitivity among nodes emerges more often and in linking to them. Therefore, Pagerank measures the
a stronger way, increasing the probability of clique relative importance of a set of web pages based not
formation: only on the quantity but especially the quality of their
links.
1
c= ci . (19) The basic Pagerank is computed as follows (ac-
n i cording to the definition provided by Easley and
Kleinberg31 ):
Initialization: In a network of n nodes (or web
LINK ANALYSIS pages), assign a Pagerank value of 1/n to each node,
and choose the number of iterations k of the algo-
In certain network settings, such as the Web, one may
rithm.
be interested in finding the most valuable, authorita-
tive or influential node (e.g., web page), or a list of
1. Update the Pagerank values of each node by
them. To perform this task, several link analysis algo-
sequentially applying the following rule: Ba-
rithms were devised, being the HITS29 and the Google
sic Pagerank update rule: divide the actual
Pagerank30 algorithms the most popular ones. These
Pagerank value of node p by the number of
algorithms explore the relationship between links and
its outgoing links and pass these equal shares
the content of web pages, to improve the task of in-
to the nodes it points to. Note that if a node
formation retrieval in the Web, being of extreme im-
p has no outgoing links, the Pagerank share
portance for the design of efficient search engines. As
is passed to itself. The update of a node’s
the development of these methods was motivated by
Pagerank value is performed by summing the
the problem of web queries, for the sake of simplicity,
shares it receives in each iteration.
we will explain them in this context.
Before introducing any of these algorithms, it 2. Apply this rule until the kth iteration, or until
is first necessary to define some elementary concepts, convergence has occurred.
namely, the concepts of hubs and authorities. In the
context of Web, a hub can be understood as a web To illustrate, consider the following example:
page that points to many other web pages or, in in a network comprised six nodes termed A, B, C,
other words, as a compilation of web pages that ad- D, E, and F. How can we find the most influential
dress a specific topic. The quality of a hub is usu- node, using the Pagerank algorithm? First, according
ally determined by the quality of the authorities it to the initialization step of the algorithm, each node is
points to. On the other hand, authorities are web assigned an equal Pagerank of P R = 1n = 16 , as repre-
pages cited by many different hubs, which means that sented in Figure 2(a). Then, these values are updated
their relevance is measured by the number of inward k times (for the sake of simplicity, we consider only
links they receive. Typically, good authoritative pages two iterations) by applying the basic Pagerank update
are reliable sources of information about a given rule.
topic. To apply the rule, first is necessary to compute
In the following subsection, we explain the the shares of all nodes. Then, for each node we sum
foundations of Pagerank algorithm. all shares the node receives. The result of this sum
will be its new Pagerank value, as shown in Table 2
and Figure 2(b). For instance, the share of node D,
Pagerank Algorithm which has only one outgoing link, is computed as
Pagerank is a link analysis algorithm based on the shar e(D) = 1/6 = 16 . Its new Pagerank value is given
1
concept of eigenvector centrality. This algorithm is by the sum of the shares of its ingoing links, namely,
used by GoogleTM Internet search engine to rank web those coming from nodes A, C, E, and F:
pages according to the value of the information they
carry, so the most valuable ones appear at the top of 1 1 1 1 5
P R(D) = + + + =+ . (20)
the search results. 12 12 12 6 12
The idea of the algorithm is that information on
Web can be ranked according to link popularity (the After computing these values for all nodes in
more web pages are linked to a given web page the the network, we repeat the process for the second

106 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

F I G U R E 2 | Illustration of the process behind Pagerank algorithm in a network comprised six nodes. The first network (a) corresponds to the
initialization step. In network (b) are shown the updated Pagerank values at the end of the first iteration of the algorithm. Note that node D is so far
the most authoritative node, with a Pagerank value of 5/12.The rightmost network (c) corresponds to the second (and last) iteration of Pagerank.
Here, we notice that node B overtakes the position of node D in terms of Pagerank values.

T A B L E 2 Updated Pagerank Values after the First Iteration important one, once this Pagerank value means that
k=1 a great part of the information that flows through the
network passes through it.
Node A B C D E F Although Pagerank and other link analysis algo-
Shares 1/12 1/12 1/12 1/6 1/12 1/6 rithms were originally motivated by the necessity of
Updated Pagerank 1/12 1/4 1/12 5/12 1/12 1/12 extracting and understanding the information yielded
by Web, they are also used in other domains, such as
social sciences. In the social sciences field, links can
be analyzed from two distinct, but somehow interre-
T A B L E 3 Updated Pagerank Values at the End of the Second lated, perspectives: the information centered and the
(and Last) Iteration k = 2 actor centered. These perspectives are typically used
to help understand the underlying social phenomena,
Node A B C D E F
by means of the identification of the most valuable
Shares 1/24 1/8 1/24 5/12 1/24 1/12 sources of information or, alternatively, the most im-
Updated Pagerank 1/24 11/24 1/8 5/24 1/24 1/8 portant actors. Nevertheless, there is still some lack
of consensual guiding principles about how to inter-
pret link analysis results in a social science context.
Thelwall32 stresses out the importance of developing
iteration k = 2, obtaining the results shown in
guidelines for improving the process of interpreting
Table 3 and Figure 2(c). This rule is applied iteratively
these results and proposes a theoretical framework
until the convergence of the Pagerank values, or until
for link analysis interpretation.
the kth iteration. As we consider only two iterations,
we can try to draw some conclusions and interpret
the results based only on the information available
in Tables 2 and 3. Therefore, at the end of the first
PROPERTIES OF REAL-WORLD
iteration, node D seemed to be the most promising
one, with a Pagerank of 5/12; nevertheless, at the
NETWORKS
second iteration node B overtakes the position of D, Real-world problems are an inexhaustible source of
being now assigned to the first place of the ranking inspiration for network theories. The great majority
of nodes. This sudden change befits the idea behind of real-world events and activities we play, observe,
Pagerank algorithm that measures the quality, instead and study can be easily modeled using graphs and can
of the quantity, of a node’s connections. Therefore, be analyzed through the lens of network analysis.
and besides node D is the one receiving more incom- In this overview, we focus on social networks,
ing links, the importance of the nodes linking to them which are those arising as a result of human and so-
is not that significant. On the other hand, node B has cial interactions, though there are other types of real-
only two incoming links, but one of them is of great world networks. Given the diversity of such networks,
importance, namely, node D. This is the main reason researchers classified them into main types, although
why node B receives the larger Pagerank value at the this classification is not fully consensual. One possibil-
end of the second iteration, turning into the most in- ity is the one provided by Newman,9 which was pre-
fluential, or authoritative, node in the network. If B sented in the Introduction section. Although they stem
was an actor, he/she would be considered the most from distinct real-world problems and knowledge

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 107
Overview wires.wiley.com/widm

fields, they all share a set of common properties which tence of shortcuts between most of vertex pairs in a
make them peculiar, thus opposing to the two well- network. In social settings, this means that two ap-
known network models: random networks and regu- parently disconnected people can quickly get in touch
lar networks. with each other through an incredible low number
The simplest and best known model of network of acquaintances or friends. This finding has several
is the random graph.33,34 This type of graph is char- implications in dynamic processes because it implies,
acterized by the random placement of edges between for instance, that the spread of a contagious disease
a fixed number n of vertices, to create a network throughout the population will be faster than one
in which each of the 1/2n(n–1) possible edges is in- would expect.
dependently present with some probability p. When In mathematical terms, the small-world effect
p = 0, we obtain a graph of perfect order—regular means that the average geodesic distance (i.e., the
graph; and when p = 1, we obtain a random graph, shortest path) between pairs of vertices scales loga-
which embodies the total chaos. Because both regular rithmically, or slower, with the network size with a
graphs and random graphs represent extremes, they fixed mean degree.9 This property is also observed
are not realistic. Additional properties9 are required in random graphs, where the diameter is very small,
to model complex and atypical networks such as the only growing logarithmically with n, and the vertices
ones we find in real world. In short, we can say that have all about the same degree.
real-world networks are nonrandom and nonregular
graphs with unique features, where ‘order coexists
Property 2: Transitivity or Clustering
with disorder’.35 In this section, we introduce and ex-
According to Wasserman and Faust,16 transitivity is
plain some of these properties, which are as follows:
a property that considers triples of nodes (i.e., sets of
three vertices, in which at least one is connected to
1. Small-world effect;
both others) in a graph or, in other words, measures
2. Transitivity or clustering; the density of triangles (i.e., three nodes connected
3. Power-law degree distributions; to each other by three edges in the network, which
4. Network resilience; means that every node is fully connected to the re-
maining two nodes) in a network. In the social net-
5. Mixing patterns;
work parlance, it means that a friend of your friend
6. Community structure. is also likely to be your friend.
This property is quantified using a clustering
coefficient that can be global or local, as mentioned
Property 1: the Small-World Effect in the section Elementary Statistical Measures.
Stanley Milgram,36 an American social psychologist,
was the first to point out the existence of small-world
effects in real social networks, through a series of Property 3: Power-Law Degree
famous experiments which are today known as the Distributions
Milgram experiment. This experiment was done to The degree distribution P(k) is the probability distri-
test the speculative idea of the small-world effect and bution of the degrees of nodes over the whole net-
is one of its first direct demonstrations. The main hy- work. Therefore, P(k) represents the probability that
pothesis of the study was that pairs of apparently dis- a vertex chosen uniformly at random has degree k,
tant individuals are connected by a short path, i.e., by and is defined by the fraction of nodes in the net-
a few number of acquaintances, through the network. work that have degree k. This means that, if the total
To probe the distribution of the path lengths, Mil- number of nodes in the network is n, and nk of these
gram asked some random participants (about 300) to nodes have degree k then, for this value of the de-
pass a letter to someone they knew in a first-name ba- gree, we have a probability of P(k) = nnk . Computing
sis in an attempt to get it to an assigned target person. this probability for each degree value k, of the set of
With this experiment, it was shown that the median degree values appearing in a given network, we ob-
length of the paths that succeeded in reaching the tar- tain the probability distribution of the degree in this
get was six, which clearly explains the origins of the network.
concept six degrees of separation. Random graphs, such as the ones studied by
The overall conclusion of Milgram and its col- Erdös and Rényi,34 show a binomial degree distri-
leagues has been accepted in a broad sense. In fact, bution because the presence, or absence, of an edge
the small-world effect has been widely observed in is equiprobable (i.e., equal for all possible vertex
real-world networks and it is manifested by the exis- pairs). In the limit of large graph size, this degree

108 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

distribution goes from binomial to Poisson distribu- such cases, it is more appropriate to test the resilience
tion. Therefore, in this class of graphs, the degree of the network the removal of a given percentage of
distribution is highly homogeneous as most vertices nodes.
have similar, or equal, degree.
Real-world networks are, in turn, quite differ- Property 5: Mixing Patterns
ent from random graphs with respect to degree dis- Some networks are made up of different types of ver-
tribution. Barabási and Albert24 discovered that, in tices. In these kinds of networks, the linking between
real graphs, the distribution of the vertex degree is vertices or, in other words, the probability of con-
very heterogeneous and highly right-skewed, with a nection between vertex pairs, tends to be selective
large majority of vertices having a low degree and and highly dependent on vertex types (e.g., food web,
a small number having a high degree. This finding whose vertices may be herbivores, carnivores, and
comes to reinforce the previous work of Price37 on plants). In social networks, this is also evident be-
networks of citations between scientific papers. In cause individuals tend to interact with other similar
both cases, they state that the degree distribution of to them. This selective linking is usually called assor-
real networks, such as citation networks, follows a tative mixing, or homophily, and a classic example is
power-law (at least asymptotically) and, therefore, mixing by race. Real networks tend to show higher
these networks are sometimes referred to as scale- tendencies for assortative mixing.
free networks.23 Power-law distributions usually arise Newman39 proposed an assortative coefficient
when the amount you get of something depends on to quantify the assortative mixing of a network. This
the amount you already have. A common analogy is coefficient replaces the previously proposed by Gupta
‘the rich get richer’. Price38 used the term cumula- et al.40 and allows us to distinguish a randomly mixed
tive advantage to refer to this mechanism, which is network from a perfectly assortative one.
believed to be the most probable explanation for the
power-law degree distributions in several real-world
networks, which include, but are not restricted to, Property 6: Community Structure
collaboration networks and the World Wide Web. The great majority of real social networks show
Today, this process is best known as preferential at- community structure, which means that we can find
tachment, a name coined by Barabási and Albert.24 In groups of densely connected vertices that are low con-
their seminal paper,24 the authors describe a network nected to other groups of vertices in the network. This
growth model which became known as the Barabási– topic will be deepened in the following section.
Albert model. This work shows that network grow-
ing with preferential attachment will indeed become
scale free, due to the ‘the-rich-gets-richer’ strategy em-
COMMUNITY DETECTION
ployed in the model. One of the unique features of social networks is that
they tend to show community structure. This prop-
erty usually arises as a consequence of both global and
Property 4: Network Resilience local heterogeneity of edges distribution in a graph.
Network resilience measures the impact on the con- Thus, we often find high concentrations of edges
nectivity of the network when one or more vertices within certain regions of the graph, called commu-
are removed and it is an indicator of the cohesion nities, and low concentration of edges between those
of the network. Different kinds of networks exhibit regions.
different levels of resilience. Most networks are ro- Communities, also known as modules or clus-
bust against random vertex removal but considerably ters, can be straightforward defined as similar groups
less robust to targeted removal of the highest degree of nodes. A more complete definition is built upon the
vertices. Also, when the endpoints of a bridge are concept of density: communities can be understood as
deleted there are strong changes in the network with densely connected groups of vertices in the network,
respect to the ability of communication between pairs with sparser connections between them.
of vertices, as some of them become disconnected. Be- According to Newman and Girvan,41 there are
tweenness centrality can also be seen as a measure of two main lines of research in discovering communities
resilience as it tells us how many geodesic paths will in network data. The first has its origins in Computer
get longer when a vertex is removed from the net- Science and is known as graph partitioning, whereas
work. However, in real settings the removal of a sin- the second has been mainly pursued by sociologists
gle node is not usually cause for alarm, since the net- and is usually referred as blockmodeling, hierarchi-
works comprise millions or even billions of nodes. In cal clustering, or community structure detection. The

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 109
Overview wires.wiley.com/widm

In real life, we can find several examples of such


tight groups. There is a long list of examples, so we
will only name a few. Society is a rich environment
for finding communities, once people have the nat-
ural tendency to form groups. These groups can be
families, circles of friends, working and/or religious
groups, towns, nations, and so on. If we also con-
sider groups formed by companies, or by customers
of a given product, we can identify communities with
relevance to Economics and Business fields. Biology is
another activity where methods for finding communi-
ties are useful, especially within the scope of metabolic
networks. For instance, in protein–protein interaction
networks we can find groups of proteins with simi-
lar functions within the cell. We can also find vir-
tual online communities in the network of Internet,
or groups of topic-related web pages, which may be
useful for the development of automatic and efficient
recommendation systems.
F I G U R E 3 | Illustration of a network with three distinct The importance of studying these communities
communities: C1 = {A, B, C, D, E, F}, C2 = {G, H, I}, and C3 = {J, K, is intuitive in domains such as SNA. To highlight this
L, M, N, O, P}. importance, Fortunato35 has stated that the analysis
of the structural position of nodes, in each module,
can help identify central actors (those within central
former originally arose in the Computer Sciences field positions), often associated to group control and sta-
because of the necessity of finding the best way to allo- bility functions, as well as intermediate actors, who
cate tasks to processors so as to minimize the commu- are those who lie at the boundaries of communities
nications between them. This network optimization and play a key role in the spread and exchange of
task aimed at enhancing the computation, in a parallel new ideas and information, creating bridges between
computing environment. The latter was motivated by communities. Other interesting possibility opened by
the discovery of community groups within society, to the task of discovering communities is the one that fo-
simplify the analysis of social phenomena through the cus on the analysis of coarse-grained descriptions of
arrangement of people according to their similarities. the original graph. An example is the study of graphs
The main process behind community detection algo- obtained by considering vertices as communities and
rithms is based on dividing the original graph, into edges between them as an indicator of overlap be-
a set of disjoint subgraphs, through the optimization tween communities. This strategy is used by Oliveira
of a given objective function (e.g., modularity). The and Gama43 for the detection of transitions in
aim of both approaches is to discover groups of re- clusters.
lated vertices in the network and, if possible, the cor- The following subsections are devoted to the
responding hierarchical organization, based only on introduction of the most popular (not necessarily
the information provided by network topology. This the best) methods to solve the problem of finding
is usually done by iteratively removing the bridges be- communities. The great majority of these traditional
tween groups of vertices, as suggested by Girvan and algorithms assume partitions of vertices, instead of
Newman.42 covers,1 i.e., they do not allow overlap of communi-
To better understand the introduced concepts, ties, so each vertex is assigned to a single community.
in Figure 3 is depicted a simple network compris- However, if one suspects that the nature of his/her
ing three communities, named C1 , C2 , and C3 . In network implies the existence of overlapping com-
this picture, we represent an ideal situation since each munities, a possible choice is the clique percolation
community is itself a complete graph, or a clique, of method, proposed by the physicists Palla et al.44 The
varying size (C1 = K6 , C2 = K3 , and C3 = K7 ). Also, main feature of this prominent approach is its abil-
the density of ties between communities is very low. ity to find overlapping communities in a network, by
The few ties that exist are bridges, since they are the allowing vertices to belong to more than one group.
only available connections between different parts of This characteristic is especially appealing in social sci-
the network. ences, as people tend to belong to more than one

110 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

community (e.g., family, work, friends, etc.) at the regions,31 namely, bridges and local bridges.
same time. A well-known algorithm exploring this
For those interested in using clique percola- method is the one proposed by Girvan and
tion method to detect overlapping communities, Palla Newman.42
et al.44 developed the CFinder software package, 2. Agglomerative methods: this class of meth-
which is freely available at www.cfinder.org. ods focuses on the tightly knit parts of the
network, rather on the connections at their
boundaries. Walktrap45 is an example of an
Hierarchical Clustering algorithm based on this method.
Hierarchical clustering is a popular class of meth-
ods for finding clusters, since it does not require any
assumptions regarding their number, membership, In the next subsection, we present one of the
and size. Hierarchical clustering algorithms produce a best known and widely used divisive hierarchical al-
flexible nested structure (smaller clusters within larger gorithm for finding communities, especially in social
clusters which, in turn, are embedded in even larger networks: the algorithm of Girvan and Newman.
clusters), typically represented by means of a dendro-
gram, that uncovers the multilevel structure of the
Girvan–Newman Algorithm
network. Such features are highly desired in domains
Among the most popular algorithms, or even the most
where little information is available concerning the
popular one, for solving community detection prob-
community structure of a network. In addition, these
lems is the one devised by Girvan and Newman42 and
methods proved to be quite effective in solving cluster
known as the Girvan–Newman algorithm.
analysis problems, thus becoming attractive for graph
The Girvan–Newman algorithm is a divisive hi-
partitioning and community detection purposes.
erarchical technique that deconstructs the initial full
The procedure of traditional hierarchical clus-
network into progressively smaller connected pieces,
tering is quite intuitive, being strongly based on the
until the point where there are no edges to remove
definition of similarity. Usually, the first step is the
and each node represents itself a community. Bear-
selection of the similarity measure that will be used
ing in mind that communities are cohesive groups
to assess how alike two nodes are according to a
of nodes, with sparser connections between them, the
given global, or local, property. Examples of such
criterion to remove the edges, proposed by Girvan and
measures are the cosine similarity, the Jaccard index,
Newman,42 is the graph-theoretic centrality measure
the Euclidean, or Manhattan distances, the Hamming
edge betweenness. The reason behind this choice is
distance between pairs of rows in an adjacency ma-
related to the fact that this centrality measure is able
trix, among others. The next step is to compute the
to identify edges that lie on a large number of shortest
similarity matrix between all pairs of nodes, regard-
paths between nodes and, therefore, are believed to
less of the fact that those nodes are, or not, con-
connect different nonoverlapping communities. Thus,
nected to each other. Then, one chooses the approach
the main idea of this algorithm is that if we identify
to group them—the agglomerative or the divisive—
and remove bridges, we isolate the existing commu-
and, depending on the choice, selects a given dis-
nities in a network.
tance measure to compute the similarity between clus-
Because it is based on the concept of between-
ters (e.g., single linkage, complete linkage, Ward’s
ness, it is only suitable for networks of moderate or-
method, etc.). The result is a dendrogram illustrating
der (up to a few thousand nodes), due to the high cost
the arrangement of clusters returned by the hierar-
of computing it. The input of the algorithm is a full
chical algorithm. To select the best partition, i.e., the
graph and the output is a hierarchical structure, such
best number of communities k, a typical strategy is
as a dendrogram, where communities at any level cor-
to compute the value of modularity41 for every pos-
respond to a horizontal cut through this hierarchical
sible number of clusters and select the number that
tree. The steps of the algorithm can be summarized as
maximizes this function.
follows:
As mentioned before, there are two general ap-
proaches for hierarchical clustering, which are as fol-
lows: • Compute the betweenness of all edges in the
network;
1. Divisive methods: this class of methods • Remove the edge with highest betweenness.
focuses on identifying and removing the This step may cause the network to split into
spanning links between densely connected separate disconnected parts, which constitute

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 111
Overview wires.wiley.com/widm

the first level of regions in the partitioning of BOX 1 SOFTWARE AND TOOLS FOR SO-
the graph. CIAL NETWORK ANALYSIS
• Repeat the previous steps until there are no
To fulfill the growing need for social network mining and vi-
edges to remove in the graph. Note that the
sualization, a considerable collection of software and tools
obtained smaller components, within larger
were developed for SNA. Some of these tools are more
components, are the regions nested within the
technical and targeted to users with strong programming
larger regions found in the first steps.
background and skills (e.g., UCInet51 ; igraph, sna, and
NetworkX52 libraries for R46 environment), and others are
Because of its popularity, almost all stan-
more intuitive and, thus, more suitable for users from the
dard software libraries have this algorithm im-
social sciences (e.g., Gephi53 and NodeXL54 ). Though these
plemented. For instance, in R46 we can use the
tools have different characteristics, most of them allow:
edge.betweenness.community function, provided by
the computation of metrics that provide a local (actor level)
library igraph, to apply the Girvan–Newman algo-
and global (network level) description of the network; the
rithm.
graphical visualization of the network; and the detection of
communities. On the basis of the study of Combe et al.,55
Modularity Optimization these constitute the expected functionalities of an SNA tool.
A widely used and very popular class of methods to In general, the main functionalities that can be embedded
detect communities in networks is modularity max- in SNA tools are the following:
imization. Modularity Q is a quality function that • Creation of networks;
attempts to measure the merit of a given partition of • Visualization and manipulation of networks;
the network into communities. It has been used not
• Qualitative and quantitative/statistical analysis of
only to compare the quality of the partitions obtained
networks;
by different community detection methods, but also
as an objective function to optimize. According to • Community detection;
Newman,47 • Predictive analysis (peer influence/contagion
modeling, homophily models, and link prediction).
Modularity is, up to a multiplicative constant, the
number of edges falling within groups minus the ex- Despite the quantity and diversity of available tools in the
pected number in an equivalent network with edges Web, some of the most popular are
placed at random.
• Pajek56 : freely available software for the analysis
Based on this definition, we can deduce that and visualization of large-scale networks.
modularity is a measure that explicitly takes into ac- • Gephi53 : open source software for network ma-
count the heterogeneity of the edges. The basic idea is nipulation and exploration, endowed with a three-
that a network shows meaningful community struc- dimensional render engine to display real-time
ture if the number of edges between communities is evolving networks.
fewer than expected on the basis of random choice. • UCInet51 : commercial social network analysis
By assumption, the higher its value the better the par- which makes use of Pajek and NETDRAW for vi-
tition, meaning that the found communities are inter- sualization and it especially suitable for statistical
nally densely connected and externally sparsely con- and matricial analyses.
nected, because there are more edges falling within
• NodeXL54 : freely available add-in to Microsoft Ex-
groups than what would be expected by chance. Mod-
cel 2007 for the overview, discovery and explo-
ularity is computed as
ration of networks, which does not require any
 
1  ki kj programming skills because it is very user friendly.
Q= Ai j − δ(ci , c j ), (21) Not suitable for the analysis of large networks.
2m i j 2m
• R46 libraries (igraph, sna, tnet, statnet, and
where m is the number of edges, ki and kj represent, NetworkX52 ): freely available packages for R en-
respectively, the degree of nodes i and j, Aij is the entry vironment which are very comprehensive (e.g.,
of the adjacency matrix that gives the number of edges significant number of implemented algorithms for
ki kj
between nodes i and j, 2m represents the expected community detection; analysis of longitudinal net-
number of edges falling between those nodes, ci and works, as well as two-mode networks) and with
cj denote the groups to which nodes i and j belong, good two- and three-dimensional visualization ca-
and δ(ci , cj ) represents the Kronecker delta. pabilities.

112 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

From the formula, we can deduce that Q ∈ [ − gence of new movements in network research. These
1, 1], being either negative or positive. If positive, then new approaches started to focus in the analysis of
there is possibility of finding community structure on the statistical properties of large-scale complex net-
the network. If Q is not only positive, but also large, works, which were easily gathered using computers
then the corresponding partition may reflect the real and other electronic devices. This change of scales
community structure. According to Clauset et al.,48 forced upon a corresponding change in the tradi-
in practice, it was found that a modularity of about tional analytical approach and data mining statis-
0.3 is a good indicator of the existence of meaningful tical methods became of great importance, due to
communities. their ability to extract patterns and knowledge from
Following this reasoning, and knowing that the massive quantities of data.9 Nowadays, the wide
higher the modularity, the best the obtained network availability of software and SNA tools and libraries
division is, a natural approach would be maximiz- (more than 50) is a reflection of this evolution in
ing this measure, by computing it for every possible SNA (Box 1).
partition of the network and selecting the partition Because of these technological advances and
returning the higher value. This simple idea gave rise consequent impact in the availability of networked
to a new class of methods whose foundations are set data, new challenges are being posed to the research
on the maximization of modularity. Albeit this ap- field of SNA, and a new paradigm is emerging. This
proach is quite attractive, the exhaustive search over paradigm takes into consideration new factors in the
all possible divisions is usually intractable. This un- analysis of social networks, such as the size of data,
desired effect of computational inefficiency has been which is incredibly getting large, and changes in space
circumvented by adapting a number of heuristic meth- and time.
ods to this specific optimization problem. Following The first issue has implications on existing meth-
this strategy, one can obtain a fairly good approxi- ods for SNA, which now need to be improved to scale
mation of the global optimum (in this case, the max- well to large-scale social networks. For instance, find-
imum value of modularity) in an acceptable time. Al- ing communities in large-scale social networks will
gorithms that employ this strategy are, for instance, continue to be a dynamic research challenge. Also in
the one proposed by Blondel et al.,49 which per- the area of community detection, the urge to develop
forms a hierarchical optimization of modularity by new methods which are not only scalable and efficient
exploring greedy techniques, and the one proposed but also fully automatic, in the sense that they do not
by Guimerá and Amaral,50 that applies the simulated need any user-specified parameter (e.g., number and
annealing procedure to the modularity optimization size of communities), will continue to be an important
problem. research problem.
Those interested in knowing more about the The second issue is of extreme relevance because
problem of finding communities in networks, can re- the speed at which data is collected is turning obsolete
fer to the recently released survey by Fortunato.35 static analyses. Therefore, the study of the dynamics
and evolution of social networks, which include but
are not restricted to both the discovery of the gen-
CONCLUSIONS AND CURRENT eral properties that govern the temporal evolution of
TRENDS social networks and the detection and understanding
of temporal and spatial changes in these networks,
At the beginning, analysis of social networks was
will continue to be a significant strand of research in
based in single small graphs. The first studies used
SNA.
data collected using direct questionnaires asking re-
It is also expected that the popularity of SNA
spondents to detail their interactions with others. The
will continue to increase, attracting more and more
gathered data was then represented using the math-
researchers to the field and impelling an increasing
ematical graph model, where vertices correspond to
number of companies to incorporate SNA methods
individuals (the respondents) and edges to interac-
into their business processes and generalize their use
tions between them. These traditional studies usu-
as strategic tools.
ally entailed some problems such as inaccuracy, sub-
jectivity, and lack of generality due to small sample
size.
NOTES
The substantial advance of technology, the in-
a
creasing availability of computers and the arising of Cover is a synonym of partition for soft assignment
communication networks contributed to the emer- of vertices to communities.

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 113
Overview wires.wiley.com/widm

ACKNOWLEDGMENTS
We are grateful to the referees for their helpful comments and suggestions on an earlier draft of
this paper. We are also grateful for the financial support of the project Knowledge Discovery
from Ubiquitous Data Streams (PTDC /EIA-EIA/098355/2008). The work of Márcia Oliveira
was also supported by the Portuguese Foundation for Science and Technology (FCT), under the
PhD grant SFRH/BD/81339/2011.

REFERENCES
1. Moreno JL. Who Shall Survive? New York: Beacon 14. Broder A, Kumar R, Maghoul F, Raghavan P, Ra-
House; 1953. jagopalan S, Stata R, Tomkins A, Wiener J. Graph
2. Domingos P, Richardson M. Mining the network value structure in the Web. Comput Netw 2000, 33:309–
of customers. In: Seventh ACM SIGKDD International 320.
Conference on Knowledge Discovery and Data Min- 15. Alon U. Biological networks: the tinkerer as an engi-
ing. New York, NY: ACM 2001, 57–66. neer. Science 2003, 301:1866–1867
3. Richardson M, Domingos P. Mining knowledge- 16. Wasserman S, Faust K. Social Network Analysis: Meth-
sharing sites for viral marketing. In: Proceedings of ods and Applications. Cambridge, UK: Cambridge Uni-
the 8th ACM SIGKDD International Conference on versity Press; 1994.
Knowledge Discovery and Data Mining. New York, 17. Diestel R. Graph Theory. 3rd ed. Heidelberg: Spring-
NY: ACM 2002, 61–70. Verlag; 2005.
4. Leskovec J, Adamic LA, Huberman BA. The dynamics 18. Granovetter M. The strength of weak ties. Am J Sociol
of viral marketing. ACM Trans Web 2007, 1:228–237. 1973, 78:1360–1380.
5. Dasgupta K, Singh R, Viswanathan B, Chakraborty D, 19. Granovetter M. Getting a Job: A Study of Contacts and
Mukherjea S, Nanavati AA, Joshi A. Social ties and Careers. Cambridge, MA: Harvard University Press;
their relevance to churn in mobile telecom networks. 1974.
In: Eleventh International Conference on Extending 20. Freeman LC. Centrality in social networks: conceptual
Database Technology: Advances in Database Technol- clarification. Soc Netw 1979, 1:215–239.
ogy. New York, NY: ACM 2008, 668–677. 21. Brin S, Page L. Node centrality in weighted networks:
6. Wei C-P, Chiu I-T. Turning telecommunications call generalizing degree and shortest paths. Soc Netw 2010,
details to churn prediction: a data mining approach. 32:245–251.
Expert Syst Appl 2002, 23:103–112. 22. Bonacich P. Power and centrality: a family of measures.
7. Xu J, Che H. Criminal network analysis and visualiza- Am J Sociol 1987, 92:1170–1182.
tion. Commun ACM 2005, 48:101–107. 23. Barabási AL, Bonabeau E. Scale-Free Networks. Sci
8. Shetty J, Adibi J. The Enron Email Dataset Database Am 2003, 288:60–69.
Schema and Brief Statistical Report. Technical Report, 24. Barabási A-L, Albert R. Emergence of scaling in ran-
University of Southern California, 2004. dom networks. Science 1999, 286:509–512.
9. Newman MEJ. The structure and function of complex 25. Kossinets G, Watts DJ. Empirical analysis of an evolv-
networks. SIAM Rev 2003, 45:167–228. ing social network. Science 2006, 311:88–90.
10. Van De Bunt GG, Van Duijn MAJ, Snijders TAB. 26. Watts DJ, Strogatz SH. Collective dynamics of small-
Friendship Networks Through Time: An Actor- world networks. Nature 1998, 393:440–442.
Oriented Dynamic Statistical Network Model. Com- 27. Leskovec J, Kleinberg J, Faloutsos C. Graphs over time:
put Math Org Theory 1999, 5:167–192. densification laws, shrinking diameters and possible ex-
11. Ritter T. The networking company: antecedents for planations. In: Eleventh ACM SIGKDD International
coping with relationships and networks effectively. Ind Conference on Knowledge Discovery in Data Mining.
Mark Manage 1999, 28:467–479. New York, NY: ACM 2005, 177–187.
12. Newman MEJ. The structure of scientific collaboration 28. Costa L, Oliveira O, Travieso G, Rodrigues F,
networks. Proc Natl Acad Sci USA 2001, 98:404–409. Villas Boas P, Antiqueira L, Viana M, Rocha L. Analyz-
13. Truyen TT, Phung DQ, Venkatesh S. Preference net- ing and modeling real-world phenomena with complex
works: probabilistic models for recommendation sys- networks: a survey of applications. Adv Phys 2011,
tems. In: Sixth Australasian Conference on Data Min- 60:329–412.
ing and Analytics. Darlinghurst, Australia: Australian 29. Kleinberg J. Authorative sources in a hyperlinked en-
Computer Society, Inc. 2007. 70:195–202. vironment. J ACM 1999, 46:604–632.

114 
c 2012 John Wiley & Sons, Inc. Volume 2, March/April 2012
WIREs Data Mining and Knowledge Discovery Social network analysis

30. Brin S, Page L. The anatomy of a large-scale hyper- 45. Pons P, Latapy M. Computing communities in large
textual Web search engine. Comput Netw ISDN Syst networks using random walks. J Graph Algorithms
1998, 30:107–117. Appl 2006, 10:191–218.
31. Easley D, Kleinberg J. Networks, Crowds and Markets: 46. R Development Core Team. R: a language and envi-
Reasoning about a Highly Connected World. Cam- ronment for statistical computing. R Foundation for
bridge, UK: Cambridge University Press; 2010. Statistical Computing; 2011. Available at: http://www
32. Thelwall M. Interpreting social science link analysis .R-project.org. (Accessed January 14, 2012)
research: a theoretical framework. J Am Soc Inf Sci 47. Newman MEJ. Modularity and community structure
Technol 2006, 57:60–68. in networks. Proc Natl Acad Sci USA 2006, 103:8577–
33. Rapoport A. Spread of information through a popula- 8582.
tion with socio-structural bias: I. Assumption of tran- 48. Clauset A, Newman MEJ, Moore C. Finding commu-
sitivity. Bull Math Biophys 1953, 15:523–533. nity structure in very large networks. Phys Rev E 2004,
34. Erdos P, Renyi A. On the evolution of random graphs. 70:066111.
Publ Math Inst Hungarian Acad Sci 1960, 5:17–61. 49. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E.
35. Fortunato S. Community detection in graphs. Phys Rep Fast unfolding of communities in large networks. J Stat
2010, 486:75–174. Mech: Theory Exp 2008, 2008:P10008.
36. Milgram S. The small world problem. Psychol Today 50. Guimera R, Amaral LAN. Functional cartography of
1967, 1:61–67. complex metabolic networks. Nature 2005, 433:895–
900.
37. Price DDS. Networks of scientific papers. Science 1965,
149:510–515. 51. Borgatti SP, Everett MG, Freeman LC. Ucinet for Win-
dows: software for social network analysis. Harv Anal
38. Price DDS. A general theory of bibliometric and other Technol 2002, 2006.
cumulative advantage processes. J Am Soc Inf Sci 1976,
52. Hagberg AA, Schult DA, Swart PJ. Exploring network
27:292–306.
structure, dynamics, and function using NetworkX. In:
39. Newman MEJ. Mixing patterns in networks. Phys Rev Seventh Python in Science Conference; 2008, 11–15.
E 2003, 67:026126.
53. Bastian M, Heymann S, Jacomy M. Gephi: an open
40. Gupta S, Anderson RM, May RM. Networks of sexual source software for exploring and manipulating net-
contacts: implications for the pattern of spread of HIV. works. In: Third International AAAI Conference on
AIDS 1989, 3:807–817. Weblogs and Social Media. Palo Alto, CA: Association
41. Newman MEJ, Girvan M. Finding and evaluating for the Advancement of Artiı̈cial intelligence. 2009,
community structure in networks. Phys Rev E 2004, 361–362.
69:026113. 54. Smith M, Shneiderman B, Milic-Frayling N, Rodrigues
42. Girvan M, Newman MEJ. Community structure in so- EM, Barash V, Dunne C, Capone T, Perer A, Gleave
cial and biological networks. Proc Natl Acad Sci USA E. Analyzing (social media) networks with NodeXL.
2002, 99:7821–7826. In: Fourth International Conference on Communities
43. Oliveira M, Gama J. MEC—monitoring clusters’ tran- and Technologies. New York, NY: ACM 2009, 255–
sitions. In: Proceedings of the 5th Starting AI Re- 264.
searchers’ Symposium. Lisbon, Portugal: IOS Press; 55. Combe D, Largeron C, Egyed-Zsigmond E, Géry M.
2010. A comparative study of social network analysis tools.
44. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the Soc Netw 2010, 2:1–12.
overlapping community structure of complex networks 56. Batagelj V, Mrvar A. Pajek—program for large net-
in nature and society. Nature 2005, 435:814–818. work analysis. Connections 1998, 21:47–57.

FURTHER READING/RESOURCES
Doreian P, Stockman FN, eds. Evolution of Social Networks. London: Routledge; 1997.
Degenne A, Forsé M. Introducing Social Networks. London/Thousand Oaks, CA/New Delhi: Sage Publications; 1999.
Freeman LC. The Development of Social Network Analysis: A Study in the Sociology of Science. Vancouver, Canada:
Empirical Press; 2004.
Carrington PJ, Scott J, Wasserman S, eds. Models and Methods in Social Network Analysis. New York: Cambridge
University Press; 2005.
Knoke D, Yang S. Social Network Analysis. 2nd ed. London/Thousand Oaks, CA/New Delhi: Sage Publications; 2008.

Volume 2, March/April 2012 


c 2012 John Wiley & Sons, Inc. 115

You might also like