Web Data Analysis
Department of Communication PhD Student Workshop
Web Mining for Communication Research
April 22-25, 2014
http://weblab.com.cityu.edu.hk/blog/project/workshops
Cheng-Jun Wang
Outline
I. Key features of web data
II. Major approaches to web data
analysis
i.
ii.
iii.
iv.
Network analysis
Temporal analysis
Spatial analysis
Sentiment analysis
III. Reflections on web data analysis
FEATURES OF WEB DATA
Traditional vs. Web Data
Analysis of traditional
(cross-sectional, fat)
data
ID
V1
V2
V3
...
V..
1
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1,00
0
...
...
...
...
...
Analysis of web
(longitudinal, tall)
data
Time series analysis
ID
Network analysis1
Spatial analysis
2
Text mining
...
etc.
Time
V1
...
...
...
...
...
...
Multiple regression
1,000
...
...
Log-linear model
...
...
...
10,000
...
...
...
...
...
...
...
...
Multilevel analysis
Structural equation modeling
etc.
APPROACHES TO WEB
DATA ANALYSIS
5
What Can We Do with Web
Data?
Features
Temporal features
Spatial features
Structural/behavioral features (e.g., RT,
@)
Content features (term/topic/sentiment)
Approaches
Time series analysis
Spatial analysis
Network analysis
Text mining
6
Frequently Used Tools
Operation
Pull-down
menus
Programmingbased
Open Source Commercial
OpenOffice
Google Docs
Spreadsheet
SPSS
Excel
Stata
SAS
NETWORK ANALYSIS
(ELEMENTARY-LEVEL)
8
What Is a Network?
A network consists of
Nodes (actors, agents,
etc.)
Edges (relations, ties, etc.)
The same set of nodes
and edges can also be
called:
a graph
a matrix
a web
a map
etc.
A pair of adjacent nodes are
neighbors. (Are A and C neighbors?)
9
Key Concepts
Network
Node
Edge
Ego-network
Component
Triadic closure
Individual-level
analysis:
Centrality metrics
Group-level
analysis
Transitivity
Global-level
analysis
Density
Modularity
10
Examples of Nodes and
Edges
Nodes:
Persons (e.g., Facebook
users)
Organizations
(McDonald restaurants)
Nations (EU members)
Machines (web servers)
Locations (airports)
Ideas (words in articles)
etc.
Edges:
Kinship links (family
ties)
Friendship ties (factual
or perceived)
Business transactions
Travel routes (highways,
subways, air flights)
Similarities (word cooccurrences in articles)
etc.
11
Examples of Innovative
Network Analysis
Food Flavor Network
Music Notes Network
http://www.nature.com/srep/2011/111215/srep0
0196/full/srep00196.html
http://www.eie.polyu.edu.hk/~xfliu/publications/
LiuXF.2010.physa.Music.pdf
12
More on Edges
Directed (one-way) vs. undirected (two-way)
Observed (directly measured, e.g., hyperlinks)
vs. hidden (inferred , e.g., co-occurrences)
Formal (institutionally arranged, top-down) vs.
informal (self-organized, bottom-up)
Static (unchanged over time) vs. dynamic
(evolving)
Positive (e.g., friending) vs. negative (e.g., defriending)
The key challenge to innovative network analysis is
to identify hidden, informal, and evolving edges
13
Classification of Online Social
Networks
Manifestation
of Ties
Direction of Ties
Undirected
Directed
Directly
Observed
Friendship networks
(e.g., Facebook,
Google+)
Microblog networks
(e.g., Twitter, Sina
Weibo)
Indirectly
Inferred
Semantic networks
(e.g., recommendation
systems, social tagging
systems)
Newsgroups, blogs,
WWW hyperlink
networks
Source: Ackland and Zhu (forthcoming). Social network analysis, Sage.14
Components
A component is a
subset of a network:
i.
ii.
every node in the
subset has a path to
every other
the subset is not
part of some larger
set
Most online social
networks have one
(or a few) giant
components
15
Components in a High School
Network
Source: Bearman, Moody & Stovel (2004).
http://www.jstor.org/discover/10.1086/386272?uid=3738176&uid=2&uid=4&sid=
21103
878405327
16
Components in World Wide
Web
Daisy Model (Donato et al., 2005)
Bowtie Model (Broder et al., 2000)
SCC: strongly connected component
IN: unilaterally connected to SCC
OUT: unilaterally connected by SCC
Teapot Model (Zhu et al., 2008)
17
Ego-Centric Network
Ego-network: a subset of a network
including a particularly designated node
(ego) and its neighbors (alters)
For example, followers of a VIP account on
Twitter or Sina Weibo form an ego-network
All snowballing samples of online social
networks are ego-networks.
An important property of ego-networks is
the depth (see next slide).
Family tree is a special case of egonetworks (see the second next slide).
18
The Depth of Ego Networks
1.0 Ego Network
1.5 Ego Network
2.0 Ego Network
2.5 Ego Network
19
Family Tree: Special EgoNetworks
Is it an
undirected or
directed graph?
Are there
multiple paths
from a parent
node to a child
node?
What are the
similarities or
differences
between family
trees and other
types of ego20
networks?
Triadic Closure (Transitivity)
B
A
t0
t1
Why are friends (B and C) of a common friend (A) more likely to become
friends themselves: 1. chances to meet each other; 2. similarity between them.
21
Triads of Undirected
Networks
Closed Triad
Connected Pair
Open Triad
Unconnected
22
Triads of Directed Networks
The 1st number: N of bidirectional edges;
The 2nd number: N of
uni-directional edges;
The 3rd number: N of
nonexistent edges;
The letter code:
directed variations of
the same triad, with U
for up, D for down,
C for circle, and T for
transitive (i.e.,
having 2 paths that
lead to the same
endpoint).
23
Measure of Triadic Closure
Nodes in a graph
usually have multiple
triads each.
Therefore, there is a
need to measure
quantitatively the
overall degree of triadic
closure for each node.
Clustering coefficient
(CC) is the most
frequently used
measure for the
purpose.
B
A
C
D
24
Triadic Closure: Driven by
Social Selection or Social
Influence?
(b) Focal closure
(a) Triadic closure
Person
Person
Person
Person
Focus
(e.g.,
recommended
(c) Membership closure
books on
Amazon)
Person
Person
Focus
(e.g.,
recommended
groups on
Person
Facebook)
25
Goals of Social Network
Analysis
Perer & Shneiderman (2008):
1.Overall network metrics (e.g., number of nodes,
number of edges, density, diameter), global
2.Node rankings (degree, betweenness, closeness
centrality), individual
3.Edge rankings (weight, betweenness centrality),
local
4.Node rankings in pairs (degree vs. betweenness,
plotted on a scatter gram), local
5.Edge rankings in pairs, local
6.Cohesive subgroups (finding communities), local
7.Multiplexity (analyzing comparisons between
different edge types, such as friends vs. enemies),
cross-levels
26
Levels of Network Analysis
Individual-level: nodes, focusing on
who are the most
popular/important/influential nodes in
the network?
Local-level: groups (or clusters,
communities, components, etc.),
focusing on how are the nodes
clustered/grouped together?
Global-level: network, focusing on how
densely/closely is the network
connected as a whole?
27
Individual-level Analysis
Find popular/important/influential nodes
usually based on centrality metrics
Degree centrality: How many nodes are you
connected to?
Closeness centrality: How close are you to
other nodes?
Betweenness centrality: How many paths
are through you?
Eigenvalue: How many important nodes are
round you?
28
Interpretation of Centrality
Scores
High centrality scores:
Individuals with high
centrality scores are
often more likely to be:
Low centrality scores:
Individuals with low
scores are in
peripheral positions:
leaders
key conduits of
information
early adopters of
anything that spreads
in a network
who may be
protected from
negative contagion
and influence
who may be
associated with less
work overload in an
organization
29
Example: Krackhardts Kite
Graph
F
A
E
G
H
D
A network of 10
nodes and 18 edges:
Who has the highest
degree centrality?
Who has the highest
betweenness
centrality?
Who has the highest
closeness centrality?
30
Degree Centrality
Number of
neighbors a node is
directly connected
Indicates how well
the node is
connected within
the graph
Degree of G = 6
31
Betweenness Centrality
The number of
shortest paths
between pairs of
other nodes through a
node (as compared
with total number of
shortest paths in the
graph)
Indicates how critical
the node is to the flow
of information or
resource in the graph
Betweenness of H = 14
32
Closeness Centrality
Number of steps
along the shortest
path from the focal
node to all other
node
Indicates how
quickly information
travels between the
node and anyone
else in the graph
Closeness of D and E =
14, respectively
14 == 1*5 + 2*3 + 3*1
33
Eigenvalue Centrality
The extent to which a node
is a big fish connected with
other big fish in a big pond.
Calculated by assessing
how well connected a node
is to the parts of the
network with the greatest
connectivity.
Nodes with high
eigenvector scores have
many connections who
have many connections,
etc., similar to the logic of
Google PageRank.
Highly connected individuals
within highly interconnected
clusters, or big fish in big
ponds, have high eigenvector
centrality.
34
Group-level Analysis
Central Question: How are nodes
clustered (grouped) together?
based on clustering analysis, a method
to merge an n number of nodes into a g
number of groups such that:
the nodes within the same group are
maximally similar or homogeneous
the nodes between the groups are
maximally different or heterogeneous
35
Process of Clustering
Analysis
1
At step 1, there are 10 clusters,
each with a node that is uniquely
different from all others.
At step 2, nodes 1 and 2 are
considered to be similar enough
to form a cluster; same goes
between nodes 9 and 10. There
are now 8 clusters.
At step 3, node 3 joins the
cluster of 1 and 2, and node 8
joins the cluster of 9 and 10. The
process keeps on until every
node is included in the one giant
cluster at step 6.
An optimal solution is to keep a
small number of clusters with
maximal similarity within and
maximal difference between.
2
3
4
5
6
7
8
9
10
1
36
Island Method for Group
Detection
By raising the
threshold of edge
strength (e.g., mean,
median, or k
standard deviation
above the mean), an
increasing number of
groups
(communities) will
emerge successively
from a giant
connected
component.
37
Group-level Metrics in
NodeXL
Vertex counts
Edge counts
Geodesic distances
Group density
Number of edges between each pair
of groups
38
Global-level Analysis
Key question: How
densely or closely
connected is the
network as a whole?
Fig a (top): connected
based on 67%
agreement
Fig b (bottom):
connected based on
75% agreement
39
Global-level Metrics in NodeXL
(1)
Graph Type
Vertices
Unique Edges
Edges With Duplicates
Total Edges
Directed or undirected.
The number of vertices in the graph.
The number of edges that do not have duplicates.
The number of edges that have duplicates.
The number of edges in the graph. This is the sum of Unique
Edges and Edges With Duplicates.
Self-Loops
The number of edges that connect a vertex to itself.
Reciprocated Vertex Pair In a directed graph, this is the N of vertex pairs that have edges
Ratio
in both directions divided by the N of vertex pairs that are
connected by any edge. Duplicate edges and self-loops are
ignored. In an undirected graph, this is undefined.
Reciprocated Edge Ratio In a directed graph, this is the number of edges that are
reciprocated divided by the total number of edges. Duplicate
edges and self-loops are ignored. In an undirected graph, this is
undefined and is not calculated.
Connected Components The number of connected components in the graph. A
connected component is a set of vertices that are connected to
each other but not to the rest of the graph.
40
Global-level Metrics in NodeXL
(2)
Single-Vertex Connected
Components
Maximum Vertices in a
Connected Component
Maximum Edges in a
Connected Component
Maximum Geodesic
Distance (Diameter)
The N of connected components that have only one vertex.
Average Geodesic
Distance
The average geodesic distance among all vertex pairs, where
geodesic distance is the distance between two vertices along the
shortest path between them.
Graph Density
A ratio that compares the N of edges with the maximum N of
edges the graph would have if all the vertices were connected to
each other. Duplicate edges and self-loops are ignored.
Modularity
When the graph has groups, this is a measure of the "quality" of
the grouping. Graphs with high modularity have dense
connections among the vertices within the group but sparse
connections among vertices in different groups. When the graph
does not have groups, this is undefined.
The N of vertices in the connected component that has the most
vertices.
The N of edges in the connected component that has the most
edges.
The maximum geodesic distance among all vertex pairs, where
geodesic distance is the shortest path between two vertices.
41
HANDS-ON TUTORIALS
42
use R! aRe you
suRe?
NodeXL
and
Super R logo
Source: www.redbubble.com/
43
R for Web Data Analysis
Networ
k
Analysi
s
R packages
igraph, Statnet,
Rsiena
Spatial
Analysis
http://cran.rproject.org/web/views/Spat
ial.html
Sp, Spatial,
OpenStreetMap,
RgoogleMaps
Temporal
Analysis
http://cran.rtseries, forecast, urca,
project.org/web/views/Time wavelets, SpatioTemporal
Series.html
http://cran.rproject.org/web/views/Spat
ioTemporal.html
Text Mining
http://cran.rproject.org/web/views/Natu
ralLanguageProcessing.ht
ml
tm, Rweka, openNLP,
wordcloud, topicmodels,
RTextTools, sentiment,
ReadMe
Machine
http://cran.r-
Nnet, rpart, trees, party,
44
came, I saw,
and I
walked away?
Picture: Gareth Jenkins/Solent
http://www.telegraph.co.uk/news/picturegalleries/picturesoftheday/8561204/Pictures-of-the-day-7-June-2011.html?image=6
Figure from the movie Daddy Day Care (2003)
http://img0.joyreactor.com/pics/post/gif-eddie-murphy-reaction-gifs-party-394848.gif
Plunge
into the water!
45
HANDS-ON!
46
Demo 1. Software
Installation
Download and install R, Rstudio, and
NodeXL
http://cran.r-project.org/
https://www.rstudio.com/ide/
http://nodexl.codeplex.com/
Learn the basics of R
http://tryr.codeschool.com/
More information
https://www.rstudio.com/training/online.html
47
NETWORK ANALYSIS
(ADVANCED-LEVEL)
48
Network Topology
49
Regular or random?
Regular network
Nodes are connected in
a regular neighborhood
with a fixed number k
of edges per each node
They do not exhibit the
small world
characteristics
They may exhibit
clustering
Random network
Random networks
have randomly
connected edges
each node has an
average edges
They exhibit the small
world characteristics
They do not exhibit
clustering
50
Small-World Networks
Between order and chaos
Network generation
Watts and Strogatz
(1999) propose a
model for networks
between order and
chaos
The model is built by simply
Re-wiring at random a small
percentage of the regular edges
Which dramatically shortens the
average path length without
destroying clustering
Such that
The network exhibits the
small world feature as
random networks
And exhibits clustering, as
regular lattices
Watts and Strogatz (1999)
51
Scale-free network
Power law
Long-tail distribution
P(k) ~ k-a, 0<a<2
log(P) ~-a*log(k)
Zipf distribution
Pareto distribution
Properties
Scale-invariance
P(c*k) ~ (c*k) a
Thus, P(c*k) ~ c a k-a
P(c*k) k-a
No average
Universality
Barabsi, Albert, and Jeong, Scale-free characteristics of random networks: The topology of the world wide web, Physical A.,
281, 2000, pp.69-77.
52
Demo2. Generate the
Network
R scripthttp://chengjun.github.io/web_data_analysis/demo2_simulate_networks/
install.packages("igraph")
library(igraph)
size = 50
g = graph.tree(size, children = 2); plot(g)
g = graph.star(size); plot(g)
g = graph.full(size); plot(g)
g = graph.ring(size); plot(g)
g = connect.neighborhood(graph.ring(size), 2); plot(g)
g = erdos.renyi.game(size, 0.1)
# small-world network
g = rewire.edges(erdos.renyi.game(size, 0.1), prob = 0.8 ); plot(g)
# scale-free network
g = barabasi.game(size) ; plot(g)
53
The Political Blogosphere VS.
Congressmens Retweet
Network
Peng, Zhu, Liu, Wu, Liu (2014)
L. A. Adamic and N. Glance, 'The
Political Blogosphere and the 2004
U.S. Election: Divided They Blog',
LinkKDD 2005
Friendship, Interaction
networks and Vote agreement
of congressmen in the United
States. 7th APNC, Montreal,
Canada
54
How to Represent a
Network?
A
e1
B
e3
e
2
C
e
4
e6 e5
E
A,
A,
A,
C,
C,
C,
B
D
C
D
E
F
55
Demo 3. Describe the
Network
Compute graph metrics using NodeXL
Step 1 paste the
edgelist here
56
Demo 3. Describe the
Network
NodeXL: Set node attributes
Step 2 paste the
node attribute here,
and name it as
party
Remember to click here to shift to
the Vertices window
57
Network
NodeXL: Calculating graph
metrics
Step 3 Click graph
metrics here
Remember to set the graph as
directed here
58
Network
NodeXL: Set vertex color and
vertex size
Step 4 Set vertex color
and vertex size by
click here
Remember to set the party as a
categorical variable
59
Demo 3. Describe the
Network
R script
http://chengjun.github.io/web_data_analysis/demo3_describe_the_network/
Graph Statistics
Centrality Measures
Algorithms of graphs
Shortest path
Connected component algorithms
60
The exponential random
graph model (p*)
An ERGM (p*) model is a statistical model for the
ties in a network
Independent (pairs of) ties (p1, Holland and
Leinhardt, 1981; Fienberg and Wasserman, 1979,
1981)
Markov graphs (Frank and Strauss, 1986)
Extensions (Pattison & Wasserman, 1999; Robins,
Pattison & Wasserman, 1999; Wasserman &
Pattison, 1996)
New specifications (Snijders et al., 2006; Hunter
& Handcock, 2006)
61
Why do we use stochastic
network models?
To capture complex social phenomena that
caused by regularities and randomness.
To infer whether certain network signatures will
appear more often than by chance
To distinguish between different social processes
(e.g. homophily vs. structural balance)
To better understand the way local social
processes interact and combine to shape global
network patterns
Deterministic approaches are not always good
enough
62
Procedures of ERGM
Assume we have an observed network of size n. What are the
mechanisms driving the formation of our network (e.g. reciprocity,
transitivity)?
Given those mechanisms, are some network configurations (e.g.
mutual dyads, transitive triplets) more common than you would
expect by chance?
Include a parameter for each configuration in the model. Parameter
values will help us identify a probability distribution for all graphs of
size n. (e.g. if we have a high value for the reciprocity parameter,
graphs that have a lot of mutual dyads will be more probable than
ones that do not)
Estimate the parameters: find the parameter values that best match
the observed network. We do that using MCMC-MLE: Markov Chain
Monte Carlo Maximum Likelihood Estimation techniques.
Once we have our probability distribution, we can draw random
graphs from it and compare any of their characteristics to those of our
observed
network.
http://www.kateto.net/wordpress/wp-content/uploads/2012/12/COMM%20645%20-%20ERGM.pdf
63
Network Configurations:
Undirected Networks
4-star
Edge
K-star
2-star
:
:
Triangle
3-star
64
Network Configurations:
Directed Networks
Arc
Reciprocity
isolate
2-mixed star
2-in star
2-out star
K-in star
Transitive triad
:
:
K-out star
:
:
Cyclic triad
65
Exponential Random Graph Models
ERGM
Y: all the possible ties
y: the observed ties
X: node attributes
g(y,X): network configurations(a vector) .
: a vector of model parameters
k(): normalizing constant
66
One example
denotes the vector of change statistics
Johan Koskinen (2012) An introduction to ERGM. 8th UKSNA Conference, Bristol
67
Tie-Network configuration
matrix
edges
2-star
K-star
Triangle
Y1,2
Y1,3
Y2, 3
Yn, n-1
68
Online Collective Identity
Ackland (2011) Online collective identity. SN
69
Demo 4. ERGM with R
R script
Load data
Build up network objects
Set the node attributes
Plot the network
Fitting a basic ERG model
http://chengjun.github.io/web_data_analysis/demo4_ergm_analysis/
70
TEMPORAL ANALYSIS
71
Time Series Analysis
Time series data can
be analyzed within
either time domain
or frequency domain.
Time domain:
ARIMA/VAR analysis
Survival analysis
Multilevel analysis
Frequency domain:
Where time domain analysis is
routinely conducted, frequency domain
analysis rarely adopted.
Fourier
transformation
Spectrum analysis
(comparing ak and bk
of different time
series).
72
Time Series Analysis
Forecasting and Univariate Modeling
Frequency analysis
Decomposition and Filtering
Seasonality
Stationarity, Unit Roots, and Cointegration
Nonlinear Time Series Analysis
Dynamic Regression Models
Multivariate Time Series Models
73
Survival Analysis of
Blogging Behavior
Source: Zhu et al., ICA 2010
74
SPATIAL ANALYSIS
75
Spatial Analysis
Spatial Data:
Location names
IP addresses
Map visits
GPS usage
etc.
Well-developed for
offline data but underdeveloped/utilized for
web data beyond visual
inspections.
Spatial Analysis:
Spatial clusters/patterns
(by visual inspections)
Spatial autocorrelation
Spatial Regression
Spatial Dependence
(correlation between
nearby locations)
Spatial interaction
(correlation between
geo-coded variables)
76
Geospatial Distribution of the
Communication on Twitter
Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, et al. (2013) The
Geospatial Characteristics of a Social Movement Communication Network.
PLoS ONE 8(3): e55957. doi:10.1371/journal.pone.0055957
77
Spatial Distribution of Tweets in
Milan
kernel smoother of point density
Are the tweets randomly distributed?
78
Temporal Distribution of Tweets
in Milan
79
SENTIMENT ANALYSIS
80
Sentiment Analysis
Decompose sentiment
Emotion
Joy
surprise
Anger
Sadness
Fear
disgust
Polarity
Positivity
Negativity
Neutral
Lexicon method
Carlo Strapparava and
Alessandro Valituttis
emotions lexicon
Janyce Wiebes
subjectivity lexicon
Liu Bings polarity
lexicon
Supervised machine
learning
Combine lexicon and
machine learning
81
Sentiment in the Tweet Stream
Miller (2011) Social scientists wade into the tweet
stream. Science
82
Twitter Mood Predicts the Stock
Market?
Decompose
sentiment
Emotion
Calm
Alert
Sure
Vital
Kind
Happy
Bollen (2011)
Twitter mood predicts
the stock market. JOCS
83
Calm Sentiment Predicts the
Stock Market
84
Demo 5. Sentiment Analysis
with Supervised Machine
Learning
R script
http://chengjun.github.io/web_data_analysis/demo5_sentiment_analysis/
Figure source: http://courtneylambert.co/official-twitter-stats-from-chirp
85
REFLECTION ON WEB
DATA ANALYSIS
86
Google Correlate & Google Flu
Prediction
http://www.google.com/trends/correlate/comic?p=2
87
Nature reported that Google flu trends (GFT) was predicting more than
double the proportion of doctor visits for influenza-like illness (ILI) than the
Centers for Disease Control and Prevention (CDC), which bases its
estimates on surveillance reports from laboratories across the United States
(1, 2).
Lazer et al. (2014) The parable of Google Flu Traps in big data analysis. Science
88
Tweet Sentiment and U. S.
Election 2012
Figures source: election.twitter.com
89
Facebook Insight and Twitter
Mention
Result
Facebook Insight
http://www.cnn.com/election/2012/facebook-insights/
http://www.zerogeography.net/2012/11/obama-wins-election-ontwitter.html
http://www.huffingtonpost.com/simon-jackman/pollster-predictions_b_2081013.html
90
Predict Political Orientation
with Machine Learning
Colleoni et al (2014) Echo chamber or public sphere Predicting political
orientation and measuring political homophily in Twitter. JOC
91
To Move on
R Style Guide
R bloggers
stackoverflow
github
http://adv-r.had.co.nz/Style.html
92
Yet, It Is Not Finished
Creating an R package (intro) http://gastonsanchez.com/teaching/
93