Big Data Analytics
MS4252
Social Network Analysis III
Sources: Jennifer Golbeck 2013. Analyzing the Social Web.
Elsevier (chapter 7, 8, 9, 13)
1
Chapter 7
UNDERSTANDING STRUCTURE
THROUGH ATTRIBUTE AND
BEHAVIOR
2
Describe the Network
To really understand what’s happening in a
network, look beyond the structure at
users’ attributes, behaviour, the content they are
sharing, and their interactions
Analyzing this content can lead to many insights
into the meaning of network structure
3
Attribute, Behavior, and Content
Attributes are characteristics of a node
Age, gender, or location
Person’s religion or political preferences
Behavior refers to actions of nodes
How frequently she posts,
Who and how many people she follows, or
How often she clicks on shared links.
4
Attribute, Behavior, and Content
Content is a broader term that refers to the
nonstructural information about a node
Includes any information about the nodes or edges
beyond the structural features.
Combination of attributes and behavior, like the types of
comments they post online or the topics they discuss.
If a node represents a video or picture instead of a
person, the content would include the things depicted in
the video.
5
Structural Analysis
Identify those clusters
provide statistics about the network properties (like
density and connectivity)
about the importance of individual nodes with measures
(like centrality)
6
How to interpret the three clusters?
Built from the photo sharing website Flickr
On Flickr, photos are labeled with descriptive
keywords called tags
Nodes represent tags and an edge between tags
indicate that they were used to describe the
same image.
e.g. if an image is tagged with the word “desk”
and “keyboard” the network would show a line
connecting those two words.
Network is a 1.5 egocentric network of a single
tag
7
Now with content
8
Break down into Egocentric Network
The network of three months of discussion on the CSS-Discuss
mailing list. Node size reflects the node’s out-degree in this directed
network. Edge means the the person has replied to another.
9
Chapter 8
BUILDING NETWORKS
10
First Question?
What do the nodes represent?
What do the edges represent?
Know this before doing anything with data!
11
Example
Facebook Network
What are the nodes and edges?
What can you see if you were able to visualize this
network?
What would network statistics mean in this network?
What patterns might emerge?
12
Network with Multiple node types
Bipartite graphs have two node types that do not
have connections within the type
E.g. no people connected to one another
Graphs can have multiple node types and not be
bipartite
13
To build a network
Step 1: Define Nodes
What are they?
What are the criteria for being included?
Example, build a social network of people in
Company X.
Full time employees
Part-time employees
Contractor who are hired to work temporarily
…
14
To Build a Network
Step 2: Define Edges
What does a edge represent?
What is the criteria for adding one?
Same Example: Employee network at Company
X, two people are connected if
Two people work in the same dept and work closely on
many projects.
One person works for another.
Two people participate in department-wide discussions
on a mailing list.
15
Sampling Methods
Some networks may be too big to analyze in
their entirety.
Millions of nodes and edges are difficult to understand,
impossible to visualize.
Sampling-selecting a subset of nodes and edges
—is an effective and common way of obtaining a
reasonably sized dataset from a large one.
Random sampling
Snowball sampling
Egocentric analysis
16
Example: Enron email network
Node: messages
Edge: any pair of nodes that have exchanged at least 10
emails.
17
Random Sampling
Select a certain percentage of nodes and keep all edges
between them, or
Select a certain percentage of edges and keep all the
nodes that are mentioned.
Random Edge Sampling
50% 25% 10%
18
Random Sampling
Problems
Edge sampling biased toward high degree nodes
Node sampling loses some structural characteristics.
Benefits
Reduce the network to a smaller size in an even way
A picture of the overall patterns of relationships and
clusters can be seen
Node sampling keeps some network statistical features
19
Snowball Sampling
A technique commonly used in sociology where
participants are interviewed
When working with a large network, choose a
starting node (initial seed node)
Get that node, its connections, their connections, and so
on until the network is the right size for analysis
20
Snowball Sampling
Problems
Biased toward the part of the network sampled, may
miss other features
Benefits:
Easy to do, common
21
Egocentric Network Analysis
Instead of looking at the whole network, look at
the egocentric networks of some nodes.
Randomly selected egocentric networks, or
The networks of individuals selected based on certain
characteristics
A different type of analysis than overall network
analysis, but it shows the role of an individual in
context
22
Egocentric Network Analysis
A 1.5 egocentric network of a person who posts
in both English and Spanish
People who post only in Spanish are shown in black,
those posting only in English are in white, and people
using multiple languages or a third language are in
gray.
23
Chapter 9
ENTITY RESOLUTION AND LINK
PREDICTION
24
Overview
Link Prediction is a method of analysis that detects
where missing links should be present in the network.
Entity resolution is a technique for merging nodes that
represent the same person.
25
Link Prediction
Data often has errors in it, including missing
links,
and link prediction could identify places where an
analyst might want to check to confirm that there
is no edge between a pair of nodes
It can also be used to identify people.
The frequency to attend the meeting
26
Systematic methods for predicting link
If we have two nodes, A and B, then the
score(A,B) indicates how closely connected A and
B are in the graph
After computing the score for every pair of
nodes, the algorithm returns a ranked list
The pairs with the highest score are predicted to
be the most likely new edges.
27
Score Method
Shortest Path Length
Common Neighbors
Jaccard Index
Adamic/Adar
Preferential Attachment
28
Score – Shortest Path
One of the simplest ways to score the similarity
or closeness of two nodes is to use the shortest
path length between them
Score(A,B) = - shortestPath(A,B)
29
Score – Common Neighbors
Another way of computing score
uses more information from the network structure is to
count the number of common neighbors between the
two nodes in a pair
Score(A, G) = Neighbor(A) ∩ Neighbor(G)
30
Score – Common Neighbors
The number of common neighbors makes social
sense, too.
The more friends two people have in common, the more
likely they are to be introduced to one another.
However, some people have an abnormally high
number of connections in social networks, like
singers
many friends in common may not mean much since
they are connected
31
Score – Jaccard Index
The Jaccard Index counts the total number of
friends in common and divides that by the total
number of people who are friends of either node
32
Example
Four nodes: Alice, Bob, Chuck, and Dave.
Let Alice and Bob be celebrities, each with 1 million
friends; Chuck and Dave are average users with 100
friends each.
Now say Alice and Bob have 2,000 friends in common
while Chuck and Dave have only 20 friends in common.
Although Alice and Bob may seem to be more strongly
connected than Chuck and Dave, since they have 100
times more common friends, the Jaccard Index
indicates this is not the case
What is Jaccard Index for Alice and Bob? How
about Chuck and Dave?
What if the 20 people Chuck and Dave know in
common are also celebrities
33
Score – Adamic/Adar
Sum up the inverse log of each neighbor’s degree
(common)
Giving more weights to friends with fewer edges.
The formula is
34
Score – Preferential Attachments
The network principle states nodes with a high degree
are more likely to gain new links.
Popular nodes are more likely to gain new friends than less
popular nodes.
When predicting edges, preferential attachments
suggest that nodes with high degree are more likely to
gain new edges.
35
Score – Preferential Attachments
The formula for this score method is relatively is
With this measure, we would predict that the
next edge to appear will be between nodes A and
G.
36
Entity Resolution
Identify nodes that represent the same entity
and then to merge them together
Entity resolution is a technique for merging
nodes that represent the same person.
John and J. Smith are actually the same person
37
Scoring Techniques
For each pair of nodes, we can compute a score that
represents the likelihood that they are the same node.
Then we can set a threshold value.
Any pair of nodes with score above the threshold will be
merged.
38
Scoring Techniques
To create a score for each pair of nodes, we will
consider similarity on a set of attributes.
For each pair of nodes, record 1 if their values match and 0
otherwise for a given attribute.
However, matching on some attributes is more important than
on others (weights).
Then, we add the positive weights for each attribute where
nodes match, and add the negative weights when they do not
match.
39
Scoring Techniques
Example: two nodes (J Smith and John Smith) and five
attributes
The vector of match/nonmatch is (0, 1, 0, 0, 1)
Suppose the weights are ()
We know that because a match on SSN is much
definitive.
40
Scoring Techniques
We are left to find a method for computing the weight
for each attribute.
We need two probabilities.
Probability : the probability that the two nodes will match on a
attribute by chance.
Example, same birth month is 1/12
Probability : the probability that two nodes that represent the
same person will have the same value (close to 1).
Example, the m probability for the birth month is 0.95
41
Scoring Techniques
Once we have the and probabilities, we need to turn
them into weights.
There are two weights for each attribute.
The common formulas are:
For a match,
For a nonmatch,
For the month attribute, for a match and for a
nonmatch.
42
Score for partial match
Return to the `John Smith’ and `J Smith’ example, while
their first names are not an exact match, they are
closed.
We may label this as a partial match.
For example, if we say `J’ is a 0.3 match for `John’, we would
add 0.3 times the weight for a name match.
In this approach, the formula that a partial match with
value we would add the following to the score:
Comparison? 43
Application
Link Prediction
Friend Recommendation
Entity Resolution
Finding duplicated account
https://sensorstechforum.com/new-facebook-scam-fake-duplicate-accounts-fraud/
44
Incorporating Network Data
Relational data is useful for enhancing the attribute-
based similarity!
45
Decision to merge
The results from these similarity measures can be used
in addition to attribute data.
Jointly use two similarity scores!
46
Chapter 13
SOCIAL INFORMATION FILTERING
47
Social Information Filtering
Some of the information on the social media will be
more useful than the rest, and the sheer volume means
a filter would be useful to sort through it.
This chapter introduces methods used to sort, filter, and
aggregate information from the social media.
Two major approaches:
Social voting
Recommender systems
48
Social sharing and social filtering
Social-sharing websites, like Digg, Slashdot, Reddit, are
designed to share interesting content.
The community then votes items up or down, and the
most interesting links are highlighted.
The reliance on large number of people to help
complete a task is a type pf crowdsourcing.
49
Automated recommender systems
Amazon, Netflex, Pandora, use recommender systems
(RS) to suggest items a user might like.
The RS leverages a user’s previous activity, either on
explicit data such as user ratings, or data implicitly
captured from user’s behavior such making a purchase
or viewing an item.
The core idea behind RS is to take data created by other
people and personalize it for the individual user.
50
Traditional recommender systems
Recommender systems work in one the two ways:
Suggesting items similar to the ones a person likes (item-based)
Suggesting item liked by people who are similar to the user
(user-based; collaborative filtering)
Item based recommendation is not very social, but the
user-based RS relies on other people’s actions.
51
Traditional RS: example
How to suggest new items to Alice? How much she
might like movie Vertigo?
Compute the correlation between Alice and Bob (0.26),
and Alice and Chuck (0.83)!
Vertigo
?
3
4 52
Social Recommender Systems
Collaborative Filtering: an example of how algorithms
can leverage data from crowd!
With the rise of social media, RS have started to
consider social connections in addition to similarity.
Example: In twitter, when a user searches for a term,
the search results can be shown in three ways:
All tweets that match the search
‘Top’ tweets
Tweets only posted by people the user knows
53
Case Study: Reddit Voting
Social News Website
Users vote stories up or down
How to count votes?
Using the total is biased to older posts
Using percentage can skew high or low for new posts
54
Reddit
55
Case Study: Reddit Voting
Solution:
Estimate how many votes a post would get over time
Use a 95% confidence interval to forecast.
When Reddit is relatively sure a new post will get more
up votes than an older one, it ranks it higher.
56
Case Study: Trust-based RS
57
Trust-based Recommender Systems
Trust-based recommenders, like FilmTrust, use social
network information.
Instead of computing similarity between two users as
the weight, we can use trust value from social network
data.
The system estimates how much the user might trust
other people’s opinion. How?
58
Trust
Alice does not know Dave.
Dave is trustworthy at a level 3 out of 10, and Chuck gives Dave
a 10 out of 10
Alice could average these values, and estimate Dave’s
trustworthiness at (10+3)/2=6.5.
However, Alice trusts Bob’s opinion far more than Chuck’s.
59
Weighted average
Alice has given Bob a higher trust rating
Weighted average
60
Example with People
Bob, Chuck, and Dave all saw and rated the move
`Night of the Living Dead’.
Their ratings are as follows on a five-star scale:
Bob: 5 stars
Chuck: 2 stars
Dave: 4 stars
Recommended rating:
61
How good is a RS?
Generally: RMSE
Error = My rating – Recommended Rating
Compute the RMSE
Back to Example, when user’s rating of a movie moved away
from the average rating, the trusted based rating greatly
outperformed collaborative filtering algorithm.
Need alternative ways of evaluating systems
Serendipity over accuracy
Diversity
62