0% found this document useful (0 votes)
69 views10 pages

Guidelines For Online Network Crawling: A Study of Data Collection Approaches and Network Properties

Uploaded by

Nanny Kebede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views10 pages

Guidelines For Online Network Crawling: A Study of Data Collection Approaches and Network Properties

Uploaded by

Nanny Kebede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Guidelines for Online Network Crawling: A Study of Data

Collection Approaches and Network Properties


Katchaguy Areekijseree, Ricky Laishram and Sucheta Soundarajan
Department of Electrical Engineering and Computer Science, Syracuse University
Syracuse, NY
{kareekij,rlaishra,susounda}@syr.edu

ABSTRACT consider the problem of network sampling through crawling,


Over the past two decades, online social networks have attracted or online sampling, in which the only way to obtain more in-
a great deal of attention from researchers. However, before one formation about the network is to query observed nodes for their
can gain insight into the properties or structure of a network, one neighbors, and thus expand the observed network. In this paper,
must first collect appropriate data. Data collection poses several we use network sampling and network crawling interchangeably.
challenges, such as API or bandwidth limits, which require the In network crawling, there are a number of important goals,
data collector to carefully consider which queries to make. Many such as finding samples that are unbiased with respect to some
online network crawling methods have been proposed, but it is not property, locating ‘important’ nodes, or finding a sample that pre-
always clear which method should be used for a given network. serves information flow patterns. In this work, however, we focus
In this paper, we perform a detailed, hypothesis-driven analysis of on the efficiency of the crawling algorithm itself (i.e., How quickly
several online crawling algorithms, ranging from classical crawling nodes and edges can be discovered through the crawling process).
methods to modern, state-of-the-art algorithms, with respect to the Our first considered goal is maximizing the total number of nodes
task of collecting as much data (nodes or edges) as possible given a observed, which we refer to as the node coverage goal. Second goal
fixed query budget. We show that the performance of these meth- is maximizing number of edges observed, which we refer to as
ods depends strongly on the network structure. We identify three the edge coverage goal. These two goals have been closely tied to
relevant network characteristics: community separation, average other application-specific crawling goals, including crawling for
community size, and average node degree. We present experiments census-type applications (i.e., collecting node or edge attribute in-
on both real and synthetic networks, and provide guidelines to formation) [10] and crawling to preserve community structure [15].
researchers regarding selection of an appropriate sampling method. While network crawling is a critically important step in a net-
work analysis tasks, it is often difficult for users to select a crawling
KEYWORDS technique. The literature contains a large number of crawling algo-
rithms, and it is rarely clear which method is best to use for a given
Experiments; Online Sampling Algorithm; Network Crawling; Net-
network. Existing work in the literature has typically attempted to
work Sampling; Complex Networks.
determine which crawling method is the ‘best’. In this paper, rather
ACM Reference Format: than arguing for usage of one single crawling algorithm, our goal
Katchaguy Areekijseree, Ricky Laishram and Sucheta Soundarajan. 2018. is to investigate the relationship between network structure and
Guidelines for Online Network Crawling: A Study of Data Collection Ap-
crawler performance. More specifically, we investigate those net-
proaches and Network Properties. In WebSci ’18: 10th ACM Conference on
Web Science, May 27–30, 2018, Amsterdam, Netherlands. ACM, New York,
work structural properties that govern the ability of a crawler being
NY, USA, 10 pages. https://doi.org/10.1145/3201064.3201066 able to move between ‘regions’ of a graph, and demonstrate that
these features have a strong effect on the performance of various
1 INTRODUCTION crawling methods.
However, the knowledge of how structural properties affect the
The study of complex networks has become a critical aspect of
performance of a crawler might not be immediately helpful, because
research in a wide range of fields. Often, network data is collected
one does not know ahead of time what the properties of network
from online sources, including online social networking sites (OSNs)
are! To address this, we consider broad categories of networks, and
such as Facebook or Twitter. Such sites may provide APIs, which are
show that for networks in the same category (which exhibit similar
a convenient channel for accessing data. However, the data collec-
structural properties), the crawling methods tend to have similar
tion process can require a significant amount of time. Thus, when
performance relative to one another. Based on these properties, we
collecting data, efficiency is extremely important. In this paper, we
provide general guidelines on selecting a crawling method for a
Permission to make digital or hard copies of all or part of this work for personal or particular network type.
classroom use is granted without fee provided that copies are not made or distributed Our work has several novel contributions:
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM (1) To the best of our knowledge, this is the first work to system-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a atically examine the effect that network structural properties
fee. Request permissions from [email protected]. have on the performance of network crawling methods.
WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands (2) We provide an extensive, scientific analysis of the relation-
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5563-6/18/05. . . $15.00
ship between network structural properties and the per-
https://doi.org/10.1145/3201064.3201066 formance of popular crawling algorithms. Unlike existing
WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands Katchaguy Areekijseree, Ricky Laishram and Sucheta Soundarajan

works on sampling algorithms, which typically perform a step. A similar approach is used by the OPIC algorithm, which con-
high-level comparison of crawling methods, our goal is to siders PageRank as a measurement of importance [1]. Results show
understand the relationship between network structure and that both MOD and OPIC significantly outperform other methods.
crawler performance. Analysis of Sampling Algorithms: There has also been a sig-
(3) We provide a framework for classifying network crawling nificant amount of work comparing the performance of sampling
algorithms, based on how their performance changes as net- algorithms. In [12], Leskovec and Faloutsos study the characteris-
work structure changes, and categorize networks according tics of different down-sampling methods and attempt to find out
to their structural properties. the best method that produces the smallest bias for all the defined
network properties, arguing that Random Walk is the best at pre-
The rest of this paper is organized as follows. In the following
serving network properties. Similarly, Kurant, et al. analyze BFS
Section 2, we give an overview of the existing works in this area.
crawler [8]. The experiment results show that the crawler is biased
In Section 3, we discuss the problem, preliminaries, details of the
towards high degree nodes.
network crawling methods and the network structural properties of
Ye, et al. present a large-scale empirical study that compares
interest. We describe our experiments and discuss about the results
crawling methods on the basis of performance, sensitivity, and
in Section 4. We present conclusion in Section 5 and future works
bias, on the tasks of node and edge coverage [19]. Ahmed, et al.
in 6.
provide a detailed framework for classifying sampling algorithms
and examine the performance of various algorithms at the task
2 RELATED WORK of preserving graph statistics [2]. While these works perform an
Work on network sampling can be separated into two main scenar- important evaluation of existing algorithms, they typically conduct
ios: down-sampling and sampling through crawling. When down- a high-level comparison of sampling methods.
sampling, one possesses the complete network dataset, and wishes In contrast, the purpose of our work is to perform a detailed
to scale it down to some desired size (perhaps because the entire analysis of the effects of network structure on the performance of
dataset is too large to fit into the memory or would take too long crawling algorithms, rather than evaluating the performance of
to analyze). The good sample network will maintain the relevant algorithms, we seek to answer how and why algorithms succeed or
properties and characteristics of the original network. fail. Thus, we can give some insights and guildlines on how to pick
In the crawling scenario, one starts with no information about an appropriate method when data collection need to be performed.
the network other than the knowledge of a single node. One can Community Structure: We draw heavily from the literature
obtain addition information by iteratively performing queries on on community structure in networks. To characterize the strength
observed nodes (e.g., through an API) to collect the information of communities in networks, we use the popular modularity metric
about the unobserved network. In this way, the observed sample and the corresponding Louvain method for optimizing modular-
is expanded from the single initially observed node. While both ity [5]. Mixing between communities is an important part of our
of these problems are often broadly referred to as ‘sampling’, they analysis: for example, Leskovec, et al. suggest that communities are
require intrinsically different approaches. In this work, we consider well defined and distinct if they are small, but that large communi-
algorithms that are used in the crawling scenario. ties tends to mix with one another [13].
Due to the vast amount of literature in this area, we cannot pro- Our work differs from existing work in important ways: (1) We
vide a complete discussion of the field. For a more comprehensive present a comprehensive analysis of sampling algorithms for the
discussion, we refer the reader to the excellent survey in [2]. crawling scenario (in contrast to most existing work, which is for
Algorithms for Network Crawling: Network crawling has down-sampling). (2) Our primary goal is not to determine which
been used extensively to collect data from the Internet and other existing algorithms are best, but rather to gain insight into the
large networks like the WWW and online social network platforms. interplay between network structure and crawler performance, and
One of the largest online social network studies is presented by thus understand why certain algorithms perform better or worse.
Mislove, et al. They study four online social network sites (Orkut, (3) We provide guidelines for selecting the appropriate sampling
Youtube, Live Journal and Flickr) and collect the data by using a approach if the network category is known.
BFS crawler [16]. A similar study on other online social networking
sites, such as CyWorld and MySpace, is presented by Ahn, et al. in
[3]. However, the sampled networks produced by a BFS crawler
contains bias. Kurant, et al. suggest a solution for correcting the 3 NETWORK SAMPLING THROUGH DATA
bias from a BFS crawler [9]. Similarly, Gjoka, et al. propose another CRAWLING
approach baseed on Metropolis-Hastings-based Random Walk for Suppose that we are collecting data from an online social network
crawling an unbiased sample and use it for collecting Facebook through its API, and we only have 24 hours to collect data. The pro-
network [7]. cess starts with one known user account. As our first step, we must
There has also been a great deal of interest on network crawl- query that user, and the server responds with a list of neighbors.
ing for specific applications. Salehi, et al. introduce PageRank- Then, one of these returned users is selected for the next query,
Sampling [18] to preserve community structure. Avrachenkov, et and so on. This process is repeated until 24 hours have passed. The
al. present a greedy crawling method, called Maximum Observed problem is to decide which node to select for each query such that
Degree (MOD), for maximizing number of nodes in the sample in the total number of nodes or edges observed is maximized.
[4] by querying the node with the highest observed degree in each
Guidelines for Online Network Crawling WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands

3.1 Problem Definition Random Walk (RW): In each iteration, the crawler transitions
Let G = (V , E) be a static unobserved, undirected network. We are to a random neighbor of the node that was just queried. Nodes
given a starting node ns ∈ V and a total query budget b. To collect can be visited multiple times but crawler only performs a new

data, we may perform a query on an previously-observed node. In query on node v ∈ Vo if it had not been previously queried. The
this work, we assume that in response to a query, we receive all results of Random Walk came out on top in [12].
neighbors of the queried node. In each step, the crawler queries Breadth-First Search (BFS): The crawler selects the node that
an observed-but-not-queried node (i.e., a node that was observed has been in the list of unqueried nodes the longest (i.e., First-In,
as a neighbor of a previously-queried node, but has itself not yet First-Out). BFS is widely used for network sampling because of
been queried). This process is repeated until b queries have been its simplicity. In addition, the obtained network gives a complete
made. The output is a sample graph S = (V ′, E ′ ), where V ′ ⊆ V and view (all nodes and edges) of a particular area in the graph,
E ′ ⊆ E, containing all nodes and edges observed. Table 1 shows the which may be useful for network analysis. An example of one
notation used in the paper. of the largest network analysis using BFS is presented in [16].
Snowball Sampling (SB): The crawler acts similarly to BFS, ex-
cept that when a node is queried, only p fraction of its neighbors
Table 1: Notations and Definitions.
are added to the queue for future queries. Here, we set p to 0.5.
Experimental results in [3] show that this approach is capable
Notation Definition
of discovering hub nodes (i.e., nodes with high degree), which
ns A seed node from which the crawler begins. helps the crawler in expanding to unexplored areas in the graph.

Vo The set of nodes that have been observed but Depth-first Search (DFS): The crawler acts similarly to BFS, ex-
not queried (‘open’ nodes). cept that a node is selected in LIFO fashion (i.e., Last-In, First-

Vc The set of nodes that have been observed and Out).
queried (‘closed’ nodes). Maximum Observed Degree (MOD): This is a greedy algorithm
′ ′ ′ ′
V The set of all observed nodes, V = Vo ∪ Vc . based on the intuition that a node with high observed degree

E The set of all observed edges. likely has high unobserved degree. For each query, the crawler
dv The degree (number of neighbors) of node v. selects an open node with the highest observed degree [4].
Experimental results in [4] demonstrated that MOD substantially
We consider two different sampling goals: outperforms other algorithms at the node coverage task.
Goal 1: Node Coverage - Collect a sample graph S = (V ′, E ′ ), Maximum Observed PageRank (PR): The crawler acts similarly
where V ′ ⊆ V and E ′ ⊆ E so that the number of nodes in V ′ is to MOD. The PageRank score of every observed node is calcu-
maximized. lated when new nodes (or new edges) are added to the sample
Goal 2: Edge Coverage - Collect a sample graph S = (V ′, E ′ ), graph. The node with the highest observed PageRank score is
where V ′ ⊆ V and E ′ ⊆ E so that the number of edges in E ′ is selected for the next query. As argued in [18], this algorithm
maximized. can capture the community structure of the network.
We selected these goals because they are closely related to many Online Page Importance Computation (OPIC): This is an on-
other application-specific goals: for example, the community-based line algorithm that aims to calculate the nodes importance score
sampling techniques in [15] use the node coverage goal to identify a without recalculating from the whole sample. The algorithm
sample that captures the community structure of the network. The only updates the scores of the most recently queried node and
authors in [14] also use this goal to identify a set of most influential its neighbors. Initially, each observed node is given an equal
nodes across several centrality measures. We do not discount the amount of “cash". In each step, the crawler selects the node with
importance of crawling for other applications (such as obtaining the most cash and its cash is distributed evenly to its neighbors.
unbiased samples), but these are fundamentally different problems. Results in [1] show that OPIC can compute the importance of
Note that a sampling algorithm that is successful with respect nodes as in standard methods, but it is faster.
to one of these goals may not be successful with respect to the Volatile Multi-armed Bandit (VMAB): VMAB is a reinforcement
other. For example, suppose an open node (i.e., a node that has learning algorithm that balances exploration and exploitation.
been observed through other nodes but not yet queried) has many This approach is used for finding target nodes on a network
unobserved edges adjacent to it, but these edges lead to other open in [6]. The UCB1 algorithm is used for selecting an arm. Each
nodes. Querying this node would lead to a large increase in the arm represents a set of unqueried nodes with equivalent struc-
number of observed edges, but not the number of observed nodes. tural properties (we use ‘common neighbors’, as described [6],
for the implementation in our experiment). In [6], VMAB mod-
3.2 Online Crawling Methods ifies UCB1 to handle non-stationary bandits, because arms
can appear, disappear or merge (because nodes are added to
In our study, we compare nine popular crawling methods. As men- the sample which affect the change of structural properties).
tioned in the previous section, the crawler expands the sample by In each iteration, the crawler selects the arm that maximizes
selecting for query a node that had been previously observed. The q
2·ln(n−z i )
details of each algorithm are as follows: W̄i +Cp · Ti (n) and randomly picks a node from this arm
to query.
Random Crawling (Rand): In each iteration, the crawler ran-

domly selects a node from Vo for the next query.
WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands Katchaguy Areekijseree, Ricky Laishram and Sucheta Soundarajan

3.3 The Effects of Network Structure on for a particular type of network. These results are presented in
Algorithm Performance sections 4.1.2 and 4.2.
The purpose of our study is to examine the effects of network struc-
tural properties on the relative performance of network crawlers. 4 EXPERIMENTAL STUDIES
It is known that crawler performance may vary substantially by We first perform a series of controlled experiments on synthetic
network [19]. This variance must be due to differences in struc- networks, in which we methodically modify structural properties
tural properties of the underlying networks: but what are those of the network and observe the effect of these variations on the
properties, and how do they affect the comparative performance of performance of the crawling algorithms. We then validate our ob-
crawling methods? servations with real world networks. Using these results, we group
the crawling algorithms into three distinct classes. We addition-
3.3.1 Structural Properties of Interest. Our hypothesis is that
ally group the network datasets into categories based on domain.
the performance of crawling methods is mostly affected by the ease
Within each category of network, crawling algorithms show con-
with which a crawler can move between regions of the graph. If it
sistent performance, allowing a user to select the method that is
is difficult to move between regions of the graph, and the crawler
best for a particular category.
gets stuck in one general area, then it will eventually start seeing
the same nodes and edges over and over again. We thus consider
three structural features:1 4.1 Effects of Network Properties
Community Separation: We consider a community to be a As mentioned in the earlier section, we consider three network
subgraph with high internal density and relatively few edges to the structural properties: the community separation , the average de-
rest of the graph. To determine how well-separated communities gree, and the average community size.
are, we use modularity Q, as defined by Newman [17], which is
defined as follows: 4.1.1 Synthetic networks. We use the LFR network model [11]
to generate synthetic networks. This model is usually used as a
dv d w
 
1 Õ benchmark networks for evaluating different community detection
Q= Avw − δ (cv , cw ),
2m vm 2m algorithms. The model allows us to control average node degree,
where A, m and di are the adjacency matrix, total edges, and degree community size, and community mixing. We set each network to
of node i, respectively. δ (cv , cw ) is a delta function which returns have 5000 nodes and a maximum node degree of 300. We vary
one when node v and w are in the same community. Otherwise, it the community mixing parameter µ (the fraction of edges crossing
returns zero. The higher the modularity, the better the separation between communities, ranging from 0 to 1), average node degree,
between communities, and so a crawler is more likely to get trapped and community size.
in a region. We find communities using the Louvain method [5]. Note that community mixing µ and modularity Q are inversely
Average Degree: If the average node degree is high relative related, meaning that networks with high community mixing will
to community size, then a node is more likely to have neighbors have low modularity. Here, we consider values of µ from 0.1 to 0.9,
outside of its own community, making it easier for a crawler to average degrees from 15 to 200, and average community sizes from
move between regions. It is defined as 100 to 2500. Higher values of µ indicate fuzzier community borders.
For each set of parameters, we generate 10 networks. We gen-
dv
Í
davд = v ∈V . erate multiple graphs with same parameters instead of running
m multiple experiments on a single generated graph, because we want
Average Community Size: As described above, community to reduce the error or bias that might come from the generated
size is relevant in conjunction with average degree. graphs. We consider budgets of up to 1000 queries, representing
Í
c i ∈C |c i | sampling up to 20% of the nodes in the network.
CS avд = , Based on our experimental results, we are able to categorize
|C |
where C is a set of communities, c i is the set of nodes in community the nine crawling algorithms into three groups (G1 - G3). The
i and | · | refers to a cardinality of a set. algorithms in each groups are as follows:
G1: Node Importance-based algorithms: Maximum observed
3.3.2 Properties of Real Networks. As we will see, the properties
degree, OPIC and Maximum observed PageRank
described above have a large effect on crawler performance: but if
G2: Random Walk
one begins without any knowledge of the network, how can one
G3: Graph Traversal-based algorithms: BFS, DFS, SB, Random,
take advantage of our results to select an algorithm? As it is well-
and VMAB
known, networks of the same type tend to have similar structural
properties. For example, social networks tend to have more near- A summary of how structural properties affect each group is
cliques than citation networks, which are more likely to have many shown in Table 2. We plot results for methods from each group in
long chain-like structures. Unfortunately, while it is not possible Figures 2-4. For brevity, we show only results for the best method
for a single algorithm to be the best on every network, we are able in each group: MOD, Random Walk and BFS. Results are as follows:
to produce general guidelines for selecting a crawling algorithm Node Coverage: Figure 2 depicts results for the node coverage
1 We explored other structural properties, such as clustering coefficient, but these three task as average degree and average community size are varied,
emerged as having the greatest effect on crawler performance. with community mixing fixed at 0.1 (i.e., few connections between
Guidelines for Online Network Crawling WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands

Table 2: Categorization and summary of the performances of sampling algorithms.

G3: Graph
Coverage Property G1: Node Importance-Based G2: Random Walk
Traversal-Based
Community Excellent performance when community overlap is
Separation high (i.e. low Q or high µ).
Average Stable
Strong performance when communities are large if µ is
community
Node low. Community size does not matter if µ is high.
size Stable
Performance
Average Strong performance when average degree is extremely improvement when
degree low (<10) even if µ is low. Otherwise, stable average degree
increases.
Community Excellent performance when community overlap is
Separation high (i.e. low Q or high µ).
Average Stable
Strong performance when communities are large if µ is
community
Edge low. Community size does not matter if µ is high. Stable
size
Strong performance when davд is low (<10) even if µ is
Average Performance drops
low. Otherwise, performance drops when davд
degree when davд increases.
increases.
Best Method in Group MOD RW BFS

communities). Plots further to the right have larger average com- outperforms the other methods when budget increases. We ob-
munity sizes (CS avд ranges from 100 to 2500), and plots higher served the same behavior for every generated network with the
on the y-axis have higher average degrees (davд ranging from 7 same parameter setting that has average degree less than 10.
to 100). Within each plot, the x-axis shows the fraction of nodes As illustrated in Figure 1, G1 methods perform better when µ
queried, and the y-axis shows the fraction of nodes observed in the increases. We can see similar results as shown in Figure 3. When
sample. Figure 1 illustrates the results when community mixing is community mixing is high (there are many connections between
varied, and average degree and average community size are fixed communities), G1 methods are consistently the best, as a node out-
at 15 and 300, respectively. Figure 3 shows results when average side of the crawler’s current community may have a high observed
degree is varied, when average community size is fixed at 300 and degree, so the crawler can escape easily.
community mixing at 0.6. For brevity, we cannot show results for G2 - Random Walk: As we can clearly see from Figure 2 and 3,
all parameter settings, but the depicted results are representative the Random Walk crawler is a very stable algorithm. Its perfor-
of the full set of results. We make several important observations mance seems to be unaffected by all considered properties. Since
from this set of plots:
G1 - Node Importance-based methods: These methods greed-
ily pick a node with high centrality (degree or PageRank), and tend
to behave similarly. This is not surprising, as there is a high correla-
tion between the various centrality measures. The performances of
methods in G1 significantly improve as the size of the community is
increased. If community mixing is low (fewer connections between
communities), and communities are small, G1 methods perform
poorly, as they tend to get ‘stuck’ in a region: if there are few
connections between communities, these methods have difficulty
transitioning to new communities. They thus exhibit diminishing
marginal returns: while no method will query the same node mul-
tiple times, they tend to query nodes with similar neighborhoods, Figure 1: (Node coverage) Results on synthetic networks
resulting in redundant information. One interesting observation with different values of community mixing µ when average
is that when the average degree is extremely low (last row on Fig- degree is fixed at 15 and average community size is fixed at
ure 2), G1 methods perform worse than both G2 and G3 methods 300. For the individual plots, the x-axes show query budgets
for low query budgets, but the performance rapidly increases and (0% to 20% of total nodes). The y-axes show the percent of
node observed (0% to 100%). G1 methods improve as µ in-
creases, while other methods are stable.
WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands Katchaguy Areekijseree, Ricky Laishram and Sucheta Soundarajan

region easily, while leaves some partial region unobserved. How-


ever, the crawler has a freedom to move back and forth, partially
explored region, and them discover more nodes of this region in
later steps.
G3 - Graph Traversal-based methods: These methods are not
meaningfully affected by community size. These methods expand
the frontier of the sample uniformly, and so can easily move be-
tween graph regions. However, the choice of next queried node
depends on when that node was put into the query queue. It is
likely that nodes with small and medium degree will be queried
and the number of new nodes added to sample will be low. The
performance of these methods improves as the average degree in-
creases (moving up the y-axis in Figure 3), because they can quickly
reach and escape the boundaries of a community. For large average
degree, these methods outperform Random Walk sampling. Note
that, VMAB is put into this group because of its performance. VMAB
often performs poorly at the task of node coverage because new
arms appear frequently, resulting in nearly random query choices.

Figure 2: (Node coverage) Results on synthetic networks


with different values of davд and CS avд when community
mixing µ is fixed at 0.1. Plots to the right (along x-axis) have
higher average community sizes, while plots near the top
(along y-axis) have higher average degree. For the individual
plots, the x-axes show query budgets ranging from 0% to 20%
of total nodes in the network. The y-axes show the node cov-
erage (measured in % of total nodes observed), ranging from
0% to 100%. The performance of the G2 random walk method Figure 4: (Edge coverage) Results on synthetic networks
is stable. G1 methods improve when CS avд increases, while with different values of average degree, when community
G3 methods improve when davд increases. mixing fixed at 0.6 and community size is 1000. For the in-
dividual plots, the x-axes show query budgets (0% to 20% of
total nodes). The y-axes show the percent of edge observed
(0% to 100%). The performance of G3 are stable, while G1 and
G2 show a decrease in performance.

Edge Coverage Task: Figure 5 is interpreted similarly to Figure 2,


showing how the performances of G1 - G3 change as average degree
and community sizes vary, with fixed low community mixing (µ =
0.1). When varying µ, performance is similar to the node coverage
task, as shown in Figure 1, and due to space constraints, we do not
include it. As before, these limited results are representative of the
full set of results. The key observations are as follows:
Figure 3: (Node coverage) Results on synthetic networks G1 - Node Importance-based methods: These methods are
with different values of average degree, when community the best algorithms if 1) community mixing µ is high or 2) commu-
mixing fixed at 0.6 and community size is 300. For the in- nities are large, even if community mixing µ is low (along the x-axis
dividual plots, the x-axes show query budgets (0% to 20% of in Figure 5, the performance increases). As in the node coverage
total nodes). The y-axes show the percent of node observed task, having fuzzy borders between communities leads to improved
(0% to 100%). G1 outperforms the others methods. The per- performance by G1 methods. However, if communities are very
formance of G3 methods improve as average degree and av- large, then even if a G1 method gets ‘stuck’ in a single community,
erage community size increases. it is still able to discover many edges (here, ‘large’ is measured
with respect to the sample budget). G1 methods also show the best
performance when average degree is extremely low. However, the
the Random Walk crawler selects next node randomly from the last performance degrades when average degree increases as shown in
queried node’s neighbors, the crawler can escape from the current Figure 4.
Guidelines for Online Network Crawling WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands

Table 3: Network statistics of realworld networks used in the


controlled experiments.

Test
Pair Network d av д CS sv д Q
Prop.
Wiki-Vote 28.51 1,177.67 0.42
A
Twitter 33.01 1,129.25 0.81
Q
Brightkite 7.51 274.10 0.68
B
MathSciNet 4.93 594.09 0.80
Shipsec1 24.36 4,117.50 0.89
A
Shipsec5 24.61 5,252.15 0.90
CS av д
Github 7.25 83.68 0.43
B
P2P-gnutella 4.73 1,276.76 0.50
P2P-gnutella 4.73 1,276.76 0.50
A
Bingham 72.57 1,250.13 0.45
d av д
Amazon 2.74 272.44 0.99
B
UK-2005 181.19 157.13 1.00

Figure 5: (Edge coverage) Results on synthetic networks communities, we use the modularity Q of communities found by
with different values of davд and CS avд when community the Louvain method [5] and use it as a proxy for community mixing.
mixing µ is fixed at 0.1. Plots to the right have higher average As mentioned in the previous section, modularity and community
community sizes, while plots near the top have higher aver- mixing are inversely related (⇑ Q ≈⇓ µ). Because networks are of
age degree. For the individual plots, the x-axes show query different sizes and structures, rather than setting a fixed number of
budgets ranging from 0% to 20% of total nodes in the net- queries, we set the query budget to be 5% of the number of nodes.
work. The y-axes show the edge coverage (measured in % of We perform 10 trials for each network and sampling method.
total edges observed), ranging from 0% to 100%. The G3 meth- For each pair of networks, we refer to one of the networks as the
ods are stable throughout. The G2 method is not affected by ‘High’-valued network (H) and another as the ‘Low’-valued network
average community size, but decreases as average degree in- (L). These designations are relative to the other network in the pair
creases. G1 algorithms improve as community size increases, (e.g., the network with higher test value is labeled ‘H’).
but worsen as average degree increases. Because Random Walk is the most stable method (i.e., shows
consistent performance regardless of the structural properties we
consider), we use it as a reference point. We show the number
G2 - Random Walk: The performance seems to be very stable of nodes or edges discovered by each sampling method as a per-
when community size increases regardless of whether community centage improvement above (or below) the number of nodes and
mixing µ is high or low. edges discovered by  Random Walk. Results are shown in Table 4.
G3 - Graph Traversal-based methods: These methods are Each cell contains yx , where x is the percentage improvement of
stable as parameters vary, but these methods generally perform the algorithm performs on the ‘Low’-valued network and y is the
worse than Random Walk and methods in G1. percentage improvement of the algorithm performs on the ‘High’-
4.1.2 Real-World Networks: The previous experiments demon- valued network compare to Random Walk performance. An arrow
strate that a major factor in the performance of each method is the indicates how the performance changes when the value of test prop-
ability to transition between different regions of the graph. Here, erties changes from Low to High value. A symbol ‘⇑’ represents
we test to see whether this conclusion holds on real graphs. the performance improves when the test property increases and ‘⇓’
We perform three sets of controlled experiments. In each set, shows the performance degrades when test property increases.
we locate two pairs of networks (Pair ‘A’ and ‘B’), each containing First, we consider pairs with similar average degree and average
networks that are similar with respect to two of the three structural community size, but different values of modularity Q (0.42 vs. 0.81,
features, but are very different with respect to the third. For example, and 0.68 vs. 0.8). As expected, G1 methods perform extremely well
Wiki-Vote and Twitter networks have average degree around 30 and when modularity is low in pair A (⇓ indicates the G1 performance
average community size around 1100, but differ in modularity (0.42 drops when modularity increases) for both node coverage and edge
vs. 0.81). Statistics for the selected network pairs are listed in Table 3 coverage tasks. On the other hand, G1 methods perform very well
(with the feature being varied shown in bold).2 In order to find on both networks in pair B, because both networks have extremely
low average degree (davд < 10). The performance of G3 methods
2 Network datasets from http://networkrepository.com. are worse than Random walk, as we expected.
WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands Katchaguy Areekijseree, Ricky Laishram and Sucheta Soundarajan
Table 5: Categories of the realworld networks and their
Table 4: Experimental results of the controlled experiments. structural characteristics.
An arrow indicates how the performance changes when test
property changes from Low to High (⇑: improve, ⇓: degrade). Type Network davд CS avд Q Properties
In yx , x and y indicate the percentage improvement of Low-
 
Citeseer 7.16 988.35 0.90
and High- valued networks, respectively (‘+’: outperform Low degree, medium-
Dblp-2010 6.33 739.91 0.86
Collab. sized and clear
RW, ‘-’: underperform RW). Dblp-2012 6.62 1248.35 0.82 communities
MathSciNet 4.93 594.09 0.80
Amazon 2.74 272.44 0.99 Low degree, small
Node Coverage Edge Coverage Recmnd.
Test Pair Github 7.25 83.68 0.43 and clear communities
Prop. % Imprv. % Imprv. % Imprv. % Imprv. OR 25.77 1074.44 0.63
High degree, large
G1 vs. RW G3 vs. RW G1 vs. RW G3 vs. RW FB Penn94 65.59 2186.11 0.49
and clear communities
Wosn-friends 25.77 856.65 0.63
⇑ −22.44% ⇑ −58.13%
 7.62%     21.12%   
A ⇓ −5.24% −13.97% ⇓ −2.90% −49.33% P2P-gnutella 4.73 1276.76 0.50 Low degree, large
Q Tech.
RL-caida 6.37 856.12 0.86 and clear communities
12.47% −28.27% 31.85% −47.22%
B ⇑ 19.81% ⇑ −17.64% ⇑ 53.49% ⇑ −23.10% Arabic-2005 21.36 115.86 1.00
High degree, medium
Italycnr-2000 17.36 1134.34 0.91
Web. -sized and fuzzy
−71.52% −20.53% −19.65% −5.34% Sk-2005 5.51 338.22 0.99 communities
A ⇑ −70.32% ⇓ −27.68% ⇑ −10.48% ⇓ −6.38% Uk-2005 181.19 157.13 1.00
CS av д Slashdot 10.24 173.87 0.36 High degree, small-to-
10.14% −35.21% 20.01% −58.44% OSNs. Themarker 29.87 458.90 0.31 medium-sized and
B ⇑ 15.67% ⇑ −15.18% ⇑ 34.68% ⇑ −19.39% fuzzy communities
BlogCatalog 47.15 1455.48 0.32
 10.14%  −15.18%  34.68%  −19.39% PKUSTK13 68.73 3,514.56 0.88
A ⇓ −14.38% ⇑ −0.87% ⇓ −3.19% ⇓ −27.72% PWTK 51.89 4,635.81 0.93 High degree, large
Scientific
d av д Shipsec1 24.36 4,117.50 0.89 and clear communities
−0.40%  2.09%  −0.48%  1.53% 
B ⇑ 6.25% ⇑ 334.34% ⇓ −1.42% ⇑ 82.61%
Shipsec5 24.61 5,252.15 0.90

Next, we consider networks with similar modularity and aver- on Facebook and scientific networks, the G2 method is the
age degree, but different average community sizes (4118 vs. 5252 best.
and 84 vs. 1277). Our earlier experiment predicted that G1 meth- (3) Methods in G3 seem not to be a good choice when consider-
ods will perform better on networks with larger community sizes. ing these two goals on all types of network.
As expected, G1 performance improves for networks with larger
communities for both pairs. As suggested by Newman in [17], a modularity Q ≥ 0.3 indicates
Finally, we consider networks with similar modularity and av- a good community structure. As we can see, OSNs have the lowest
erage community size, but different average degrees (5 vs. 73 and modularity as compared to others. This indicates that the commu-
3 vs. 181). As predicted, G3 methods perform better on networks nities are fuzzy and overlapping. Since these are social networks,
with higher average degree in both pairs. However, we see less people can be part of several groups in real life (group of friends,
consistent results for edge coverage. Overall, the experiments on family, co-workers, etc.). So, consistent with our earlier results, G1
real networks bear out our results on synthetic networks. methods performs the best on networks in this category.
Other networks typically have higher values of modularity Q
4.2 Network Types (ranging between 0.4 and 0.9), and the determining factor of which
In a real application, the network properties are not known until the method is best thus depends on average community size and av-
sample is generated: so how can one select an appropriate crawling erage degree. The ratio between average degree and average com-
method? Here, we analyze how well the algorithms perform by munity size of collaboration, technology is around 200, while this
network type (web, collaboration, technology, scientific, Facebook, ratio on recommendation and web networks is about 35. This ratio
recommendation and other OSNs)3 . The statistics of all networks indicates how large the community is compared to average degree.
are listed in Table 5. Again, we set the maximum query budget to The average degrees of these four types of networks are quite low,
be 5 percent of total nodes and perform 10 trials for each method. between 2 and 15. As we expected from the previous experiment,
We depict the mean and standard deviation of the percentage of G1 methods perform very well in this case. In contrast, on average,
nodes and edges found in Table 6. Facebook and scientific computing networks have communities
We make the following observations: only 12 and 80 times larger than their average degrees, which range
from 15 to 70. Method in G2 performs the best.
(1) G1 or G2 methods are almost always the best, regardless of
In view of our earlier experiments, we see that in the networks
network type or coverage task.
with communities that are small relative to average degree, G1
(2) On collaboration, technological, recommendation, web and
methods quickly see all of a community, and then have trouble
online social networks, G1 methods perform the best, while
escaping (because of the strong community structure). However,
3 Network datasets from http://networkrepository.com. when communities are large relative to average degree, both G1
Guidelines for Online Network Crawling WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands

Table 6: Summary of the network characteristics and performance of algorithms. Algorithms tend to perform similarly on
networks in the same category.

Node coverage Edge coverage


Type Network
G1 G2 G3 G1 G2 G3
Citeseer 25.66±2.94 25.15±0.55 21.01±0.23 20.48±1.68 19.49±0.42 15.1±0.52
Collaboration: Low davд ,
Dblp-2010 32.59±0.12 26.53±0.37 18.22±0.16 29.16±0.08 19.53±0.39 12.72±0.29
Medium CS avд ,
High Q Dblp-2012 38.33±0.04 31.74±0.28 26.21±0.11 33.47±0.03 22.39±0.36 17.09±0.07
MathSciNet 36.14±0.09 30.17±0.26 24.85±0.13 36.27±0.06 23.63±0.3 18.17±0.09
Recommendation: Low davд , Amazon 5.71±0.16 5.73±0.06 5.85±0.18 5.60±0.06 5.61±0.08 5.50±0.17
Low CS avд , High Q Github 53.59±0.02 46.33±0.24 30.02±0.08 72.57±0.02 60.47±0.29 25.13±0.18
OR 38.99±2.50 55.94±0.68 51.00±0.22 31.05±3.69 27.37±0.57 16.2±0.17
Facebook: High davд ,
Penn94 75.30±1.05 82.52±0.34 80.07±0.24 24.04±1.74 19.47±0.41 12.39±0.13
High CS avд , High Q
Wosn-friends 38.20±3.05 55.80±0.49 50.93±0.19 30.92±3.12 27.85±0.7 16.46±0.17
Technology: Low davд , P2P-gnutella 36.02±0.11 32.71±0.17 27.74±0.22 26.96±0.08 20.02±0.1 16.13±0.17
High CS avд , High Q RL-caida 28.86±0.12 27.71±0.47 26.62±0.10 39.57±0.18 30.21±0.85 20.26±0.11
Arabic-2005 9.47±2.49 6.47±0.72 9.54±1.48 6.97±0.94 5.40±1.28 6.75±0.95
Web: High davд ,
Italycnr-2000 8.52±2.25 15.66±6.37 13.65±2.84 14.9±3.83 24.59±12.51 11.93±2.89
Medium CS avд , Low Q
Sk-2005 10.33±0.87 6.41±1.04 8.21±0.65 9.69±0.51 6.21±0.96 8.03±1.02
OSNs: High davд , Slashdot 70.68±0.01 61.23±0.25 36.81±0.75 75.85±0.01 57.74±0.24 21.84±0.56
Low-to-medium CS avд , BlogCatalog 90.38±0.02 90.38±0.37 90.38±0.49 90.51±0.01 82.28±0.32 18.81±0.26
Low Q Themarker 89.48±0.01 86.04±0.2 47.40±0.12 82.28±0.01 67.4±0.25 19.72±0.12
PKUSTK13 7.40±0.51 43.94±9.74 33.78±1.51 5.68±0.17 10.58±0.61 9.41±0.22
Scientific: High davд ,
PWTK 5.61±0.12 20.08±2.68 15.45±0.74 5.27±0.05 8.13±0.21 7.99±0.10
High CS avд , High Q
Shipsec1 7.81±0.44 27.47±1.43 21.77±0.52 7.80±0.89 9.71±0.54 9.17±0.35
Shipsec5 8.17±0.81 27.85±2.46 20.02±0.91 8.75±0.50 9.79±0.52 9.15±0.44

and G2 methods tend to stay in the same community for much goals: node and edge coverage. We considered nine important sam-
longer, and G1 methods perform the best. pling methods, ranging from well-understood, classical methods
All in all, the G1 and G2 methods are the best, depending on like Random Walk and BFS to modern, cutting-edge algorithms. We
structural properties. G1 methods expand the sampled network by performed experiments on real and synthetic networks, and demon-
quickly filling out the unobserved nodes (or edges) in a particular strated that the performance of the sampling methods is highly
region before moving out of the region. However, these methods dependent on the network structure, and in particular, whether the
are obstructed by sharp community borders. In contrast, the G2 sampling method is able to transition between different regions of
method Random Walk has the freedom to move around, and so the graph. As a result of our experiments, we categorized the nine
the crawler observes parts of many communities before it fills out crawling methods into three groups: Node Importance-based,
individual regions. Because of this, a Random Walk takes longer to Random Walk-based and Graph Traversal-based approaches.
fully explore regions, but reaches more of the network. We observed that Random Walk and Node Importance-based
algorithms performed well, depending on the network structure.
5 CONCLUSION In particular, on networks with clear and sharp community struc-
ture, the Node Importance-based algorithms tend to get ‘stuck’
Data collection is the first process of any network analysis task. in a region of a graph, while the Random Walk method is able to
However, the literature contains a vast selection of network sam- transition between regions. However, when boundaries between
pling algorithms, and so it is often difficult for users to select a communities are fuzzier, or the communities overlap, the Node
single method that works well for their data, as sampling methods Importance-based methods demonstrate excellent performance.
that work well on one network may not work well on a different Finally, we showed how a user can select an appropriate crawling
network. In this paper, we performed a large-scale, comprehensive method based on the network type: in particular, the Random Walk
study to understand how the structural features of networks affect method is suitable for crawling Facebook and scientific computing
the performance of sampling methods. We identified three network networks, but for collaboration, recommendation, technological,
properties of interest: community separation, community size, and web and other online social networks, Node Importance-based
average degree. We performed a large set of controlled experi- methods are best.
ments on synthetic and real graphs, and considered two sampling
WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands Katchaguy Areekijseree, Ricky Laishram and Sucheta Soundarajan

6 FUTURE WORK mechanics: theory and experiment (2008).


[6] Zahy Bnaya, Rami Puzis, Roni Stern, and Ariel Felner. 2013. Bandit algorithms
As part of our future research, we plan to investigate different types for social network queries. In 2013 International Conference on Social Computing.
of query responses. In this work, we assumed that all neighbors are IEEE, 148–153.
[7] Minas Gjoka, Maciej Kurant, Carter T Butts, and Athina Markopoulou. 2009.
returns for each query. However, in certain settings, this assumption Unbiased sampling of facebook. preprint arXiv 906 (2009).
may not hold, and the crawler may need multiple queries to obtain [8] Maciej Kurant, Athina Markopoulou, and Patrick Thiran. 2010. On the bias of
all of a node’s neighbors. For example, an online social network BFS (breadth first search). In Teletraffic Congress (ITC), 2010 22nd International.
IEEE, 1–8.
API may divide a node’s neighbors into paдes, where each page [9] Maciej Kurant, Athina Markopoulou, and Patrick Thiran. 2011. Towards unbiased
contains k records and only one page is returned at a time. Also, BFS sampling. IEEE Journal on Selected Areas in Communications 29, 9 (2011),
we want to expand our findings and give the insight on directed 1799–1809.
[10] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is
networks. It is not clear that our results will generalize to such a Twitter, a social network or a news media?. In 19th international conference on
setting, and we plan to investigate it further. World wide web.
[11] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. 2008. Benchmark
graphs for testing community detection algorithms. Physical review E 78, 4 (2008),
7 ACKNOWLEDGEMENTS 046110.
[12] Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In
The authors would like to thank Jeremy Wendt of Sandia National Proceedings of the 12th ACM SIGKDD international conference on Knowledge
Laboratories for thoughtful comments and conversations. discovery and data mining. ACM, 631–636.
[13] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. 2009.
Community structure in large networks: Natural cluster sizes and the absence of
REFERENCES large well-defined clusters. Internet Mathematics 6, 1 (2009), 29–123.
[1] Serge Abiteboul, Mihai Preda, and Gregory Cobena. 2003. Adaptive on-line page [14] Arun S Maiya and Tanya Y Berger-Wolf. 2010. Online sampling of high centrality
importance computation. In Proceedings of the 12th international conference on individuals in social networks. In Pacific-Asia Conference on Knowledge Discovery
World Wide Web. and Data Mining. Springer, 91–98.
[2] Nesreen K. Ahmed, Jennifer Neville, and Ramana Kompella. 2014. Network [15] Arun S Maiya and Tanya Y Berger-Wolf. 2010. Sampling community structure. In
Sampling: From Static to Streaming Graphs. ACM Transactions on Knowledge Proceedings of the 19th international conference on World wide web. ACM, 701–710.
Discovery from Data (TKDD) 8, 2 (2014). [16] Alan Mislove, Massimiliano Marcon, Krishna P Gummadi, Peter Druschel, and
[3] Yong-Yeol Ahn, Seungyeop Han, Haewoon Kwak, Sue Moon, and Hawoong Jeong. Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks.
2007. Analysis of topological characteristics of huge online social networking In 7th ACM SIGCOMM conference on Internet measurement. ACM, 29–42.
services. In International conference on WWW. [17] Mark EJ Newman. 2004. Fast algorithm for detecting community structure in
[4] Konstantin Avrachenkov, Prithwish Basu, Giovanni Neglia, Bruno Ribeiro, and networks. Physical review E 69, 6 (2004), 066133.
Don Towsley. 2014. Pay few, influence most: Online myopic network covering. [18] Mostafa Salehi, Hamid R Rabiee, and Arezo Rajabi. 2012. Sampling from complex
In Computer Communications Workshops. networks with high community structures. Chaos 22, 2 (2012).
[5] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefeb- [19] Shaozhi Ye, Juan Lang, and Felix Wu. 2010. Crawling online social graphs. In
vre. 2008. Fast unfolding of communities in large networks. Journal of statistical 12th International Asia-Pacific Web Conference.

You might also like