Web Graph Statistics

Description

This webpage presents operational statistics derived from Common Crawl's Web Graph releases, which show the structure and connectivity of the web as captured in the crawl releases. The data consists of host- and domain-level graphs, where hostnames are formatted in reverse domain name notation. These graphs include all types of links, such as those pointing to images, JS libraries, web fonts, and so on. However, only hostnames with valid IANA top-level domains are considered, excluding URLs that use IP addresses as host components.

The domain-level graphs are constructed by aggregating host-level data at the pay-level domain (PLD) level, using the public suffix list maintained on publicsuffix.org. This methodology provides a comprehensive view of the web's hierarchical structure, which is useful for research in areas like ranking algorithms, graph analysis, and link spam detection.

For those interested in exploring the Web Graphs, we provide tools and instructions through the cc-webgraph project on GitHub. We also have a Jupyter Notebook in our cc-notebooks repository. Some related papers on how these Web Graphs can be used can be found in the Related Reading section below. Additionally, the list of graph releases is accessible via the graphinfo.json file.

The top ten highest ranked hosts and domains from each release are listed below. Ranks are derived from Harmonic Centrality, and we also show PageRank for comparison.

Top 1000 Ranks

These ranks can be found by running the following:

# Define environment variables for release and graph level
export RELEASE="{release}"  # Desired release (e.g., cc-main-2017-18-nov-dec-jan)
export GRAPH_LEVEL="{graph_level}"  # Desired graph level (e.g., domain or host)

# Fetch the top 1000 ranks for the specified release and graph level
curl -s https://data.commoncrawl.org/projects/hyperlinkgraph/$RELEASE/ \
        $GRAPH_LEVEL/$RELEASE-$GRAPH_LEVEL-ranks.txt.gz \
        | zcat \
        | head -n 1001

Each of these ranks files is multiple GiB, so piping to zcat or gunzip allows you to use head or tail to avoid downloading the whole thing.

What Are These Ranks?

Harmonic Centrality (that's the equation below and on the left) considers how close a node is to others, directly or indirectly. The closer a node is to others, the higher its score. It's based on proximity, not the importance or behaviour of neighbours. We calculate this with HyperBall.

With PageRank (that's the equation on the right), each node's score depends on how many important nodes link to it, and how those nodes distribute their importance. We calculate this with PageRankParallelGaussSeidel.

PageRank is susceptible to manipulation (e.g., link farming or creating many interconnected spam pages). These artificial links can inflate the importance of a spam node. Harmonic Centrality is better for reducing this spam, because it's harder to 'game', or exploit through artificial link patterns.

Statistics Plots

nodes
arcs
successoravggap
avglocality
maxoutdegree
dangling
percdangling
avgoutdegree
successoravglogdelta
maxindegree
avgindegree
sccs
maxsccsize
percmaxscc
percminscc

The following plots are of Web Graph statistics for all previous releases.

Download Data

domain.tsv host.tsv

Web Data Commons, for their web graph data set and everything related.
Common Search; we first used their web graph to expand the crawler frontier, and Common Search's cosr-back project was an important source of inspiration how to process our data using PySpark.
The authors of the WebGraph framework, whose software simplifies the computation of rankings.
This project is maintained by Common Crawl. View the project on GitHub.