Skip to content

sdsc-ordes/open-pulse-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

20 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Open Pulse Crawler

A powerful GitHub crawler based on breadth-first search (BFS) strategy to discover and map relationships between users, organizations, and repositories.

Features

  • ๐Ÿ” BFS Crawling: Discovers GitHub entities layer by layer from initial seed nodes
  • ๐Ÿ”„ Multi-Token Support: Use multiple GitHub tokens for higher rate limits
  • ๐Ÿ’พ Smart Caching: Avoids redundant API calls with file-based caching
  • ๐Ÿ“Š Multiple Output Formats: Export data as JSON, CSV (edges & nodes)
  • ๐Ÿ“ˆ Visualization: Generate network graphs with color-coded node types
  • โธ๏ธ State Management: Save and resume crawler state
  • ๐Ÿ“ Rich Logging: Timestamped logs with progress tracking and statistics
  • ๐Ÿ“‰ Progress Tracking: Real-time progress bars with percentage, ETA, and statistics using tqdm
  • ๐ŸŽฏ Relationship Mapping: Tracks "owner of", "contributor of", "member of", "fork of", and "parent of" relationships
  • ๐Ÿšฆ Intelligent Rate Limiting: Adaptive rate limit management with semaphores, delays, and multi-token rotation
  • โšก Concurrent Control: Configurable request throttling to prevent API abuse

Installation

Using uv (recommended):

# Install the package
uv pip install -e .

# With visualization support
uv pip install -e ".[viz]"

# With development tools
uv pip install -e ".[dev,viz]"

Using pip:

pip install -e .
# or with visualization
pip install -e ".[viz]"

Configuration

Set your GitHub personal access token(s) in the environment:

# Single token
export GITHUB_TOKEN="ghp_your_token_here"

# Multiple tokens (comma-separated for better rate limits)
export GITHUB_TOKEN="ghp_token1,ghp_token2,ghp_token3"

You can create a .env file in your project directory:

GITHUB_TOKEN=ghp_your_token_here

Usage

Basic Usage

Crawl from command-line seeds:

open-pulse-crawler crawl caviri sdsc-ordes/gimie --rounds 2

Using a Seed File

Create a seeds.txt file:

caviri
sdsc-ordes/gimie
https://github.com/torvalds/linux
torvalds

Run the crawler:

open-pulse-crawler crawl --seed-file seeds.txt --rounds 3

Advanced Options

open-pulse-crawler crawl \
  --seed-file seeds.txt \
  --rounds 3 \
  --output-dir ./results \
  --cache-dir ./cache \
  --state-file state.json \
  --visualize \
  --visualize-clusters \
  --verbose

Resume from Saved State

open-pulse-crawler crawl --resume --state-file state.json

Command-Line Options

Basic Options

  • seeds: Initial seed nodes (users, orgs, or repos)
  • --seed-file, -f: Path to file containing seed nodes (one per line)
  • --rounds, -r: Number of BFS rounds to perform (default: 3)
  • --output-dir, -o: Directory for output files (default: ./output)
  • --cache-dir, -c: Directory for caching API responses
  • --state-file, -s: File to save/load crawler state
  • --resume: Resume from saved state file
  • --no-json: Skip JSON output
  • --no-csv: Skip CSV output
  • --visualize, -v: Generate graph visualization (PNG)
  • --verbose: Enable verbose logging

Rate Limiting Options (New!)

  • --request-delay: Minimum delay in seconds between API requests (default: 0.0)
  • --max-concurrent: Maximum number of concurrent API requests (default: 5)
  • --rate-limit-buffer: Buffer of requests to keep before waiting (default: 50)

See RATE_LIMITING.md for detailed guide on rate limiting and API management.

Output Formats

JSON Output

Complete graph data with all discovered entities:

{
  "users": {
    "caviri": {
      "login": "caviri",
      "name": "Carlos Vivar",
      "id": 12345,
      "type": "User",
      "authored_repositories": ["caviri/repo1"],
      "forked_repositories": []
    }
  },
  "orgs": {...},
  "repos": {...}
}

CSV Output (Edges)

Relationships between entities:

source,target,property,source_type,target_type
caviri,caviri/repo1,owner of,user,repo
user1,org1,member of,user,org
repo1,repo2,parent of,repo,repo

CSV Output (Nodes)

All discovered nodes:

id,name,type,is_seed
caviri,Carlos Vivar,user,true
sdsc-ordes/gimie,gimie,repo,true
torvalds,Linus Torvalds,user,false

Visualization

When --visualize is enabled, generates a PNG image with:

  • Color-coded nodes (users=blue, orgs=red, repos=green)
  • Seed nodes shown as squares
  • Regular nodes shown as circles
  • Directed edges showing relationships

How It Works

  1. Seed Parsing: Accepts GitHub URLs, usernames, or org/repo identifiers
  2. BFS Expansion: For each round:
    • Processes all nodes in the current level
    • Discovers connected entities (repos, members, contributors)
    • Adds new entities to the queue for the next round
  3. Relationship Mapping:
    • Users/Orgs โ†’ Repos: "owner of" or "contributor of"
    • Users โ†’ Orgs: "member of"
    • Repos โ†’ Repos: "parent of" (for forks)
  4. Caching: Stores API responses to avoid redundant calls
  5. Rate Limiting: Automatically handles GitHub API rate limits with token rotation

Project Structure

src/open_pulse_crawler/
โ”œโ”€โ”€ __init__.py          # Package initialization
โ”œโ”€โ”€ models.py            # Pydantic models for GitHub entities
โ”œโ”€โ”€ github_client.py     # GitHub API client with caching
โ”œโ”€โ”€ crawler.py           # BFS crawler core logic
โ”œโ”€โ”€ io_utils.py          # Input/output handlers
โ”œโ”€โ”€ visualization.py     # Graph visualization
โ””โ”€โ”€ cli.py              # Command-line interface

Progress Tracking

The crawler now includes real-time progress tracking with tqdm and human-readable timestamps:

  • Overall round progress: Shows completion percentage and ETA across all rounds
  • Per-round progress: Displays node processing progress within each round
  • Live statistics: Real-time updates of nodes, users, orgs, repos, and queue size
  • Timestamps: Start time, end time, and duration in human-readable format
  • Round timestamps: See when each BFS round begins

Example progress output:

๐Ÿš€ Crawl started at 2025-10-02 14:30:15
๐Ÿ“Š Target: 3 rounds

Overall Progress:  67%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹      | 2/3 [00:45<00:22] nodes=156 users=12 orgs=3 repos=141 queue=234
Round 2 [14:30:47]:   100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 156/156 [00:18<00:00,  8.67node/s]

โœ… Crawl completed at 2025-10-02 14:31:38
โฑ๏ธ  Total duration: 1m 23s
๐Ÿ“ฆ Collected: 56 users, 8 orgs, 170 repos

See PROGRESS_TRACKING.md and TIMESTAMPS.md for more details.

Statistics and Monitoring

The crawler provides detailed statistics:

  • Nodes processed per round
  • API calls made and cache hits
  • Rate limit waits and token switches
  • Time taken per round
  • Total entities discovered

Example output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Crawl Statistics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Metric                      โ”‚ Value                                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rounds Completed            โ”‚ 3                                         โ”‚
โ”‚ Total Nodes Visited         โ”‚ 150                                       โ”‚
โ”‚ Users Discovered            โ”‚ 45                                        โ”‚
โ”‚ Organizations Discovered    โ”‚ 12                                        โ”‚
โ”‚ Repositories Discovered     โ”‚ 93                                        โ”‚
โ”‚ API Calls Made              โ”‚ 200                                       โ”‚
โ”‚ Cache Hits                  โ”‚ 50                                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

License

Apache 2.0

Author

caviri

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages