A powerful GitHub crawler based on breadth-first search (BFS) strategy to discover and map relationships between users, organizations, and repositories.
- ๐ BFS Crawling: Discovers GitHub entities layer by layer from initial seed nodes
- ๐ Multi-Token Support: Use multiple GitHub tokens for higher rate limits
- ๐พ Smart Caching: Avoids redundant API calls with file-based caching
- ๐ Multiple Output Formats: Export data as JSON, CSV (edges & nodes)
- ๐ Visualization: Generate network graphs with color-coded node types
- โธ๏ธ State Management: Save and resume crawler state
- ๐ Rich Logging: Timestamped logs with progress tracking and statistics
- ๐ Progress Tracking: Real-time progress bars with percentage, ETA, and statistics using tqdm
- ๐ฏ Relationship Mapping: Tracks "owner of", "contributor of", "member of", "fork of", and "parent of" relationships
- ๐ฆ Intelligent Rate Limiting: Adaptive rate limit management with semaphores, delays, and multi-token rotation
- โก Concurrent Control: Configurable request throttling to prevent API abuse
Using uv (recommended):
# Install the package
uv pip install -e .
# With visualization support
uv pip install -e ".[viz]"
# With development tools
uv pip install -e ".[dev,viz]"Using pip:
pip install -e .
# or with visualization
pip install -e ".[viz]"Set your GitHub personal access token(s) in the environment:
# Single token
export GITHUB_TOKEN="ghp_your_token_here"
# Multiple tokens (comma-separated for better rate limits)
export GITHUB_TOKEN="ghp_token1,ghp_token2,ghp_token3"You can create a .env file in your project directory:
GITHUB_TOKEN=ghp_your_token_hereCrawl from command-line seeds:
open-pulse-crawler crawl caviri sdsc-ordes/gimie --rounds 2Create a seeds.txt file:
caviri
sdsc-ordes/gimie
https://github.com/torvalds/linux
torvaldsRun the crawler:
open-pulse-crawler crawl --seed-file seeds.txt --rounds 3open-pulse-crawler crawl \
--seed-file seeds.txt \
--rounds 3 \
--output-dir ./results \
--cache-dir ./cache \
--state-file state.json \
--visualize \
--visualize-clusters \
--verboseopen-pulse-crawler crawl --resume --state-file state.jsonseeds: Initial seed nodes (users, orgs, or repos)--seed-file, -f: Path to file containing seed nodes (one per line)--rounds, -r: Number of BFS rounds to perform (default: 3)--output-dir, -o: Directory for output files (default: ./output)--cache-dir, -c: Directory for caching API responses--state-file, -s: File to save/load crawler state--resume: Resume from saved state file--no-json: Skip JSON output--no-csv: Skip CSV output--visualize, -v: Generate graph visualization (PNG)--verbose: Enable verbose logging
--request-delay: Minimum delay in seconds between API requests (default: 0.0)--max-concurrent: Maximum number of concurrent API requests (default: 5)--rate-limit-buffer: Buffer of requests to keep before waiting (default: 50)
See RATE_LIMITING.md for detailed guide on rate limiting and API management.
Complete graph data with all discovered entities:
{
"users": {
"caviri": {
"login": "caviri",
"name": "Carlos Vivar",
"id": 12345,
"type": "User",
"authored_repositories": ["caviri/repo1"],
"forked_repositories": []
}
},
"orgs": {...},
"repos": {...}
}Relationships between entities:
source,target,property,source_type,target_type
caviri,caviri/repo1,owner of,user,repo
user1,org1,member of,user,org
repo1,repo2,parent of,repo,repoAll discovered nodes:
id,name,type,is_seed
caviri,Carlos Vivar,user,true
sdsc-ordes/gimie,gimie,repo,true
torvalds,Linus Torvalds,user,falseWhen --visualize is enabled, generates a PNG image with:
- Color-coded nodes (users=blue, orgs=red, repos=green)
- Seed nodes shown as squares
- Regular nodes shown as circles
- Directed edges showing relationships
- Seed Parsing: Accepts GitHub URLs, usernames, or org/repo identifiers
- BFS Expansion: For each round:
- Processes all nodes in the current level
- Discovers connected entities (repos, members, contributors)
- Adds new entities to the queue for the next round
- Relationship Mapping:
- Users/Orgs โ Repos: "owner of" or "contributor of"
- Users โ Orgs: "member of"
- Repos โ Repos: "parent of" (for forks)
- Caching: Stores API responses to avoid redundant calls
- Rate Limiting: Automatically handles GitHub API rate limits with token rotation
src/open_pulse_crawler/
โโโ __init__.py # Package initialization
โโโ models.py # Pydantic models for GitHub entities
โโโ github_client.py # GitHub API client with caching
โโโ crawler.py # BFS crawler core logic
โโโ io_utils.py # Input/output handlers
โโโ visualization.py # Graph visualization
โโโ cli.py # Command-line interface
The crawler now includes real-time progress tracking with tqdm and human-readable timestamps:
- Overall round progress: Shows completion percentage and ETA across all rounds
- Per-round progress: Displays node processing progress within each round
- Live statistics: Real-time updates of nodes, users, orgs, repos, and queue size
- Timestamps: Start time, end time, and duration in human-readable format
- Round timestamps: See when each BFS round begins
Example progress output:
๐ Crawl started at 2025-10-02 14:30:15
๐ Target: 3 rounds
Overall Progress: 67%|โโโโโโโโโโโโโ | 2/3 [00:45<00:22] nodes=156 users=12 orgs=3 repos=141 queue=234
Round 2 [14:30:47]: 100%|โโโโโโโโโโโโโโโโโโโโ| 156/156 [00:18<00:00, 8.67node/s]
โ
Crawl completed at 2025-10-02 14:31:38
โฑ๏ธ Total duration: 1m 23s
๐ฆ Collected: 56 users, 8 orgs, 170 repos
See PROGRESS_TRACKING.md and TIMESTAMPS.md for more details.
The crawler provides detailed statistics:
- Nodes processed per round
- API calls made and cache hits
- Rate limit waits and token switches
- Time taken per round
- Total entities discovered
Example output:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโ Crawl Statistics โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Metric โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Rounds Completed โ 3 โ
โ Total Nodes Visited โ 150 โ
โ Users Discovered โ 45 โ
โ Organizations Discovered โ 12 โ
โ Repositories Discovered โ 93 โ
โ API Calls Made โ 200 โ
โ Cache Hits โ 50 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Apache 2.0
caviri