A tool that maps the top PyPI packages to their GitHub repositories for batch analysis and ingestion. Perfect for creating datasets of popular Python projects, dependency analysis, or feeding repository lists into code analysis tools. Developed as part of PRSM for OSE.
Available in both JavaScript (Node.js) and Python! Choose your preferred implementation - both have identical features and output formats.
- Batch Processing: Process thousands of PyPI packages automatically
- Smart Mapping: Extracts GitHub URLs from PyPI package metadata
- Checkpoint System: Automatic checkpoints every N packages (configurable)
- Resume Support: Resume from previous checkpoint if interrupted
- Configurable: Customize count, rate limits, and output via CLI arguments
- Retry Logic: Automatic retry with exponential backoff for network failures
- Multiple Formats: Outputs both JSON (with metadata) and CSV (for batch upload)
- Progress Tracking: Real-time progress display with success/failure indicators
- Error Handling: Robust error handling with detailed logging
Choose one:
JavaScript version:
- Node.js >= 14.0.0
Python version:
- Python >= 3.7
Both versions:
- Internet connection (for PyPI API access)
- No external dependencies required!
# Clone the repository
git clone https://github.com/yourusername/software-finder.git
cd software-finder
# No dependencies to install!
# JavaScript version uses only Node.js built-in modules
# Python version uses only Python standard libraryBoth versions have identical command-line options and produce the same output. Choose your preferred language!
Process the top 5,000 PyPI packages (default):
JavaScript:
node generate-python-repo-list.jsPython:
python generate-python-repo-list.pyJavaScript:
# Process only top 1,000 packages
node generate-python-repo-list.js --count=1000
# Adjust rate limit to 5 requests per second
node generate-python-repo-list.js --rate=5
# Custom checkpoint interval (every 100 packages)
node generate-python-repo-list.js --checkpoint=100
# Custom output name
node generate-python-repo-list.js --output=my-python-repos
# Combine multiple options
node generate-python-repo-list.js --count=1000 --rate=5 --output=top-1kPython:
# Process only top 1,000 packages
python generate-python-repo-list.py --count 1000
# Adjust rate limit to 5 requests per second
python generate-python-repo-list.py --rate 5
# Custom checkpoint interval (every 100 packages)
python generate-python-repo-list.py --checkpoint 100
# Custom output name
python generate-python-repo-list.py --output my-python-repos
# Combine multiple options
python generate-python-repo-list.py --count 1000 --rate 5 --output top-1kIf the process is interrupted, resume from the last checkpoint:
JavaScript:
node generate-python-repo-list.js --resume=python-repos-top-5000-checkpoint-500.jsonPython:
python generate-python-repo-list.py --resume python-repos-top-5000-checkpoint-500.jsonnpm start
# or
npm run generate| Option | Description | Default |
|---|---|---|
--count=N |
Number of packages to process | 5000 |
--rate=N |
Requests per second (rate limit) | 10 |
--checkpoint=N |
Save checkpoint every N packages | 500 |
--resume=FILE |
Resume from checkpoint file | None |
--output=NAME |
Output file base name | python-repos-top-N |
--help, -h |
Show help message | - |
The tool generates several output files:
-
{output}.json: Full results with metadata- Package names, download counts, GitHub URLs
- Processing statistics and configuration
- Timestamp and source information
-
{output}.csv: Simple CSV with GitHub URLs only- Perfect for batch uploads to analysis tools
- One URL per line with header row
-
{output}-failed.json: List of packages without GitHub URLs- Useful for manual review or alternative mapping strategies
{output}-checkpoint-{N}.json: Automatic checkpoints- Created every N packages (configurable)
- Can be used to resume interrupted processing
- Automatically saved during long-running operations
-
Fetch Package List: Downloads the top PyPI packages by 30-day download count from hugovk/top-pypi-packages
-
Query Package Metadata: For each package, queries the PyPI JSON API to extract project metadata
-
Extract GitHub URLs: Searches multiple metadata fields for GitHub repository URLs:
project_urls.Sourceproject_urls.Repositoryproject_urls.Homepagehome_page- And several other fallbacks
-
Clean and Validate: Cleans URLs (removes
.git, trailing slashes) and validates format -
Output Results: Saves to JSON and CSV formats with detailed statistics
PyPI to GitHub Repository Mapper
==================================================
Target packages: 5000
Rate limit: 10 req/sec
Checkpoint interval: 500 packages
==================================================
π¦ Fetching top PyPI packages...
β
Found 8000 available packages
π Mapping 5000 packages to GitHub repositories...
β±οΈ Estimated time: ~9 minutes
Legend:
β
= Found GitHub URL
β = No GitHub URL in PyPI metadata
[1/5000] β
boto3 β https://github.com/boto/boto3
[2/5000] β
urllib3 β https://github.com/urllib3/urllib3
[3/5000] β
botocore β https://github.com/boto/botocore
...
{
"generated_at": "2025-10-31T12:00:00.000Z",
"config": {
"target_count": 5000,
"rate_limit": 10
},
"stats": {
"total_repos": 4235,
"failed_count": 765,
"success_rate": "84.7%",
"processing_time_minutes": 8.3
},
"source": "pypi_top_downloads_30d",
"repositories": [
{
"github_url": "https://github.com/boto/boto3",
"package_name": "boto3",
"downloads_30d": 432165789,
"source": "pypi_top_downloads_30d"
}
]
}- Rate Limiting: Default 10 req/sec is safe for PyPI. Adjust with
--rateif needed - Processing Time: ~8-10 minutes for 5,000 packages at default rate
- Memory Usage: Minimal - processes packages sequentially
- Network: Requires stable internet connection
If you encounter network errors, try reducing the rate limit:
JavaScript:
node generate-python-repo-list.js --rate=5Python:
python generate-python-repo-list.py --rate 5Always use the latest checkpoint to resume:
JavaScript:
node generate-python-repo-list.js --resume=python-repos-top-5000-checkpoint-4500.jsonPython:
python generate-python-repo-list.py --resume python-repos-top-5000-checkpoint-4500.jsonSome packages don't have GitHub URLs in their metadata. This is expected and the tool will:
- Mark them with β in console output
- Save them to
{output}-failed.jsonfor reference - Continue processing remaining packages
- Repository Analysis: Feed URLs into code analysis tools
- Dependency Mapping: Build dependency graphs of popular packages
- Trend Analysis: Track popular Python projects over time
- Research: Study characteristics of widely-used Python libraries
- Security Auditing: Batch-analyze popular packages for vulnerabilities
This project provides two implementations with identical functionality:
- Runtime: Node.js >= 14.0.0
- Dependencies: None (uses only built-in modules:
https,fs,path) - Best for: Teams already using Node.js, integration with npm workflows
- Lines of code: ~331
- Runtime: Python >= 3.7
- Dependencies: None (uses only standard library)
- Best for: Python developers, integration with Python workflows
- Features: Type hints, comprehensive docstrings
- Lines of code: ~390
Both versions:
- Produce identical output formats (JSON, CSV)
- Support all the same command-line options
- Have the same checkpoint format (interchangeable)
- Use the same algorithm for GitHub URL extraction
- Respect PyPI rate limits
Pro tip: You can start with one version and resume with the other! Checkpoint files are compatible across implementations.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- PyPI package data from hugovk/top-pypi-packages
- Inspired by the need for better Python ecosystem analysis tools
- libraries.io - Comprehensive package registry data
- deps.dev - Google's dependency analysis tool
- top-pypi-packages - Source of PyPI rankings
Note: This tool respects PyPI's rate limits and best practices. Please use responsibly.