Skip to content

jring-o/software-finder

Repository files navigation

PyPI to GitHub Repository Mapper

A tool that maps the top PyPI packages to their GitHub repositories for batch analysis and ingestion. Perfect for creating datasets of popular Python projects, dependency analysis, or feeding repository lists into code analysis tools. Developed as part of PRSM for OSE.

Available in both JavaScript (Node.js) and Python! Choose your preferred implementation - both have identical features and output formats.

Features

  • Batch Processing: Process thousands of PyPI packages automatically
  • Smart Mapping: Extracts GitHub URLs from PyPI package metadata
  • Checkpoint System: Automatic checkpoints every N packages (configurable)
  • Resume Support: Resume from previous checkpoint if interrupted
  • Configurable: Customize count, rate limits, and output via CLI arguments
  • Retry Logic: Automatic retry with exponential backoff for network failures
  • Multiple Formats: Outputs both JSON (with metadata) and CSV (for batch upload)
  • Progress Tracking: Real-time progress display with success/failure indicators
  • Error Handling: Robust error handling with detailed logging

Prerequisites

Choose one:

JavaScript version:

  • Node.js >= 14.0.0

Python version:

  • Python >= 3.7

Both versions:

  • Internet connection (for PyPI API access)
  • No external dependencies required!

Installation

# Clone the repository
git clone https://github.com/yourusername/software-finder.git
cd software-finder

# No dependencies to install!
# JavaScript version uses only Node.js built-in modules
# Python version uses only Python standard library

Usage

Both versions have identical command-line options and produce the same output. Choose your preferred language!

Basic Usage

Process the top 5,000 PyPI packages (default):

JavaScript:

node generate-python-repo-list.js

Python:

python generate-python-repo-list.py

Custom Configuration

JavaScript:

# Process only top 1,000 packages
node generate-python-repo-list.js --count=1000

# Adjust rate limit to 5 requests per second
node generate-python-repo-list.js --rate=5

# Custom checkpoint interval (every 100 packages)
node generate-python-repo-list.js --checkpoint=100

# Custom output name
node generate-python-repo-list.js --output=my-python-repos

# Combine multiple options
node generate-python-repo-list.js --count=1000 --rate=5 --output=top-1k

Python:

# Process only top 1,000 packages
python generate-python-repo-list.py --count 1000

# Adjust rate limit to 5 requests per second
python generate-python-repo-list.py --rate 5

# Custom checkpoint interval (every 100 packages)
python generate-python-repo-list.py --checkpoint 100

# Custom output name
python generate-python-repo-list.py --output my-python-repos

# Combine multiple options
python generate-python-repo-list.py --count 1000 --rate 5 --output top-1k

Resume from Checkpoint

If the process is interrupted, resume from the last checkpoint:

JavaScript:

node generate-python-repo-list.js --resume=python-repos-top-5000-checkpoint-500.json

Python:

python generate-python-repo-list.py --resume python-repos-top-5000-checkpoint-500.json

Using npm Scripts (JavaScript only)

npm start
# or
npm run generate

Command Line Options

Option Description Default
--count=N Number of packages to process 5000
--rate=N Requests per second (rate limit) 10
--checkpoint=N Save checkpoint every N packages 500
--resume=FILE Resume from checkpoint file None
--output=NAME Output file base name python-repos-top-N
--help, -h Show help message -

Output Files

The tool generates several output files:

Main Outputs

  • {output}.json: Full results with metadata

    • Package names, download counts, GitHub URLs
    • Processing statistics and configuration
    • Timestamp and source information
  • {output}.csv: Simple CSV with GitHub URLs only

    • Perfect for batch uploads to analysis tools
    • One URL per line with header row
  • {output}-failed.json: List of packages without GitHub URLs

    • Useful for manual review or alternative mapping strategies

Checkpoint Files

  • {output}-checkpoint-{N}.json: Automatic checkpoints
    • Created every N packages (configurable)
    • Can be used to resume interrupted processing
    • Automatically saved during long-running operations

How It Works

  1. Fetch Package List: Downloads the top PyPI packages by 30-day download count from hugovk/top-pypi-packages

  2. Query Package Metadata: For each package, queries the PyPI JSON API to extract project metadata

  3. Extract GitHub URLs: Searches multiple metadata fields for GitHub repository URLs:

    • project_urls.Source
    • project_urls.Repository
    • project_urls.Homepage
    • home_page
    • And several other fallbacks
  4. Clean and Validate: Cleans URLs (removes .git, trailing slashes) and validates format

  5. Output Results: Saves to JSON and CSV formats with detailed statistics

Example Output

Console Output

PyPI to GitHub Repository Mapper
==================================================
Target packages: 5000
Rate limit: 10 req/sec
Checkpoint interval: 500 packages
==================================================

πŸ“¦ Fetching top PyPI packages...
βœ… Found 8000 available packages

πŸ” Mapping 5000 packages to GitHub repositories...
⏱️  Estimated time: ~9 minutes

Legend:
  βœ… = Found GitHub URL
  ❌ = No GitHub URL in PyPI metadata

[1/5000] βœ… boto3                                    β†’ https://github.com/boto/boto3
[2/5000] βœ… urllib3                                  β†’ https://github.com/urllib3/urllib3
[3/5000] βœ… botocore                                 β†’ https://github.com/boto/botocore
...

JSON Output Structure

{
  "generated_at": "2025-10-31T12:00:00.000Z",
  "config": {
    "target_count": 5000,
    "rate_limit": 10
  },
  "stats": {
    "total_repos": 4235,
    "failed_count": 765,
    "success_rate": "84.7%",
    "processing_time_minutes": 8.3
  },
  "source": "pypi_top_downloads_30d",
  "repositories": [
    {
      "github_url": "https://github.com/boto/boto3",
      "package_name": "boto3",
      "downloads_30d": 432165789,
      "source": "pypi_top_downloads_30d"
    }
  ]
}

Performance Considerations

  • Rate Limiting: Default 10 req/sec is safe for PyPI. Adjust with --rate if needed
  • Processing Time: ~8-10 minutes for 5,000 packages at default rate
  • Memory Usage: Minimal - processes packages sequentially
  • Network: Requires stable internet connection

Troubleshooting

Connection Errors

If you encounter network errors, try reducing the rate limit:

JavaScript:

node generate-python-repo-list.js --rate=5

Python:

python generate-python-repo-list.py --rate 5

Interrupted Processing

Always use the latest checkpoint to resume:

JavaScript:

node generate-python-repo-list.js --resume=python-repos-top-5000-checkpoint-4500.json

Python:

python generate-python-repo-list.py --resume python-repos-top-5000-checkpoint-4500.json

No GitHub URL Found

Some packages don't have GitHub URLs in their metadata. This is expected and the tool will:

  • Mark them with ❌ in console output
  • Save them to {output}-failed.json for reference
  • Continue processing remaining packages

Use Cases

  • Repository Analysis: Feed URLs into code analysis tools
  • Dependency Mapping: Build dependency graphs of popular packages
  • Trend Analysis: Track popular Python projects over time
  • Research: Study characteristics of widely-used Python libraries
  • Security Auditing: Batch-analyze popular packages for vulnerabilities

Implementation Details

This project provides two implementations with identical functionality:

JavaScript Version (generate-python-repo-list.js)

  • Runtime: Node.js >= 14.0.0
  • Dependencies: None (uses only built-in modules: https, fs, path)
  • Best for: Teams already using Node.js, integration with npm workflows
  • Lines of code: ~331

Python Version (generate-python-repo-list.py)

  • Runtime: Python >= 3.7
  • Dependencies: None (uses only standard library)
  • Best for: Python developers, integration with Python workflows
  • Features: Type hints, comprehensive docstrings
  • Lines of code: ~390

Both versions:

  • Produce identical output formats (JSON, CSV)
  • Support all the same command-line options
  • Have the same checkpoint format (interchangeable)
  • Use the same algorithm for GitHub URL extraction
  • Respect PyPI rate limits

Pro tip: You can start with one version and resume with the other! Checkpoint files are compatible across implementations.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Related Projects


Note: This tool respects PyPI's rate limits and best practices. Please use responsibly.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors