Find a file
2025-07-01 20:01:43 -05:00
.github support 3.9 again 2025-06-30 21:53:20 -05:00
docs prepare 2.4.1 2025-07-01 20:01:43 -05:00
scrapelib remove py 3.10 2025-07-01 19:58:16 -05:00
.coveragerc fix some coverage issues 2014-05-02 22:24:23 -04:00
.gitignore move to uv, fix linting errors, and test fix (#259) 2025-06-23 23:05:33 -04:00
LICENSE moving it over 2015-06-17 12:13:42 -04:00
mkdocs.yml switch to codeberg 2025-06-23 23:19:16 -04:00
pyproject.toml prepare 2.4.1 2025-07-01 20:01:43 -05:00
README.md remove old installation note 2025-06-23 23:20:28 -04:00

scrapelib is a library for making requests to less-than-reliable websites.

This repository has moved to Codeberg, GitHub will remain as a read-only mirror.

Source: https://codeberg.org/jpt/scrapelib

Documentation: https://jamesturk.github.io/scrapelib/

Issues: https://codeberg.org/jpt/scrapelib/issues

PyPI badge Test badge

Features

scrapelib originated as part of the Open States project to scrape the websites of all 50 state legislatures and as a result was therefore designed with features desirable when dealing with sites that have intermittent errors or require rate-limiting.

Advantages of using scrapelib over using requests as-is:

  • HTTP(S) and FTP requests via an identical API
  • support for simple caching with pluggable cache backends
  • highly-configurable request throtting
  • configurable retries for non-permanent site failures
  • All of the power of the suberb requests library.

Installation

scrapelib is on PyPI, and can be installed via any standard package management tool.

Example Usage


  import scrapelib
  s = scrapelib.Scraper(requests_per_minute=10)

  # Grab Google front page
  s.get('http://google.com')

  # Will be throttled to 10 HTTP requests per minute
  while True:
      s.get('http://example.com')