Skip to content

19-84/redd-archiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Redd-Archiver

License: Unlicense Python 3.7+ PostgreSQL Required Version 1.0.0 Multi-Platform MCP Server

Transform compressed data dumps into browsable HTML archives with flexible deployment options. Redd-Archiver supports offline browsing via sorted index pages OR full-text search with Docker deployment. Features mobile-first design, multi-platform support, and enterprise-grade performance with PostgreSQL full-text indexing.

Supported Platforms:

Platform Format Status Available Posts
Reddit .zst JSON Lines (Pushshift) βœ… Full support 2.38B posts (40,029 subreddits, through Dec 31 2024)
Voat SQL dumps βœ… Full support 3.81M posts, 24.1M comments (22,637 subverses, complete archive)
Ruqqus .7z JSON Lines βœ… Full support 500K posts (6,217 guilds, complete archive)

Tracked content: 2.384 billion posts across 68,883 communities (Reddit full Pushshift dataset through Dec 31 2024, Voat/Ruqqus complete archives)

Version 1.0 features multi-platform archiving, REST API with 30+ endpoints, MCP server for AI integration, and PostgreSQL-backed architecture for large-scale processing.

πŸš€ Quick Start

Try the live demo: Browse Example Archive β†’

New to Redd-Archiver? Start here: QUICKSTART.md

Get running in 2-15 minutes with our step-by-step guide covering:

  • Local testing (5 minutes)
  • Tor homelab deployment (2 minutes) - no domain or port forwarding needed!
  • Production HTTPS (15 minutes)
  • Example data testing

🎯 Key Features

🌐 Multi-Platform Support

Archive content from multiple link aggregator platforms in a single unified archive:

Platform Format CLI Flag URL Prefix
Reddit .zst JSON Lines --subreddit /r/
Voat SQL dumps --subverse /v/
Ruqqus .7z JSON Lines --guild /g/
  • Automatic Detection: Platform auto-detected from file extensions
  • Unified Search: PostgreSQL FTS searches across all platforms
  • Mixed Archives: Combine Reddit, Voat, and Ruqqus in single archive

πŸ€– MCP Server (AI Integration)

29 MCP tools auto-generated from OpenAPI for AI assistants:

  • Full Archive Access: Query posts, comments, users, search via Claude Desktop or Claude Code
  • Token Overflow Prevention: Built-in LLM guidance with field selection and truncation
  • 5 MCP Resources: Instant access to stats, top posts, subreddits, search help
  • Claude Code Ready: Copy-paste configuration for immediate use
{
  "mcpServers": {
    "reddarchiver": {
      "command": "uv",
      "args": ["--directory", "/path/to/mcp_server", "run", "python", "server.py"],
      "env": { "REDDARCHIVER_API_URL": "http://localhost:5000" }
    }
  }
}

See MCP Server Documentation for complete setup guide.

Core Functionality

  • πŸ“± Mobile-First Design: Responsive layout optimized for all devices with touch-friendly navigation
  • πŸ” Advanced Search System (Server Required): PostgreSQL full-text search optimized for Tor network. Search by keywords, subreddit, author, date, score. Requires Docker deployment - offline browsing uses sorted index pages.
  • ⚑ JavaScript Free: Complete functionality without JS, pure CSS interactions
  • 🎨 Theme Support: Built-in light/dark theme toggle with CSS-only implementation
  • β™Ώ Accessibility: WCAG compliant with keyboard navigation and screen reader support
  • πŸš„ Performance: Optimized CSS (29KB), designed for low-bandwidth networks

Technical Excellence

  • πŸ—οΈ Modular Architecture: 18 specialized modules for maintainability and extensibility
  • πŸ—„οΈ PostgreSQL Backend: Large-scale processing with constant memory usage regardless of dataset size
  • ⚑ Lightning-Fast Search: PostgreSQL full-text search with GIN indexing
  • 🌐 REST API v1: 30+ endpoints with MCP/AI optimization for programmatic access to posts, comments, users, statistics, search, aggregations, and exports
  • πŸ§… Tor-Optimized: Zero JavaScript, server-side search, no external dependencies
  • πŸ“Š Rich Statistics: Comprehensive analytics dashboard with file size tracking
  • πŸ”— SEO Optimized: Complete meta tags, XML sitemaps, and structured data
  • πŸ’Ύ Streaming Processing: Memory-efficient with automatic resume capability
  • πŸ“ˆ Progress Tracking: Real-time transfer rates, ETAs, and database metrics
  • πŸ† Instance Registry: Leaderboard system with completeness-weighted scoring for distributed archives

Deployment Options

  • 🏠 Local/Homelab: HTTP on localhost or LAN (2 commands)
  • 🌐 Production HTTPS: Automated Let's Encrypt setup (5 minutes)
  • πŸ§… Tor Hidden Service: .onion access, zero networking config (2 minutes)
  • πŸ”€ Dual-Mode: HTTPS + Tor simultaneously
  • πŸ“„ Static Hosting: GitHub/Codeberg Pages for small archives (browse-only, no search)

πŸ“¦ Deployment Options

Redd-Archiver generates static HTML files that can be browsed offline OR deployed with full-text search:

Mode Search Server Setup Time Use Case
Offline Browsing ❌ Browse-only None 0 min USB drives, local archives, offline research
Static Hosting ❌ Browse-only GitHub/Codeberg Pages 10 min Free public hosting (size limits)
Docker Local βœ… PostgreSQL FTS localhost 5 min Development, testing
Docker + Tor βœ… PostgreSQL FTS .onion hidden service 2 min Private sharing, no port forwarding
Docker + HTTPS βœ… PostgreSQL FTS Public domain 15 min Production public archives

Offline Browsing Features:

  • Sorted index pages (by score, comments, date)
  • Pagination for large subreddits
  • Full comment threads and user pages
  • Works by opening HTML files directly

With Search Server:

  • PostgreSQL full-text search with GIN indexing
  • Search by keywords, subreddit, author, date, score
  • Sub-second results, Tor-compatible
  • Requires Docker deployment

🚨 Get Involved: Help Preserve Internet History

Internet content disappears every day. Communities get banned, platforms shut down, and valuable discussions vanish. You can help prevent this.

πŸ“₯ Download & Mirror Data Now

Don't wait for content to disappear. Download these datasets today:

Platform Size Posts Download
Reddit 3.28TB 2.38B posts Academic Torrents Β· Magnet Link
Voat ~15GB 3.8M posts Archive.org †
Ruqqus ~752MB 500K posts Archive.org ‑

† Voat Performance Tip: Use pre-split files for 1000x faster imports (2-5 min vs 30+ min per subverse) ‑ Ruqqus: Docker image includes p7zip for automatic .7z decompression

Every mirror matters. Store locally, seed torrents, share with researchers. Be part of the preservation network.

🌐 Join the Registry: Deploy Your Instance

Already running an archive? Register it on our public leaderboard:

  1. Deploy your instance (Quick Start - 2-15 minutes)
  2. Submit via Registry Template
  3. Join coordinated preservation efforts with other teams

Benefits:

  • Public visibility and traffic
  • Coordinated archiving to avoid duplication
  • Team collaboration opportunities
  • Leaderboard recognition

πŸ‘‰ Register Your Instance Now β†’

πŸ†• Submit New Data Sources

Found a new platform dataset? Help expand the archive network:

  • Lemmy databases
  • Hacker News archives
  • Alternative Reddit archives
  • Other link aggregator platforms

πŸ‘‰ Submit Data Source β†’

Why submit?

  • Makes data discoverable for other archivists
  • Prevents duplicate preservation efforts
  • Builds comprehensive multi-platform archive ecosystem
  • Tracks data availability before platforms disappear

πŸ“Έ Screenshots

Dashboard

Dashboard

Main landing page showing archive overview with statistics for 9,592 posts across Reddit, Voat, and Ruqqus. Features customizable branding (site name, project URL), responsive cards, activity metrics, and content statistics. (Works offline)

Subreddit Index

Subreddit Index

Post listing with sorting options (score, comments, date), pagination, and badge coloring. Includes navigation and theme toggle. (Works offline - sorted by score/comments/date)

Post Page with Comments

Post Page

Individual post displaying nested comment threads with collapsible UI, user flair, and timestamps. Comments include anchor links for direct navigation from user pages. (Works offline)

Mobile Responsive Design

Mobile Dashboard

Fully optimized for mobile devices with touch-friendly navigation and responsive layout.

Search Interface

Search Form

PostgreSQL full-text search with Google-style operators. Supports filtering by subreddit, author, date range, and score. (Requires Docker deployment)

Search Results

Search results with highlighted excerpts using PostgreSQL ts_headline(). Sub-second response times with GIN indexing. (Server-based, Tor-compatible)

Sample Archive: Multi-platform archive featuring programming and technology communities from Reddit, Voat, and Ruqqus Β· See all screenshots β†’

πŸ› οΈ Installation

Prerequisites

  • Python 3.7 or higher
  • PostgreSQL 12+ (required for v1.0+)
  • 4GB+ RAM (PostgreSQL uses constant memory)
  • Disk space: ~1.5-2x your input .zst file size for PostgreSQL database

Python Dependencies

Redd-Archiver uses modern, performance-focused dependencies:

Core:

  • psycopg[binary,pool]==3.2.3 - PostgreSQL adapter with connection pooling
  • zstandard==0.23.0 - Fast .zst decompression
  • psutil==6.1.1 - System resource monitoring

HTML Generation:

  • jinja2>=3.1.6 - Modern template engine with inheritance
  • rcssmin>=1.1.2 - CSS minification for smaller file sizes

Performance:

  • orjson>=3.11.4 - Fast JSON parsing

Quick Start

Option 1: Docker (Recommended)

git clone https://github.com/19-84/redd-archiver.git
cd redd-archiver

# Create required directories
mkdir -p data output/.postgres-data logs tor-public

# Copy environment template and configure
cp .env.example .env
# Edit .env with your settings (change default passwords!)

# Start PostgreSQL container
docker-compose up -d

# Install Python dependencies
pip install -r requirements.txt

# Configure database connection
export DATABASE_URL="postgresql://reddarchiver:your_password_here@localhost:5432/reddarchiver"

# Run the archive generator
python reddarc.py /path/to/data/ --output my-archive/

Option 2: Local PostgreSQL

git clone https://github.com/19-84/redd-archiver.git
cd redd-archiver

# Install PostgreSQL (Ubuntu/Debian)
sudo apt update && sudo apt install postgresql postgresql-contrib

# Or on macOS
brew install postgresql@16 && brew services start postgresql@16

# Create database
sudo -u postgres createuser redd-archiver
sudo -u postgres createdb -O redd-archiver redd-archiver
sudo -u postgres psql -c "ALTER USER redd-archiver WITH PASSWORD 'your_password_here';"

# Install Python dependencies
pip install -r requirements.txt

# Configure database connection
export DATABASE_URL="postgresql://reddarchiver:your_password_here@localhost:5432/reddarchiver"

# Run the archive generator
python reddarc.py /path/to/data/ --output my-archive/

Upgrading?

Review the CHANGELOG.md for version updates and changes.

πŸ“Š Usage

1. Prepare Your Data

Redd-Archiver processes data dumps from multiple platforms:

Platform Format Data Sources
Reddit .zst JSON Lines Pushshift Complete Dataset Β· Magnet Link Β· 3.28TB Β· 2.38B posts Β· 40K subreddits
Voat SQL dumps Voat Archive 2021 Β· 22,637 subverses Β· 3.8M posts Β· 24M comments Β· Complete archive
Ruqqus .7z JSON Lines Ruqqus Archive 2021 Β· 6,217 guilds Β· Complete archive

2. Identify High-Priority Communities (Optional)

Scanner Tools help you identify which communities to archive first based on priority scores:

# Scan Reddit data (generates subreddits_complete.json)
python tools/find_banned_subreddits.py /path/to/reddit-data/ --output tools/subreddits_complete.json

# Scan Voat data (generates subverses.json)
python tools/scan_voat_subverses.py /path/to/voat-data/ --output tools/subverses.json

# Scan Ruqqus data (generates guilds.json)
python tools/scan_ruqqus_guilds.py /path/to/ruqqus-data/ --output tools/guilds.json

What the scanners do:

  • Calculate archive priority scores (0-100) for each community
  • Track post counts, activity periods, deletion rates, NSFW content
  • Identify restricted, quarantined, or banned communities (highest priority)
  • Sort communities by archival importance

Example output:

  • Reddit: 40,029 subreddits from 2.38B posts analyzed
  • Voat: 15,545 subverses from 3.81M posts + 24.1M comments analyzed
  • Ruqqus: 6,217 guilds from 500K posts analyzed
  • Status breakdown (Reddit): 26,552 active, 8,642 restricted, 4,803 inactive, 32 quarantined

Use cases:

  • Targeted archiving: Archive high-risk communities first (restricted, quarantined)
  • Storage planning: Identify largest communities before downloading
  • Historical research: Find communities with high deletion/removal rates

Output files (included in tools/ directory):

  • subreddits_complete.json - Reddit subreddit statistics (40,029 communities, 46MB)
  • subverses.json - Voat subverse statistics (22,585 communities, 14MB)
  • guilds.json - Ruqqus guild statistics (6,217 communities, 3.6MB)

View the complete data catalog to browse all communities and their priority scores.

3. Configure PostgreSQL

Ensure DATABASE_URL is set (see Installation above):

export DATABASE_URL="postgresql://reddarchiver:password@localhost:5432/reddarchiver"

4. Generate Your Archive

Reddit Archives (.zst files):

# Auto-discovery (processes all .zst files in directory)
python reddarc.py /path/to/pushshift-data/ --output my-archive/

# Single subreddit
python reddarc.py /data --subreddit privacy \
  --comments-file /data/privacy_comments.zst \
  --submissions-file /data/privacy_submissions.zst \
  --output my-archive/

Voat Archives (SQL dumps):

# Import Voat subverses
python reddarc.py /data --subverse voatdev,pics --output my-archive/ --import-only

# Export HTML after import
python reddarc.py /data --output my-archive/ --export-from-database

Ruqqus Archives (.7z files):

# Import Ruqqus guilds
python reddarc.py /data --guild Quarantine,News --output my-archive/ --import-only

# Export HTML after import
python reddarc.py /data --output my-archive/ --export-from-database

Multi-Platform Mixed Archive:

# Import from multiple platforms into single archive
python reddarc.py /reddit-data --subreddit privacy --output unified-archive/ --import-only
python reddarc.py /voat-data --subverse technology --output unified-archive/ --import-only
python reddarc.py /ruqqus-data --guild Tech --output unified-archive/ --import-only

# Generate HTML for all platforms
python reddarc.py /any-path --output unified-archive/ --export-from-database

With filtering and SEO:

python reddarc.py /data/ --output my-archive/ \
  --min-score 100 --min-comments 50 \
  --base-url https://example.com \
  --site-name "My Archive"

Import/Export workflow (for large datasets):

# Import data to PostgreSQL (no HTML generation)
python reddarc.py /data/ --output my-archive/ --import-only

# Export HTML from PostgreSQL (no data import)
python reddarc.py /data/ --output my-archive/ --export-from-database

4. Deploy Your Archive

Multiple deployment options available:

Local/Development (HTTP):

docker compose up -d
# Access: http://localhost

Production HTTPS (Let's Encrypt):

./docker/scripts/init-letsencrypt.sh
# Access: https://your-domain.com

Homelab/Tor (.onion hidden service):

docker compose -f docker-compose.yml -f docker-compose.tor-only.yml --profile tor up -d
# Access: http://[your-address].onion (via Tor Browser)
# No port forwarding or domain required!

Dual-Mode (HTTPS + Tor):

docker compose --profile production --profile tor up -d
# Access: Both https://your-domain.com and http://[address].onion

Static Hosting (GitHub/Codeberg Pages):

# Generate archive locally, push to GitHub/Codeberg
python reddarc.py /data --output archive/
cd archive/
git init && git add . && git commit -m "Initial archive"
git remote add origin https://github.com/username/repo.git
git push -u origin main
# Enable Pages in repository settings

See deployment guides:

5. Advanced CLI Options

Processing Control:

--hide-deleted-comments    # Hide [deleted]/[removed] comments in output
--no-user-pages           # Skip user page generation (saves memory)
--dry-run                 # Preview discovered files without processing
--force-rebuild           # Ignore resume state and rebuild from scratch
--force-parallel-users    # Override auto-detection for parallel processing

Logging:

--log-file <path>         # Custom log file location (default: output/.archive-error.log)
--log-level DEBUG         # Set logging verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Performance Tuning:

--debug-memory-limit 8.0      # Override memory limit in GB (default: auto-detect)
--debug-max-connections 8     # Override DB connection pool size (default: auto-detect)
--debug-max-workers 4         # Override parallel workers (default: auto-detect)

Environment Variables:

# Required
DATABASE_URL=postgresql://user:pass@host:5432/reddarchiver

# Optional Performance Tuning (auto-detected if not set)
REDDARCHIVER_MAX_DB_CONNECTIONS=8       # Connection pool size
REDDARCHIVER_MAX_PARALLEL_WORKERS=4     # Parallel processing workers
REDDARCHIVER_USER_BATCH_SIZE=2000       # User page batch size
REDDARCHIVER_QUEUE_MAX_BATCHES=10       # Queue backpressure control
REDDARCHIVER_CHECKPOINT_INTERVAL=10     # Progress save frequency
REDDARCHIVER_USER_PAGE_WORKERS=4        # User page generation workers

πŸ—οΈ Architecture

Redd-Archiver features a clean modular architecture with specialized components:

Project Structure

reddarc.py              # Main CLI entry point
search_server.py        # Flask search API server
version.py              # Version metadata

core/                   # Core processing & database
β”œβ”€β”€ postgres_database.py    # PostgreSQL backend
β”œβ”€β”€ postgres_search.py      # PostgreSQL FTS implementation
β”œβ”€β”€ write_html.py           # HTML generation coordinator
β”œβ”€β”€ watchful.py             # .zst streaming utilities
β”œβ”€β”€ incremental_processor.py # Incremental processing
└── importers/              # Multi-platform importers
    β”œβ”€β”€ base_importer.py        # Abstract base class
    β”œβ”€β”€ reddit_importer.py      # .zst JSON Lines parser
    β”œβ”€β”€ voat_importer.py        # SQL dump coordinator
    β”œβ”€β”€ voat_sql_parser.py      # SQL INSERT parser
    └── ruqqus_importer.py      # .7z JSON Lines parser

api/                    # REST API v1
β”œβ”€β”€ __init__.py             # Blueprint registration
└── routes.py               # 30+ API endpoints

mcp_server/             # MCP Server for AI integration
β”œβ”€β”€ server.py               # FastMCP server (29 tools)
β”œβ”€β”€ README.md               # MCP documentation
└── tests/                  # MCP server tests

utils/                  # Utility functions
β”œβ”€β”€ console_output.py       # Console output formatting
β”œβ”€β”€ error_handling.py       # Error handling utilities
β”œβ”€β”€ input_validation.py     # Input validation
β”œβ”€β”€ regex_utils.py          # Regular expression utilities
β”œβ”€β”€ search_operators.py     # Search query parsing
└── simple_json_utils.py    # JSON utilities

processing/             # Data processing modules
β”œβ”€β”€ parallel_user_processing.py  # Parallel user page generation
β”œβ”€β”€ batch_processing_utils.py    # Batch processing utilities
└── incremental_statistics.py    # Statistics tracking

monitoring/             # Performance & monitoring
β”œβ”€β”€ performance_monitor.py      # Performance monitoring
β”œβ”€β”€ performance_phases.py       # Phase tracking
β”œβ”€β”€ performance_timing.py       # Timing utilities
β”œβ”€β”€ auto_tuning_validator.py    # Auto-tuning validation
β”œβ”€β”€ streaming_config.py         # Auto-detecting configuration
└── system_optimizer.py         # System optimization

HTML Modules (18 specialized modules)

html_modules/
β”œβ”€β”€ html_seo.py                # SEO, meta tags, sitemaps
β”œβ”€β”€ html_pages_jinja.py        # Jinja2-based page generation
β”œβ”€β”€ html_statistics.py         # Analytics and metrics
β”œβ”€β”€ dashboard_helpers.py       # Dashboard utility functions
β”œβ”€β”€ html_field_generation.py   # Dynamic field generation
β”œβ”€β”€ jinja_filters.py           # Custom Jinja2 filters
β”œβ”€β”€ html_pages.py              # Core page generation
β”œβ”€β”€ html_comments.py           # Comment threading system
β”œβ”€β”€ __init__.py                # Public API exports
β”œβ”€β”€ jinja_env.py               # Jinja2 environment setup
β”œβ”€β”€ html_utils.py              # File operations, utilities
β”œβ”€β”€ html_dashboard_jinja.py    # Jinja2 dashboard rendering
β”œβ”€β”€ css_minifier.py            # CSS minification
β”œβ”€β”€ html_scoring.py            # Dynamic score badges
β”œβ”€β”€ html_templates.py          # Template management
β”œβ”€β”€ html_url.py                # URL processing, domains
β”œβ”€β”€ html_dashboard.py          # Dashboard generation
└── html_constants.py          # Configuration values

Jinja2 Templates (15 templates)

templates_jinja2/
β”œβ”€β”€ base/
β”‚   └── base.html              # Master layout template
β”œβ”€β”€ components/
β”‚   β”œβ”€β”€ dashboard_card.html    # Dashboard statistics cards
β”‚   β”œβ”€β”€ footer.html            # Site footer
β”‚   β”œβ”€β”€ global_summary.html    # Global statistics summary
β”‚   β”œβ”€β”€ navigation.html        # Site navigation bar
β”‚   β”œβ”€β”€ post_card.html         # Post display card
β”‚   β”œβ”€β”€ user_comment.html      # User comment display
β”‚   └── user_post.html         # User post display
β”œβ”€β”€ macros/
β”‚   β”œβ”€β”€ comment_macros.html    # Comment rendering macros
β”‚   └── reddit_macros.html     # Reddit-specific macros
└── pages/
    β”œβ”€β”€ global_search.html     # Global search page
    β”œβ”€β”€ index.html             # Dashboard homepage
    β”œβ”€β”€ link.html              # Individual post page
    β”œβ”€β”€ subreddit.html         # Subreddit listing page
    └── user.html              # User profile page

Database Schema

sql/
β”œβ”€β”€ schema.sql                 # PostgreSQL table definitions
β”œβ”€β”€ indexes.sql                # Performance indexes (GIN, B-tree)
β”œβ”€β”€ fix_statistics.sql         # Statistics maintenance queries
└── migrations/
    └── 003_add_total_activity_column.sql  # Schema migration

πŸ” PostgreSQL Full-Text Search

Lightning-Fast Database Search

Redd-Archiver v1.0 uses PostgreSQL full-text search with GIN indexing for blazing-fast search capabilities:

Key Features:

  • Database-Powered: Native PostgreSQL indexing with constant memory usage
  • Large-Scale: Efficiently search large datasets (tested with hundreds of GB)
  • Relevance Ranking: PostgreSQL ts_rank() for intelligent result ordering
  • Highlighted Excerpts: ts_headline() shows matching content in context
  • Advanced Filters: Search by subreddit, author, date range, score
  • Concurrent Queries: Handle multiple search requests simultaneously

Search API

PostgreSQL search is exposed via postgres_search.py (CLI) and search_server.py (Web API):

Command-Line Interface:

# Search command-line interface
python postgres_search.py "your query" --subreddit technology --limit 50

# Example: Search for posts about "machine learning" with high scores
python postgres_search.py "machine learning" --min-score 100 --limit 20

Web API (βœ… Implemented):

# Start search server with Docker Compose (recommended)
docker-compose up -d reddarchiver-search-server

# Or run directly
export DATABASE_URL="postgresql://user:pass@localhost:5432/reddarchiver"
python search_server.py

# Access at http://localhost:5000

Features:

  • RESTful search API with JSON responses
  • Real-time search with PostgreSQL FTS
  • Rate limiting and CSRF protection
  • Health check endpoint: GET /health
  • Search endpoint: GET /search?q=query&subreddit=optional&limit=50
  • Result highlighting with ts_headline()
  • Search suggestions and trending searches

🌐 REST API & Registry

REST API v1

Full-featured API with 30+ endpoints for programmatic access and MCP/AI integration:

Category Endpoints Key Features
System (5) /health, /stats, /schema, /openapi.json Health checks, statistics, capability discovery, OpenAPI spec
Posts (13) /posts, /posts/{id}, /posts/{id}/comments, /posts/{id}/context, /posts/{id}/comments/tree, /posts/{id}/related, /posts/random, /posts/aggregate, /posts/batch List, single, comments, context, tree, related, random, aggregate, batch
Comments (7) /comments, /comments/{id}, /comments/random, /comments/aggregate, /comments/batch List, single, random, aggregate, batch
Users (8) /users, /users/{username}, /users/{username}/summary, /users/{username}/posts, /users/{username}/comments, /users/aggregate, /users/batch List, profiles, summary, activity, aggregate, batch
Subreddits (4) /subreddits, /subreddits/{name}, /subreddits/{name}/summary List, statistics, summary
Search (3) /search, /search/explain Full-text search with operators, query debugging

MCP/AI-Optimized Features:

  • Field Selection: ?fields=id,title,score for token optimization
  • Truncation Controls: ?max_body_length=500&include_body=false for response size management
  • Export Formats: ?format=csv|ndjson for data analysis
  • Batch Endpoints: Reduce N requests to 1 with /posts|comments|users/batch
  • Context Endpoints: Single-call discussion retrieval with /posts/{id}/context
  • Search Operators: Google-style syntax ("exact", OR, -exclude, sub:, author:, score:)

Rate limited to 100 requests/minute. See API Documentation for complete reference.

Instance Registry & Leaderboard

Redd-Archiver supports a distributed registry system for tracking archive instances:

  • Instance Metadata: Configure via environment variables or CLI flags (--site-name, --contact, --team-id)
  • Leaderboard Generator: Automated scoring based on archive completeness and content risk
  • Team Grouping: Group multiple instances under a team ID for coordinated archiving

See Registry Setup Guide for configuration.

πŸ“ˆ Performance & Optimization

PostgreSQL Backend Performance (v1.0+)

Constant Memory Usage:

  • 4GB RAM: Process large datasets efficiently (tested with hundreds of GB)
  • 8GB RAM: Optimal for concurrent operations
  • 16GB+ RAM: Ideal for parallel user page generation

Database Storage:

Input (.zst) PostgreSQL DB HTML Output Example
93.6MB ~150MB 1.4GB r/technology
100MB ~160MB ~1.5GB Small archives
500MB ~800MB ~7.5GB Research projects
2GB ~3.2GB ~30GB Large collections
100GB ~160GB ~1.5TB Enterprise-scale

Processing Speed:

  • Data Import: Fast streaming ingestion to PostgreSQL
  • HTML Generation: Efficient database-backed rendering
  • Search Index: Instant with PostgreSQL GIN indexes
  • Performance: Scales with dataset size, optimized for large archives

Search Performance

Performance varies based on dataset size, query complexity, and hardware:

  • PostgreSQL FTS: Fast indexed search for large datasets
  • GIN Indexes: Optimized index lookups for text search
  • Concurrent Queries: Supports multiple simultaneous searches with connection pooling
  • Memory Efficient: Constant memory usage with streaming results

Architecture Benefits

PostgreSQL v1.0 Features:

  • Large-Scale Processing: Efficiently handle large datasets (tested with hundreds of GB)
  • Constant Memory: 4GB RAM regardless of dataset size
  • Fast Search: PostgreSQL FTS with GIN indexing
  • Resume Capability: Database-backed progress tracking
  • Concurrent Processing: Multi-connection pool for parallel operations

πŸ”€ Scaling for Very Large Archives

Single Instance Limits

Redd-Archiver has been tested with archives up to hundreds of gigabytes. For optimal performance:

  • Tested scale: Hundreds of GB per instance
  • Memory usage: Constant 4GB RAM regardless of dataset size
  • Database: PostgreSQL handles large datasets efficiently

Horizontal Scaling Strategy

For very large archive collections (multiple terabytes), deploy multiple instances divided by topic:

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Instance 1     β”‚     β”‚  Instance 2     β”‚     β”‚  Instance 3     β”‚
β”‚  Technology     β”‚     β”‚  Gaming         β”‚     β”‚  Science        β”‚
β”‚  Subreddits     β”‚     β”‚  Subreddits     β”‚     β”‚  Subreddits     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Benefits:

  • Efficient search: Each database stays manageable size
  • Distributed load: Parallel processing across instances
  • Topic organization: Logical grouping of related content
  • Independent scaling: Scale individual topics as needed

Deployment Options:

  1. Single server: Multiple Docker Compose stacks with different ports
  2. Multiple servers: One instance per physical/virtual machine
  3. Topic-based domains: tech.archive.com, gaming.archive.com, etc.

Example Multi-Instance Setup:

# Instance 1: Technology topics (port 8080)
cd /archives/tech
docker compose up -d

# Instance 2: Gaming topics (port 8081)
cd /archives/gaming
docker compose -f docker-compose.yml up -d

# Instance 3: Science topics (port 8082)
cd /archives/science
docker compose -f docker-compose.yml up -d

When to Use:

  • Archive collection exceeds 500GB
  • Search performance degrades with single instance
  • Logical topic divisions exist in your archive
  • Want to distribute load across multiple servers

🎯 Use Cases

Research & Academia

  • Studying online discourse and community dynamics
  • Analyzing social movements and trends
  • Preserving internet culture

Community Archiving

  • Backing up subreddits before potential removal
  • Creating offline-accessible community resources
  • Distributing knowledge repositories

Investigation & Analysis

  • Pattern analysis in deleted/removed content
  • User behavior studies
  • Content moderation research

πŸ“š Documentation

Deployment Guides

API & Integration

Project Documentation

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for development guidelines, code structure, and testing procedures.

Key areas for contribution:

  • PostgreSQL query optimizations
  • Additional export formats
  • Enhanced search features
  • Documentation improvements

See our modular architecture (18 specialized modules) for easy entry points to contribute.


πŸ“ License

This is free and unencumbered software released into the public domain. See the LICENSE file (Unlicense) for details.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means.

πŸ“¦ Data Sources

This project leverages public datasets from the following sources:

πŸ™ Acknowledgments

This project builds upon the work of several excellent archival projects:

  • reddit-html-archiver by libertysoft3 - Original inspiration and foundation for static HTML generation
  • redarc - Self-hosted Reddit archiving with PostgreSQL and full-text search
  • red-arch - Static website generator for Reddit subreddit archives
  • zst_blocks_format - Efficient block-based compression format for processing large datasets

πŸ“§ Contact

πŸ’° Support the Project

Redd-Archiver was built by one person over 6 months as a labor of love to preserve internet history before it disappears forever.

This isn't backed by a company or institutionβ€”just an individual committed to keeping valuable discussions accessible. Your support helps:

  • Continue development and bug fixes
  • Maintain documentation and support
  • Cover infrastructure costs (servers, storage, bandwidth)
  • Preserve more data sources and platforms

Every donation, no matter the size, helps keep this preservation effort alive.

Bitcoin (BTC)

bc1q8wpdldnfqt3n9jh2n9qqmhg9awx20hxtz6qdl7

Bitcoin QR Code
Scan to donate Bitcoin

Monero (XMR)

42zJZJCqxyW8xhhWngXHjhYftaTXhPdXd9iJ2cMp9kiGGhKPmtHV746EknriN4TNqYR2e8hoaDwrMLfv7h1wXzizMzhkeQi

Monero QR Code
Scan to donate Monero

Thank you for supporting internet archival efforts! Every contribution helps maintain and improve this project.


This software is provided "as is" under the Unlicense. See LICENSE for details. Users are responsible for compliance with applicable laws and terms of service when processing data.