Want to Sponsor daily updates? Ping me. • Support HackerBook
did you miss hacker news for a day because you were occupied with something useful like work or you just didn't touch the Internet for sanity or purity? Well now you can go back to the days of hacker news that you missed and see what you missed using our weekly updated archive. updates come in around every Sunday.
https://hackerbook.dosaygo.com
Community, all the HN belong to you. This repo packages 20 years of Hacker News into a static archive you can run entirely in your browser. The site is just files: HTML, JSON, and gzipped SQLite shards. No server app required.
- Demo: https://hackerbook.dosaygo.com
- Code: https://github.com/DOSAYGO-STUDIO/HackerBook
Note
This project is an archive and a toolkit: run it locally, mirror it, or build your own daily snapshots.
- What this is
- How it works
- Screenshots
- Prerequisites
- Quick start
- Repository layout
- Pipeline overview
- Restarting safely
- Shards and content hashing
- User stats shards
- Index builders
- Architecture Decision: SQLite vs Parquet
- Tech deep dive
- Deploy checklist
- Kiwix offline packaging
- Contribute analysis
- FAQ
- Notes
A fully static Hacker News archive that runs entirely in your browser using SQLite compiled to WebAssembly. The browser fetches only the shard files it needs for the page you are viewing, so the initial load is tiny and navigation stays fast.
Community, all the HN belong to you. This is an archive of Hacker News that fits in your browser. The whole dataset. So they'll never die. Ever. It's the unkillable static archive of HN in your hands.
Inspired by HN Made of Primes, this is the "year-end gift" to the community: keep a durable, offline-friendly HN archive forever.
Tip
The "organizing idea": ship data like a static website. Everything is a file. Everything is cacheable. Everything is yours.
- SQLite WASM in the browser runs queries client-side.
- The archive is sharded into gzipped SQLite files so the browser can fetch only what it needs.
- Manifests and indexes describe where shards live and what they contain.
- Content hashing bakes a short SHA-256 into each shard filename for perfect cache correctness.
Important
The shards are immutable by filename. If content changes, the hash changes, and caches refresh automatically.
Query view
Me view
- Node.js 20+ (Required for tools and ETL)
- Google Cloud SDK (Required if using
download_hn.shto fetch raw data) - Disk Space: ~45GB free (If running the full archivist workflow: 22GB raw + staging + shards)
Always run npm install once in the repo root.
Download the pre-built site to browse offline without building the database.
node toool/download-site.mjs
npm run serve
# OR
python3 -m http.server 8000 --directory docsFull pipeline: BigQuery -> ETL -> Shards.
# 1. Download raw data (Requires GCP configuration)
./download_hn.sh
# 2. Build shards and indexes
npm run etl
# 3. Finalize and serve
./toool/s/predeploy-checks.sh --restart-etl
npm run serveTip
Use AUTO_RUN=1 for unattended pipeline runs. ✨
docs/: the static site (HTML, JS, CSS, manifests, indexes, shards)docs/static-shards/: gzipped item/comment shardsdocs/static-user-stats-shards/: user stats shardsdata/: raw dumps + staging DBdownload_hn.sh: BigQuery export helperetl-hn.js: main shard builder + post-passbuild-archive-index.js: archive index buildertoool/s/: scripts for predeploy, index builds, downloads
- Raw data: BigQuery exports in
data/raw/*.json.gz(ortoool/data/raw/). - Staging (optional): imports into
data/static-staging-hn.sqlite. - Shard build: write
.sqliteshards indocs/static-shards/. - Post-pass: VACUUM, gzip, hash rename, manifest rewrite.
- Indexes: archive, cross-shard, user stats.
- Deploy: publish
docs/.
./toool/s/predeploy-checks.sh runs the full pipeline with prompts.
Flags:
--use-stagingusedata/static-staging-hn.sqliteinstead of raw downloads.--restart-etlresume post-pass/gzip from existing shards and/or.prepass.--from-shardsskip ETL; normalize shard hashes and rebuild from existing shards.--hash-onlywith--from-shards, only normalize shard hashes (skip post-pass).AUTO_RUN=1auto-advance prompts.
Warning
--hash-only assumes you already have gzipped shards. It will not gzip .sqlite files.
This pipeline is designed to tolerate interrupts.
- Kill only after a shard completes (
[shard N] ... file ...MB). - If killed mid-gzip, restart with:
./toool/s/predeploy-checks.sh --restart-etl. - The post-pass detects partial output and resumes from the last good shard.
Items/comments shards live in docs/static-shards/.
- Filenames are hashed:
shard_<sid>_<hash>.sqlite.gz. <hash>is 12 hex chars (SHA-256 truncated).- Hashing happens after VACUUM + gzip, so the hash represents the final bytes.
Why this matters:
- Cache correctness is automatic. If the shard changes, the URL changes.
- Old shards can be cached forever (
immutable), while new snapshots fetch cleanly.
User stats are separate shards optimized for usernames and monthly activity.
- Location:
docs/static-user-stats-shards/ - Manifest:
docs/static-user-stats-manifest.json(.gz) - Tables:
users,user_domains,user_months - Built by:
node ./toool/s/build-user-stats.mjs --gzip --target-mb 15
The build pipeline uses parallel workers to stream data from the staging DB, aggregating stats in-memory and flushing to size-capped SQLite shards.
The app switches to these shards for the ?view=me view and when you select "User stats shards" in query mode.
- Archive index:
node ./build-archive-index.js - Cross-shard index:
node ./toool/s/build-cross-shard-index.mjs --binary - User stats:
node ./toool/s/build-user-stats.mjs --gzip --target-mb 15
Why SQLite sharding instead of Parquet with HTTP range requests?
After analyzing all query patterns in HackerBook, we determined that SQLite is optimal for this use case. See docs/ARCHITECTURE.md for the full analysis.
TL;DR:
- 40% of queries are point lookups (
WHERE id=?) - SQLite's B-tree indexes are 10-100x faster than Parquet - 30% are Top-K sorted queries with filters - SQLite's query optimizer excels at these
- JOINs are critical for comment threads - Parquet has no native JOIN support
- Small result sets (LIMIT 30-50) don't benefit from columnar storage
- 15MB shards download in <1s and cache forever
Parquet would make sense for a supplementary "Data Export" feature for data scientists, but not for the interactive web UI.
graph TD
BQ[BigQuery Public Dataset] -->|download_hn.sh| Raw[data/raw/*.json.gz]
Raw -->|etl-hn.js| Shards[SQLite Shards]
Shards -->|Post-pass| GzShards[Gzipped Shards]
GzShards --> Manifest[static-manifest.json]
graph TD
Shards[static-shards/] --> Archive[build-archive-index.js]
Shards --> Cross[build-cross-shard-index.mjs]
Shards --> UserStats[build-user-stats.mjs]
Archive --> ArchiveIdx[archive-index.json]
Cross --> CrossIdx[cross-shard-index.bin]
UserStats --> UserIdx[static-user-stats-shards/]
The user stats generation (toool/s/build-user-stats.mjs) is a high-performance pipeline designed to process 40M+ items efficiently:
- Staging DB Source: Reads directly from the
items_rawstaging table (indexed byby, time) instead of parsing raw JSON, significantly reducing I/O overhead. - Parallel Partitioning: The user space is partitioned into equal chunks (by count) and processed by multiple worker threads in parallel.
- Streaming Aggregation:
- Workers stream items ordered by user.
- Stats (counts, scores, domains) are aggregated in-memory for the current user.
- Completed user records are flushed to the current SQLite shard.
- Dynamic Sharding: Shards are rotated automatically when they reach the target size (default 15MB), ensuring optimal download sizes.
- Robustness:
- Memory Safety: Automatically detects low heap limits and respawns with increased memory (8GB).
- Transaction Safety: Uses batched transactions for high-throughput writes.
static-manifest.json: The core routing map. It maps ranges of Item IDs (id_lotoid_hi) to specific shard files. The client uses this to perform binary searches when looking up a specific ID (e.g.,?id=123). It is generated during the initial ETL sharding process.archive-index.json: A secondary map optimized for time-based access. It maps "effective" time ranges (tmin_efftotmax_eff) to shards, filtering out data outliers. It also caches item counts (posts vs comments) to allow the UI to display daily statistics without opening every shard.static-user-stats-manifest.json: The directory for user data. It maps alphabetical ranges of usernames (e.g., "a" to "az") to specific user-stats shards, enabling fast lookups for the "Me" view.cross-shard-index.bin: A compact binary index (not JSON) that mapsparent_idto a list ofshard_ids. This solves the "scattered conversation" problem: if a story is in Shard A, but comments are added months later in Shard B and C, this index tells the client exactly which shards to fetch to load the full thread.
The cross-shard-index.bin file uses a custom binary format for compactness (CSR-style):
- Header (24 bytes):
CSHX(magic), Version (u32), Parents Count (u32), Links Count (u32), Shard Count (u32), Reserved (u32). - Parent IDs: Sorted array of
uint32parent IDs (parents_count * 4bytes). - Offsets: Array of
uint32indices pointing into the Links array ((parents_count + 1) * 4bytes). - Links: Array of
uint16shard IDs (links_count * 2bytes).
The index implements a Compressed Sparse Row (CSR) pattern using three parallel arrays:
- Parent IDs: A sorted list of story IDs that have cross-shard comments.
- Offsets: Pointers into the Links array. The shards for
ParentIDs[i]are found inLinksfrom indexOffsets[i]toOffsets[i+1]. - Links: A flat list of Shard IDs.
To find shards for a given story:
- Binary search
Parent IDsto find the indexiof the story ID. - Read
start = Offsets[i]andend = Offsets[i+1]. - The relevant shards are
Links[start...end].
items(
id INTEGER PRIMARY KEY,
type TEXT,
time INTEGER,
by TEXT,
title TEXT,
text TEXT,
url TEXT,
score INTEGER,
parent INTEGER
)
edges(
parent_id INTEGER NOT NULL,
ord INTEGER NOT NULL,
child_id INTEGER NOT NULL,
PRIMARY KEY(parent_id, ord)
)
users(
username TEXT PRIMARY KEY,
first_time INTEGER,
last_time INTEGER,
items INTEGER,
comments INTEGER,
stories INTEGER,
ask INTEGER,
show INTEGER,
launch INTEGER,
jobs INTEGER,
polls INTEGER,
avg_score REAL,
sum_score INTEGER,
max_score INTEGER,
min_score INTEGER,
max_score_id INTEGER,
max_score_title TEXT
)
user_domains(username TEXT, domain TEXT, count INTEGER)
user_months(username TEXT, month TEXT, count INTEGER)
SELECT id, title, by, score, time, type, text, url, parent
FROM `bigquery-public-data.hacker_news.full`
WHERE deleted IS NULL OR deleted = false
- The manifest records a snapshot time (latest item time in the dump).
- The UI normalizes to the snapshot day end (23:59:59) so the "front page" is always the final day of the dump.
- Cache-busting is applied to manifests and indexes; shards are immutable by hash.
static-manifest.json(.gz)updated and points to hashed shards.archive-index.json(.gz)updated.cross-shard-index.bin(.gz)updated.static-user-stats-manifest.json(.gz)updated.- Cache-bust string bumped in
docs/index.htmlanddocs/static.html. - Verification: Run
./toool/s/ci-verify.shto ensure core assets are in place.
Hacker Book is compatible with Kiwix, an offline content platform that allows you to browse the full HN archive without any internet connection.
Kiwix packages websites into ZIM files - compressed archives optimized for offline browsing. You can use Kiwix on smartphones (even "dumb phones"), tablets, and computers to access content offline.
To package Hacker Book for Kiwix:
- Install zimwriterfs:
# On Ubuntu/Debian
sudo apt-get install zimwriterfs
# On macOS
brew install zimwriterfs- Build or download the site:
# Option A: Download pre-built site
node toool/download-site.mjs
# Option B: Build from scratch
./toool/s/predeploy-checks.sh- Create the ZIM file:
zimwriterfs \
--welcome=index.html \
--favicon=cd.png \
--language=eng \
--title="Hacker Book - HN Archive" \
--description="Complete offline Hacker News archive (2006-2025)" \
--creator="DOSAYGO STUDIO" \
--publisher="Community" \
--name="hackerbook" \
docs/ \
hackerbook.zim- Use with Kiwix:
- Install Kiwix Desktop or Kiwix Android
- Open the
hackerbook.zimfile - Browse the full HN archive offline
Once created, the ZIM file can be:
- Shared via USB drives
- Distributed through library.kiwix.org
- Hosted for download
- Used completely offline
Tip
The ZIM file will be large (~9-10GB). Consider splitting it into smaller chunks for easier distribution or creating year-specific ZIM files.
Have charts or analyses you want to feature? Email a link and a short caption to [email protected]. Bonus points for fresh takes and bold visualizations. 📊
Is this really offline? Yes. Once the shards you need are cached, browsing is offline. You can also download the whole site and run it locally.
How do I update the archive?
Run download_hn.sh, then ./toool/s/predeploy-checks.sh to rebuild shards and indexes.
Why shards instead of HTTP Range requests? Shards are friendlier to static hosting and cache behavior, and make it easier to reason about data integrity.
How big is it? The full archive is ~9GB gzipped (~22GB uncompressed), split across ~1.6k shards.
Can I host this anywhere? Yes. Any static host works (Cloudflare Pages, S3, GitHub Pages, nginx, etc.).
- Works best on modern browsers (Chrome, Firefox, Safari) with
DecompressionStream; falls back to pako gzip when needed. - Mobile layout is optimized for tap navigation and small screens.
- Viewer and ETL code are MIT-licensed.
- HN content remains property of Y Combinator and the respective authors.
- Roadmap: speed up user stats build (issue #3: #3).
_headerssets long-cache +immutablefor hashed shards and gz assets; manifests are cache-busted via HTML version strings.


