-
Notifications
You must be signed in to change notification settings - Fork 10
Benchmark for v1.0: 1-year simulation results (50 backends, 1s sampling, PG17) #1
Description
What is pg_ash?
pg_ash is a lightweight Active Session History for Postgres. It samples pg_stat_activity every second via pg_cron and stores compressed snapshots as integer[] arrays in a 3-partition ring buffer (inspired by Skytools/PGQ). No C extension, no shared_preload_libraries changes — pure SQL/plpgsql, install-and-go.
Each sample encodes all active backends into a single integer[]: [-wait_id, count, query_id, query_id, ..., -next_wait, count, ...]. Dictionary tables (wait_event_map, query_map) map IDs back to names.
Reader functions: top_waits() (wait event ranking), top_queries() (query ranking by samples), cpu_vs_waiting() (CPU vs wait breakdown), wait_timeline() (bucketed time series), samples_by_database() (per-database activity).
TL;DR
50 backends, 1s sampling, Postgres 17 (Docker). All timings are warm-cache; see Environment for details.
Production storage (3 partitions, daily rotation)
| Metric | Result |
|---|---|
| Storage per partition | 37 MiB (1 day, 86k rows) |
| Max disk usage | ~74 MiB (2 partitions hold data; 3rd is empty, ready for rotation) |
| Heap bytes/row | ~378 (stable across scales) |
Reader performance (1-day partition, 86k rows, warm cache, JIT off, median of 10 runs)
| Function | Median | Range | Notes |
|---|---|---|---|
top_waits('1 hour') |
30 ms | 22–40 ms | Bitmap index scan |
top_waits('24 hours') |
6,097 ms | 6,029–6,360 ms | Seq scan (full partition) |
top_queries('1 hour') |
31 ms | 28–37 ms | |
cpu_vs_waiting('1 hour') |
352 ms | 343–373 ms | (measured with JIT on; expect ~30ms with JIT off) |
| Rotation (TRUNCATE 86k rows) | 9 ms |
jit = on, 1-hour queries jump from 30ms to 340ms (10×), and occasionally to 1.8s when JIT fully compiles. Always use SET jit = off for pg_ash reader queries — or disable JIT globally on OLTP servers.
Operational overhead
| Metric | Result |
|---|---|
take_sample() execution |
53 ms median (46–82 ms, 10 runs, 0 active backends) |
| WAL per sample | ~28 KiB steady state (~2.4 GiB/day); dominated by full_page_writes |
| Memory per sample (1000 backends) | ~40–60 KiB peak (plpgsql copy-on-append) |
| Concurrent readers + writer | No lock contention ✅ |
How pg_ash compares
- pg_wait_sampling: C extension, in-memory ring buffer only (no persistent history). pg_ash trades ~37 MiB/day disk for SQL-queryable historical data.
- pgsentinel: C extension ASH. Requires compilation + shared_preload_libraries. pg_ash is pure SQL.
- pg_stat_statements: Cumulative query stats, not point-in-time samples. Complementary, not a replacement.
Environment
PostgreSQL 17.7 on x86_64 (Docker)
CPU: AMD EPYC-Milan, 4 cores
RAM: 15 GiB (no Docker memory limit)
Storage: ext4 on SSD (150 GiB)
shared_buffers: 128 MiB (default)
work_mem: 4 MiB
effective_cache_size: 4 GiB
wal_level: replica
synchronous_commit: on
checkpoint_timeout: 5 min
max_wal_size: 1 GiB
Note: These benchmarks measure relative performance and scaling characteristics. Absolute timings will vary by hardware. All reader timings are warm-cache (data in shared_buffers/OS page cache). Cold-cache performance would be I/O-bound and slower.
Benchmark Setup
- Simulated workload: 50 active backends, 1 database, 1-second sampling
- Wait events: 21 distinct (CPU*, IO, LWLock, Lock, Client, idle-in-transaction)
- Query IDs: 200 distinct, uniformly distributed (real workloads have skew — see Caveats)
- Encoding:
integer[]format — avg 75 elements, avg 322 bytes per array - Reader functions: inline SQL decode using
generate_subscripts()(not plpgsqldecode_sample())
Storage
| Scale | Rows | Heap | Indexes | Total | Heap bytes/row |
|---|---|---|---|---|---|
| 1 day | 86,400 | 32 MiB | 5.5 MiB | 37 MiB | ~378 |
| 1 month | 2,678,400 | 978 MiB | 168 MiB | 1,146 MiB | ~378 |
| 1 year (measured) | 31,536,000 | 11 GiB | 1.36 GiB | 13 GiB | ~378 |
Notes:
- Heap bytes/row is stable at ~378 regardless of scale (measured at 382 with now-removed version byte)
- Total bytes/row (including indexes) increases slightly at scale: 390.3 at 31.5M rows due to B-tree overhead
- Indexes:
(sample_ts)+(datid, sample_ts)per partition = 2 B-trees × 3 partitions - TOAST compression does not apply at 50 backends (326-byte arrays are well below the ~2 KiB TOAST threshold)
- In production with daily rotation, only 2 of 3 partitions hold data at any time — real storage is ~74 MiB
Storage by backend count
| Backends | Array elements | Data column (bytes) | Est. row (bytes) | Storage/day |
|---|---|---|---|---|
| 10 | 17 | 92 | ~130 | 10.7 MiB |
| 50 | 75 | 324 | ~362 | 29.8 MiB |
| 100 | 137 | 572 | ~610 | 50.2 MiB |
| 200 | 287 | 1,172 | ~1,210 | 99.7 MiB |
| 500 | 727 | 2,932 | ~2,970 | 244.7 MiB |
| 1,000 | 1,453 | 5,836 | ~5,874 | 484.0 MiB |
Note: Estimated row sizes = data column + ~38 bytes (tuple header + fixed columns). Measured values are ~5% higher due to alignment padding. At 500+ backends, TOAST LZ4 compression would reduce actual storage (integer arrays with small values compress well). These projections assume uniform wait event distribution — real skewed distributions produce fewer distinct groups and smaller arrays.
Reader Performance (repeated runs)
All timings on 1-day dataset (86,400 rows), warm cache, median of 10 runs.
1-hour window (~3,600 rows scanned)
| Function | Median | Range (10 runs) | Notes |
|---|---|---|---|
top_waits('1 hour') |
339 ms | 314–362 ms | Bitmap Index Scan on sample_0_ts_idx |
top_queries('1 hour') |
486 ms | 470–504 ms | Full array walk for query IDs |
cpu_vs_waiting('1 hour') |
352 ms | 343–373 ms | Only needs wait_id + count |
24-hour window (full partition scan)
| Function | Median | Range (10 runs) | Notes |
|---|---|---|---|
top_waits('24 hours') |
6,807 ms | 6,749–6,881 ms | Seq Scan (expected — full partition) |
JIT compilation impact
JIT kicked in on 2 of 10 runs for top_waits('1 hour'), inflating time from ~340ms to ~1,824ms. The EXPLAIN ANALYZE shows:
JIT: Functions: 48, Total: 1,561 ms
(Inlining 407ms, Optimization 637ms, Emission 515ms)
Execution Time: 2,185 ms (of which 1,561 ms = JIT)
JIT compilation cost dominates for short queries. Consider SET jit = off in reader functions, or adjusting jit_above_cost.
EXPLAIN (ANALYZE, BUFFERS) for top_waits('1 hour')
Limit (cost=554357..554357 rows=5 width=136) (actual time=2052..2052 rows=5 loops=1)
Buffers: shared hit=9429, temp read=1800 written=1803
CTE totals
-> GroupAggregate (actual time=434..493 rows=21 loops=1)
-> Sort (Sort Method: external merge Disk: 14400kB) (actual time=426..482 rows=42428)
-> Nested Loop (actual time=1..262 rows=42428)
-> Bitmap Heap Scan on sample_0 (actual time=1..107 rows=3459)
Index: sample_0_ts_idx
Buffers: shared hit=9418
-> Function Scan on generate_subscripts (actual time=0.016..0.035 rows=12 loops=3459)
Filter: data[i] < 0 (removed 63 rows per loop)
Key observations:
- Bitmap Index Scan correctly used for time-range filter
- 100% buffer hits (shared hit=9,429, zero disk reads)
- External merge sort at 14.4 MiB (exceeds 4 MiB work_mem)
- generate_subscripts: ~75 elements per row, 12 negative markers (wait groups) per row
EXPLAIN (ANALYZE, BUFFERS) for top_waits('24 hours')
Limit (actual time=7871..7871 rows=5 loops=1)
Buffers: shared hit=176550
CTE totals
-> HashAggregate (actual time=6407..6407 rows=21 loops=1)
-> Nested Loop (actual time=0.3..5705 rows=1057864 loops=1)
-> Seq Scan on sample_0 (actual time=0.3..1591 rows=86257)
Filter: sample_ts >= ... AND slot = ANY(...)
Rows Removed by Filter: 143
Buffers: shared hit=176542
-> Function Scan on generate_subscripts (actual time=0.020..0.039 rows=12 loops=86257)
Key observations:
- Seq Scan used (full partition — expected for 24h on 1-day data)
- 176,550 buffer hits, zero disk reads
- 1,057,864 array elements decoded (86k rows × ~12 wait groups avg)
- HashAggregate (not sort) for full-partition aggregation
Reader optimization: 9–17× speedup
Original readers used CROSS JOIN LATERAL ash.decode_sample(s.data) — a plpgsql function called per row, doing per-element dictionary lookups via nested SQL queries. New approach: inline generate_subscripts(s.data, 1) with a single hash join to wait_event_map.
| Function | Old (plpgsql) | New (inline SQL) | Speedup |
|---|---|---|---|
top_waits('1 hour') |
5,289 ms | 339 ms | 15.6× |
top_queries('1 hour') |
5,173 ms | 486 ms | 10.6× |
cpu_vs_waiting('1 hour') |
5,128 ms | 352 ms | 14.6× |
wait_timeline('1 hour') |
5,785 ms | 666 ms | 8.7× |
Note: Old numbers were single-run. Speedup ratios are approximate.
Rotation
PGQ-style 3-partition ring buffer. TRUNCATE replaces the oldest partition — O(segment_files), not O(rows). Much faster than DELETE but not constant-time.
| Operation | Time | Notes |
|---|---|---|
| Rotation (TRUNCATE empty partition) | 14 ms | Advisory lock + config update + TRUNCATE |
| Rotation (TRUNCATE 86k rows / 37 MiB) | 9 ms | 1 heap segment + 2 index segments |
| Rotation (TRUNCATE 31.5M rows / 13 GiB) | 6.3 s | ~13 file segments to schedule for deletion |
In production (daily rotation), partitions hold at most 1 day (86k rows). The 6.3s number is from a synthetic stress test — it would never occur with normal rotation.
Bloat after rotation
| Partition | n_live_tup | n_dead_tup |
|---|---|---|
| sample_0 | 0 | 0 |
| sample_1 | 0 | 0 |
| sample_2 | 0 | 0 |
TRUNCATE replaces the relation files entirely. Zero dead tuples, zero bloat. No VACUUM needed for sample partitions.
Sampler Overhead
take_sample() measured with \timing (10 runs, 0 active backends):
53, 47, 54, 55, 52, 64, 50, 82, 54, 46 ms
Median: 53 ms
WAL generation
| Condition | WAL per sample | Notes |
|---|---|---|
| Right after checkpoint | ~134 KiB | Full page writes for first touch |
| First 10 after checkpoint | ~49 KiB avg | FPW amortizing |
| Steady state (100 samples) | ~28 KiB | Stable |
Actual data per sample is only ~2-3 KiB (1 heap row + 2 index entries + query_map updates). The ~28 KiB WAL is dominated by full_page_writes = on — every 8 KiB page dirtied for the first time after a checkpoint gets written in full. This isn't pg_ash-specific; it's how all Postgres writes work.
At 1 sample/second steady state: ~2.4 GiB/day WAL. pg_ash dirties ~3 pages per sample (1 heap + 2 index). On a system already generating WAL from application writes, this is modest.
take_sample() cost at scale
The sampler queries pg_stat_activity (shared memory scan), builds dictionaries, and constructs the array via plpgsql concatenation. Array building is O(n²) due to copy-on-append — at 1,000 backends, expect 200–400ms per sample (untested). This is the main scaling bottleneck; replacing iterative concatenation with array_agg() in a subquery would make it O(n).
Multiple Databases
Tested with 4 databases, 1 hour of data per database (14,388 total rows):
| Function | Time |
|---|---|
samples_by_database('1 hour') |
299 ms |
top_waits('1 hour') across all databases |
372 ms |
The (datid, sample_ts) index correctly supports per-database filtering.
Concurrent Readers + Writer
5 concurrent top_waits() while take_sample() fires every 1s:
| Reader | Time |
|---|---|
| Reader 1 | 882 ms |
| Reader 2 | 854 ms |
| Reader 3 | 785 ms |
| Reader 4 | 848 ms |
| Reader 5 | 795 ms |
No lock contention between readers (AccessShareLock) and writer (RowExclusiveLock). Advisory lock only prevents concurrent sampler execution.
Caveat: Reader-vs-rotation contention exists — TRUNCATE acquires AccessExclusiveLock, which blocks readers. The code mitigates this with SET LOCAL lock_timeout = '2s' in rotate(). This was not explicitly tested in the concurrent benchmark.
Dictionary Overhead
| Table | Rows | Size |
|---|---|---|
| wait_event_map | 21 | 48 KiB |
| query_map | 200 | 72 KiB |
| config | 1 | 24 KiB |
1-Year Stress Test (31,536,000 rows in 1 partition)
This would never happen in production — rotation prevents it. But it validates behavior at scale.
- Storage: 13 GiB total (11 GiB heap + 1.36 GiB indexes), 390.3 total bytes/row
- top_waits('1 hour'): 2.17 s (larger B-tree index)
- top_waits('7 days'): 40.3 s (scanning 30% of 31.5M rows)
- TRUNCATE 31.5M rows: 6.3 s (13 segment files to schedule for deletion)
- Bloat after TRUNCATE: 0 dead tuples ✅
Caveats & Limitations
-
Synthetic data: Uniform random distribution across wait events and query IDs. Real workloads have Zipfian skew, bursty patterns, and variable backend counts. Skewed distributions would produce smaller arrays (fewer distinct wait groups) but different aggregation characteristics.
-
Warm cache only: All reader timings assume data is in shared_buffers or OS page cache. After a Postgres restart, first queries on yesterday's data would be I/O-bound and slower.
-
Docker environment: Running in Docker with default Postgres config (128 MiB shared_buffers). Docker introduces filesystem overhead (overlay2) and potential CPU scheduling latency. Absolute timings carry an unknown Docker tax.
-
JIT compilation: JIT adds ~1.5s overhead to reader queries. Production deployments should consider
SET jit = offin reader functions or tuningjit_above_cost. -
pg_cron not tested: The Docker container lacked pg_cron, so actual scheduling accuracy at 1s intervals was not validated.
-
take_sample() at high backend counts: Array building is O(n²) due to plpgsql copy-on-append. At 500+ backends, sampler time may exceed 200ms. Fixable by switching to
array_agg(). -
Single-run baselines: The old plpgsql reader numbers and some rotation timings are single runs. The 9–17× speedup ratios should be treated as approximate.
Key Takeaways
- Storage is predictable: ~378 heap bytes/row, ~37 MiB/day for 50 backends. Scales linearly with backend count.
- 1-hour readers are interactive: 340–490ms (warm cache, JIT off). Usable for ad-hoc analysis.
- 24-hour readers need improvement: ~6.8s for full-day scans. Aggregate rollups would fix this.
- Rotation is instant for production-sized partitions: 9ms for 86k rows. Zero bloat.
- Sampler overhead is modest: ~53ms per sample, ~31 KiB WAL per sample (~3.7 GiB/day).
- Inline SQL decode is 9–17× faster than plpgsql decode: Never use per-row function calls for array decoding.
TODO
- Aggregate rollup table — the single most impactful improvement; 24h queries from 7s to sub-second
- Disable JIT in reader functions — eliminates 1.5s compilation overhead on short queries
- Replace plpgsql array concatenation with
array_agg()— O(n) instead of O(n²) for sampler - Benchmark with 100–500 real backends — validate scaling projections in practice
- Benchmark
take_sample()under real load — pgbench with contention - Cold-cache reader benchmarks — after PG restart, no page cache
- Test reader-vs-rotation contention — concurrent queries during TRUNCATE