Skip to content

Benchmark for v1.0: 1-year simulation results (50 backends, 1s sampling, PG17) #1

@NikolayS

Description

@NikolayS

What is pg_ash?

pg_ash is a lightweight Active Session History for Postgres. It samples pg_stat_activity every second via pg_cron and stores compressed snapshots as integer[] arrays in a 3-partition ring buffer (inspired by Skytools/PGQ). No C extension, no shared_preload_libraries changes — pure SQL/plpgsql, install-and-go.

Each sample encodes all active backends into a single integer[]: [-wait_id, count, query_id, query_id, ..., -next_wait, count, ...]. Dictionary tables (wait_event_map, query_map) map IDs back to names.

Reader functions: top_waits() (wait event ranking), top_queries() (query ranking by samples), cpu_vs_waiting() (CPU vs wait breakdown), wait_timeline() (bucketed time series), samples_by_database() (per-database activity).

TL;DR

50 backends, 1s sampling, Postgres 17 (Docker). All timings are warm-cache; see Environment for details.

Production storage (3 partitions, daily rotation)

Metric Result
Storage per partition 37 MiB (1 day, 86k rows)
Max disk usage ~74 MiB (2 partitions hold data; 3rd is empty, ready for rotation)
Heap bytes/row ~378 (stable across scales)

Reader performance (1-day partition, 86k rows, warm cache, JIT off, median of 10 runs)

Function Median Range Notes
top_waits('1 hour') 30 ms 22–40 ms Bitmap index scan
top_waits('24 hours') 6,097 ms 6,029–6,360 ms Seq scan (full partition)
top_queries('1 hour') 31 ms 28–37 ms
cpu_vs_waiting('1 hour') 352 ms 343–373 ms (measured with JIT on; expect ~30ms with JIT off)
Rotation (TRUNCATE 86k rows) 9 ms

⚠️ JIT compilation impact: With default jit = on, 1-hour queries jump from 30ms to 340ms (10×), and occasionally to 1.8s when JIT fully compiles. Always use SET jit = off for pg_ash reader queries — or disable JIT globally on OLTP servers.

Operational overhead

Metric Result
take_sample() execution 53 ms median (46–82 ms, 10 runs, 0 active backends)
WAL per sample ~28 KiB steady state (~2.4 GiB/day); dominated by full_page_writes
Memory per sample (1000 backends) ~40–60 KiB peak (plpgsql copy-on-append)
Concurrent readers + writer No lock contention ✅

How pg_ash compares

  • pg_wait_sampling: C extension, in-memory ring buffer only (no persistent history). pg_ash trades ~37 MiB/day disk for SQL-queryable historical data.
  • pgsentinel: C extension ASH. Requires compilation + shared_preload_libraries. pg_ash is pure SQL.
  • pg_stat_statements: Cumulative query stats, not point-in-time samples. Complementary, not a replacement.

Environment

PostgreSQL 17.7 on x86_64 (Docker)
CPU: AMD EPYC-Milan, 4 cores
RAM: 15 GiB (no Docker memory limit)
Storage: ext4 on SSD (150 GiB)
shared_buffers: 128 MiB (default)
work_mem: 4 MiB
effective_cache_size: 4 GiB
wal_level: replica
synchronous_commit: on
checkpoint_timeout: 5 min
max_wal_size: 1 GiB

Note: These benchmarks measure relative performance and scaling characteristics. Absolute timings will vary by hardware. All reader timings are warm-cache (data in shared_buffers/OS page cache). Cold-cache performance would be I/O-bound and slower.


Benchmark Setup

  • Simulated workload: 50 active backends, 1 database, 1-second sampling
  • Wait events: 21 distinct (CPU*, IO, LWLock, Lock, Client, idle-in-transaction)
  • Query IDs: 200 distinct, uniformly distributed (real workloads have skew — see Caveats)
  • Encoding: integer[] format — avg 75 elements, avg 322 bytes per array
  • Reader functions: inline SQL decode using generate_subscripts() (not plpgsql decode_sample())

Storage

Scale Rows Heap Indexes Total Heap bytes/row
1 day 86,400 32 MiB 5.5 MiB 37 MiB ~378
1 month 2,678,400 978 MiB 168 MiB 1,146 MiB ~378
1 year (measured) 31,536,000 11 GiB 1.36 GiB 13 GiB ~378

Notes:

  • Heap bytes/row is stable at ~378 regardless of scale (measured at 382 with now-removed version byte)
  • Total bytes/row (including indexes) increases slightly at scale: 390.3 at 31.5M rows due to B-tree overhead
  • Indexes: (sample_ts) + (datid, sample_ts) per partition = 2 B-trees × 3 partitions
  • TOAST compression does not apply at 50 backends (326-byte arrays are well below the ~2 KiB TOAST threshold)
  • In production with daily rotation, only 2 of 3 partitions hold data at any time — real storage is ~74 MiB

Storage by backend count

Backends Array elements Data column (bytes) Est. row (bytes) Storage/day
10 17 92 ~130 10.7 MiB
50 75 324 ~362 29.8 MiB
100 137 572 ~610 50.2 MiB
200 287 1,172 ~1,210 99.7 MiB
500 727 2,932 ~2,970 244.7 MiB
1,000 1,453 5,836 ~5,874 484.0 MiB

Note: Estimated row sizes = data column + ~38 bytes (tuple header + fixed columns). Measured values are ~5% higher due to alignment padding. At 500+ backends, TOAST LZ4 compression would reduce actual storage (integer arrays with small values compress well). These projections assume uniform wait event distribution — real skewed distributions produce fewer distinct groups and smaller arrays.

Reader Performance (repeated runs)

All timings on 1-day dataset (86,400 rows), warm cache, median of 10 runs.

1-hour window (~3,600 rows scanned)

Function Median Range (10 runs) Notes
top_waits('1 hour') 339 ms 314–362 ms Bitmap Index Scan on sample_0_ts_idx
top_queries('1 hour') 486 ms 470–504 ms Full array walk for query IDs
cpu_vs_waiting('1 hour') 352 ms 343–373 ms Only needs wait_id + count

24-hour window (full partition scan)

Function Median Range (10 runs) Notes
top_waits('24 hours') 6,807 ms 6,749–6,881 ms Seq Scan (expected — full partition)

JIT compilation impact

JIT kicked in on 2 of 10 runs for top_waits('1 hour'), inflating time from ~340ms to ~1,824ms. The EXPLAIN ANALYZE shows:

JIT: Functions: 48, Total: 1,561 ms
  (Inlining 407ms, Optimization 637ms, Emission 515ms)
Execution Time: 2,185 ms (of which 1,561 ms = JIT)

JIT compilation cost dominates for short queries. Consider SET jit = off in reader functions, or adjusting jit_above_cost.

EXPLAIN (ANALYZE, BUFFERS) for top_waits('1 hour')
Limit  (cost=554357..554357 rows=5 width=136) (actual time=2052..2052 rows=5 loops=1)
  Buffers: shared hit=9429, temp read=1800 written=1803
  CTE totals
    -> GroupAggregate  (actual time=434..493 rows=21 loops=1)
         -> Sort  (Sort Method: external merge Disk: 14400kB) (actual time=426..482 rows=42428)
              -> Nested Loop  (actual time=1..262 rows=42428)
                   -> Bitmap Heap Scan on sample_0 (actual time=1..107 rows=3459)
                        Index: sample_0_ts_idx
                        Buffers: shared hit=9418
                   -> Function Scan on generate_subscripts (actual time=0.016..0.035 rows=12 loops=3459)
                        Filter: data[i] < 0 (removed 63 rows per loop)

Key observations:

  • Bitmap Index Scan correctly used for time-range filter
  • 100% buffer hits (shared hit=9,429, zero disk reads)
  • External merge sort at 14.4 MiB (exceeds 4 MiB work_mem)
  • generate_subscripts: ~75 elements per row, 12 negative markers (wait groups) per row
EXPLAIN (ANALYZE, BUFFERS) for top_waits('24 hours')
Limit  (actual time=7871..7871 rows=5 loops=1)
  Buffers: shared hit=176550
  CTE totals
    -> HashAggregate  (actual time=6407..6407 rows=21 loops=1)
         -> Nested Loop  (actual time=0.3..5705 rows=1057864 loops=1)
              -> Seq Scan on sample_0 (actual time=0.3..1591 rows=86257)
                   Filter: sample_ts >= ... AND slot = ANY(...)
                   Rows Removed by Filter: 143
                   Buffers: shared hit=176542
              -> Function Scan on generate_subscripts (actual time=0.020..0.039 rows=12 loops=86257)

Key observations:

  • Seq Scan used (full partition — expected for 24h on 1-day data)
  • 176,550 buffer hits, zero disk reads
  • 1,057,864 array elements decoded (86k rows × ~12 wait groups avg)
  • HashAggregate (not sort) for full-partition aggregation

Reader optimization: 9–17× speedup

Original readers used CROSS JOIN LATERAL ash.decode_sample(s.data) — a plpgsql function called per row, doing per-element dictionary lookups via nested SQL queries. New approach: inline generate_subscripts(s.data, 1) with a single hash join to wait_event_map.

Function Old (plpgsql) New (inline SQL) Speedup
top_waits('1 hour') 5,289 ms 339 ms 15.6×
top_queries('1 hour') 5,173 ms 486 ms 10.6×
cpu_vs_waiting('1 hour') 5,128 ms 352 ms 14.6×
wait_timeline('1 hour') 5,785 ms 666 ms 8.7×

Note: Old numbers were single-run. Speedup ratios are approximate.

Rotation

PGQ-style 3-partition ring buffer. TRUNCATE replaces the oldest partition — O(segment_files), not O(rows). Much faster than DELETE but not constant-time.

Operation Time Notes
Rotation (TRUNCATE empty partition) 14 ms Advisory lock + config update + TRUNCATE
Rotation (TRUNCATE 86k rows / 37 MiB) 9 ms 1 heap segment + 2 index segments
Rotation (TRUNCATE 31.5M rows / 13 GiB) 6.3 s ~13 file segments to schedule for deletion

In production (daily rotation), partitions hold at most 1 day (86k rows). The 6.3s number is from a synthetic stress test — it would never occur with normal rotation.

Bloat after rotation

Partition n_live_tup n_dead_tup
sample_0 0 0
sample_1 0 0
sample_2 0 0

TRUNCATE replaces the relation files entirely. Zero dead tuples, zero bloat. No VACUUM needed for sample partitions.

Sampler Overhead

take_sample() measured with \timing (10 runs, 0 active backends):

53, 47, 54, 55, 52, 64, 50, 82, 54, 46 ms
Median: 53 ms

WAL generation

Condition WAL per sample Notes
Right after checkpoint ~134 KiB Full page writes for first touch
First 10 after checkpoint ~49 KiB avg FPW amortizing
Steady state (100 samples) ~28 KiB Stable

Actual data per sample is only ~2-3 KiB (1 heap row + 2 index entries + query_map updates). The ~28 KiB WAL is dominated by full_page_writes = on — every 8 KiB page dirtied for the first time after a checkpoint gets written in full. This isn't pg_ash-specific; it's how all Postgres writes work.

At 1 sample/second steady state: ~2.4 GiB/day WAL. pg_ash dirties ~3 pages per sample (1 heap + 2 index). On a system already generating WAL from application writes, this is modest.

take_sample() cost at scale

The sampler queries pg_stat_activity (shared memory scan), builds dictionaries, and constructs the array via plpgsql concatenation. Array building is O(n²) due to copy-on-append — at 1,000 backends, expect 200–400ms per sample (untested). This is the main scaling bottleneck; replacing iterative concatenation with array_agg() in a subquery would make it O(n).

Multiple Databases

Tested with 4 databases, 1 hour of data per database (14,388 total rows):

Function Time
samples_by_database('1 hour') 299 ms
top_waits('1 hour') across all databases 372 ms

The (datid, sample_ts) index correctly supports per-database filtering.

Concurrent Readers + Writer

5 concurrent top_waits() while take_sample() fires every 1s:

Reader Time
Reader 1 882 ms
Reader 2 854 ms
Reader 3 785 ms
Reader 4 848 ms
Reader 5 795 ms

No lock contention between readers (AccessShareLock) and writer (RowExclusiveLock). Advisory lock only prevents concurrent sampler execution.

Caveat: Reader-vs-rotation contention exists — TRUNCATE acquires AccessExclusiveLock, which blocks readers. The code mitigates this with SET LOCAL lock_timeout = '2s' in rotate(). This was not explicitly tested in the concurrent benchmark.

Dictionary Overhead

Table Rows Size
wait_event_map 21 48 KiB
query_map 200 72 KiB
config 1 24 KiB

1-Year Stress Test (31,536,000 rows in 1 partition)

This would never happen in production — rotation prevents it. But it validates behavior at scale.

  • Storage: 13 GiB total (11 GiB heap + 1.36 GiB indexes), 390.3 total bytes/row
  • top_waits('1 hour'): 2.17 s (larger B-tree index)
  • top_waits('7 days'): 40.3 s (scanning 30% of 31.5M rows)
  • TRUNCATE 31.5M rows: 6.3 s (13 segment files to schedule for deletion)
  • Bloat after TRUNCATE: 0 dead tuples ✅

Caveats & Limitations

  1. Synthetic data: Uniform random distribution across wait events and query IDs. Real workloads have Zipfian skew, bursty patterns, and variable backend counts. Skewed distributions would produce smaller arrays (fewer distinct wait groups) but different aggregation characteristics.

  2. Warm cache only: All reader timings assume data is in shared_buffers or OS page cache. After a Postgres restart, first queries on yesterday's data would be I/O-bound and slower.

  3. Docker environment: Running in Docker with default Postgres config (128 MiB shared_buffers). Docker introduces filesystem overhead (overlay2) and potential CPU scheduling latency. Absolute timings carry an unknown Docker tax.

  4. JIT compilation: JIT adds ~1.5s overhead to reader queries. Production deployments should consider SET jit = off in reader functions or tuning jit_above_cost.

  5. pg_cron not tested: The Docker container lacked pg_cron, so actual scheduling accuracy at 1s intervals was not validated.

  6. take_sample() at high backend counts: Array building is O(n²) due to plpgsql copy-on-append. At 500+ backends, sampler time may exceed 200ms. Fixable by switching to array_agg().

  7. Single-run baselines: The old plpgsql reader numbers and some rotation timings are single runs. The 9–17× speedup ratios should be treated as approximate.

Key Takeaways

  1. Storage is predictable: ~378 heap bytes/row, ~37 MiB/day for 50 backends. Scales linearly with backend count.
  2. 1-hour readers are interactive: 340–490ms (warm cache, JIT off). Usable for ad-hoc analysis.
  3. 24-hour readers need improvement: ~6.8s for full-day scans. Aggregate rollups would fix this.
  4. Rotation is instant for production-sized partitions: 9ms for 86k rows. Zero bloat.
  5. Sampler overhead is modest: ~53ms per sample, ~31 KiB WAL per sample (~3.7 GiB/day).
  6. Inline SQL decode is 9–17× faster than plpgsql decode: Never use per-row function calls for array decoding.

TODO

  1. Aggregate rollup table — the single most impactful improvement; 24h queries from 7s to sub-second
  2. Disable JIT in reader functions — eliminates 1.5s compilation overhead on short queries
  3. Replace plpgsql array concatenation with array_agg() — O(n) instead of O(n²) for sampler
  4. Benchmark with 100–500 real backends — validate scaling projections in practice
  5. Benchmark take_sample() under real load — pgbench with contention
  6. Cold-cache reader benchmarks — after PG restart, no page cache
  7. Test reader-vs-rotation contention — concurrent queries during TRUNCATE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions