Skip to content

Adjust v2 defaults#1805

Merged
badrishc merged 5 commits into
mainfrom
badrishc/adjust-defaults
May 17, 2026
Merged

Adjust v2 defaults#1805
badrishc merged 5 commits into
mainfrom
badrishc/adjust-defaults

Conversation

@badrishc

@badrishc badrishc commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Updates Garnet's v2 defaults to better reflect typical operational deployments and the vector-database workload.

PageSize               : 32m → 16m
ReadCachePageSize      : 32m → 4m
ValueOverflowThreshold : 4k  → 16k

Rationale

PageSize = 16m (was 32m)

A modest 2× reduction. Earlier drafts proposed 4 MB; after a critical re-examination (see below), 16 MB is the more defensible compromise.

Real benefits over 32 MB:

  • Halves flush-completion latency at the device level (~5 ms vs ~10 ms for 16 MB vs 32 MB on a 3 GB/s NVMe). Flush completion gates FlushedUntilAddress, which gates replication acks and AOF commit cadence.
  • 2× finer HeadAddress step → LogSizeTracker keeps memory closer to target on tight budgets.
  • Halves the worst-case tail-page memory waste (16 MB pinned for a nearly-empty tail page vs 32 MB).
  • Page allocation zero-fill is 2× faster — less latency on the unlucky thread that crosses a page boundary into an unallocated page.

Cache behavior is preserved. For a 64 GB typical deployment:

  • 32 MB pages → 2,048 entries → pagePointers 16 KB → L1d.
  • 16 MB pages → 4,096 entries → pagePointers 32 KB → still L1d on every mainstream server CPU.
  • (4 MB pages would have been 16,384 entries → 128 KB → spills to L2. That's why we backed off 4 MB.)

Costs are modest:

  • 2× more page-boundary crossings (more PageOffset CAS / HandlePageOverflow invocations).
  • 2× faster RCU onset (records age from mutable → read-only 2× sooner).
  • 2× larger per-page metadata arrays (PageStatusIndicator, objectPages, etc.).

ReadCachePageSize = 4m (was 32m)

The previous 32 MB default was clearly broken for the read cache; 4 MB is the right operational point.

  • Avoid the broken-at-32m regime. With a 1 GB ReadCacheMemorySize default and 32 MB pages, there are only ~32 pages total — too coarse for any useful eviction or second-chance behavior. 4 MB gives 256 pages, which is plenty.
  • Eviction is byte-granular. LogSizeTracker.DetermineEvictionRange evicts at record granularity; the page-size choice only governs when the underlying page buffer is freed back to the allocator. At 256 pages on a 1 GB budget, each freed page is a 0.4% memory step — comfortably smooth.
  • Why not 1 MB? Earlier drafts proposed 1 MB. At 1,024 pages the eviction granularity (0.1% per page free) is overkill, and you pay 4× the per-page metadata (ObjectIdMap, PageStatusIndicator) and 4× the page-allocation churn during cache fills. 4 MB captures the win at a fraction of the overhead and matches the 4:1 ratio with the 16 MB main-log page.
  • No flush penalty either way. Read cache uses NullDevice — there is no flush IO whose cost grows with page count, so smaller pages are essentially free on that axis.

ValueOverflowThreshold = 16k (was 4k)

Critical for the vector-database use case.

  • DiskANN "1 IO per hop" invariant. Garnet's vector-set surface uses DiskANN, whose disk-graph traversal performance hinges on each hop being a single IO. Records stored as overflow byte[] currently incur 2 IOs on cold reads (one for the main-log record header to learn the overflow offset, one for the overflow payload itself) and cannot be coalesced because the overflow position lives in the header.
  • Covers all mainstream embeddings inline. Typical embedding vector sizes (FP32, plus ~24 B record header):
    • BGE-Base (768d): ~3.1 KB
    • text-embedding-ada-002 (1536d): ~6.3 KB
    • text-embedding-3-large (3072d): ~12.3 KB
  • 16 KB inline keeps all of those in a single IO. The next jump (4096d FP32, Cohere v4 / Mistral-embed at ~16.3 KB) still overflows, but those models are uncommon enough that the page-density trade-off of moving to 32 KB was judged not worth it.
  • Optional optimization assumed. Maximum benefit requires IStreamBuffer.InitialIOSize to be hinted up to the record size so the first IO grabs the entire inline record; without that hint the OS-page-sized initial read (4 KB on Linux x64) wastes one IO for inline records > 4 KB. That fix is straightforward and orthogonal.

Why not 4 MB? (notes from review)

Earlier iterations proposed PageSize = 4m. Critical review concluded the operational arguments for going that small don't hold up:

  • "Smaller flush bursts" — Flush is async; the writer never waits on it. Total bytes pushed are identical; smaller IOs may be less device-efficient. Only matters for FlushedUntilAddress tick frequency, which is a marginal effect.
  • "Finer ReadOnlyAddress granularity" — Actually a negative. Records age into immutable 8× faster, causing more RCU traffic on writes.
  • "Smaller blast radius on corruption" — Tsavorite doesn't have page-checksum-and-discard recovery; corruption typically fails recovery rather than discarding a page.
  • Cache fitpagePointers for 4 MB / 64 GB log is 128 KB, which spills out of L1d on every modern server CPU.
  • YCSB delta — The 5–6% in-memory-read advantage seen at 4 MB was on a dataset that fit entirely in memory (no flush/eviction), within run-to-run noise, and likely an artifact of insert-phase contention patterns rather than a steady-state read win.

16 MB captures the meaningful operational wins without these downsides.

Testing

GarnetServerConfigTests (ImportExportConfigLocal, ImportExportConfigAzure, ValueOverflowThresholdParsing) had hard-coded "32m" and "4k" assertions; updated to the new defaults. Affected tests pass locally.

Copilot AI review requested due to automatic review settings May 15, 2026 17:18

This comment was marked as outdated.

@badrishc badrishc marked this pull request as draft May 15, 2026 17:26
Addresses Copilot review comments on PR #1805:
- ImportExportConfigLocal/Azure: default PageSize now "4m" (was "32m");
  override changed to "8m" so the test continues to verify non-default
  tracking now that "4m" matches the default.
- ValueOverflowThresholdParsing: default ValueOverflowThreshold now "16k"
  (16384 bytes), was "4k" (4096 bytes).

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc marked this pull request as ready for review May 16, 2026 01:00
badrishc and others added 2 commits May 16, 2026 12:28
After a more critical review, 4 MB is hard to defend on a typical
64 GB log:
- The flush-burst, ReadOnly-granularity, and blast-radius rationale
  is weak or wrong-directional (smaller pages age records into
  immutable faster, causing MORE RCU, not less).
- pagePointers at 4 MB / 64 GB log = 128 KB, which spills out of
  L1d on every modern server CPU (32-48 KB L1d).
- The YCSB result was on an in-memory dataset with no flush/eviction
  activity and was within noise across iterations.

16 MB is a defensible compromise:
- Halves flush-completion latency and tail-page waste vs 32 MB.
- 2x finer eviction granularity.
- pagePointers (32 KB on 64 GB log) still fits L1d.
- Only doubles per-page metadata vs 32 MB instead of 8x.

ReadCachePageSize=1m and ValueOverflowThreshold=16k unchanged
(those wins are clean and well-grounded).

Co-authored-by: Copilot <[email protected]>
With a 1 GB ReadCacheMemorySize default and byte-granular eviction via
LogSizeTracker, 256 read-cache pages (4 MB each) already provide ample
eviction granularity (~0.4% memory step per page free) while halving
per-page metadata (ObjectIdMap, PageStatusIndicator) and reducing
page-allocation churn during cache fills. 1 MB was over-conservative.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc merged commit d3677cf into main May 17, 2026
282 of 283 checks passed
@badrishc badrishc deleted the badrishc/adjust-defaults branch May 17, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants