Skip to content

Improve store info stats, optimize vector IO#1901

Merged
badrishc merged 20 commits into
mainfrom
badrishc/vector-set-io
Jun 27, 2026
Merged

Improve store info stats, optimize vector IO#1901
badrishc merged 20 commits into
mainfrom
badrishc/vector-set-io

Conversation

@badrishc

Copy link
Copy Markdown
Collaborator

No description provided.

badrishc and others added 15 commits June 26, 2026 16:01
Under high concurrent pending-read load (e.g. disk-served DiskANN vector
search), a single IO completion drainer convoys on the per-session
completion-signal path (SemaphoreSlim.Release Monitor + futex wake) and
collapses throughput past moderate concurrency. Raising the default number
of Native-device completion drain threads from 1 to 4 removes that collapse
on both libaio and io_uring backends; idle drainers park in the kernel, so
the cost when unused is negligible. io_uring scales further and more
CPU-efficiently; libaio benefits less beyond a few.

Co-authored-by: Copilot <[email protected]>
Disk-served DiskANN reads previously used a single global initial-IO size
(--initial-io-record-size, default 128B) for every record term, so a large
fixed-size FullVector record took two IOs (a 128B header read, then a re-read
of the full record). Tuning the global flag could not be right for multiple
vector sets with different dimensions/M in one instance.

Derive the initial disk-read size per vector set from its own geometry:
FullVector = dimensions*sizeof(float)+overhead, NeighborList =
numLinks*sizeof(int)+overhead. The sizes are stashed in thread-statics on
entry to a search/add (same single-threaded-DiskANN model as
ActiveThreadSession) and reset on context exit, so different sets get
different optimal sizes. Paths that don't set the geometry fall back to the
previous behavior, so nothing regresses; correctness is unaffected either way
because the read path still grows-and-retries on an undersized initial read.

Validated on a 50k Cohere-768d disk-served set (A/B, same data):
reads/query 1410 -> 753 (FullVector 2 IOs -> 1), bandwidth -13%, recall
preserved (0.8632 -> 0.8634).

Co-authored-by: Copilot <[email protected]>
The INFO STORE store-stats had two problems: Log.MemorySizeBytes overflowed
int32 for in-memory logs larger than 2 GiB (BufferSize*PageSize and
AllocatedPageCount<<bits were evaluated in 32-bit before widening to long),
so e.g. a 128 GiB log reported a negative number; and several byte-valued
fields were named ambiguously (IndexMemorySize reported #cache-lines, not
bytes; BufferSize is a page count, not a byte size).

- Fix the int32 overflow in AllocatorBase.MaxMemorySizeBytes and
  LogAccessor.MemorySizeBytes/MemorySizeBytesIncludingOverflowPages (also
  fixes the INFO MEMORY store-memory aggregate that summed the negative).
- Report byte-valued fields in bytes with a Bytes suffix and split max vs
  current: IndexMemorySizeBytes; Log.PageSizeBytes, Log.MaxPageCount (was
  BufferSize), Log.MaxMemorySizeBytes, Log.CurrentMemorySizeBytes,
  Log.CurrentHeapSizeBytes (+ ReadCache mirror); grouped capacity then
  addresses. Added Tsavorite.IndexSizeBytes and LogAccessor.PageSizeBytes.
- Update the two affected tests (ConfigSetIndexSizeTest, GetStoreAddressInfo).

Co-authored-by: Copilot <[email protected]>
Fixes from code review of the geometry-derived initial-IO sizing:

- Size FullVector reads by the actual stored element width: the Redis
  quantizers (NoQuant/Bin/Q8) store F32 (4 bytes/dim) but the extended X*
  quantizers (XNoQuant_U8/I8, XBin_U8/I8) store 1 byte/dim. Pass quantType
  into SetActiveReadGeometry so byte-quantized sets aren't over-read ~4x.
  (Format mapping per VectorManager.TryGetEmbedding.)
- Instrument the VSIM-by-element path (ElementSimilarity) too, so
  Service.SearchElement gets the same per-index sizing as search-by-vector.
- Apply the per-batch initialIORecordSize in the non-SSE ContextReadWithPrefetch
  fallback loop as well (was only set on the SSE-prefetch and count==1 paths),
  so the sizing also takes effect on platforms without SSE (e.g. ARM64).

Co-authored-by: Copilot <[email protected]>
Add IndexBucketCount (hash buckets), IndexOverflowBucketCount (overflow
buckets in use) and IndexBucketSizeBytes (cache-line bucket size) to the
store stats, so IndexBucketCount * IndexBucketSizeBytes == IndexMemorySizeBytes
is visible and the index occupancy/overflow can be inspected.

Co-authored-by: Copilot <[email protected]>
The geometry-derived disk-read sizing covered the FullVector (term 0) and
NeighborList (term 1) records but left the QuantizedVector (term 2) reads on
the small default initial size. On quantized sets the approximate-distance
pass reads term 2, so size it from the set's geometry too, per quantizer:

  - byte quantizers (Q8) store 1 byte/dim
  - binary quantizers (Bin) pack 1 bit/dim

sized over the reduced dimensions when REDUCE is applied. Confirmed against
the on-disk records (768-dim: Q8 term-2 = 788 bytes, Bin term-2 = 102 bytes),
so each quantized record now lands in a single IO without over-reading whole
sectors for the tiny binary records. Sizing falls back to the previous default
when geometry isn't set, and an under-read still self-corrects with a second
IO. Disk-served recall is unchanged (Q8 ~0.90, Bin as expected for binary).

Co-authored-by: Copilot <[email protected]>
The index stats reported the overflow bucket count but not its size, and the
index memory totals (INFO STORE IndexMemorySizeBytes and INFO MEMORY
store_index_size) counted only the main hash table, silently excluding the
overflow buckets. Overflow buckets are the same 64-byte cache-line layout as
main buckets and can exceed the main table under load (e.g. a 1MB / 16384-bucket
index with 200k keys grows ~19k overflow buckets, ~1.2MB), so the old totals
could undercount real index memory by half.

Add Tsavorite IndexOverflowBucketSizeBytes / IndexOverflowSizeBytes /
IndexTotalSizeBytes and surface a complete, symmetric INFO STORE index section:
main (IndexBucketCount/IndexBucketSizeBytes/IndexMemorySizeBytes), overflow
(IndexOverflowBucketCount/IndexOverflowBucketSizeBytes/IndexOverflowMemorySizeBytes),
and IndexTotalMemorySizeBytes. INFO MEMORY store_index_size now includes overflow
too. IndexMemorySizeBytes still reports the main table (== configured index size).

Co-authored-by: Copilot <[email protected]>
Overflow buckets share the identical 64-byte cache-line layout as main buckets,
so IndexOverflowBucketSizeBytes was always equal to IndexBucketSizeBytes. Drop
the duplicate field (and its Tsavorite property) and document that the single
IndexBucketSizeBytes covers both main and overflow buckets. The overflow memory
figure (IndexOverflowMemorySizeBytes = overflow bucket count x bucket size) and
IndexTotalMemorySizeBytes are unchanged.

Co-authored-by: Copilot <[email protected]>
The per-term initial disk-read sizes were three separate [ThreadStatic] int
fields (ActiveFullVectorIOSize / ActiveNeighborListIOSize /
ActiveQuantizedVectorIOSize) set and cleared one by one. Bundle them into a
single VectorReadGeometry struct held in one [ThreadStatic], populated in one
assignment by SetActiveReadGeometry and reset with `= default` on context exit.

No behavior change: the sizes computed and returned are identical. The read-path
getter now takes a single thread-static read (one struct copy) instead of
reading an individual thread-static per branch, and setup/teardown go from three
thread-static writes to one. Disk-served recall unchanged (NoQuant/Q8 ~0.94,
Bin ~0.70 on a 8k Cohere-768 spot check).

Co-authored-by: Copilot <[email protected]>
When a vector index metadata record is read back from disk (after eviction or
recovery) the native index must be recreated, which triggers a copy-update of
the index key (RecreateIndexArg). GetRMWModifiedFieldInfo only sized the
copy-update destination for the replication append-log args, leaving the
recreate path with a zero-length value, so CopyUpdater's oldValue.CopyTo threw
"Destination is too short". Size the destination to the index value for the
recreate arg too.

Co-authored-by: Copilot <[email protected]>
Flushes the main store's in-memory hybrid log to the disk device and evicts it
(shifts HeadAddress to TailAddress) so subsequent reads are served from disk.
Gated behind --enable-debug-command. Useful for experiments that isolate the
disk-serving path, e.g. build a vector graph in memory, evict, then measure
disk-served query behavior.

Co-authored-by: Copilot <[email protected]>
Add a per-batch ReadCopyOptions to IReadArgBatch (default Inherit, resolved
through the session/store hierarchy so existing batches are unchanged) and have
the DiskANN read batch return per-term options: the small per-element records
that form the serial read-barrier chain (NeighborList adjacency, QuantizedVector
approximate-distance vectors, internal/external id maps) are copied back to the
main-log tail when read from disk, so subsequent hops and queries serve them
from memory; the large raw FullVectors stay on disk.

This keeps the graph "stub" memory-resident (as in classic DiskANN) while only
the raw vectors are disk-served. On a disk-served Cohere-768 NoQuant set (graph
evicted via DEBUG FLUSHANDEVICT), this - together with storing vectors inline so
each vector is a single IO - raised peak disk-served throughput from ~44K to
~312K IOPS (~8% to ~58% of device) and cut single-query latency ~17x, with
recall unchanged.

Co-authored-by: Copilot <[email protected]>
The small graph "stub" records that form the serial read-barrier chain
(NeighborList adjacency, internal/external id maps, quantized vectors) are
copied back into memory on disk read so subsequent hops and queries serve
them from memory. Previously they were always copied to the main-log tail,
which pollutes the writable main log with read-only graph data that then
has to be flushed back to disk when the log fills.

Route them to the read cache instead when it is enabled (--readcache): a
separate, never-flushed, LRU region that is the natural home for hot
read-only data, leaving the writable main log clean. The destination is
captured once from GarnetServerOptions.EnableReadCache at VectorManager
construction (StubReadCopyTo); when the read cache is disabled it falls
back to the main-log tail, so there is no regression for that configuration.
The large raw FullVector is still left on disk (CopyTo=None) — only the raw
vectors are served from disk.

Measured (50k Cohere-768, NoQuant, FlushAndEvict then queried): with
--readcache the main-log tail grows only ~80 bytes during the query phase
(index recreate) while ~447 KB of stub records land in the read cache;
without --readcache the same ~447 KB goes to the main-log tail (prior
behavior). recall@10 = 1.000 in both modes.

Co-authored-by: Copilot <[email protected]>
…noquant

Capture the read-copy decision for the raw FullVector so it isn't "helpfully"
changed later: raw is served from disk (CopyTo=None) in both modes. In quant
mode it is cold (QuantizedVector drives distance; raw is reranking-only). In
noquant mode it is the hot distance source, but it is still not cached because
it is ~tens of times larger than the NeighborList at the same access frequency,
so admitting it to the shared read-cache LRU would evict the small,
higher-reuse-per-byte stubs (adjacency / id-maps) that traversal needs every
hop. Caching raw safely would require a separate/protected budget, not the
shared stub cache.

Co-authored-by: Copilot <[email protected]>
…BufferSize

BufferSize is MaxAllocatedPageCount rounded up to the next power of 2, so
MaxMemorySizeBytes = BufferSize * PageSize overestimated the real memory cap
(e.g. --memory 3g --page 16m => 192 pages, but BufferSize 256 reported 4 GiB).
AllocatedPageCount never exceeds MaxAllocatedPageCount, so that is the true
maximum; compute MaxMemorySizeBytes from it. Verified: 3g/16m now reports
3 GiB instead of 4 GiB.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc force-pushed the badrishc/vector-set-io branch from a0b2894 to 9de6ba0 Compare June 26, 2026 23:03
An A/B experiment settles whether to admit the raw FullVector to the read cache
in no-quant mode. Under cache pressure (the disk-tiered case — you tier because
the set does not fit memory) caching raw is a net loss: 80k Cohere-768 NoQuant,
48 MB read cache, raw set ~246 MB gave 87.5 ms/query with raw cached vs 76.4 ms
without, despite ~40% fewer disk reads, because copying large raw vectors into
the cache thrashes it (~282 MB evicted / 250 queries) for more overhead than the
read savings — while the small stubs incur zero eviction either way. Caching raw
only wins when the whole set fits the read cache, which needs a size-aware
admission gate (not implemented). So raw stays on disk in both modes; only the
small stubs (adjacency / id-maps / quantized vectors) are cached.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc force-pushed the badrishc/vector-set-io branch from 5f5156d to c481806 Compare June 27, 2026 00:21
Rewrite the per-term read-copy and IO-size comments to state the current policy
plainly: small stubs are copied back to memory (read cache or main-log tail) and
the raw vector / attributes are served from disk; for no-quant sets caching the
raw vector yields no net gain once the working set exceeds the read cache. Also
make the DEBUG FLUSHANDEVICT and batch read-copy comments describe behavior only.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc marked this pull request as ready for review June 27, 2026 01:17
Copilot AI review requested due to automatic review settings June 27, 2026 01:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances Garnet/Tsavorite observability and vector-search IO behavior by refining INFO/store metrics (more accurate byte-based sizes and log limits) and introducing per-batch/per-term disk-read sizing plus selective read-copying for vector index “stub” records.

Changes:

  • Expand store/index/log INFO metrics (bytes, overflow buckets, page sizing, max vs current memory) and update related tests/utilities.
  • Add per-batch InitialIORecordSize + ReadCopyOptions to IReadArgBatch, and use them in Tsavorite prefetch reads to tune disk IO behavior (used by vector reads).
  • Improve vector IO by computing per-vector-set disk read geometry and copying small hot graph records back into memory; add DEBUG FLUSHANDEVICT and increase default native-device completion threads to 4.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/standalone/Garnet.test/TestUtils.cs Updates store-info parsing to match renamed log memory metric key.
test/standalone/Garnet.test/RespConfigTests.cs Adjusts config/metrics expectations for index size now reported in bytes.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs Adds index size byte/overflow totals and uses batch-level read-copy + initial-IO sizing in prefetch reads.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/LogAccessor.cs Adds page-size metric and fixes integer overflow in log memory size calculations.
libs/storage/Tsavorite/cs/src/core/Index/Interfaces/IReadArgBatch.cs Introduces optional per-batch InitialIORecordSize and ReadCopyOptions.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Fixes MaxMemorySizeBytes overflow and aligns it with the true page-count cap.
libs/server/Storage/Functions/MainStore/VarLenInputMethods.cs Ensures vector index copy-update sizing covers additional recreate-index path.
libs/server/Servers/GarnetServerOptions.cs Changes default native-device completion drainer threads from 1 to 4 with updated rationale.
libs/server/Resp/Vector/VectorManager.Locking.cs Resets per-thread vector read geometry when exiting a vector-set context.
libs/server/Resp/Vector/VectorManager.cs Sets stub record copy destination and seeds per-operation active geometry for sized reads.
libs/server/Resp/Vector/VectorManager.Callbacks.cs Implements per-term initial IO sizing and read-copy policy for vector batch reads; adds geometry computation helpers.
libs/server/Resp/Vector/DiskANNService.cs Exposes term constants needed by vector read batch logic.
libs/server/Resp/CmdStrings.cs Adds string constant for DEBUG FLUSHANDEVICT.
libs/server/Resp/AdminCommands.cs Adds DEBUG FLUSHANDEVICT subcommand and help text.
libs/server/Metrics/Info/GarnetInfoMetrics.cs Revises store stats metrics (index totals, log/readcache sizing and naming).
libs/host/defaults.conf Updates default DeviceCompletionThreads to 4 and documents behavior.
libs/host/Configuration/Options.cs Updates CLI help text and defaulting behavior for device completion threads.

Comment thread libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs Outdated
Comment thread libs/server/Metrics/Info/GarnetInfoMetrics.cs
Comment thread libs/server/Metrics/Info/GarnetInfoMetrics.cs
badrishc and others added 2 commits June 26, 2026 19:23
FORCEGC is a debug/admin utility rather than a general command, so move it from a
top-level RESP command to `DEBUG FORCEGC [generation]`, alongside the other DEBUG
subcommands. Remove the top-level command (enum value, parser, dispatch,
command-info/docs JSON, ACL test) and document FORCEGC and FLUSHANDEVICT under
DEBUG in the website docs. Reply and generation validation are unchanged.

Co-authored-by: Copilot <[email protected]>
… page count

- ContextReadWithPrefetch now resolves per-batch ReadCopyOptions via
  ReadCopyOptions.Merge so a batch can override CopyFrom and CopyTo
  independently, instead of replacing the whole struct when CopyTo is
  Inherit (which dropped a per-batch CopyFrom override).
- INFO STORE Log.MaxPageCount / ReadCache.MaxPageCount now report
  MaxAllocatedPageCount (the actual page-count cap) instead of BufferSize
  (the power-of-2 circular-buffer capacity), making them self-consistent
  with MaxMemorySizeBytes. Added LogAccessor.MaxAllocatedPageCount.

Co-authored-by: Copilot <[email protected]>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.

Comment thread libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Outdated
Comment thread libs/server/Resp/Vector/VectorManager.Callbacks.cs Outdated
- VectorManager.StubReadCopyTo is now a per-instance readonly field instead
  of a process-wide mutable static. The native DiskANN read callback reaches
  it via the already-thread-local ActiveThreadSession.vectorManager, so
  multiple servers/databases with different EnableReadCache settings in one
  process no longer clobber each other (which could make a store without a
  read cache attempt ReadCache copies).
- Fix IncrementAllocatedPageCount XML summary: it updates
  HighWaterAllocatedPageCount, not MaxAllocatedPageCount (the configured cap).

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc merged commit f396fd2 into main Jun 27, 2026
203 checks passed
@badrishc badrishc deleted the badrishc/vector-set-io branch June 27, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants