Improve store info stats, optimize vector IO#1901
Merged
Merged
Conversation
Under high concurrent pending-read load (e.g. disk-served DiskANN vector search), a single IO completion drainer convoys on the per-session completion-signal path (SemaphoreSlim.Release Monitor + futex wake) and collapses throughput past moderate concurrency. Raising the default number of Native-device completion drain threads from 1 to 4 removes that collapse on both libaio and io_uring backends; idle drainers park in the kernel, so the cost when unused is negligible. io_uring scales further and more CPU-efficiently; libaio benefits less beyond a few. Co-authored-by: Copilot <[email protected]>
Disk-served DiskANN reads previously used a single global initial-IO size (--initial-io-record-size, default 128B) for every record term, so a large fixed-size FullVector record took two IOs (a 128B header read, then a re-read of the full record). Tuning the global flag could not be right for multiple vector sets with different dimensions/M in one instance. Derive the initial disk-read size per vector set from its own geometry: FullVector = dimensions*sizeof(float)+overhead, NeighborList = numLinks*sizeof(int)+overhead. The sizes are stashed in thread-statics on entry to a search/add (same single-threaded-DiskANN model as ActiveThreadSession) and reset on context exit, so different sets get different optimal sizes. Paths that don't set the geometry fall back to the previous behavior, so nothing regresses; correctness is unaffected either way because the read path still grows-and-retries on an undersized initial read. Validated on a 50k Cohere-768d disk-served set (A/B, same data): reads/query 1410 -> 753 (FullVector 2 IOs -> 1), bandwidth -13%, recall preserved (0.8632 -> 0.8634). Co-authored-by: Copilot <[email protected]>
The INFO STORE store-stats had two problems: Log.MemorySizeBytes overflowed int32 for in-memory logs larger than 2 GiB (BufferSize*PageSize and AllocatedPageCount<<bits were evaluated in 32-bit before widening to long), so e.g. a 128 GiB log reported a negative number; and several byte-valued fields were named ambiguously (IndexMemorySize reported #cache-lines, not bytes; BufferSize is a page count, not a byte size). - Fix the int32 overflow in AllocatorBase.MaxMemorySizeBytes and LogAccessor.MemorySizeBytes/MemorySizeBytesIncludingOverflowPages (also fixes the INFO MEMORY store-memory aggregate that summed the negative). - Report byte-valued fields in bytes with a Bytes suffix and split max vs current: IndexMemorySizeBytes; Log.PageSizeBytes, Log.MaxPageCount (was BufferSize), Log.MaxMemorySizeBytes, Log.CurrentMemorySizeBytes, Log.CurrentHeapSizeBytes (+ ReadCache mirror); grouped capacity then addresses. Added Tsavorite.IndexSizeBytes and LogAccessor.PageSizeBytes. - Update the two affected tests (ConfigSetIndexSizeTest, GetStoreAddressInfo). Co-authored-by: Copilot <[email protected]>
Fixes from code review of the geometry-derived initial-IO sizing: - Size FullVector reads by the actual stored element width: the Redis quantizers (NoQuant/Bin/Q8) store F32 (4 bytes/dim) but the extended X* quantizers (XNoQuant_U8/I8, XBin_U8/I8) store 1 byte/dim. Pass quantType into SetActiveReadGeometry so byte-quantized sets aren't over-read ~4x. (Format mapping per VectorManager.TryGetEmbedding.) - Instrument the VSIM-by-element path (ElementSimilarity) too, so Service.SearchElement gets the same per-index sizing as search-by-vector. - Apply the per-batch initialIORecordSize in the non-SSE ContextReadWithPrefetch fallback loop as well (was only set on the SSE-prefetch and count==1 paths), so the sizing also takes effect on platforms without SSE (e.g. ARM64). Co-authored-by: Copilot <[email protected]>
Add IndexBucketCount (hash buckets), IndexOverflowBucketCount (overflow buckets in use) and IndexBucketSizeBytes (cache-line bucket size) to the store stats, so IndexBucketCount * IndexBucketSizeBytes == IndexMemorySizeBytes is visible and the index occupancy/overflow can be inspected. Co-authored-by: Copilot <[email protected]>
The geometry-derived disk-read sizing covered the FullVector (term 0) and NeighborList (term 1) records but left the QuantizedVector (term 2) reads on the small default initial size. On quantized sets the approximate-distance pass reads term 2, so size it from the set's geometry too, per quantizer: - byte quantizers (Q8) store 1 byte/dim - binary quantizers (Bin) pack 1 bit/dim sized over the reduced dimensions when REDUCE is applied. Confirmed against the on-disk records (768-dim: Q8 term-2 = 788 bytes, Bin term-2 = 102 bytes), so each quantized record now lands in a single IO without over-reading whole sectors for the tiny binary records. Sizing falls back to the previous default when geometry isn't set, and an under-read still self-corrects with a second IO. Disk-served recall is unchanged (Q8 ~0.90, Bin as expected for binary). Co-authored-by: Copilot <[email protected]>
The index stats reported the overflow bucket count but not its size, and the index memory totals (INFO STORE IndexMemorySizeBytes and INFO MEMORY store_index_size) counted only the main hash table, silently excluding the overflow buckets. Overflow buckets are the same 64-byte cache-line layout as main buckets and can exceed the main table under load (e.g. a 1MB / 16384-bucket index with 200k keys grows ~19k overflow buckets, ~1.2MB), so the old totals could undercount real index memory by half. Add Tsavorite IndexOverflowBucketSizeBytes / IndexOverflowSizeBytes / IndexTotalSizeBytes and surface a complete, symmetric INFO STORE index section: main (IndexBucketCount/IndexBucketSizeBytes/IndexMemorySizeBytes), overflow (IndexOverflowBucketCount/IndexOverflowBucketSizeBytes/IndexOverflowMemorySizeBytes), and IndexTotalMemorySizeBytes. INFO MEMORY store_index_size now includes overflow too. IndexMemorySizeBytes still reports the main table (== configured index size). Co-authored-by: Copilot <[email protected]>
Overflow buckets share the identical 64-byte cache-line layout as main buckets, so IndexOverflowBucketSizeBytes was always equal to IndexBucketSizeBytes. Drop the duplicate field (and its Tsavorite property) and document that the single IndexBucketSizeBytes covers both main and overflow buckets. The overflow memory figure (IndexOverflowMemorySizeBytes = overflow bucket count x bucket size) and IndexTotalMemorySizeBytes are unchanged. Co-authored-by: Copilot <[email protected]>
The per-term initial disk-read sizes were three separate [ThreadStatic] int fields (ActiveFullVectorIOSize / ActiveNeighborListIOSize / ActiveQuantizedVectorIOSize) set and cleared one by one. Bundle them into a single VectorReadGeometry struct held in one [ThreadStatic], populated in one assignment by SetActiveReadGeometry and reset with `= default` on context exit. No behavior change: the sizes computed and returned are identical. The read-path getter now takes a single thread-static read (one struct copy) instead of reading an individual thread-static per branch, and setup/teardown go from three thread-static writes to one. Disk-served recall unchanged (NoQuant/Q8 ~0.94, Bin ~0.70 on a 8k Cohere-768 spot check). Co-authored-by: Copilot <[email protected]>
When a vector index metadata record is read back from disk (after eviction or recovery) the native index must be recreated, which triggers a copy-update of the index key (RecreateIndexArg). GetRMWModifiedFieldInfo only sized the copy-update destination for the replication append-log args, leaving the recreate path with a zero-length value, so CopyUpdater's oldValue.CopyTo threw "Destination is too short". Size the destination to the index value for the recreate arg too. Co-authored-by: Copilot <[email protected]>
Flushes the main store's in-memory hybrid log to the disk device and evicts it (shifts HeadAddress to TailAddress) so subsequent reads are served from disk. Gated behind --enable-debug-command. Useful for experiments that isolate the disk-serving path, e.g. build a vector graph in memory, evict, then measure disk-served query behavior. Co-authored-by: Copilot <[email protected]>
Add a per-batch ReadCopyOptions to IReadArgBatch (default Inherit, resolved through the session/store hierarchy so existing batches are unchanged) and have the DiskANN read batch return per-term options: the small per-element records that form the serial read-barrier chain (NeighborList adjacency, QuantizedVector approximate-distance vectors, internal/external id maps) are copied back to the main-log tail when read from disk, so subsequent hops and queries serve them from memory; the large raw FullVectors stay on disk. This keeps the graph "stub" memory-resident (as in classic DiskANN) while only the raw vectors are disk-served. On a disk-served Cohere-768 NoQuant set (graph evicted via DEBUG FLUSHANDEVICT), this - together with storing vectors inline so each vector is a single IO - raised peak disk-served throughput from ~44K to ~312K IOPS (~8% to ~58% of device) and cut single-query latency ~17x, with recall unchanged. Co-authored-by: Copilot <[email protected]>
The small graph "stub" records that form the serial read-barrier chain (NeighborList adjacency, internal/external id maps, quantized vectors) are copied back into memory on disk read so subsequent hops and queries serve them from memory. Previously they were always copied to the main-log tail, which pollutes the writable main log with read-only graph data that then has to be flushed back to disk when the log fills. Route them to the read cache instead when it is enabled (--readcache): a separate, never-flushed, LRU region that is the natural home for hot read-only data, leaving the writable main log clean. The destination is captured once from GarnetServerOptions.EnableReadCache at VectorManager construction (StubReadCopyTo); when the read cache is disabled it falls back to the main-log tail, so there is no regression for that configuration. The large raw FullVector is still left on disk (CopyTo=None) — only the raw vectors are served from disk. Measured (50k Cohere-768, NoQuant, FlushAndEvict then queried): with --readcache the main-log tail grows only ~80 bytes during the query phase (index recreate) while ~447 KB of stub records land in the read cache; without --readcache the same ~447 KB goes to the main-log tail (prior behavior). recall@10 = 1.000 in both modes. Co-authored-by: Copilot <[email protected]>
…noquant Capture the read-copy decision for the raw FullVector so it isn't "helpfully" changed later: raw is served from disk (CopyTo=None) in both modes. In quant mode it is cold (QuantizedVector drives distance; raw is reranking-only). In noquant mode it is the hot distance source, but it is still not cached because it is ~tens of times larger than the NeighborList at the same access frequency, so admitting it to the shared read-cache LRU would evict the small, higher-reuse-per-byte stubs (adjacency / id-maps) that traversal needs every hop. Caching raw safely would require a separate/protected budget, not the shared stub cache. Co-authored-by: Copilot <[email protected]>
…BufferSize BufferSize is MaxAllocatedPageCount rounded up to the next power of 2, so MaxMemorySizeBytes = BufferSize * PageSize overestimated the real memory cap (e.g. --memory 3g --page 16m => 192 pages, but BufferSize 256 reported 4 GiB). AllocatedPageCount never exceeds MaxAllocatedPageCount, so that is the true maximum; compute MaxMemorySizeBytes from it. Verified: 3g/16m now reports 3 GiB instead of 4 GiB. Co-authored-by: Copilot <[email protected]>
a0b2894 to
9de6ba0
Compare
An A/B experiment settles whether to admit the raw FullVector to the read cache in no-quant mode. Under cache pressure (the disk-tiered case — you tier because the set does not fit memory) caching raw is a net loss: 80k Cohere-768 NoQuant, 48 MB read cache, raw set ~246 MB gave 87.5 ms/query with raw cached vs 76.4 ms without, despite ~40% fewer disk reads, because copying large raw vectors into the cache thrashes it (~282 MB evicted / 250 queries) for more overhead than the read savings — while the small stubs incur zero eviction either way. Caching raw only wins when the whole set fits the read cache, which needs a size-aware admission gate (not implemented). So raw stays on disk in both modes; only the small stubs (adjacency / id-maps / quantized vectors) are cached. Co-authored-by: Copilot <[email protected]>
5f5156d to
c481806
Compare
Rewrite the per-term read-copy and IO-size comments to state the current policy plainly: small stubs are copied back to memory (read cache or main-log tail) and the raw vector / attributes are served from disk; for no-quant sets caching the raw vector yields no net gain once the working set exceeds the read cache. Also make the DEBUG FLUSHANDEVICT and batch read-copy comments describe behavior only. Co-authored-by: Copilot <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR enhances Garnet/Tsavorite observability and vector-search IO behavior by refining INFO/store metrics (more accurate byte-based sizes and log limits) and introducing per-batch/per-term disk-read sizing plus selective read-copying for vector index “stub” records.
Changes:
- Expand store/index/log INFO metrics (bytes, overflow buckets, page sizing, max vs current memory) and update related tests/utilities.
- Add per-batch
InitialIORecordSize+ReadCopyOptionstoIReadArgBatch, and use them in Tsavorite prefetch reads to tune disk IO behavior (used by vector reads). - Improve vector IO by computing per-vector-set disk read geometry and copying small hot graph records back into memory; add
DEBUG FLUSHANDEVICTand increase default native-device completion threads to 4.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/standalone/Garnet.test/TestUtils.cs | Updates store-info parsing to match renamed log memory metric key. |
| test/standalone/Garnet.test/RespConfigTests.cs | Adjusts config/metrics expectations for index size now reported in bytes. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs | Adds index size byte/overflow totals and uses batch-level read-copy + initial-IO sizing in prefetch reads. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/LogAccessor.cs | Adds page-size metric and fixes integer overflow in log memory size calculations. |
| libs/storage/Tsavorite/cs/src/core/Index/Interfaces/IReadArgBatch.cs | Introduces optional per-batch InitialIORecordSize and ReadCopyOptions. |
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Fixes MaxMemorySizeBytes overflow and aligns it with the true page-count cap. |
| libs/server/Storage/Functions/MainStore/VarLenInputMethods.cs | Ensures vector index copy-update sizing covers additional recreate-index path. |
| libs/server/Servers/GarnetServerOptions.cs | Changes default native-device completion drainer threads from 1 to 4 with updated rationale. |
| libs/server/Resp/Vector/VectorManager.Locking.cs | Resets per-thread vector read geometry when exiting a vector-set context. |
| libs/server/Resp/Vector/VectorManager.cs | Sets stub record copy destination and seeds per-operation active geometry for sized reads. |
| libs/server/Resp/Vector/VectorManager.Callbacks.cs | Implements per-term initial IO sizing and read-copy policy for vector batch reads; adds geometry computation helpers. |
| libs/server/Resp/Vector/DiskANNService.cs | Exposes term constants needed by vector read batch logic. |
| libs/server/Resp/CmdStrings.cs | Adds string constant for DEBUG FLUSHANDEVICT. |
| libs/server/Resp/AdminCommands.cs | Adds DEBUG FLUSHANDEVICT subcommand and help text. |
| libs/server/Metrics/Info/GarnetInfoMetrics.cs | Revises store stats metrics (index totals, log/readcache sizing and naming). |
| libs/host/defaults.conf | Updates default DeviceCompletionThreads to 4 and documents behavior. |
| libs/host/Configuration/Options.cs | Updates CLI help text and defaulting behavior for device completion threads. |
FORCEGC is a debug/admin utility rather than a general command, so move it from a top-level RESP command to `DEBUG FORCEGC [generation]`, alongside the other DEBUG subcommands. Remove the top-level command (enum value, parser, dispatch, command-info/docs JSON, ACL test) and document FORCEGC and FLUSHANDEVICT under DEBUG in the website docs. Reply and generation validation are unchanged. Co-authored-by: Copilot <[email protected]>
… page count - ContextReadWithPrefetch now resolves per-batch ReadCopyOptions via ReadCopyOptions.Merge so a batch can override CopyFrom and CopyTo independently, instead of replacing the whole struct when CopyTo is Inherit (which dropped a per-batch CopyFrom override). - INFO STORE Log.MaxPageCount / ReadCache.MaxPageCount now report MaxAllocatedPageCount (the actual page-count cap) instead of BufferSize (the power-of-2 circular-buffer capacity), making them self-consistent with MaxMemorySizeBytes. Added LogAccessor.MaxAllocatedPageCount. Co-authored-by: Copilot <[email protected]>
- VectorManager.StubReadCopyTo is now a per-instance readonly field instead of a process-wide mutable static. The native DiskANN read callback reaches it via the already-thread-local ActiveThreadSession.vectorManager, so multiple servers/databases with different EnableReadCache settings in one process no longer clobber each other (which could make a store without a read cache attempt ReadCache copies). - Fix IncrementAllocatedPageCount XML summary: it updates HighWaterAllocatedPageCount, not MaxAllocatedPageCount (the configured cap). Co-authored-by: Copilot <[email protected]>
TedHartMS
approved these changes
Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.