[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing#784
[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing#784
Conversation
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
in VecSimIndexAbstract move data members to private when possible DONT FIX LEAK YET Factories: move NewAbstractInitParams to a general location (new file factory_utils) replace it in all factories
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #784 +/- ##
==========================================
+ Coverage 96.64% 96.66% +0.01%
==========================================
Files 125 126 +1
Lines 7724 7707 -17
==========================================
- Hits 7465 7450 -15
+ Misses 259 257 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
use sanitizer in pull request
This caused a buffer overflow because setup<TieredIndexParams> didn't set the data type to INT8, creating a float32 index instead. DataBlock::addElement() tries to copy dim * sizeof(float) bytes, but the allocated buffer is only dim * sizeof(int8) bytes, causing a read overflow.
fix leaks that will be moved to a separate PR it was failing only with codecov becuase only there we use FP64_TESTS=1 Prevent template deduction errors in GenerateAndAddVector by making data_t parameter non-deducible Used std::type_identity<data_t>::type for the value parameter to force explicit template specification (e.g., GenerateAndAddVector<double>()) instead of allowing compiler to incorrectly deduce int from literal values, which caused buffer overflows when index expected different data types.
This reverts commit af844ec.
|
af844ec will be addressed in a separate PR to allow backporting to all version brnaches |
|
Sanitizer will also be added in a separate PR |
ofiryanai
left a comment
There was a problem hiding this comment.
Looks ok, please make sure the todo in brute_force_factory.cpp is complete or remove the TODO comment and we're good to go
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical buffer read overflow vulnerability in vector preprocessing by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline. The core issue was that preprocessors were receiving incorrect size parameters, causing them to read beyond available input data boundaries.
Key changes include:
- Added
inputBlobSizemember to track original input vector size separately from processed storage size - Updated all preprocessor API calls to use correct input blob sizes instead of stored data sizes
- Centralized factory logic to ensure consistent parameter initialization across all index types
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/VecSim/vec_sim_index.h | Added inputBlobSize member and updated preprocessor calls to use correct sizes |
| src/VecSim/utils/vec_utils.h/.cpp | Renamed VecSimParams_GetDataSize to VecSimParams_GetStoredDataSize for clarity |
| src/VecSim/index_factories/factory_utils.h | New centralized factory utility template for consistent parameter initialization |
| src/VecSim/index_factories/*.cpp | Updated factories to use centralized parameter creation and correct size calculations |
| tests/unit/test_*.cpp | Fixed buffer overflows in tests and added comprehensive Cosine metric test coverage |
| tests/unit/unit_test_utils.cpp | Updated to use renamed getStoredDataSize() method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
fix getQueryBlob in tiered add getHNSWIterator to tiered batch itertor if its BUILD_TESTS
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.2
git worktree add -d .worktree/backport-784-to-8.2 origin/8.2
cd .worktree/backport-784-to-8.2
git switch --create backport-784-to-8.2
git cherry-pick -x 9fb223ae78c5effccfcef44953a22e9d89df40f9 |
* rename dataSize->getStoredDataSize * missing rename * format * add input blob size to AbstractIndexInitParams in VecSimIndexAbstract move data members to private when possible DONT FIX LEAK YET Factories: move NewAbstractInitParams to a general location (new file factory_utils) replace it in all factories * try sanitizer * run sanitizer * runanyway * rervt task unit test use sanitizer in pull request * fix leak in input bob size * fix int8/uint8 elementSizeEstimation tests This caused a buffer overflow because setup<TieredIndexParams> didn't set the data type to INT8, creating a float32 index instead. DataBlock::addElement() tries to copy dim * sizeof(float) bytes, but the allocated buffer is only dim * sizeof(int8) bytes, causing a read overflow. * run codecov with sanitizer (also intek) * some fixes and assertion * fix uint8 * fix again * fix possible warning abour divison by zero by checking quantBits with constexpr * TO REVERT! fix leaks that will be moved to a separate PR it was failing only with codecov becuase only there we use FP64_TESTS=1 Prevent template deduction errors in GenerateAndAddVector by making data_t parameter non-deducible Used std::type_identity<data_t>::type for the value parameter to force explicit template specification (e.g., GenerateAndAddVector<double>()) instead of allowing compiler to incorrectly deduce int from literal values, which caused buffer overflows when index expected different data types. * Revert "TO REVERT!" This reverts commit af844ec. * rever ci changes * calculate EstimateElementSize accroding to the stored vector size add tests * revert unrelated change in cmake.san * add batch itertor blob correctness to int8 tests fix getQueryBlob in tiered add getHNSWIterator to tiered batch itertor if its BUILD_TESTS * apply suggesting (cherry picked from commit 9fb223a)
[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing (#784) * rename dataSize->getStoredDataSize * missing rename * format * add input blob size to AbstractIndexInitParams in VecSimIndexAbstract move data members to private when possible DONT FIX LEAK YET Factories: move NewAbstractInitParams to a general location (new file factory_utils) replace it in all factories * try sanitizer * run sanitizer * runanyway * rervt task unit test use sanitizer in pull request * fix leak in input bob size * fix int8/uint8 elementSizeEstimation tests This caused a buffer overflow because setup<TieredIndexParams> didn't set the data type to INT8, creating a float32 index instead. DataBlock::addElement() tries to copy dim * sizeof(float) bytes, but the allocated buffer is only dim * sizeof(int8) bytes, causing a read overflow. * run codecov with sanitizer (also intek) * some fixes and assertion * fix uint8 * fix again * fix possible warning abour divison by zero by checking quantBits with constexpr * TO REVERT! fix leaks that will be moved to a separate PR it was failing only with codecov becuase only there we use FP64_TESTS=1 Prevent template deduction errors in GenerateAndAddVector by making data_t parameter non-deducible Used std::type_identity<data_t>::type for the value parameter to force explicit template specification (e.g., GenerateAndAddVector<double>()) instead of allowing compiler to incorrectly deduce int from literal values, which caused buffer overflows when index expected different data types. * Revert "TO REVERT!" This reverts commit af844ec. * rever ci changes * calculate EstimateElementSize accroding to the stored vector size add tests * revert unrelated change in cmake.san * add batch itertor blob correctness to int8 tests fix getQueryBlob in tiered add getHNSWIterator to tiered batch itertor if its BUILD_TESTS * apply suggesting (cherry picked from commit 9fb223a)
This PR addresses a potential buffer read overflow issue resulting from incorrect blob size handling during vector preprocessing.
The core issue was confusion between
storedDataSize(processed blob size) and the actual input blob size, passingdataSize(stored data size) to the preprocessors instead of the actual input blob size.This mismatch caused preprocessors to copy more data than available in the input blob, leading to out-of-bounds reads.
Correct Behavior
The system should properly distinguish between two different size concepts:
inputBlobSize: Size of the original input vector blob (dim * sizeof(type))storedDataSize: Size of vectors after preprocessing (may include extra bytes for normalization)Data Size Relationships:
Non-Tiered Indexes:
storedDataSize = inputBlobSize = dim * sizeof(type)storedDataSize = inputBlobSize + 4(norm stored at end)Tiered Indexes:
storedDataSize = inputBlobSizeWhat Was Fixed
The fix ensures memory safety by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline.
The fix was implemented by:
inputBlobSizeas a new member to theVecSimIndexAbstractclass andAbstractIndexInitParamsstruct to explicitly track the original input vector sizeinputBlobSizeinstead ofstoredDataSizewhen processing input vectors, ensuring preprocessors only access the actual available input datainputBlobSizerepresents the original input size (dim * sizeof(type)) whilestoredDataSizerepresents the final processed size (which may include extra bytes for normalizationAdditional Changes
Centralized Factory Logic:
src/VecSim/index_factories/factory_utils.h(new)NewAbstractInitParams()template function to standardize parameter initialization across all factories, eliminating code duplication and ensuring consistentinputBlobSizecalculation logic throughout the codebaseRenamed
dataSize→StoredDataSizefor ClarityTest Fixes - Buffer Overflow in Element Size Estimation:
test_int8.cpp,test_uint8.cppTieredIndexParamswas created fromHNSWParamsthat didn't explicitly set thetypefield.The buffer overflow occurred because when the test called
addVector,DataBlock::addElement()tried to copydim * sizeof(float)bytes (16 bytes for dim=4), but the allocated buffer was onlydim * sizeof(int8_t)bytes (4 bytes for dim=4), leading to out-of-bounds memory access during vector operations.Test Fixes - Compiler Warnings:
test_svs.cpp,test_svs_multi.cpp,test_svs_tiered.cppquantBitsconstexpr. The compiler couldn't prove at compile-time thatquantBitswas never zero, even with runtime checks. Usingconstexprmoves the evaluation to compile-time, allowingif constexprto eliminate the division code whenquantBits == VecSimSvsQuant_NONE, satisfying static analysis requirements.Fixed Element Size Estimation in Factory Functions:
src/VecSim/index_factories/brute_force_factory.cpp,src/VecSim/index_factories/hnsw_factory.cppEstimateElementSize()functions to useVecSimParams_GetStoredDataSize()instead ofdim * sizeof(type)for accurate memory estimation. This ensures size calculations account for additional storage requirements (such as normalization data for INT8/UINT8 with Cosine metric) and provides consistent estimation across all index types. Added comprehensive test coverage for Cosine metric element size estimation in INT8 and UINT8 test suites to validate the fix.Fixed Query Blob Access in Tiered Batch Iterator:
src/VecSim/batch_iterator.h,src/VecSim/algorithms/hnsw/hnsw_tiered.hgetQueryBlob()virtual in the baseVecSimBatchIteratorclass and overrode it inTieredHNSW_BatchIteratorto properly delegate to the frontend iterator. This was necessary because the tiered batch iterator itself doesn't own any query vector - it relies on its internal frontend and backend iterators to manage their own query copies.Enhanced Batch Iterator Testing for INT8 Cosine Metric:
tests/unit/test_int8.cpp,src/VecSim/algorithms/hnsw/hnsw_tiered.h,src/VecSim/batch_iterator.hCosineBlobCorrectnesstest to explicitly validate query blob transfer from frontend to backend in tiered batch iterators:BruteForceiterator to the backendHNSWiterator.