feat(core): prevent unnecessary partition rewrites when inserting duplicate data#5764
feat(core): prevent unnecessary partition rewrites when inserting duplicate data#5764bluestreak01 merged 42 commits intomasterfrom
Conversation
…l-noop Conflicts: core/src/main/java/io/questdb/cairo/O3PartitionJob.java core/src/main/java/io/questdb/cairo/TableWriter.java core/src/test/java/io/questdb/test/AbstractCairoTest.java core/src/test/java/io/questdb/test/cairo/wal/WalWriterReplaceRangeTest.java
…-dedup-identical-noop Conflicts: core/src/main/java/io/questdb/cairo/TableWriter.java
|
GitHub Actions - Rebuild Native Libraries seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Pull Request Overview
Optimize the system to avoid rewriting partitions when identical data is re-inserted under dedup settings, while enhancing frame and concurrent queue abstractions.
- Introduce in-memory frame support alongside file-backed frames and new FrameFactory methods.
- Refactor ConcurrentQueue/ConcurrentPool APIs with a static factory and pluggable manipulators.
- Add dedup skip logic in O3PartitionJob/O3OpenColumnJob and related frame-algebra checks.
Comments suppressed due to low confidence (3)
core/src/main/java/io/questdb/cairo/frm/FrameColumnTypePool.java:43
- The new createFromMemoryColumn method lacks Javadoc—please add a description and parameter docs to explain its purpose and usage.
FrameColumn createFromMemoryColumn(
core/src/main/java/io/questdb/mp/ConcurrentQueue.java:150
- [nitpick] Having both tryDequeue(T) returning boolean and tryDequeueValue(T) returning T may confuse callers; consider renaming or consolidating these methods for clarity.
public T tryDequeueValue(T container) {
core/src/main/java/io/questdb/cairo/frm/file/FrameFactory.java:53
- The clear() method is documented as not thread safe—consider synchronizing it or annotating with a thread-safety warning to prevent misuse.
// It is NOT thread safe.
core/src/main/java/io/questdb/cairo/frm/file/ContiguousFileVarFrameColumn.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/cairo/frm/file/ContiguousFileVarFrameColumn.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/cairo/frm/file/MemoryFixFrameColumn.java
Outdated
Show resolved
Hide resolved
|
Update: fixed after native lib rebuild
|
core/src/test/java/io/questdb/test/cairo/mv/MatViewIdenticalReplaceTest.java
Outdated
Show resolved
Hide resolved
|
Below are the results of insertion of identical rows on a deduplicated table via TSBS (ILP/TCP). patch:
|
|
I also tried inserting rows with different fixed-size column data interchangeably - all seems to be running fine, no-op check kicks in as expected. |
[PR Coverage check]😍 pass : 334 / 383 (87.21%) file detail
|
Description
Core Problem Solved: When QuestDB has deduplication enabled on a table, and users insert data that's exactly the same as what's already in the table, the system was still rewriting entire partitions unnecessarily. This PR adds a "no-op" optimization to skip partition rewrites when all the data being inserted is identical to existing data.
Key Changes:
Smart Duplicate Detection: When all the rows in the merge section are duplicates, the check then checks non-key columns; if they match, the partition is not modified [feat(core): optimize to not rewrite partitions when the same data is re-inserted with dedup
Frame API Extensions: Implementation extends the Partition Frame API to perform the data check - this allows the system to efficiently compare existing data with incoming data
Thread Safety Improvements: Makes the FrameFactory class thread safe so that the checks can run concurrently across multiple partitions
Performance Boost: The benchmarks show significant improvements when inserting identical data - the system can now recognize "this is exactly the same" and skip expensive partition rewrite operations
Why this matters:
Performance: Massive speed improvements when applications accidentally re-insert the same data
Resource Efficiency: Avoids unnecessary disk I/O and CPU usage for no-op operations
Better User Experience: Applications that retry failed insertions or have duplicate data pipelines won't cause performance degradation
This is particularly valuable for time-series workloads where duplicate data insertion can happen due to network retries, application restarts, or data pipeline issues.
Implementation
Implementation extends the Partition Frame API to perform the data check. It also makes the
FrameFactoryclass thread safe so that the checks can run concurrently across multiple partitions.Testing
Along with the new tests introduced in
DedupWalWriterTest, the existing tests inDedupInsertFuzzTestare modified to generate identical or semi-identical commits. It is done in two ways:_txnfile.