Skip to content

feat(core): prevent unnecessary partition rewrites when inserting duplicate data#5764

Merged
bluestreak01 merged 42 commits intomasterfrom
feat-dedup-identical-noop
Jul 1, 2025
Merged

feat(core): prevent unnecessary partition rewrites when inserting duplicate data#5764
bluestreak01 merged 42 commits intomasterfrom
feat-dedup-identical-noop

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Jun 19, 2025

Description

Core Problem Solved: When QuestDB has deduplication enabled on a table, and users insert data that's exactly the same as what's already in the table, the system was still rewriting entire partitions unnecessarily. This PR adds a "no-op" optimization to skip partition rewrites when all the data being inserted is identical to existing data.

Key Changes:

  1. Smart Duplicate Detection: When all the rows in the merge section are duplicates, the check then checks non-key columns; if they match, the partition is not modified [feat(core): optimize to not rewrite partitions when the same data is re-inserted with dedup

  2. Frame API Extensions: Implementation extends the Partition Frame API to perform the data check - this allows the system to efficiently compare existing data with incoming data

  3. Thread Safety Improvements: Makes the FrameFactory class thread safe so that the checks can run concurrently across multiple partitions

  4. Performance Boost: The benchmarks show significant improvements when inserting identical data - the system can now recognize "this is exactly the same" and skip expensive partition rewrite operations

Why this matters:

Performance: Massive speed improvements when applications accidentally re-insert the same data
Resource Efficiency: Avoids unnecessary disk I/O and CPU usage for no-op operations
Better User Experience: Applications that retry failed insertions or have duplicate data pipelines won't cause performance degradation

This is particularly valuable for time-series workloads where duplicate data insertion can happen due to network retries, application restarts, or data pipeline issues.

Implementation

Implementation extends the Partition Frame API to perform the data check. It also makes the FrameFactory class thread safe so that the checks can run concurrently across multiple partitions.

Testing

Along with the new tests introduced in DedupWalWriterTest, the existing tests in DedupInsertFuzzTest are modified to generate identical or semi-identical commits. It is done in two ways:

  • Some random insert commits are duplicated. When it happens, a random portion of this commit is re-inserted. There is are equal probabilities to insert the same rows or 1, 2 rows containing 1 column with a different value.
  • At the end of some tests, a random range of data is re-inserted. This is then checked that the partitions are the same by checking partition name versions in _txn file.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 19, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ ideoma
✅ puzpuzpuz
❌ GitHub Actions - Rebuild Native Libraries


GitHub Actions - Rebuild Native Libraries seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ideoma ideoma requested a review from Copilot June 19, 2025 13:16

This comment was marked as outdated.

@ideoma ideoma requested a review from Copilot June 19, 2025 14:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Optimize the system to avoid rewriting partitions when identical data is re-inserted under dedup settings, while enhancing frame and concurrent queue abstractions.

  • Introduce in-memory frame support alongside file-backed frames and new FrameFactory methods.
  • Refactor ConcurrentQueue/ConcurrentPool APIs with a static factory and pluggable manipulators.
  • Add dedup skip logic in O3PartitionJob/O3OpenColumnJob and related frame-algebra checks.
Comments suppressed due to low confidence (3)

core/src/main/java/io/questdb/cairo/frm/FrameColumnTypePool.java:43

  • The new createFromMemoryColumn method lacks Javadoc—please add a description and parameter docs to explain its purpose and usage.
    FrameColumn createFromMemoryColumn(

core/src/main/java/io/questdb/mp/ConcurrentQueue.java:150

  • [nitpick] Having both tryDequeue(T) returning boolean and tryDequeueValue(T) returning T may confuse callers; consider renaming or consolidating these methods for clarity.
    public T tryDequeueValue(T container) {

core/src/main/java/io/questdb/cairo/frm/file/FrameFactory.java:53

  • The clear() method is documented as not thread safe—consider synchronizing it or annotating with a thread-safety warning to prevent misuse.
    // It is NOT thread safe.

@puzpuzpuz puzpuzpuz added Core Related to storage, data type, etc. Materialized View labels Jun 30, 2025
@puzpuzpuz puzpuzpuz self-requested a review June 30, 2025 12:15
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intermediate feedback

@puzpuzpuz
Copy link
Copy Markdown
Contributor

puzpuzpuz commented Jul 1, 2025

Update: fixed after native lib rebuild

testReplaceNoPartitionRewriteDesignatedTimestamp is failing:

expected:
ts	x	v
2022-02-24T00:00:00.000000Z	1	123
2022-02-24T00:55:00.000000Z	2	
2022-02-24T02:00:00.000000Z	3	2345567
actual:
ts	x	v
2022-02-24T00:00:00.000000Z	1	123
2022-02-24T01:00:00.000000Z	2	
2022-02-24T02:00:00.000000Z	3	2345567

GitHub Actions - Rebuild Native Libraries added 2 commits July 1, 2025 07:51
@puzpuzpuz
Copy link
Copy Markdown
Contributor

Below are the results of insertion of identical rows on a deduplicated table via TSBS (ILP/TCP).

./tsbs_generate_data --use-case="cpu-only" --seed=123 --scale=4000 --timestamp-start="2016-01-01T00:00:00Z" --timestamp-end="2016-01-03T00:00:00Z" --log-interval="10s" --format="questdb" > /tmp/data

patch:

$ ./tsbs_load_questdb --file /tmp/data --workers 4
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1751363876,18008565.69,1.801000E+08,18008565.69,1800856.57,1.801000E+07,1800856.57
1751363886,16630000.23,3.464000E+08,17319310.42,1663000.02,3.464000E+07,1731931.04
1751363896,16040000.22,5.068000E+08,16892885.01,1604000.02,5.068000E+07,1689288.50
1751363906,16490001.11,6.717000E+08,16792166.04,1649000.11,6.717000E+07,1679216.60

Summary:
loaded 691200000 metrics in 41.177sec with 4 workers (mean rate 16785940.46 metrics/sec)
loaded 69120000 rows in 41.177sec with 4 workers (mean rate 1678594.05 rows/sec)

master:

$ ./tsbs_load_questdb --file /tmp/data --workers 4
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1751363984,12889981.36,1.289000E+08,12889981.36,1288998.14,1.289000E+07,1288998.14
1751363994,12260009.40,2.515000E+08,12574995.73,1226000.94,2.515000E+07,1257499.57
1751364004,5050000.12,3.020000E+08,10066664.47,505000.01,3.020000E+07,1006666.45
1751364014,2889999.93,3.309000E+08,8272498.60,288999.99,3.309000E+07,827249.86
1751364024,2809998.73,3.590000E+08,7179998.38,280999.87,3.590000E+07,717999.84
...

Summary:
loaded 691200000 metrics in 80.276sec with 4 workers (mean rate 8610324.27 metrics/sec)
loaded 69120000 rows in 80.276sec with 4 workers (mean rate 861032.43 rows/sec)

@puzpuzpuz
Copy link
Copy Markdown
Contributor

I also tried inserting rows with different fixed-size column data interchangeably - all seems to be running fine, no-op check kicks in as expected.

@bluestreak01 bluestreak01 changed the title feat(core): optimize to not rewrite partitions when the same data is re-inserted with dedup feat(core): Prevent unnecessary partition rewrites when inserting duplicate data Jul 1, 2025
@bluestreak01 bluestreak01 changed the title feat(core): Prevent unnecessary partition rewrites when inserting duplicate data feat(core): prevent unnecessary partition rewrites when inserting duplicate data Jul 1, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 334 / 383 (87.21%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/frm/DeletedFrameColumn.java 0 3 00.00%
🔵 io/questdb/cairo/O3OpenColumnJob.java 1 2 50.00%
🔵 io/questdb/cairo/frm/file/MemoryVarFrameColumn.java 21 31 67.74%
🔵 io/questdb/cairo/frm/file/MemoryFixFrameColumn.java 19 28 67.86%
🔵 io/questdb/cairo/frm/file/ContiguousFileFixFrameColumn.java 20 24 83.33%
🔵 io/questdb/cairo/O3PartitionJob.java 79 92 85.87%
🔵 io/questdb/cairo/frm/file/ContiguousFileVarFrameColumn.java 29 33 87.88%
🔵 io/questdb/cairo/frm/file/FrameImpl.java 54 58 93.10%
🔵 io/questdb/cairo/frm/FrameAlgebra.java 25 26 96.15%
🔵 io/questdb/cairo/TableUtils.java 4 4 100.00%
🔵 io/questdb/std/IntHashSet.java 3 3 100.00%
🔵 io/questdb/cairo/CairoEngine.java 7 7 100.00%
🔵 io/questdb/cairo/frm/file/FrameFactory.java 16 16 100.00%
🔵 io/questdb/cairo/TableWriter.java 28 28 100.00%
🔵 io/questdb/cairo/frm/file/ContiguousFileColumnPool.java 28 28 100.00%

@bluestreak01 bluestreak01 merged commit 633a2d2 into master Jul 1, 2025
38 of 39 checks passed
@bluestreak01 bluestreak01 deleted the feat-dedup-identical-noop branch July 1, 2025 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Related to storage, data type, etc. Materialized View

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants