feat(core): prevent unnecessary partition rewrites when inserting duplicate data by ideoma · Pull Request #5764 · questdb/questdb

ideoma · 2025-06-19T12:27:57Z

Description

Core Problem Solved: When QuestDB has deduplication enabled on a table, and users insert data that's exactly the same as what's already in the table, the system was still rewriting entire partitions unnecessarily. This PR adds a "no-op" optimization to skip partition rewrites when all the data being inserted is identical to existing data.

Key Changes:

Smart Duplicate Detection: When all the rows in the merge section are duplicates, the check then checks non-key columns; if they match, the partition is not modified [feat(core): optimize to not rewrite partitions when the same data is re-inserted with dedup
Frame API Extensions: Implementation extends the Partition Frame API to perform the data check - this allows the system to efficiently compare existing data with incoming data
Thread Safety Improvements: Makes the FrameFactory class thread safe so that the checks can run concurrently across multiple partitions
Performance Boost: The benchmarks show significant improvements when inserting identical data - the system can now recognize "this is exactly the same" and skip expensive partition rewrite operations

Why this matters:

Performance: Massive speed improvements when applications accidentally re-insert the same data
Resource Efficiency: Avoids unnecessary disk I/O and CPU usage for no-op operations
Better User Experience: Applications that retry failed insertions or have duplicate data pipelines won't cause performance degradation

This is particularly valuable for time-series workloads where duplicate data insertion can happen due to network retries, application restarts, or data pipeline issues.

Implementation

Implementation extends the Partition Frame API to perform the data check. It also makes the FrameFactory class thread safe so that the checks can run concurrently across multiple partitions.

Testing

Along with the new tests introduced in DedupWalWriterTest, the existing tests in DedupInsertFuzzTest are modified to generate identical or semi-identical commits. It is done in two ways:

Some random insert commits are duplicated. When it happens, a random portion of this commit is re-inserted. There is are equal probabilities to insert the same rows or 1, 2 rows containing 1 column with a different value.
At the end of some tests, a random range of data is re-inserted. This is then checked that the partitions are the same by checking partition name versions in _txn file.

…l-noop Conflicts: core/src/main/java/io/questdb/cairo/O3PartitionJob.java core/src/main/java/io/questdb/cairo/TableWriter.java core/src/test/java/io/questdb/test/AbstractCairoTest.java core/src/test/java/io/questdb/test/cairo/wal/WalWriterReplaceRangeTest.java

…-dedup-identical-noop Conflicts: core/src/main/java/io/questdb/cairo/TableWriter.java

CLAassistant · 2025-06-19T12:50:32Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ ideoma
✅ puzpuzpuz
❌ GitHub Actions - Rebuild Native Libraries

GitHub Actions - Rebuild Native Libraries seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…th no dups

Copilot

Pull Request Overview

Optimize the system to avoid rewriting partitions when identical data is re-inserted under dedup settings, while enhancing frame and concurrent queue abstractions.

Introduce in-memory frame support alongside file-backed frames and new FrameFactory methods.
Refactor ConcurrentQueue/ConcurrentPool APIs with a static factory and pluggable manipulators.
Add dedup skip logic in O3PartitionJob/O3OpenColumnJob and related frame-algebra checks.

Comments suppressed due to low confidence (3)

core/src/main/java/io/questdb/cairo/frm/FrameColumnTypePool.java:43

The new createFromMemoryColumn method lacks Javadoc—please add a description and parameter docs to explain its purpose and usage.

    FrameColumn createFromMemoryColumn(

core/src/main/java/io/questdb/mp/ConcurrentQueue.java:150

[nitpick] Having both tryDequeue(T) returning boolean and tryDequeueValue(T) returning T may confuse callers; consider renaming or consolidating these methods for clarity.

    public T tryDequeueValue(T container) {

core/src/main/java/io/questdb/cairo/frm/file/FrameFactory.java:53

The clear() method is documented as not thread safe—consider synchronizing it or annotating with a thread-safety warning to prevent misuse.

    // It is NOT thread safe.

core/src/main/java/io/questdb/mp/ConcurrentQueueSegment.java

…n dirs

puzpuzpuz

Intermediate feedback

core/src/main/java/io/questdb/cairo/O3OpenColumnJob.java

core/src/main/java/io/questdb/cairo/frm/FrameAlgebra.java

core/src/main/java/io/questdb/cairo/frm/file/ContiguousFileVarFrameColumn.java

core/src/main/java/io/questdb/cairo/frm/file/FrameFactory.java

core/src/main/java/io/questdb/cairo/frm/file/FrameImpl.java

core/src/main/java/io/questdb/cairo/frm/file/MemoryFixFrameColumn.java

core/src/main/java/io/questdb/cairo/frm/FrameColumn.java

puzpuzpuz · 2025-07-01T07:24:58Z

Update: fixed after native lib rebuild

testReplaceNoPartitionRewriteDesignatedTimestamp is failing:

expected:
ts	x	v
2022-02-24T00:00:00.000000Z	1	123
2022-02-24T00:55:00.000000Z	2	
2022-02-24T02:00:00.000000Z	3	2345567
actual:
ts	x	v
2022-02-24T00:00:00.000000Z	1	123
2022-02-24T01:00:00.000000Z	2	
2022-02-24T02:00:00.000000Z	3	2345567

core/src/main/java/io/questdb/cairo/O3PartitionJob.java

core/src/main/java/io/questdb/cairo/TableWriter.java

core/src/main/java/io/questdb/cairo/frm/FrameAlgebra.java

core/src/main/c/share/dedup.cpp

core/src/main/c/share/column_type.h

core/src/main/java/io/questdb/cairo/TableWriter.java

core/src/main/java/io/questdb/cairo/O3PartitionJob.java

core/src/test/java/io/questdb/test/cairo/mv/MatViewIdenticalReplaceTest.java

core/src/test/java/io/questdb/test/cairo/wal/DedupWalWriterTest.java

puzpuzpuz · 2025-07-01T10:04:11Z

Below are the results of insertion of identical rows on a deduplicated table via TSBS (ILP/TCP).

./tsbs_generate_data --use-case="cpu-only" --seed=123 --scale=4000 --timestamp-start="2016-01-01T00:00:00Z" --timestamp-end="2016-01-03T00:00:00Z" --log-interval="10s" --format="questdb" > /tmp/data

patch:

$ ./tsbs_load_questdb --file /tmp/data --workers 4
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1751363876,18008565.69,1.801000E+08,18008565.69,1800856.57,1.801000E+07,1800856.57
1751363886,16630000.23,3.464000E+08,17319310.42,1663000.02,3.464000E+07,1731931.04
1751363896,16040000.22,5.068000E+08,16892885.01,1604000.02,5.068000E+07,1689288.50
1751363906,16490001.11,6.717000E+08,16792166.04,1649000.11,6.717000E+07,1679216.60

Summary:
loaded 691200000 metrics in 41.177sec with 4 workers (mean rate 16785940.46 metrics/sec)
loaded 69120000 rows in 41.177sec with 4 workers (mean rate 1678594.05 rows/sec)

master:

$ ./tsbs_load_questdb --file /tmp/data --workers 4
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1751363984,12889981.36,1.289000E+08,12889981.36,1288998.14,1.289000E+07,1288998.14
1751363994,12260009.40,2.515000E+08,12574995.73,1226000.94,2.515000E+07,1257499.57
1751364004,5050000.12,3.020000E+08,10066664.47,505000.01,3.020000E+07,1006666.45
1751364014,2889999.93,3.309000E+08,8272498.60,288999.99,3.309000E+07,827249.86
1751364024,2809998.73,3.590000E+08,7179998.38,280999.87,3.590000E+07,717999.84
...

Summary:
loaded 691200000 metrics in 80.276sec with 4 workers (mean rate 8610324.27 metrics/sec)
loaded 69120000 rows in 80.276sec with 4 workers (mean rate 861032.43 rows/sec)

puzpuzpuz · 2025-07-01T10:39:14Z

I also tried inserting rows with different fixed-size column data interchangeably - all seems to be running fine, no-op check kicks in as expected.

core/src/main/java/io/questdb/cairo/TableWriter.java

core/src/main/java/io/questdb/cairo/O3PartitionJob.java

glasstiger · 2025-07-01T15:58:52Z

[PR Coverage check]

😍 pass : 334 / 383 (87.21%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/cairo/frm/DeletedFrameColumn.java	0	3	00.00%
🔵	io/questdb/cairo/O3OpenColumnJob.java	1	2	50.00%
🔵	io/questdb/cairo/frm/file/MemoryVarFrameColumn.java	21	31	67.74%
🔵	io/questdb/cairo/frm/file/MemoryFixFrameColumn.java	19	28	67.86%
🔵	io/questdb/cairo/frm/file/ContiguousFileFixFrameColumn.java	20	24	83.33%
🔵	io/questdb/cairo/O3PartitionJob.java	79	92	85.87%
🔵	io/questdb/cairo/frm/file/ContiguousFileVarFrameColumn.java	29	33	87.88%
🔵	io/questdb/cairo/frm/file/FrameImpl.java	54	58	93.10%
🔵	io/questdb/cairo/frm/FrameAlgebra.java	25	26	96.15%
🔵	io/questdb/cairo/TableUtils.java	4	4	100.00%
🔵	io/questdb/std/IntHashSet.java	3	3	100.00%
🔵	io/questdb/cairo/CairoEngine.java	7	7	100.00%
🔵	io/questdb/cairo/frm/file/FrameFactory.java	16	16	100.00%
🔵	io/questdb/cairo/TableWriter.java	28	28	100.00%
🔵	io/questdb/cairo/frm/file/ContiguousFileColumnPool.java	28	28	100.00%

ideoma and others added 21 commits June 9, 2025 19:26

frame refactoring

a7113e6

fixed colum dedup comparision

4a6acaa

var column comparison WIP

312d1f8

support dedup identical check with arrays

1eed1d5

more comparison proc fixes

a920b5d

fix merge

24bcc79

fix cases when O3 results in noop merge + suffix

48b52a0

cleanup

39f5c89

core(core): concurrent pool implementation

14b18ab

code formatting

7c9c5dc

Merge remote-tracking branch 'origin/chore-concurrent-pool' into feat…

69681e2

…-dedup-identical-noop Conflicts: core/src/main/java/io/questdb/cairo/TableWriter.java

fix merge

8d84ede

make FrameFactory thread safe

1d5a808

make FrameFactory thread safe

2616aaa

cleanup

0600dfb

cleanup, shuffle dups in tests

8196bbf

Merge branch 'master' into feat-dedup-identical-noop

d0508c7

fix cxx compilation on windows

1b812fc

fix cxx compilation on windows

ede87a7

Rebuild CXX libraries

c37ce5b

fix dedup optimisation that does append only on matched timestamps wi…

a22094f

…th no dups

ideoma requested a review from Copilot June 19, 2025 13:16

This comment was marked as outdated.

Sign in to view

ideoma requested a review from Copilot June 19, 2025 14:25

Copilot AI reviewed Jun 19, 2025

View reviewed changes

core/src/main/java/io/questdb/mp/ConcurrentQueueSegment.java Outdated Show resolved Hide resolved

ideoma added 3 commits June 19, 2025 15:46

fix bugs found by tests

bcfa6b1

fix bugs found by tests

89bea71

fix physical row count when no rewrite happens, delete empty partitio…

2e6962b

…n dirs

puzpuzpuz added Core Related to storage, data type, etc. Materialized View labels Jun 30, 2025

puzpuzpuz self-requested a review June 30, 2025 12:15

puzpuzpuz reviewed Jun 30, 2025

View reviewed changes

ideoma and others added 4 commits June 30, 2025 18:04

address review comments

4f07fa2

address review comments

397537b

rewrite ts comparison in C

dfb6072

Merge branch 'master' into feat-dedup-identical-noop

bcc72fe

GitHub Actions - Rebuild Native Libraries added 2 commits July 1, 2025 07:51

Rebuild CXX libraries

5ccfa31

Rebuild CXX libraries

f5fab90

puzpuzpuz reviewed Jul 1, 2025

View reviewed changes

Merge branch 'master' into feat-dedup-identical-noop

4a1197e

puzpuzpuz reviewed Jul 1, 2025

View reviewed changes

bluestreak01 changed the title ~~feat(core): optimize to not rewrite partitions when the same data is re-inserted with dedup~~ feat(core): Prevent unnecessary partition rewrites when inserting duplicate data Jul 1, 2025

bluestreak01 changed the title ~~feat(core): Prevent unnecessary partition rewrites when inserting duplicate data~~ feat(core): prevent unnecessary partition rewrites when inserting duplicate data Jul 1, 2025

ideoma added 2 commits July 1, 2025 14:58

addressing review feedback

274cdfe

better comments

65ec24a

puzpuzpuz reviewed Jul 1, 2025

View reviewed changes

core/src/main/java/io/questdb/cairo/TableWriter.java Outdated Show resolved Hide resolved

Rebuild CXX libraries

f93f725

puzpuzpuz reviewed Jul 1, 2025

View reviewed changes

core/src/main/java/io/questdb/cairo/O3PartitionJob.java Outdated Show resolved Hide resolved

more tests, small fixes

0b1f330

puzpuzpuz approved these changes Jul 1, 2025

View reviewed changes

bluestreak01 merged commit 633a2d2 into master Jul 1, 2025
38 of 39 checks passed

bluestreak01 deleted the feat-dedup-identical-noop branch July 1, 2025 17:01

This was referenced Nov 7, 2025

fix(core): fix critical storage corruption on deduplicate write resulting to same data #6359

Merged

fix(core): fix potential storage corruption on deduplicate write resulting to same data #6360

Merged

Conversation

ideoma commented Jun 19, 2025 • edited by bluestreak01 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes:

Why this matters:

Implementation

Testing

Uh oh!

CLAassistant commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

puzpuzpuz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

puzpuzpuz commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

puzpuzpuz commented Jul 1, 2025

Uh oh!

puzpuzpuz commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

glasstiger commented Jul 1, 2025

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ideoma commented Jun 19, 2025 •

edited by bluestreak01

Loading

CLAassistant commented Jun 19, 2025 •

edited

Loading

puzpuzpuz commented Jul 1, 2025 •

edited

Loading