Skip to content

fix(core): fix degradation into 1 by 1 wal transaction application when queue is big#5725

Merged
bluestreak01 merged 6 commits intomasterfrom
fix-1-by-1-wal-apply
Jun 12, 2025
Merged

fix(core): fix degradation into 1 by 1 wal transaction application when queue is big#5725
bluestreak01 merged 6 commits intomasterfrom
fix-1-by-1-wal-apply

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Jun 6, 2025

Pull Request Overview

This PR fixes an issue where WAL application degrades into 1-by-1 commits when the queue is large by correcting the back-fill logic in WalTxnDetails, adding coverage for the incremental scenario in tests, and introducing a new getCommitLatency configuration property.

Adjusted the transaction metadata loading and back-fill loops in WalTxnDetails to handle newly loaded entries correctly.
Added testLastCommitToTimestampIncremental in WalTxnDetailsFuzzTest to cover the incremental commit-to-timestamp behavior.
Introduced getCommitLatency in the configuration interfaces and provided implementations in default and wrapper classes.

This manifests into something like:

I i.q.c.TableWriter processing WAL [path=/abc~14/wal1401/33279, roLo=0, roHi=20000, seqTxn=5530295, tsMin=2025-05-28T02:12:00.001269Z, tsMax=2025-05-28T02:12:59.995842Z, commitToTs=]
I i.q.c.TableWriter processing WAL [path=/abc~14/wal1357/2002641, roLo=0, roHi=20000, seqTxn=5530296, tsMin=2025-05-28T02:12:00.001068Z, tsMax=2025-05-28T02:12:59.924071Z, commitToTs=]
I i.q.c.TableWriter processing WAL [path=/abc~14/wal1358/1700854, roLo=0, roHi=20000, seqTxn=5530297, tsMin=2025-05-28T02:12:00.000871Z, tsMax=2025-05-28T02:12:59.984037Z, commitToTs=]
I i.q.c.TableWriter processing WAL [path=/abc~14/wal1401/33280, roLo=0, roHi=20000, seqTxn=5530298, tsMin=2025-05-28T02:12:00.001103Z, tsMax=2025-05-28T02:12:59.995259Z, commitToTs=]
I i.q.c.TableWriter processing WAL [path=/abc~14/wal1357/2002642, roLo=0, roHi=20000, seqTxn=5530299, tsMin=2025-05-28T02:12:00.008045Z, tsMax=2025-05-28T02:12:59.993879Z, commitToTs=294247-01-10T04:00:54.775807Z]
I i.q.c.TableWriter WAL dedup sorted commit index [table=abc~14, totalRows=1043617, lagRows=1023617]
I i.q.c.TableWriter o3 partition task [table=market_data_snapshot, partitionTs=2025-05-28T02:00:00.000000Z, partitionIndex=228, last=false, append=false, ro=false, srcOooLo=0, srcOooHi=1043616, srcOooMax=1043617, o3RowCount=1043617, o3LagRowCount=0, srcDataMax=42776005, o3Ts=2025-05-28T02:11:00.000729Z, newSize=43819622, maxTs=2025-05-29T16:30:59.999972Z, pCount=1, flattenTs=true, memUsed=4.918 GiB, rssMemUsed=1.368 GiB]
I i.q.c.TableWriter merged partition [table=`market_data_snapshot`, ts=2025-05-28T02:00:00.000000Z, txn=4818357]
I i.q.c.TableWriter processing WAL [path=/abc~14/wal1358/1700855, roLo=0, roHi=20000, seqTxn=5530300, tsMin=2025-05-28T02:12:00.008045Z, tsMax=2025-05-28T02:12:59.987112Z, commitToTs=294247-01-10T04:00:54.775807Z]
I i.q.c.TableWriter WAL dedup sorted commit index [table=abc~14, totalRows=20000, lagRows=0]
I i.q.c.TableWriter o3 partition task [table=market_data_snapshot, partitionTs=2025-05-28T02:00:00.000000Z, partitionIndex=228, last=false, append=false, ro=false, srcOooLo=0, srcOooHi=19999, srcOooMax=20000, o3RowCount=20000, o3LagRowCount=0, srcDataMax=42776005, o3Ts=2025-05-28T02:12:00.008045Z, newSize=42796005, maxTs=2025-05-29T16:30:59.999972Z, pCount=1, flattenTs=true, memUsed=4.154 GiB, rssMemUsed=619.076 MiB]
I i.q.c.TableWriter merged partition [table=`market_data_snapshot`, ts=2025-05-28T02:00:00.000000Z, txn=4818359]
I i.q.c.TableWriter processing WAL [path=/abc~14/wal1401/33281, roLo=0, roHi=20000, seqTxn=5530301, tsMin=2025-05-28T02:12:00.008945Z, tsMax=2025-05-28T02:12:59.991588Z, commitToTs=294247-01-10T04:00:54.775807Z]

And from some point on commitToTs is set to 294247-01-10T04:00:54.775807Z, which means that a full commit is to be performed after every transaction.

@ideoma ideoma requested a review from Copilot June 6, 2025 09:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue where WAL application degrades into 1-by-1 commits when the queue is large by correcting the back-fill logic in WalTxnDetails, adding coverage for the incremental scenario in tests, and introducing a new getCommitLatency configuration property.

  • Adjusted the transaction metadata loading and back-fill loops in WalTxnDetails to handle newly loaded entries correctly.
  • Added testLastCommitToTimestampIncremental in WalTxnDetailsFuzzTest to cover the incremental commit-to-timestamp behavior.
  • Introduced getCommitLatency in the configuration interfaces and provided implementations in default and wrapper classes.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
core/src/test/java/io/questdb/test/cairo/wal/WalTxnDetailsFuzzTest.java Added new incremental commit-to-timestamp test and updated package/import
core/src/test/java/io/questdb/test/cairo/wal/TableTransactionLogFuzzTest.java Updated test package path
core/src/main/java/io/questdb/cairo/wal/WalTxnDetails.java Refined metadata read loops, added early exit for no-load case
core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java Added getCommitLatency() override with default value
core/src/main/java/io/questdb/cairo/CairoConfigurationWrapper.java Delegated new getCommitLatency() method
core/src/main/java/io/questdb/cairo/CairoConfiguration.java Made getCommitLatency() abstract (removed default implementation)
Comments suppressed due to low confidence (4)

core/src/main/java/io/questdb/cairo/CairoConfiguration.java:145

  • Changing getCommitLatency to an abstract method is a breaking API change that impacts any custom CairoConfiguration implementations. Consider providing a default implementation or bumping the major version to avoid regressions.
long getCommitLatency();

core/src/main/java/io/questdb/cairo/CairoConfiguration.java:145

  • The getCommitLatency method lacks JavaDoc specifying the time unit (e.g., nanoseconds). Adding documentation will help callers understand what the returned value represents.
long getCommitLatency();

core/src/main/java/io/questdb/cairo/wal/WalTxnDetails.java:516

  • Returning early when no transactions are loaded prevents the following loop from resetting commitToTimestamp to Long.MIN_VALUE for existing entries, potentially leaving stale values. Verify that skipping the second loop is intended.
if (transactionMeta.size() == initialSize) {

core/src/main/java/io/questdb/cairo/wal/WalTxnDetails.java:463

  • The condition was relaxed from < loadFromSeqTxn to <= loadFromSeqTxn, which may introduce an off-by-one shift in transaction indexing. Please add or update tests for the scenario where startSeqTxn == loadFromSeqTxn to confirm correct behavior.
if (lastLoadedSeqTxn >= loadFromSeqTxn && startSeqTxn <= loadFromSeqTxn) {

@puzpuzpuz puzpuzpuz added Bug Incorrect or unexpected behavior Core Related to storage, data type, etc. Performance Performance improvements and removed Bug Incorrect or unexpected behavior labels Jun 6, 2025
@bluestreak01
Copy link
Copy Markdown
Member

we have segfault galore on this PR and master (not sure if they are the same)

@bluestreak01 bluestreak01 added DO NOT MERGE These changes should not be merged to main branch and removed DO NOT MERGE These changes should not be merged to main branch labels Jun 10, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 7 / 8 (87.50%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 0 1 00.00%
🔵 io/questdb/cairo/wal/WalTxnDetails.java 6 6 100.00%
🔵 io/questdb/cairo/CairoConfigurationWrapper.java 1 1 100.00%

@bluestreak01 bluestreak01 merged commit 4d2f4df into master Jun 12, 2025
37 checks passed
@bluestreak01 bluestreak01 deleted the fix-1-by-1-wal-apply branch June 12, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Related to storage, data type, etc. Performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants