Adds Disaster Recovery modes for Pauseless #16071

noob-se7en · 2025-06-11T08:20:26Z

Problem Statement:
In Pauseless + Dedup, During a Disaster scenario (i.e. a segment online in IS but having no completed (immutable) replica on any server), Re-Ingestion is disabled by default in RVM. Its disabled because Dedup requires ingestion to only happen in strict order (i.e. a segment ingested in past cannot be re-ingested if server has consumed the following segments to it).
So incase of above Disaster scenario, there's only a manual fix requiring bulk delete of segments which is quite operationally heavy and time consuming.

PR:
There can be some use-cases where dedup constraints can be relaxed until the disaster/data-loss is recovered (priority is to recover from disaster quickly as possible letting go off Dedup metadata correctness for the time being).
This PR adds a new enum config to StreamIngestionConfig: DisasterRecoveryMode. Please refer below:

public enum DisasterRecoveryMode {
  // ALWAYS means Pinot will always run the Disaster Recovery Job
  ALWAYS,
  // DEFAULT means Pinot will skip the Disaster Recovery Job for tables like dedup/upsert where consistency
  // of data is higher in priority than availability.
  DEFAULT
}

codecov-commenter · 2025-06-11T09:07:10Z

Codecov Report

Attention: Patch coverage is 28.57143% with 25 lines in your changes missing coverage. Please review.

Project coverage is 63.21%. Comparing base (1a476de) to head (4b804df).
Report is 371 commits behind head on master.

Files with missing lines	Patch %	Lines
.../core/realtime/PinotLLCRealtimeSegmentManager.java	0.00%	21 Missing ⚠️
...a/manager/realtime/RealtimeSegmentDataManager.java	42.85%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #16071      +/-   ##
============================================
+ Coverage     62.90%   63.21%   +0.30%     
+ Complexity     1386     1363      -23     
============================================
  Files          2867     2960      +93     
  Lines        163354   170879    +7525     
  Branches      24952    26154    +1202     
============================================
+ Hits         102755   108015    +5260     
- Misses        52847    54685    +1838     
- Partials       7752     8179     +427

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.15% <28.57%> (+0.28%)`	⬆️
java-21	`63.17% <28.57%> (+0.34%)`	⬆️
skip-bytebuffers-false	`?`
skip-bytebuffers-true	`?`
temurin	`63.21% <28.57%> (+0.30%)`	⬆️
unittests	`63.20% <28.57%> (+0.30%)`	⬆️
unittests1	`64.74% <71.42%> (+8.92%)`	⬆️
unittests2	`33.44% <11.42%> (-0.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

noob-se7en · 2025-06-12T20:13:24Z

...re/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java

              } else if (_currentOffset.compareTo(endOffset) == 0) {
                _segmentLogger
                    .info("Current offset {} matches offset in zk {}. Replacing segment", _currentOffset, endOffset);
-                boolean replaced = buildSegmentAndReplace();


These changes are required because incase of segment buildfailure, reingestion happens. Let's say during reingestion server is successful in uploading the segment, but when offline -> online is triggered during reset of the segment build can still fail on servers.

Can put these changes of this file in separate PR as well.

I see two scenarios:

Issue with the segment e.g. incorrect data:

In this case the re-ingestion will fail as well.

Deleting the segment and switching to latest offset will be the way forward.

Issue on the server e.g. out of disk/ memory:

In this case both of the above scenarios i.e. build and downloadSegmentAndReplace might end up failing.

Also, have we come across this scenario in past ?

cc: @Jackie-Jiang

It happened in the integration test due to injection of failure during commit - RealtimeSegmentDataManager failed to build but StateLessRealtimeSegmentDataManager succeeded to build. In this segment is present in deep store but server is unable to load the segment and move forward.

Jackie-Jiang

cc @KKcorps for review

Jackie-Jiang · 2025-06-24T23:32:01Z

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/ReplicaGroupStrategyConfig.java

 * Class representing configurations related to segment assignment strategy.
+ *  @deprecated Use {@link org.apache.pinot.spi.config.table.assignment.InstanceAssignmentConfig} instead.
 */
+@Deprecated


We cannot fully deprecate this config yet because partitionColumn can only be configured here when we are using instancePartitionsMap. I'd suggest not deprecating it in this PR (as it is a different scope) and addressing it in a separate PR

Jackie-Jiang · 2025-06-24T23:32:41Z

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/DisasterRecoveryMode.java

+
+public enum DisasterRecoveryMode {
+  BEST_EFFORT
+  // TODO: Add support for strict recovery mode.


Suggest adding STRICT in the same PR, which is the default mode when not configured

Default should be null IMO. STRICT is very risky as segments will be deleted in bulk based on the time interval set and it might create data loss.

What I meant is to introduce an enum for the default mode. Using null to represent the default enum is a little bit anti pattern

By default we mean not doing anything i.e. skipping re-ingestion for dedup and partial upserts.
Are you planning to make this the default behavior i.e. re-ingest missing segments ?

Default mode will be DisasterRecoveryMode .NONE. In this mode we skip repair for upsert/dedup tables.

Actually NONE will be confusing since this config is now a streamIngestion config instead of dedup/upsert config.
Let me think of sth else.

~~hmmm will change the definition of DisasterRecoveryMode.STRICT. DisasterRecoveryMode.STRICT will mean skip upsert/dedup tables.~~

Very bad with naming, let me know if any better name:

public enum DisasterRecoveryMode { // ALWAYS means Pinot will always run the Disaster Recovery Job ALWAYS, // CONSISTENCY_FIRST means Pinot will skip the Disaster Recovery Job for tables like dedup/upsert where consistency of data is // higher in priority than availability. CONSISTENCY_FIRST }

Jackie-Jiang · 2025-06-27T19:13:56Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+    boolean isPartialUpsertEnabled = (tableConfig.getUpsertConfig() != null) && (tableConfig.getUpsertConfig().getMode()
+        == UpsertConfig.Mode.PARTIAL);
+    if (isPartialUpsertEnabled) {
+      // If isPartialUpsert is enabled, do not allow repair.
+      return false;
+    }


We should also honor best effort here

@noob-se7en given that this behavior can be kept common for partial upsert and dedup, we can put this config in stream ingestion config as suggested by @Jackie-Jiang .

Jackie-Jiang · 2025-06-27T19:14:51Z

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/DedupConfig.java

+  @JsonPropertyDescription("Recovery mode which is used to decide how to recover a segment online in IS but having no"
+      + " completed (immutable) replica on any server in pause-less ingestion")
+  @Nullable
+  private DisasterRecoveryMode _disasterRecoveryMode;


IMO this property belongs to StreamIngestionConfig

9aman · 2025-06-30T05:32:20Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+    boolean isDedupEnabled = (dedupConfig != null) && (dedupConfig.isDedupEnabled());
+    if (isDedupEnabled) {
+      DisasterRecoveryMode disasterRecoveryMode = dedupConfig.getDisasterRecoveryMode();
+      if (disasterRecoveryMode == null) {


Using Default mode for not recovering here instead of relying on null.

9aman · 2025-06-30T05:35:04Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

          ControllerGauge.PAUSELESS_SEGMENTS_IN_UNRECOVERABLE_ERROR_COUNT, segmentsInErrorStateInAllReplicas.size());
-      LOGGER.error("Skipping repair for errored segments in table: {} because dedup or partial upsert is enabled.",
-          realtimeTableName);
+      LOGGER.error("Skipping repair for errored segments in table: {}.", realtimeTableName);


Can we move these log lines to the function allowRepairOfErrorSegments since it's only being called from here.

We can then add information like PartialUpsert enabled or repairErrorSegmentForPartialUpsertOrDedup is set to true as the reason for skipping re-ingestion instead of loosing this info.

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/DisasterRecoveryMode.java

Jackie-Jiang · 2025-06-30T20:08:34Z

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/DisasterRecoveryMode.java

+  ALWAYS,
+  // CONSISTENCY_FIRST means Pinot will skip the Disaster Recovery Job for tables like dedup/upsert where consistency
+  // of data is higher in priority than availability.
+  CONSISTENCY_FIRST


We can call this DEFAULT, and use javadoc to explain its behavior

Jackie-Jiang · 2025-06-30T20:10:29Z

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/ingestion/StreamIngestionConfig.java


+  @JsonPropertyDescription("Recovery mode which is used to decide how to recover a segment online in IS but having no"
+      + " completed (immutable) replica on any server in pause-less ingestion")
+  private DisasterRecoveryMode _disasterRecoveryMode = DisasterRecoveryMode.ALWAYS;


This is a behavior change. Do we want to keep the existing behavior?

On second thought, data correctness makes sense in default mode.
Thanks, changed to DEFAULT which is skip upsert/dedup.

…sterRecoveryMode.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

noob-se7en added 3 commits June 11, 2025 13:50

Adds BEST_EFFORT disaster recovery mode in dedup

b7f67ba

updates json description of _disasterRecoveryMode

9943991

adds todo for strict recovery mode

d1577a8

noob-se7en added 7 commits June 11, 2025 19:13

Adds integration test

61aea67

Adds integration test

c6f5f46

Fixes lint

a3a4101

minor fixes

5224295

fixes unit test

9fa0649

update JsonPropertyDescription of _disasterRecoveryMode

995944a

nit

266478a

noob-se7en marked this pull request as ready for review June 12, 2025 20:10

noob-se7en commented Jun 12, 2025

View reviewed changes

Jackie-Jiang added feature ingestion real-time dedup changes related to realtime ingestion dedup handling labels Jun 12, 2025

Merge branch 'master' of github.com:apache/pinot into dedup_dr

300ecf3

Jackie-Jiang reviewed Jun 24, 2025

View reviewed changes

noob-se7en added 2 commits June 27, 2025 14:07

Removes deprecated from ReplicaGroupStrategyConfig

d9e5045

Removes deprecated from method

b56f2ae

Jackie-Jiang reviewed Jun 27, 2025

View reviewed changes

9aman reviewed Jun 30, 2025

View reviewed changes

noob-se7en added 3 commits June 30, 2025 12:22

Moves config to streamIngestion and adds Default Mode

cdc06fd

ehnaces log and fixes lint

4fc5a80

Fixes lint/test and changes config

978ee29

noob-se7en changed the title ~~Adds BEST_EFFORT disaster recovery mode in dedup~~ Adds Disaster Recovery mode config for Pauseless Jun 30, 2025

Merge branch 'master' of github.com:apache/pinot into dedup_dr

50e8ab1

noob-se7en requested a review from Jackie-Jiang June 30, 2025 19:46

noob-se7en requested a review from 9aman June 30, 2025 19:46

noob-se7en changed the title ~~Adds Disaster Recovery mode config for Pauseless~~ Adds Disaster Recovery modes for Pauseless Jun 30, 2025

Merge branch 'master' of github.com:apache/pinot into dedup_dr

00fc183

Jackie-Jiang reviewed Jun 30, 2025

View reviewed changes

noob-se7en and others added 5 commits July 1, 2025 01:59

Update pinot-spi/src/main/java/org/apache/pinot/spi/config/table/Disa…

4a30422

…sterRecoveryMode.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

Changes default mode for the config

934d963

fix comment

d6f4b94

changes log severity

61658ac

minor comment fix

f5c5809

noob-se7en requested a review from Jackie-Jiang June 30, 2025 20:48

changes log severity

4b804df

Jackie-Jiang approved these changes Jun 30, 2025

View reviewed changes

Jackie-Jiang merged commit a1b37cf into apache:master Jun 30, 2025
18 checks passed

Adds Disaster Recovery modes for Pauseless #16071

Adds Disaster Recovery modes for Pauseless #16071

Uh oh!

Conversation

noob-se7en commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noob-se7en Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noob-se7en Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noob-se7en Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

noob-se7en commented Jun 11, 2025 •

edited

Loading

codecov-commenter commented Jun 11, 2025 •

edited

Loading

noob-se7en Jun 12, 2025 •

edited

Loading

noob-se7en Jun 30, 2025 •

edited

Loading

noob-se7en Jun 30, 2025 •

edited

Loading