Support retrying non-finished async tasks on startup and periodically by danielhumanmod · Pull Request #1585 · apache/polaris

danielhumanmod · 2025-05-14T05:36:10Z

Context

Polaris uses async tasks to perform operations such as table and manifest file cleanup. These tasks are executed asynchronously in a separate thread within the same JVM, and retries are handled inline within the task execution. However, this mechanism does not guarantee eventual execution in the following cases:

The task fails repeatedly and hits the maximum retry limit.
The service crashes or shuts down before retrying.

Implementation Plan

Stage 1: Potential improvement - #1523
Introduce per-task transactional leasing in the metastore layer via loadTasks(...)

Stage 2 (Current PR):
Persist failed tasks and introduce a retry mechanism triggered during Polaris startup and via periodic background checks, changes included:

Metastore Layer:
- Exposes a new API getMetaStoreManagerMap
- Ensures LAST_ATTEMPT_START_TIME set for each task entity creation, which is important for time-out filtering when loadTasks() from metastore, so that prevent multiple executors from picking the same task
TaskRecoveryManager: New class responsible for task recovery logic, including:
- Constructing executionPolarisCallContext
- Loading tasks from metastore
- Triggering task execution
QuarkusTaskExecutorImpl: Hook into application lifecycle to initiate task recovery.
Task Retry Strategy: Failed tasks remain persisted in the metastore and are retried by the recovery manager.
Tests: Adjusted existing tests and added new coverage for recovery behavior.

Recommended Review Order

Metastore Layer related code
TaskRecoveryManager
QuarkusTaskExecutorImpl and TaskExecutorImpl
Task cleanup handlers
Tests

danielhumanmod · 2025-05-18T00:30:47Z

...rvice/src/test/java/org/apache/polaris/service/quarkus/task/TableCleanupTaskHandlerTest.java


    handler.handleTask(task, callContext);

+    timeSource.add(Duration.ofMinutes(10));


Previously, task entity might miss LAST_ATTEMPT_START_TIME prop so loading tasks without time-out can success; After complete each task entity with this property, we need to manipulate time to make loadTasks works

Can you explain this further - I'm not sure why the tests need this 10m jump? Is it so that tasks are "recovered" by the Quarkus Scheduled method?

adnanhemani · 2025-06-02T19:01:44Z

...va/org/apache/polaris/extension/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java

  }

+  @Override
+  public Map<String, PolarisMetaStoreManager> getMetaStoreManagerMap() {


To make this a bit more defensively-coded, I might recommend making this into a iterator of Map.Entry objects, given that this is a public method and we wouldn't want any code path to be able to modify this mapping?

Good catch, will fix this!

adnanhemani · 2025-06-02T19:07:18Z

...s/service/src/test/java/org/apache/polaris/service/quarkus/task/TaskRecoveryManagerTest.java

+  }
+
+  private void addTaskLocation(TaskEntity task) {
+    Map<String, String> internalPropertiesAsMap = new HashMap<>(task.getInternalPropertiesAsMap());


addInternalProperty

adnanhemani · 2025-06-03T04:51:13Z

...ice/common/src/main/java/org/apache/polaris/service/task/ManifestFileCleanupTaskHandler.java

-        dataFileDeletes.size(),
-        manifestFile.path());
    try {
+      ManifestReader<DataFile> dataFiles = ManifestFiles.read(manifestFile, fileIO);


What's the reason behind this change?

adnanhemani · 2025-06-03T04:53:14Z

service/common/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java

                      new ManifestFileCleanupTaskHandler.ManifestCleanupTask(
                          tableEntity.getTableIdentifier(), TaskUtils.encodeManifestFile(mf)))
+                  .withLastAttemptExecutorId(executorId)
+                  .withAttemptCount(1)


How can we assume this?

adnanhemani · 2025-06-03T04:53:27Z

service/common/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java

                      new BatchFileCleanupTaskHandler.BatchFileCleanupTask(
                          tableEntity.getTableIdentifier(), metadataBatch))
+                  .withLastAttemptExecutorId(executorId)
+                  .withAttemptCount(1)


Ditto as above.

adnanhemani · 2025-06-03T05:50:52Z

service/common/src/main/java/org/apache/polaris/service/task/TaskRecoveryManager.java

+      PolarisCallContext polarisCallContext =
+          new PolarisCallContext(
+              metastore, new PolarisDefaultDiagServiceImpl(), configurationStore, clock);
+      EntitiesResult entitiesResult =


I'm not sure I'm understanding the logic here: we are asking for 20 tasks here - but what if there are more than 20 tasks that need recovery?

adnanhemani · 2025-06-03T06:07:00Z

...rvice/src/test/java/org/apache/polaris/service/quarkus/task/TableCleanupTaskHandlerTest.java


    handler.handleTask(task, callContext);

+    timeSource.add(Duration.ofMinutes(10));


Can you explain this further - I'm not sure why the tests need this 10m jump? Is it so that tasks are "recovered" by the Quarkus Scheduled method?

adnanhemani · 2025-06-03T06:33:42Z

...s/service/src/test/java/org/apache/polaris/service/quarkus/task/TaskRecoveryManagerTest.java

+      tableCleanupTaskHandler.handleTask(task, callCtx);
+
+      // Step 3: Verify that the generated child tasks were registered, ATTEMPT_COUNT = 2
+      timeSource.add(Duration.ofMinutes(10));


I, personally, found this very hard to follow - even with the comments. I would highly recommend making the comments much more verbose here to allow the full flow of logic (what is happening with which task and why) to be communicated to a reader who may not be an expert at this particular type of task or tasks in general.

github-actions · 2025-07-04T02:12:07Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

danielhumanmod · 2025-07-06T00:47:34Z

@adnanhemani Since the branch is outdated for too long and hard to rebase on main, I create a new PR #2003 to introduce this changes, will migrate and address your comments there

danielhumanmod added 3 commits May 13, 2025 22:22

complete impl with test

994567a

adjust existing task tests

6d2040d

task recovery impl

448b150

github-project-automation bot added this to Basic Kanban Board May 14, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board May 14, 2025

danielhumanmod added 2 commits May 13, 2025 22:50

format

4e0a6c6

update task entity

a815e3e

danielhumanmod mentioned this pull request May 14, 2025

Support per-task transactional leasing in loadTasks #1523

Open

danielhumanmod added 3 commits May 13, 2025 23:22

comment in tests

218e5f5

standardize task entity properties when creation

d1d15d7

complete log and test

a5d8f56

danielhumanmod commented May 18, 2025

View reviewed changes

danielhumanmod marked this pull request as ready for review May 18, 2025 00:33

danielhumanmod requested review from HonahX, MonkeyCanCode, RussellSpitzer, adutra, ashvina, collado-mike, dennishuo, dimas-b, ebyhr, eric-maynard, flyrain, jackye1995, jbonofre, singhpk234, snazy, takidau and vvcephei as code owners May 18, 2025 00:33

danielhumanmod requested a review from pingtimeout as a code owner May 18, 2025 00:33

danielhumanmod changed the title ~~Support more reliable async task retry to guarantee eventual execution (2/2) – Task Executor~~ Support retrying non-finished async tasks on startup and periodically May 18, 2025

adnanhemani reviewed Jun 3, 2025

View reviewed changes

github-actions bot added the stale label Jul 4, 2025

danielhumanmod closed this Jul 6, 2025

github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board Jul 6, 2025


		handler.handleTask(task, callContext);

		timeSource.add(Duration.ofMinutes(10));

Conversation

danielhumanmod commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Implementation Plan

Recommended Review Order

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

danielhumanmod commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielhumanmod commented May 14, 2025 •

edited

Loading

danielhumanmod commented Jul 6, 2025 •

edited

Loading