Skip to content

Infinite retry loop when completing / failing a DataFlow #4860

@correiaafonso12

Description

@correiaafonso12

Bug Report

Hey everyone,

Describe the Bug

The Data Flow's COMPLETED / FAILED state processing in the DataFlowManagerImpl notifies the Controlplane of the data transfer's success / failure via the TransferProcessApiClient, returning a Result. When successful, the Data Flow moves to NOTIFIED as expected. However, if the Result failed, the Data Flow transitions to its current state, being picked up again by the state machine manager and retrying the process. As there is no limit to the amount of retries, this process could possibly continue forever.

Lines 286 - 308 from DataPlaneManagerImpl

private boolean processCompleted(DataFlow dataFlow) {
    var response = transferProcessClient.completed(dataFlow.toRequest());
    if (response.succeeded()) {
        dataFlow.transitToNotified();
        update(dataFlow);
    } else {
        dataFlow.transitToCompleted(); // Will retry while the process fails
        update(dataFlow);
    }
    return true;
}

private boolean processFailed(DataFlow dataFlow) {
    var response = transferProcessClient.failed(dataFlow.toRequest(), dataFlow.getErrorDetail());
    if (response.succeeded()) {
        dataFlow.transitToNotified();
        update(dataFlow);
    } else {
        dataFlow.transitToFailed(dataFlow.getErrorDetail()); // Will retry while the process fails
        update(dataFlow);
    }
    return true;
}

Expected Behavior

Terminate the Data Flow after a certain amount of failed retries.

Observed Behavior

The Dataplane keeps retrying to notify the Controlplane indefinitely.

Steps to Reproduce

There are many ways to force a failed Result from TransferProcessApiClient, but I experimented the following:

  1. Start a Consumer <-> Provider pair of Connectors
  2. On the Provider, create an Asset + Policy + Contract Definition
  3. On the Consumer, fetch the catalog and negotiate the offer
  4. On the Consumer, start a PROVIDER-PUSH Transfer Process
    • Make sure the Transfer takes some time (f.e transfer a huge file, or simply add a Thread.sleep() in the TransferService)
  5. While transferring, shutdown the Provider Controlplane
  6. When the transfer completes, the Controlplane won't be reachable, so the Controlplane notification will fail
  7. Check the Provider Dataplane logs to see multiple retries of this process

Context Information

Tested on version 0.10.1 and on main's latest commit (98cbed8).

This issue was first discovered because, somehow, the Transfer Process and the Data Flow became out of sync, with the Transfer Process being TERMINATED but the Data Flow being COMPLETED. In this instance, the Controlplane always replied with an error status code to the Dataplane notification, failing the Result and forcing a retry of the process.

Possible Implementation

I see two possible approaches.

The default implementation of the EdcHttpClient (indirectly called by the TransferProcessApiClient's default implementation) already has a configurable retry policy. As retries are already handled in this stage, if a failed Result is received at the DataPlaneManagerImpl, the Data Flow may immediately transition to TERMINATED.

If retries should be handled at the DataPlaneManagerImpl, we could make the process run inside a RetryProcessor, making the Data Flow transition to TERMINATED on final failure. For custom implementations of the TransferProcessApiClient that do not handle retries, this would ensure that the Data Flow is not terminated at the first failure.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions