-
Notifications
You must be signed in to change notification settings - Fork 285
Description
Bug Report
Hey everyone,
Describe the Bug
The Data Flow's COMPLETED / FAILED state processing in the DataFlowManagerImpl notifies the Controlplane of the data transfer's success / failure via the TransferProcessApiClient, returning a Result. When successful, the Data Flow moves to NOTIFIED as expected. However, if the Result failed, the Data Flow transitions to its current state, being picked up again by the state machine manager and retrying the process. As there is no limit to the amount of retries, this process could possibly continue forever.
Lines 286 - 308 from DataPlaneManagerImpl
private boolean processCompleted(DataFlow dataFlow) {
var response = transferProcessClient.completed(dataFlow.toRequest());
if (response.succeeded()) {
dataFlow.transitToNotified();
update(dataFlow);
} else {
dataFlow.transitToCompleted(); // Will retry while the process fails
update(dataFlow);
}
return true;
}
private boolean processFailed(DataFlow dataFlow) {
var response = transferProcessClient.failed(dataFlow.toRequest(), dataFlow.getErrorDetail());
if (response.succeeded()) {
dataFlow.transitToNotified();
update(dataFlow);
} else {
dataFlow.transitToFailed(dataFlow.getErrorDetail()); // Will retry while the process fails
update(dataFlow);
}
return true;
}Expected Behavior
Terminate the Data Flow after a certain amount of failed retries.
Observed Behavior
The Dataplane keeps retrying to notify the Controlplane indefinitely.
Steps to Reproduce
There are many ways to force a failed Result from TransferProcessApiClient, but I experimented the following:
- Start a Consumer <-> Provider pair of Connectors
- On the Provider, create an Asset + Policy + Contract Definition
- On the Consumer, fetch the catalog and negotiate the offer
- On the Consumer, start a PROVIDER-PUSH Transfer Process
- Make sure the Transfer takes some time (f.e transfer a huge file, or simply add a
Thread.sleep()in theTransferService)
- Make sure the Transfer takes some time (f.e transfer a huge file, or simply add a
- While transferring, shutdown the Provider Controlplane
- When the transfer completes, the Controlplane won't be reachable, so the Controlplane notification will fail
- Check the Provider Dataplane logs to see multiple retries of this process
Context Information
Tested on version 0.10.1 and on main's latest commit (98cbed8).
This issue was first discovered because, somehow, the Transfer Process and the Data Flow became out of sync, with the Transfer Process being TERMINATED but the Data Flow being COMPLETED. In this instance, the Controlplane always replied with an error status code to the Dataplane notification, failing the Result and forcing a retry of the process.
Possible Implementation
I see two possible approaches.
The default implementation of the EdcHttpClient (indirectly called by the TransferProcessApiClient's default implementation) already has a configurable retry policy. As retries are already handled in this stage, if a failed Result is received at the DataPlaneManagerImpl, the Data Flow may immediately transition to TERMINATED.
If retries should be handled at the DataPlaneManagerImpl, we could make the process run inside a RetryProcessor, making the Data Flow transition to TERMINATED on final failure. For custom implementations of the TransferProcessApiClient that do not handle retries, this would ensure that the Data Flow is not terminated at the first failure.