-
Notifications
You must be signed in to change notification settings - Fork 16.6k
Closed
Labels
area:providerskind:bugThis is a clearly a bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetlabel for new issues that we didn't triage yetprovider:microsoft-azureAzure-related issuesAzure-related issues
Description
Apache Airflow Provider(s)
microsoft-azure
Versions of Apache Airflow Providers
8.3.0
Apache Airflow version
2.7.3
Operating System
Ubuntu 20.04
Deployment
Other
Deployment details
Airflow running on a VM hosted in Azure
What happened
We are experiencing an issue with Azure Spot Containers where their status continuously cycles between Unhealthy → Repairing → Running, without actually executing any tasks.
- When they return to the Running state, they remain idle and do not perform any actions.
- Eventually, they go back to Unhealthy, repeating the cycle indefinitely.
- Since they don’t stay in any state for long, they can bypass both container and Airflow timeouts.
- Attempting to manually SSH into a container that reaches the Running state after being Unhealthy fails. In our experience, nothing can be done with the container other than terminating it.
- It seems to occur about 10% of the time to SPOT containers in EU-West.
What you think should happen instead
Ideally, the container should be forcefully terminated when it enters the Unhealthy state to prevent this looping behaviour.
How to reproduce
Since this is a randomly occurring issue, there is no single snippet of code that can consistently reproduce it. However, this can increase the likelihood of encountering the problem:
- Deploy multiple Azure Spot Containers running Airflow tasks.
- Run tasks during peak hours (e.g., in the EU West region) to increase the chances
- Monitor container lifecycle events to check if they enter an Unhealthy → Repairing → Running loop.
- (Optional) Manually find a way to spoof the container's status as "Unhealthy."
- Try SSH into a container that enters the "Running" state after being Unhealthy—it should fail.
It is difficult to force it to happen on demand.
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area:providerskind:bugThis is a clearly a bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetlabel for new issues that we didn't triage yetprovider:microsoft-azureAzure-related issuesAzure-related issues