Azure Spot Containers Stuck in Unhealthy-Repairing Cycle

### Apache Airflow Provider(s)

microsoft-azure

### Versions of Apache Airflow Providers

8.3.0

### Apache Airflow version

2.7.3

### Operating System

Ubuntu 20.04

### Deployment

Other

### Deployment details

Airflow running on a VM hosted in Azure

### What happened

We are experiencing an issue with Azure Spot Containers where their status continuously cycles between Unhealthy → Repairing → Running, without actually executing any tasks.

- When they return to the Running state, they remain idle and do not perform any actions.
- Eventually, they go back to Unhealthy, repeating the cycle indefinitely.
- Since they don’t stay in any state for long, they can bypass both container and Airflow timeouts.
- Attempting to manually SSH into a container that reaches the Running state after being Unhealthy fails. In our experience, nothing can be done with the container other than terminating it.
- It seems to occur about 10% of the time to SPOT containers in EU-West.


### What you think should happen instead

Ideally, the container should be forcefully terminated when it enters the Unhealthy state to prevent this looping behaviour.

### How to reproduce

Since this is a randomly occurring issue, there is no single snippet of code that can consistently reproduce it. However, this can increase the likelihood of encountering the problem:

- Deploy multiple Azure Spot Containers running Airflow tasks.
- Run tasks during peak hours (e.g., in the EU West region) to increase the chances
- Monitor container lifecycle events to check if they enter an Unhealthy → Repairing → Running loop.
- (Optional) Manually find a way to spoof the container's status as "Unhealthy."
- Try SSH into a container that enters the "Running" state after being Unhealthy—it should fail.

It is difficult to force it to happen on demand.

### Anything else

_No response_

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Spot Containers Stuck in Unhealthy-Repairing Cycle #47023

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Azure Spot Containers Stuck in Unhealthy-Repairing Cycle #47023

Description

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions