Network instabilities are able to freeze KubernetesJobWatcher

**Apache Airflow version**:
1.10.10 and 1.10.12 so far

**Kubernetes version (if you are using kubernetes) (use kubectl version)**: v1.16.10

**Environment**:

Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

**What happened**:
We are running Airflow with the KubernetesExecutor on a Kubernetes cluster with a somewhat unreliable network. The scheduler runs smoothly until it logs the following line:

```
WARNING - HTTPError when attempting to run task, re-queueing. Exception: ('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))
```
From then on, the executor is able to schedule jobs, but the job watcher stops processing events. The resulting behaviour is that new jobs are scheduled, but all jobs with status `Completed` remain in the cluster instead of being terminated. We mitigate this condition by periodically restarting the scheduler, which cleans up all the completed tasks correctly.

**What you expected to happen**:
Either KubernetesJobWatcher opens a new connection and goes on processing events, or it raises an exception shutting the process down so it can be restarted.

**How to reproduce the bug**: 
Take down the connection between the scheduler pod and the Kubernetes API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Network instabilities are able to freeze KubernetesJobWatcher #12644

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Network instabilities are able to freeze KubernetesJobWatcher #12644

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions