Skip to content

Network instabilities are able to freeze KubernetesJobWatcher #12644

@guillemborrell

Description

@guillemborrell

Apache Airflow version:
1.10.10 and 1.10.12 so far

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.16.10

Environment:

Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

What happened:
We are running Airflow with the KubernetesExecutor on a Kubernetes cluster with a somewhat unreliable network. The scheduler runs smoothly until it logs the following line:

WARNING - HTTPError when attempting to run task, re-queueing. Exception: ('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))

From then on, the executor is able to schedule jobs, but the job watcher stops processing events. The resulting behaviour is that new jobs are scheduled, but all jobs with status Completed remain in the cluster instead of being terminated. We mitigate this condition by periodically restarting the scheduler, which cleans up all the completed tasks correctly.

What you expected to happen:
Either KubernetesJobWatcher opens a new connection and goes on processing events, or it raises an exception shutting the process down so it can be restarted.

How to reproduce the bug:
Take down the connection between the scheduler pod and the Kubernetes API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions