Skip to content

KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state #21087

@cansjt

Description

@cansjt

Apache Airflow version

2.2.3 (latest released)

What happened

After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider to 3.0.1 (from 2.0.3) we started to see these errors in the logs:

{"asctime": "2022-01-25 08:19:39", "levelname": "ERROR", "process": 565811, "name": "airflow.executors.kubernetes_executor.KubernetesJobWatcher", "funcName": "run", "lineno": 111, "message": "Unknown error in KubernetesJobWatcher. Failing", "exc_info": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 102, in run\n    self.resource_version = self._run(\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 145, in _run\n    for event in list_worker_pods():\n  File \"/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py\", line 182, in stream\n    raise client.rest.ApiException(\nkubernetes.client.exceptions.ApiException: (410)\nReason: Expired: too old resource version: 655595751 (655818065)\n"}
Process KubernetesJobWatcher-6571:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 655595751 (655818065)

Pods are created and run to completion, but it seems the KubernetesJobWatcher is incapable of seeing that they completed. From there Airflow goes to a complete halt.

What you expected to happen

No errors in the logs and the job watcher does it's job of collecting completed jobs.

How to reproduce

I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous versions to see if it helps.

Operating System

k8s (Airflow images are Debian based)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon 2.6.0
apache-airflow-providers-cncf-kubernetes 3.0.1
apache-airflow-providers-ftp 2.0.1
apache-airflow-providers-http 2.0.2
apache-airflow-providers-imap 2.1.0
apache-airflow-providers-postgres 2.4.0
apache-airflow-providers-sqlite 2.0.1

Deployment

Other

Deployment details

The deployment is on k8s v1.19.16, made with helm3.

Anything else

This, in the symptoms, look a lot like #17629 but happens in a different place.
Redeploying as suggested in that issues seemed to help, but most jobs that were supposed to run last night got stuck again. All jobs use the same pod template, without any customization.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions