-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow version
2.2.3 (latest released)
What happened
After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider to 3.0.1 (from 2.0.3) we started to see these errors in the logs:
{"asctime": "2022-01-25 08:19:39", "levelname": "ERROR", "process": 565811, "name": "airflow.executors.kubernetes_executor.KubernetesJobWatcher", "funcName": "run", "lineno": 111, "message": "Unknown error in KubernetesJobWatcher. Failing", "exc_info": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 102, in run\n self.resource_version = self._run(\n File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 145, in _run\n for event in list_worker_pods():\n File \"/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py\", line 182, in stream\n raise client.rest.ApiException(\nkubernetes.client.exceptions.ApiException: (410)\nReason: Expired: too old resource version: 655595751 (655818065)\n"}
Process KubernetesJobWatcher-6571:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
self.resource_version = self._run(
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
for event in list_worker_pods():
File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 182, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 655595751 (655818065)
Pods are created and run to completion, but it seems the KubernetesJobWatcher is incapable of seeing that they completed. From there Airflow goes to a complete halt.
What you expected to happen
No errors in the logs and the job watcher does it's job of collecting completed jobs.
How to reproduce
I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous versions to see if it helps.
Operating System
k8s (Airflow images are Debian based)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon 2.6.0
apache-airflow-providers-cncf-kubernetes 3.0.1
apache-airflow-providers-ftp 2.0.1
apache-airflow-providers-http 2.0.2
apache-airflow-providers-imap 2.1.0
apache-airflow-providers-postgres 2.4.0
apache-airflow-providers-sqlite 2.0.1
Deployment
Other
Deployment details
The deployment is on k8s v1.19.16, made with helm3.
Anything else
This, in the symptoms, look a lot like #17629 but happens in a different place.
Redeploying as suggested in that issues seemed to help, but most jobs that were supposed to run last night got stuck again. All jobs use the same pod template, without any customization.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct