-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
I have an AKS deployed airflow v2.0.0 with a Kubernetes Executor enabled and the KubernetesJobWatcher is failing periodically.
Apache Airflow version: 2.0.0
Kubernetes version (if you are using kubernetes) (use kubectl version):
% kc version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"1994a5495a40a663921c5ecfee7dd9a8c61704fa", GitTreeState:"clean", BuildDate:"2020-07-23T22:06:44Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Cloud provider or hardware configuration: AKS
What happened:
[2021-02-10 15:33:34,756] {kubernetes_executor.py:111} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 436, in _error_catcher
yield
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 763, in read_chunked
self._update_chunk_length()
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 103, in run
kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
for event in list_worker_pods():
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, \'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 436, in _error_catcher
yield
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 763, in read_chunked
self._update_chunk_length()
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 103, in run
kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
for event in list_worker_pods():
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, \'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
[2021-02-10 15:33:35,022] {kubernetes_executor.py:266} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2021-02-10 15:37:58,640] {kubernetes_executor.py:111} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 436, in _error_catcher
yield
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 763, in read_chunked
self._update_chunk_length()
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 103, in run
kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
for event in list_worker_pods():
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, \'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
Process KubernetesJobWatcher-5:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 436, in _error_catcher
yield
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 763, in read_chunked
self._update_chunk_length()
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 103, in run
kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
for event in list_worker_pods():
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, \'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
[2021-02-10 15:37:59,446] {kubernetes_executor.py:266} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
edikmkoyan@EMkoyan15052 chart %
edikmkoyan@EMkoyan15052 chart % kv version
zsh: command not found: kv
edikmkoyan@EMkoyan15052 chart % kc version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"1994a5495a40a663921c5ecfee7dd9a8c61704fa", GitTreeState:"clean", BuildDate:"2020-07-23T22:06:44Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
scheduler pods are being recreated. kc logs pod/airflow2-scheduler-84df66d96f-vphtw scheduler logs the messages above.
How often does this problem occur? Once? Every time etc?
About 2 times in 30 minutes