-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bug Report
Outbound linkerd-proxy sometimes does not call epoll_wait on a certain epoll file descriptor which results in very large delays between establishing a connection and sending the TLS handshake client_hello message. This in turn causes the inbound linkerd-proxy on the other side to close that connection as it times out after 10 seconds trying to read the TLS/SNI message.
What is the issue?
Interpreting the log and trace files and the key events as summarized in the below table we can see that:
- The
outboundproxy successfully opens a TCP connection, but then waits for 52 seconds before it writes the first byte of data into that connection (the TLS handshakeclient_hellomessage). - The
inboundproxy immediatelly accepts the connection from theoutboundproxy and waits for 10 seconds for any data to arrive. As no data arrives within 10 seconds, it times out, closes the connection and writes the "TLS detection timed out" log message.
It looks like this is a strange bug on the outbound proxy, which in some cases leads to a very long delay between
establishing a connection and writing the TSL handshake. What is very suspicious is that there is no epoll_wait system call on the outbound proxy for the epoll file descriptor 5 (the one the affected TCP connection is registered with).
I can only speculate on what causes this behaviour, some ideas:
- A bug in the underlying async runtime (Tokio or Tower) which results in some epoll file descriptors not beeing polled (
epoll_waitis not called for those file descritors). - ... your idea goes here ...
How can it be reproduced?
???
Logs, error output, etc
In order to narrow down what is going on I enabled loglevel TRACE on the outbound and
inbound linkerd-proxy sidecars of the following two Pods in our cluster:
| Pod | IP Address | Role |
|---|---|---|
gh-commander-api-f4b748487-7l57c |
10.40.39.45 |
outbound |
linkerd-destination-dc474b4bb-jsw7j |
10.40.42.129 |
inbound |
Furthermore, I started an strace sessions for the Unix processes of each of
these two linkerd-proxy instances to capture all system calls.
After running for some minutes the problem appeared again at 2021-05-19T18:49:12Z indicated
by a log message on the linkerd-destination Pod containing the string TLS detection timed out.
Being confident that we captured an instance of the problematic event, I stopped log and strace
capturing which resulted in the following files:
| Filename | Pod | Content | Relevant Time Window |
|---|---|---|---|
| 10.40.39.45.log.txt | gh-commander-api-f4b748487-7l57c | Log of outbound linkerd-proxy |
18:49:02Z - 18:49:54Z |
| 10.40.39.45.strace.txt | gh-commander-api-f4b748487-7l57c | Strace output of outbound linkerd-proxy |
18:49:02Z - 18:49:54Z |
| 10.40.42.129.log.txt | linkerd-destination-dc474b4bb-jsw7j | Log of inbound linkerd-proxy |
18:49:02Z - 18:49:14Z |
| 10.40.42.129.strace.txt | linkerd-destination-dc474b4bb-jsw7j | Strace output of inbound linkerd-proxy |
18:49:02Z - 18:49:14Z |
Note, the log files are huge and do not fit into a gist, so they are links to the linkerd Slack channel and you might only be able to read them if you have access to that channel.
The following table summarises the key events from the above log and trace files in chronological order for the TCP
connection 10.40.39.45:34506 ➔ 10.40.42.129:8086:
| ID | Time | Event | Actor | Remarks |
|---|---|---|---|---|
| 1 | 18:49:02Z | connect |
outbound |
Outbound linkerd-proxy opens TCP connection |
| 2 | 18:49:02Z | epoll_ctl |
outbound |
Outbound linkerd-proxy registers epoll with file descriptor 5 |
| 3 | 18:49:02Z | accept |
inbound |
Inbound linkerd-proxy accepts TCP connection |
| 4 | 18:49:02Z | epoll_ctl |
inbound |
Inbound linkerd-proxy registers epoll |
| 5 | 18:49:12Z | read-timeout |
inbound |
Inbound linkerd-proxy closes TCP connection witout having received any data from it. |
| 6 | 18:49:12Z | epoll_ctl |
inbound |
Inbound linkerd-proxy deregisters epoll |
| 7 | 18:49:12Z | close |
inbound |
Inbound linkerd-proxy closes connection |
| 8 | 18:49:12Z | log |
inbound |
Inbound linkerd-proxy logs TLS detection timed out |
| 9 | 18:49:54Z | writev |
outbound |
Outbound linkerd-proxy writes TLS handshake data with SNI value linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local |
| 10 | 18:49:54Z | recvfrom |
outbound |
Outbound linkerd-proxy tries to read from connection and realises that the peer closed it. |
| 11 | 18:49:54Z | epoll_ctl |
outbound |
Outbound linkerd-proxy deregisters epoll from file descriptor 5 |
| 12 | 18:49:54Z | close |
outbound |
Outbound linkerd-proxy closes connection |
linkerd check output
Linkerd core checks
===================
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match
linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match
Status check results are √
Environment
- Kubernetes Version: 1.19.6
- Cluster Environment: EKS
- Host OS: Amazon Linux 2
- Linkerd version: edge-21.5.1
Possible solution
Additional context
This issue is based on this Slack thread.