Skip to content

Outbound proxy delays writing of TLS handshake message  #6146

@alex-berger

Description

@alex-berger

Bug Report

Outbound linkerd-proxy sometimes does not call epoll_wait on a certain epoll file descriptor which results in very large delays between establishing a connection and sending the TLS handshake client_hello message. This in turn causes the inbound linkerd-proxy on the other side to close that connection as it times out after 10 seconds trying to read the TLS/SNI message.

What is the issue?

Interpreting the log and trace files and the key events as summarized in the below table we can see that:

  • The outbound proxy successfully opens a TCP connection, but then waits for 52 seconds before it writes the first byte of data into that connection (the TLS handshake client_hello message).
  • The inbound proxy immediatelly accepts the connection from the outbound proxy and waits for 10 seconds for any data to arrive. As no data arrives within 10 seconds, it times out, closes the connection and writes the "TLS detection timed out" log message.

It looks like this is a strange bug on the outbound proxy, which in some cases leads to a very long delay between
establishing a connection and writing the TSL handshake. What is very suspicious is that there is no epoll_wait system call on the outbound proxy for the epoll file descriptor 5 (the one the affected TCP connection is registered with).

I can only speculate on what causes this behaviour, some ideas:

  • A bug in the underlying async runtime (Tokio or Tower) which results in some epoll file descriptors not beeing polled (epoll_wait is not called for those file descritors).
  • ... your idea goes here ...

How can it be reproduced?

???

Logs, error output, etc

In order to narrow down what is going on I enabled loglevel TRACE on the outbound and
inbound linkerd-proxy sidecars of the following two Pods in our cluster:

Pod IP Address Role
gh-commander-api-f4b748487-7l57c 10.40.39.45 outbound
linkerd-destination-dc474b4bb-jsw7j 10.40.42.129 inbound

Furthermore, I started an strace sessions for the Unix processes of each of
these two linkerd-proxy instances to capture all system calls.

After running for some minutes the problem appeared again at 2021-05-19T18:49:12Z indicated
by a log message on the linkerd-destination Pod containing the string TLS detection timed out.
Being confident that we captured an instance of the problematic event, I stopped log and strace
capturing which resulted in the following files:

Filename Pod Content Relevant Time Window
10.40.39.45.log.txt gh-commander-api-f4b748487-7l57c Log of outbound linkerd-proxy 18:49:02Z - 18:49:54Z
10.40.39.45.strace.txt gh-commander-api-f4b748487-7l57c Strace output of outbound linkerd-proxy 18:49:02Z - 18:49:54Z
10.40.42.129.log.txt linkerd-destination-dc474b4bb-jsw7j Log of inbound linkerd-proxy 18:49:02Z - 18:49:14Z
10.40.42.129.strace.txt linkerd-destination-dc474b4bb-jsw7j Strace output of inbound linkerd-proxy 18:49:02Z - 18:49:14Z

Note, the log files are huge and do not fit into a gist, so they are links to the linkerd Slack channel and you might only be able to read them if you have access to that channel.

The following table summarises the key events from the above log and trace files in chronological order for the TCP
connection 10.40.39.45:3450610.40.42.129:8086:

ID Time Event Actor Remarks
1 18:49:02Z connect outbound Outbound linkerd-proxy opens TCP connection
2 18:49:02Z epoll_ctl outbound Outbound linkerd-proxy registers epoll with file descriptor 5
3 18:49:02Z accept inbound Inbound linkerd-proxy accepts TCP connection
4 18:49:02Z epoll_ctl inbound Inbound linkerd-proxy registers epoll
5 18:49:12Z read-timeout inbound Inbound linkerd-proxy closes TCP connection witout having received any data from it.
6 18:49:12Z epoll_ctl inbound Inbound linkerd-proxy deregisters epoll
7 18:49:12Z close inbound Inbound linkerd-proxy closes connection
8 18:49:12Z log inbound Inbound linkerd-proxy logs TLS detection timed out
9 18:49:54Z writev outbound Outbound linkerd-proxy writes TLS handshake data with SNI value linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
10 18:49:54Z recvfrom outbound Outbound linkerd-proxy tries to read from connection and realises that the peer closed it.
11 18:49:54Z epoll_ctl outbound Outbound linkerd-proxy deregisters epoll from file descriptor 5
12 18:49:54Z close outbound Outbound linkerd-proxy closes connection

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Environment

  • Kubernetes Version: 1.19.6
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2
  • Linkerd version: edge-21.5.1

Possible solution

Additional context

This issue is based on this Slack thread.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions