Add ability to configure client-go's QPS and Burst settings#11644
Merged
Add ability to configure client-go's QPS and Burst settings#11644
QPS and Burst settings#11644Conversation
## Problem and Symptoms When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this: ``` time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline" ``` ## Solution Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic. ## Observability The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst` ## Testing You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to: - Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory. - Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts. ```javascript import { Client, Stream } from 'k6/experimental/grpc'; import { sleep } from 'k6'; const client = new Client(); client.load(['.'], 'identity.proto'); // This always holds: // req_num = (1 / req_duration ) * duration * VUs // Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we // can solve for the required VUs: // VUs = req_num * req_duration / duration // VUs = 6000 * 0.5 / 60 = 50 export const options = { scenarios: { identity: { executor: 'constant-vus', vus: 50, duration: '1m', }, }, }; export default () => { client.connect('localhost:8080', { plaintext: true, }); const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify'); // Replace with your own token let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw=="; // Replace with your own CSR let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE="; const data = { identity: "web.emojivoto.serviceaccount.identity.linkerd.cluster.local", token: token, certificate_signing_request: csr, }; stream.write(data); // This request takes around 2ms, so this sleep will mostly determine its final duration sleep(0.5); }; ``` This results in the following report: ``` scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * identity: 50 looping VUs for 1m0s (gracefulStop: 30s) data_received................: 6.3 MB 104 kB/s data_sent....................: 9.4 MB 156 kB/s grpc_req_duration............: avg=2.14ms min=873.93µs med=1.9ms max=12.89ms p(90)=3.13ms p(95)=3.86ms grpc_streams.................: 6000 99.355331/s grpc_streams_msgs_received...: 6000 99.355331/s grpc_streams_msgs_sent.......: 6000 99.355331/s iteration_duration...........: avg=503.16ms min=500.8ms med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms iterations...................: 6000 99.355331/s vus..........................: 50 min=50 max=50 vus_max......................: 50 min=50 max=50 running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations ``` With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
adleong
approved these changes
Nov 28, 2023
olix0r
reviewed
Nov 28, 2023
Member
olix0r
left a comment
There was a problem hiding this comment.
Thinking forward a little bit... we can imagine we may need to introduce similar configurations to other controllers in the future.
It might be nice to support having a linkerd default bundle of settings that can be overridden for each controller.
Do you foresee any problems if we introduce a top-level kubeAPI structure with a default clientQPS and clientBurst, and then support overriding the kubeAPI config on the identity controller?
olix0r
approved these changes
Nov 28, 2023
mateiidavid
added a commit
that referenced
this pull request
Nov 30, 2023
This edge release introduces new configuration values in the identity controller for client-go's `QPS` and `Burst` settings. Default values for these settings have also been raised from `5` (QPS) and `10` (Burst) to `100` and `200` respectively. * Added `namespaceSelector` fields for the tap-injector and jaeger-injector webhooks. The webhooks are now configured to skip `kube-system` by default ([#11649]; fixes [#11647]) (thanks @mikutas!) * Added the ability to configure client-go's `QPS` and `Burst` settings in the identity controller ([#11644]) * Improved client-go logging visibility throughout the control plane's components ([#11632]) * Introduced `PodDisruptionBudgets` in the linkerd-viz Helm chart for tap and tap-injector ([#11628]; fixes [#11248]) (thanks @mcharriere!) [#11649]: #11649 [#11647]: #11647 [#11644]: #11644 [#11632]: #11632 [#11628]: #11628 [#11248]: #11248 Signed-off-by: Matei David <[email protected]>
Merged
mateiidavid
added a commit
that referenced
this pull request
Dec 1, 2023
This edge release introduces new configuration values in the identity controller for client-go's `QPS` and `Burst` settings. Default values for these settings have also been raised from `5` (QPS) and `10` (Burst) to `100` and `200` respectively. * Added `namespaceSelector` fields for the tap-injector and jaeger-injector webhooks. The webhooks are now configured to skip `kube-system` by default ([#11649]; fixes [#11647]) (thanks @mikutas!) * Added the ability to configure client-go's `QPS` and `Burst` settings in the identity controller ([#11644]) * Improved client-go logging visibility throughout the control plane's components ([#11632]) * Introduced `PodDisruptionBudgets` in the linkerd-viz Helm chart for tap and tap-injector ([#11628]; fixes [#11248]) (thanks @mcharriere!) [#11649]: #11649 [#11647]: #11647 [#11644]: #11644 [#11632]: #11632 [#11628]: #11628 [#11248]: #11248 Signed-off-by: Matei David <[email protected]>
olix0r
pushed a commit
that referenced
this pull request
Dec 6, 2023
* Add ability to configure client-go's `QPS` and `Burst` settings ## Problem and Symptoms When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this: ``` time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline" ``` ## Solution Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic. ## Observability The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst` ## Testing You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to: - Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory. - Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts. ```javascript import { Client, Stream } from 'k6/experimental/grpc'; import { sleep } from 'k6'; const client = new Client(); client.load(['.'], 'identity.proto'); // This always holds: // req_num = (1 / req_duration ) * duration * VUs // Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we // can solve for the required VUs: // VUs = req_num * req_duration / duration // VUs = 6000 * 0.5 / 60 = 50 export const options = { scenarios: { identity: { executor: 'constant-vus', vus: 50, duration: '1m', }, }, }; export default () => { client.connect('localhost:8080', { plaintext: true, }); const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify'); // Replace with your own token let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw=="; // Replace with your own CSR let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE="; const data = { identity: "web.emojivoto.serviceaccount.identity.linkerd.cluster.local", token: token, certificate_signing_request: csr, }; stream.write(data); // This request takes around 2ms, so this sleep will mostly determine its final duration sleep(0.5); }; ``` This results in the following report: ``` scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * identity: 50 looping VUs for 1m0s (gracefulStop: 30s) data_received................: 6.3 MB 104 kB/s data_sent....................: 9.4 MB 156 kB/s grpc_req_duration............: avg=2.14ms min=873.93µs med=1.9ms max=12.89ms p(90)=3.13ms p(95)=3.86ms grpc_streams.................: 6000 99.355331/s grpc_streams_msgs_received...: 6000 99.355331/s grpc_streams_msgs_sent.......: 6000 99.355331/s iteration_duration...........: avg=503.16ms min=500.8ms med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms iterations...................: 6000 99.355331/s vus..........................: 50 min=50 max=50 vus_max......................: 50 min=50 max=50 running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations ``` With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
Merged
olix0r
added a commit
that referenced
this pull request
Dec 7, 2023
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633) * b24b0e97c stable-2.14.6
olix0r
added a commit
that referenced
this pull request
Dec 7, 2023
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633) * b24b0e97c stable-2.14.6
olix0r
added a commit
that referenced
this pull request
Dec 7, 2023
stable-2.14.6 * 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem and Symptoms
When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies against
kube-apiserverat a rate surpassing client-go's default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this:Solution
Client-go's default
QPSis 5 andBurstis 10. This PR exposes those settings as entries invalues.yamlwith defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to thekube-apiserverin a hot path. The other controllers mostly rely in informers, and direct calls are sporadic.Observability
The
QPSandBurstsettings used are exposed both as a log entry as soon as the controller starts, and as the new metric gaugeshttp_client_qpsandhttp_client_burstTesting
You can use the following K6 script, which simulates 6k calls to the
Certifyservice during one minute from emojivoto's web pod. Before running this you need to:tokenandcsrentries you can use here, that will be shown as soon as a web pod starts.This results in the following report:
With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and the number of complete requests much lower.