When edges command tests fail print all pods for additional debugging information.#6627
Conversation
Signed-off-by: Kevin Leimkuhler <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #6627 +/- ##
==========================================
+ Coverage 45.33% 45.38% +0.05%
==========================================
Files 182 182
Lines 24273 24443 +170
Branches 290 290
==========================================
+ Hits 11004 11094 +90
- Misses 12492 12558 +66
- Partials 777 791 +14
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
adleong
left a comment
There was a problem hiding this comment.
I don't understand how the "Expected Prometheus edges" error is functionally different from the "Expected output:" error. In either case, I would expect the test to be retried up to the timeout. How does this help the flakeyness?
Signed-off-by: Kevin Leimkuhler <[email protected]>
|
Hm that's a good point I guess we're already retrying the A 10 second sleep at the start of the function that we pass into the retry may be best. |
|
@kleimkuhler The last time I tried to address this flakiness was in #6575. The problem was that the Prometheus pod wasn't being injected. Maybe that's still the case? Given this has been such a persistent issue, it might be worth adding more debugging info when this test fails, and leave that debugging there for good. Like showing the output of |
Signed-off-by: Kevin Leimkuhler <[email protected]>
|
If that's still the case, I've added |
Signed-off-by: Kevin Leimkuhler <[email protected]>
adleong
left a comment
There was a problem hiding this comment.
Listing the pods in the error message seems fine but we should just be careful when interpreting this information. There will be a period of time when the Prometheus pod exists but has not yet scraped and created any edges. Prometheus pods which exist but are not running (for example if they have been evicted) will also show here, I believe; which could be confusing.
|
Yep all good points. I think the main thing we'll be able to takeaway as Alejandro mentioned is if the Prometheus pod is getting injected or not. By the way, I added retries to the |
TestDirectEdges testedges command tests fail print the cluster pods for additional debugging information.
edges command tests fail print the cluster pods for additional debugging information.edges command tests fail print all pods for additional debugging information.
…ng information. (linkerd#6627) After linkerd#6574 recently merging the `TestDirectEdges` test has been flaky. Occasionally the test fails for the following error: ``` 2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s) 2021-08-05T22:08:06.2860719Z edges_test.go:161: Expected output: 2021-08-05T22:08:06.2861172Z \[ 2021-08-05T22:08:06.2861480Z \{ 2021-08-05T22:08:06.2861884Z "src": "prometheus", 2021-08-05T22:08:06.2862642Z "src_namespace": "external\-prometheus", 2021-08-05T22:08:06.2863356Z "dst": "slow-cooker", 2021-08-05T22:08:06.2864211Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2865267Z "client_id": "prometheus.external\-prometheus", 2021-08-05T22:08:06.2866374Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2867120Z "no_tls_reason": "" 2021-08-05T22:08:06.2867492Z \}, 2021-08-05T22:08:06.2867804Z \{ 2021-08-05T22:08:06.2868211Z "src": "prometheus", 2021-08-05T22:08:06.2868933Z "src_namespace": "external\-prometheus", 2021-08-05T22:08:06.2869481Z "dst": "terminus", 2021-08-05T22:08:06.2870311Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2871318Z "client_id": "prometheus.external\-prometheus", 2021-08-05T22:08:06.2872435Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2873180Z "no_tls_reason": "" 2021-08-05T22:08:06.2873563Z \}, 2021-08-05T22:08:06.2873888Z \{ 2021-08-05T22:08:06.2874421Z "src": "slow-cooker", 2021-08-05T22:08:06.2875282Z "src_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2875936Z "dst": "terminus", 2021-08-05T22:08:06.2876751Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2877906Z "client_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2879151Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2879887Z "no_tls_reason": "" 2021-08-05T22:08:06.2880258Z \} 2021-08-05T22:08:06.2880579Z \] 2021-08-05T22:08:06.2880885Z 2021-08-05T22:08:06.2881227Z actual: 2021-08-05T22:08:06.2881560Z [ 2021-08-05T22:08:06.2881876Z { 2021-08-05T22:08:06.2882429Z "src": "slow-cooker", 2021-08-05T22:08:06.2883268Z "src_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2883925Z "dst": "terminus", 2021-08-05T22:08:06.2884759Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2885901Z "client_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2887153Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2887896Z "no_tls_reason": "" 2021-08-05T22:08:06.2888250Z } 2021-08-05T22:08:06.2888572Z ] 2021-08-05T22:08:06.2888870Z ``` What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker. While it's remains unclear why exactly this happens, this change addresses two potential problems: 1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information. 2. Introduce retries to the `TestEdges` test. As we already have retries in the `TestDirectEdges`, this makes a similar change to `TestEdges`. Signed-off-by: Kevin Leimkuhler <[email protected]>
Looked at the CI failures after #6627 was merged, and in all the cases the external prometheus pod wasn't injected. This updates the `external-prometheus-deep` test to manually inject the `external-prometheus` deployment instead of having the injection be nondeterministic. Signed-off-by: Tarun Pothulapati <[email protected]>
Looked at the CI failures after #6627 was merged, and in all the cases the external prometheus pod wasn't injected. This updates the `external-prometheus-deep` test to manually inject the `external-prometheus` deployment instead of having the injection be nondeterministic. Signed-off-by: Tarun Pothulapati <[email protected]>
…ng information. (linkerd#6627) After linkerd#6574 recently merging the `TestDirectEdges` test has been flaky. Occasionally the test fails for the following error: ``` 2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s) 2021-08-05T22:08:06.2860719Z edges_test.go:161: Expected output: 2021-08-05T22:08:06.2861172Z \[ 2021-08-05T22:08:06.2861480Z \{ 2021-08-05T22:08:06.2861884Z "src": "prometheus", 2021-08-05T22:08:06.2862642Z "src_namespace": "external\-prometheus", 2021-08-05T22:08:06.2863356Z "dst": "slow-cooker", 2021-08-05T22:08:06.2864211Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2865267Z "client_id": "prometheus.external\-prometheus", 2021-08-05T22:08:06.2866374Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2867120Z "no_tls_reason": "" 2021-08-05T22:08:06.2867492Z \}, 2021-08-05T22:08:06.2867804Z \{ 2021-08-05T22:08:06.2868211Z "src": "prometheus", 2021-08-05T22:08:06.2868933Z "src_namespace": "external\-prometheus", 2021-08-05T22:08:06.2869481Z "dst": "terminus", 2021-08-05T22:08:06.2870311Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2871318Z "client_id": "prometheus.external\-prometheus", 2021-08-05T22:08:06.2872435Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2873180Z "no_tls_reason": "" 2021-08-05T22:08:06.2873563Z \}, 2021-08-05T22:08:06.2873888Z \{ 2021-08-05T22:08:06.2874421Z "src": "slow-cooker", 2021-08-05T22:08:06.2875282Z "src_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2875936Z "dst": "terminus", 2021-08-05T22:08:06.2876751Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2877906Z "client_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2879151Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2879887Z "no_tls_reason": "" 2021-08-05T22:08:06.2880258Z \} 2021-08-05T22:08:06.2880579Z \] 2021-08-05T22:08:06.2880885Z 2021-08-05T22:08:06.2881227Z actual: 2021-08-05T22:08:06.2881560Z [ 2021-08-05T22:08:06.2881876Z { 2021-08-05T22:08:06.2882429Z "src": "slow-cooker", 2021-08-05T22:08:06.2883268Z "src_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2883925Z "dst": "terminus", 2021-08-05T22:08:06.2884759Z "dst_namespace": "linkerd-direct-edges-test", 2021-08-05T22:08:06.2885901Z "client_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2887153Z "server_id": "default.linkerd-direct-edges-test", 2021-08-05T22:08:06.2887896Z "no_tls_reason": "" 2021-08-05T22:08:06.2888250Z } 2021-08-05T22:08:06.2888572Z ] 2021-08-05T22:08:06.2888870Z ``` What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker. While it's remains unclear why exactly this happens, this change addresses two potential problems: 1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information. 2. Introduce retries to the `TestEdges` test. As we already have retries in the `TestDirectEdges`, this makes a similar change to `TestEdges`. Signed-off-by: Kevin Leimkuhler <[email protected]> Signed-off-by: Sanni Michael <[email protected]>
Looked at the CI failures after linkerd#6627 was merged, and in all the cases the external prometheus pod wasn't injected. This updates the `external-prometheus-deep` test to manually inject the `external-prometheus` deployment instead of having the injection be nondeterministic. Signed-off-by: Tarun Pothulapati <[email protected]> Signed-off-by: Sanni Michael <[email protected]>
After #6574 recently merging the
TestDirectEdgestest has been flaky. Occasionally the test fails for the following error:What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker.
While it's remains unclear why exactly this happens, this change addresses two potential problems:
TestEdgestest. As we already have retries in theTestDirectEdges, this makes a similar change toTestEdges.Signed-off-by: Kevin Leimkuhler [email protected]