Skip to content

When edges command tests fail print all pods for additional debugging information.#6627

Merged
kleimkuhler merged 4 commits intomainfrom
kleimkuhler/retry-prometheus-edges
Aug 10, 2021
Merged

When edges command tests fail print all pods for additional debugging information.#6627
kleimkuhler merged 4 commits intomainfrom
kleimkuhler/retry-prometheus-edges

Conversation

@kleimkuhler
Copy link
Contributor

@kleimkuhler kleimkuhler commented Aug 6, 2021

After #6574 recently merging the TestDirectEdges test has been flaky. Occasionally the test fails for the following error:

2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s)
2021-08-05T22:08:06.2860719Z     edges_test.go:161: Expected output:
2021-08-05T22:08:06.2861172Z         \[
2021-08-05T22:08:06.2861480Z           \{
2021-08-05T22:08:06.2861884Z             "src": "prometheus",
2021-08-05T22:08:06.2862642Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2863356Z             "dst": "slow-cooker",
2021-08-05T22:08:06.2864211Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2865267Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2866374Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2867120Z             "no_tls_reason": ""
2021-08-05T22:08:06.2867492Z           \},
2021-08-05T22:08:06.2867804Z           \{
2021-08-05T22:08:06.2868211Z             "src": "prometheus",
2021-08-05T22:08:06.2868933Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2869481Z             "dst": "terminus",
2021-08-05T22:08:06.2870311Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2871318Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2872435Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2873180Z             "no_tls_reason": ""
2021-08-05T22:08:06.2873563Z           \},
2021-08-05T22:08:06.2873888Z           \{
2021-08-05T22:08:06.2874421Z             "src": "slow-cooker",
2021-08-05T22:08:06.2875282Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2875936Z             "dst": "terminus",
2021-08-05T22:08:06.2876751Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2877906Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879151Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879887Z             "no_tls_reason": ""
2021-08-05T22:08:06.2880258Z           \}
2021-08-05T22:08:06.2880579Z         \]
2021-08-05T22:08:06.2880885Z         
2021-08-05T22:08:06.2881227Z         actual:
2021-08-05T22:08:06.2881560Z         [
2021-08-05T22:08:06.2881876Z           {
2021-08-05T22:08:06.2882429Z             "src": "slow-cooker",
2021-08-05T22:08:06.2883268Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2883925Z             "dst": "terminus",
2021-08-05T22:08:06.2884759Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2885901Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887153Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887896Z             "no_tls_reason": ""
2021-08-05T22:08:06.2888250Z           }
2021-08-05T22:08:06.2888572Z         ]
2021-08-05T22:08:06.2888870Z

What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker.

While it's remains unclear why exactly this happens, this change addresses two potential problems:

  1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information.
  2. Introduce retries to the TestEdges test. As we already have retries in the TestDirectEdges, this makes a similar change to TestEdges.

Signed-off-by: Kevin Leimkuhler [email protected]

@kleimkuhler kleimkuhler requested a review from a team as a code owner August 6, 2021 17:16
@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2021

Codecov Report

Merging #6627 (32a5c63) into main (6da81f9) will increase coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6627      +/-   ##
==========================================
+ Coverage   45.33%   45.38%   +0.05%     
==========================================
  Files         182      182              
  Lines       24273    24443     +170     
  Branches      290      290              
==========================================
+ Hits        11004    11094      +90     
- Misses      12492    12558      +66     
- Partials      777      791      +14     
Flag Coverage Δ
golang 44.09% <ø> (+0.06%) ⬆️
javascript 66.76% <ø> (ø)
unittests 45.38% <ø> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
viz/metrics-api/stat_summary.go 82.24% <0.00%> (-4.83%) ⬇️
viz/cmd/stat.go 58.42% <0.00%> (-2.99%) ⬇️
controller/api/destination/watcher/test_util.go 75.00% <0.00%> (-1.20%) ⬇️
pkg/k8s/labels.go 65.51% <0.00%> (ø)
viz/metrics-api/prometheus.go 82.39% <0.00%> (ø)
pkg/k8s/api.go 29.21% <0.00%> (+0.95%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da81f9...32a5c63. Read the comment docs.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the "Expected Prometheus edges" error is functionally different from the "Expected output:" error. In either case, I would expect the test to be retried up to the timeout. How does this help the flakeyness?

Signed-off-by: Kevin Leimkuhler <[email protected]>
@kleimkuhler
Copy link
Contributor Author

Hm that's a good point I guess we're already retrying the Expected output ... error. I was trying to avoid introducing a sleep, but maybe that's our only option here.

A 10 second sleep at the start of the function that we pass into the retry may be best.

@alpeb
Copy link
Member

alpeb commented Aug 10, 2021

@kleimkuhler The last time I tried to address this flakiness was in #6575. The problem was that the Prometheus pod wasn't being injected. Maybe that's still the case? Given this has been such a persistent issue, it might be worth adding more debugging info when this test fails, and leave that debugging there for good. Like showing the output of kubectl get po -A just to see if the pods are injected.

@kleimkuhler
Copy link
Contributor Author

If that's still the case, I've added kubectl get pods -A output to the Expected output ... error. I removed the sleep I introduced so that we have a better chance of reproducing this and can see what the cluster's pod state is.

Signed-off-by: Kevin Leimkuhler <[email protected]>
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing the pods in the error message seems fine but we should just be careful when interpreting this information. There will be a period of time when the Prometheus pod exists but has not yet scraped and created any edges. Prometheus pods which exist but are not running (for example if they have been evicted) will also show here, I believe; which could be confusing.

@kleimkuhler
Copy link
Contributor Author

Yep all good points. I think the main thing we'll be able to takeaway as Alejandro mentioned is if the Prometheus pod is getting injected or not.

By the way, I added retries to the TestEdges test because it actually failed on the previous commit for this lack of edges reason. I'm not sure if that was obvious if you had not had a chance to look at the previous CI failure.

@kleimkuhler kleimkuhler changed the title Add check that there are edges to/from Prometheus in TestDirectEdges test When edges command tests fail print the cluster pods for additional debugging information. Aug 10, 2021
@kleimkuhler kleimkuhler changed the title When edges command tests fail print the cluster pods for additional debugging information. When edges command tests fail print all pods for additional debugging information. Aug 10, 2021
@kleimkuhler kleimkuhler merged commit 00187fa into main Aug 10, 2021
@kleimkuhler kleimkuhler deleted the kleimkuhler/retry-prometheus-edges branch August 10, 2021 20:06
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 10, 2021
…ng information. (linkerd#6627)

After linkerd#6574 recently merging the `TestDirectEdges` test has been flaky. Occasionally the test fails for the following error:
```
2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s)
2021-08-05T22:08:06.2860719Z     edges_test.go:161: Expected output:
2021-08-05T22:08:06.2861172Z         \[
2021-08-05T22:08:06.2861480Z           \{
2021-08-05T22:08:06.2861884Z             "src": "prometheus",
2021-08-05T22:08:06.2862642Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2863356Z             "dst": "slow-cooker",
2021-08-05T22:08:06.2864211Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2865267Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2866374Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2867120Z             "no_tls_reason": ""
2021-08-05T22:08:06.2867492Z           \},
2021-08-05T22:08:06.2867804Z           \{
2021-08-05T22:08:06.2868211Z             "src": "prometheus",
2021-08-05T22:08:06.2868933Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2869481Z             "dst": "terminus",
2021-08-05T22:08:06.2870311Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2871318Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2872435Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2873180Z             "no_tls_reason": ""
2021-08-05T22:08:06.2873563Z           \},
2021-08-05T22:08:06.2873888Z           \{
2021-08-05T22:08:06.2874421Z             "src": "slow-cooker",
2021-08-05T22:08:06.2875282Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2875936Z             "dst": "terminus",
2021-08-05T22:08:06.2876751Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2877906Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879151Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879887Z             "no_tls_reason": ""
2021-08-05T22:08:06.2880258Z           \}
2021-08-05T22:08:06.2880579Z         \]
2021-08-05T22:08:06.2880885Z         
2021-08-05T22:08:06.2881227Z         actual:
2021-08-05T22:08:06.2881560Z         [
2021-08-05T22:08:06.2881876Z           {
2021-08-05T22:08:06.2882429Z             "src": "slow-cooker",
2021-08-05T22:08:06.2883268Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2883925Z             "dst": "terminus",
2021-08-05T22:08:06.2884759Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2885901Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887153Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887896Z             "no_tls_reason": ""
2021-08-05T22:08:06.2888250Z           }
2021-08-05T22:08:06.2888572Z         ]
2021-08-05T22:08:06.2888870Z
```

What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker.

While it's remains unclear why exactly this happens, this change addresses two potential problems:

1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information.
2. Introduce retries to the `TestEdges` test. As we already have retries in the `TestDirectEdges`, this makes a similar change to `TestEdges`.

Signed-off-by: Kevin Leimkuhler <[email protected]>
Pothulapati added a commit that referenced this pull request Aug 11, 2021
Looked at the CI failures after #6627
was merged, and in all the cases the external prometheus pod
wasn't injected.

This updates the `external-prometheus-deep` test to manually
inject the `external-prometheus` deployment instead of having
the injection be nondeterministic.

Signed-off-by: Tarun Pothulapati <[email protected]>
Pothulapati added a commit that referenced this pull request Aug 12, 2021
Looked at the CI failures after #6627
was merged, and in all the cases the external prometheus pod
wasn't injected.

This updates the `external-prometheus-deep` test to manually
inject the `external-prometheus` deployment instead of having
the injection be nondeterministic.

Signed-off-by: Tarun Pothulapati <[email protected]>
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 13, 2021
…ng information. (linkerd#6627)

After linkerd#6574 recently merging the `TestDirectEdges` test has been flaky. Occasionally the test fails for the following error:
```
2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s)
2021-08-05T22:08:06.2860719Z     edges_test.go:161: Expected output:
2021-08-05T22:08:06.2861172Z         \[
2021-08-05T22:08:06.2861480Z           \{
2021-08-05T22:08:06.2861884Z             "src": "prometheus",
2021-08-05T22:08:06.2862642Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2863356Z             "dst": "slow-cooker",
2021-08-05T22:08:06.2864211Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2865267Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2866374Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2867120Z             "no_tls_reason": ""
2021-08-05T22:08:06.2867492Z           \},
2021-08-05T22:08:06.2867804Z           \{
2021-08-05T22:08:06.2868211Z             "src": "prometheus",
2021-08-05T22:08:06.2868933Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2869481Z             "dst": "terminus",
2021-08-05T22:08:06.2870311Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2871318Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2872435Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2873180Z             "no_tls_reason": ""
2021-08-05T22:08:06.2873563Z           \},
2021-08-05T22:08:06.2873888Z           \{
2021-08-05T22:08:06.2874421Z             "src": "slow-cooker",
2021-08-05T22:08:06.2875282Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2875936Z             "dst": "terminus",
2021-08-05T22:08:06.2876751Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2877906Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879151Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879887Z             "no_tls_reason": ""
2021-08-05T22:08:06.2880258Z           \}
2021-08-05T22:08:06.2880579Z         \]
2021-08-05T22:08:06.2880885Z
2021-08-05T22:08:06.2881227Z         actual:
2021-08-05T22:08:06.2881560Z         [
2021-08-05T22:08:06.2881876Z           {
2021-08-05T22:08:06.2882429Z             "src": "slow-cooker",
2021-08-05T22:08:06.2883268Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2883925Z             "dst": "terminus",
2021-08-05T22:08:06.2884759Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2885901Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887153Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887896Z             "no_tls_reason": ""
2021-08-05T22:08:06.2888250Z           }
2021-08-05T22:08:06.2888572Z         ]
2021-08-05T22:08:06.2888870Z
```

What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker.

While it's remains unclear why exactly this happens, this change addresses two potential problems:

1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information.
2. Introduce retries to the `TestEdges` test. As we already have retries in the `TestDirectEdges`, this makes a similar change to `TestEdges`.

Signed-off-by: Kevin Leimkuhler <[email protected]>
Signed-off-by: Sanni Michael <[email protected]>
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 13, 2021
Looked at the CI failures after linkerd#6627
was merged, and in all the cases the external prometheus pod
wasn't injected.

This updates the `external-prometheus-deep` test to manually
inject the `external-prometheus` deployment instead of having
the injection be nondeterministic.

Signed-off-by: Tarun Pothulapati <[email protected]>
Signed-off-by: Sanni Michael <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants