Skip to content

Refactor edges command#6574

Merged
adleong merged 8 commits intomainfrom
alex/the-edges-of-tomorrow
Aug 5, 2021
Merged

Refactor edges command#6574
adleong merged 8 commits intomainfrom
alex/the-edges-of-tomorrow

Conversation

@adleong
Copy link
Member

@adleong adleong commented Jul 28, 2021

Fixes #3706

The implementation of the linkerd viz edges command works by gathering http and tcp metrics in both the inbound and outbound directions and combining this data in dubious ways.

We make the implementation simpler and more correct by instead doing the following:

  • Gather tcp metrics only
    • (this drops support for very old proxy versions which do not expose the tcp_open_connections metric)
  • Gather outbound metrics only
    • (all meshed edges will have a src in the mesh and will be present in the outbound metrics)
  • Outbound metrics do not have a client_id label, so we fill in this missing data by inspecting the source pod via the k8s api and reconstruct that pod's TLS identity based on it's service account name and namespace.

Signed-off-by: Alex Leong [email protected]

Signed-off-by: Alex Leong <[email protected]>
@adleong adleong requested a review from a team as a code owner July 28, 2021 22:19
@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2021

Codecov Report

Merging #6574 (fcce58e) into main (0b74e13) will decrease coverage by 0.05%.
The diff coverage is 64.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6574      +/-   ##
==========================================
- Coverage   45.39%   45.33%   -0.06%     
==========================================
  Files         182      182              
  Lines       24364    24272      -92     
  Branches      290      290              
==========================================
- Hits        11059    11004      -55     
+ Misses      12525    12491      -34     
+ Partials      780      777       -3     
Flag Coverage Δ
golang 44.02% <64.38%> (-0.07%) ⬇️
javascript 66.76% <ø> (ø)
unittests 45.33% <64.38%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cli/cmd/identity.go 18.63% <0.00%> (+1.96%) ⬆️
pkg/k8s/k8s.go 22.66% <0.00%> (-7.70%) ⬇️
viz/metrics-api/test_helper.go 91.28% <ø> (ø)
viz/metrics-api/edges.go 73.25% <87.75%> (+6.77%) ⬆️
viz/metrics-api/prometheus.go 82.39% <100.00%> (+0.12%) ⬆️
cli/cmd/install.go 49.44% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b74e13...fcce58e. Read the comment docs.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice refactor, this simplified approach makes a lot of better sense, plus I see now some edges that were missing before 👍

I'm curious about the replacement of tcp_open_total with tcp_open_connections; does this mean we only care about current live connections and not about all the connections during the pods lifetimes?

Comment on lines 72 to 75
clientID, err := s.getPodIdentity(string(sample.Metric[model.LabelName("pod")]), key.srcNs)
if err != nil {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just log a warning and skip the sample in case the pod referred in a metric is no longer found, instead of erroring. Even if tcp_open_connections means current connections I did see errors after scaling down pods and calling linkerd viz edges shortly after.

Copy link
Contributor

@Pothulapati Pothulapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, and also makes edges work with namespaces, etc as expected.

but there seems to be a bit of staleness issue. (was this already known?)

I installed the emojivoto application, and scaled up the web deployment. Edges resources appear as we would expect.

on ⛵ kind-kind  linkerd2 on 🌱 alex [📦📝🤷‍] via 🐼 v1.16.6 via 
➜ k -n emojivoto scale deploy/web --replicas 3                                                                                                                            ~/work/linkerd2
deployment.apps/web scaled

on ⛵ kind-kind  linkerd2 on 🌱 alex [📦📝🤷‍] via 🐼 v1.16.6 via 
➜ ./bin/go-run cli viz edges po -n emojivoto -owide                                                                                                                       ~/work/linkerd2
SRC                           DST                         SRC_NS        DST_NS      CLIENT_ID                SERVER_ID           SECURED
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-5f9k4        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-6r7gd        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-7wbsg        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-h6z7m        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-p4kpl        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
web-5f86686c4d-5f9k4          emoji-696d9d8f95-vvkjb      emojivoto     emojivoto   web.emojivoto            emoji.emojivoto     √
web-5f86686c4d-5f9k4          voting-ff4c54b8d-746nc      emojivoto     emojivoto   web.emojivoto            voting.emojivoto    √
web-5f86686c4d-6r7gd          emoji-696d9d8f95-vvkjb      emojivoto     emojivoto   web.emojivoto            emoji.emojivoto     √
web-5f86686c4d-6r7gd          voting-ff4c54b8d-746nc      emojivoto     emojivoto   web.emojivoto            voting.emojivoto    √
web-5f86686c4d-7wbsg          emoji-696d9d8f95-vvkjb      emojivoto     emojivoto   web.emojivoto            emoji.emojivoto     √
web-5f86686c4d-7wbsg          voting-ff4c54b8d-746nc      emojivoto     emojivoto   web.emojivoto            voting.emojivoto    √
prometheus-759b9b47f8-d5g7j   emoji-696d9d8f95-vvkjb      linkerd-viz   emojivoto   prometheus.linkerd-viz   emoji.emojivoto     √
prometheus-759b9b47f8-d5g7j   vote-bot-6d7677bb68-zwc2v   linkerd-viz   emojivoto   prometheus.linkerd-viz   default.emojivoto   √
prometheus-759b9b47f8-d5g7j   voting-ff4c54b8d-746nc      linkerd-viz   emojivoto   prometheus.linkerd-viz   voting.emojivoto    √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-5f9k4        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-6r7gd        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-7wbsg        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-h6z7m        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-p4kpl        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √

Now, after scaling down, The client side metrics for web pods are gone but where its the server side, they still continue exist.

on ⛵ kind-kind  linkerd2 on 🌱 alex [📦📝🤷‍] via 🐼 v1.16.6 via 
➜ k -n emojivoto scale deploy/web --replicas 1                                  

on ⛵ kind-kind  linkerd2 on 🌱 alex [📦📝🤷‍] via 🐼 v1.16.6 via 
❯ ./bin/go-run cli viz edges po -n emojivoto -owide                                                                                                                       ~/work/linkerd2
SRC                           DST                         SRC_NS        DST_NS      CLIENT_ID                SERVER_ID           SECURED
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-5f9k4        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-6r7gd        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-7wbsg        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-h6z7m        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-p4kpl        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
web-5f86686c4d-5f9k4          emoji-696d9d8f95-vvkjb      emojivoto     emojivoto   web.emojivoto            emoji.emojivoto     √
web-5f86686c4d-5f9k4          voting-ff4c54b8d-746nc      emojivoto     emojivoto   web.emojivoto            voting.emojivoto    √
prometheus-759b9b47f8-d5g7j   emoji-696d9d8f95-vvkjb      linkerd-viz   emojivoto   prometheus.linkerd-viz   emoji.emojivoto     √
prometheus-759b9b47f8-d5g7j   vote-bot-6d7677bb68-zwc2v   linkerd-viz   emojivoto   prometheus.linkerd-viz   default.emojivoto   √
prometheus-759b9b47f8-d5g7j   voting-ff4c54b8d-746nc      linkerd-viz   emojivoto   prometheus.linkerd-viz   voting.emojivoto    √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-5f9k4        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-6r7gd        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-7wbsg        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-h6z7m        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-p4kpl        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √

This also seems to be a problem with current linkerd binary

on ⛵ kind-kind  linkerd2 on 🌱 alex [📦📝🤷‍] via 🐼 v1.16.6 via 
➜ linkerd viz edges po -n emojivoto -owide                                                                                                                                ~/work/linkerd2
SRC                           DST                         SRC_NS        DST_NS      CLIENT_ID                SERVER_ID           SECURED
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-5f9k4        emojivoto     emojivoto   default.emojivoto        web.emojivoto       √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-6r7gd        emojivoto     emojivoto                                                √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-7wbsg        emojivoto     emojivoto                                                √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-h6z7m        emojivoto     emojivoto                                                √
vote-bot-6d7677bb68-zwc2v     web-5f86686c4d-p4kpl        emojivoto     emojivoto                                                √
web-5f86686c4d-5f9k4          emoji-696d9d8f95-vvkjb      emojivoto     emojivoto   web.emojivoto            emoji.emojivoto     √
web-5f86686c4d-5f9k4          voting-ff4c54b8d-746nc      emojivoto     emojivoto   web.emojivoto            voting.emojivoto    √
prometheus-759b9b47f8-d5g7j   emoji-696d9d8f95-vvkjb      linkerd-viz   emojivoto   prometheus.linkerd-viz   emoji.emojivoto     √
prometheus-759b9b47f8-d5g7j   vote-bot-6d7677bb68-zwc2v   linkerd-viz   emojivoto   prometheus.linkerd-viz   default.emojivoto   √
prometheus-759b9b47f8-d5g7j   voting-ff4c54b8d-746nc      linkerd-viz   emojivoto   prometheus.linkerd-viz   voting.emojivoto    √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-5f9k4        linkerd-viz   emojivoto   prometheus.linkerd-viz   web.emojivoto       √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-6r7gd        linkerd-viz   emojivoto                                                √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-7wbsg        linkerd-viz   emojivoto                                                √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-h6z7m        linkerd-viz   emojivoto                                                √
prometheus-759b9b47f8-d5g7j   web-5f86686c4d-p4kpl        linkerd-viz   emojivoto                                                √

Should we maybe skip the metrics if we know that the pod isn't existing, or we continue to show edges from all the metric history even if they might not exist as of now (In which case, we should somehow also be showing the client side edges for the removed resource? 🤔

Copy link
Contributor

@dadjeibaah dadjeibaah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Much easier to reason about. I left a few TIOLI and LGTM modulo other reviewers comments.

}

edges = append(edges, edgesHTTP...)
edges := []*pb.Edge{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIOLI: Since we already know the length of the edges we need to add to the array, we could just initialize an array of (len(edgesMap) and then use the index in the range to populate edges.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it's more efficient to initialize the array to the final size but I find the append style to be more readable so I prefer it in cases where the performance difference isn't significant.

Comment on lines +209 to +210
var l5dns string
var l5dtrustdomain string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIOLI: could also be written as:

Suggested change
var l5dns string
var l5dtrustdomain string
var l5dns, l5dtrustdomain string

@adleong
Copy link
Member Author

adleong commented Aug 4, 2021

I've updated this to use sum instead of count for the open connections gauge. This means that the value will correctly drop to 0 when the connections are closed and the edges for deleted resources will go away.

Signed-off-by: Alex Leong <[email protected]>
Copy link
Contributor

@Pothulapati Pothulapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Also, Nice fix on using sum instead! :shipit:

Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great simplification of how edges are collected. It works well and glad to see this test back 👍

Comment on lines -29 to -32
// This test has been disabled because it can fail due to
// https://github.com/linkerd/linkerd2/issues/3706
// This test should be updated and re-enabled when that issue is addressed.
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@adleong adleong merged commit 10a36b0 into main Aug 5, 2021
@adleong adleong deleted the alex/the-edges-of-tomorrow branch August 5, 2021 18:01
kleimkuhler added a commit that referenced this pull request Aug 10, 2021
…ng information. (#6627)

After #6574 recently merging the `TestDirectEdges` test has been flaky. Occasionally the test fails for the following error:
```
2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s)
2021-08-05T22:08:06.2860719Z     edges_test.go:161: Expected output:
2021-08-05T22:08:06.2861172Z         \[
2021-08-05T22:08:06.2861480Z           \{
2021-08-05T22:08:06.2861884Z             "src": "prometheus",
2021-08-05T22:08:06.2862642Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2863356Z             "dst": "slow-cooker",
2021-08-05T22:08:06.2864211Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2865267Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2866374Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2867120Z             "no_tls_reason": ""
2021-08-05T22:08:06.2867492Z           \},
2021-08-05T22:08:06.2867804Z           \{
2021-08-05T22:08:06.2868211Z             "src": "prometheus",
2021-08-05T22:08:06.2868933Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2869481Z             "dst": "terminus",
2021-08-05T22:08:06.2870311Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2871318Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2872435Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2873180Z             "no_tls_reason": ""
2021-08-05T22:08:06.2873563Z           \},
2021-08-05T22:08:06.2873888Z           \{
2021-08-05T22:08:06.2874421Z             "src": "slow-cooker",
2021-08-05T22:08:06.2875282Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2875936Z             "dst": "terminus",
2021-08-05T22:08:06.2876751Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2877906Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879151Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879887Z             "no_tls_reason": ""
2021-08-05T22:08:06.2880258Z           \}
2021-08-05T22:08:06.2880579Z         \]
2021-08-05T22:08:06.2880885Z         
2021-08-05T22:08:06.2881227Z         actual:
2021-08-05T22:08:06.2881560Z         [
2021-08-05T22:08:06.2881876Z           {
2021-08-05T22:08:06.2882429Z             "src": "slow-cooker",
2021-08-05T22:08:06.2883268Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2883925Z             "dst": "terminus",
2021-08-05T22:08:06.2884759Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2885901Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887153Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887896Z             "no_tls_reason": ""
2021-08-05T22:08:06.2888250Z           }
2021-08-05T22:08:06.2888572Z         ]
2021-08-05T22:08:06.2888870Z
```

What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker.

While it's remains unclear why exactly this happens, this change addresses two potential problems:

1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information.
2. Introduce retries to the `TestEdges` test. As we already have retries in the `TestDirectEdges`, this makes a similar change to `TestEdges`.

Signed-off-by: Kevin Leimkuhler <[email protected]>
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 10, 2021
Fixes linkerd#3706

The implementation of the `linkerd viz edges` command works by gathering http and tcp metrics in both the inbound and outbound directions and combining this data in dubious ways.

We make the implementation simpler and more correct by instead doing the following:

* Gather tcp metrics only
  * (this drops support for very old proxy versions which do not expose the `tcp_open_connections` metric)
* Gather outbound metrics only
  * (all meshed edges will have a src in the mesh and will be present in the outbound metrics)
* Outbound metrics do not have a `client_id` label, so we fill in this missing data by inspecting the source pod via the k8s api and reconstruct that pod's TLS identity based on it's service account name and namespace.

Signed-off-by: Alex Leong <[email protected]>
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 10, 2021
…ng information. (linkerd#6627)

After linkerd#6574 recently merging the `TestDirectEdges` test has been flaky. Occasionally the test fails for the following error:
```
2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s)
2021-08-05T22:08:06.2860719Z     edges_test.go:161: Expected output:
2021-08-05T22:08:06.2861172Z         \[
2021-08-05T22:08:06.2861480Z           \{
2021-08-05T22:08:06.2861884Z             "src": "prometheus",
2021-08-05T22:08:06.2862642Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2863356Z             "dst": "slow-cooker",
2021-08-05T22:08:06.2864211Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2865267Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2866374Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2867120Z             "no_tls_reason": ""
2021-08-05T22:08:06.2867492Z           \},
2021-08-05T22:08:06.2867804Z           \{
2021-08-05T22:08:06.2868211Z             "src": "prometheus",
2021-08-05T22:08:06.2868933Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2869481Z             "dst": "terminus",
2021-08-05T22:08:06.2870311Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2871318Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2872435Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2873180Z             "no_tls_reason": ""
2021-08-05T22:08:06.2873563Z           \},
2021-08-05T22:08:06.2873888Z           \{
2021-08-05T22:08:06.2874421Z             "src": "slow-cooker",
2021-08-05T22:08:06.2875282Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2875936Z             "dst": "terminus",
2021-08-05T22:08:06.2876751Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2877906Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879151Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879887Z             "no_tls_reason": ""
2021-08-05T22:08:06.2880258Z           \}
2021-08-05T22:08:06.2880579Z         \]
2021-08-05T22:08:06.2880885Z         
2021-08-05T22:08:06.2881227Z         actual:
2021-08-05T22:08:06.2881560Z         [
2021-08-05T22:08:06.2881876Z           {
2021-08-05T22:08:06.2882429Z             "src": "slow-cooker",
2021-08-05T22:08:06.2883268Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2883925Z             "dst": "terminus",
2021-08-05T22:08:06.2884759Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2885901Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887153Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887896Z             "no_tls_reason": ""
2021-08-05T22:08:06.2888250Z           }
2021-08-05T22:08:06.2888572Z         ]
2021-08-05T22:08:06.2888870Z
```

What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker.

While it's remains unclear why exactly this happens, this change addresses two potential problems:

1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information.
2. Introduce retries to the `TestEdges` test. As we already have retries in the `TestDirectEdges`, this makes a similar change to `TestEdges`.

Signed-off-by: Kevin Leimkuhler <[email protected]>
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 13, 2021
Fixes linkerd#3706

The implementation of the `linkerd viz edges` command works by gathering http and tcp metrics in both the inbound and outbound directions and combining this data in dubious ways.

We make the implementation simpler and more correct by instead doing the following:

* Gather tcp metrics only
  * (this drops support for very old proxy versions which do not expose the `tcp_open_connections` metric)
* Gather outbound metrics only
  * (all meshed edges will have a src in the mesh and will be present in the outbound metrics)
* Outbound metrics do not have a `client_id` label, so we fill in this missing data by inspecting the source pod via the k8s api and reconstruct that pod's TLS identity based on it's service account name and namespace.

Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Sanni Michael <[email protected]>
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 13, 2021
…ng information. (linkerd#6627)

After linkerd#6574 recently merging the `TestDirectEdges` test has been flaky. Occasionally the test fails for the following error:
```
2021-08-05T22:08:06.2860090Z --- FAIL: TestDirectEdges (73.39s)
2021-08-05T22:08:06.2860719Z     edges_test.go:161: Expected output:
2021-08-05T22:08:06.2861172Z         \[
2021-08-05T22:08:06.2861480Z           \{
2021-08-05T22:08:06.2861884Z             "src": "prometheus",
2021-08-05T22:08:06.2862642Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2863356Z             "dst": "slow-cooker",
2021-08-05T22:08:06.2864211Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2865267Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2866374Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2867120Z             "no_tls_reason": ""
2021-08-05T22:08:06.2867492Z           \},
2021-08-05T22:08:06.2867804Z           \{
2021-08-05T22:08:06.2868211Z             "src": "prometheus",
2021-08-05T22:08:06.2868933Z             "src_namespace": "external\-prometheus",
2021-08-05T22:08:06.2869481Z             "dst": "terminus",
2021-08-05T22:08:06.2870311Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2871318Z             "client_id": "prometheus.external\-prometheus",
2021-08-05T22:08:06.2872435Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2873180Z             "no_tls_reason": ""
2021-08-05T22:08:06.2873563Z           \},
2021-08-05T22:08:06.2873888Z           \{
2021-08-05T22:08:06.2874421Z             "src": "slow-cooker",
2021-08-05T22:08:06.2875282Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2875936Z             "dst": "terminus",
2021-08-05T22:08:06.2876751Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2877906Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879151Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2879887Z             "no_tls_reason": ""
2021-08-05T22:08:06.2880258Z           \}
2021-08-05T22:08:06.2880579Z         \]
2021-08-05T22:08:06.2880885Z
2021-08-05T22:08:06.2881227Z         actual:
2021-08-05T22:08:06.2881560Z         [
2021-08-05T22:08:06.2881876Z           {
2021-08-05T22:08:06.2882429Z             "src": "slow-cooker",
2021-08-05T22:08:06.2883268Z             "src_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2883925Z             "dst": "terminus",
2021-08-05T22:08:06.2884759Z             "dst_namespace": "linkerd-direct-edges-test",
2021-08-05T22:08:06.2885901Z             "client_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887153Z             "server_id": "default.linkerd-direct-edges-test",
2021-08-05T22:08:06.2887896Z             "no_tls_reason": ""
2021-08-05T22:08:06.2888250Z           }
2021-08-05T22:08:06.2888572Z         ]
2021-08-05T22:08:06.2888870Z
```

What is happening is Prometheus has not scraped workloads yet, so there is no edge to or from Prometheus and slow-cooker.

While it's remains unclear why exactly this happens, this change addresses two potential problems:

1. Print the cluster's pods so that we can determine if the Prometheus pod is injected. If it's not injected, we'll be able to debug further with that information.
2. Introduce retries to the `TestEdges` test. As we already have retries in the `TestDirectEdges`, this makes a similar change to `TestEdges`.

Signed-off-by: Kevin Leimkuhler <[email protected]>
Signed-off-by: Sanni Michael <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client ID in edges command can be wrong

6 participants