Expose the number of service instances in the proxy#428
Conversation
When diagnosing issues, it can be difficult to reason about the states
of the various caches in the proxy.
This change introduces a set of `stack` metrics to help proxy developers
diagnose issues. The metrics track stack creations, stack deletions, and
the amount of time a given service is not available (via
`stack_poll_total_ms`)
```
:; kubectl exec -n burn-http deploy/client-burst -c client -- curl -s localhost:4191/metrics |grep -e ^stack
stack_create_total{direction="outbound",name="fallback.endpoint"} 0
stack_create_total{direction="outbound",name="balance.endpoint"} 1
stack_create_total{direction="outbound",name="logical.dst"} 1
stack_create_total{direction="outbound",name="addr"} 1
stack_create_total{direction="outbound",name="source"} 100
stack_create_total{direction="inbound",name="endpoint"} 0
stack_create_total{direction="inbound",name="route"} 0
stack_create_total{direction="inbound",name="logical"} 0
stack_create_total{direction="inbound",name="source"} 0
stack_drop_total{direction="outbound",name="fallback.endpoint"} 0
stack_drop_total{direction="outbound",name="balance.endpoint"} 0
stack_drop_total{direction="outbound",name="logical.dst"} 0
stack_drop_total{direction="outbound",name="addr"} 0
stack_drop_total{direction="outbound",name="source"} 0
stack_drop_total{direction="inbound",name="endpoint"} 0
stack_drop_total{direction="inbound",name="route"} 0
stack_drop_total{direction="inbound",name="logical"} 0
stack_drop_total{direction="inbound",name="source"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="true"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="true"} 70371
stack_poll_total{direction="outbound",name="logical.dst",ready="true"} 23600
stack_poll_total{direction="outbound",name="addr",ready="true"} 23600
stack_poll_total{direction="outbound",name="source",ready="true"} 23600
stack_poll_total{direction="inbound",name="endpoint",ready="true"} 0
stack_poll_total{direction="inbound",name="route",ready="true"} 0
stack_poll_total{direction="inbound",name="logical",ready="true"} 0
stack_poll_total{direction="inbound",name="source",ready="true"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="false"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="false"} 23175
stack_poll_total{direction="outbound",name="logical.dst",ready="false"} 0
stack_poll_total{direction="outbound",name="addr",ready="false"} 1
stack_poll_total{direction="outbound",name="source",ready="false"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="false"} 0
stack_poll_total{direction="inbound",name="route",ready="false"} 0
stack_poll_total{direction="inbound",name="logical",ready="false"} 0
stack_poll_total{direction="inbound",name="source",ready="false"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="logical.dst",ready="error"} 0
stack_poll_total{direction="outbound",name="addr",ready="error"} 0
stack_poll_total{direction="outbound",name="source",ready="error"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="error"} 0
stack_poll_total{direction="inbound",name="route",ready="error"} 0
stack_poll_total{direction="inbound",name="logical",ready="error"} 0
stack_poll_total{direction="inbound",name="source",ready="error"} 0
stack_poll_total_ms{direction="outbound",name="fallback.endpoint"} 0
stack_poll_total_ms{direction="outbound",name="balance.endpoint"} 40
stack_poll_total_ms{direction="outbound",name="logical.dst"} 0
stack_poll_total_ms{direction="outbound",name="addr"} 2
stack_poll_total_ms{direction="outbound",name="source"} 0
stack_poll_total_ms{direction="inbound",name="endpoint"} 0
stack_poll_total_ms{direction="inbound",name="route"} 0
stack_poll_total_ms{direction="inbound",name="logical"} 0
stack_poll_total_ms{direction="inbound",name="source"} 0
```
seanmonstar
left a comment
There was a problem hiding this comment.
🎉
(I've recently read through finagle stats again, and the ease of adding new stats is enviable...)
|
This seems like something that could potentially be a good use-case for the |
hawkw
left a comment
There was a problem hiding this comment.
The new metrics seem like they could potentially be very useful, this is great.
I was curious about the potential performance impact of taking the timestamps necessary for the poll duration metric. It would be nice to know if this actually makes a meaningful difference or not...
linkerd/stack-metrics/src/lib.rs
Outdated
| if self.blocked_since.is_none() { | ||
| self.blocked_since = Some(Instant::now()); | ||
| } | ||
| Ok(Async::NotReady) | ||
| } | ||
| Ok(Async::Ready(())) => { | ||
| self.metrics.ready_total.incr(); | ||
| if let Some(t0) = self.blocked_since.take() { | ||
| let not_ready = Instant::now() - t0; | ||
| self.metrics.poll_millis.add(not_ready.as_millis() as u64); | ||
| } | ||
| Ok(Async::Ready(())) | ||
| } | ||
| Err(e) => { | ||
| self.metrics.error_total.incr(); | ||
| if let Some(t0) = self.blocked_since.take() { | ||
| let not_ready = Instant::now() - t0; | ||
| self.metrics.poll_millis.add(not_ready.as_millis() as u64); | ||
| } |
There was a problem hiding this comment.
Any concerns about the performance implications of recording all these timestamps? It seems like if we are instrumenting every stack in the proxy with one of these layers, we are potentially adding a fair number of syscalls in the request path.
It could be good to do some benchmarking and see if there's a noticeable perf impact from this? We might want to add a config to turn off the stack metrics, or something?
There was a problem hiding this comment.
How is this different from any of the other latency metrics we record?
There was a problem hiding this comment.
Furthermore, this is only engaged when a service is NotReady, which is fairly rare in the scheme of things. I'm not particularly worried about this? Unless you really think we should be?
kleimkuhler
left a comment
There was a problem hiding this comment.
Change looks good. +1 to @hawkw's question as I had similar thoughts on the Instant::now() calls.
kleimkuhler
left a comment
There was a problem hiding this comment.
Change looks good and that final reorganization helps a lot.
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
When diagnosing issues, it can be difficult to reason about the states
of the various caches in the proxy.
This change introduces a set of
stackmetrics to help proxy developersdiagnose issues. The metrics track stack creations, stack deletions, and
the amount of time a given service is not available (via
stack_poll_total_ms)