tracing: Add initial opentracing support#565
tracing: Add initial opentracing support#565grahamwhaley merged 3 commits intokata-containers:masterfrom
Conversation
|
To test this PR with Jaeger:
|
|
Note: I'll add further tracing granularity (for virtcontainers) on #566. |
|
PSS Measurement: Memory inside container: |
I was aware of this, but the new API doesn't appear to work... Let's take a closer look... |
3689a9b to
4b1ba95
Compare
|
PSS Measurement: Memory inside container: |
4b1ba95 to
76dbe6e
Compare
|
Right, we're now using the new Jaeger API. Jaeger helpfully provides two API's called |
|
PSS Measurement: Memory inside container: |
76dbe6e to
d6112a9
Compare
|
PSS Measurement: Memory inside container: |
Codecov Report
@@ Coverage Diff @@
## master #565 +/- ##
==========================================
+ Coverage 66.87% 66.91% +0.03%
==========================================
Files 85 85
Lines 9157 9287 +130
==========================================
+ Hits 6124 6214 +90
- Misses 2394 2418 +24
- Partials 639 655 +16 |
|
Build succeeded (third-party-check pipeline).
|
sboeuf
left a comment
There was a problem hiding this comment.
A few comments/questions, but otherwise, this looks very good and is ready to be merged IMO.
Thank you for the very detailed commit message and comments inside the code, it makes the review easier about the design choices you made ;)
|
|
||
|
|
||
| [[projects]] | ||
| digest = "1:167b6f65a6656de568092189ae791253939f076df60231fdd64588ac703892a1" |
There was a problem hiding this comment.
I am thinking the dep file format change should be in its own PR, to avoid it being hidden through this PR, and to make sure ppl realize it's been updated.
There was a problem hiding this comment.
Agreed. I'll raise a PR for that later today...
| vc "github.com/kata-containers/runtime/virtcontainers" | ||
| vf "github.com/kata-containers/runtime/virtcontainers/factory" | ||
| "github.com/kata-containers/runtime/virtcontainers/pkg/oci" | ||
| opentracing "github.com/opentracing/opentracing-go" |
There was a problem hiding this comment.
I don't think you need to specify a custom name for the import here. opentracing-go will be automatically translated into opentracing to be used from the code.
There was a problem hiding this comment.
I didn't specify the name - it was added automatically by the go tooling.
| runtimeConfig oci.RuntimeConfig) error { | ||
| var err error | ||
|
|
||
| span, ctx := opentracing.StartSpanFromContext(ctx, "create") |
There was a problem hiding this comment.
I'm curious how much overhead (time wise) this tracing will add ? Given the fact that we have a jeager agent running (as you said this would resolve into no-op otherwise).
There was a problem hiding this comment.
Good question. I've written a test program to create spans using a "real" tracer and a "NOP" tracer (as this branch now uses by default). Creating 100,000 spans on a NUC:
| nanonseconds / span | |
|---|---|
| Real tracer | 2860 |
| NOP tracer | 550 |
So, there is some overhead even with tracing "disabled" but when you consider that a docker run -ti busybox true with this branch currently only creates 17 spans, I'd say the overhead is unlikely to be an issue, particularly when comparing that with the value tracing brings.
There was a problem hiding this comment.
Thanks for providing the data, this looks negligible indeed.
d6112a9 to
b821a53
Compare
|
Branch updated - I've now found the magic knob to toggle tracing so now you need to set BTW, I plan to raise a PR on the dev guide explaining how to enable and view traces. I'll then raise a runtime PR to add a Tracing section to the |
|
PSS Measurement: Memory inside container: |
|
Build succeeded (third-party-check pipeline).
|
|
#574 just got merged, @jodh-intel you can rebase ;) |
Move a test assertion which seemed to have become separated from its correct location. Signed-off-by: James O. D. Hunt <[email protected]>
Created two new helper functions to create a `cli.Context` with and without a `cli.App`. Calling these functions simplifies a lot of test code. Signed-off-by: James O. D. Hunt <[email protected]>
|
On it. Just found a bug-ette so will re-push and comment once that's resolved... |
Add initial support for opentracing by using the `jaeger` package.
Since opentracing uses the `context` package, add a `context.Context`
as the first parameter to all the functions that we might want to
trace. Trace "spans" (trace points) are then added by extracting the
trace details from the specified context parameter.
Notes:
- Although the tracer is created in `main()`, the "root span"
(aka the first trace point) is not added until `beforeSubcommands()`.
This is by design and is a compromise: by delaying the creation of the
root span, the spans become much more readable since using the web-based
JaegerUI, you will see traces like this:
```
kata-runtime: kata-runtime create
------------ -------------------
^ ^
| |
Trace name First span name
(which clearly shows the CLI command that was run)
```
Creating the span earlier means it is necessary to expand 'n' spans in
the UI before you get to see the name of the CLI command that was run.
In adding support, this became very tedious, hence my design decision to
defer the creation of the root span until after signal handling has been
setup and after CLI options have been parsed, but still very early in
the code path.
- At this stage, the tracing stops at the `virtcontainers` call
boundary.
- Tracing is "always on" as there doesn't appear to be a way to toggle
it. However, its resolves to a "nop" unless the tracer can talk to a
jaeger agent.
Note that this commit required a bit of rework to `beforeSubcommands()`
to reduce the cyclomatic complexity.
Fixes kata-containers#557.
Signed-off-by: James O. D. Hunt <[email protected]>
b821a53 to
3a1bbd0
Compare
|
@sboeuf, @grahamwhaley, @egernst - branch updated ;) |
|
Build succeeded (third-party-check pipeline).
|
|
Um, pullapprove is showing WIP - it isn't (now) as I removed the dnm label! /me tries toggling that again... |
|
Nope :( I'd rather no re-force-push to avoid restarting all the long-running CI's again. |
|
Confirmed - it really doesn't like you does it! It also does not seem to have a 'refresh' button on their gui :-(. |
|
I was hoping this might be merged by Monday so looking for reviewers... ;) |
|
@jodh-intel I'll do another review, but let's ask @amshinde @egernst @bergwolf to review this too ;) |
|
pullapprove has woken from its slumber it seems @jodh-intel ... |
|
PSS Measurement: Memory inside container: |
|
Thanks for the rework/rebase @jodh-intel, this LGTM! |
| kataLog.Infof(msg, args...) | ||
| } | ||
|
|
||
| func createTracer(name string) (opentracing.Tracer, error) { |
There was a problem hiding this comment.
Check if tracing is enabled first? Or am I misunderstanding it that tracing is checked somewhere else?
There was a problem hiding this comment.
Hi @bergwolf - actually, no.
It's a little subtle but once you add opentracing support to a codebase, unless you want to litter the code with calls like...
var span opentracing.Span
if (tracing) {
span := ...
}
// do stuff
if (tracing) {
defer span.Finish()
}
... you need to always create a valid (non-nil) tracer and non-nil spans.
But if you do that, how do you allow the application to toggle tracing on/off?
The answer is you either create a real tracer, or you set Disabled=true in the Jaeger Configuration. That has the effect or causing cfg.NewTracer() to return a NoppTracer which effectively disables tracing (there is a small overhead - see #565 (comment)).
There was a problem hiding this comment.
@jodh-intel Yeah, I think we need to decide who controls the tracing functionality -- kata config or Jaeger config? Right now, each of them has a switch and it looks like only the Jaeger one is in effect.
And do we really need to create these hierarchical spanners? If not, we can put in app.Before and app.After something like:
if (tracing) {
setup tracer
start tracer span
}
and
if (tracing) {
stop tracer span
cleanup tracer
}
So we have one tracer and one span for each command line invocation. We can pass around span so that we can call SetTag on it. And it looks much cleaner than having every one to call
span := opentracing.StartSpanFromContext
defer span.Finish()
Another question about opentracing (sorry to be dumb), do we need to plant trace points on our own? Or is it integrated with logrus somehow? Or can Jaeger insert new trace points dynamically?
There was a problem hiding this comment.
I agree. But the recommendation is to always pass a context.Context, and not to pass the spans directly as that "pollutes" the codebase too much. See:
- https://github.com/yurishkuro/opentracing-tutorial/tree/master/go/lesson02#propagate-the-in-process-context
(note that that was written by Yuri, the tech lead for Jaeger tracing @ Uber).
There was a problem hiding this comment.
@jodh-intel Thanks for the presentation. I now understand better how Jaeger works and I agree we need to create span for each function we want to trace. And Jaeger looks more like a function tracing system to me. I have two more questions though:
- Can we use it to trace any point in the code? If so, we can use it to replace most of the logs.
- Do users have to install Jaeger (server?) to be able to disable tracing?
There was a problem hiding this comment.
Hi @bergwolf.
Can we use it to trace any point in the code?
Yes, assuming the code has access to a context.Context.
Do users have to install Jaeger (server?) to be able to disable tracing?
To disable tracing, they just never uncomment enable_tracing in the config file. So if enable_tracing == false, the runtime will use the "NOP tracer". But also, if enable_tracing == true but no Jaeger agent is running, that will not actually generate any errors either - it will just make the runtime a little slower as the jaeger-client-go code tries (unsucessfully but quietly :) to talk to the Jaeger agent over UDP.
There was a problem hiding this comment.
... And yes, I'm starting to think that once we have the tracing support working well, we can remove some of the logging. It would be interesting to know how zap performs compared to jaeger since if jaeger is faster (and I'm guessing that it is), we don't need tracing any more in theory. Some users might be concerned about the UDP comms, but that's what syslog uses I believe ;)
|
There is one issue that we need to resolve though - how to specify details of a remote Jaeger Collector (which reads all the traces + spans). Since this branch doesn't do anything special, the jaeger package will try to send the tracer+spans to the local system where the runtime is running. I think this means we'll need to expand the config options for tracing to something like: However, as you can see from https://github.com/jaegertracing/jaeger-client-go#environment-variables, there are a lot of variables. I don't think we can rely on just variables as getting those set in the runtime's docker environment is not easy (unless we make the runtime read I think we can probably handle this on a separate issue+PR, although note that if we do this, the |
|
To be clear about #565 (comment), I think we want to minimise the tracing options in |
|
@jodh-intel Thanks for the explanation. I don't much like it that we have to modify every function parameter to include the context if we want to trace it. But it looks like if we want to integrate with opentracing, we do have to bite the bullet. Maybe we should consider adding a context pointer to the sandbox structure, so that at least in virtcontainers, we do not have to modify function parameters everywhere. LGTM! |
|
Hi @bergwolf - agreed and agreed :) |
|
This is all green. I don't have time to review or test for any noticeable impact right now, but... I'll merge on the acks of others (is that 'standing on the ACKS of giants' I wonder ;-) ). |
Invoke tmp.mount by adding it to kata-containers.target. This is not invoked by systemd with current rootfs setup we have as one of the dependencies for tmp.mount is systemd-remount-fs.service which depends on /etc/fstab file being present(it is currenty missing) Instead of adding that file, start the tmp.mount unit by including it in kata-containers.target With this and change in os-builder to not delete the tmp.mount unit, (kata-containers/osbuilder#301) /tmp should now be writeable for systemd based images. For initrd this is handled by the agent itself. Fixes kata-containers#565 Signed-off-by: Archana Shinde <[email protected]>
Add initial support for opentracing by using the
jaegerpackage.Since opentracing uses the
contextpackage, add acontext.Contextas the first parameter to all the functions that we might want to
trace. Trace "spans" (trace points) are then added by extracting the
trace details from the specified context parameter.
Notes:
Although the tracer is created in
main(), the "root span"(aka the first trace point) is not added until
beforeSubcommands().This is by design and is a compromise: by delaying the creation of the
root span, the spans become much more readable since using the web-based
JaegerUI, you will see traces like this:
Creating the span earlier means it is necessary to expand 'n' spans in
the UI before you get to see the name of the CLI command that was run.
In adding support, this became very tedious, hence my design decision to
defer the creation of the root span until after signal handling has been
setup and after CLI options have been parsed, but still very early in
the code path.
At this stage, the tracing stops at the
virtcontainerscall boundary.Tracing is "always on" as there doesn't appear to be a way to toggle
it. However, its resolves to a "nop" unless the tracer can talk to a
jaeger agent.
Fixes #557.
Signed-off-by: James O. D. Hunt [email protected]