tracing: Add initial opentracing support by jodh-intel · Pull Request #565 · kata-containers/runtime

jodh-intel · 2018-08-09T14:17:22Z

Add initial support for opentracing by using the jaeger package.
Since opentracing uses the context package, add a context.Context
as the first parameter to all the functions that we might want to
trace. Trace "spans" (trace points) are then added by extracting the
trace details from the specified context parameter.

Notes:

Although the tracer is created in main(), the "root span"
(aka the first trace point) is not added until beforeSubcommands().

This is by design and is a compromise: by delaying the creation of the
root span, the spans become much more readable since using the web-based
JaegerUI, you will see traces like this:
```
  kata-runtime: kata-runtime create
  ------------  -------------------
      ^                ^
      |                |
  Trace name        First span name
                   (which clearly shows the CLI command that was run)
```
Creating the span earlier means it is necessary to expand 'n' spans in
the UI before you get to see the name of the CLI command that was run.
In adding support, this became very tedious, hence my design decision to
defer the creation of the root span until after signal handling has been
setup and after CLI options have been parsed, but still very early in
the code path.
At this stage, the tracing stops at the virtcontainers call boundary.
Tracing is "always on" as there doesn't appear to be a way to toggle
it. However, its resolves to a "nop" unless the tracer can talk to a
jaeger agent.

Fixes #557.

Signed-off-by: James O. D. Hunt [email protected]

jodh-intel · 2018-08-09T14:18:15Z

To test this PR with Jaeger:

Configure docker to use the code in this PR.
Run the docker command shown on https://www.jaegertracing.io/docs/getting-started/ to start all the Jaeger bits.

Run something like:

$ docker run -ti --runtime kata-runtime busybox true

Browse to http://localhost:16686.
Click the "Service" pull down and select kata-runtime.
Click "Find Traces".
Click on a trace on the right hand pane!

jodh-intel · 2018-08-09T14:20:44Z

Note: I'll add further tracing granularity (for virtcontainers) on #566.

katacontainersbot · 2018-08-09T14:31:57Z

PSS Measurement:
Qemu: 165859 KB
Proxy: 4155 KB
Shim: 8815 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003860 KB

jodh-intel · 2018-08-09T14:35:19Z

cli/tracing.go:45:25:warning: cfg.New is deprecated: use NewTracer() function  (SA1019) (staticcheck)

I was aware of this, but the new API doesn't appear to work... Let's take a closer look...

katacontainersbot · 2018-08-09T15:02:01Z

PSS Measurement:
Qemu: 172291 KB
Proxy: 3970 KB
Shim: 9053 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2004108 KB

jodh-intel · 2018-08-09T15:10:21Z

Right, we're now using the new Jaeger API. Jaeger helpfully provides two API's called NewTracer() - one works, one doesn't (atleast for me). Guess which I picked first? :)

katacontainersbot · 2018-08-09T15:24:24Z

PSS Measurement:
Qemu: 169558 KB
Proxy: 4075 KB
Shim: 8890 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2004000 KB

katacontainersbot · 2018-08-09T16:10:00Z

PSS Measurement:
Qemu: 170138 KB
Proxy: 4042 KB
Shim: 8884 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003852 KB

codecov · 2018-08-09T16:50:10Z

Codecov Report

Merging #565 into master will increase coverage by 0.03%.
The diff coverage is 77.59%.

@@            Coverage Diff             @@
##           master     #565      +/-   ##
==========================================
+ Coverage   66.87%   66.91%   +0.03%     
==========================================
  Files          85       85              
  Lines        9157     9287     +130     
==========================================
+ Hits         6124     6214      +90     
- Misses       2394     2418      +24     
- Partials      639      655      +16

opendev-zuul · 2018-08-09T16:55:37Z

Build succeeded (third-party-check pipeline).

kata-runsh : SUCCESS in 57m 31s

sboeuf

A few comments/questions, but otherwise, this looks very good and is ready to be merged IMO.
Thank you for the very detailed commit message and comments inside the code, it makes the review easier about the design choices you made ;)

sboeuf · 2018-08-09T22:02:13Z

Gopkg.lock



 [[projects]]
+  digest = "1:167b6f65a6656de568092189ae791253939f076df60231fdd64588ac703892a1"


I am thinking the dep file format change should be in its own PR, to avoid it being hidden through this PR, and to make sure ppl realize it's been updated.

Agreed. I'll raise a PR for that later today...

Right - #574 raised. I'll add dnm to this PR so when #574 is merged I can remove my first commit from this PR...

sboeuf · 2018-08-09T22:19:18Z

cli/create.go

 	vc "github.com/kata-containers/runtime/virtcontainers"
 	vf "github.com/kata-containers/runtime/virtcontainers/factory"
 	"github.com/kata-containers/runtime/virtcontainers/pkg/oci"
+	opentracing "github.com/opentracing/opentracing-go"


I don't think you need to specify a custom name for the import here. opentracing-go will be automatically translated into opentracing to be used from the code.

I didn't specify the name - it was added automatically by the go tooling.

sboeuf · 2018-08-09T22:25:34Z

cli/create.go

 	runtimeConfig oci.RuntimeConfig) error {
 	var err error

+	span, ctx := opentracing.StartSpanFromContext(ctx, "create")


I'm curious how much overhead (time wise) this tracing will add ? Given the fact that we have a jeager agent running (as you said this would resolve into no-op otherwise).

Good question. I've written a test program to create spans using a "real" tracer and a "NOP" tracer (as this branch now uses by default). Creating 100,000 spans on a NUC:

nanonseconds / span

Real tracer 2860

NOP tracer 550

So, there is some overhead even with tracing "disabled" but when you consider that a docker run -ti busybox true with this branch currently only creates 17 spans, I'd say the overhead is unlikely to be an issue, particularly when comparing that with the value tracing brings.

Thanks for providing the data, this looks negligible indeed.

jodh-intel · 2018-08-10T10:17:38Z

Branch updated - I've now found the magic knob to toggle tracing so now you need to set enable_tracing=true in configuration.toml if you want to trace the runtime.

BTW, I plan to raise a PR on the dev guide explaining how to enable and view traces. I'll then raise a runtime PR to add a Tracing section to the README, referencing the new dev guide URL for tracing info.

katacontainersbot · 2018-08-10T10:29:03Z

PSS Measurement:
Qemu: 167869 KB
Proxy: 4160 KB
Shim: 8846 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003984 KB

opendev-zuul · 2018-08-10T11:21:19Z

Build succeeded (third-party-check pipeline).

kata-runsh : SUCCESS in 59m 05s

sboeuf · 2018-08-10T14:17:48Z

#574 just got merged, @jodh-intel you can rebase ;)

Move a test assertion which seemed to have become separated from its correct location. Signed-off-by: James O. D. Hunt <[email protected]>

Created two new helper functions to create a `cli.Context` with and without a `cli.App`. Calling these functions simplifies a lot of test code. Signed-off-by: James O. D. Hunt <[email protected]>

jodh-intel · 2018-08-10T14:29:12Z

On it. Just found a bug-ette so will re-push and comment once that's resolved...

Add initial support for opentracing by using the `jaeger` package. Since opentracing uses the `context` package, add a `context.Context` as the first parameter to all the functions that we might want to trace. Trace "spans" (trace points) are then added by extracting the trace details from the specified context parameter. Notes: - Although the tracer is created in `main()`, the "root span" (aka the first trace point) is not added until `beforeSubcommands()`. This is by design and is a compromise: by delaying the creation of the root span, the spans become much more readable since using the web-based JaegerUI, you will see traces like this: ``` kata-runtime: kata-runtime create ------------ ------------------- ^ ^ | | Trace name First span name (which clearly shows the CLI command that was run) ``` Creating the span earlier means it is necessary to expand 'n' spans in the UI before you get to see the name of the CLI command that was run. In adding support, this became very tedious, hence my design decision to defer the creation of the root span until after signal handling has been setup and after CLI options have been parsed, but still very early in the code path. - At this stage, the tracing stops at the `virtcontainers` call boundary. - Tracing is "always on" as there doesn't appear to be a way to toggle it. However, its resolves to a "nop" unless the tracer can talk to a jaeger agent. Note that this commit required a bit of rework to `beforeSubcommands()` to reduce the cyclomatic complexity. Fixes kata-containers#557. Signed-off-by: James O. D. Hunt <[email protected]>

jodh-intel · 2018-08-10T15:16:42Z

@sboeuf, @grahamwhaley, @egernst - branch updated ;)

opendev-zuul · 2018-08-10T16:16:49Z

Build succeeded (third-party-check pipeline).

kata-runsh : SUCCESS in 59m 29s

jodh-intel · 2018-08-10T16:24:50Z

Um, pullapprove is showing WIP - it isn't (now) as I removed the dnm label!

/me tries toggling that again...

jodh-intel · 2018-08-10T16:26:15Z

Nope :( I'd rather no re-force-push to avoid restarting all the long-running CI's again.

grahamwhaley · 2018-08-10T16:34:34Z

Confirmed - it really doesn't like you does it! It also does not seem to have a 'refresh' button on their gui :-(.
I'd leave it over the weekend and see if it feels better on Monday :-)

jodh-intel · 2018-08-10T16:38:11Z

I was hoping this might be merged by Monday so looking for reviewers... ;)

sboeuf · 2018-08-10T16:39:28Z

@jodh-intel I'll do another review, but let's ask @amshinde @egernst @bergwolf to review this too ;)

grahamwhaley · 2018-08-10T16:47:38Z

pullapprove has woken from its slumber it seems @jodh-intel ...

katacontainersbot · 2018-08-10T18:41:35Z

PSS Measurement:
Qemu: 170118 KB
Proxy: 4206 KB
Shim: 9256 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003992 KB

sboeuf · 2018-08-11T02:21:08Z

Thanks for the rework/rebase @jodh-intel, this LGTM!

bergwolf · 2018-08-13T07:38:13Z

cli/tracing.go

+	kataLog.Infof(msg, args...)
+}
+
+func createTracer(name string) (opentracing.Tracer, error) {


Check if tracing is enabled first? Or am I misunderstanding it that tracing is checked somewhere else?

Hi @bergwolf - actually, no.

It's a little subtle but once you add opentracing support to a codebase, unless you want to litter the code with calls like...

var span opentracing.Span if (tracing) { span := ... } // do stuff if (tracing) { defer span.Finish() }

... you need to always create a valid (non-nil) tracer and non-nil spans.

But if you do that, how do you allow the application to toggle tracing on/off?

The answer is you either create a real tracer, or you set Disabled=true in the Jaeger Configuration. That has the effect or causing cfg.NewTracer() to return a NoppTracer which effectively disables tracing (there is a small overhead - see #565 (comment)).

@jodh-intel Yeah, I think we need to decide who controls the tracing functionality -- kata config or Jaeger config? Right now, each of them has a switch and it looks like only the Jaeger one is in effect.

And do we really need to create these hierarchical spanners? If not, we can put in app.Before and app.After something like:

if (tracing) { setup tracer start tracer span }

and

if (tracing) { stop tracer span cleanup tracer }

So we have one tracer and one span for each command line invocation. We can pass around span so that we can call SetTag on it. And it looks much cleaner than having every one to call

span := opentracing.StartSpanFromContext defer span.Finish()

Another question about opentracing (sorry to be dumb), do we need to plant trace points on our own? Or is it integrated with logrus somehow? Or can Jaeger insert new trace points dynamically?

I agree. But the recommendation is to always pass a context.Context, and not to pass the spans directly as that "pollutes" the codebase too much. See:

https://github.com/yurishkuro/opentracing-tutorial/tree/master/go/lesson02#propagate-the-in-process-context
(note that that was written by Yuri, the tech lead for Jaeger tracing @ Uber).

@jodh-intel Thanks for the presentation. I now understand better how Jaeger works and I agree we need to create span for each function we want to trace. And Jaeger looks more like a function tracing system to me. I have two more questions though:

Can we use it to trace any point in the code? If so, we can use it to replace most of the logs.

Do users have to install Jaeger (server?) to be able to disable tracing?

Hi @bergwolf.

Can we use it to trace any point in the code?

Yes, assuming the code has access to a context.Context.

Do users have to install Jaeger (server?) to be able to disable tracing?

To disable tracing, they just never uncomment enable_tracing in the config file. So if enable_tracing == false, the runtime will use the "NOP tracer". But also, if enable_tracing == true but no Jaeger agent is running, that will not actually generate any errors either - it will just make the runtime a little slower as the jaeger-client-go code tries (unsucessfully but quietly :) to talk to the Jaeger agent over UDP.

... And yes, I'm starting to think that once we have the tracing support working well, we can remove some of the logging. It would be interesting to know how zap performs compared to jaeger since if jaeger is faster (and I'm guessing that it is), we don't need tracing any more in theory. Some users might be concerned about the UDP comms, but that's what syslog uses I believe ;)

jodh-intel · 2018-08-13T08:35:35Z

There is one issue that we need to resolve though - how to specify details of a remote Jaeger Collector (which reads all the traces + spans). Since this branch doesn't do anything special, the jaeger package will try to send the tracer+spans to the local system where the runtime is running.

I think this means we'll need to expand the config options for tracing to something like:

[runtime]

[runtime.tracing]
enabled = true
# Set to value of JAEGER_ENDPOINT if you trace collector is not local
# (See https://github.com/jaegertracing/jaeger-client-go#environment-variables)
endpoint = ""

However, as you can see from https://github.com/jaegertracing/jaeger-client-go#environment-variables, there are a lot of variables. I don't think we can rely on just variables as getting those set in the runtime's docker environment is not easy (unless we make the runtime read process.env from config.json and set them itself or something).

I think we can probably handle this on a separate issue+PR, although note that if we do this, the enable_tracing option might change real soon now (TM) 😄

jodh-intel · 2018-08-13T09:05:56Z

To be clear about #565 (comment), I think we want to minimise the tracing options in configuration.toml. However, atleast when running under docker, allowing the Jaeger client library to access environment variables isn't actually that easy.

bergwolf · 2018-08-14T07:31:16Z

@jodh-intel Thanks for the explanation. I don't much like it that we have to modify every function parameter to include the context if we want to trace it. But it looks like if we want to integrate with opentracing, we do have to bite the bullet. Maybe we should consider adding a context pointer to the sandbox structure, so that at least in virtcontainers, we do not have to modify function parameters everywhere.

LGTM!

jodh-intel · 2018-08-14T08:40:43Z

Hi @bergwolf - agreed and agreed :)

grahamwhaley · 2018-08-14T09:32:48Z

This is all green. I don't have time to review or test for any noticeable impact right now, but... I'll merge on the acks of others (is that 'standing on the ACKS of giants' I wonder ;-) ).
I liked @bergwolf suggestion that maybe we can carry the context around in one of the other existing data structures :-)

Invoke tmp.mount by adding it to kata-containers.target. This is not invoked by systemd with current rootfs setup we have as one of the dependencies for tmp.mount is systemd-remount-fs.service which depends on /etc/fstab file being present(it is currenty missing) Instead of adding that file, start the tmp.mount unit by including it in kata-containers.target With this and change in os-builder to not delete the tmp.mount unit, (kata-containers/osbuilder#301) /tmp should now be writeable for systemd based images. For initrd this is handled by the agent itself. Fixes kata-containers#565 Signed-off-by: Archana Shinde <[email protected]>

jodh-intel force-pushed the support-opentracing branch from 3689a9b to 4b1ba95 Compare August 9, 2018 14:47

jodh-intel force-pushed the support-opentracing branch from 4b1ba95 to 76dbe6e Compare August 9, 2018 15:08

jodh-intel force-pushed the support-opentracing branch from 76dbe6e to d6112a9 Compare August 9, 2018 15:54

sboeuf approved these changes Aug 9, 2018

View reviewed changes

jodh-intel mentioned this pull request Aug 10, 2018

Update dep config file #573

Closed

jodh-intel added the do-not-merge label Aug 10, 2018

jodh-intel force-pushed the support-opentracing branch from d6112a9 to b821a53 Compare August 10, 2018 10:14

James O. D. Hunt added 2 commits August 10, 2018 15:25

tests: Move assert closer to function call

41d1c14

Move a test assertion which seemed to have become separated from its correct location. Signed-off-by: James O. D. Hunt <[email protected]>

tests: Add cli.Context helper functions

0ede467

Created two new helper functions to create a `cli.Context` with and without a `cli.App`. Calling these functions simplifies a lot of test code. Signed-off-by: James O. D. Hunt <[email protected]>

jodh-intel removed the do-not-merge label Aug 10, 2018

jodh-intel force-pushed the support-opentracing branch from b821a53 to 3a1bbd0 Compare August 10, 2018 15:16

jodh-intel added do-not-merge and removed do-not-merge labels Aug 10, 2018

grahamwhaley added wip and removed wip labels Aug 10, 2018

bergwolf reviewed Aug 13, 2018

View reviewed changes

grahamwhaley merged commit dd2acd2 into kata-containers:master Aug 14, 2018

jodh-intel mentioned this pull request Aug 21, 2018

Add virtcontainers trace support #617

Merged

jodh-intel mentioned this pull request Aug 28, 2018

Implement containerd shim v2 API for Kata Containers #572

Merged

sboeuf added the feature New functionality label Sep 12, 2018

jodh-intel mentioned this pull request Dec 6, 2018

Adding OpenTracing support for Kata kata-containers/kata-containers#27

Closed



		[[projects]]
		digest = "1:167b6f65a6656de568092189ae791253939f076df60231fdd64588ac703892a1"

Conversation

jodh-intel commented Aug 9, 2018

Uh oh!

jodh-intel commented Aug 9, 2018

Uh oh!

jodh-intel commented Aug 9, 2018

Uh oh!

katacontainersbot commented Aug 9, 2018

Uh oh!

jodh-intel commented Aug 9, 2018

Uh oh!

katacontainersbot commented Aug 9, 2018

Uh oh!

jodh-intel commented Aug 9, 2018

Uh oh!

katacontainersbot commented Aug 9, 2018

Uh oh!

katacontainersbot commented Aug 9, 2018

Uh oh!

codecov bot commented Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

opendev-zuul bot commented Aug 9, 2018

Uh oh!

sboeuf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jodh-intel commented Aug 10, 2018

Uh oh!

katacontainersbot commented Aug 10, 2018

Uh oh!

opendev-zuul bot commented Aug 10, 2018

Uh oh!

sboeuf commented Aug 10, 2018

Uh oh!

jodh-intel commented Aug 10, 2018

Uh oh!

jodh-intel commented Aug 10, 2018

Uh oh!

opendev-zuul bot commented Aug 10, 2018

Uh oh!

jodh-intel commented Aug 10, 2018

Uh oh!

jodh-intel commented Aug 10, 2018

Uh oh!

grahamwhaley commented Aug 10, 2018

Uh oh!

jodh-intel commented Aug 10, 2018

Uh oh!

sboeuf commented Aug 10, 2018

Uh oh!

grahamwhaley commented Aug 10, 2018

Uh oh!

katacontainersbot commented Aug 10, 2018

Uh oh!

sboeuf commented Aug 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 9, 2018 •

edited

Loading

bergwolf commented Aug 14, 2018 •

edited by amshinde

Loading