Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

tracing: Add initial opentracing support#565

Merged
grahamwhaley merged 3 commits intokata-containers:masterfrom
jodh-intel:support-opentracing
Aug 14, 2018
Merged

tracing: Add initial opentracing support#565
grahamwhaley merged 3 commits intokata-containers:masterfrom
jodh-intel:support-opentracing

Conversation

@jodh-intel
Copy link

Add initial support for opentracing by using the jaeger package.
Since opentracing uses the context package, add a context.Context
as the first parameter to all the functions that we might want to
trace. Trace "spans" (trace points) are then added by extracting the
trace details from the specified context parameter.

Notes:

  • Although the tracer is created in main(), the "root span"
    (aka the first trace point) is not added until beforeSubcommands().

    This is by design and is a compromise: by delaying the creation of the
    root span, the spans become much more readable since using the web-based
    JaegerUI, you will see traces like this:

      kata-runtime: kata-runtime create
      ------------  -------------------
          ^                ^
          |                |
      Trace name        First span name
                       (which clearly shows the CLI command that was run)
    

    Creating the span earlier means it is necessary to expand 'n' spans in
    the UI before you get to see the name of the CLI command that was run.
    In adding support, this became very tedious, hence my design decision to
    defer the creation of the root span until after signal handling has been
    setup and after CLI options have been parsed, but still very early in
    the code path.

  • At this stage, the tracing stops at the virtcontainers call boundary.

  • Tracing is "always on" as there doesn't appear to be a way to toggle
    it. However, its resolves to a "nop" unless the tracer can talk to a
    jaeger agent.

Fixes #557.

Signed-off-by: James O. D. Hunt [email protected]

@jodh-intel
Copy link
Author

To test this PR with Jaeger:

@jodh-intel
Copy link
Author

Note: I'll add further tracing granularity (for virtcontainers) on #566.

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 165859 KB
Proxy: 4155 KB
Shim: 8815 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003860 KB

@jodh-intel
Copy link
Author

cli/tracing.go:45:25:warning: cfg.New is deprecated: use NewTracer() function  (SA1019) (staticcheck)

I was aware of this, but the new API doesn't appear to work... Let's take a closer look...

@jodh-intel jodh-intel force-pushed the support-opentracing branch from 3689a9b to 4b1ba95 Compare August 9, 2018 14:47
@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 172291 KB
Proxy: 3970 KB
Shim: 9053 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2004108 KB

@jodh-intel jodh-intel force-pushed the support-opentracing branch from 4b1ba95 to 76dbe6e Compare August 9, 2018 15:08
@jodh-intel
Copy link
Author

Right, we're now using the new Jaeger API. Jaeger helpfully provides two API's called NewTracer() - one works, one doesn't (atleast for me). Guess which I picked first? :)

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 169558 KB
Proxy: 4075 KB
Shim: 8890 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2004000 KB

@jodh-intel jodh-intel force-pushed the support-opentracing branch from 76dbe6e to d6112a9 Compare August 9, 2018 15:54
@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 170138 KB
Proxy: 4042 KB
Shim: 8884 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003852 KB

@codecov
Copy link

codecov bot commented Aug 9, 2018

Codecov Report

Merging #565 into master will increase coverage by 0.03%.
The diff coverage is 77.59%.

@@            Coverage Diff             @@
##           master     #565      +/-   ##
==========================================
+ Coverage   66.87%   66.91%   +0.03%     
==========================================
  Files          85       85              
  Lines        9157     9287     +130     
==========================================
+ Hits         6124     6214      +90     
- Misses       2394     2418      +24     
- Partials      639      655      +16

@opendev-zuul
Copy link

opendev-zuul bot commented Aug 9, 2018

Build succeeded (third-party-check pipeline).

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions, but otherwise, this looks very good and is ready to be merged IMO.
Thank you for the very detailed commit message and comments inside the code, it makes the review easier about the design choices you made ;)



[[projects]]
digest = "1:167b6f65a6656de568092189ae791253939f076df60231fdd64588ac703892a1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking the dep file format change should be in its own PR, to avoid it being hidden through this PR, and to make sure ppl realize it's been updated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll raise a PR for that later today...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - #574 raised. I'll add dnm to this PR so when #574 is merged I can remove my first commit from this PR...

vc "github.com/kata-containers/runtime/virtcontainers"
vf "github.com/kata-containers/runtime/virtcontainers/factory"
"github.com/kata-containers/runtime/virtcontainers/pkg/oci"
opentracing "github.com/opentracing/opentracing-go"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to specify a custom name for the import here. opentracing-go will be automatically translated into opentracing to be used from the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't specify the name - it was added automatically by the go tooling.

runtimeConfig oci.RuntimeConfig) error {
var err error

span, ctx := opentracing.StartSpanFromContext(ctx, "create")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how much overhead (time wise) this tracing will add ? Given the fact that we have a jeager agent running (as you said this would resolve into no-op otherwise).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I've written a test program to create spans using a "real" tracer and a "NOP" tracer (as this branch now uses by default). Creating 100,000 spans on a NUC:

nanonseconds / span
Real tracer 2860
NOP tracer 550

So, there is some overhead even with tracing "disabled" but when you consider that a docker run -ti busybox true with this branch currently only creates 17 spans, I'd say the overhead is unlikely to be an issue, particularly when comparing that with the value tracing brings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the data, this looks negligible indeed.

@jodh-intel
Copy link
Author

Branch updated - I've now found the magic knob to toggle tracing so now you need to set enable_tracing=true in configuration.toml if you want to trace the runtime.

BTW, I plan to raise a PR on the dev guide explaining how to enable and view traces. I'll then raise a runtime PR to add a Tracing section to the README, referencing the new dev guide URL for tracing info.

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 167869 KB
Proxy: 4160 KB
Shim: 8846 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003984 KB

@opendev-zuul
Copy link

opendev-zuul bot commented Aug 10, 2018

Build succeeded (third-party-check pipeline).

@sboeuf
Copy link

sboeuf commented Aug 10, 2018

#574 just got merged, @jodh-intel you can rebase ;)

James O. D. Hunt added 2 commits August 10, 2018 15:25
Move a test assertion which seemed to have become separated from its
correct location.

Signed-off-by: James O. D. Hunt <[email protected]>
Created two new helper functions to create a `cli.Context` with and without a
`cli.App`.

Calling these functions simplifies a lot of test code.

Signed-off-by: James O. D. Hunt <[email protected]>
@jodh-intel
Copy link
Author

On it. Just found a bug-ette so will re-push and comment once that's resolved...

Add initial support for opentracing by using the `jaeger` package.
Since opentracing uses the `context` package, add a `context.Context`
as the first parameter to all the functions that we might want to
trace. Trace "spans" (trace points) are then added by extracting the
trace details from the specified context parameter.

Notes:

- Although the tracer is created in `main()`, the "root span"
  (aka the first trace point) is not added until `beforeSubcommands()`.

  This is by design and is a compromise: by delaying the creation of the
  root span, the spans become much more readable since using the web-based
  JaegerUI, you will see traces like this:

  ```
  kata-runtime: kata-runtime create
  ------------  -------------------
       ^                ^
       |                |
  Trace name        First span name
                    (which clearly shows the CLI command that was run)
  ```

  Creating the span earlier means it is necessary to expand 'n' spans in
  the UI before you get to see the name of the CLI command that was run.
  In adding support, this became very tedious, hence my design decision to
  defer the creation of the root span until after signal handling has been
  setup and after CLI options have been parsed, but still very early in
  the code path.

  - At this stage, the tracing stops at the `virtcontainers` call
  boundary.

- Tracing is "always on" as there doesn't appear to be a way to toggle
  it. However, its resolves to a "nop" unless the tracer can talk to a
  jaeger agent.

Note that this commit required a bit of rework to `beforeSubcommands()`
to reduce the cyclomatic complexity.

Fixes kata-containers#557.

Signed-off-by: James O. D. Hunt <[email protected]>
@jodh-intel
Copy link
Author

@sboeuf, @grahamwhaley, @egernst - branch updated ;)

@opendev-zuul
Copy link

opendev-zuul bot commented Aug 10, 2018

Build succeeded (third-party-check pipeline).

@jodh-intel
Copy link
Author

Um, pullapprove is showing WIP - it isn't (now) as I removed the dnm label!

/me tries toggling that again...

@jodh-intel
Copy link
Author

Nope :( I'd rather no re-force-push to avoid restarting all the long-running CI's again.

@grahamwhaley grahamwhaley added wip and removed wip labels Aug 10, 2018
@grahamwhaley
Copy link
Contributor

Confirmed - it really doesn't like you does it! It also does not seem to have a 'refresh' button on their gui :-(.
I'd leave it over the weekend and see if it feels better on Monday :-)

@jodh-intel
Copy link
Author

I was hoping this might be merged by Monday so looking for reviewers... ;)

@sboeuf
Copy link

sboeuf commented Aug 10, 2018

@jodh-intel I'll do another review, but let's ask @amshinde @egernst @bergwolf to review this too ;)

@grahamwhaley
Copy link
Contributor

pullapprove has woken from its slumber it seems @jodh-intel ...

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 170118 KB
Proxy: 4206 KB
Shim: 9256 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003992 KB

@sboeuf
Copy link

sboeuf commented Aug 11, 2018

Thanks for the rework/rebase @jodh-intel, this LGTM!

kataLog.Infof(msg, args...)
}

func createTracer(name string) (opentracing.Tracer, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if tracing is enabled first? Or am I misunderstanding it that tracing is checked somewhere else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bergwolf - actually, no.

It's a little subtle but once you add opentracing support to a codebase, unless you want to litter the code with calls like...

var span opentracing.Span

if (tracing) {
    span := ...
}

// do stuff

if (tracing) {
    defer span.Finish()
}

... you need to always create a valid (non-nil) tracer and non-nil spans.

But if you do that, how do you allow the application to toggle tracing on/off?

The answer is you either create a real tracer, or you set Disabled=true in the Jaeger Configuration. That has the effect or causing cfg.NewTracer() to return a NoppTracer which effectively disables tracing (there is a small overhead - see #565 (comment)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel Yeah, I think we need to decide who controls the tracing functionality -- kata config or Jaeger config? Right now, each of them has a switch and it looks like only the Jaeger one is in effect.

And do we really need to create these hierarchical spanners? If not, we can put in app.Before and app.After something like:

if (tracing) {
    setup tracer
    start tracer span
}

and

if (tracing) {
     stop tracer span
     cleanup tracer
}

So we have one tracer and one span for each command line invocation. We can pass around span so that we can call SetTag on it. And it looks much cleaner than having every one to call

span := opentracing.StartSpanFromContext
defer span.Finish()

Another question about opentracing (sorry to be dumb), do we need to plant trace points on our own? Or is it integrated with logrus somehow? Or can Jaeger insert new trace points dynamically?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But the recommendation is to always pass a context.Context, and not to pass the spans directly as that "pollutes" the codebase too much. See:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel Thanks for the presentation. I now understand better how Jaeger works and I agree we need to create span for each function we want to trace. And Jaeger looks more like a function tracing system to me. I have two more questions though:

  1. Can we use it to trace any point in the code? If so, we can use it to replace most of the logs.
  2. Do users have to install Jaeger (server?) to be able to disable tracing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bergwolf.

Can we use it to trace any point in the code?

Yes, assuming the code has access to a context.Context.

Do users have to install Jaeger (server?) to be able to disable tracing?

To disable tracing, they just never uncomment enable_tracing in the config file. So if enable_tracing == false, the runtime will use the "NOP tracer". But also, if enable_tracing == true but no Jaeger agent is running, that will not actually generate any errors either - it will just make the runtime a little slower as the jaeger-client-go code tries (unsucessfully but quietly :) to talk to the Jaeger agent over UDP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... And yes, I'm starting to think that once we have the tracing support working well, we can remove some of the logging. It would be interesting to know how zap performs compared to jaeger since if jaeger is faster (and I'm guessing that it is), we don't need tracing any more in theory. Some users might be concerned about the UDP comms, but that's what syslog uses I believe ;)

@jodh-intel
Copy link
Author

There is one issue that we need to resolve though - how to specify details of a remote Jaeger Collector (which reads all the traces + spans). Since this branch doesn't do anything special, the jaeger package will try to send the tracer+spans to the local system where the runtime is running.

I think this means we'll need to expand the config options for tracing to something like:

[runtime]

[runtime.tracing]
enabled = true
# Set to value of JAEGER_ENDPOINT if you trace collector is not local
# (See https://github.com/jaegertracing/jaeger-client-go#environment-variables)
endpoint = ""

However, as you can see from https://github.com/jaegertracing/jaeger-client-go#environment-variables, there are a lot of variables. I don't think we can rely on just variables as getting those set in the runtime's docker environment is not easy (unless we make the runtime read process.env from config.json and set them itself or something).

I think we can probably handle this on a separate issue+PR, although note that if we do this, the enable_tracing option might change real soon now (TM) 😄

@jodh-intel
Copy link
Author

To be clear about #565 (comment), I think we want to minimise the tracing options in configuration.toml. However, atleast when running under docker, allowing the Jaeger client library to access environment variables isn't actually that easy.

@bergwolf
Copy link
Member

bergwolf commented Aug 14, 2018

@jodh-intel Thanks for the explanation. I don't much like it that we have to modify every function parameter to include the context if we want to trace it. But it looks like if we want to integrate with opentracing, we do have to bite the bullet. Maybe we should consider adding a context pointer to the sandbox structure, so that at least in virtcontainers, we do not have to modify function parameters everywhere.

LGTM!

Approved with PullApprove

@jodh-intel
Copy link
Author

Hi @bergwolf - agreed and agreed :)

@grahamwhaley
Copy link
Contributor

This is all green. I don't have time to review or test for any noticeable impact right now, but... I'll merge on the acks of others (is that 'standing on the ACKS of giants' I wonder ;-) ).
I liked @bergwolf suggestion that maybe we can carry the context around in one of the other existing data structures :-)

@grahamwhaley grahamwhaley merged commit dd2acd2 into kata-containers:master Aug 14, 2018
@sboeuf sboeuf added the feature New functionality label Sep 12, 2018
zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019
Invoke tmp.mount by adding it to kata-containers.target.
This is not invoked by systemd with current rootfs setup we have
as one of the dependencies for tmp.mount is systemd-remount-fs.service
which depends on /etc/fstab file being present(it is currenty missing)
Instead of adding that file, start the tmp.mount unit by including
it in kata-containers.target

With this and change in os-builder to not delete the tmp.mount unit,
(kata-containers/osbuilder#301) /tmp should now
be writeable for systemd based images.
For initrd this is handled by the agent itself.

Fixes kata-containers#565

Signed-off-by: Archana Shinde <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add initial opentracing support

5 participants