Skip to content

Add extra prometheus metrics#32792

Merged
thaJeztah merged 1 commit into
moby:masterfrom
rogaha:extra_prometheus_metrics
May 9, 2017
Merged

Add extra prometheus metrics#32792
thaJeztah merged 1 commit into
moby:masterfrom
rogaha:extra_prometheus_metrics

Conversation

@rogaha
Copy link
Copy Markdown
Contributor

@rogaha rogaha commented Apr 24, 2017

Signed-off-by: Roberto Gandolfo Hashioka [email protected]

- What I did

Added/updated prometheus metrics:

  • buildsTriggered: type Counter
  • buildsFailed: type LabeledCounter
    • dockerfile_syntax
    • dockerfile_empty
    • command_not_supported
      - error_processing_commands
      - build_target_not_reachable
      - missing_onbuild_arguments
      - unknown_instruction
      - build_canceled
  • engineInfo: type LabeledGauge

- How I did it

Extended the existing prometheus metrics using the docker/go-metrics package

- How to verify it

$ make binary

replace docker binaries with the ones located in bundles/17.06.0-dev/binary-daemon

- Description for the changelog

  • Added build & engine info prometheus metrics

- A picture of a cute animal (not mandatory but encouraged)
boo-cute-dog-pomeranian-favim com-452643

@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented Apr 25, 2017

/cc @aluzzardi @cpuguy83 @crosbymichael

@aluzzardi
Copy link
Copy Markdown
Member

/cc @stevvooe @aaronlehmann

Comment thread api/server/router/build/metrics.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this metric should live in the builder, rather than the API server.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'll move it there. thanks for the feedback @aaronlehmann!

Comment thread daemon/cluster/swarm.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these gauges would only be updated when an API user invokes the /info endpoint. I would expect them to update periodically regardless of API activity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I agree. It would be better to have it that way, I'm new to prometheus do you have an example on how to accomplish that?

Copy link
Copy Markdown
Contributor

@alfred-landrum alfred-landrum Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogaha - could create a custom collector that calls state.controlClient.ListNodes as needed when the metrics are scraped:
https://godoc.org/github.com/prometheus/client_golang/prometheus#Collector

see especially the example there.

prometheus has a set of "*Func" metrics that allow you to specify a function to be called at metric collect time (ie, when the metrics are being scraped). I don't see those exposed via go-metrics though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's perfect! It would be better if go-metrics had an interface for it. But here is how @crosbymichael used it: https://github.com/crosbymichael/cgroups/pull/8/files. I'll give it a try!

Comment thread daemon/cluster/metrics.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/The information/Information/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be part of swarmkit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann I'll update it. Thanks for the feedback.

@stevvooe, I think it could also live there. But I'll defer this decision to you guys. We can always move that later.

Comment thread daemon/cluster/metrics.go Outdated
Copy link
Copy Markdown

@aaronlehmann aaronlehmann Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of managers in the swarm

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann Should those be exported directly by swarmkit?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should; otherwise the engine would have to poll for them. In this PR they're updated every time /info is invoked, but that doesn't seem right.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann @aluzzardi can we work together to get these metrics into swarmkit? Would be ok to have these metrics in the engine for now while swarmkit is not exposing them?

Comment thread daemon/cluster/metrics.go Outdated
Copy link
Copy Markdown

@aaronlehmann aaronlehmann Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of nodes in the swarm

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update it!

Comment thread daemon/metrics.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document the expected values for these fields? For example, what is id? (I'm assuming that is a join key, but not sure)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevvooe this is the ID generated by the engine and it's expected to be persisted between upgrades. I'll add some comments there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tied to node identity in a swarm? It looks like we have a swarm NodeID and an ID in docker info. I am not sure about our intention of supporting the id in docker info. For the most part, it should be stable but it is tied to a signing system (jwt) that we don't really use.

cc @dmcgowan

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not tied to swarm node identity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevvooe as @aaronlehmann mentioned, this ID is the related to the engine itself and it might not be part of a swarm.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expect that will be part of the "swarm mode enabled by default" work. Until swarm mode is active out of the box, these can't really be converged.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogaha Okay, so it sounds like field is fine to use, as long as it is treated opaquely, as the format may change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. we are good even if the format changes! Thanks for the feedback @stevvooe! As @aaronlehmann mentioned, we will be able to converge the IDs once we have swarm mode enabled by default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: graphdriver?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevvooe 👍

@rogaha rogaha force-pushed the extra_prometheus_metrics branch from cf26700 to c776ac2 Compare April 27, 2017 23:00
Comment thread builder/dockerfile/builder.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogaha, thoughts on doing something like:

defer func() {
  if err != nil {
    triggeredBuilds.WithValues("fail").UpdateSince(start)
  } else {
    triggeredBuilds.WithValues("success").UpdateSince(start)
  }
}()

seems cleaner

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for the feedback @tescherm. I'll do that!

Comment thread daemon/metrics.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider rewording to something like:

information related to the engine and the OS it is running on

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@rogaha rogaha force-pushed the extra_prometheus_metrics branch 3 times, most recently from dd8719a to 2a77fe3 Compare April 28, 2017 09:45
@thaJeztah
Copy link
Copy Markdown
Member

Looks like this needs a rebase

@aluzzardi
Copy link
Copy Markdown
Member

My 2cts: I think this should be a controller in swarmkit (swarmkit/manager/metrics, instrumentation, telemetry?) that uses the internal store API to export metrics.

@dongluochen @aaronlehmann Thoughts on that?

@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented Apr 30, 2017

@aluzzardi I agree that those metrics should be in swarmkit, but they are not there yet! So nothing prevents us to expose the swarm info data at scrape time using the approach @aaronlehmann suggested (prometheus Collector). I'll remove the swarm related metrics from this PR as @alfred-landrum offered his help to implement the prometheus Collector to expose the engine/swarm info (https://godoc.org/github.com/prometheus/client_golang/prometheus#ex-Collector) -- he has already worked on that in the past.

@rogaha rogaha force-pushed the extra_prometheus_metrics branch from 2a77fe3 to 513daed Compare May 1, 2017 07:10
@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented May 1, 2017

@thaJeztah PR rebased

@aluzzardi
Copy link
Copy Markdown
Member

The metrics are being exposed under the engine namespace. Once they are exposed by swarmkit, the labels will be different

@aluzzardi
Copy link
Copy Markdown
Member

aluzzardi commented May 1, 2017

Also, once exposed by swarmkit, nodes will be exposed as a gauge labeled by state (in the same way the engine exposes containers by state).

Basically, we should have {nodes,services,tasks}_by_state in swarmkit itself.

@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented May 1, 2017

@aluzzardi it sounds great! Are you ok if for the time being (while the swarmkit metrics are developed) we expose some of the swarm metrics via the engine/info? We are already doing that via the Docker API / CLI.

@aluzzardi
Copy link
Copy Markdown
Member

@rogaha The thing is, the names, namespace and labels will be different in swarmkit which will break whoever is scraping those.

Are we okay with this? If so, then it's fine.

Otherwise, I've put together a proof of concept here (it's not much more complicated than what has been done here):
moby/swarmkit#2157

Thoughts, @aaronlehmann @rogaha?

@cpuguy83
Copy link
Copy Markdown
Member

cpuguy83 commented May 1, 2017

I think it would be best to add to swarmkit.
There's no reason the swarmkit changes can't make the next release other than settling on namespace... which is a problem here as well.

@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented May 2, 2017

@cpuguy83 agreed. I've removed the swarm related metrics from the PR.

@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented May 2, 2017

The work done by @aluzzardi here seems to be very promising!

@rogaha rogaha force-pushed the extra_prometheus_metrics branch from 513daed to 3cfd793 Compare May 2, 2017 19:48
Comment thread daemon/metrics.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed to something longer than just "id". Without reading this PR discussion I'd have no idea what this is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about daemon_uuid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sounds good. I'll change it to daemon_uuid.

@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented May 8, 2017

The unit-tests failures don't seem to be related to the changes in this PR:

00:36:12 ----------------------------------------------------------------------
00:36:12 FAIL: check_test.go:355: DockerSwarmSuite.TearDownTest
00:36:12 
00:36:12 unmount of /tmp/docker-execroot/d373fc5c29f3c/netns failed: invalid argument
00:36:12 unmount of /tmp/docker-execroot/d373fc5c29f3c/netns failed: no such file or directory
00:36:12 unmount of /tmp/docker-execroot/d5f595cc1336f/netns failed: no such file or directory
00:36:12 check_test.go:360:
00:36:12     d.Stop(c)
00:36:12 daemon/daemon.go:392:
00:36:12     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:36:12 ... Error: Error while stopping the daemon df643cfe854d0 : exit status 2
00:43:44 ----------------------------------------------------------------------
00:43:44 FAIL: docker_cli_swarm_test.go:1371: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
00:43:44 
00:43:44 [d728ab7ffcd95] waiting for daemon to start
00:43:44 [d728ab7ffcd95] daemon started
00:43:44 
00:43:44 [d8a9593b13a08] waiting for daemon to start
00:43:44 [d8a9593b13a08] daemon started
00:43:44 
00:43:44 [d1573de56bd46] waiting for daemon to start
00:43:44 [d1573de56bd46] daemon started
00:43:44 
00:43:44 [d8a9593b13a08] exiting daemon
00:43:44 [d8a9593b13a08] waiting for daemon to start
00:43:44 [d8a9593b13a08] daemon started
00:43:44 
00:43:44 [d1573de56bd46] exiting daemon
00:43:44 [d1573de56bd46] waiting for daemon to start
00:43:44 [d1573de56bd46] daemon started
00:43:44 
00:43:44 docker_cli_swarm_test.go:1448:
00:43:44     c.Assert(err, checker.IsNil)
00:43:44 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc421923ec0), Stderr:[]uint8(nil)} ("exit status 1")
00:43:44 
00:43:44 [d728ab7ffcd95] exiting daemon
00:43:44 [d8a9593b13a08] exiting daemon
00:43:44 [d1573de56bd46] exiting daemon
00:37:51 ----------------------------------------------------------------------
00:37:51 FAIL: check_test.go:355: DockerSwarmSuite.TearDownTest
00:37:51 
00:37:51 unmount of /tmp/docker-execroot/d711731fa58b2/netns failed: invalid argument
00:37:51 unmount of /tmp/docker-execroot/d711731fa58b2/netns failed: no such file or directory
00:37:51 check_test.go:360:
00:37:51     d.Stop(c)
00:37:51 daemon/daemon.go:392:
00:37:51     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:37:51 ... Error: Error while stopping the daemon def4990e6fdf2 : exit status 2

@crosbymichael PTAL

Comment thread builder/dockerfile/metrics.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make constants for the various metrics labels, otherwise we could end up with a panic somewhere in the code due to typos, or accidentally add a new metric label without initializing it first.

Could potentially be in a separate package like builder/internal/metrics

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion @cpuguy83! I should have done that in the first place to avoid the issues you mentioned. I've created a package called internal with builder/internal/constants.go file in it. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally not because it'll just become a trash-heap of places to put stuff.
And ideally these constants would be namespaced with metrics.
So it'd be something likemetrics.DockerfileSyntaxError

We can initialize all this stuff there.

Copy link
Copy Markdown
Contributor Author

@rogaha rogaha May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuguy83 ok, as we talked on Slack, I’ve removed the internal package and moved the constants file within the dockerfile package. They look like this now:

const (
	metricsDockerfileSyntaxError        = "dockerfile_syntax_error"
.....
)

@rogaha rogaha force-pushed the extra_prometheus_metrics branch 2 times, most recently from 92ad2b7 to ae92bde Compare May 8, 2017 14:54
@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented May 8, 2017

@cpuguy83 PTAL

@crosbymichael
Copy link
Copy Markdown
Contributor

LGTM

@rogaha rogaha force-pushed the extra_prometheus_metrics branch 3 times, most recently from 01d3fba to af00c52 Compare May 8, 2017 23:54
Comment thread builder/dockerfile/constants.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment explaining that values must be initialized in builder/dockerfile/metrics.go... it may be good to put these in metrics.go even.

We should explain how to add new metrics to this package so we don't end up with random stuff added.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sounds good. I've updated it.

 - buildsTriggered
 - buildsFailed
    - valid options:
        metricsDockerfileSyntaxError,
        metricsDockerfileEmptyError,
        metricsCommandNotSupportedError,
        metricsErrorProcessingCommandsError,
        metricsBuildTargetNotReachableError,
        metricsMissingOnbuildArgumentsError,
        metricsUnknownInstructionError,
        metricsBuildCanceled,
- engineInfo

Signed-off-by: Roberto Gandolfo Hashioka <[email protected]>
@rogaha rogaha force-pushed the extra_prometheus_metrics branch from af00c52 to a28b173 Compare May 9, 2017 08:05
Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM code+design

@thaJeztah
Copy link
Copy Markdown
Member

PowerPC failure is a flaky test;

09:13:59 ----------------------------------------------------------------------
09:13:59 FAIL: check_test.go:355: DockerSwarmSuite.TearDownTest
09:13:59 
09:13:59 unmount of /tmp/docker-execroot/df71d386c61af/netns failed: invalid argument
09:13:59 unmount of /tmp/docker-execroot/df71d386c61af/netns failed: no such file or directory
09:13:59 unmount of /tmp/docker-execroot/dfb4d9d9b9d35/netns failed: no such file or directory
09:13:59 check_test.go:360:
09:13:59     d.Stop(c)
09:13:59 daemon/daemon.go:392:
09:13:59     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
09:13:59 ... Error: Error while stopping the daemon d821ee3d36cc5 : exit status 2

Let me trigger once

@thaJeztah
Copy link
Copy Markdown
Member

All 💚

@thaJeztah thaJeztah merged commit 1a6f8a9 into moby:master May 9, 2017
@rogaha
Copy link
Copy Markdown
Contributor Author

rogaha commented May 9, 2017

thanks @thaJeztah!

@rogaha rogaha deleted the extra_prometheus_metrics branch May 9, 2017 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.