Add extra prometheus metrics#32792
Conversation
There was a problem hiding this comment.
I wonder if this metric should live in the builder, rather than the API server.
There was a problem hiding this comment.
Indeed, I'll move it there. thanks for the feedback @aaronlehmann!
There was a problem hiding this comment.
It looks like these gauges would only be updated when an API user invokes the /info endpoint. I would expect them to update periodically regardless of API activity.
There was a problem hiding this comment.
yep, I agree. It would be better to have it that way, I'm new to prometheus do you have an example on how to accomplish that?
There was a problem hiding this comment.
@rogaha - could create a custom collector that calls state.controlClient.ListNodes as needed when the metrics are scraped:
https://godoc.org/github.com/prometheus/client_golang/prometheus#Collector
see especially the example there.
prometheus has a set of "*Func" metrics that allow you to specify a function to be called at metric collect time (ie, when the metrics are being scraped). I don't see those exposed via go-metrics though.
There was a problem hiding this comment.
Indeed, that's perfect! It would be better if go-metrics had an interface for it. But here is how @crosbymichael used it: https://github.com/crosbymichael/cgroups/pull/8/files. I'll give it a try!
There was a problem hiding this comment.
Seems like this should be part of swarmkit.
There was a problem hiding this comment.
@aaronlehmann I'll update it. Thanks for the feedback.
@stevvooe, I think it could also live there. But I'll defer this decision to you guys. We can always move that later.
There was a problem hiding this comment.
Number of managers in the swarm
There was a problem hiding this comment.
@aaronlehmann Should those be exported directly by swarmkit?
There was a problem hiding this comment.
I think they should; otherwise the engine would have to poll for them. In this PR they're updated every time /info is invoked, but that doesn't seem right.
There was a problem hiding this comment.
@aaronlehmann @aluzzardi can we work together to get these metrics into swarmkit? Would be ok to have these metrics in the engine for now while swarmkit is not exposing them?
There was a problem hiding this comment.
Could you document the expected values for these fields? For example, what is id? (I'm assuming that is a join key, but not sure)
There was a problem hiding this comment.
@stevvooe this is the ID generated by the engine and it's expected to be persisted between upgrades. I'll add some comments there.
There was a problem hiding this comment.
Is this tied to node identity in a swarm? It looks like we have a swarm NodeID and an ID in docker info. I am not sure about our intention of supporting the id in docker info. For the most part, it should be stable but it is tied to a signing system (jwt) that we don't really use.
cc @dmcgowan
There was a problem hiding this comment.
It is not tied to swarm node identity.
There was a problem hiding this comment.
@stevvooe as @aaronlehmann mentioned, this ID is the related to the engine itself and it might not be part of a swarm.
There was a problem hiding this comment.
Yes, I expect that will be part of the "swarm mode enabled by default" work. Until swarm mode is active out of the box, these can't really be converged.
There was a problem hiding this comment.
@rogaha Okay, so it sounds like field is fine to use, as long as it is treated opaquely, as the format may change.
There was a problem hiding this comment.
sounds good. we are good even if the format changes! Thanks for the feedback @stevvooe! As @aaronlehmann mentioned, we will be able to converge the IDs once we have swarm mode enabled by default.
cf26700 to
c776ac2
Compare
There was a problem hiding this comment.
@rogaha, thoughts on doing something like:
defer func() {
if err != nil {
triggeredBuilds.WithValues("fail").UpdateSince(start)
} else {
triggeredBuilds.WithValues("success").UpdateSince(start)
}
}()seems cleaner
There was a problem hiding this comment.
+1, thanks for the feedback @tescherm. I'll do that!
There was a problem hiding this comment.
consider rewording to something like:
information related to the engine and the OS it is running on
dd8719a to
2a77fe3
Compare
|
Looks like this needs a rebase |
|
My 2cts: I think this should be a controller in swarmkit (swarmkit/manager/metrics, instrumentation, telemetry?) that uses the internal store API to export metrics. @dongluochen @aaronlehmann Thoughts on that? |
|
@aluzzardi I agree that those metrics should be in swarmkit, but they are not there yet! So nothing prevents us to expose the swarm info data at scrape time using the approach @aaronlehmann suggested (prometheus Collector). I'll remove the swarm related metrics from this PR as @alfred-landrum offered his help to implement the prometheus Collector to expose the engine/swarm info (https://godoc.org/github.com/prometheus/client_golang/prometheus#ex-Collector) -- he has already worked on that in the past. |
2a77fe3 to
513daed
Compare
|
@thaJeztah PR rebased |
|
The metrics are being exposed under the engine namespace. Once they are exposed by swarmkit, the labels will be different |
|
Also, once exposed by swarmkit, nodes will be exposed as a gauge labeled by state (in the same way the engine exposes containers by state). Basically, we should have |
|
@aluzzardi it sounds great! Are you ok if for the time being (while the swarmkit metrics are developed) we expose some of the |
|
@rogaha The thing is, the names, namespace and labels will be different in swarmkit which will break whoever is scraping those. Are we okay with this? If so, then it's fine. Otherwise, I've put together a proof of concept here (it's not much more complicated than what has been done here): Thoughts, @aaronlehmann @rogaha? |
|
I think it would be best to add to swarmkit. |
|
@cpuguy83 agreed. I've removed the |
|
The work done by @aluzzardi here seems to be very promising! |
513daed to
3cfd793
Compare
There was a problem hiding this comment.
This should be changed to something longer than just "id". Without reading this PR discussion I'd have no idea what this is.
There was a problem hiding this comment.
yep, sounds good. I'll change it to daemon_uuid.
|
The unit-tests failures don't seem to be related to the changes in this @crosbymichael PTAL |
There was a problem hiding this comment.
Maybe we should make constants for the various metrics labels, otherwise we could end up with a panic somewhere in the code due to typos, or accidentally add a new metric label without initializing it first.
Could potentially be in a separate package like builder/internal/metrics
There was a problem hiding this comment.
Great suggestion @cpuguy83! I should have done that in the first place to avoid the issues you mentioned. I've created a package called internal with builder/internal/constants.go file in it. WDYT?
There was a problem hiding this comment.
Ideally not because it'll just become a trash-heap of places to put stuff.
And ideally these constants would be namespaced with metrics.
So it'd be something likemetrics.DockerfileSyntaxError
We can initialize all this stuff there.
There was a problem hiding this comment.
@cpuguy83 ok, as we talked on Slack, I’ve removed the internal package and moved the constants file within the dockerfile package. They look like this now:
const (
metricsDockerfileSyntaxError = "dockerfile_syntax_error"
.....
)
92ad2b7 to
ae92bde
Compare
|
@cpuguy83 PTAL |
|
LGTM |
01d3fba to
af00c52
Compare
There was a problem hiding this comment.
Maybe a comment explaining that values must be initialized in builder/dockerfile/metrics.go... it may be good to put these in metrics.go even.
We should explain how to add new metrics to this package so we don't end up with random stuff added.
There was a problem hiding this comment.
yep, sounds good. I've updated it.
- buildsTriggered
- buildsFailed
- valid options:
metricsDockerfileSyntaxError,
metricsDockerfileEmptyError,
metricsCommandNotSupportedError,
metricsErrorProcessingCommandsError,
metricsBuildTargetNotReachableError,
metricsMissingOnbuildArgumentsError,
metricsUnknownInstructionError,
metricsBuildCanceled,
- engineInfo
Signed-off-by: Roberto Gandolfo Hashioka <[email protected]>
af00c52 to
a28b173
Compare
|
PowerPC failure is a flaky test; Let me trigger once |
|
All 💚 |
|
thanks @thaJeztah! |
Signed-off-by: Roberto Gandolfo Hashioka [email protected]
- What I did
Added/updated
prometheusmetrics:buildsTriggered: typeCounterbuildsFailed: typeLabeledCounterdockerfile_syntaxdockerfile_emptycommand_not_supported-
error_processing_commands-
build_target_not_reachable-
missing_onbuild_arguments-
unknown_instruction-
build_canceledengineInfo: typeLabeledGauge- How I did it
Extended the existing
prometheusmetrics using thedocker/go-metricspackage- How to verify it
replace
dockerbinaries with the ones located inbundles/17.06.0-dev/binary-daemon- Description for the changelog
Added build & engine info prometheus metrics- A picture of a cute animal (not mandatory but encouraged)
