Return an http error during scraping if metrics collide when escaped to underscores#1641
Return an http error during scraping if metrics collide when escaped to underscores#1641
Conversation
|
So the edge case here is:
This should collide, but because the first metric's hash was Compat mode, it doesn't match the hash of the second metric, and we don't detect the collision. I think this is another case where we can say "don't do that", but I'm not sure how best to define what "that" is. In general, I think users should not define metrics in init() functions. OR, we could say, don't define UTF-8 metrics in init functions. |
|
Another option is to store hashes for both UTF-8 and compat. This adds 64bits per metric and another collectorsByID map (since we'll need to store both the compat and utf8 collector ids depending on the mode). don't love that either |
|
I am leaning towards having a second map in the registerer that only checks for compat fqname+labelname collisions. It's a memory increase but it gets around all these issues. |
prometheus/desc.go
Outdated
| } | ||
|
|
||
| d.id, d.dimHash = makeHashes(labelNames, labelValues, help, true) | ||
| d.compatID, d.compatDimHash = makeHashes(labelNames, labelValues, help, false) |
There was a problem hiding this comment.
this mutates labelNames so it's not pretty
prometheus/registry.go
Outdated
| compatDimHashesByName map[string]uint64 | ||
| uncheckedCollectors []Collector | ||
| pedanticChecksEnabled bool | ||
| utf8Collision bool |
There was a problem hiding this comment.
So we're adding a uint64, empty struct, and Collector interface obj for every metric registered. This could be a lot of memory with many millions of metrics. Perhaps we could only store the compat IDs if they are different from the default?
There was a problem hiding this comment.
(I have now implemented this)
prometheus/registry.go
Outdated
| defer r.mtx.Unlock() | ||
|
|
||
| delete(r.collectorsByID, collectorID) | ||
| if _, exists := r.collectorsByCompatID[compatID]; !exists { |
There was a problem hiding this comment.
there is other cleanup needed
|
I think this new approach is a decent solution. We can flip UTF-8 mode on and off at will and we don't pay a huge memory or performance penalty (that I can see) |
prometheus/desc.go
Outdated
| // used as an identifier of the descriptor. | ||
| id uint64 | ||
| // compatID is similar to id, but is a hash of all the relevant names escaped with underscores. | ||
| compatID uint64 |
There was a problem hiding this comment.
When I first saw this variable I thought it was a typo for compactID, but I see you're are using compact is other places as well.
What does compat means? 😅
There was a problem hiding this comment.
And still related to this variable, have we thought about using id to store the hash of the already normalized descriptor(instead of creating a new hash)?
There was a problem hiding this comment.
"compatible." as in, "compatible with legacy systems." I need to pick better names for sure
There was a problem hiding this comment.
have we thought about using id to store the hash of the already normalized descriptor?
do you mean only storing the normalized descriptor? We need both in case the user switches the mode to UTF-8 comparison with no legacy compatibility. Or do you mean something else?
There was a problem hiding this comment.
Yes, I meant just storing the normalized descriptor. Good point that's it's possible to change the validation in-flight!
There was a problem hiding this comment.
Not sure why someone would like to do that though 🤔
There was a problem hiding this comment.
for the record (as discussed in slack), it's because metrics are often created in init() functions before the user has a chance to actually configure the correct escaping scheme. So we have to be ready for both scenarios.
b08de89 to
8e9ef8c
Compare
97cad93 to
118a3c1
Compare
prometheus/promhttp/http.go
Outdated
| if hasEscapedCollisions { | ||
| switch contentType.ToEscapingScheme() { | ||
| case model.UnderscoreEscaping, model.DotsEscaping: | ||
| opts.ErrorLog.Println("error: one or more metrics collide when escaped") |
There was a problem hiding this comment.
need a better error message?
There was a problem hiding this comment.
maybe something like
"One or more metrics collide when UTF-8 characters in name/label are translated to underscores"?
ArthurSens
left a comment
There was a problem hiding this comment.
Looking good!
Since we always allow registration regardless of collision, I would just move the test closer to the HTTP package. We can keep the same test cases you've added, but the test would look similar to this:
func TestEscapedCollisions(t *testing.T) {
reg := prometheus.NewRegistry()
reg.MustRegister(prometheus.NewCounter(prometheus.CounterOpts{
Name: "test_metric",
Help: "A test metric with underscores",
}))
reg.MustRegister(prometheus.NewCounter(prometheus.CounterOpts{
Name: "test.metric",
Help: "A test metric with dots",
}))
handler := HandlerFor(reg, HandlerOpts{})
writer := httptest.NewRecorder()
request, _ := http.NewRequest("GET", "/metrics", nil)
request.Header.Add(acceptEncodingHeader, "the header that requires escaping")
handler.ServeHTTP(writer, request)
require.Equal(t, "some http error", writer.Code)
request, _ := http.NewRequest("GET", "/metrics", nil)
request.Header.Add(acceptEncodingHeader, "the header that allows utf8")
handler.ServeHTTP(writer, request)
require.Equal(t, "no erro error", writer.Code)
}
I'd like to keep the current unit test since the implementation is a little weird, but I will add this too. |
|
I'll squash this down one we're done with notes |
prometheus/promhttp/http_test.go
Outdated
| Help: "A test metric with dots", | ||
| })) | ||
|
|
||
| handler := HandlerFor(reg, HandlerOpts{ |
There was a problem hiding this comment.
this is the sucky part, it's necessary to specify the registry here or else we don't detect the collision case (because Gatherer does not know about collisions)
aad8c01 to
7f3763c
Compare
ArthurSens
left a comment
There was a problem hiding this comment.
LGTM!
Just one comment about adding another dependency, sorry about that 😬
prometheus/registry_test.go
Outdated
| require.NoError(t, err) | ||
| err = reg.Register(tc.counterB()) | ||
| if !tc.expectErr { | ||
| require.NoError(t, err) | ||
| } else { | ||
| require.Error(t, err) | ||
| } | ||
| require.Equal(t, tc.expectLegacyCollision, reg.HasEscapedCollision()) |
There was a problem hiding this comment.
Sorry for creating the confusion here, that's my bad! Bartek made a good point for not using require here, would you mind reverting the change? 🙈
We always have problems when we introduce new dependencies, let's avoid it if we can
…to underscores Signed-off-by: Owen Williams <[email protected]>
7f3763c to
42825b6
Compare
Signed-off-by: Owen Williams <[email protected]>
| } | ||
| } | ||
| } | ||
| hasEscapedCollisions = reg.HasEscapedCollision() |
There was a problem hiding this comment.
hmm... should I use the singular or the plural? :)
| Help: "A test metric with dots", | ||
| })) | ||
|
|
||
| handler := HandlerFor(reg, HandlerOpts{}) |
| return gf() | ||
| } | ||
|
|
||
| func (gf GathererFunc) HasEscapedCollision() bool { |
There was a problem hiding this comment.
this is the most risky change. both mimir and avalanche use gathererfunc
There was a problem hiding this comment.
nevermind, avalanche doesn't use it
| id uint64 | ||
| // escapedID is similar to id, but with the metric and label names escaped | ||
| // with underscores. | ||
| escapedID uint64 |
There was a problem hiding this comment.
Why not use id for escaped ID?
| escapedXXH.Write(separatorByteSlice) | ||
| } | ||
| d.id = xxh.Sum64() | ||
| d.escapedID = escapedXXH.Sum64() |
There was a problem hiding this comment.
Again, why not use id as escaped ID always?
bwplotka
left a comment
There was a problem hiding this comment.
Thanks for this!
I think I would love to understand this change more before committing here. Asked some likely stupid questions on the issue #1638 (comment)
We never did those kind of checks on HTTP handler only, only on registration layer, I wonder if it's worth to keep that. Having your app working fine and suddenly /metrics are erroring only because 2 series collide is bit unexpected (especially by default) and wasteful, scrape could just ignore one problematic metric.
If anything we could start with a special Gatherer that checks that, without changes to descriptor ever, and see who would actually use it (who would like to opt-in for extra safety here). I would also be happy to consider adding this logic on-registry check/error when escaped metric name is colliding OR only check escaped ones (is there a case when escaped don't collide, but non-escaped collide?)
Is it worth to double check those first @ArthurSens ?
|
We are going to put this on hold and have a meeting to discuss the various approaches and their pros and cons |
fixes #1638
This change prevents a possible issue when UTF-8 metrics are scraped by a system that does not support UTF-8. The metric names will be escaped, and if two metrics escape to the same legacy name, they will appear in the exposition "twice", causing collision problems. Therefore, during scraping we return an http error if two metrics escape to the same legacy string.
This change creates a new uint64 per metric and two new maps in the Registry. The new maps are only populated if the hash of the original metric data contains UTF-8 data. This means memory usage will be lower in the near term and will go up in the long term or for OTLP-heavy users. Eventually (i.e. Prom 4.0 or later) we could deprecate the compatibility modes with legacy systems to save this memory, but it's also not clear to me that the memory usage will be that high.
To avoid performance issues checking for collisions on every scrape, the actual collision detection is done at registration time but only reported at scrape time.