Track distribution of row write rates with new HashBucketHistogram.#261
Track distribution of row write rates with new HashBucketHistogram.#261
Conversation
|
Tested locally and seems to work. |
| entries := 0 | ||
| for _, bucket := range c.bigBuckets(chunk.From, chunk.Through) { | ||
| hashValue := hashValue(userID, bucket.bucket, metricName) | ||
| rowWrites.Observe(hashValue, uint32(len(chunk.Metric))) |
There was a problem hiding this comment.
I want to double check: is this likely to end up with us distributing across hash buckets in a way that's aligned to how we're distributing across dynamo? If so, won't that make the measurements less useful?
There was a problem hiding this comment.
is this likely to end up with us distributing across hash buckets in a way that's aligned to how we're distributing across dynamo?
Thats unknown, as we don't know how dynamodb distributes across partitions.
If so, won't that make the measurements less useful?
Less useful maybe, but I still think quite useful - this will give us good information on the distribution of our write load within the hash space, and any massive outliers should show up.
| rowWrites = util.NewHashBucketHistogram(util.HashBucketHistogramOpts{ | ||
| HistogramOpts: prometheus.HistogramOpts{ | ||
| Namespace: "cortex", | ||
| Name: "chunk_store_row_write_total", |
There was a problem hiding this comment.
_total is a suffix reserved for counters by convention, not for histograms, which get additional _sum and _count counters created automatically. I'd call this chunk_store_row_writes_distribution or something.
| // Collect implements prometheus.Metric | ||
| func (h *hashBucketHistogram) Collect(c chan<- prometheus.Metric) { | ||
| for i := range h.buckets { | ||
| h.Histogram.Observe(float64(atomic.SwapUint32(&h.buckets[i], 0))) |
There was a problem hiding this comment.
So what your histogram observes depends totally on your scrape rate, missed scrapes, and whether you scrape from multiple Prometheus servers in parallel. This is obviously funky and not normally recommended Prometheus metrics usage, but I guess you know that?
Also, the xxx_count will be off.
There was a problem hiding this comment.
Yeah this is a bit of a hack for now - suggests to make it more robust are welcome! I'm thinking a goroutine which dumps the buckets into the histogram and resets them every second or something?
There was a problem hiding this comment.
Also, to normalise this a little, I think it would be useful to count all writes (a single counter), and then express each bucket's writes as a proportion of that.
There was a problem hiding this comment.
Or even better yet, to cancel out the effect the number of buckets has on that, multiply by number of buckets too - so 1 would be perfectly load balanced, more than 1 skewed etc
Part of #254