Added support for InfluxDB batches of more than 1 and made the emission channel size configurable.#3937
Conversation
f521b64 to
8c16e04
Compare
atc/metric/emit.go
Outdated
| ) | ||
|
|
||
| func Initialize(logger lager.Logger, host string, attributes map[string]string) error { | ||
| func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error { |
There was a problem hiding this comment.
As we always expect bufferSize to be uint anyway, wdyt of changing the signature to take an uint instead of int?
| func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error { | |
| func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize uint32) error { |
It seems to make that it'd be better to do the casting at the make(chan ... that comes later rather than at the moment that we pass the value to Initialize 🤔
There was a problem hiding this comment.
I will make this change - makes sense to me to constrain it with a uint32 and do the cast at the last possible moment.
atc/metric/emit.go
Outdated
| } | ||
|
|
||
| func emit(logger lager.Logger, event Event) { | ||
| logger.Debug("emit-event", lager.Data{ |
There was a problem hiding this comment.
We already have a way of having the metrics emitted logging to stdout through the --emit-to-logs variable - did you want this logline for the purposes of checking if metrics are still being emitted regardless of the configured emitter, or for the same purposes as --emit-to-logs?
There was a problem hiding this comment.
This code is to check regardless of the configured emitter. If I remember correctly only one of the emitters can be configured at a time. So we cannot have --emit-to-logs and InfluxDB emission configured at the same time.
There was a problem hiding this comment.
I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.
If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷♂️
atc/metric/emitter/influxdb.go
Outdated
|
|
||
| InsecureSkipVerify bool `long:"influxdb-insecure-skip-verify" description:"Skip SSL verification when emitting to InfluxDB."` | ||
|
|
||
| // https://github.com/influxdata/docs.influxdata.com/issues/454 |
There was a problem hiding this comment.
Thanks for giving us the context on "why 5000"! That's very helpful.
Wdyt of leaving that to the PR though? As one could always trace the 5000 back to this PR, I'd say leaving that in the PR's comments (or even in the commit message) would be enough 🤔 In the codebase, we don't have many other cases where we add references like this.
There was a problem hiding this comment.
I will remove the comments in the code and add as a comment here:
influxdata/docs.influxdata.com-ARCHIVE#454
https://docs.influxdata.com/influxdb/v0.13/write_protocols/write_syntax/#write-a-batch-of-points-with-curl
5000 seems to be the batch size recommended by the InfluxDB team
atc/metric/emitter/influxdb.go
Outdated
| "influxdb-batch-duration": emitter.batchDuration, | ||
| "current-duration": duration, | ||
| }) | ||
| go emitBatch(emitter, logger, batch) |
There was a problem hiding this comment.
I'm not very sure about making each emission a goroutine 🤔 If I understood correctly, this makes the buffer "infinite" as there's never really going to exist something that throttles the writes. Is this statement right? Maybe I got something wrong 😬
There was a problem hiding this comment.
Making it a goroutine ensures that the construction of the batch points and writing to InfluxDB are immediately taken off the line of execution so that the emitter can continue to build up the next batch.
And the emitLoop function in emit.go still ensures that events are passed on in a synchronous way to the InfluxDBEmitter after they have been read from the channel.
So there will only ever be one open batch that will be closed and submitted for final processing when either the size or duration limit is reached. However, there can potentially be a number of closed batches that are in the process of being transformed into batch points or written to InfluxDB.
Does this answer your question? Or am I missing something?
@cirocosta I am busy working on a unit test for this logic. Also see my replies to the other comments. |
| influxclient "github.com/influxdata/influxdb1-client/v2" | ||
| ) | ||
|
|
||
| // This is a copy of the github.com/influxdata/influxdb1-client/v2/client.Client interface whose sole purpose is |
There was a problem hiding this comment.
I only added this file as a last resort. As stated in the counterfeiter docs for third party packages, I tried with
//go:generate counterfeiter github.com/influxdata/influxdb1-client/v2/client.Client
and various other permutations of that, but none of them worked.
…on channel buffer size configurable. Signed-off-by: Rudolf Visagie <[email protected]>
Signed-off-by: Rudolf Visagie <[email protected]>
Signed-off-by: Rudolf Visagie <[email protected]>
…as been added to the PR comments Signed-off-by: Rudolf Visagie <[email protected]>
… the channel Signed-off-by: Rudolf Visagie <[email protected]>
Signed-off-by: Rudolf Visagie <[email protected]>
Signed-off-by: Rudolf Visagie <[email protected]>
17d8b81 to
661bf9b
Compare
|
@cirocosta This PR is ready for review again |
|
Hi @rudolfv ! Sorry for the very long delay, I'll get back to it very soon! Update (21 Jun, 2019): we still didn't get the time to come to it. It's still in our backlog though! |
|
@cirocosta Any updates on this? |
|
@vito @cirocosta I see the stale[bot] is about to close this. Do you have any updates for me? |
|
@rudolfv Sorry! 🙁 I'll poke the issue so the stale bot backs off. We'll be sure to get to this soon. @cirocosta has been pretty busy with other commitments, and I've at least been busy putting together our roadmap. Unfortunately we don't have very consistent bandwidth for PR reviews - we're trying to improve on this as a project but it's slow going. I poked him about it yesterday, I'll try to get one of us to wrap this up soon depending on who's available. |
|
Thank you @vito! |
vito
left a comment
There was a problem hiding this comment.
a tentative review to keep the ball rolling 🙂 sorry for the wait
atc/metric/emit.go
Outdated
| } | ||
|
|
||
| func emit(logger lager.Logger, event Event) { | ||
| logger.Debug("emit-event", lager.Data{ |
There was a problem hiding this comment.
I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.
If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷♂️
atc/metric/emit.go
Outdated
|
|
||
| select { | ||
| case emissions <- eventEmission{logger: logger, event: event}: | ||
| logger.Debug("emit-event-write-to-channel", lager.Data{ |
There was a problem hiding this comment.
This on the other hand seems like it'd be a bit too noisy. 😅 Would you be OK with removing either this one or the above log?
(Also interesting that the above log line will run even with a nil emitter. Not sure if that's useful. I guess you'd notice it's not configured more quickly, but it also means those logs will show up even in places where metrics are undesired.)
atc/metric/emit.go
Outdated
|
|
||
| func emitLoop() { | ||
| for emission := range emissions { | ||
| emission.logger.Debug("emit-event-loop", lager.Data{ |
There was a problem hiding this comment.
Also noisy - wouldn't this mean there are 3 logs emitted for every metric?
| } | ||
| } | ||
|
|
||
| func (emitter *InfluxDBEmitter) SubmitBatch(logger lager.Logger) { |
There was a problem hiding this comment.
Just checking, is this thread-safe? I guess it is because all of the emits come from an emit loop? 🤔
| copy(batchToSubmit, batch) | ||
| batch = make([]metric.Event, 0) | ||
| lastBatchTime = time.Now() | ||
| go emitBatch(emitter, logger, batchToSubmit) |
There was a problem hiding this comment.
I guess the idea behind this is to prevent a giant batch from submitting too slowly and causing the queue to fill up. Makes sense, since that was the original problem. But I wonder if batching would help mitigate that in and of itself, since submiting 5k at a time might cause less backpressure than 1 event 5000 times.
Have you observed any issues with this yet at large scale? It seems like the goroutines could potentially pile up due to a slow consumer.
I think @cirocosta had similar concerns on an earlier revision. I'm fine with how it is for now, but in the future we may want to add a max-in-flight or something. I guess at the end of the day slow consumers are hard to avoid, and this is why we've been thinking of standardizing on Prometheus. 😅
There was a problem hiding this comment.
@vito We have not had any issues with this, but I understand your concerns. As a final test I would suggest that once we have merged this and built a release candidate, that we will spin up a concourse cluster similar to our production one and perform some load tests to give us a measure of confidence with all the 5.5 code changes.
|
thaanks for looking at it @vito 🙌 sorry y'all for dropping the ball on taking so long for doing it :( :( :( |
Signed-off-by: Rudolf Visagie <[email protected]>
04e3e9e to
666c14b
Compare
|
@vito I have removed all the extraneous logging. These are really all remnants of when I initially debugged the issue to find the exact point at which we were losing messages. |
Also, from running it in our production pipeline I can confirm that having debug mode on will create an insane amount of logging of which the usefulness is questionable. It makes sense to only add this again and do a custom build if we need to troubleshoot something specific further down the line. |
Signed-off-by: Rudolf Visagie <[email protected]>
dc5c349 to
dc224ff
Compare
vito
left a comment
There was a problem hiding this comment.
👍 thanks for cleaning up those logs! i noticed one more that should probably be removed, after that I think it's good to go
@cirocosta np!
Signed-off-by: Rudolf Visagie <[email protected]>
vito
left a comment
There was a problem hiding this comment.
thanks! good to go once CI passes. 👍
|
Thanks, @vito. Any ideas on when the 5.5 release is targeted for? |
|
Don't know yet, I would assume a week or two at worst. There are few things left in the milestone. |
|
Thanks @vito, that's good news! |
|
@vito, where can I get hold of the 5.5 release candidate artifacts? |
* influxdb batch size/duration and metrics buffer size concourse/concourse#3937 * max db connection pool size concourse/concourse#4232 Signed-off-by: Jamie Klassen <[email protected]>
|
Hi @rudolfv, you can find binaries by prefixing the |
#3937 Signed-off-by: Jamie Klassen <[email protected]> Co-authored-by: James Thomson <[email protected]>
Thank you, @pivotal-jamie-klassen! |
#3937 Signed-off-by: Jamie Klassen <[email protected]> Co-authored-by: James Thomson <[email protected]>
This commit adds the new parameters that were added to Concourse 5.5. Here's a breakdown of the new parameters: - max-active-tasks-per-worker > used by the `limit-active-tasks` container placement strategy > concourse/concourse#4118 - support for influxdb batching and bigger buffer size for metrics emissions > concourse/concourse#3937 - limitting number of max connections in db conn pools > concourse/concourse#4232 Signed-off-by: Ciro S. Costa <[email protected]> Co-authored-by: Zoe Tian <[email protected]>
This commit adds the new parameters that were added to Concourse 5.5. Here's a breakdown of the new parameters: - max-active-tasks-per-worker > used by the `limit-active-tasks` container placement strategy > concourse/concourse#4118 - support for influxdb batching and bigger buffer size for metrics emissions > concourse/concourse#3937 - limitting number of max connections in db conn pools > concourse/concourse#4232 Signed-off-by: Ciro S. Costa <[email protected]> Co-authored-by: Zoe Tian <[email protected]>
This commit adds the new parameters that were added to Concourse 5.5. Here's a breakdown of the new parameters: - max-active-tasks-per-worker > used by the `limit-active-tasks` container placement strategy > concourse/concourse#4118 - support for influxdb batching and bigger buffer size for metrics emissions > concourse/concourse#3937 - limitting number of max connections in db conn pools > concourse/concourse#4232 Signed-off-by: Ciro S. Costa <[email protected]> Co-authored-by: Zoe Tian <[email protected]>
This commit adds the new parameters that were added to Concourse 5.5. Here's a breakdown of the new parameters: - max-active-tasks-per-worker > used by the `limit-active-tasks` container placement strategy > concourse/concourse#4118 - support for influxdb batching and bigger buffer size for metrics emissions > concourse/concourse#3937 - limitting number of max connections in db conn pools > concourse/concourse#4232 Signed-off-by: Ciro S. Costa <[email protected]> Co-authored-by: Zoe Tian <[email protected]>
@vito We still need to test the latest changes (making all the magic numbers configurable) against one of our concourse stacks, but I did manage to build a full release image in our slightly altered release pipeline. We are not running the upgrade or downgrade jobs, but everything else passes.