Added support for InfluxDB batches of more than 1 and made the emission channel size configurable. by rudolfv · Pull Request #3937 · concourse/concourse

rudolfv · 2019-05-29T12:16:22Z

@vito We still need to test the latest changes (making all the magic numbers configurable) against one of our concourse stacks, but I did manage to build a full release image in our slightly altered release pipeline. We are not running the upgrade or downgrade jobs, but everything else passes.

cirocosta

Thank you for the PR, @rudolfv !

I just added some questions / points 👍I feel a bit bad that we didn't have some testing around this area before 😬 Do you think it'd be worth adding some to this logic?

Thanks!

cirocosta · 2019-06-03T21:19:15Z

atc/metric/emit.go

 )

-func Initialize(logger lager.Logger, host string, attributes map[string]string) error {
+func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error {


As we always expect bufferSize to be uint anyway, wdyt of changing the signature to take an uint instead of int?

Suggested change

func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error {

func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize uint32) error {

It seems to make that it'd be better to do the casting at the make(chan ... that comes later rather than at the moment that we pass the value to Initialize 🤔

I will make this change - makes sense to me to constrain it with a uint32 and do the cast at the last possible moment.

cirocosta · 2019-06-03T21:21:56Z

atc/metric/emit.go

 }

 func emit(logger lager.Logger, event Event) {
+	logger.Debug("emit-event", lager.Data{


We already have a way of having the metrics emitted logging to stdout through the --emit-to-logs variable - did you want this logline for the purposes of checking if metrics are still being emitted regardless of the configured emitter, or for the same purposes as --emit-to-logs?

This code is to check regardless of the configured emitter. If I remember correctly only one of the emitters can be configured at a time. So we cannot have --emit-to-logs and InfluxDB emission configured at the same time.

I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.

If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷‍♂️

cirocosta · 2019-06-03T21:24:58Z

atc/metric/emitter/influxdb.go


 	InsecureSkipVerify bool `long:"influxdb-insecure-skip-verify" description:"Skip SSL verification when emitting to InfluxDB."`
+
+	// https://github.com/influxdata/docs.influxdata.com/issues/454


Thanks for giving us the context on "why 5000"! That's very helpful.

Wdyt of leaving that to the PR though? As one could always trace the 5000 back to this PR, I'd say leaving that in the PR's comments (or even in the commit message) would be enough 🤔 In the codebase, we don't have many other cases where we add references like this.

I will remove the comments in the code and add as a comment here:

influxdata/docs.influxdata.com-ARCHIVE#454
https://docs.influxdata.com/influxdb/v0.13/write_protocols/write_syntax/#write-a-batch-of-points-with-curl
5000 seems to be the batch size recommended by the InfluxDB team

cirocosta · 2019-06-03T21:45:53Z

atc/metric/emitter/influxdb.go

+			"influxdb-batch-duration": emitter.batchDuration,
+			"current-duration": duration,
+		})
+		go emitBatch(emitter, logger, batch)


I'm not very sure about making each emission a goroutine 🤔 If I understood correctly, this makes the buffer "infinite" as there's never really going to exist something that throttles the writes. Is this statement right? Maybe I got something wrong 😬

Making it a goroutine ensures that the construction of the batch points and writing to InfluxDB are immediately taken off the line of execution so that the emitter can continue to build up the next batch.

And the emitLoop function in emit.go still ensures that events are passed on in a synchronous way to the InfluxDBEmitter after they have been read from the channel.

So there will only ever be one open batch that will be closed and submitted for final processing when either the size or duration limit is reached. However, there can potentially be a number of closed batches that are in the process of being transformed into batch points or written to InfluxDB.

Does this answer your question? Or am I missing something?

rudolfv · 2019-06-06T15:15:37Z

Thank you for the PR, @rudolfv !

I just added some questions / points 👍I feel a bit bad that we didn't have some testing around this area before 😬 Do you think it'd be worth adding some to this logic?

Thanks!

@cirocosta I am busy working on a unit test for this logic. Also see my replies to the other comments.

rudolfv · 2019-06-07T12:17:07Z

atc/metric/emitter/influxdb_client.go

+	influxclient "github.com/influxdata/influxdb1-client/v2"
+)
+
+// This is a copy of the github.com/influxdata/influxdb1-client/v2/client.Client interface whose sole purpose is


I only added this file as a last resort. As stated in the counterfeiter docs for third party packages, I tried with
//go:generate counterfeiter github.com/influxdata/influxdb1-client/v2/client.Client
and various other permutations of that, but none of them worked.

…on channel buffer size configurable. Signed-off-by: Rudolf Visagie <[email protected]>

Signed-off-by: Rudolf Visagie <[email protected]>

…as been added to the PR comments Signed-off-by: Rudolf Visagie <[email protected]>

… the channel Signed-off-by: Rudolf Visagie <[email protected]>

Signed-off-by: Rudolf Visagie <[email protected]>

rudolfv · 2019-06-10T13:10:12Z

@cirocosta This PR is ready for review again

cirocosta · 2019-06-17T20:38:12Z

Hi @rudolfv !

Sorry for the very long delay, I'll get back to it very soon!

Update (21 Jun, 2019): we still didn't get the time to come to it. It's still in our backlog though!

rudolfv · 2019-07-22T17:22:21Z

@cirocosta Any updates on this?

rudolfv · 2019-07-30T16:50:44Z

@vito @cirocosta I see the stale[bot] is about to close this. Do you have any updates for me?

vito · 2019-07-30T16:55:28Z

@rudolfv Sorry! 🙁 I'll poke the issue so the stale bot backs off. We'll be sure to get to this soon. @cirocosta has been pretty busy with other commitments, and I've at least been busy putting together our roadmap. Unfortunately we don't have very consistent bandwidth for PR reviews - we're trying to improve on this as a project but it's slow going. I poked him about it yesterday, I'll try to get one of us to wrap this up soon depending on who's available.

rudolfv · 2019-07-30T17:01:58Z

Thank you @vito!

vito

a tentative review to keep the ball rolling 🙂 sorry for the wait

vito · 2019-08-08T15:47:55Z

atc/metric/emit.go

 }

 func emit(logger lager.Logger, event Event) {
+	logger.Debug("emit-event", lager.Data{


I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.

If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷‍♂️

vito · 2019-08-08T15:49:38Z

atc/metric/emit.go


 	select {
 	case emissions <- eventEmission{logger: logger, event: event}:
+		logger.Debug("emit-event-write-to-channel", lager.Data{


This on the other hand seems like it'd be a bit too noisy. 😅 Would you be OK with removing either this one or the above log?

(Also interesting that the above log line will run even with a nil emitter. Not sure if that's useful. I guess you'd notice it's not configured more quickly, but it also means those logs will show up even in places where metrics are undesired.)

vito · 2019-08-08T15:50:16Z

atc/metric/emit.go


 func emitLoop() {
 	for emission := range emissions {
+		emission.logger.Debug("emit-event-loop", lager.Data{


Also noisy - wouldn't this mean there are 3 logs emitted for every metric?

vito · 2019-08-08T15:54:20Z

atc/metric/emitter/influxdb.go

 	}
 }
+
+func (emitter *InfluxDBEmitter) SubmitBatch(logger lager.Logger) {


Just checking, is this thread-safe? I guess it is because all of the emits come from an emit loop? 🤔

@vito, yes, that is my understanding. The emitLoop will serialize the emission of events to the InfluxDB code. emitLoop -> emitter.Emit -> emitter.SubmitBatch

vito · 2019-08-08T15:57:20Z

atc/metric/emitter/influxdb.go

+	copy(batchToSubmit, batch)
+	batch = make([]metric.Event, 0)
+	lastBatchTime = time.Now()
+	go emitBatch(emitter, logger, batchToSubmit)


I guess the idea behind this is to prevent a giant batch from submitting too slowly and causing the queue to fill up. Makes sense, since that was the original problem. But I wonder if batching would help mitigate that in and of itself, since submiting 5k at a time might cause less backpressure than 1 event 5000 times.

Have you observed any issues with this yet at large scale? It seems like the goroutines could potentially pile up due to a slow consumer.

I think @cirocosta had similar concerns on an earlier revision. I'm fine with how it is for now, but in the future we may want to add a max-in-flight or something. I guess at the end of the day slow consumers are hard to avoid, and this is why we've been thinking of standardizing on Prometheus. 😅

@vito We have not had any issues with this, but I understand your concerns. As a final test I would suggest that once we have merged this and built a release candidate, that we will spin up a concourse cluster similar to our production one and perform some load tests to give us a measure of confidence with all the 5.5 code changes.

cirocosta · 2019-08-08T16:33:21Z

thaanks for looking at it @vito 🙌 sorry y'all for dropping the ball on taking so long for doing it :( :( :(

Signed-off-by: Rudolf Visagie <[email protected]>

rudolfv · 2019-08-08T18:54:05Z

@vito I have removed all the extraneous logging. These are really all remnants of when I initially debugged the issue to find the exact point at which we were losing messages.

rudolfv · 2019-08-08T19:09:27Z

@vito I have removed all the extraneous logging. These are really all remnants of when I initially debugged the issue to find the exact point at which we were losing messages.

Also, from running it in our production pipeline I can confirm that having debug mode on will create an insane amount of logging of which the usefulness is questionable. It makes sense to only add this again and do a custom build if we need to troubleshoot something specific further down the line.

Signed-off-by: Rudolf Visagie <[email protected]>

vito

👍 thanks for cleaning up those logs! i noticed one more that should probably be removed, after that I think it's good to go

@cirocosta np!

atc/metric/emitter/influxdb.go

Signed-off-by: Rudolf Visagie <[email protected]>

vito

thanks! good to go once CI passes. 👍

rudolfv · 2019-08-13T17:28:18Z

Thanks, @vito. Any ideas on when the 5.5 release is targeted for?

vito · 2019-08-13T18:31:19Z

Don't know yet, I would assume a week or two at worst. There are few things left in the milestone.

rudolfv · 2019-08-13T20:52:36Z

Thanks @vito, that's good news!

rudolfv · 2019-08-21T17:11:25Z

@vito, where can I get hold of the 5.5 release candidate artifacts?

* influxdb batch size/duration and metrics buffer size concourse/concourse#3937 * max db connection pool size concourse/concourse#4232 Signed-off-by: Jamie Klassen <[email protected]>

jamieklassen · 2019-08-26T17:37:45Z

Hi @rudolfv, you can find binaries by prefixing the path part of the versions of the linux-rc resource in the main pipeline with https://storage.cloud.google.com/concourse-artifacts/, e.g. https://storage.cloud.google.com/concourse-artifacts/rcs/concourse-f5706980d9a64f85ed2b2c44ef3cb8ff3a783bd7-linux-amd64.tgz -- actually you can probably guess the filename pattern if you're interested in other source versions or OS builds (i.e. darwin/windows). RC images are available on dockerhub at concourse/concourse-rc and concourse/concourse-bosh-release gets continuously updated with the latest RCs as the pipeline tests them. We're closing in on the v5.5.0 release now.

#3937 Signed-off-by: Jamie Klassen <[email protected]> Co-authored-by: James Thomson <[email protected]>

rudolfv · 2019-08-28T16:33:55Z

Hi @rudolfv, you can find binaries by prefixing the path part of the versions of the linux-rc resource in the main pipeline with https://storage.cloud.google.com/concourse-artifacts/, e.g. https://storage.cloud.google.com/concourse-artifacts/rcs/concourse-f5706980d9a64f85ed2b2c44ef3cb8ff3a783bd7-linux-amd64.tgz -- actually you can probably guess the filename pattern if you're interested in other source versions or OS builds (i.e. darwin/windows). RC images are available on dockerhub at concourse/concourse-rc and concourse/concourse-bosh-release gets continuously updated with the latest RCs as the pipeline tests them. We're closing in on the v5.5.0 release now.

Thank you, @pivotal-jamie-klassen!

#3937 Signed-off-by: Jamie Klassen <[email protected]> Co-authored-by: James Thomson <[email protected]>

This commit adds the new parameters that were added to Concourse 5.5. Here's a breakdown of the new parameters: - max-active-tasks-per-worker > used by the `limit-active-tasks` container placement strategy > concourse/concourse#4118 - support for influxdb batching and bigger buffer size for metrics emissions > concourse/concourse#3937 - limitting number of max connections in db conn pools > concourse/concourse#4232 Signed-off-by: Ciro S. Costa <[email protected]> Co-authored-by: Zoe Tian <[email protected]>

rudolfv force-pushed the feature/3674-influxdb-batching branch from f521b64 to 8c16e04 Compare May 29, 2019 12:17

rudolfv mentioned this pull request May 29, 2019

InfluxDB emitter losing events #3674

Closed

vito requested a review from cirocosta May 29, 2019 13:29

YoussB mentioned this pull request Jun 3, 2019

make channel size for event emit metrics configurable #3933

Closed

cirocosta reviewed Jun 3, 2019

View reviewed changes

rudolfv commented Jun 7, 2019

View reviewed changes

Rudolf Visagie added 7 commits June 7, 2019 12:12

Added support for InfluxDB batches of more than 1 and made the emissi…

dcc0a79

…on channel buffer size configurable. Signed-off-by: Rudolf Visagie <[email protected]>

Added InfluxDB emitter tests

bee8895

Signed-off-by: Rudolf Visagie <[email protected]>

Re-organize the test spec to group tests under the same context

437dee9

Signed-off-by: Rudolf Visagie <[email protected]>

Remove the batch size comments as per the PR review - the same info h…

b6146ad

…as been added to the PR comments Signed-off-by: Rudolf Visagie <[email protected]>

Change the bufferSize parameter to uint32 and only cast when creating…

e0d2a8f

… the channel Signed-off-by: Rudolf Visagie <[email protected]>

Fix formatting

10ff0d3

Signed-off-by: Rudolf Visagie <[email protected]>

Make a copy of the batch before submitting

661bf9b

Signed-off-by: Rudolf Visagie <[email protected]>

rudolfv force-pushed the feature/3674-influxdb-batching branch from 17d8b81 to 661bf9b Compare June 7, 2019 16:12

vito reviewed Aug 8, 2019

View reviewed changes

rudolfv requested a review from a team August 8, 2019 18:51

Remove extraneous logging

666c14b

Signed-off-by: Rudolf Visagie <[email protected]>

rudolfv force-pushed the feature/3674-influxdb-batching branch from 04e3e9e to 666c14b Compare August 8, 2019 18:52

Removed unnecessary line break

dc224ff

Signed-off-by: Rudolf Visagie <[email protected]>

rudolfv force-pushed the feature/3674-influxdb-batching branch from dc5c349 to dc224ff Compare August 8, 2019 20:14

vito requested changes Aug 13, 2019

View reviewed changes

atc/metric/emitter/influxdb.go Outdated Show resolved Hide resolved

Removed development logging

7503f92

Signed-off-by: Rudolf Visagie <[email protected]>

vito approved these changes Aug 13, 2019

View reviewed changes

vito merged commit 7216993 into concourse:master Aug 13, 2019

jamieklassen mentioned this pull request Aug 21, 2019

add missing job properties concourse/concourse-bosh-release#48

Merged

cirocosta mentioned this pull request Aug 23, 2019

[stable/concourse] add new parameter for Concourse 5.5 helm/charts#15978

Merged

3 tasks

jamieklassen added the release/documented Documentation and release notes have been updated. label Aug 26, 2019

jamieklassen added this to the v5.5.0 milestone Aug 26, 2019

jamieklassen pushed a commit that referenced this pull request Aug 26, 2019

document metrics buffer and InfluxDB batching

534c20f

#3937 Signed-off-by: Jamie Klassen <[email protected]> Co-authored-by: James Thomson <[email protected]>

cirocosta mentioned this pull request Sep 4, 2019

wings: configure metrics buffer size (and, influxdb batch size?) concourse/ci#141

Closed

matthewpereira pushed a commit that referenced this pull request Sep 5, 2019

document metrics buffer and InfluxDB batching

2620153

#3937 Signed-off-by: Jamie Klassen <[email protected]> Co-authored-by: James Thomson <[email protected]>

cirocosta mentioned this pull request Sep 12, 2019

add ops file to set metrics buffer size concourse/concourse-bosh-deployment#182

Merged

cirocosta mentioned this pull request Oct 8, 2019

Newrelic metrics emitter losing most events #4540

Closed

	func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error {
	func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize uint32) error {


		InsecureSkipVerify bool `long:"influxdb-insecure-skip-verify" description:"Skip SSL verification when emitting to InfluxDB."`

		// https://github.com/influxdata/docs.influxdata.com/issues/454

Uh oh!

Comments

Conversation

rudolfv commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cirocosta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfv Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfv Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfv Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfv commented Jun 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfv commented Jun 10, 2019

Uh oh!

cirocosta commented Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rudolfv commented Jul 22, 2019

Uh oh!

rudolfv commented Jul 30, 2019

Uh oh!

vito commented Jul 30, 2019

Uh oh!

rudolfv commented Jul 30, 2019

Uh oh!

vito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfv Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cirocosta commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rudolfv commented Aug 8, 2019

Uh oh!

rudolfv commented Aug 8, 2019

Uh oh!

vito left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rudolfv commented May 29, 2019 •

edited

Loading

rudolfv Jun 6, 2019 •

edited

Loading

rudolfv Jun 6, 2019 •

edited

Loading

rudolfv Jun 6, 2019 •

edited

Loading

cirocosta commented Jun 17, 2019 •

edited

Loading

rudolfv Aug 8, 2019 •

edited

Loading

cirocosta commented Aug 8, 2019 •

edited

Loading

rudolfv commented Aug 21, 2019 •

edited

Loading

jamieklassen commented Aug 26, 2019 •

edited

Loading