Skip to content

Comments

Added support for InfluxDB batches of more than 1 and made the emission channel size configurable.#3937

Merged
vito merged 10 commits intoconcourse:masterfrom
rudolfv:feature/3674-influxdb-batching
Aug 13, 2019
Merged

Added support for InfluxDB batches of more than 1 and made the emission channel size configurable.#3937
vito merged 10 commits intoconcourse:masterfrom
rudolfv:feature/3674-influxdb-batching

Conversation

@rudolfv
Copy link

@rudolfv rudolfv commented May 29, 2019

@vito We still need to test the latest changes (making all the magic numbers configurable) against one of our concourse stacks, but I did manage to build a full release image in our slightly altered release pipeline. We are not running the upgrade or downgrade jobs, but everything else passes.

Copy link
Contributor

@cirocosta cirocosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @rudolfv !

I just added some questions / points 👍I feel a bit bad that we didn't have some testing around this area before 😬 Do you think it'd be worth adding some to this logic?

Thanks!

)

func Initialize(logger lager.Logger, host string, attributes map[string]string) error {
func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we always expect bufferSize to be uint anyway, wdyt of changing the signature to take an uint instead of int?

Suggested change
func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error {
func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize uint32) error {

It seems to make that it'd be better to do the casting at the make(chan ... that comes later rather than at the moment that we pass the value to Initialize 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this change - makes sense to me to constrain it with a uint32 and do the cast at the last possible moment.

}

func emit(logger lager.Logger, event Event) {
logger.Debug("emit-event", lager.Data{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a way of having the metrics emitted logging to stdout through the --emit-to-logs variable - did you want this logline for the purposes of checking if metrics are still being emitted regardless of the configured emitter, or for the same purposes as --emit-to-logs?

Copy link
Author

@rudolfv rudolfv Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is to check regardless of the configured emitter. If I remember correctly only one of the emitters can be configured at a time. So we cannot have --emit-to-logs and InfluxDB emission configured at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.

If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷‍♂️


InsecureSkipVerify bool `long:"influxdb-insecure-skip-verify" description:"Skip SSL verification when emitting to InfluxDB."`

// https://github.com/influxdata/docs.influxdata.com/issues/454
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving us the context on "why 5000"! That's very helpful.

Wdyt of leaving that to the PR though? As one could always trace the 5000 back to this PR, I'd say leaving that in the PR's comments (or even in the commit message) would be enough 🤔 In the codebase, we don't have many other cases where we add references like this.

Copy link
Author

@rudolfv rudolfv Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"influxdb-batch-duration": emitter.batchDuration,
"current-duration": duration,
})
go emitBatch(emitter, logger, batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very sure about making each emission a goroutine 🤔 If I understood correctly, this makes the buffer "infinite" as there's never really going to exist something that throttles the writes. Is this statement right? Maybe I got something wrong 😬

Copy link
Author

@rudolfv rudolfv Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it a goroutine ensures that the construction of the batch points and writing to InfluxDB are immediately taken off the line of execution so that the emitter can continue to build up the next batch.

And the emitLoop function in emit.go still ensures that events are passed on in a synchronous way to the InfluxDBEmitter after they have been read from the channel.

So there will only ever be one open batch that will be closed and submitted for final processing when either the size or duration limit is reached. However, there can potentially be a number of closed batches that are in the process of being transformed into batch points or written to InfluxDB.

Does this answer your question? Or am I missing something?

@rudolfv
Copy link
Author

rudolfv commented Jun 6, 2019

Thank you for the PR, @rudolfv !

I just added some questions / points 👍I feel a bit bad that we didn't have some testing around this area before 😬 Do you think it'd be worth adding some to this logic?

Thanks!

@cirocosta I am busy working on a unit test for this logic. Also see my replies to the other comments.

influxclient "github.com/influxdata/influxdb1-client/v2"
)

// This is a copy of the github.com/influxdata/influxdb1-client/v2/client.Client interface whose sole purpose is
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added this file as a last resort. As stated in the counterfeiter docs for third party packages, I tried with
//go:generate counterfeiter github.com/influxdata/influxdb1-client/v2/client.Client
and various other permutations of that, but none of them worked.

Rudolf Visagie added 7 commits June 7, 2019 12:12
…on channel buffer size configurable.

Signed-off-by: Rudolf Visagie <[email protected]>
Signed-off-by: Rudolf Visagie <[email protected]>
…as been added to the PR comments

Signed-off-by: Rudolf Visagie <[email protected]>
Signed-off-by: Rudolf Visagie <[email protected]>
@rudolfv rudolfv force-pushed the feature/3674-influxdb-batching branch from 17d8b81 to 661bf9b Compare June 7, 2019 16:12
@rudolfv
Copy link
Author

rudolfv commented Jun 10, 2019

@cirocosta This PR is ready for review again

@cirocosta
Copy link
Contributor

cirocosta commented Jun 17, 2019

Hi @rudolfv !

Sorry for the very long delay, I'll get back to it very soon!


Update (21 Jun, 2019): we still didn't get the time to come to it. It's still in our backlog though!

@rudolfv
Copy link
Author

rudolfv commented Jul 22, 2019

@cirocosta Any updates on this?

@rudolfv
Copy link
Author

rudolfv commented Jul 30, 2019

@vito @cirocosta I see the stale[bot] is about to close this. Do you have any updates for me?

@vito
Copy link
Member

vito commented Jul 30, 2019

@rudolfv Sorry! 🙁 I'll poke the issue so the stale bot backs off. We'll be sure to get to this soon. @cirocosta has been pretty busy with other commitments, and I've at least been busy putting together our roadmap. Unfortunately we don't have very consistent bandwidth for PR reviews - we're trying to improve on this as a project but it's slow going. I poked him about it yesterday, I'll try to get one of us to wrap this up soon depending on who's available.

@rudolfv
Copy link
Author

rudolfv commented Jul 30, 2019

Thank you @vito!

Copy link
Member

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tentative review to keep the ball rolling 🙂 sorry for the wait

}

func emit(logger lager.Logger, event Event) {
logger.Debug("emit-event", lager.Data{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.

If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷‍♂️


select {
case emissions <- eventEmission{logger: logger, event: event}:
logger.Debug("emit-event-write-to-channel", lager.Data{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This on the other hand seems like it'd be a bit too noisy. 😅 Would you be OK with removing either this one or the above log?

(Also interesting that the above log line will run even with a nil emitter. Not sure if that's useful. I guess you'd notice it's not configured more quickly, but it also means those logs will show up even in places where metrics are undesired.)


func emitLoop() {
for emission := range emissions {
emission.logger.Debug("emit-event-loop", lager.Data{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noisy - wouldn't this mean there are 3 logs emitted for every metric?

}
}

func (emitter *InfluxDBEmitter) SubmitBatch(logger lager.Logger) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, is this thread-safe? I guess it is because all of the emits come from an emit loop? 🤔

Copy link
Author

@rudolfv rudolfv Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vito, yes, that is my understanding. The emitLoop will serialize the emission of events to the InfluxDB code. emitLoop -> emitter.Emit -> emitter.SubmitBatch

copy(batchToSubmit, batch)
batch = make([]metric.Event, 0)
lastBatchTime = time.Now()
go emitBatch(emitter, logger, batchToSubmit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea behind this is to prevent a giant batch from submitting too slowly and causing the queue to fill up. Makes sense, since that was the original problem. But I wonder if batching would help mitigate that in and of itself, since submiting 5k at a time might cause less backpressure than 1 event 5000 times.

Have you observed any issues with this yet at large scale? It seems like the goroutines could potentially pile up due to a slow consumer.

I think @cirocosta had similar concerns on an earlier revision. I'm fine with how it is for now, but in the future we may want to add a max-in-flight or something. I guess at the end of the day slow consumers are hard to avoid, and this is why we've been thinking of standardizing on Prometheus. 😅

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vito We have not had any issues with this, but I understand your concerns. As a final test I would suggest that once we have merged this and built a release candidate, that we will spin up a concourse cluster similar to our production one and perform some load tests to give us a measure of confidence with all the 5.5 code changes.

@cirocosta
Copy link
Contributor

cirocosta commented Aug 8, 2019

thaanks for looking at it @vito 🙌 sorry y'all for dropping the ball on taking so long for doing it :( :( :(

@rudolfv rudolfv requested a review from a team August 8, 2019 18:51
Signed-off-by: Rudolf Visagie <[email protected]>
@rudolfv rudolfv force-pushed the feature/3674-influxdb-batching branch from 04e3e9e to 666c14b Compare August 8, 2019 18:52
@rudolfv
Copy link
Author

rudolfv commented Aug 8, 2019

@vito I have removed all the extraneous logging. These are really all remnants of when I initially debugged the issue to find the exact point at which we were losing messages.

@rudolfv
Copy link
Author

rudolfv commented Aug 8, 2019

@vito I have removed all the extraneous logging. These are really all remnants of when I initially debugged the issue to find the exact point at which we were losing messages.

Also, from running it in our production pipeline I can confirm that having debug mode on will create an insane amount of logging of which the usefulness is questionable. It makes sense to only add this again and do a custom build if we need to troubleshoot something specific further down the line.

Signed-off-by: Rudolf Visagie <[email protected]>
@rudolfv rudolfv force-pushed the feature/3674-influxdb-batching branch from dc5c349 to dc224ff Compare August 8, 2019 20:14
Copy link
Member

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks for cleaning up those logs! i noticed one more that should probably be removed, after that I think it's good to go

@cirocosta np!

Signed-off-by: Rudolf Visagie <[email protected]>
Copy link
Member

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! good to go once CI passes. 👍

@vito vito merged commit 7216993 into concourse:master Aug 13, 2019
@rudolfv
Copy link
Author

rudolfv commented Aug 13, 2019

Thanks, @vito. Any ideas on when the 5.5 release is targeted for?

@vito
Copy link
Member

vito commented Aug 13, 2019

Don't know yet, I would assume a week or two at worst. There are few things left in the milestone.

@rudolfv
Copy link
Author

rudolfv commented Aug 13, 2019

Thanks @vito, that's good news!

@rudolfv
Copy link
Author

rudolfv commented Aug 21, 2019

@vito, where can I get hold of the 5.5 release candidate artifacts?

jamieklassen pushed a commit to concourse/concourse-bosh-release that referenced this pull request Aug 21, 2019
* influxdb batch size/duration and metrics buffer size concourse/concourse#3937
* max db connection pool size concourse/concourse#4232

Signed-off-by: Jamie Klassen <[email protected]>
@jamieklassen
Copy link
Member

jamieklassen commented Aug 26, 2019

Hi @rudolfv, you can find binaries by prefixing the path part of the versions of the linux-rc resource in the main pipeline with https://storage.cloud.google.com/concourse-artifacts/, e.g. https://storage.cloud.google.com/concourse-artifacts/rcs/concourse-f5706980d9a64f85ed2b2c44ef3cb8ff3a783bd7-linux-amd64.tgz -- actually you can probably guess the filename pattern if you're interested in other source versions or OS builds (i.e. darwin/windows). RC images are available on dockerhub at concourse/concourse-rc and concourse/concourse-bosh-release gets continuously updated with the latest RCs as the pipeline tests them. We're closing in on the v5.5.0 release now.

@jamieklassen jamieklassen added the release/documented Documentation and release notes have been updated. label Aug 26, 2019
@jamieklassen jamieklassen added this to the v5.5.0 milestone Aug 26, 2019
jamieklassen pushed a commit that referenced this pull request Aug 26, 2019
#3937

Signed-off-by: Jamie Klassen <[email protected]>
Co-authored-by: James Thomson <[email protected]>
@rudolfv
Copy link
Author

rudolfv commented Aug 28, 2019

Hi @rudolfv, you can find binaries by prefixing the path part of the versions of the linux-rc resource in the main pipeline with https://storage.cloud.google.com/concourse-artifacts/, e.g. https://storage.cloud.google.com/concourse-artifacts/rcs/concourse-f5706980d9a64f85ed2b2c44ef3cb8ff3a783bd7-linux-amd64.tgz -- actually you can probably guess the filename pattern if you're interested in other source versions or OS builds (i.e. darwin/windows). RC images are available on dockerhub at concourse/concourse-rc and concourse/concourse-bosh-release gets continuously updated with the latest RCs as the pipeline tests them. We're closing in on the v5.5.0 release now.

Thank you, @pivotal-jamie-klassen!

matthewpereira pushed a commit that referenced this pull request Sep 5, 2019
#3937

Signed-off-by: Jamie Klassen <[email protected]>
Co-authored-by: James Thomson <[email protected]>
k8s-ci-robot pushed a commit to helm/charts that referenced this pull request Sep 6, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
kengou pushed a commit to kengou/charts that referenced this pull request Sep 18, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
ramkumarvs pushed a commit to yugabyte/charts-helm-fork that referenced this pull request Sep 30, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
taylorsilva pushed a commit to taylorsilva/concourse-helm that referenced this pull request Oct 2, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release/documented Documentation and release notes have been updated.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants