Ingester bounds by pstibrany · Pull Request #3992 · cortexproject/cortex

pstibrany · 2021-03-22T15:28:35Z

What this PR does: This PR implements various global (per-ingester, not per-tenant) limits for use by ingester:

Max number of series in memory
Max number of tenants in memory
Max number of inflight requests
Max ingestion rate.

All of these are disabled by default, and can be changed by using config/CLI parameters, and by using runtime configuration (to avoid redeploy of ingesters). Current limits are exported as cortex_ingester_global_limit metric with various limit label. If ingester finds that push request would go over one of these limits, it returns 500 error.

Which issue(s) this PR fixes:
Fixes #665.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

56quarters

Looks good, small nit about atomic alignment.

It seems like more settings are being added to runtime config due to a desire to change them without restarts. Obviously not something to be solved here but, might be time to start thinking about how we could enable more configuration to be changed at runtime without needing to add it to the runtimeConfigValues struct.

pkg/ingester/ingester.go

docs/configuration/config-file-reference.md

pracucci

Solid work! 👏 The overall logic makes sense to me. I've some concerns about the "global limits" naming (because we already have "global" limits) and some high level comments. I will take a deeper look at tiny details during the 2nd pass review.

docs/configuration/config-file-reference.md

pkg/ingester/ingester_v2.go

pkg/ingester/global_limits.go

pkg/ingester/ingester.go

pkg/ingester/ingester_v2.go

pkg/ingester/global_limits.go

ranton256

Looks good to me overall. Thanks for working on this.

pkg/ingester/ingester_v2.go

docs/configuration/config-file-reference.md

ranton256 · 2021-03-31T20:13:14Z

Looks good to me overall. Thanks for working on this.

I remembered after I sent this, I was also going to ask if you had any data on the performance difference with the new limits enabled.

pstibrany · 2021-03-31T21:58:25Z

I remembered after I sent this, I was also going to ask if you had any data on the performance difference with the new limits enabled.

We haven't run this code in any of our environments yet. I expect that we will start testing it sometime after easter. I have done some benchmarking in this comment, but it was only comparing master vs this branch with nil limits. Included in this PR is benchmark which compares some failure scenarios with and without limits. Unfortunately I don't have numbers ready at the moment to post it here.

ranton256

LGTM, and thanks for the info on the benchmarking status.

tomwilkie

Given this a once over and LGTM! Thanks Peter.

pracucci

Very good job and super sorry for my late review! I left few nits. No need for me to re-review it once they're addressed. Thanks!

pkg/ingester/instance_limits.go

pkg/ingester/ingester_v2.go

Signed-off-by: Peter Štibraný <[email protected]>

Rename max_users to max_tenants. Removed extra parameter to `getOrCreateTSDB` Signed-off-by: Peter Štibraný <[email protected]>

…that these limits only work when using blocks engine. Signed-off-by: Peter Štibraný <[email protected]>

Signed-off-by: Peter Štibraný <[email protected]>

Signed-off-by: Peter Štibraný <[email protected]>

Signed-off-by: Peter Štibraný <[email protected]>

pstibrany · 2021-04-09T13:57:34Z

Thank you for all reviews!

bboreham · 2021-06-29T08:11:46Z

Also fixes #858

bboreham · 2021-06-29T08:21:08Z

pkg/ingester/metrics.go

+		inflightRequests: promauto.With(r).NewGaugeFunc(prometheus.GaugeOpts{
+			Name: "cortex_ingester_inflight_push_requests",


This duplicates cortex_inflight_requests{route="/cortex.Ingester/Push"}.

Yes it does. But it exposes value use for limit check.

bboreham · 2021-07-05T16:12:41Z

I tried this out, with the limit set to 500 and many synthetic requests being pushed in from Avalanche.

It did, as hoped, cause a lot of errors "cannot push: too many inflight push requests in ingester".

I also expected the change to cut the number of goroutines, but there were still a lot. Mostly in gRPC before the code with the limit is hit:

goroutine profile: total 5860
5274 @ 0x43b2c5 0x44c437 0xac6d05 0xac6bd1 0xac7c35 0x47f887 0xac7b72 0xac7b2f 0xb3ca23 0xb3d64d 0xb437f6 0xb47b8c 0xb5648b 0x472701
#	0xac6d04	google.golang.org/grpc/internal/transport.(*recvBufferReader).read+0xa4		/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:177
#	0xac6bd0	google.golang.org/grpc/internal/transport.(*recvBufferReader).Read+0x210	/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:171
#	0xac7c34	google.golang.org/grpc/internal/transport.(*transportReader).Read+0x54		/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:482
#	0x47f886	io.ReadAtLeast+0x86								/usr/local/go/src/io/io.go:328
#	0xac7b71	io.ReadFull+0xd1								/usr/local/go/src/io/io.go:347
#	0xac7b2e	google.golang.org/grpc/internal/transport.(*Stream).Read+0x8e			/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:466
#	0xb3ca22	google.golang.org/grpc.(*parser).recvMsg+0x62					/backend-enterprise/vendor/google.golang.org/grpc/rpc_util.go:557
#	0xb3d64c	google.golang.org/grpc.recvAndDecompress+0x4c					/backend-enterprise/vendor/google.golang.org/grpc/rpc_util.go:688
#	0xb437f5	google.golang.org/grpc.(*Server).processUnaryRPC+0x355				/backend-enterprise/vendor/google.golang.org/grpc/server.go:1176
#	0xb47b8b	google.golang.org/grpc.(*Server).handleStream+0xd0b				/backend-enterprise/vendor/google.golang.org/grpc/server.go:1533
#	0xb5648a	google.golang.org/grpc.(*Server).serveStreams.func1.2+0xaa			/backend-enterprise/vendor/google.golang.org/grpc/server.go:871

161 @ 0x43b2c5 0x44cf05 0x44ceee 0x41cb5e 0x41cb42 0x41fd58 0x40e1d6 0x40b699 0x9c765e 0x9c9254 0x9b3775 0x20844a7 0x20799dd 0x156b2e9 0x20db263 0xb9ab23 0xd0b2a4 0x20c6cb6 0xb9ab23 0xb9e7e2 0xb9ab23 0xd0eafa 0xb9ab23 0xd0b814 0xb9ab23 0xb9ad17 0x154fdb0 0xb439cb 0xb47b8c 0xb5648b 0x472701
#	0x9c765d	github.com/prometheus/prometheus/tsdb.(*Head).putSeriesBuffer+0x3d			/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/head.go:1282
#	0x9c9253	github.com/prometheus/prometheus/tsdb.(*headAppender).Commit+0x633			/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/head.go:1521
#	0x9b3774	github.com/prometheus/prometheus/tsdb.dbAppender.Commit+0x34				/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/db.go:817
#	0x20844a6	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).v2Push+0x1a06			/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/ingester_v2.go:896
#	0x20799dc	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).Push+0x8dc			/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/ingester.go:475
#	0x156b2e8	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x88	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2565
#	0x20db262	github.com/cortexproject/cortex/pkg/cortex.ThanosTracerUnaryInterceptor+0xa2		/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/tracing.go:14
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0b2a3	github.com/weaveworks/common/middleware.ServerUserHeaderInterceptor+0xa3		/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_auth.go:38
#	0x20c6cb5	github.com/cortexproject/cortex/pkg/util/fakeauth.SetupAuthMiddleware.func1+0x115	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/fakeauth/fake_auth.go:27
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb9e7e1	github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1+0x301		/backend-enterprise/vendor/github.com/opentracing-contrib/go-grpc/server.go:57
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0eaf9	github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1+0x99	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:32
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0b813	github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor+0x93	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:29
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb9ad16	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xd6		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
#	0x154fdaf	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler+0x14f	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2567
#	0xb439ca	google.golang.org/grpc.(*Server).processUnaryRPC+0x52a					/backend-enterprise/vendor/google.golang.org/grpc/server.go:1210
#	0xb47b8b	google.golang.org/grpc.(*Server).handleStream+0xd0b					/backend-enterprise/vendor/google.golang.org/grpc/server.go:1533
#	0xb5648a	google.golang.org/grpc.(*Server).serveStreams.func1.2+0xaa				/backend-enterprise/vendor/google.golang.org/grpc/server.go:871

This in turn was caused by me using the cortex-jsonnet config which sets -server.grpc-max-concurrent-streams to 100,000: https://github.com/grafana/cortex-jsonnet/blob/23d110a5f0450417a551102007167d146702513f/cortex/ingester.libsonnet#L34

Reducing -server.grpc-max-concurrent-streams to 500 capped the number of goroutines below 1000.

pstibrany requested a review from pracucci March 22, 2021 15:28

pull-request-size bot added the size/XL label Mar 22, 2021

pstibrany mentioned this pull request Mar 23, 2021

Ingesters should not grow without bound #665

Closed

56quarters approved these changes Mar 24, 2021

View reviewed changes

pkg/ingester/ingester.go Outdated Show resolved Hide resolved

weeco reviewed Mar 26, 2021

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

weeco reviewed Mar 26, 2021

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

pracucci reviewed Mar 29, 2021

View reviewed changes

pstibrany force-pushed the ingester-limits branch 3 times, most recently from 1b8dc66 to d37db98 Compare March 31, 2021 07:46

ranton256 reviewed Mar 31, 2021

View reviewed changes

pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

ranton256 approved these changes Apr 2, 2021

View reviewed changes

pstibrany force-pushed the ingester-limits branch from 71d9f41 to 865b487 Compare April 6, 2021 08:06

tomwilkie approved these changes Apr 7, 2021

View reviewed changes

pracucci approved these changes Apr 9, 2021

View reviewed changes

pkg/ingester/instance_limits.go Outdated Show resolved Hide resolved

pkg/ingester/instance_limits.go Outdated Show resolved Hide resolved

pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved

pstibrany force-pushed the ingester-limits branch from f317bb1 to b7b1f8d Compare April 9, 2021 13:24

pstibrany added 12 commits April 9, 2021 15:24

Added global ingester limits.

82d0a51

Signed-off-by: Peter Štibraný <[email protected]>

Add tests for global limits.

edd0587

Signed-off-by: Peter Štibraný <[email protected]>

Expose current limits used by ingester via metrics.

023a25e

Signed-off-by: Peter Štibraný <[email protected]>

Add max inflight requests limit.

baabc5f

Signed-off-by: Peter Štibraný <[email protected]>

Added test for inflight push requests.

ffc8155

Signed-off-by: Peter Štibraný <[email protected]>

Docs.

635e218

Signed-off-by: Peter Štibraný <[email protected]>

Debug log.

90390ef

Signed-off-by: Peter Štibraný <[email protected]>

Test for unmarshalling.

e37a216

Signed-off-by: Peter Štibraný <[email protected]>

Nil default global limits.

82b14c2

Signed-off-by: Peter Štibraný <[email protected]>

CHANGELOG.md

e97bef6

Signed-off-by: Peter Štibraný <[email protected]>

Expose current ingestion rate as gauge.

97af696

Signed-off-by: Peter Štibraný <[email protected]>

Expose number of inflight requests.

e870730

Signed-off-by: Peter Štibraný <[email protected]>

pstibrany added 11 commits April 9, 2021 15:25

Change ewmaRate to use RWMutex.

2360350

Signed-off-by: Peter Štibraný <[email protected]>

Rename globalLimits to instanceLimits.

0a68912

Rename max_users to max_tenants. Removed extra parameter to `getOrCreateTSDB` Signed-off-by: Peter Štibraný <[email protected]>

Rename globalLimits to instanceLimits, fix users -> tenants, explain …

b453a1e

…that these limits only work when using blocks engine. Signed-off-by: Peter Štibraný <[email protected]>

Rename globalLimits to instanceLimits, fix users -> tenants, explain …

37be0e6

…that these limits only work when using blocks engine. Signed-off-by: Peter Štibraný <[email protected]>

Remove details from error messages.

3be949c

Signed-off-by: Peter Štibraný <[email protected]>

Comment.

69a6a63

Signed-off-by: Peter Štibraný <[email protected]>

Fix series count when closing non-empty TSDB.

47b93ce

Signed-off-by: Peter Štibraný <[email protected]>

Added new failure modes to benchmark.

202f1f4

Signed-off-by: Peter Štibraný <[email protected]>

Fixed docs.

809e4c0

Signed-off-by: Peter Štibraný <[email protected]>

Tick every second.

34266ad

Signed-off-by: Peter Štibraný <[email protected]>

Fix CHANGELOG.md

e20cd3b

Signed-off-by: Peter Štibraný <[email protected]>

pstibrany force-pushed the ingester-limits branch from b7b1f8d to 2581afa Compare April 9, 2021 13:25

pstibrany and others added 4 commits April 9, 2021 15:25

Review feedback.

b059e8e

Signed-off-by: Peter Štibraný <[email protected]>

Review feedback.

ca04f26

Signed-off-by: Peter Štibraný <[email protected]>

Remove forgotten fmt.Println.

6693661

Signed-off-by: Peter Štibraný <[email protected]>

Use error variables.

2581afa

Signed-off-by: Peter Štibraný <[email protected]>

pstibrany merged commit 7f85a26 into cortexproject:master Apr 9, 2021

pracucci mentioned this pull request Jun 28, 2021

Ingester blowing up to tens of thousands of goroutines #4324

Closed

bboreham reviewed Jun 29, 2021

View reviewed changes

bboreham mentioned this pull request Aug 13, 2021

when ingester starting, ingester increase thousands goroutines #4393

Open

2 tasks

		inflightRequests: promauto.With(r).NewGaugeFunc(prometheus.GaugeOpts{
		Name: "cortex_ingester_inflight_push_requests",

Conversation

pstibrany commented Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

56quarters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ranton256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ranton256 commented Mar 31, 2021

Uh oh!

pstibrany commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ranton256 left a comment

Choose a reason for hiding this comment

Uh oh!

tomwilkie left a comment

Choose a reason for hiding this comment

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pstibrany commented Apr 9, 2021

Uh oh!

bboreham commented Jun 29, 2021

Uh oh!

bboreham Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

pstibrany Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

bboreham commented Jul 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pstibrany commented Mar 22, 2021 •

edited

Loading

pstibrany commented Mar 31, 2021 •

edited

Loading