[tsdb] re-implement WAL watcher to read via a "notification" channel by cstyan · Pull Request #11949 · prometheus/prometheus

cstyan · 2023-02-08T01:02:56Z

EDIT: this is technicaly an implementation change, but could probably still use a changelog entry
[ENHANCEMENT] reimplemented WAL watcher reading code in a way that should reduce overall CPU usage in low scrape throughput deployments

This PR modifiers the WAL watcher to not poll on a timer for reading the WAL, but rather only read when it recieves a notification over a channel from the TSDB code.

This shows CPU usage where the green line is the prometheus instance running with this PR's changes, and the teal line is the a build from the current main branch. I think in github the image is slightly cut-off, the decrease is between 60 and 70%.

Signed-off-by: Callum Styan [email protected]

the TSDB code Signed-off-by: Callum Styan <[email protected]>

rfratto · 2023-02-08T01:27:03Z

tsdb/wlog/watcher.go

 			return nil

+		// we haven't read due to a notification in quite some time, try reading anyways
 		case <-readTicker.C:


You could do case <-readTicker.C, <-w.ReadNotify: here, but then you wouldn't get to log which event caused the read (if you want to keep that).

I'd like to keep the logging, knowing there was a timeout can be useful. The log level can be debug imo.

Not blocking, but it would be nice to consolidate the error checking and handling to a function that is reused between the two paths.

ptodev · 2023-02-08T14:26:56Z

tsdb/wlog/watcher.go

+				return err
+			}

+		case <-w.readNotify:


I wonder what would happen if we receive several read notifications while we are already busy processing a read notification. In that case would we want to drain the readNotify channel before we begin reading? If that's done, then we would have satisfied all the read requests via just one read.

Actually, since this is an unbufferend channel, I guess the Notfy() function would just block in such a situation?

I used select with a default case to explicitly not block.

The theory is that throwing away a notification is not the end of the world. In prometheus' with high scrape load we're probably getting a lot of notifications we don't really need, since for each read by the WAL watcher it reads until EOF. If we were sending a notification "hey, read X bytes" we couldn't get away with throwing these notifications away. In the case of a prometheus with a low scrape load it's very unlikely that we're going to throw away a notification.

This reminds me that adding metrics to track dropping of notifications is a good idea.

codesome · 2023-02-09T07:26:28Z

tsdb/wlog/wlog.go

 	defer w.mtx.Unlock()
+	defer func() {
+		if w.WriteNotified != nil {
+			w.WriteNotified.Notify()


How much notify is too much notify? Alternatively, we can call this once at the end of logging in a TSDB commit (here) because a single commit (hence the scrape) can have more than one WAL logging (more common once native histograms get adopted).

I'll move the call to Notify, good call.

I'm not 100% sure if the only creation of the head struct for a normal prometheus (not read-only) is via the db open call? Is there a similar FlushWAL that could create another new head struct? If not I can clean up the saving of the reference to the interface even further than my most recent commit.

bump @codesome

codesome

The added interface in TSDB looks fine. What was the load (#series, #samples/sec) that you tested this with? I am curious if at a high load too many notify calls would be a problem and if the backoff turns out to be better.

cstyan · 2023-02-15T13:27:42Z

What was the load (#series, #samples/sec) that you tested this with? I am curious if at a high load too many notify calls would be a problem and if the backoff turns out to be better.

So far I've only tested this with two prometheus' deployed locally, both scraping each other. As you mentioned on slack we can test in our dev environments, but this also reminds me that it would be nice to include remote write in prombench.

rfratto · 2023-02-16T17:08:57Z

As you mentioned on slack we can test in our dev environments

I'd also be happy to help test this in Grafana Agent, our deployments there have a modest throughput (~2000 remote_writes/s).

on each WAL Log call Signed-off-by: Callum Styan <[email protected]>

cstyan · 2023-02-20T19:09:20Z

I'd also be happy to help test this in Grafana Agent, our deployments there have a modest throughput (~2000 remote_writes/s).

@rfratto yeah if we can deploy a second set of agents with these changes to compare the two that would be helpful, let me know how I can help

Signed-off-by: Callum Styan <[email protected]>

csmarchbanks

A few small comments but generally this looks good to me.

csmarchbanks · 2023-05-09T15:41:00Z

storage/remote/current.txt

I don't think all of these benchmarks should be part of this PR, it looks like they are related to the new format/compression algorithms?

csmarchbanks · 2023-05-09T15:46:34Z

storage/remote/storage.go

 }

+func (s *Storage) Notify() {
+	for _, s := range s.rws.queues {


Nit: I don't love that we are shadowing s here, maybe q instead?

csmarchbanks · 2023-05-09T15:54:12Z

tsdb/wlog/watcher.go

 		case <-readTicker.C:
+			level.Debug(w.logger).Log("msg", "Watcher is reading the WAL due to timeout, haven't received any write notifications recently", "timeout", readTimeout)
 			err = w.readSegment(reader, segmentNum, tail)
+			readTicker.Reset(readTimeout)


Is it necessary to reset the ticker here? It would have just triggered so should be a full 15 seconds until the next trigger.

IMO we still should reset, the read could take some amount of time. I don't think calling reset is expensive in some way either?

I'm not opposed to removing it either.

🤷 leaving it in seems fine, it's an edge case anyway.

csmarchbanks · 2023-05-09T15:58:13Z

tsdb/wlog/watcher.go

 			return nil

+		// we haven't read due to a notification in quite some time, try reading anyways
 		case <-readTicker.C:


Not blocking, but it would be nice to consolidate the error checking and handling to a function that is reused between the two paths.

rfratto · 2023-05-09T18:47:34Z

FWIW, I deployed this on a set of 8 Grafana Agents which (in total) write 450,000 samples/sec to their WALs. We aren't seeing any noticeable effect on CPU consumption after the deploy around ~17:20 UTC (the other annotation lines are unrelated deploys):

It doesn't decrease CPU usage either, but that's only expected to be observable with really low scrape loads.

Signed-off-by: Callum Styan <[email protected]>

csmarchbanks

Looks like some CI failures now, but 👍 , thanks for making the helper function.

csmarchbanks · 2023-05-11T14:17:23Z

tsdb/wlog/watcher.go

 		case <-readTicker.C:
+			level.Debug(w.logger).Log("msg", "Watcher is reading the WAL due to timeout, haven't received any write notifications recently", "timeout", readTimeout)
 			err = w.readSegment(reader, segmentNum, tail)
+			readTicker.Reset(readTimeout)


🤷 leaving it in seems fine, it's an edge case anyway.

Signed-off-by: Callum Styan <[email protected]>

read ticker timeout instead of calling the Notify function Signed-off-by: Callum Styan <[email protected]>

Signed-off-by: Callum Styan <[email protected]>

csmarchbanks

One nit, but 👍

tsdb/wlog/watcher.go

**What this PR does / why we need it**: This PR implements a new mechanism for the wal Watcher in Promtail, to know there are new records to be read. It uses a combination of: - prometheus/prometheus#11950 - prometheus/prometheus#11949 The main idea is that the primary mechanism is a notification channel between the `wal.Writer` and `wal.Watcher`. The Watcher subscribes to write events the writer publishes, getting notified if the wal has been written. The same subscriptions design is used for cleanup events. As a backup, the watcher has a timer that implements an exponential backoff strategy, which is constrained by a minimum and maximum that the user can configure. Below the cpu difference is shown of running both main and this branch against the same scrape target. <img width="2496" alt="image" src="https://user-images.githubusercontent.com/2617411/232099483-7e5c36fa-9360-4eb9-8240-687adf46e330.png"> The yellow line is the latest main build from where this branch started, and the green line is this branch. Both promtails tailing docker logs, and using the following metrics to get cpu usage from cadvisor: ``` avg by (name) (rate(container_cpu_usage_seconds_total{job=~".+", instance=~".+", name=~"promtail-wal-test_promtail.+"}[$__rate_interval])) ``` **Which issue(s) this PR fixes**: Part of #8197 **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`

prometheus ruler added a feature of notifying the reader when a sample is appended instead of waiting loop burning the CPU cycles. prometheus/prometheus#11949 Also should fix: #10859 Signed-off-by: Kaviraj <[email protected]>

**What this PR does / why we need it**: prometheus ruler added a feature of notifying the reader when a sample is appended, instead of waiting in a loop burning the CPU cycles. prometheus/prometheus#11949 This changes a default behaviour a bit. Now if `notify` is not enabled, next read is done only when next readTicker is triggered. **Which issue(s) this PR fixes**: Also should fix #10859 **Special notes for your reviewer**: Adding few more details for the sake of completeness. We found this via more frequent failures of rule-evaluation integration tests linked on the issue above. After some investigation, we tracked down to prometheus changes. Prometheus introduced new type `wlog.WriteNotified` interface with `Notify()` method with a goal to notify any waiting readers, that some write is done. Two types implements this type `wlog.Watcher` and `remote.Storage`. `remote.Storage` implements `Notify()` by just calling it's queues `wlog.Watcher`'s `Notify()` under the hood. How are these types impacts Loki ruler? Loki ruler also uses `remote.Storage`. So when any samples got committed via `appender`, we have to notify the remote storage. **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/setup/upgrade/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) --------- Signed-off-by: Kaviraj <[email protected]>

**What this PR does / why we need it**: prometheus ruler added a feature of notifying the reader when a sample is appended, instead of waiting in a loop burning the CPU cycles. prometheus/prometheus#11949 This changes a default behaviour a bit. Now if `notify` is not enabled, next read is done only when next readTicker is triggered. **Which issue(s) this PR fixes**: Also should fix grafana#10859 **Special notes for your reviewer**: Adding few more details for the sake of completeness. We found this via more frequent failures of rule-evaluation integration tests linked on the issue above. After some investigation, we tracked down to prometheus changes. Prometheus introduced new type `wlog.WriteNotified` interface with `Notify()` method with a goal to notify any waiting readers, that some write is done. Two types implements this type `wlog.Watcher` and `remote.Storage`. `remote.Storage` implements `Notify()` by just calling it's queues `wlog.Watcher`'s `Notify()` under the hood. How are these types impacts Loki ruler? Loki ruler also uses `remote.Storage`. So when any samples got committed via `appender`, we have to notify the remote storage. **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/setup/upgrade/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](grafana@d10549e) --------- Signed-off-by: Kaviraj <[email protected]>

WIP implement WAL watcher reading via notifications over a channel from

53f6cbc

the TSDB code Signed-off-by: Callum Styan <[email protected]>

rfratto reviewed Feb 8, 2023

View reviewed changes

ptodev reviewed Feb 8, 2023

View reviewed changes

codesome reviewed Feb 9, 2023

View reviewed changes

Notify via head appenders Commit (finished all WAL logging) rather than

3932659

on each WAL Log call Signed-off-by: Callum Styan <[email protected]>

thepalbi mentioned this pull request Mar 23, 2023

Promtail WAL support: Implement reader side grafana/loki#8302

Merged

7 tasks

thepalbi mentioned this pull request Apr 13, 2023

Notification based Promtail WAL Watcher polling approach grafana/loki#9135

Merged

5 tasks

cstyan added 6 commits April 28, 2023 14:13

Fix misspelled Notify plus add a metric for dropped Write notifications

01a7ead

Signed-off-by: Callum Styan <[email protected]>

Update tests to handle new notification pattern

049ea0a

Signed-off-by: Callum Styan <[email protected]>

this test maybe needs more time on windows?

9e23fa4

Signed-off-by: Callum Styan <[email protected]>

Merge branch 'main' into callum-watcher-notify-channel

e4be6c7

does this test need more time on windows as well?

b174c03

Signed-off-by: Callum Styan <[email protected]>

read timeout is already a time.Duration

43059fc

Signed-off-by: Callum Styan <[email protected]>

cstyan force-pushed the callum-watcher-notify-channel branch from aeea6c1 to 43059fc Compare May 5, 2023 20:17

cstyan marked this pull request as ready for review May 5, 2023 21:39

cstyan requested review from bwplotka, csmarchbanks and tomwilkie as code owners May 5, 2023 21:39

cstyan changed the title ~~WIP implement WAL watcher reading via channel~~ [tsdb] re-implement WAL watcher to read via a "notification" channel May 8, 2023

cstyan removed request for bwplotka and tomwilkie May 8, 2023 20:53

csmarchbanks reviewed May 9, 2023

View reviewed changes

cstyan added 2 commits May 10, 2023 17:21

remove mistakenly commited benchmark data files

834d7ae

Signed-off-by: Callum Styan <[email protected]>

address some review feedback

6646efe

Signed-off-by: Callum Styan <[email protected]>

cstyan requested a review from jesusvazquez as a code owner May 11, 2023 00:25

csmarchbanks reviewed May 11, 2023

View reviewed changes

cstyan added 4 commits May 11, 2023 12:44

fix missed changes from previous commit

ecfceab

Signed-off-by: Callum Styan <[email protected]>

Fix issues from wrapper function

ce0f322

Signed-off-by: Callum Styan <[email protected]>

try fixing race condition in test by allowing tests to overwrite the

6651a45

read ticker timeout instead of calling the Notify function Signed-off-by: Callum Styan <[email protected]>

fix linting

aa8b2d7

Signed-off-by: Callum Styan <[email protected]>

csmarchbanks approved these changes May 13, 2023

View reviewed changes

tsdb/wlog/watcher.go Show resolved Hide resolved

cstyan merged commit 0d2108a into main May 15, 2023

cstyan deleted the callum-watcher-notify-channel branch May 15, 2023 19:31

rfratto mentioned this pull request Jun 19, 2023

Consume a lot of CPU resources when idling grafana/agent#1148

Closed

gouthamve mentioned this pull request Jul 21, 2023

WAL watcher has high CPU usage with low throughput #11625

Closed

cstyan mentioned this pull request Sep 1, 2023

WIP implement doubling backoff for WAL watcher timer #11950

Closed

kavirajk mentioned this pull request Oct 15, 2023

ruler: Support writeNotify on Loki ruler grafana/loki#10906

Merged

7 tasks

jlevesy mentioned this pull request Nov 8, 2023

WAL watcher replay in agent mode takes significantly more time since 2.45 #13111

Closed

Conversation

cstyan commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

cstyan commented Feb 15, 2023

Uh oh!

rfratto commented Feb 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cstyan commented Feb 20, 2023

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rfratto commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cstyan commented Feb 8, 2023 •

edited

Loading

rfratto commented Feb 16, 2023 •

edited

Loading

rfratto commented May 9, 2023 •

edited

Loading