Skip to content

[tsdb] re-implement WAL watcher to read via a "notification" channel#11949

Merged
cstyan merged 14 commits intomainfrom
callum-watcher-notify-channel
May 15, 2023
Merged

[tsdb] re-implement WAL watcher to read via a "notification" channel#11949
cstyan merged 14 commits intomainfrom
callum-watcher-notify-channel

Conversation

@cstyan
Copy link
Copy Markdown
Member

@cstyan cstyan commented Feb 8, 2023

EDIT: this is technicaly an implementation change, but could probably still use a changelog entry
[ENHANCEMENT] reimplemented WAL watcher reading code in a way that should reduce overall CPU usage in low scrape throughput deployments

This PR modifiers the WAL watcher to not poll on a timer for reading the WAL, but rather only read when it recieves a notification over a channel from the TSDB code.
2023-02-07-140245_2405x1035_scrot
This shows CPU usage where the green line is the prometheus instance running with this PR's changes, and the teal line is the a build from the current main branch. I think in github the image is slightly cut-off, the decrease is between 60 and 70%.

Signed-off-by: Callum Styan [email protected]

return nil

// we haven't read due to a notification in quite some time, try reading anyways
case <-readTicker.C:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do case <-readTicker.C, <-w.ReadNotify: here, but then you wouldn't get to log which event caused the read (if you want to keep that).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep the logging, knowing there was a timeout can be useful. The log level can be debug imo.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but it would be nice to consolidate the error checking and handling to a function that is reused between the two paths.

return err
}

case <-w.readNotify:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what would happen if we receive several read notifications while we are already busy processing a read notification. In that case would we want to drain the readNotify channel before we begin reading? If that's done, then we would have satisfied all the read requests via just one read.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since this is an unbufferend channel, I guess the Notfy() function would just block in such a situation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used select with a default case to explicitly not block.

The theory is that throwing away a notification is not the end of the world. In prometheus' with high scrape load we're probably getting a lot of notifications we don't really need, since for each read by the WAL watcher it reads until EOF. If we were sending a notification "hey, read X bytes" we couldn't get away with throwing these notifications away. In the case of a prometheus with a low scrape load it's very unlikely that we're going to throw away a notification.

This reminds me that adding metrics to track dropping of notifications is a good idea.

defer w.mtx.Unlock()
defer func() {
if w.WriteNotified != nil {
w.WriteNotified.Notify()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much notify is too much notify? Alternatively, we can call this once at the end of logging in a TSDB commit (here) because a single commit (hence the scrape) can have more than one WAL logging (more common once native histograms get adopted).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the call to Notify, good call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if the only creation of the head struct for a normal prometheus (not read-only) is via the db open call? Is there a similar FlushWAL that could create another new head struct? If not I can clean up the saving of the reference to the interface even further than my most recent commit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump @codesome

Copy link
Copy Markdown
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added interface in TSDB looks fine. What was the load (#series, #samples/sec) that you tested this with? I am curious if at a high load too many notify calls would be a problem and if the backoff turns out to be better.

@cstyan
Copy link
Copy Markdown
Member Author

cstyan commented Feb 15, 2023

What was the load (#series, #samples/sec) that you tested this with? I am curious if at a high load too many notify calls would be a problem and if the backoff turns out to be better.

So far I've only tested this with two prometheus' deployed locally, both scraping each other. As you mentioned on slack we can test in our dev environments, but this also reminds me that it would be nice to include remote write in prombench.

@rfratto
Copy link
Copy Markdown
Contributor

rfratto commented Feb 16, 2023

As you mentioned on slack we can test in our dev environments

I'd also be happy to help test this in Grafana Agent, our deployments there have a modest throughput (~2000 remote_writes/s).

@cstyan
Copy link
Copy Markdown
Member Author

cstyan commented Feb 20, 2023

I'd also be happy to help test this in Grafana Agent, our deployments there have a modest throughput (~2000 remote_writes/s).

@rfratto yeah if we can deploy a second set of agents with these changes to compare the two that would be helpful, let me know how I can help

@cstyan cstyan force-pushed the callum-watcher-notify-channel branch from aeea6c1 to 43059fc Compare May 5, 2023 20:17
@cstyan cstyan marked this pull request as ready for review May 5, 2023 21:39
@cstyan cstyan changed the title WIP implement WAL watcher reading via channel [tsdb] re-implement WAL watcher to read via a "notification" channel May 8, 2023
@cstyan cstyan removed request for bwplotka and tomwilkie May 8, 2023 20:53
Copy link
Copy Markdown
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments but generally this looks good to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think all of these benchmarks should be part of this PR, it looks like they are related to the new format/compression algorithms?

}

func (s *Storage) Notify() {
for _, s := range s.rws.queues {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't love that we are shadowing s here, maybe q instead?

case <-readTicker.C:
level.Debug(w.logger).Log("msg", "Watcher is reading the WAL due to timeout, haven't received any write notifications recently", "timeout", readTimeout)
err = w.readSegment(reader, segmentNum, tail)
readTicker.Reset(readTimeout)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to reset the ticker here? It would have just triggered so should be a full 15 seconds until the next trigger.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we still should reset, the read could take some amount of time. I don't think calling reset is expensive in some way either?

I'm not opposed to removing it either.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 leaving it in seems fine, it's an edge case anyway.

return nil

// we haven't read due to a notification in quite some time, try reading anyways
case <-readTicker.C:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but it would be nice to consolidate the error checking and handling to a function that is reused between the two paths.

@rfratto
Copy link
Copy Markdown
Contributor

rfratto commented May 9, 2023

FWIW, I deployed this on a set of 8 Grafana Agents which (in total) write 450,000 samples/sec to their WALs. We aren't seeing any noticeable effect on CPU consumption after the deploy around ~17:20 UTC (the other annotation lines are unrelated deploys):

image

It doesn't decrease CPU usage either, but that's only expected to be observable with really low scrape loads.

@cstyan cstyan requested a review from jesusvazquez as a code owner May 11, 2023 00:25
Copy link
Copy Markdown
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some CI failures now, but 👍 , thanks for making the helper function.

case <-readTicker.C:
level.Debug(w.logger).Log("msg", "Watcher is reading the WAL due to timeout, haven't received any write notifications recently", "timeout", readTimeout)
err = w.readSegment(reader, segmentNum, tail)
readTicker.Reset(readTimeout)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 leaving it in seems fine, it's an edge case anyway.

cstyan added 4 commits May 11, 2023 12:44
read ticker timeout instead of calling the Notify function

Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
Copy link
Copy Markdown
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, but 👍

@cstyan cstyan merged commit 0d2108a into main May 15, 2023
@cstyan cstyan deleted the callum-watcher-notify-channel branch May 15, 2023 19:31
cstyan pushed a commit to grafana/loki that referenced this pull request May 17, 2023
**What this PR does / why we need it**:

This PR implements a new mechanism for the wal Watcher in Promtail, to
know there are new records to be read. It uses a combination of:
- prometheus/prometheus#11950
- prometheus/prometheus#11949

The main idea is that the primary mechanism is a notification channel
between the `wal.Writer` and `wal.Watcher`. The Watcher subscribes to
write events the writer publishes, getting notified if the wal has been
written. The same subscriptions design is used for cleanup events.

As a backup, the watcher has a timer that implements an exponential
backoff strategy, which is constrained by a minimum and maximum that the
user can configure.

Below the cpu difference is shown of running both main and this branch
against the same scrape target.

<img width="2496" alt="image"
src="https://user-images.githubusercontent.com/2617411/232099483-7e5c36fa-9360-4eb9-8240-687adf46e330.png">

The yellow line is the latest main build from where this branch started,
and the green line is this branch. Both promtails tailing docker logs,
and using the following metrics to get cpu usage from cadvisor:
```
avg by (name) (rate(container_cpu_usage_seconds_total{job=~".+", instance=~".+", name=~"promtail-wal-test_promtail.+"}[$__rate_interval]))
```

**Which issue(s) this PR fixes**:
Part of #8197

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
kavirajk added a commit to grafana/loki that referenced this pull request Oct 15, 2023
prometheus ruler added a feature of notifying the reader when a sample is appended instead of waiting loop burning the CPU cycles.
prometheus/prometheus#11949

Also should fix: #10859

Signed-off-by: Kaviraj <[email protected]>
kavirajk added a commit to grafana/loki that referenced this pull request Oct 16, 2023
**What this PR does / why we need it**:
prometheus ruler added a feature of notifying the reader when a sample
is appended, instead of waiting in a loop burning the CPU cycles.
prometheus/prometheus#11949

This changes a default behaviour a bit. Now if `notify` is not enabled,
next read is done only when next readTicker is triggered.

**Which issue(s) this PR fixes**:
Also should fix #10859

**Special notes for your reviewer**:
Adding few more details for the sake of completeness.

We found this via more frequent failures of rule-evaluation integration
tests linked on the issue above. After some investigation, we tracked
down to prometheus changes.

Prometheus introduced new type `wlog.WriteNotified` interface with
`Notify()` method with a goal to notify any waiting readers, that some
write is done.

Two types implements this type `wlog.Watcher` and `remote.Storage`.
`remote.Storage` implements `Notify()` by just calling it's queues
`wlog.Watcher`'s `Notify()` under the hood.

How are these types impacts Loki ruler?

Loki ruler also uses `remote.Storage`. So when any samples got committed
via `appender`, we have to notify the remote storage.

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)

---------

Signed-off-by: Kaviraj <[email protected]>
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
**What this PR does / why we need it**:
prometheus ruler added a feature of notifying the reader when a sample
is appended, instead of waiting in a loop burning the CPU cycles.
prometheus/prometheus#11949

This changes a default behaviour a bit. Now if `notify` is not enabled,
next read is done only when next readTicker is triggered.

**Which issue(s) this PR fixes**:
Also should fix grafana#10859

**Special notes for your reviewer**:
Adding few more details for the sake of completeness.

We found this via more frequent failures of rule-evaluation integration
tests linked on the issue above. After some investigation, we tracked
down to prometheus changes.

Prometheus introduced new type `wlog.WriteNotified` interface with
`Notify()` method with a goal to notify any waiting readers, that some
write is done.

Two types implements this type `wlog.Watcher` and `remote.Storage`.
`remote.Storage` implements `Notify()` by just calling it's queues
`wlog.Watcher`'s `Notify()` under the hood.

How are these types impacts Loki ruler?

Loki ruler also uses `remote.Storage`. So when any samples got committed
via `appender`, we have to notify the remote storage.

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](grafana@d10549e)

---------

Signed-off-by: Kaviraj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants