Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mmap chunks during head compaction #14610

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alanprot
Copy link
Contributor

@alanprot alanprot commented Aug 7, 2024

I was running some load test with native histograms and i noticed a huge spike in memory during head compaction.

Upon some investigation i noticed that the reason why this is hapenning is cause we were not calling h.mmapHeadChunks during head compaction even though calling this at this point seems safe - this call is not even locked by any mutex.

We were no calling it simply cause the call was placed after the mutex to call db.reloadBlocks() on the following loop:

prometheus/tsdb/db.go

Lines 1056 to 1067 in ee5bba0

db.cmtx.Lock()
if err := db.reloadBlocks(); err != nil {
level.Error(db.logger).Log("msg", "reloadBlocks", "err", err)
}
db.cmtx.Unlock()
select {
case db.compactc <- struct{}{}:
default:
}
// We attempt mmapping of head chunks regularly.
db.head.mmapHeadChunks()

Which is locked here during head compaction.

This PR is just creating a new go routine to keep calling h.mmapHeadChunks and not be blocked by that mutex.

This are the results before and after the change (CPU and memory now and 5d ago - without this change):

Screenshot 2024-08-07 at 8 15 58 AM

@alanprot alanprot force-pushed the mmap-chunks-head-compaction branch from c272a79 to d9401db Compare August 7, 2024 04:47
@prymitive
Copy link
Contributor

#10709 (comment)

@alanprot
Copy link
Contributor Author

alanprot commented Aug 7, 2024

Hi @prymitive thanks for pointing that out.

I can try to make it in the same go routine, but for this we would need to use something like TryLock, which i wanted to avoid it. WDYT?

The thing is, with native histograms this spike is way more notable for 2 main reasons:

  • We cut head chunks way more often as the head chunk gets to the 1kb limit quite fast if the histogram have multiples buckets (once every 2 min in our case with 300 buckets) - FloatHistograms even faster.
  • The size of the wall can become quite large - proportional to the number of buckets on the histogram sample - which takes a long time to truncate (and the lock is still being hold while wall is being truncated)

We can see below the number of chunks created x the number of mmaped before and after the change (we can see that before the change we kept creating chunks but we were not mmaping them during the head compaction/wall truncation, causing the memory to spike at that time (in our case a spike from 3gb to 10gb + which is quite a lot)
Before:

Screenshot 2024-08-07 at 8 33 58 AM

After:
Screenshot 2024-08-07 at 8 34 32 AM

@alanprot alanprot marked this pull request as ready for review August 7, 2024 15:42
@alanprot alanprot requested a review from jesusvazquez as a code owner August 7, 2024 15:42
@bboreham
Copy link
Member

Since the objection is to CPU usage, could you add a CPU chart too please?

@alanprot
Copy link
Contributor Author

alanprot commented Aug 12, 2024

Hi @bboreham thanks for looking.

There is already a graph on the description that shows CPU before and after... it's hard anything as sometimes we can see exactly the same values and sometimes just a bit more? (hard to say)...

Maybe we could run prombench here?

@bboreham
Copy link
Member

Sorry the charts looked similar so I didn't realise one was CPU and one was memory.

Something is using a lot more than one core; is it because this is actually Cortex?

@alanprot
Copy link
Contributor Author

Yeah... this is cortex and is a sum of multiples instances (ingesters). Heap Size is the average btw the ingesters instances.

The spikes in CPU and Memory is exactly during the head compaction time (every 2 hours). Those graphs are comparing before and after i applied this change (one has a offset of 6 days which was before the patch). In this we can see that the memory spike is gone after the patch and the CPU seems similar.

@beorn7
Copy link
Member

beorn7 commented Aug 13, 2024

We cut head chunks way more often as the head chunk gets to the 1kb limit quite fast if the histogram have multiples buckets (once every 2 min in our case with 300 buckets) - FloatHistograms even faster.

It was pretty much an ad-hoc decision to stick with the 1kb limit for histogram chunks and merely require that at least 10 histograms are in each chunk. Have you played with increasing that limit? It would be interesting to explore the trade-offs.

@alanprot
Copy link
Contributor Author

It was pretty much an ad-hoc decision to stick with the 1kb limit for histogram chunks and merely require that at least 10 histograms are in each chunk. Have you played with increasing that limit? It would be interesting to explore the trade-offs.

I can do this test but i think what will happen is the baseline heap size will increase as we will keep 4kb instead of 1kb in memory before cutting a new head chunk and mmaping it,

@beorn7
Copy link
Member

beorn7 commented Aug 13, 2024

OK, thanks for your thought about this.

Anyway, this concrete PR is about a different approach to the problem. @codesome you are most familiar with mmap'd chunks. What do you think about this PR?

@alanprot
Copy link
Contributor Author

@codesome 🥺 :)

@codesome
Copy link
Member

In your comparison charts, the CPU is spiking higher. And as I understand, CPU is much more expensive than giving a little more memory.

Can you please check what is causing the extra CPU use?

The change in general looks innocent, but I would like to be cautious about any extra CPU cost.

@alanprot
Copy link
Contributor Author

In your comparison charts, the CPU is spiking higher. And as I understand, CPU is much more expensive than giving a little more memory.

Can you please check what is causing the extra CPU use?

The change in general looks innocent, but I would like to be cautious about any extra CPU cost.

Thanks a lot for looking at this @codesome :D:D:D

This first 3 spikes are are very similar, no? Indeed the 2 last ones are a bit higher (2-4%) but i think its just some variance on the runs TBH.. i can try to rerun the tests to make sure.

Screenshot 2024-08-21 at 4 07 40 PM

CPU is much more expensive than giving a little more memory.

This scenario might not be entirely realistic since all the series in this load test are native histograms. However, during head compaction, memory usage spikes dramatically from 3GB to 10GB, which seems excessive, doesn't it? On the other hand, CPU usage increased by only about 2% during 3 of the 6 head compactions. While CPU resources are generally more expensive, the difference in CPU usage appears minimal compared to the significant memory spike. And since memory isn’t compressible, this spike could easily lead to out-of-memory (OOM) errors during head compaction?

Is it possible to run prom bench here to see if we can capture something there as well? IDK if currently we ingest native histogram there but at least we may see if this cause more CPU with normal samples.

@bboreham
Copy link
Member

/prombench main

@prombot
Copy link
Contributor

prombot commented Aug 22, 2024

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-14610 and main

After successful deployment, the benchmarking results can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@bboreham
Copy link
Member

(comparing against main is best done right after rebasing your project on main, but since your timezone is very different to mine I'll kick it off and we can see how it looks after a few hours)

@codesome
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Aug 22, 2024

Benchmark cancel is in progress.

@codesome
Copy link
Member

Something happened. This PR started to behave weirdly after some point. @alanprot could you investigate what happened? Before the anomaly, there is a slight increase in CPU with no impact on memory, probably because we do not have native histograms in prombench. In general I think we could try this out and revert back if it causes a problem, after we have fixed the potential issue i.e.

CPU:
Screenshot 2024-08-22 at 2 01 08 PM

Memory:
Screenshot 2024-08-22 at 2 01 43 PM

@alanprot
Copy link
Contributor Author

Parca only have the profiles after 3:42? (after the problem hapenned? :( )

Seems for some reason i dont see the (*DB).run go routine on the new version at 3:42... but i wonder if im looking the correct profile: Link

@bboreham
Copy link
Member

Parca is configured with one hour of retention.

@alanprot
Copy link
Contributor Author

alanprot commented Aug 22, 2024

Ok.. I did find the deadlock between this 2 go routines:

(it happens during series GC and maybe my tests did not had churn so i did not see this happening)

         1   runtime.gopark /usr/local/go/src/runtime/proc.go:402
             runtime.goparkunlock /usr/local/go/src/runtime/proc.go:408 (inline)
             runtime.semacquire1 /usr/local/go/src/runtime/sema.go:160
             sync.runtime_SemacquireMutex /usr/local/go/src/runtime/sema.go:77
             sync.(*Mutex).lockSlow /usr/local/go/src/sync/mutex.go:171
             sync.(*Mutex).Lock /usr/local/go/src/sync/mutex.go:90 (inline)
             github.com/prometheus/prometheus/tsdb.(*Head).mmapHeadChunks /go/src/github.com/prometheus/prometheus/tsdb/head.go:1758
             github.com/prometheus/prometheus/tsdb.(*DB).run.func1 /go/src/github.com/prometheus/prometheus/tsdb/db.go:1055
         1   runtime.gopark /usr/local/go/src/runtime/proc.go:402
             runtime.goparkunlock /usr/local/go/src/runtime/proc.go:408 (inline)
             runtime.semacquire1 /usr/local/go/src/runtime/sema.go:160
             sync.runtime_SemacquireRWMutex /usr/local/go/src/runtime/sema.go:87
             sync.(*RWMutex).Lock /usr/local/go/src/sync/rwmutex.go:151
             github.com/prometheus/prometheus/tsdb.(*stripeSeries).gc.func1 /go/src/github.com/prometheus/prometheus/tsdb/head.go:1944
             github.com/prometheus/prometheus/tsdb.(*stripeSeries).iterForDeletion /go/src/github.com/prometheus/prometheus/tsdb/head.go:1983
             github.com/prometheus/prometheus/tsdb.(*stripeSeries).gc /go/src/github.com/prometheus/prometheus/tsdb/head.go:1955
             github.com/prometheus/prometheus/tsdb.(*Head).gc /go/src/github.com/prometheus/prometheus/tsdb/head.go:1573
             github.com/prometheus/prometheus/tsdb.(*Head).truncateSeriesAndChunkDiskMapper /go/src/github.com/prometheus/prometheus/tsdb/head.go:1372
             github.com/prometheus/prometheus/tsdb.(*Head).truncateMemory /go/src/github.com/prometheus/prometheus/tsdb/head.go:1170
             github.com/prometheus/prometheus/tsdb.(*DB).compactHead /go/src/github.com/prometheus/prometheus/tsdb/db.go:1433
             github.com/prometheus/prometheus/tsdb.(*DB).Compact /go/src/github.com/prometheus/prometheus/tsdb/db.go:1261
             github.com/prometheus/prometheus/tsdb.(*DB).run.func2 /go/src/github.com/prometheus/prometheus/tsdb/db.go:1088

I will see how i can better fix this.

@alanprot
Copy link
Contributor Author

alanprot commented Aug 22, 2024

I think this deadlock can happens nowadays as well in a very specific scenario: If the autoCompact is disabled and we call db.Compact from another go routine at the same time mmapHeadChunks is called by the db.run method.

It does not happen often because as soon as we call db.Compact, the db.Run go routine gets stuck in the lock - but if we were very unluck and called db.Compact at the same time the db.run is on this line, it could still happen.

@alanprot alanprot force-pushed the mmap-chunks-head-compaction branch 4 times, most recently from 66c5a9e to 2a0e45f Compare August 23, 2024 00:36
@alanprot
Copy link
Contributor Author

alanprot commented Aug 23, 2024

Ok.. i created a test to reproduce the problem. Here is the run is the deadlock happening the test failing:

panic: test timed out after 10m0s
running tests:
	TestMMapAndCompactConcurrently (9m25s)

goroutine 6291 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/go/src/testing/testing.go:2366 +0x385

The deadlock occurs because mmapHeadChunks and gc can acquire locks in a different sequence. mmapHeadChunks first locks stripeSeries and then lock the series itself, while the gc method locks the stripe in the stripeSeries, then the series itself, and then potentially another stripe on the stripeSeries - when deleting a time series. This happens because a time series can be present in two stripes— one cause its ID and another cause its hash.

To prevent a deadlock, I used the TryLock method, and, if the series is already locked, I called mmapChunks outside of the striped lock. This should be ok since series.mmapChunks is still protected by the its lock. Another potential solution might be to create an additional lock on the stripeSeries and prevent the gc and mmapHeadChunks methods from running concurrently - this should still be ok to address the problem raised by this pr as gc is way faster than the whole compact method.

With all that said, I wonder why we don't mmap the chunks as soon as a new head chunk is cut, which would eliminate the need for this background task that scans all series to find those with chunks to be mmaped - this seems to be an discussion for another PR but m more than happy to try it out if we thing it make sense.

PS: i also rebased the PR.

@alanprot alanprot force-pushed the mmap-chunks-head-compaction branch from fad30d9 to 5c2e638 Compare August 23, 2024 06:22
@bboreham
Copy link
Member

I think the background mapping is to avoid a spike in IO, because it’s typical for every series to cut a new chunk at the same time.
Look back in git history, you should find the reason.

@alanprot
Copy link
Contributor Author

alanprot commented Aug 23, 2024

I think the background mapping is to avoid a spike in IO, because it’s typical for every series to cut a new chunk at the same time.

That makes sense. Doing it in the background might still cause the same spike, but it wouldn't block new appends, i guess?

Still, wouldn't it make more sense to keep track of the newly cut chunks instead of scanning all time series every minute?

@alanprot
Copy link
Contributor Author

I wonder if there is a fundamental design problem as the same locks are used to lock different data structures (series and hashes) and, because of this, they can be locked in random orders causing the deadlock.

@alanprot alanprot force-pushed the mmap-chunks-head-compaction branch from 5c2e638 to 7a44443 Compare August 26, 2024 21:50
@bboreham
Copy link
Member

Certainly it's important not to have two bits of code that can take the locks in different orders.
I'm not sure I would call having one such bit of code a design problem.

@alanprot
Copy link
Contributor Author

Certainly it's important not to have two bits of code that can take the locks in different orders. I'm not sure I would call having one such bit of code a design problem.

yeah... But as for now I don't see any way to fix this without a potential performance penalty (like unnecessary locks) except by using the TryLock method. Using TryLock in the mmapHeadChunks function might be acceptable since it can be considered a best-effort approach. If some series aren't mapped in the current call, it's not a big issue because this function is called every minute.

The PR right now is even keeping track of the series it could not lock and mmaping its chunks outside of the stripped lock but i think this is unnecessary.

WDYT?

@alanprot
Copy link
Contributor Author

I created an alternative approach for this: #14752

Can we run prom bench there just to see the comparison?

@github-actions github-actions bot added the stale label Nov 2, 2024
@bwplotka
Copy link
Member

We discussed this with @alanprot and @yeya24 during KubeCon NA.

It's pretty hard to find this discussion (it's buried in this PR) but I think the gist is that, with native histograms (and especially with nhcb) the characteristic for WAL operations are different, thus different locking/head compaction flow could be considered. Perhaps a separate discussion should be made for this, but first we have to reproduce the problem with prombench IMO. Otherwise it's not sustainable to fix it, if we can't test against re-introducing the problem. Let's get a regression benchmark possible, I have some ideas for prombench improvements in mind.

I think you also created pure blocks for native histograms only, which might be unrealistic in practice too (is that right?).

@krajorama
Copy link
Member

@krajorama Note to self, follow up on this w.r.t. prombench native histogram support as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants