-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow mmap chunks during head compaction #14610
base: main
Are you sure you want to change the base?
Conversation
c272a79
to
d9401db
Compare
Hi @prymitive thanks for pointing that out. I can try to make it in the same go routine, but for this we would need to use something like The thing is, with native histograms this spike is way more notable for 2 main reasons:
We can see below the number of chunks created x the number of mmaped before and after the change (we can see that before the change we kept creating chunks but we were not mmaping them during the head compaction/wall truncation, causing the memory to spike at that time (in our case a spike from 3gb to 10gb + which is quite a lot) |
Since the objection is to CPU usage, could you add a CPU chart too please? |
Hi @bboreham thanks for looking. There is already a graph on the description that shows CPU before and after... it's hard anything as sometimes we can see exactly the same values and sometimes just a bit more? (hard to say)... Maybe we could run prombench here? |
Sorry the charts looked similar so I didn't realise one was CPU and one was memory. Something is using a lot more than one core; is it because this is actually Cortex? |
Yeah... this is cortex and is a sum of multiples instances (ingesters). Heap Size is the average btw the ingesters instances. The spikes in CPU and Memory is exactly during the head compaction time (every 2 hours). Those graphs are comparing before and after i applied this change (one has a offset of 6 days which was before the patch). In this we can see that the memory spike is gone after the patch and the CPU seems similar. |
It was pretty much an ad-hoc decision to stick with the 1kb limit for histogram chunks and merely require that at least 10 histograms are in each chunk. Have you played with increasing that limit? It would be interesting to explore the trade-offs. |
I can do this test but i think what will happen is the baseline heap size will increase as we will keep 4kb instead of 1kb in memory before cutting a new head chunk and mmaping it, |
OK, thanks for your thought about this. Anyway, this concrete PR is about a different approach to the problem. @codesome you are most familiar with mmap'd chunks. What do you think about this PR? |
@codesome 🥺 :) |
In your comparison charts, the CPU is spiking higher. And as I understand, CPU is much more expensive than giving a little more memory. Can you please check what is causing the extra CPU use? The change in general looks innocent, but I would like to be cautious about any extra CPU cost. |
Thanks a lot for looking at this @codesome :D:D:D This first 3 spikes are are very similar, no? Indeed the 2 last ones are a bit higher (2-4%) but i think its just some variance on the runs TBH.. i can try to rerun the tests to make sure. ![]()
This scenario might not be entirely realistic since all the series in this load test are native histograms. However, during head compaction, memory usage spikes dramatically from 3GB to 10GB, which seems excessive, doesn't it? On the other hand, CPU usage increased by only about 2% during 3 of the 6 head compactions. While CPU resources are generally more expensive, the difference in CPU usage appears minimal compared to the significant memory spike. And since memory isn’t compressible, this spike could easily lead to out-of-memory (OOM) errors during head compaction? Is it possible to run prom bench here to see if we can capture something there as well? IDK if currently we ingest native histogram there but at least we may see if this cause more CPU with normal samples. |
/prombench main |
(comparing against main is best done right after rebasing your project on main, but since your timezone is very different to mine I'll kick it off and we can see how it looks after a few hours) |
/prombench cancel |
Benchmark cancel is in progress. |
Something happened. This PR started to behave weirdly after some point. @alanprot could you investigate what happened? Before the anomaly, there is a slight increase in CPU with no impact on memory, probably because we do not have native histograms in prombench. In general I think we could try this out and revert back if it causes a problem, after we have fixed the potential issue i.e. |
Parca only have the profiles after 3:42? (after the problem hapenned? :( ) Seems for some reason i dont see the |
Parca is configured with one hour of retention. |
Ok.. I did find the deadlock between this 2 go routines: (it happens during series GC and maybe my tests did not had churn so i did not see this happening)
I will see how i can better fix this. |
I think this deadlock can happens nowadays as well in a very specific scenario: If the autoCompact is disabled and we call It does not happen often because as soon as we call |
66c5a9e
to
2a0e45f
Compare
Ok.. i created a test to reproduce the problem. Here is the run is the deadlock happening the test failing:
The deadlock occurs because To prevent a deadlock, I used the With all that said, I wonder why we don't mmap the chunks as soon as a new head chunk is cut, which would eliminate the need for this background task that scans all series to find those with chunks to be mmaped - this seems to be an discussion for another PR but m more than happy to try it out if we thing it make sense. PS: i also rebased the PR. |
fad30d9
to
5c2e638
Compare
I think the background mapping is to avoid a spike in IO, because it’s typical for every series to cut a new chunk at the same time. |
That makes sense. Doing it in the background might still cause the same spike, but it wouldn't block new appends, i guess? Still, wouldn't it make more sense to keep track of the newly cut chunks instead of scanning all time series every minute? |
I wonder if there is a fundamental design problem as the same locks are used to lock different data structures ( |
Signed-off-by: alanprot <[email protected]>
Signed-off-by: alanprot <[email protected]>
…ctx) concurrently Signed-off-by: alanprot <[email protected]>
5c2e638
to
7a44443
Compare
Certainly it's important not to have two bits of code that can take the locks in different orders. |
yeah... But as for now I don't see any way to fix this without a potential performance penalty (like unnecessary locks) except by using the The PR right now is even keeping track of the series it could not lock and mmaping its chunks outside of the stripped lock but i think this is unnecessary. WDYT? |
I created an alternative approach for this: #14752 Can we run prom bench there just to see the comparison? |
We discussed this with @alanprot and @yeya24 during KubeCon NA. It's pretty hard to find this discussion (it's buried in this PR) but I think the gist is that, with native histograms (and especially with nhcb) the characteristic for WAL operations are different, thus different locking/head compaction flow could be considered. Perhaps a separate discussion should be made for this, but first we have to reproduce the problem with prombench IMO. Otherwise it's not sustainable to fix it, if we can't test against re-introducing the problem. Let's get a regression benchmark possible, I have some ideas for prombench improvements in mind. I think you also created pure blocks for native histograms only, which might be unrealistic in practice too (is that right?). |
@krajorama Note to self, follow up on this w.r.t. prombench native histogram support as well. |
I was running some load test with native histograms and i noticed a huge spike in memory during head compaction.
Upon some investigation i noticed that the reason why this is hapenning is cause we were not calling
h.mmapHeadChunks
during head compaction even though calling this at this point seems safe - this call is not even locked by any mutex.We were no calling it simply cause the call was placed after the mutex to call
db.reloadBlocks()
on the following loop:prometheus/tsdb/db.go
Lines 1056 to 1067 in ee5bba0
Which is locked here during head compaction.
This PR is just creating a new go routine to keep calling
h.mmapHeadChunks
and not be blocked by that mutex.This are the results before and after the change (CPU and memory now and 5d ago - without this change):