-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
physical memory goes high every several hours #1128
Comments
Typically, unused memory will get on a sliding scale, with the default being madvising it back to the OS after about 10 seconds. Is allocation spikiness a plausible explanation? The time interval between the samples isn't completely clear. (Does memory usage eventually go back down?) |
The memory usage does go back down, but the spikiness is too high for us. We try hard to keep memory usage under control, or it may cause out-of-memory. |
But what I mean is, it sounds like the program just has a spiky allocation pattern; the most likely cause of jemalloc temporarily using much more memory is the program allocating much more memory; there's not a ton that can be done about that. One thing you can do is change the settings on how long it takes to purge; this is dictated by the dirty_decay_ms and muzzy_decay_ms (both of which you can set to 0, which will cause jemalloc to madvise away all memory as soon as possible, which will slow down the application). You could also do something like call the |
Thank you David. afaik, "Allocated" is the number of bytes allocated by the application, and from the log I attached, "Allocated" didn't change too much (extra 10GB physical memory used), so I don't think the extra memory is from application side. I will try to set dirty_decay_ms and muzzy_decay_ms. |
The issue is, if you allocate a lot of memory and then free it quickly, the allocated stat won't look high (unless you happen to grab it at exactly the right time), even though the RSS will be higher until the memory gets purged down. |
@atwufei : you can verify if it's purging related from the stats, for example, The "npages" is the number of dirty pages at that moment (they are not returned to kernel yet, so included in resident memory). Looking at the stats a bit more, I realized that the program is likely suffering the VM fragmentation we fixed recently (in #1071). Can you please try the current dev branch and see if it still happens? |
The "npages" does vary a lot, for the normal (good) and high rss cases:
|
I will try the latest dev commit: commit 83aa988 (HEAD -> dev, origin/dev, origin/HEAD)
|
Yeah like @davidtgoldblatt mentioned, dirty pages are purged with a time based decay curve, by default over a period of 10 seconds. If you want the freed dirty pages to be returned faster, tuning dirty_decay_ms will do the job (note that it may increase CPU usage). However, from the stats the metadata usage keeps increasing which suggests virtual memory fragmentation, so I'd say try out the current dev branch as well. |
@davidtgoldblatt with an old version of jemalloc, the physical memory usage of our application is pretty flat, but the size overhead (such as jemalloc's metadata) is larger than the new one, so we want to switch for smaller footprint. |
How does purge work exactly? dirty_decay_ms = 10000 looks not too long, but I saw the high physical memory usage lasted for one or two hours. |
After setting dirty_decay_ms and muzzy_decay_ms to 0, physical memory usage is flat again, but I'm afraid of any performance regression of this configuration, we are still testing on that. @interwq any configurations in between but still keep the peak rss relatively low? For the dev branch, VIRT/RES is 75g/45g
|
@atwufei: feel free to experiment with different decay time settings — it’s usually about CPU-memory trade off: the faster we return memory back to kernel, the more CPU cycles will be spent on doing syscalls (e.g. madvise), and potentially page faults. I’d start with muzzy decay 0, and try dirty decay from 1 to 10s (which is the default 10000). The metadata usage on dev branch looks reasonable. So your workload was indeed hurt by the VM fragmentation issue. Please use the dev branch for now; we’ll be cutting 5.1 soon. |
Thank you @interwq. Still have one thing to figure out, as there are 10GB+ extra physical memory used, does it mean the application does malloc() at least 10GB+ in 10 seconds? That looks higher than I expected. If dirty_decay_ms is 10000, does jemalloc make sure all the dirty pages 10 seconds ago are purged completely? |
Yes. At any moment, the dirty pages are all freed within the last N seconds where N is the decay time. Pages are purged gradually, not all at the same time, but will all be purged over the period. |
That seems not the case. From the application's perspective, the memory used is kept the same, so the physical memory usage is flat after setting dirty_decay_ms to 0. But the high physical memory usage lasts 2 hours, from malloc_stats_print(), it doesn't request/free so frequently. So it looks like to me, some memory is not purged even after decay time. |
Happy Chinese New Year! @interwq I tried again with dirty_decay_ms = 1000. After I stopped client's traffic, the server was pretty idle with only a little cpu usage.
Several minutes later,
dirty npages remained high, when will these pages get freed to OS? Another question, is it possible to purge them manually? |
@atwufei : Happy Chinese New Year too! By default, the purging activity is “inlined” with normal malloc / free calls, e.g. every N invocations to jemalloc will attempt to purge once. If the application goes completely idle, no threads will be executing the purging path. The background thread feature is designed to solve this problem by adding jemalloc controlled background threads, which will handle all purging asynchronously. To enable it, add “background_thread:true” to your malloc_conf. More details here: jemalloc.net/jemalloc.3.html#opt.background_thread Manual purging is also possible, check here: jemalloc.net/jemalloc.3.html#arena.i.decay |
Is "purging" per arena or global? If per arena, does arena start to purge only after every N invocations in this specific arena? And how large is N, so that I can calculate the max dirty pages, a lot of malloc in our system are 16/32 KB. |
From the manual: ‘In the case of As of the number of invocations between purging, it’s a internal parameter and can be changed with future design. Application should either use the background thread feature (I suggest trying this first), or manual purging here. |
Got it, thank you very much. |
Hope I'm not hijacking this thread, but we seem to be running into the same issue (application allocates 600MB, jemalloc has 6G mapped, dirty pages very high). Is there any ETA on releasing 5.1.0? I'm a bit nervous about running an unreleased version. Relatedly but somewhat orthogonally, @davidtgoldblatt mentioned that this is likely caused by the application's patterns themselves being spiky. Is there any way to get heap profiling with jemalloc that shows all allocations over some period of time, regardless of whether they have been deallocated since? That would be pretty helpful for tracking down an application's inefficient use of memory. |
@viveksjain -- re: high dirty memory, try adding "dirty_decay_ms:1000" to your malloc_conf and see if it solves the issue for you. You may want to combine with "muzzy_decay_ms:0" if it's supported on your system. Details in http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms. For 5.1 release, @davidtgoldblatt has been preparing for that and it should happen relatively soon. re: profiling for allocations accumulatively, see http://jemalloc.net/jemalloc.3.html#opt.prof_accum |
Sorry for the delay, got busy with some other stuff.
|
To close the loop here, I believe this manifests due to our application's usage of producer-consumer pattern. The producer thread allocates lots of data and hands it off to the consumer, which eventually frees it. jemalloc seems to bound each individual producer's max dirty memory (to about 100MB in a basic test), however there does not seem to be any global limit. Our application spins up producer-consumer 1, processes a bunch of data, then after some time two other threads spins up P-C 2 which also processes a bunch of data, and so on. Even though P-C 1 has freed all of its data, jemalloc does not return it back to the system, likely due to @interwq's point that if that thread goes idle, it will not be executing the purging path. So even though application's total allocated amount stays low, jemalloc's mapped memory scales with the number of producer-consumers (up to the number of arenas, I'm guessing). Enabling background_thread solves the issue. |
Thanks for following up @viveksjain. For context we do enable background_thread by default for all services at Facebook. Currently this is the only way to purge unused memory when application threads go idle. It also provides other advantages such as improved tail latency. |
I have tried with background_thread enabled, but RSS of my application is still not going down. Any pointers would be really helpful. I am trying with jemalloc 5.2.1. |
…se dirty pages more aggressively back to the OS dirty_decay_ms and muzzy_decay_ms to 0 by default, add C++ / Python option to configure this The current default behavior causes applications dealing in large datasets to hold on to a large amount of physical operating system memory. While this may improve performance in some cases, it empirically seems to be causing problems for users. There's some discussion of this issue in some other contexts here jemalloc/jemalloc#1128 Here is a test script I used to check the RSS while reading a large Parquet file (~10GB in memory) in a loop (requires downloading the file http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet) https://gist.github.com/wesm/c75ad3b6dcd37231aaacf56a80a5e401 This patch enables jemalloc background page reclamation and reduces the time decay from 10 seconds to 1 second so that memory is returned to the OS more aggressively. Closes #5701 from wesm/ARROW-6910 and squashes the following commits: 8fc8aa8 <Antoine Pitrou> Revert "Try to fix protobuf-related clang warning" ab67abb <Antoine Pitrou> Try to fix protobuf-related clang warning 9290470 <Wes McKinney> Review comments, disable PLASMA_VALGRIND in Travis 8c4d367 <Wes McKinney> Use background_thread:true and 1000ms decay daa5416 <Wes McKinney> Set jemalloc dirty_decay_ms and muzzy_decay_ms to 0 by default, add function to set the values to something else Lead-authored-by: Wes McKinney <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…se dirty pages more aggressively back to the OS dirty_decay_ms and muzzy_decay_ms to 0 by default, add C++ / Python option to configure this The current default behavior causes applications dealing in large datasets to hold on to a large amount of physical operating system memory. While this may improve performance in some cases, it empirically seems to be causing problems for users. There's some discussion of this issue in some other contexts here jemalloc/jemalloc#1128 Here is a test script I used to check the RSS while reading a large Parquet file (~10GB in memory) in a loop (requires downloading the file http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet) https://gist.github.com/wesm/c75ad3b6dcd37231aaacf56a80a5e401 This patch enables jemalloc background page reclamation and reduces the time decay from 10 seconds to 1 second so that memory is returned to the OS more aggressively. Closes apache#5701 from wesm/ARROW-6910 and squashes the following commits: 8fc8aa8 <Antoine Pitrou> Revert "Try to fix protobuf-related clang warning" ab67abb <Antoine Pitrou> Try to fix protobuf-related clang warning 9290470 <Wes McKinney> Review comments, disable PLASMA_VALGRIND in Travis 8c4d367 <Wes McKinney> Use background_thread:true and 1000ms decay daa5416 <Wes McKinney> Set jemalloc dirty_decay_ms and muzzy_decay_ms to 0 by default, add function to set the values to something else Lead-authored-by: Wes McKinney <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…se dirty pages more aggressively back to the OS dirty_decay_ms and muzzy_decay_ms to 0 by default, add C++ / Python option to configure this The current default behavior causes applications dealing in large datasets to hold on to a large amount of physical operating system memory. While this may improve performance in some cases, it empirically seems to be causing problems for users. There's some discussion of this issue in some other contexts here jemalloc/jemalloc#1128 Here is a test script I used to check the RSS while reading a large Parquet file (~10GB in memory) in a loop (requires downloading the file http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet) https://gist.github.com/wesm/c75ad3b6dcd37231aaacf56a80a5e401 This patch enables jemalloc background page reclamation and reduces the time decay from 10 seconds to 1 second so that memory is returned to the OS more aggressively. Closes #5701 from wesm/ARROW-6910 and squashes the following commits: 8fc8aa8c1 <Antoine Pitrou> Revert "Try to fix protobuf-related clang warning" ab67abb3a <Antoine Pitrou> Try to fix protobuf-related clang warning 929047034 <Wes McKinney> Review comments, disable PLASMA_VALGRIND in Travis 8c4d367f1 <Wes McKinney> Use background_thread:true and 1000ms decay daa541605 <Wes McKinney> Set jemalloc dirty_decay_ms and muzzy_decay_ms to 0 by default, add function to set the values to something else Lead-authored-by: Wes McKinney <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Hi There,
We are using jemalloc 5.0.1 on our project, and found that the physical memory usage goes high every several (>10) hours. Here is the log I captured, more than 10GB physical memory were used during this time:
Allocated: 56763404160, active: 64958468096, metadata: 3267106432, resident: 70248562688, mapped: 70786420736, retained: 405324754944
Allocated: 56876350976, active: 65120444416, metadata: 3292205344, resident: 74587324416, mapped: 75117805568, retained: 405240102912
Allocated: 56737409856, active: 64979918848, metadata: 3293146528, resident: 75795795968, mapped: 76325535744, retained: 404032372736
Allocated: 56738962464, active: 64995016704, metadata: 3296629168, resident: 76685611008, mapped: 77218127872, retained: 403615834112
Allocated: 56968671360, active: 65284304896, metadata: 3296170416, resident: 78292492288, mapped: 78825009152, retained: 402008952832
Allocated: 56968786248, active: 65279537152, metadata: 3298034096, resident: 79658573824, mapped: 80191090688, retained: 400642871296
Allocated: 56941156840, active: 65251299328, metadata: 3297322160, resident: 80860139520, mapped: 81392623616, retained: 399441338368
Allocated: 56991072392, active: 65310920704, metadata: 3312494544, resident: 82332794880, mapped: 82864013312, retained: 399729459200
Allocated: 57126460528, active: 65457401856, metadata: 3318715504, resident: 83553558528, mapped: 84290650112, retained: 399185723392
Allocated: 56571929400, active: 64856027136, metadata: 3341452928, resident: 85106311168, mapped: 85832876032, retained: 400474652672
Allocated: 56948892104, active: 65236578304, metadata: 3443298560, resident: 84992585728, mapped: 85696909312, retained: 413038342144
Except resident/mapped varies a lot, the others almost remains the same. What's the reason of this high physical memory usage? Does jemalloc reclaim the unused physical memory instantly or periodically? btw, Huge TLB is disabled on this machine.
The text was updated successfully, but these errors were encountered: