Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

physical memory goes high every several hours #1128

Closed
atwufei opened this issue Feb 6, 2018 · 27 comments
Closed

physical memory goes high every several hours #1128

atwufei opened this issue Feb 6, 2018 · 27 comments
Labels

Comments

@atwufei
Copy link

atwufei commented Feb 6, 2018

Hi There,

We are using jemalloc 5.0.1 on our project, and found that the physical memory usage goes high every several (>10) hours. Here is the log I captured, more than 10GB physical memory were used during this time:

Allocated: 56763404160, active: 64958468096, metadata: 3267106432, resident: 70248562688, mapped: 70786420736, retained: 405324754944
Allocated: 56876350976, active: 65120444416, metadata: 3292205344, resident: 74587324416, mapped: 75117805568, retained: 405240102912
Allocated: 56737409856, active: 64979918848, metadata: 3293146528, resident: 75795795968, mapped: 76325535744, retained: 404032372736
Allocated: 56738962464, active: 64995016704, metadata: 3296629168, resident: 76685611008, mapped: 77218127872, retained: 403615834112
Allocated: 56968671360, active: 65284304896, metadata: 3296170416, resident: 78292492288, mapped: 78825009152, retained: 402008952832
Allocated: 56968786248, active: 65279537152, metadata: 3298034096, resident: 79658573824, mapped: 80191090688, retained: 400642871296
Allocated: 56941156840, active: 65251299328, metadata: 3297322160, resident: 80860139520, mapped: 81392623616, retained: 399441338368
Allocated: 56991072392, active: 65310920704, metadata: 3312494544, resident: 82332794880, mapped: 82864013312, retained: 399729459200
Allocated: 57126460528, active: 65457401856, metadata: 3318715504, resident: 83553558528, mapped: 84290650112, retained: 399185723392
Allocated: 56571929400, active: 64856027136, metadata: 3341452928, resident: 85106311168, mapped: 85832876032, retained: 400474652672
Allocated: 56948892104, active: 65236578304, metadata: 3443298560, resident: 84992585728, mapped: 85696909312, retained: 413038342144

Except resident/mapped varies a lot, the others almost remains the same. What's the reason of this high physical memory usage? Does jemalloc reclaim the unused physical memory instantly or periodically? btw, Huge TLB is disabled on this machine.

@davidtgoldblatt
Copy link
Member

Typically, unused memory will get on a sliding scale, with the default being madvising it back to the OS after about 10 seconds. Is allocation spikiness a plausible explanation? The time interval between the samples isn't completely clear. (Does memory usage eventually go back down?)

@atwufei
Copy link
Author

atwufei commented Feb 6, 2018

The memory usage does go back down, but the spikiness is too high for us. We try hard to keep memory usage under control, or it may cause out-of-memory.

@davidtgoldblatt
Copy link
Member

davidtgoldblatt commented Feb 6, 2018

But what I mean is, it sounds like the program just has a spiky allocation pattern; the most likely cause of jemalloc temporarily using much more memory is the program allocating much more memory; there's not a ton that can be done about that.

One thing you can do is change the settings on how long it takes to purge; this is dictated by the dirty_decay_ms and muzzy_decay_ms (both of which you can set to 0, which will cause jemalloc to madvise away all memory as soon as possible, which will slow down the application). You could also do something like call the arena.<i>.purge mallctl after you free up a lot of memory.

@atwufei
Copy link
Author

atwufei commented Feb 6, 2018

Thank you David. afaik, "Allocated" is the number of bytes allocated by the application, and from the log I attached, "Allocated" didn't change too much (extra 10GB physical memory used), so I don't think the extra memory is from application side.

I will try to set dirty_decay_ms and muzzy_decay_ms.

@davidtgoldblatt
Copy link
Member

The issue is, if you allocate a lot of memory and then free it quickly, the allocated stat won't look high (unless you happen to grab it at exactly the right time), even though the RSS will be higher until the memory gets purged down.

@interwq
Copy link
Member

interwq commented Feb 6, 2018

@atwufei : you can verify if it's purging related from the stats, for example,
decaying: time npages sweeps madvises purged
dirty: 1000 ...

The "npages" is the number of dirty pages at that moment (they are not returned to kernel yet, so included in resident memory).

Looking at the stats a bit more, I realized that the program is likely suffering the VM fragmentation we fixed recently (in #1071). Can you please try the current dev branch and see if it still happens?

@atwufei
Copy link
Author

atwufei commented Feb 7, 2018

The "npages" does vary a lot, for the normal (good) and high rss cases:

decaying:  time       npages       sweeps     madvises       purged
   dirty:   N/A       495096       256451     33017674    354940241

decaying:  time       npages       sweeps     madvises       purged
   dirty:   N/A      4129045       268681     34176851    367223631

@atwufei
Copy link
Author

atwufei commented Feb 7, 2018

I will try the latest dev commit:

commit 83aa988 (HEAD -> dev, origin/dev, origin/HEAD)
Author: rustyx [email protected]
Date: Fri Jan 19 16:28:33 2018 +0100

Make generated headers usable in both x86 and x64 mode in Visual Studio

@interwq
Copy link
Member

interwq commented Feb 7, 2018

Yeah like @davidtgoldblatt mentioned, dirty pages are purged with a time based decay curve, by default over a period of 10 seconds. If you want the freed dirty pages to be returned faster, tuning dirty_decay_ms will do the job (note that it may increase CPU usage).

However, from the stats the metadata usage keeps increasing which suggests virtual memory fragmentation, so I'd say try out the current dev branch as well.

@atwufei
Copy link
Author

atwufei commented Feb 7, 2018

@davidtgoldblatt with an old version of jemalloc, the physical memory usage of our application is pretty flat, but the size overhead (such as jemalloc's metadata) is larger than the new one, so we want to switch for smaller footprint.

@atwufei
Copy link
Author

atwufei commented Feb 7, 2018

How does purge work exactly? dirty_decay_ms = 10000 looks not too long, but I saw the high physical memory usage lasted for one or two hours.

@atwufei
Copy link
Author

atwufei commented Feb 11, 2018

After setting dirty_decay_ms and muzzy_decay_ms to 0, physical memory usage is flat again, but I'm afraid of any performance regression of this configuration, we are still testing on that. @interwq any configurations in between but still keep the peak rss relatively low?

For the dev branch, VIRT/RES is 75g/45g

Allocated: 43725368360, active: 48630779904, metadata: 449112352 (n_thp 0), resident: 49972125696, mapped: 50213634048, retained: 29632806912

@interwq
Copy link
Member

interwq commented Feb 11, 2018

@atwufei: feel free to experiment with different decay time settings — it’s usually about CPU-memory trade off: the faster we return memory back to kernel, the more CPU cycles will be spent on doing syscalls (e.g. madvise), and potentially page faults. I’d start with muzzy decay 0, and try dirty decay from 1 to 10s (which is the default 10000).

The metadata usage on dev branch looks reasonable. So your workload was indeed hurt by the VM fragmentation issue. Please use the dev branch for now; we’ll be cutting 5.1 soon.

@atwufei
Copy link
Author

atwufei commented Feb 11, 2018

Thank you @interwq. Still have one thing to figure out, as there are 10GB+ extra physical memory used, does it mean the application does malloc() at least 10GB+ in 10 seconds? That looks higher than I expected. If dirty_decay_ms is 10000, does jemalloc make sure all the dirty pages 10 seconds ago are purged completely?

@interwq
Copy link
Member

interwq commented Feb 11, 2018

Yes. At any moment, the dirty pages are all freed within the last N seconds where N is the decay time. Pages are purged gradually, not all at the same time, but will all be purged over the period.

@atwufei
Copy link
Author

atwufei commented Feb 12, 2018

That seems not the case. From the application's perspective, the memory used is kept the same, so the physical memory usage is flat after setting dirty_decay_ms to 0. But the high physical memory usage lasts 2 hours, from malloc_stats_print(), it doesn't request/free so frequently. So it looks like to me, some memory is not purged even after decay time.

@atwufei
Copy link
Author

atwufei commented Feb 23, 2018

Happy Chinese New Year!

@interwq I tried again with dirty_decay_ms = 1000. After I stopped client's traffic, the server was pretty idle with only a little cpu usage.

decaying:  time       npages       sweeps     madvises       purged
   dirty:   N/A      1953665        56288     20042644    228577702
   muzzy:   N/A            0            0            0            0

Several minutes later,

decaying:  time       npages       sweeps     madvises       purged
   dirty:   N/A      1953552        56292     20042649    228578183
   muzzy:   N/A            0            0            0            0

dirty npages remained high, when will these pages get freed to OS? Another question, is it possible to purge them manually?

@interwq
Copy link
Member

interwq commented Feb 23, 2018

@atwufei : Happy Chinese New Year too!

By default, the purging activity is “inlined” with normal malloc / free calls, e.g. every N invocations to jemalloc will attempt to purge once. If the application goes completely idle, no threads will be executing the purging path. The background thread feature is designed to solve this problem by adding jemalloc controlled background threads, which will handle all purging asynchronously. To enable it, add “background_thread:true” to your malloc_conf. More details here: jemalloc.net/jemalloc.3.html#opt.background_thread

Manual purging is also possible, check here: jemalloc.net/jemalloc.3.html#arena.i.decay

@atwufei
Copy link
Author

atwufei commented Feb 24, 2018

Is "purging" per arena or global? If per arena, does arena start to purge only after every N invocations in this specific arena? And how large is N, so that I can calculate the max dirty pages, a lot of malloc in our system are 16/32 KB.

@interwq
Copy link
Member

interwq commented Feb 26, 2018

From the manual: ‘In the case of stats.arenas.<i>.* and arena.<i>.{initialized,purge,decay,dss}, <i> equal to MALLCTL_ARENAS_ALL can be used to operate on all arenas or access the summation of statistics from all arenas;’

As of the number of invocations between purging, it’s a internal parameter and can be changed with future design. Application should either use the background thread feature (I suggest trying this first), or manual purging here.

@atwufei
Copy link
Author

atwufei commented Feb 27, 2018

Got it, thank you very much.

@viveksjain
Copy link

Hope I'm not hijacking this thread, but we seem to be running into the same issue (application allocates 600MB, jemalloc has 6G mapped, dirty pages very high). Is there any ETA on releasing 5.1.0? I'm a bit nervous about running an unreleased version.

Relatedly but somewhat orthogonally, @davidtgoldblatt mentioned that this is likely caused by the application's patterns themselves being spiky. Is there any way to get heap profiling with jemalloc that shows all allocations over some period of time, regardless of whether they have been deallocated since? That would be pretty helpful for tracking down an application's inefficient use of memory.

@interwq
Copy link
Member

interwq commented Mar 2, 2018

@viveksjain -- re: high dirty memory, try adding "dirty_decay_ms:1000" to your malloc_conf and see if it solves the issue for you. You may want to combine with "muzzy_decay_ms:0" if it's supported on your system. Details in http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms.

For 5.1 release, @davidtgoldblatt has been preparing for that and it should happen relatively soon.

re: profiling for allocations accumulatively, see http://jemalloc.net/jemalloc.3.html#opt.prof_accum

@viveksjain
Copy link

Sorry for the delay, got busy with some other stuff.

  1. I tried out dirty_decay_ms:1000, but as I had expected it didn't help too much (since even with the default of 10000, we saw memory spikes that persisted for several minutes).
  2. I tried checking out jemalloc master and testing it in-house. While it did help somewhat (spikes went from 6GB to 4GB), it did not solve the problem for us.
  3. prof_accum is exactly what I was looking for, thanks! Will investigate that output (note to others: when using jeprof you also need to specify --alloc_space to show accumulated data).
  4. I also tried enabling background_thread - our application is already heavily multithreaded, so it makes sense to let jemalloc use a few threads as well. This did fix the memory spikes issue for us, so planning to use that as our default going forward.

@viveksjain
Copy link

To close the loop here, I believe this manifests due to our application's usage of producer-consumer pattern. The producer thread allocates lots of data and hands it off to the consumer, which eventually frees it. jemalloc seems to bound each individual producer's max dirty memory (to about 100MB in a basic test), however there does not seem to be any global limit. Our application spins up producer-consumer 1, processes a bunch of data, then after some time two other threads spins up P-C 2 which also processes a bunch of data, and so on. Even though P-C 1 has freed all of its data, jemalloc does not return it back to the system, likely due to @interwq's point that if that thread goes idle, it will not be executing the purging path. So even though application's total allocated amount stays low, jemalloc's mapped memory scales with the number of producer-consumers (up to the number of arenas, I'm guessing).

Enabling background_thread solves the issue.

@interwq
Copy link
Member

interwq commented Apr 4, 2018

Thanks for following up @viveksjain. For context we do enable background_thread by default for all services at Facebook. Currently this is the only way to purge unused memory when application threads go idle. It also provides other advantages such as improved tail latency.

@GauravAccGit
Copy link

GauravAccGit commented Sep 25, 2019

I have tried with background_thread enabled, but RSS of my application is still not going down. Any pointers would be really helpful. I am trying with jemalloc 5.2.1.
Pls see #1630 for more details

pitrou added a commit to apache/arrow that referenced this issue Oct 22, 2019
…se dirty pages more aggressively back to the OS dirty_decay_ms and muzzy_decay_ms to 0 by default, add C++ / Python option to configure this

The current default behavior causes applications dealing in large datasets to hold on to a large amount of physical operating system memory. While this may improve performance in some cases, it empirically seems to be causing problems for users.

There's some discussion of this issue in some other contexts here

jemalloc/jemalloc#1128

Here is a test script I used to check the RSS while reading a large Parquet file (~10GB in memory) in a loop (requires downloading the file http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet)

https://gist.github.com/wesm/c75ad3b6dcd37231aaacf56a80a5e401

This patch enables jemalloc background page reclamation and reduces the time decay from 10 seconds to 1 second so that memory is returned to the OS more aggressively.

Closes #5701 from wesm/ARROW-6910 and squashes the following commits:

8fc8aa8 <Antoine Pitrou> Revert "Try to fix protobuf-related clang warning"
ab67abb <Antoine Pitrou> Try to fix protobuf-related clang warning
9290470 <Wes McKinney> Review comments, disable PLASMA_VALGRIND in Travis
8c4d367 <Wes McKinney> Use background_thread:true and 1000ms decay
daa5416 <Wes McKinney> Set jemalloc dirty_decay_ms and muzzy_decay_ms to 0 by default, add function to set the values to something else

Lead-authored-by: Wes McKinney <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
kszucs pushed a commit to kszucs/arrow that referenced this issue Oct 23, 2019
…se dirty pages more aggressively back to the OS dirty_decay_ms and muzzy_decay_ms to 0 by default, add C++ / Python option to configure this

The current default behavior causes applications dealing in large datasets to hold on to a large amount of physical operating system memory. While this may improve performance in some cases, it empirically seems to be causing problems for users.

There's some discussion of this issue in some other contexts here

jemalloc/jemalloc#1128

Here is a test script I used to check the RSS while reading a large Parquet file (~10GB in memory) in a loop (requires downloading the file http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet)

https://gist.github.com/wesm/c75ad3b6dcd37231aaacf56a80a5e401

This patch enables jemalloc background page reclamation and reduces the time decay from 10 seconds to 1 second so that memory is returned to the OS more aggressively.

Closes apache#5701 from wesm/ARROW-6910 and squashes the following commits:

8fc8aa8 <Antoine Pitrou> Revert "Try to fix protobuf-related clang warning"
ab67abb <Antoine Pitrou> Try to fix protobuf-related clang warning
9290470 <Wes McKinney> Review comments, disable PLASMA_VALGRIND in Travis
8c4d367 <Wes McKinney> Use background_thread:true and 1000ms decay
daa5416 <Wes McKinney> Set jemalloc dirty_decay_ms and muzzy_decay_ms to 0 by default, add function to set the values to something else

Lead-authored-by: Wes McKinney <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
alamb pushed a commit to apache/arrow-rs that referenced this issue Apr 20, 2021
…se dirty pages more aggressively back to the OS dirty_decay_ms and muzzy_decay_ms to 0 by default, add C++ / Python option to configure this

The current default behavior causes applications dealing in large datasets to hold on to a large amount of physical operating system memory. While this may improve performance in some cases, it empirically seems to be causing problems for users.

There's some discussion of this issue in some other contexts here

jemalloc/jemalloc#1128

Here is a test script I used to check the RSS while reading a large Parquet file (~10GB in memory) in a loop (requires downloading the file http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet)

https://gist.github.com/wesm/c75ad3b6dcd37231aaacf56a80a5e401

This patch enables jemalloc background page reclamation and reduces the time decay from 10 seconds to 1 second so that memory is returned to the OS more aggressively.

Closes #5701 from wesm/ARROW-6910 and squashes the following commits:

8fc8aa8c1 <Antoine Pitrou> Revert "Try to fix protobuf-related clang warning"
ab67abb3a <Antoine Pitrou> Try to fix protobuf-related clang warning
929047034 <Wes McKinney> Review comments, disable PLASMA_VALGRIND in Travis
8c4d367f1 <Wes McKinney> Use background_thread:true and 1000ms decay
daa541605 <Wes McKinney> Set jemalloc dirty_decay_ms and muzzy_decay_ms to 0 by default, add function to set the values to something else

Lead-authored-by: Wes McKinney <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants