Skip to content

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

@highker

Description

@highker

Yes, we leak native memory

When compressing/decompressing gziped tables with rcfile writers, we use java native zlib inflaters and deflaters which allocate system native memory. There is an ongoing effort (#8531, #8879, #8455, #8529) to ensure the gzip input and output streams are properly closed to prevent native memory leak. However, even with all these fixes, we are still leaking memory. The following figure shows the native memory usage with 4 current queries with shape insert into A select * from B. The cluster OOMed several times.
screen shot 2017-09-17 at 11 20 04 am

But there is no leaking object!

To understand what objects are not freed, we use jemalloc. However, the jemalloc profiling result shows 0 memory leak. What's interesting is that the machines with jemalloc turned on didn't show a sign of memory leak. The following figure shows a comparison of a node with jemalloc and a node with the default allocator (glibc) in the same cluster with the same queries ran as above.
screen shot 2017-09-17 at 11 20 31 am

Why memory allocators make a difference?

glibc is the default native memory allocator of Java. The objects allocated by glibc may NOT be returned to the OS once it's freed for performance improvement. The downside of this is memory fragmentation. The fragmentation can grow unboundedly and finally triggers a OOM. This blog describes the details. On the other side, jemalloc is designed to minimize memory fragmentation, which avoids this problem from the beginning.

Tuning glibc

MALLOC_ARENA_MAX is an environment variable to control how many memory pools can be created for glibc. By default, it is 8 X #CPU cores. With MALLOC_ARENA_MAX set to 2, the OOM issue has completely gone. The following figure demonstrates the native memory usage with different MALLOC_ARENA_MAX values vs jemalloc. Notice that the drop is not a OOM; I just killed the query. When MALLOC_ARENA_MAX is 2 or 4, the memory save is even better than jemalloc. But of course, this is a trade of between memory and performance.
screen shot 2017-09-17 at 12 41 55 am

What we can do to prevent this?

  • Use a memory pool like what Hadoop does
  • Switch to jemalloc
  • Tune down MALLOC_ARENA_MAX

The first point may not work well given the memory pool can hold onto a codec for a long time without releasing it. This can lead to memory waste. That is also the reason we switch to JDK gzip library from the Hadoop one (#8481). Switching to jemalloc could be an option but may bring some uncertainties to the existing system. So maybe just to tune down MALLOC_ARENA_MAX?

Pick a number for MALLOC_ARENA_MAX

The goal is to find out what value for MALLOC_ARENA_MAX is proper. Of course, this can variable from different types of machines/clusters. The test environment is a cluster with 95 nodes with each node has 200GB of heap memory and 50GB of native memory.

1. To what extend we may OOM
Setting: a script to repetitively run 4 concurrent queries reading the same table with 256 billion rows and inserting into another 4 tables. This benchmark runs for hours to determine if there is (a trend or fact) of OOM.

MALLOC_ARENA_MAX=4:	not OOM
MALLOC_ARENA_MAX=8:	not OOM
MALLOC_ARENA_MAX=16:	OOM

Somehow, this benchmark may not be representative since it really depends on what queries we are running and how we assign heap/non-heap memory.

2. CPU performance
Setting: a single query reads a table with 111 billion rows/26 columns and writes into another table. The task concurrency and number of writers are all set to 64 to simulate the production environment and give pressure on memory.

Original Hadoop writer:				43.60 CPU days
default MALLOC_ARENA_MAX with rcfile writer:	38.41 CPU days
MALLOC_ARENA_MAX=8 with rcfile writer:		38.57 CPU days
MALLOC_ARENA_MAX=4 with rcfile writer:		38.61 CPU days
MALLOC_ARENA_MAX=2 with rcfile writer:		38.69 CPU days

The rcfile writer is designed to run faster than the Hadoop one. Among different values of MALLOC_ARENA_MAX, there is subtle difference. I bet most of the CPU is used in compressing/decompressing/writing/reading data instead of allocating/deallocating memory.

Conclusion

When memory is leaking, it may not be a problem of our code. It could just be an improper tuning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions