Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM

## Yes, we leak native memory
When compressing/decompressing gziped tables with _rcfile writers_, we use java native zlib inflaters and deflaters which allocate system native memory. There is an ongoing effort (#8531, #8879, #8455, #8529) to ensure the gzip input and output streams are properly closed to prevent native memory leak. However, even with all these fixes, we are **still leaking memory**. The following figure shows the native memory usage with 4 current queries with shape `insert into A select * from B`. The cluster OOMed several times.
<img width="1345" alt="screen shot 2017-09-17 at 11 20 04 am" src="https://user-images.githubusercontent.com/2192913/30523675-78575552-9b9a-11e7-9a58-caa2e6ae9c06.png">


## But there is no leaking object!
To understand what objects are not freed, we use [jemalloc](http://www.evanjones.ca/java-native-leak-bug.html). However, the jemalloc profiling result shows **0 memory leak**. What's interesting is that the machines with jemalloc turned on didn't show a sign of memory leak. The following figure shows a comparison of a node with jemalloc and a node with the default allocator (glibc) in the same cluster with the same queries ran as above.
<img width="1325" alt="screen shot 2017-09-17 at 11 20 31 am" src="https://user-images.githubusercontent.com/2192913/30523677-7fedf9c4-9b9a-11e7-970e-abbd852962a3.png">


## Why memory allocators make a difference?
glibc is the default native memory allocator of Java. The objects allocated by glibc may **NOT** be returned to the OS once it's freed for performance improvement. The downside of this is memory fragmentation. **The fragmentation can grow unboundedly and finally triggers a OOM**. This [blog](https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_native_memory_fragmentation_and_process_size_growth?lang=en) describes the details. On the other side, jemalloc is designed to [minimize memory fragmentation](http://jemalloc.net/), which avoids this problem from the beginning.


## Tuning glibc
[MALLOC_ARENA_MAX](https://devcenter.heroku.com/articles/tuning-glibc-memory-behavior) is an environment variable to control how many memory pools can be created for glibc. By default, it is 8 X #CPU cores. With MALLOC_ARENA_MAX set to 2, the OOM issue has completely gone. The following figure demonstrates the native memory usage with different MALLOC_ARENA_MAX values vs jemalloc. Notice that the drop is not a OOM; I just killed the query. When MALLOC_ARENA_MAX is 2 or 4, the memory save is even better than jemalloc. But of course, this is a trade of between memory and performance.
<img width="1427" alt="screen shot 2017-09-17 at 12 41 55 am" src="https://user-images.githubusercontent.com/2192913/30519064-09ab8f04-9b42-11e7-935f-6892d58d2dce.png">


## What we can do to prevent this?
- Use a memory pool like what Hadoop does
- Switch to jemalloc
- Tune down MALLOC_ARENA_MAX

The first point may not work well given the memory pool can hold onto a codec for a long time without releasing it. This can lead to memory waste. That is also the reason we  switch to JDK gzip library from the Hadoop one (#8481). Switching to jemalloc could be an option but may bring some uncertainties to the existing system. So maybe just to tune down MALLOC_ARENA_MAX?


## Pick a number for MALLOC_ARENA_MAX
The goal is to find out what value for MALLOC_ARENA_MAX is proper. Of course, this can variable from different types of machines/clusters. The test environment is a cluster with 95 nodes with each node has 200GB of heap memory and 50GB of native memory.

**_1. To what extend we may OOM_**
Setting: a script to repetitively run 4 concurrent queries reading the same table with 256 billion rows and inserting into another 4 tables. This benchmark runs for hours to determine if there is (a trend or fact) of OOM.
```
MALLOC_ARENA_MAX=4:	not OOM
MALLOC_ARENA_MAX=8:	not OOM
MALLOC_ARENA_MAX=16:	OOM
```
Somehow, this benchmark may not be representative since it really depends on what queries we are running and how we assign heap/non-heap memory.


_**2.  CPU performance**_
Setting: a single query reads a table with 111 billion rows/26 columns and writes into another table.  The task concurrency and number of writers are all set to 64 to simulate the production environment and give pressure on memory. 
```
Original Hadoop writer:				43.60 CPU days
default MALLOC_ARENA_MAX with rcfile writer:	38.41 CPU days
MALLOC_ARENA_MAX=8 with rcfile writer:		38.57 CPU days
MALLOC_ARENA_MAX=4 with rcfile writer:		38.61 CPU days
MALLOC_ARENA_MAX=2 with rcfile writer:		38.69 CPU days
```
The rcfile writer is designed to run faster than the Hadoop one. Among different values of MALLOC_ARENA_MAX, there is subtle difference. I bet most of the CPU is used in compressing/decompressing/writing/reading data instead of allocating/deallocating memory.

## Conclusion
When memory is leaking, it may not be a problem of our code. It could just be an improper tuning.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

Yes, we leak native memory

But there is no leaking object!

Why memory allocators make a difference?

Tuning glibc

What we can do to prevent this?

Pick a number for MALLOC_ARENA_MAX

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

Description

Yes, we leak native memory

But there is no leaking object!

Why memory allocators make a difference?

Tuning glibc

What we can do to prevent this?

Pick a number for MALLOC_ARENA_MAX

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions