PARQUET-160: avoid wasting 64K per empty buffer.#98
PARQUET-160: avoid wasting 64K per empty buffer.#98julienledem wants to merge 18 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
We should tweak the initialSize here.
levels should get a tiny initial size (100 bytes?) in case they are always null or always defined.
|
The initial size here should be tweaked as well to something smaller: |
There was a problem hiding this comment.
should 5 be configurable too?
we could also make CapacityByteArrayOutputStream abstract or take as an argument a slab size calculator etc. so that we can plug in different behaviors here. what do you think?
|
Do you want to tweak the initial size here as well? |
|
@julienledem ping! |
… a simpler heuristic in the column writers instead
|
Sent a PR against this PR here: julienledem#2 |
…onaryValuesWriter as well
Updates to PR-98
|
@tsdeng ok, this PR is now ready to review, it's got both @julienledem's changes and mine as well. |
…nledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer
Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java parquet-hadoop/src/test/java/parquet/hadoop/TestColumnChunkPageWriteStore.java
|
+1, lets merge when the tests are green |
|
I'm running these tests here: in case we have to wait a long time for the travis CI apache queue. |
|
Tests passed! merging now... |
This buffer initializes itself to a default size when instantiated. This leads to a lot of unused small buffers when there are a lot of empty columns. Author: Alex Levenson <[email protected]> Author: julien <[email protected]> Author: Julien Le Dem <[email protected]> Closes apache#98 from julienledem/avoid_wasting_64K_per_empty_buffer and squashes the following commits: b0200dd [julien] add license a1b278e [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer 5304ee1 [julien] remove unused constant 81e399f [julien] Merge branch 'avoid_wasting_64K_per_empty_buffer' of github.com:julienledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer ccf677d [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer 37148d6 [Julien Le Dem] Merge pull request #2 from isnotinvain/PR-98 b9abab0 [Alex Levenson] Address Julien's comment 965af7f [Alex Levenson] one more typo 9939d8d [Alex Levenson] fix typos in comments 61c0100 [Alex Levenson] Make initial slab size heuristic into a helper method, apply in DictionaryValuesWriter as well a257ee4 [Alex Levenson] Improve IndexOutOfBoundsException message 64d6c7f [Alex Levenson] update comments 8b54667 [Alex Levenson] Don't use CapacityByteArrayOutputStream for writing page chunks 6a20e8b [Alex Levenson] Remove initialSlabSize decision from InternalParquetRecordReader, use a simpler heuristic in the column writers instead 3a0f8e4 [Alex Levenson] Use simpler settings for column chunk writer b2736a1 [Alex Levenson] Some cleanup in CapacityByteArrayOutputStream 1df4a71 [julien] refactor CapacityByteArray to be aware of page size 95c8fb6 [julien] avoid wasting 64K per empty buffer.
This buffer initializes itself to a default size when instantiated.
This leads to a lot of unused small buffers when there are a lot of empty columns.