PARQUET-177: Added lower bound to memory manager resize#115
PARQUET-177: Added lower bound to memory manager resize#115danielcweeks wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
We could have a better minimum than 0 which is in any case a bad value.
Possibly something related to the pageSize or the initial blockSize?
There was a problem hiding this comment.
We could set it to something like 10MB, which is my best guess (without any empirical evidence). We give tasks 1.5 GB which only safely allows for about 10 writers per task with the default 128MB Row Group setting. A 10MB lower bound would probably push us to 100 writers, which would likely account for most situations.
Any other suggestions are welcome.
|
This looks good to me. I made a comment regarding the default value for this. |
|
Ok, I updated the property name to be more clear and set the min to be -Dan On Wed, Feb 4, 2015 at 3:36 PM, Julien Le Dem [email protected]
|
|
This looks good to me. I'm wondering if later we should add a minimum in terms of column chunk size. I know we can't guarantee it, but if we can take the number of columns into account we would have a more accurate minimum. For example, if you have 10 columns and 128 MB, you get on average ~12MB column chunks. But if you have 100 columns, then you get on average 1.2MB column chunks. That's a big difference, so if we can make this check schema-dependent, it might be worth doing. |
|
That's an excellent point, but would make it a lot harder to estimate the When you get into this situation, you've already started to degrade your I do like the idea for making column chunk min size an alternate setting On Thu, Feb 5, 2015 at 9:09 AM, Ryan Blue [email protected] wrote:
|
|
This looks good to me |
|
OK, I'll create a JIRA for capture Ryan's enhancement. Throw me a plus and I'll give merging a shot. On Thu, Feb 5, 2015 at 11:10 AM, Julien Le Dem [email protected]
|
|
I don't think you would need to change how the number of writers is affected. Just track the schema with the most columns and use that count to estimate the smallest chunk size. I agree this can be a separate JIRA issue. You're right that when you hit this, you're already in a bad state. That's why I think it is important to produce a prominent warning and tell everyone to change their writer architecture instead of allowing this to dial down row group size. |
|
Ryan, I agree fully that using the column approach would be more accurate (and not really harder to implement), but I'm having a harder time coming up with a sane default value. Do we have any idea what a good minimum is for a column chunk. Also, this would effectively be an average across the table. You could still have tiny column chunks if your column data is significantly skewed. I'll take a quick look at switching the implementation. |
|
@rdblue is this more inline with what you were thinking. I just used the page size for the default (i.e. average chunk size should never be smaller than a single page). That might be too low, but possibly a good place to start. |
|
@danielcweeks, page size is a great idea. That's already configurable and it makes no sense to have a column chunk less than a page. |
|
+1 |
|
Thanks @rdblue! I like this a lot better. |
PARQUET-177 Author: Daniel Weeks <[email protected]> Closes apache#115 from danielcweeks/memory-manager-limit and squashes the following commits: b2e4708 [Daniel Weeks] Updated to base memory allocation off estimated chunk size 09d7aa3 [Daniel Weeks] Updated property name and default value 8f6cff1 [Daniel Weeks] Added low bound to memory manager resize Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordWriter.java Resolution: Conflict due to newline change at the end of the file.
PARQUET-177