PARQUET-177: Added lower bound to memory manager resize by danielcweeks · Pull Request #115 · apache/parquet-java

danielcweeks · 2015-02-04T01:05:43Z

PARQUET-177

julienledem · 2015-02-04T23:36:20Z

parquet-hadoop/src/main/java/parquet/hadoop/MemoryManager.java

We could have a better minimum than 0 which is in any case a bad value.
Possibly something related to the pageSize or the initial blockSize?

We could set it to something like 10MB, which is my best guess (without any empirical evidence). We give tasks 1.5 GB which only safely allows for about 10 writers per task with the default 128MB Row Group setting. A 10MB lower bound would probably push us to 100 writers, which would likely account for most situations.

Any other suggestions are welcome.

julienledem · 2015-02-04T23:36:54Z

This looks good to me. I made a comment regarding the default value for this.

danielcweeks · 2015-02-05T00:29:05Z

Ok, I updated the property name to be more clear and set the min to be
DEFAULT_BLOCK_SIZE / 10, which should allow for a lot of writers, but still
maintain a reasonable (though not most efficient) block size.

-Dan

On Wed, Feb 4, 2015 at 3:36 PM, Julien Le Dem [email protected]
wrote:

This looks good to me. I made a comment regarding the default value for
this.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/incubator-parquet-mr/pull/115#issuecomment-72964158
.

rdblue · 2015-02-05T17:09:42Z

This looks good to me.

I'm wondering if later we should add a minimum in terms of column chunk size. I know we can't guarantee it, but if we can take the number of columns into account we would have a more accurate minimum.

For example, if you have 10 columns and 128 MB, you get on average ~12MB column chunks. But if you have 100 columns, then you get on average 1.2MB column chunks. That's a big difference, so if we can make this check schema-dependent, it might be worth doing.

danielcweeks · 2015-02-05T17:52:49Z

That's an excellent point, but would make it a lot harder to estimate the
number of writers (and apply it globally).

When you get into this situation, you've already started to degrade your
performance. This is just a minor speed bump to keep someone from doing
something catastrophically bad.

I do like the idea for making column chunk min size an alternate setting
for better tuning where you need it, but we should probably make that a
separate JIRA.

On Thu, Feb 5, 2015 at 9:09 AM, Ryan Blue [email protected] wrote:

This looks good to me.

I'm wondering if later we should add a minimum in terms of column chunk
size. I know we can't guarantee it, but if we can take the number of
columns into account we would have a more accurate minimum.

For example, if you have 10 columns and 128 MB, you get on average ~12MB
column chunks. But if you have 100 columns, then you get on average 1.2MB
column chunks. That's a big difference, so if we can make this check
schema-dependent, it might be worth doing.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/incubator-parquet-mr/pull/115#issuecomment-73085019
.

julienledem · 2015-02-05T19:10:04Z

This looks good to me

danielcweeks · 2015-02-05T19:12:41Z

OK, I'll create a JIRA for capture Ryan's enhancement.

Throw me a plus and I'll give merging a shot.

On Thu, Feb 5, 2015 at 11:10 AM, Julien Le Dem [email protected]
wrote:

This looks good to me

—
Reply to this email directly or view it on GitHub
https://github.com/apache/incubator-parquet-mr/pull/115#issuecomment-73106680
.

rdblue · 2015-02-05T19:13:30Z

I don't think you would need to change how the number of writers is affected. Just track the schema with the most columns and use that count to estimate the smallest chunk size. I agree this can be a separate JIRA issue.

You're right that when you hit this, you're already in a bad state. That's why I think it is important to produce a prominent warning and tell everyone to change their writer architecture instead of allowing this to dial down row group size.

danielcweeks · 2015-02-05T19:29:37Z

Ryan, I agree fully that using the column approach would be more accurate (and not really harder to implement), but I'm having a harder time coming up with a sane default value. Do we have any idea what a good minimum is for a column chunk. Also, this would effectively be an average across the table. You could still have tiny column chunks if your column data is significantly skewed.

I'll take a quick look at switching the implementation.

danielcweeks · 2015-02-05T19:58:01Z

@rdblue is this more inline with what you were thinking. I just used the page size for the default (i.e. average chunk size should never be smaller than a single page).

That might be too low, but possibly a good place to start.

rdblue · 2015-02-05T21:07:08Z

@danielcweeks, page size is a great idea. That's already configurable and it makes no sense to have a column chunk less than a page.

rdblue · 2015-02-05T21:08:26Z

+1

danielcweeks · 2015-02-05T22:39:38Z

Thanks @rdblue! I like this a lot better.

PARQUET-177 Author: Daniel Weeks <[email protected]> Closes apache#115 from danielcweeks/memory-manager-limit and squashes the following commits: b2e4708 [Daniel Weeks] Updated to base memory allocation off estimated chunk size 09d7aa3 [Daniel Weeks] Updated property name and default value 8f6cff1 [Daniel Weeks] Added low bound to memory manager resize Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordWriter.java Resolution: Conflict due to newline change at the end of the file.

Added low bound to memory manager resize

8f6cff1

danielcweeks changed the title ~~Added low bound to memory manager resize~~ Added lower bound to memory manager resize Feb 4, 2015

julienledem reviewed Feb 4, 2015
View reviewed changes

Updated property name and default value

09d7aa3

danielcweeks changed the title ~~Added lower bound to memory manager resize~~ PARQUET-177: Added lower bound to memory manager resize Feb 5, 2015

Updated to base memory allocation off estimated chunk size

b2e4708

asfgit closed this in 05adc21 Feb 5, 2015

Conversation

danielcweeks commented Feb 4, 2015

Uh oh!

julienledem Feb 4, 2015

Choose a reason for hiding this comment

Uh oh!

danielcweeks Feb 4, 2015

Choose a reason for hiding this comment

Uh oh!

julienledem commented Feb 4, 2015

Uh oh!

danielcweeks commented Feb 5, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

danielcweeks commented Feb 5, 2015

Uh oh!

julienledem commented Feb 5, 2015

Uh oh!

danielcweeks commented Feb 5, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

danielcweeks commented Feb 5, 2015

Uh oh!

danielcweeks commented Feb 5, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

danielcweeks commented Feb 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants