Skip to content

PARQUET-177: Added lower bound to memory manager resize#115

Closed
danielcweeks wants to merge 3 commits intoapache:masterfrom
danielcweeks:memory-manager-limit
Closed

PARQUET-177: Added lower bound to memory manager resize#115
danielcweeks wants to merge 3 commits intoapache:masterfrom
danielcweeks:memory-manager-limit

Conversation

@danielcweeks
Copy link

PARQUET-177

@danielcweeks danielcweeks changed the title Added low bound to memory manager resize Added lower bound to memory manager resize Feb 4, 2015
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a better minimum than 0 which is in any case a bad value.
Possibly something related to the pageSize or the initial blockSize?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could set it to something like 10MB, which is my best guess (without any empirical evidence). We give tasks 1.5 GB which only safely allows for about 10 writers per task with the default 128MB Row Group setting. A 10MB lower bound would probably push us to 100 writers, which would likely account for most situations.

Any other suggestions are welcome.

@julienledem
Copy link
Member

This looks good to me. I made a comment regarding the default value for this.

@danielcweeks
Copy link
Author

Ok, I updated the property name to be more clear and set the min to be
DEFAULT_BLOCK_SIZE / 10, which should allow for a lot of writers, but still
maintain a reasonable (though not most efficient) block size.

-Dan

On Wed, Feb 4, 2015 at 3:36 PM, Julien Le Dem [email protected]
wrote:

This looks good to me. I made a comment regarding the default value for
this.


Reply to this email directly or view it on GitHub
https://github.com/apache/incubator-parquet-mr/pull/115#issuecomment-72964158
.

@rdblue
Copy link
Contributor

rdblue commented Feb 5, 2015

This looks good to me.

I'm wondering if later we should add a minimum in terms of column chunk size. I know we can't guarantee it, but if we can take the number of columns into account we would have a more accurate minimum.

For example, if you have 10 columns and 128 MB, you get on average ~12MB column chunks. But if you have 100 columns, then you get on average 1.2MB column chunks. That's a big difference, so if we can make this check schema-dependent, it might be worth doing.

@danielcweeks
Copy link
Author

That's an excellent point, but would make it a lot harder to estimate the
number of writers (and apply it globally).

When you get into this situation, you've already started to degrade your
performance. This is just a minor speed bump to keep someone from doing
something catastrophically bad.

I do like the idea for making column chunk min size an alternate setting
for better tuning where you need it, but we should probably make that a
separate JIRA.

On Thu, Feb 5, 2015 at 9:09 AM, Ryan Blue [email protected] wrote:

This looks good to me.

I'm wondering if later we should add a minimum in terms of column chunk
size. I know we can't guarantee it, but if we can take the number of
columns into account we would have a more accurate minimum.

For example, if you have 10 columns and 128 MB, you get on average ~12MB
column chunks. But if you have 100 columns, then you get on average 1.2MB
column chunks. That's a big difference, so if we can make this check
schema-dependent, it might be worth doing.


Reply to this email directly or view it on GitHub
https://github.com/apache/incubator-parquet-mr/pull/115#issuecomment-73085019
.

@julienledem
Copy link
Member

This looks good to me

@danielcweeks
Copy link
Author

OK, I'll create a JIRA for capture Ryan's enhancement.

Throw me a plus and I'll give merging a shot.

On Thu, Feb 5, 2015 at 11:10 AM, Julien Le Dem [email protected]
wrote:

This looks good to me


Reply to this email directly or view it on GitHub
https://github.com/apache/incubator-parquet-mr/pull/115#issuecomment-73106680
.

@rdblue
Copy link
Contributor

rdblue commented Feb 5, 2015

I don't think you would need to change how the number of writers is affected. Just track the schema with the most columns and use that count to estimate the smallest chunk size. I agree this can be a separate JIRA issue.

You're right that when you hit this, you're already in a bad state. That's why I think it is important to produce a prominent warning and tell everyone to change their writer architecture instead of allowing this to dial down row group size.

@danielcweeks danielcweeks changed the title Added lower bound to memory manager resize PARQUET-177: Added lower bound to memory manager resize Feb 5, 2015
@danielcweeks
Copy link
Author

Ryan, I agree fully that using the column approach would be more accurate (and not really harder to implement), but I'm having a harder time coming up with a sane default value. Do we have any idea what a good minimum is for a column chunk. Also, this would effectively be an average across the table. You could still have tiny column chunks if your column data is significantly skewed.

I'll take a quick look at switching the implementation.

@danielcweeks
Copy link
Author

@rdblue is this more inline with what you were thinking. I just used the page size for the default (i.e. average chunk size should never be smaller than a single page).

That might be too low, but possibly a good place to start.

@rdblue
Copy link
Contributor

rdblue commented Feb 5, 2015

@danielcweeks, page size is a great idea. That's already configurable and it makes no sense to have a column chunk less than a page.

@rdblue
Copy link
Contributor

rdblue commented Feb 5, 2015

+1

@asfgit asfgit closed this in 05adc21 Feb 5, 2015
@danielcweeks
Copy link
Author

Thanks @rdblue! I like this a lot better.

rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
PARQUET-177

Author: Daniel Weeks <[email protected]>

Closes apache#115 from danielcweeks/memory-manager-limit and squashes the following commits:

b2e4708 [Daniel Weeks] Updated to base memory allocation off estimated chunk size
09d7aa3 [Daniel Weeks] Updated property name and default value
8f6cff1 [Daniel Weeks] Added low bound to memory manager resize

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordWriter.java
Resolution:
    Conflict due to newline change at the end of the file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants