Skip to content

PARQUET-204: add parquet-schema directory support#136

Closed
nevillelyh wants to merge 2 commits intoapache:masterfrom
nevillelyh:neville/PARQUET-204
Closed

PARQUET-204: add parquet-schema directory support#136
nevillelyh wants to merge 2 commits intoapache:masterfrom
nevillelyh:neville/PARQUET-204

Conversation

@nevillelyh
Copy link
Contributor

No description provided.

@nevillelyh nevillelyh force-pushed the neville/PARQUET-204 branch from 2831d2a to 361cf63 Compare March 6, 2015 12:50
@tsdeng
Copy link
Contributor

tsdeng commented Mar 6, 2015

This is good. +1
Thanks @nevillelyh
Will merge after travis passes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is stateless there could be one static instance of it.
HIDDEN_FILE_FILTER = new HiddenFileFilter();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@julienledem
Copy link
Member

Thanks for cleaning up all this duplicated code.
+1

@nevillelyh nevillelyh force-pushed the neville/PARQUET-204 branch from 361cf63 to 633829b Compare March 9, 2015 10:55
@nevillelyh
Copy link
Contributor Author

I changed it to a static INSTANCE member.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take schema evolution into account here? I.e., show the merged schemas of all part-files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. There's no guarantee that there is a single schema for the data and merging all of the schemas into one would be misleading: a union strategy can produce a schema that can't be satisfied (as a column projection) by any of the files. I think it's best to return one or all of the unique schemas, but this is already going slightly beyond what Parquet itself should be doing as a file format. Parquet reads and writes files, while Hive or Kite manages the data as a collection.

@asfgit asfgit closed this in fd3085e Mar 24, 2015
@rdblue
Copy link
Contributor

rdblue commented Mar 24, 2015

Thanks @nevillelyh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants