ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups#8317
ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups#8317bkietz wants to merge 5 commits intoapache:masterfrom
Conversation
|
I think the main reason such a property would be interesting for dask's use case is to get the number of row groups in a case all statistics are not available / not yet parsed. So the way that this PR returns |
|
@jorisvandenbossche just confirming: you want |
|
@jorisvandenbossche PTAL |
Yes, that's indeed the consequence for now (if the metadata was not yet parsed before). Long term I would like us to cache the metadata, though, without the need to necessarily directly parse all statistics etc (https://issues.apache.org/jira/browse/ARROW-10131- |
2257f86 to
9f5fcd1
Compare
|
Could you add test for the case I commented about? I think this should do it (didn't run the code though): @pytest.mark.parquet
def test_parquet_fragment_num_row_groups(tempdir):
import pyarrow.parquet as pq
table = pa.table({'a': range(8)})
pq.write_table(table, tempdir / "test.parquet", row_group_size=2)
dataset = ds.dataset(tempdir / "test.parquet", format="parquet")
original_fragment = list(dataset.get_fragments())[0]
# create fragment with subset of row groups
fragment = original_fragment.format.make_fragment(
original_fragment.path, original_fragment.filesystem,
row_groups=[1, 3])
assert fragment.num_row_groups == 2
# ensure that parsing metadata preserves correct number of row groups
fragment.ensure_complete_metadata()
assert fragment.num_row_groups == 2
assert len(fragment.row_groups) == 2 |
|
CI failure is unrelated. Merging |
No description provided.