Skip to content

Conversation

@Liyixin95
Copy link

@Liyixin95 Liyixin95 commented May 6, 2025

add test file for this issue

code:

from datetime import timedelta

import pyarrow as pa
import pyarrow.parquet as pq

data = {"a": [timedelta(days=1) for _ in range(100)]}

table = pa.Table.from_pydict(data)

pq.write_table(table, "./test.parquet")

metadata:

❯ bdt view-parquet-meta ./test.parquet                                                                                                                                                                                            (factor) 
+------------+----------------------------------+
| Key        | Value                            |
+------------+----------------------------------+
| Version    | 2                                |
| Created By | parquet-cpp-arrow version 17.0.0 |
| Rows       | 100                              |
| Row Groups | 1                                |
+------------+----------------------------------+

Row Group 0 of 1 contains 100 rows and has 95 bytes:

+-------------+--------------+---------------+-----------------+-------+-------------+-------------+
| Column Name | Logical Type | Physical Type | Distinct Values | Nulls | Min         | Max         |
+-------------+--------------+---------------+-----------------+-------+-------------+-------------+
| a           | N/A          | INT64         | N/A             | 0     | 86400000000 | 86400000000 |
+-------------+--------------+---------------+-----------------+-------+-------------+-------------+

@wgtmac
Copy link
Member

wgtmac commented May 6, 2025

Thanks for creating the PR! I think we need to include the code snippet to create this file at least in the PR description and use parquet cli to print its metadata so people can know what's in it. BTW, I'm not familiar with Polars, is it possible to use the Parquet Java writer from parquet-java or Parquet C++ writer from Apache Arrow to create such files?

@Liyixin95
Copy link
Author

@wgtmac I have update the PR description.

BTW, I'm not familiar with Polars, is it possible to use the Parquet Java writer from parquet-java or Parquet C++ writer from Apache Arrow to create such files?

Can I use pyarrow or pandas? Otherwise I have to set up a java environment on my machine.

@wgtmac
Copy link
Member

wgtmac commented May 6, 2025

Yes, pyarrow sounds good to me.

@Liyixin95
Copy link
Author

Yes, pyarrow sounds good to me.

done

@wgtmac
Copy link
Member

wgtmac commented May 6, 2025

Ah sorry, I didn't check it carefully. Parquet does not have an official duration logical type. Therefore it does not make sense to add it to the Parquet repo. cc @alamb

@alamb
Copy link
Contributor

alamb commented May 7, 2025

Ah sorry, I didn't check it carefully. Parquet does not have an official duration logical type. Therefore it does not make sense to add it to the Parquet repo. cc @alamb

I agree -- I will clarify / figure out a plan on what to do here:

@alamb
Copy link
Contributor

alamb commented May 8, 2025

I think we have figured out the issue and @Liyixin95 has provided a fix here:

Quoting from myself on apache/arrow-rs#5626 (comment)
Ok, what is happening here is as follows: arrow-rs and arrow-cpp (and potentially polars) add a special file metadata field called "ARROW:schema" that records the desired Arrow schema. This is described in more detail here:

In order for the arrow-rs parquet reader to read the data as a duration it needs to interpret the contents of that metadata hint.

So I suggest we close this PR and go with the fix in arrow-rs

@Liyixin95
Copy link
Author

So I suggest we close this PR and go with the fix in arrow-rs.

Sure, I will close this pr.

@Liyixin95 Liyixin95 closed this May 9, 2025
@alamb
Copy link
Contributor

alamb commented May 9, 2025

So I suggest we close this PR and go with the fix in arrow-rs.

Sure, I will close this pr.

Thanks agian for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants