-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
Milestone
Description
Apache Iceberg version
1.1.0
Query engine
Athena
Please describe the bug 🐞
It's not possible to read an Iceberg table with PyIceberg if the data was written using PySpark and compacted with AWS Athena.
Steps to reproduce
- Create an Iceberg table:
CREATE TABLE IF NOT EXISTS table_name
(columns ...)
USING ICEBERG
PARTITIONED BY (date)- Write to the table using PySpark:
spark_df = self.spark_session.createDataFrame(df)
spark_df.sort(date_column).writeTo(table_name).append()- Read the table using PyIceberg:
catalog = load_glue("default", {})
table = catalog.load_table('...')
scan = table.scan(
row_filter=EqualTo("date", date_as_string),
)
result = scan.to_arrow()The result variable contains correct data.
- Compact the table files using the OPTIMIZE instruction in AWS Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html
OPTIMIZE table_name REWRITE DATA USING BIN_PACK WHERE date = 'date_as_string'-
Optionally, VACUUM the table. It doesn't matter and doesn't change the behavior in any way.
-
Query the table using the same PyIceberg code as in step 3.
-
to_arrowraises an exception:ValueError: Iceberg schema is not embedded into the Parquet file, see https://github.com/apache/iceberg/issues/6505 -
The table can still be accessed correctly in AWS Athena.
Expected behavior
In step 7, the code should work correctly and return the same results as the code in step 3.
Dependency versions
Writing data (step 2)
- pyarrow: 11.0.0
- pyspark: 3.3.1
- iceberg-spark-runtime-3.3_2.12-1.1.0.jar
Reading data (steps 3 and 7):
pyiceberg.__version__
'0.3.0'
pyarrow.__version__
'10.0.1'Reactions are currently unavailable