Skip to content

Issue reading iceberg tables written by Athena with pyiceberg #6647

@nicor88

Description

@nicor88

Apache Iceberg version

None

Query engine

Athena v3

Please describe the bug 🐞

I'm trying to read an iceberg table written by Athena (engine v3), not sure which iceberg version it uses.

When running this code:

from pyiceberg import catalog
from pyiceberg.expressions import GreaterThanOrEqual


glue_catalog = catalog.load_glue(name='glue', conf={})

glue_catalog.list_namespaces()


glue_catalog.list_tables('data_engineering')

table = glue_catalog.load_table("data_engineering.iceberg_example_1")
scan = table.scan()
files = [task.file.file_path for task in scan.plan_files()]

print(files)
df_iceberg = scan.to_pandas()
print(len(df_iceberg))

If fails on the df_iceberg = scan.to_pandas() (I tried also with scan.to_arrow().
I'm able to list all the files belonging to the table, therefore this files = [task.file.file_path for task in scan.plan_files()] works.

The error is the following:

Traceback (most recent call last):
  File "/Users/nicor88/deng-swiss-knife/icerberg/get_data.py", line 31, in <module>
    df_iceberg = scan.to_arrow()
  File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 341, in to_arrow
    return project_table(
  File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/io/pyarrow.py", line 508, in project_table
    schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA)
AttributeError: 'NoneType' object has no attribute 'get'

an example table can be created like that:

create table
    data_engineering.iceberg_example_1
  with (
    table_type='iceberg',
    is_external=false,
    location='s3://xxxx/iceberg_1',
    partitioning=ARRAY['creation_date', 'bucket(user_id, 5)'],
    format='parquet',
    vacuum_max_snapshot_age_seconds=86400,
    optimize_rewrite_delete_file_threshold=2
  )
  as
    

with data as (
    select
        1 as user_id,
        'pi' as user_name,
        'active' as status,
        17.89 as cost,
        1 as quantity,
        100000000 as quantity_big,
        cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
    union all
    select
        2 as user_id,
        'beta' as user_name,
        'inactive' as status,
        3 as cost,
        5 as quantity,
        100000000 as quantity_big,
        cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
)

select
    user_id,
    user_name,
    status,
    cost,
    quantity,
    quantity_big,
    creation_date,
    created_at,
    cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as inserted_at
from data

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions