feat(ingest/athena): add upstream lineage for Iceberg/Glue#16842
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Linear: ING-2117 |
Connector Tests ResultsConnector tests failed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
|
✅ Meticulous spotted visual differences in 9 of 1480 screens tested, but all differences have already been approved: view differences detected. Meticulous evaluated ~10 hours of user flows against your PR. Last updated for commit |
|
Linear: ING-2147 |
…description Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
f41f580 to
14e5148
Compare
Auto-pushed via claude . Ignore
|
@alokr-dhub I left a few comment and please, can you update the pr description to be align with the implementation? |
|
@treff7es addressed all comments and fixed the issues. Please review. |
Summary
Improves upstream lineage for the Athena connector by routing lineage to the most semantically accurate upstream entity based on catalog type, instead of always pointing to the raw S3 location.
Lineage routing logic:
AwsDataCatalogor any catalog confirmed as typeGLUEviaListDataCatalogs): emits upstream lineage to the corresponding Glue dataset entity (glue://<schema>.<table>)iceberg://<schema>.<table>)Catalog type detection:
ListDataCatalogsAPI (paginated) to determine the catalog type (GLUE,LAMBDA,HIVE,FEDERATED)AwsDataCatalogis always assumed Glue (it is always backed by Glue); any other catalog name that isn't found or fails the API call is assumed non-Glue_is_glue_catalog: Optional[bool]field (initialized toNone, populated on first access via a@property) to avoid redundant API callsOther changes:
LINEAGE_COARSE/LINEAGE_FINEcapability descriptions to reflect new routing logic@pytest.mark.integrationfromtest_athena_get_table_properties, making it a proper unit testTest plan
test_get_catalog_type_*— covers successful lookup, case-insensitive match, catalog-not-found, and API exception pathstest_is_glue_catalog_*— covers GLUE/non-GLUE catalog types,AwsDataCatalogfallback (true), non-default catalog fallback (false), API exception fallback for both default and non-default catalogs, and caching behaviortest_athena_get_table_properties_glue_iceberg_returns_glue_urn— Glue catalog always routes to Glue URN even for Iceberg tablestest_athena_get_table_properties_iceberg_location— non-Glue Iceberg table routes to Iceberg URNtest_athena_get_table_properties_non_glue_non_iceberg_location— non-Glue non-Iceberg table falls back to S3 URNThe PR conforms to DataHub's Contributing Guideline (particularly PR Title Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub