Skip to content

feat(ingest/athena): add upstream lineage for Iceberg/Glue#16842

Merged
alokr-dhub merged 20 commits into
masterfrom
feat/add-upstream-lineage-for-iceberg
May 11, 2026
Merged

feat(ingest/athena): add upstream lineage for Iceberg/Glue#16842
alokr-dhub merged 20 commits into
masterfrom
feat/add-upstream-lineage-for-iceberg

Conversation

@alokr-dhub
Copy link
Copy Markdown
Contributor

@alokr-dhub alokr-dhub commented Mar 30, 2026

Summary

Improves upstream lineage for the Athena connector by routing lineage to the most semantically accurate upstream entity based on catalog type, instead of always pointing to the raw S3 location.

Lineage routing logic:

  • Glue catalog (AwsDataCatalog or any catalog confirmed as type GLUE via ListDataCatalogs): emits upstream lineage to the corresponding Glue dataset entity (glue://<schema>.<table>)
  • Non-Glue catalog, Iceberg table: emits upstream lineage to the corresponding Iceberg dataset entity (iceberg://<schema>.<table>)
  • Non-Glue catalog, non-Iceberg table: falls back to the S3 location as before

Catalog type detection:

  • Queries the Athena ListDataCatalogs API (paginated) to determine the catalog type (GLUE, LAMBDA, HIVE, FEDERATED)
  • Falls back gracefully: AwsDataCatalog is always assumed Glue (it is always backed by Glue); any other catalog name that isn't found or fails the API call is assumed non-Glue
  • Result is cached via a _is_glue_catalog: Optional[bool] field (initialized to None, populated on first access via a @property) to avoid redundant API calls

Other changes:

  • Updated LINEAGE_COARSE / LINEAGE_FINE capability descriptions to reflect new routing logic
  • Updated autogenerated connector registry description
  • Removed @pytest.mark.integration from test_athena_get_table_properties, making it a proper unit test

Test plan

  • test_get_catalog_type_* — covers successful lookup, case-insensitive match, catalog-not-found, and API exception paths

  • test_is_glue_catalog_* — covers GLUE/non-GLUE catalog types, AwsDataCatalog fallback (true), non-default catalog fallback (false), API exception fallback for both default and non-default catalogs, and caching behavior

  • test_athena_get_table_properties_glue_iceberg_returns_glue_urn — Glue catalog always routes to Glue URN even for Iceberg tables

  • test_athena_get_table_properties_iceberg_location — non-Glue Iceberg table routes to Iceberg URN

  • test_athena_get_table_properties_non_glue_non_iceberg_location — non-Glue non-Iceberg table falls back to S3 URN

  • The PR conforms to DataHub's Contributing Guideline (particularly PR Title Format)

  • Links to related issues (if applicable)

  • Tests for the changes have been added/updated (if applicable)

  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.

  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 96.15385% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...gestion/src/datahub/ingestion/source/sql/athena.py 96.15% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2117

@datahub-connector-tests
Copy link
Copy Markdown

datahub-connector-tests Bot commented Mar 30, 2026

Connector Tests Results

Connector tests failed for commit 2b05b25

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous Bot commented Mar 31, 2026

✅ Meticulous spotted visual differences in 9 of 1480 screens tested, but all differences have already been approved: view differences detected.

Meticulous evaluated ~10 hours of user flows against your PR.

Last updated for commit 2b05b25 fix: review changes and code refactoring. This comment will update as new commits are pushed.

@alokr-dhub alokr-dhub marked this pull request as ready for review April 1, 2026 05:16
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

Linear: ING-2147

@alokr-dhub alokr-dhub changed the title feat(ingest/athena): add upstream sibling aspect for Iceberg/Glue feat(ingest/athena): add upstream lineage for Iceberg/Glue Apr 1, 2026
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Apr 1, 2026
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Copy link
Copy Markdown
Contributor

@puneetagarwal-datahub puneetagarwal-datahub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@puneetagarwal-datahub puneetagarwal-datahub dismissed their stale review May 4, 2026 09:22

Auto-pushed via claude . Ignore

@treff7es
Copy link
Copy Markdown
Contributor

treff7es commented May 4, 2026

@alokr-dhub I left a few comment and please, can you update the pr description to be align with the implementation?

Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py
treff7es
treff7es previously approved these changes May 4, 2026
Copy link
Copy Markdown
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this comment got here accidentally

@treff7es treff7es dismissed their stale review May 4, 2026 15:28

Accidentally approved

Copy link
Copy Markdown
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentakly

Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/athena.py Outdated
@alokr-dhub
Copy link
Copy Markdown
Contributor Author

@treff7es addressed all comments and fixed the issues. Please review.

Copy link
Copy Markdown
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants