Skip to content

feat(bigquery): enrich external table metadata with source format, URIs, compression, and max bad records#16348

Merged
sgomezvillamor merged 8 commits into
datahub-project:masterfrom
EladLeev:master
Mar 17, 2026
Merged

feat(bigquery): enrich external table metadata with source format, URIs, compression, and max bad records#16348
sgomezvillamor merged 8 commits into
datahub-project:masterfrom
EladLeev:master

Conversation

@EladLeev
Copy link
Copy Markdown
Contributor

Hey,
BigQuery external tables were already detected and tagged with the EXTERNAL_TABLE subtype on DataHub, but their specific metadata was missing.
This PR enriches external tables with the following customProperties from the BQ tables:

  • external_source_format - e.g. PARQUET, CSV, ORC
  • external_source_uris - the GCS paths the table reads from
  • external_compression - e.g. GZIP
  • external_max_bad_records - tolerance for malformed rows

All values are parsed from the DDL, which is already fetched for every table (so no extra API calls, which is nice).

Unit tests, linting and manual tests against a real BQ project and DataHub on Docker pass.

Thanks!

@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Feb 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-1748

@github-actions github-actions Bot added the community-contribution PR or Issue raised by member(s) of DataHub Community label Feb 25, 2026
@datahub-cyborg datahub-cyborg Bot added the needs-review Label for PRs that need review from a maintainer. label Feb 25, 2026
@gabe-lyons
Copy link
Copy Markdown
Contributor

thanks for the contribution @EladLeev ! This is a great contribution. @sgomezvillamor want to take a stab at it?

@datahub-cyborg datahub-cyborg Bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 25, 2026
Comment thread metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py Outdated
Comment thread metadata-ingestion/tests/unit/bigquery/test_bigquery_source.py Outdated
Copy link
Copy Markdown
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Just left a couple of suggestions

@EladLeev
Copy link
Copy Markdown
Contributor Author

LGTM Just left a couple of suggestions

Thank you for the review! applied the changes now 🙏

@datahub-cyborg datahub-cyborg Bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 27, 2026
@EladLeev
Copy link
Copy Markdown
Contributor Author

EladLeev commented Mar 2, 2026

@sgomezvillamor @gabe-lyons mind to have another look? 😇

Copy link
Copy Markdown
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
thanks for the contrib

@datahub-cyborg datahub-cyborg Bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Mar 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 5, 2026

Your PR has been assigned to sergio.gomez for review (ING-1748).

@lakshay-nasa
Copy link
Copy Markdown
Contributor

Hi @sgomezvillamor looks like everything is green here (CI + approvals) and it’s been waiting for a bit. Happy to merge it if you’d like, or feel free to take it 🙂

@sgomezvillamor sgomezvillamor merged commit 8c565d8 into datahub-project:master Mar 17, 2026
52 of 54 checks passed
david-leifker pushed a commit that referenced this pull request May 27, 2026
- fix(ingestion): pin sqlglotc (#16614)
- fix(kafka): make replication factor configurable per topic (#16585)
- docs(ingestion): add request-connector page to fix dead link on Integrations page (#16617)
- docs: update announcement bar for March Town Hall (#16610)
- feat(dbt): Extract and emit stats from catalog.json (#16044)
- feat(bigquery): enrich external table metadata with source format, URIs, compression, and max bad records (#16348)
- fix(ui): Add fixes for ingestion, selecting glossaries in policies, and data product icons (#16627)
- feat(ingest/glue): Iceberg lineage (#16562)
- feat(powerbi): add external URL for Power BI App entities (#16572)
- fix(ingestion): bump authlib to >=1.6.9 for JWE RSA1_5 padding oracle… (#16633)
- fix(cli): Add gql files to the wheel build (#16637)
- docs(remote_executor/k8s): clean up source secret instructions to match EKS (#16634)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants