Skip to content

feat(ingest): add configurable connection pool settings to REST emitter#16486

Merged
sgomezvillamor merged 3 commits into
masterfrom
feat/configurable-connection-pool
Apr 15, 2026
Merged

feat(ingest): add configurable connection pool settings to REST emitter#16486
sgomezvillamor merged 3 commits into
masterfrom
feat/configurable-connection-pool

Conversation

@sgomezvillamor
Copy link
Copy Markdown
Contributor

@sgomezvillamor sgomezvillamor commented Mar 9, 2026

Summary

Adds pool_connections and pool_maxsize configuration options to the REST emitter to allow tuning of HTTP connection pool settings. This helps prevent "Connection pool is full" warnings when running high-concurrency ingestion workloads.

Background

The hardcoded pool_maxsize=100 was originally added in October 2021 (PR #3431) to support parallel metadata writes. However, it was never made configurable. Users running with max_threads > 100 or experiencing slow requests (30+ second timeouts) could exhaust the connection pool, resulting in warnings like:

Connection pool is full, discarding connection: localhost. Connection pool size: 100

Changes

  • Added pool_connections and pool_maxsize fields to DatahubClientConfig
  • Added environment variables for system-wide defaults:
    • DATAHUB_REST_EMITTER_DEFAULT_POOL_CONNECTIONS (default: 100)
    • DATAHUB_REST_EMITTER_DEFAULT_POOL_MAXSIZE (default: 100)
  • Made pool settings configurable in both ingestion sink and Python SDK (DataHubGraph)
  • Maintains backward compatibility (defaults to 100 for both)

Configuration Options

Priority order:

  1. Explicit YAML/Python config
  2. Environment variables
  3. Defaults (100)

Example YAML config:

sink:
  type: datahub-rest
  config:
    server: "https://your-datahub.io"
    pool_maxsize: 200  # Increase for high concurrency
    max_threads: 150

Example Python SDK:

from datahub.ingestion.graph.client import DataHubGraph, DatahubClientConfig

config = DatahubClientConfig(
    server="https://your-datahub.io",
    pool_maxsize=200
)
graph = DataHubGraph(config)

Testing

  • Linting passes (./gradlew :metadata-ingestion:lintFix)
  • Backward compatible (defaults match previous hardcoded values)
  • Configuration flows through all layers (config → emitter → session)

Checklist

  • PR conforms to DataHub's Contributing Guideline (PR title format)
  • Links to related issue: Addresses connection pool exhaustion in high-concurrency scenarios
  • Tests: Existing tests cover the code paths, new config fields use standard pydantic validation
  • Docs: Configuration is self-documenting via docstrings and type hints

🤖 Generated with Claude Code

Adds pool_connections and pool_maxsize configuration options to allow
tuning of the HTTP connection pool used by the REST emitter. This helps
prevent "Connection pool is full" warnings when running high-concurrency
ingestion workloads.

The hardcoded pool_maxsize=100 was originally added in Oct 2021 to support
parallel writes, but was never made configurable. Users running with
max_threads > 100 or slow requests could exhaust the pool.

Configuration can be set via:
- YAML config: pool_maxsize, pool_connections in sink config
- Environment variables: DATAHUB_REST_EMITTER_DEFAULT_POOL_MAXSIZE,
  DATAHUB_REST_EMITTER_DEFAULT_POOL_CONNECTIONS
- Defaults to 100 for both (maintains backward compatibility)

Available in both ingestion sink and Python SDK (DataHubGraph).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 9, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ta-ingestion/src/datahub/configuration/env_vars.py 50.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label Mar 9, 2026
@rajatoss
Copy link
Copy Markdown
Member

Connector Tests Results

All connector tests passed for commit c233e6a

View full test logs →

Autogenerated by the connector-tests CI pipeline.

@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 10, 2026
@gabe-lyons
Copy link
Copy Markdown
Contributor

Linear: ING-1859

Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned.

@github-actions github-actions Bot requested a review from skrydal March 10, 2026 20:11
@github-actions
Copy link
Copy Markdown
Contributor

Your PR has been assigned to @skrydal (piotr.skrydalewicz) for review (ING-1859).

Copy link
Copy Markdown
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maggiehays maggiehays added pending-submitter-merge and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 11, 2026
@sgomezvillamor sgomezvillamor merged commit a7721a4 into master Apr 15, 2026
61 checks passed
@sgomezvillamor sgomezvillamor deleted the feat/configurable-connection-pool branch April 15, 2026 08:11
david-leifker pushed a commit that referenced this pull request May 27, 2026
- fix(ingestion/dbt): remove hard generate_docs requirement from dbt cloud auto-discovery (#16453)
- feat(ingest): add configurable connection pool settings to REST emitter (#16486)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants