Skip to content

fix: Add recursion guard in LoggingHandler.emit to prevent deadlock#4302

Merged
xrmx merged 3 commits intoopen-telemetry:mainfrom
deepanshuclickpost:fix/logging-handler-recursion-deadlock
Mar 13, 2026
Merged

fix: Add recursion guard in LoggingHandler.emit to prevent deadlock#4302
xrmx merged 3 commits intoopen-telemetry:mainfrom
deepanshuclickpost:fix/logging-handler-recursion-deadlock

Conversation

@deepanshuclickpost
Copy link
Copy Markdown
Contributor

@deepanshuclickpost deepanshuclickpost commented Mar 5, 2026

LoggingHandler.emit() can deadlock when _translate() triggers internal logging (e.g., _clean_extended_attribute calling _logger.warning() for invalid attribute types). The warning re-enters the handler chain on the same thread, which tries to acquire the same handler lock, causing a permanent deadlock that freezes all worker threads.

This adds a threading.local() guard that detects re-entrant calls to emit() on the same thread and skips the recursive call, preventing the deadlock while preserving all application logs.

Fixes open-telemetry/opentelemetry-python#3858
Relates to open-telemetry/opentelemetry-python#3560
Relates to open-telemetry/opentelemetry-python#4585

Description

As suggested by @xrmx in opentelemetry-python#4954, opening this fix against the contrib repo since LoggingHandler in opentelemetry-sdk is deprecated.

LoggingHandler.emit() can deadlock when _translate() triggers internal logging. The call chain is:

  1. emit()_translate()BoundedAttributes()_clean_extended_attribute()
  2. _clean_extended_attribute() finds a non-primitive attribute and calls _logger.warning("Invalid type...")
  3. This warning re-enters the logging handler chain on the same thread
  4. The handler tries to acquire the same lock already held by step 1
  5. Deadlock — the thread waits for itself, lock is never released

Once one thread deadlocks, every other thread that attempts to log blocks behind the same lock. In production (Gunicorn with 4 workers × 5 threads), this caused 100% request failure (Nginx 499/504) and complete service outage.

Thread dump captured via py-spy during production outage confirmed all 20 request threads stuck at acquire (logging/__init__.py:927):

acquire (logging/__init__.py:927)                       ← STUCK
warning (logging/__init__.py:1501)                      ← OTel internal warning
_clean_extended_attribute (attributes/__init__.py:209)
_translate (_logs/_internal/__init__.py:637)
emit (_logs/_internal/__init__.py:656)                  ← OTel LoggingHandler
handle (logging/__init__.py:978)                        ← Lock first acquired here

While #4808 (in 1.39.0) reduces the most common trigger by converting unknown types to str(), the deadlock can still occur via the invalid key path (_logger.warning("invalid key...")) and when str(value) itself raises. This recursion guard protects against all current and future _logger.warning() calls in the pipeline as a defense-in-depth measure.

Fixes open-telemetry/opentelemetry-python#3858

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

4 new unit tests added in test_logging_handler_recursion.py:

  • test_recursive_emit_is_skipped — verifies guard blocks recursive calls
  • test_normal_emit_works — verifies normal logging is not affected
  • test_guard_is_reset_after_emit — verifies guard resets after emit completes
  • test_guard_is_thread_local — verifies guard on one thread doesn't block other threads

Does This PR Require a Core Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Mar 5, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@deepanshu406
Copy link
Copy Markdown
Contributor

@DylanRussell Fixed the import sorting per ruff lint rules. Could you approve the CI workflow run? Thanks!

@xrmx xrmx moved this to Ready for review in Python PR digest Mar 5, 2026
@deepanshuclickpost deepanshuclickpost requested a review from a team as a code owner March 5, 2026 16:30
@xrmx
Copy link
Copy Markdown
Contributor

xrmx commented Mar 6, 2026

@deepanshuclickpost Please run tox -e precommit

@tammy-baylis-swi tammy-baylis-swi moved this from Ready for review to Approved PRs in Python PR digest Mar 6, 2026
@deepanshuclickpost deepanshuclickpost force-pushed the fix/logging-handler-recursion-deadlock branch 2 times, most recently from 219944b to 2cc9e2f Compare March 7, 2026 04:48
@herin049
Copy link
Copy Markdown
Contributor

herin049 commented Mar 9, 2026

I'm thinking we might want to explicitly add a StreamHandler to the internal logger instead of having no loggers just in case a user unknowingly sets the logging.lastResort to the OpenTelemetry logger (which would trigger the deadlock again). Let me know what you think.

@deepanshuclickpost deepanshuclickpost force-pushed the fix/logging-handler-recursion-deadlock branch from 2cc9e2f to e84b623 Compare March 9, 2026 02:02
@deepanshu406
Copy link
Copy Markdown
Contributor

@herin049 CI is awaiting workflow approval.Could someone please trigger it? All review feedback has been addressed.

Thanks!

@deepanshu406
Copy link
Copy Markdown
Contributor

@xrmx This PR has 3 approvals and all review feedback has been addressed. Could you please approve the CI workflows so the checks can run? Thanks!

@deepanshu406
Copy link
Copy Markdown
Contributor

@xrmx 784 checks have passed with 3 approvals. The remaining 94 are awaiting workflow approval — likely triggered because the branch is out of date (can be cleanly merged). Could you please merge when you get a chance? Thanks!

LoggingHandler.emit() can deadlock when _translate() triggers internal
logging (e.g., _clean_extended_attribute calling _logger.warning() for
invalid attribute types). The warning re-enters the handler chain on the
same thread, creating infinite recursion that prevents the handler lock
from ever being released, blocking all other threads.

This adds a ContextVar-based guard that detects re-entrant calls to
emit() on the same thread and skips the recursive call, preventing the
deadlock while preserving all application logs.

Fixes open-telemetry/opentelemetry-python#3858
Relates to open-telemetry/opentelemetry-python#3560
Relates to open-telemetry/opentelemetry-python#4585
@deepanshuclickpost deepanshuclickpost force-pushed the fix/logging-handler-recursion-deadlock branch from e84b623 to 82705b8 Compare March 12, 2026 10:44
@deepanshu406
Copy link
Copy Markdown
Contributor

@xrmx Rebased to resolve conflicts. Could you please re-approve and merge? All feedback addressed. Thanks!

@deepanshu406
Copy link
Copy Markdown
Contributor

@pmcollins Could you please approve the CI workflows so the checks can run?

Thanks!

@deepanshu406
Copy link
Copy Markdown
Contributor

@pmcollins @xrmx All checks have passed. Can we merge this pull request? Let me know if anything else required

Thanks!

@xrmx xrmx enabled auto-merge (squash) March 13, 2026 08:29
@xrmx xrmx merged commit e1fbc0d into open-telemetry:main Mar 13, 2026
848 checks passed
@github-project-automation github-project-automation Bot moved this from Approved PRs to Done in Python PR digest Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

HTTP Worker Thread Blocked and Invalid Type Warning from Opentelemetry _logs

7 participants