Skip to content

Add recursion guard in LoggingHandler.emit to prevent deadlock#4954

Closed
deepanshuclickpost wants to merge 2 commits intoopen-telemetry:mainfrom
deepanshuclickpost:fix/logging-handler-recursion-deadlock
Closed

Add recursion guard in LoggingHandler.emit to prevent deadlock#4954
deepanshuclickpost wants to merge 2 commits intoopen-telemetry:mainfrom
deepanshuclickpost:fix/logging-handler-recursion-deadlock

Conversation

@deepanshuclickpost
Copy link
Copy Markdown

LoggingHandler.emit() can deadlock when _translate() triggers internal logging (e.g., _clean_extended_attribute calling _logger.warning() for invalid attribute types). The warning re-enters the handler chain on the same thread, which tries to acquire the same handler lock, causing a permanent deadlock that freezes all worker threads.

This adds a threading.local() guard that detects re-entrant calls to emit() on the same thread and skips the recursive call, preventing the deadlock while preserving all application logs.

Fixes #3858
Relates to #3560, #2701, #4585

Description

LoggingHandler.emit() can deadlock when _translate() triggers internal logging. The call chain is:

  1. emit()_translate()BoundedAttributes()_clean_extended_attribute()
  2. _clean_extended_attribute() finds a non-primitive attribute and calls _logger.warning("Invalid type...")
  3. This warning re-enters the logging handler chain on the same thread
  4. The handler tries to acquire the same lock already held by step 1
  5. Deadlock — the thread waits for itself, lock is never released

Once one thread deadlocks, every other thread that attempts to log blocks behind the same lock. In production (Gunicorn with 4 workers × 5 threads), this caused 100% request failure (Nginx 499/504) and complete service outage.

Thread dump captured via py-spy during production outage confirmed all 20 request threads stuck at acquire (logging/__init__.py:927):

acquire (logging/__init__.py:927)                       ← STUCK
warning (logging/__init__.py:1501)                      ← OTel internal warning
_clean_extended_attribute (attributes/__init__.py:209)
_translate (_logs/_internal/__init__.py:637)
emit (_logs/_internal/__init__.py:656)                  ← OTel LoggingHandler
handle (logging/__init__.py:978)                        ← Lock first acquired here

This adds a threading.local() recursion guard in LoggingHandler.emit(). If emit() is already running on the current thread, the recursive call returns immediately, preventing the deadlock while preserving all application logs.

Fixes #3858

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

4 new unit tests added in test_logging_handler.py:

  • test_recursive_emit_is_skipped — verifies guard blocks recursive calls
  • test_normal_emit_works — verifies normal logging is not affected
  • test_guard_is_reset_after_emit — verifies guard resets after emit completes
  • test_guard_is_thread_local — verifies guard on one thread doesn't block other threads

Does This PR Require a Contrib Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

LoggingHandler.emit() can deadlock when _translate() triggers internal
logging (e.g., _clean_extended_attribute calling _logger.warning() for
invalid attribute types). The warning re-enters the handler chain on the
same thread, which tries to acquire the same handler lock, causing a
permanent deadlock that freezes all worker threads.

This adds a threading.local() guard that detects re-entrant calls to
emit() on the same thread and skips the recursive call, preventing the
deadlock while preserving all application logs.

Fixes open-telemetry#3858
Relates to open-telemetry#3560, open-telemetry#2701, open-telemetry#4585
@deepanshuclickpost deepanshuclickpost requested a review from a team as a code owner March 5, 2026 07:18
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Mar 5, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

Copy link
Copy Markdown
Contributor

@xrmx xrmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LoggingHandler is deprecated, a copy of this has been copied into opentelemetry-python-contrib repository inside the opentelemetry-instrumentation-logging package

@xrmx
Copy link
Copy Markdown
Contributor

xrmx commented Mar 5, 2026

@deepanshuclickpost on which version are you seeing the issue? 1.39.0 should greatly reduce the chance of hitting the issue after #4808

@deepanshuclickpost
Copy link
Copy Markdown
Author

deepanshuclickpost commented Mar 5, 2026

@deepanshuclickpost on which version are you seeing the issue? 1.39.0 should greatly reduce the chance of hitting the issue after #4808

We're on version 1.37.0. Our production thread dump shows the deadlock at _clean_extended_attribute line 209, which is the except TypeError → _logger.warning("Attribute %s: %s", key, exception) path.
I see #4808 reduces this by converting to str() first, which would prevent most cases. However, the deadlock can still occur if str() raises, and the invalid key path on line 203 (_logger.warning("invalid key...")) is also unprotected.
A recursion guard in emit() would be a defense-in-depth measure against all _logger.warning() calls. Should I open this against opentelemetry-python-contrib in opentelemetry-instrumentation-logging instead?

@deepanshuclickpost
Copy link
Copy Markdown
Author

@xrmx
We're on version 1.37.0. Our production thread dump shows the deadlock at _clean_extended_attribute line 209, which is the except TypeError → _logger.warning("Attribute %s: %s", key, exception) path.
I see #4808 reduces this by converting to str() first, which would prevent most cases. However, the deadlock can still occur if str() raises, and the invalid key path on line 203 (_logger.warning("invalid key...")) is also unprotected.
A recursion guard in emit() would be a defense-in-depth measure against all _logger.warning() calls. Should I open this against opentelemetry-python-contrib in opentelemetry-instrumentation-logging instead?

@xrmx
Copy link
Copy Markdown
Contributor

xrmx commented Mar 5, 2026

My suggestion would be to update to 1.39.1 (or even 1.40.0) so at least you reduce the occurrences of this for you and then open a PR in contrib for the logging instrumentation.

@deepanshuclickpost
Copy link
Copy Markdown
Author

@xrmx
I've opened the PR against the contrib repo as suggested: open-telemetry/opentelemetry-python-contrib#4302

@xrmx
Copy link
Copy Markdown
Contributor

xrmx commented Mar 6, 2026

Thanks @deepanshuclickpost , closing this.

@xrmx xrmx closed this Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP Worker Thread Blocked and Invalid Type Warning from Opentelemetry _logs

3 participants