Add recursion guard in LoggingHandler.emit to prevent deadlock#4954
Add recursion guard in LoggingHandler.emit to prevent deadlock#4954deepanshuclickpost wants to merge 2 commits intoopen-telemetry:mainfrom
Conversation
LoggingHandler.emit() can deadlock when _translate() triggers internal logging (e.g., _clean_extended_attribute calling _logger.warning() for invalid attribute types). The warning re-enters the handler chain on the same thread, which tries to acquire the same handler lock, causing a permanent deadlock that freezes all worker threads. This adds a threading.local() guard that detects re-entrant calls to emit() on the same thread and skips the recursive call, preventing the deadlock while preserving all application logs. Fixes open-telemetry#3858 Relates to open-telemetry#3560, open-telemetry#2701, open-telemetry#4585
xrmx
left a comment
There was a problem hiding this comment.
This LoggingHandler is deprecated, a copy of this has been copied into opentelemetry-python-contrib repository inside the opentelemetry-instrumentation-logging package
|
@deepanshuclickpost on which version are you seeing the issue? 1.39.0 should greatly reduce the chance of hitting the issue after #4808 |
We're on version 1.37.0. Our production thread dump shows the deadlock at _clean_extended_attribute line 209, which is the except TypeError → _logger.warning("Attribute %s: %s", key, exception) path. |
|
@xrmx |
|
My suggestion would be to update to 1.39.1 (or even 1.40.0) so at least you reduce the occurrences of this for you and then open a PR in contrib for the logging instrumentation. |
|
@xrmx |
|
Thanks @deepanshuclickpost , closing this. |
LoggingHandler.emit() can deadlock when _translate() triggers internal logging (e.g., _clean_extended_attribute calling _logger.warning() for invalid attribute types). The warning re-enters the handler chain on the same thread, which tries to acquire the same handler lock, causing a permanent deadlock that freezes all worker threads.
This adds a threading.local() guard that detects re-entrant calls to emit() on the same thread and skips the recursive call, preventing the deadlock while preserving all application logs.
Fixes #3858
Relates to #3560, #2701, #4585
Description
LoggingHandler.emit()can deadlock when_translate()triggers internal logging. The call chain is:emit()→_translate()→BoundedAttributes()→_clean_extended_attribute()_clean_extended_attribute()finds a non-primitive attribute and calls_logger.warning("Invalid type...")Once one thread deadlocks, every other thread that attempts to log blocks behind the same lock. In production (Gunicorn with 4 workers × 5 threads), this caused 100% request failure (Nginx 499/504) and complete service outage.
Thread dump captured via
py-spyduring production outage confirmed all 20 request threads stuck atacquire (logging/__init__.py:927):This adds a
threading.local()recursion guard inLoggingHandler.emit(). Ifemit()is already running on the current thread, the recursive call returns immediately, preventing the deadlock while preserving all application logs.Fixes #3858
Type of change
How Has This Been Tested?
4 new unit tests added in
test_logging_handler.py:test_recursive_emit_is_skipped— verifies guard blocks recursive callstest_normal_emit_works— verifies normal logging is not affectedtest_guard_is_reset_after_emit— verifies guard resets after emit completestest_guard_is_thread_local— verifies guard on one thread doesn't block other threadsDoes This PR Require a Contrib Repo Change?
Checklist: