Improve the opentelemetry tracing context propagation across threads #39010
Improve the opentelemetry tracing context propagation across threads #39010rschu1ze merged 49 commits intoClickHouse:masterfrom
Conversation
|
There might be conflicts once #38965 is merged |
No worries. Your PR if merged will make that one irrelevant. |
|
This PR fixes #39122? |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Frank Chen <[email protected]>
|
Thanks. In the tests which finished already, there is still a problem Would you like to take a look? |
|
Hi @rschu1ze I noticed this problem. It's a little bit strange why the assert works. Do you know how to run such test on local to reproduce it? |
|
To run one of the failing stateless test, 1. compile a debug build and 2. follow the steps described here, e.g. |
|
This problem was due to recursive calls of |
|
Thank you @rschu1ze for your detailed review comments. It's a nice journey to work with you on this PR. |
|
My pleasure 😉 |
|
Seems like some exception safety issues are still there: |
|
The line numbers in the stack are a bit strange. @FrankChen021 Do you have an idea what could be the cause for the issues? (had a quick look but couldn't find an obvious error) |
Sorry, I didn't notice the comment, let me check. |
|
In stress tests we have fail injection with some probability for all allocations. To check that we are exception safe with bad_alloc. Previously we had issue with addAttributes when it used two arrays instead of a map. |
|
Is the exception thrown because of memory limit? |
|
ClickHouse/src/Common/ThreadPool.cpp Line 287 in 7c4f42d I got why it crashes. The exception is thrown during the construction, but the construction is out of a try-catch scope. There was a try-catch around the I'm thinking revert the change, to move the call of Does this make sense? @rschu1ze |
|
Let me note that the issue in #39010 (comment) happens in an ubsan build but doesn't look like undefined behavior. For undefined behavior, I'd expect some So an exception was thrown during Code in "Common/ThreadPool.cpp": DENY_ALLOCATIONS_IN_SCOPE;
...
ALLOW_ALLOCATIONS_IN_SCOPE;
DB::OpenTelemetry::TracingContextHolder thread_trace_context("ThreadPool::worker()", parent_thead_trace_context);
try {
...
}
catch(...)
{
...
}So allocations below the inner allocation scope are generally allowed but they may still fail, like here. In theory, no bad things should happen. But if you look at the code prior to this PR, you find that exceptions could only be thrown from the So: The suggestion would be to check carefully that the ctor of |
Yes, that's why I move the |
|
@FrankChen021 There is one more failure in stress test. |
|
@qoega on the master branch? Let me check. |
|
Hello @qoega From the log, I can't find the query that lead to the problem. But what I got now is from the stack in the gdb.log As you can see from the stack level 23, it's from the destructor of The ClickHouse/src/Server/TCPHandler.cpp Lines 417 to 420 in 2aedd41 When it exists the while loop, the unique_ptr of TracingContextHolder which is initialized in the while-loop at line ClickHouse/src/Server/TCPHandler.cpp Lines 248 to 252 in 2aedd41 will be destroyed. Since this holder holds the root span of the tracing context on current thread, its destructor destroys the tracing context kept on thread local. But the query_span, initialized in In my opinion, when we find that the connection is closed at line 418, we should first reset all necessary resource first before exiting the while-loop, like what line 427 to 429 does. ClickHouse/src/Server/TCPHandler.cpp Lines 427 to 429 in 2aedd41 Since we explicit destroy the resource on If we don't do that, I have to define Correct me if I'm wrong. |
|
BTW, HTTPHandler does not have such problem, it first creates tracing context and then calls the For GRPCServer, there's no such problem although it calls the same |
|
I think the problem is clear based on the proof below. Below is the last line in the log that prints TCP connection is closed by client. From the line above we know that the query id is So the query that triggers such problem is ClickHouse/src/Interpreters/executeQuery.cpp Lines 531 to 536 in 583c530 And the ClickHouse/src/Interpreters/executeQuery.cpp Lines 838 to 842 in 583c530 So, the ClickHouse/src/Server/TCPHandler.cpp Lines 410 to 420 in 2aedd41 I will open a PR to address the problem that the span is not explicitly closed if it's not in process list. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improve the tracing context propagation across threads
This PR improves the open-telemetry tracing context propagation to solves some bugs in essence.
Problem 1
Let's take a look at the spans logs for
SELECT 1query, the span logs are ordered by start_time_us and are shown as belowFrom the log we know that:
queryis the root span of this tracingTCPHandleris the child span of thequeryHowever, the
TCPHandleris executed in the first place and thenquery, and also duration ofTCPHandlertakes longer than thequeryspan. This is totally contradict to what the logs show.The cause of above problem is that current tracing context initialization depends on
ThreadStatus::attachQueryorContext::setCurrentQueryIdin HTTPHandler/TCPHandler. However the initialization of context is not used as the root span, the root span is used for thequeryspan which is created atexecuteQueryImplwhich is executed after thesetCurrentQueryId.Problem 2
Another problem caused by current initialization on thread by calling
attachQueryis that, for forked threads from HTTPHandler or TCPHandler, thisattachQueryit does not correctly get the right parent span but always uses the span info holding at the theContextlevel.ClickHouse/src/Interpreters/ThreadStatusExt.cpp
Line 381 in 1cf01bb
This does not reflect correct parent-child relationship and leads all span logs in child threads to be the child of TCPHandler or HTTPHandler.
Below is an example of INSERT with parallel processing of two MVs.

All 3

QueryPushPipeExthreads from the log are looked like forked fromquery, but in fact, the latter twoQueryPushPipeExare forked by the firstQueryPushPipeExwhich is forked byTCPHandlerthread as shown belowProblem 3
Relying on
attachQueryto build context on one thread is not always the right choice. Another example is for async processing of distributed table insert. These async insert are performed in background threads, it has no query context to attach to the threads that executes the async processing. So the span logs of async processing can't be logged. A previous PR raised by me only solves the link span logs on two nodes, but has no ability to add span logs on one node starting from async processing. This would lose some key information if the async processing fails.Root Cause
The root cause is that current implementation uses different variables to hold tracing context, one is
Context:: query_trace_contextand the other isThreadStatus::thread_trace_context. The span logs created byOpenTelemetrySpanHolderare initialised from the latter one, however thread level context is initialised from the former one. This causes the chaos.Solution
The rationale of the solution is simple, just keep one thread-local variable to hold tracing context on current thread, when a child thread is forked, the parent thread is responsible to copy the tracing context to the child thread.
This method is a common and widely used by other tracing propagation tools on Java platform.
Changes
ThreadStatus, and all codes are put into a separated fileOpenTelemetryTraceContext.cppOpenTelemetryThreadTraceContextis created to hold the context on one thread, it mainly holds thetrace_id, currentspan_idand a weak reference to the system tableopentelemetry_span_log. It should be initialized at the beginning execution of one thread.current_thread_trace_contextis defined and public static methodOpenTelemetryThreadTraceContext::currentis provided to access context on current threadThreadPool.cppDirectoryMonitor.cpp). Also there're some improvements on the span logs in the following processing(SeeConnection.cppandRemoteInserter.cpp,DistributedSink.cpp).InterpreterFactory::get()is removedexecuteQuery.cppOther Info
This patch has been running on our clusters(based on 21.8.4) for more than half a year, all changes proves to be correct.