Skip to content

Fix segfault in opentelemetry_span_log#38965

Closed
tavplubix wants to merge 4 commits intomasterfrom
fix_opentelemetry_span_log
Closed

Fix segfault in opentelemetry_span_log#38965
tavplubix wants to merge 4 commits intomasterfrom
fix_opentelemetry_span_log

Conversation

@tavplubix
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fixed segfault in system.opentelemetry_span_log

Fixes #38221
Fixes #38402
cc: @qoega

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Jul 7, 2022
@tavplubix tavplubix changed the title Fix sigfault in opentelemetry_span_log Fix segfault in opentelemetry_span_log Jul 7, 2022
@qoega
Copy link
Copy Markdown
Member

qoega commented Jul 7, 2022

It does not fix it at all.
The code that fails in BC check is removed in master, but you return it

From gdb stack attribute_names, attribute_values

2022-07-07 16:34:30         elem = @0x7f9136d300b0: {<DB::OpenTelemetrySpan> = {trace_id = {t = {items = {14559789254311522630, 5818322339355167126}}}, span_id = 6552067928686582189, parent_span_id = 10548528370291982616, operation_name = {static __short_mask = 128, static __long_mask = 9223372036854775808, __r_ = {<std::__1::__compressed_pair_elem<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__rep, 0, false>> = {__value_ = {{__l = {__data_ = 0x7f986e606540 "ExecutionThreadContext::executeTask() ExpressionTransform", __size_ = 57, __cap_ = 9223372036854775872}, __s = {__data_ = "@e`n\230\177\000\000\071\000\000\000\000\000\000\000@\000\000\000\000\000", {<std::__1::__padding<char, 1ul>> = {<No data fields>}, __size_ = 128 '\200'}}, __r = {__words = {140292663567680, 57, 9223372036854775872}}}}}, <std::__1::__compressed_pair_elem<std::__1::allocator<char>, 1, true>> = {<std::__1::allocator<char>> = {<std::__1::__non_trivial_if<true, std::__1::allocator<char> >> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, static npos = 18446744073709551615}, start_time_us = 1657200613577663, finish_time_us = 1657200613577784, attribute_names = {<std::__1::vector<DB::Field, AllocatorWithMemoryTracking<DB::Field> >> = {__begin_ = 0x7f98527ace00, __end_ = 0x7f98527ace38, __end_cap_ = {<std::__1::__compressed_pair_elem<DB::Field*, 0, false>> = {__value_ = 0x7f98527ace38}, <std::__1::__compressed_pair_elem<AllocatorWithMemoryTracking<DB::Field>, 1, true>> = {<AllocatorWithMemoryTracking<DB::Field>> = {<No data fields>}, <No data fields>}, <No data fields>}}, <No data fields>}, attribute_values = {<std::__1::vector<DB::Field, AllocatorWithMemoryTracking<DB::Field> >> = {__begin_ = 0x0, __end_ = 0x0, __end_cap_ = {<std::__1::__compressed_pair_elem<DB::Field*, 0, false>> = {__value_ = 0x0}, <std::__1::__compressed_pair_elem<AllocatorWithMemoryTracking<DB::Field>, 1, true>> = {<AllocatorWithMemoryTracking<DB::Field>> = {<No data fields>}, <No data fields>}, <No data fields>}}, <No data fields>}}, <No data fields>}

We need just to remove incorrect code from 22.6 with backport of
#38814

Your 'fix' returns incorrect code

@tavplubix
Copy link
Copy Markdown
Member Author

Hm, I though that incorrect code was initially introduced in #37837. This PR reverts it as well.
I have not seen similar segfaults before recent changes in opentelemetry. So I want to return code to previous state that seemed to be correct (at least it did not cause tests failures).

@novikd novikd self-assigned this Jul 7, 2022
@qoega
Copy link
Copy Markdown
Member

qoega commented Jul 7, 2022

Hm, I though that incorrect code was initially introduced in #37837. This PR reverts it as well.

That PR just added more places where spans are collected. So it just increased probability of it to fail. Master is not failing in stress and other tests. I run fuzzer for several days of a version with enabled Otel for all queries and it is not failing. So revert is doing worse than backport of Map usage instead of keys/values.
I actually fixed one use case when number was stored instead of a String value - it was found when Map was used explicitly.

@tavplubix
Copy link
Copy Markdown
Member Author

Master is not failing in stress and other tests.

Master (22.7) and 22.6 contain almost the same set of changes around opentelemetry (except #38814) and 22.6 crashes in BC check. Yes, it's a mystery why 22.7 with almost the same code does not crash and it has to be investigated.

I actually fixed one use case when number was stored instead of a String value

But what was the problem with storing numbers? All Fields were converted to Strings, so it should has worked fine with numbers. Why do you expect that changing Array to Map will fix the issue?

Also revert of the initial PR fixes #38402 as well.

@qoega
Copy link
Copy Markdown
Member

qoega commented Jul 7, 2022

@tavplubix
Copy link
Copy Markdown
Member Author

This line was incorrect before: https://github.com/ClickHouse/ClickHouse/pull/38814/files#diff-ff8ad4aed4cf07fe29cb5344d2ab79bc24efd97d054aa112c3a05b5ab853cc44L319

Please read my previous comment again. What exactly was incorrect? Why do you expect that it's fixed? I see that this line mixes strings and integers in one array of Fields (and it's really questionable practice). But all Fields are converted to Strings later, so it should not be a problem:

map[attr_idx] = Tuple{attribute_names[attr_idx], toString(attribute_values[attr_idx])};

Moreover, if you look at the logs in failed checks, you will see that segfault happens when attr_idx = 0. There is no way that exception code can be the first attribute. So the line you referred looks completely irrelevant.

@FrankChen021
Copy link
Copy Markdown
Contributor

Oh my god, I just rebased my branch to the latest which contains the #38814 commit to resolve some conflicts, and now it's reverting.

I don't know if the reverts in the PR is temporary or permanent because I'm going to submit a PR(which has some conflicts with these reverts and this PR) which introduces tracing context propagation across threads to resolve some bugs. My patch has been running on our clusters for more than half a year.

@filimonov
Copy link
Copy Markdown
Contributor

Oh my god, I just rebased my branch to the latest which contains the #38814 commit to resolve some conflicts, and now it's reverting.

I don't know if the reverts in the PR is temporary or permanent because I'm going to submit a PR(which has some conflicts with these reverts and this PR) which introduces tracing context propagation across threads to resolve some bugs. My patch has been running on our clusters for more than half a year.

Revert the revert commit in your branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

00974_query_profiler is flaky: queue is full for system log OpenTelemetrySpanLog Segfault in OpenTelemetrySpanLog

6 participants