Skip to content

Fix system.stack_trace when running in daemon mode#17630

Merged
alexey-milovidov merged 17 commits intoClickHouse:masterfrom
amosbird:stacktracefix
Dec 23, 2020
Merged

Fix system.stack_trace when running in daemon mode#17630
alexey-milovidov merged 17 commits intoClickHouse:masterfrom
amosbird:stacktracefix

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix empty system.stack_trace table when server is running in daemon mode.

Detailed description / Documentation draft:

None.

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Nov 30, 2020
@alexey-milovidov alexey-milovidov self-assigned this Nov 30, 2020
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we can add a test after this fix.

@alexey-milovidov
Copy link
Copy Markdown
Member

Simply revert this commit: bdefa92

@alexey-milovidov
Copy link
Copy Markdown
Member

We also need something to suppress TSan.

@alexey-milovidov
Copy link
Copy Markdown
Member

We failed to suppress TSan on function level, now let's do it on target level:

set_source_files_properties(file PROPERTIES COMPILE_FLAGS option)

@alexey-milovidov
Copy link
Copy Markdown
Member

Sandbox error:

ResourceNotAvailable: {74f3cad3:8}: sandbox.agentr.errors.ResourceNotAvailable('Resource #1871059875 is not available at all sources [].',)

Will try again...

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov
Copy link
Copy Markdown
Member

Did not help again... maybe I will try to do some magic. I really need this PR to be merged :)

@alexey-milovidov
Copy link
Copy Markdown
Member

Cannot reproduce TSan report on my machine...

@alexey-milovidov
Copy link
Copy Markdown
Member

Reproduced.

@alexey-milovidov
Copy link
Copy Markdown
Member

Fixed.

@alexey-milovidov
Copy link
Copy Markdown
Member

Not fixed?

@alexey-milovidov alexey-milovidov merged commit c72a480 into ClickHouse:master Dec 23, 2020
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Dec 23, 2020

@amosbird I just read what TSan said.

First it said that we can read stack_trace variable while it's being written by the signal handler.
We ensure that it does not happen by:

  • mutex in a function that is sending signals;
  • skipping signals that come from other processes - theoretically it's not enough as signal can be send from other threads, though we don't do that;
  • synchronizing via pipe - but it's not enough as we read from pipe with timeout (100 ms) and it's possible that previous invocation of the signal handler is still in progress (it was checked for sequence_num then paused for a second) and it will write stack_trace when we already processing the next invocation.

So, TSan was right and I protected from this case with data_ready_num, so we cannot read stack_trace value that was not written yet.

Then TSan said that we can write to stack_trace variable from concurrent invocations of signal handlers. I protected from it with signal_latch, so no concurrent invocations of the signal handler possible.

PS. It's too complex for the two reasons:

  • it's not allowed to use any complicated synchronization in signal handlers - only atomics and read/write to file descriptor;
  • we pass data via global variables.

The latter is not necessarily - what if don't use global variables and simply pass all data via pipe?
If we will do it in single write operation, it will not even require to protect from concurrent invocations of the signal handler.
This method is already used in BaseDaemon for fault signals.

The only difficulty is that we read from pipe with timeout, so we have to implement ReadBufferFromFileDescriptorWithTimeout.
Maybe the code will be more simple.

@amosbird
Copy link
Copy Markdown
Collaborator Author

I don't understand why changing expected_pid from global const to mutable leads to this TSan mess... Isn't it related to stack_trace variable instead?

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Dec 23, 2020

I don't understand why changing expected_pid from global const to mutable leads to this TSan mess...

I'm only about races on stack_trace variable.
Changes to expected_pid look irrelevant for TSan.

alexey-milovidov added a commit that referenced this pull request Dec 23, 2020
Backport #17630 to 20.12: Fix system.stack_trace when running in daemon mode
alexey-milovidov added a commit that referenced this pull request Dec 23, 2020
Backport #17630 to 20.11: Fix system.stack_trace when running in daemon mode
alexey-milovidov added a commit that referenced this pull request Dec 23, 2020
Backport #17630 to 20.10: Fix system.stack_trace when running in daemon mode
alexey-milovidov added a commit that referenced this pull request Feb 19, 2021
Backport #17630 to 20.8: Fix system.stack_trace when running in daemon mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants