Skip to content

Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures#45629

Merged
tavplubix merged 28 commits intoClickHouse:masterfrom
arthurpassos:fix_cares_crash
Feb 8, 2023
Merged

Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures#45629
tavplubix merged 28 commits intoClickHouse:masterfrom
arthurpassos:fix_cares_crash

Conversation

@arthurpassos
Copy link
Copy Markdown
Contributor

@arthurpassos arthurpassos commented Jan 25, 2023

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

A couple of seg faults have been reported around c-ares. All of the recent stack traces observed fail on inserting into std::unodered_set<>. I believe I have found the root cause of this, it seems to be unprocessed queries. Prior to this PR, CH calls poll to wait on the file descriptors in the c-ares channel. According to the poll docs, a negative return value means an error has ocurred. Because of this, we would abort the execution and return failure. The problem is that poll will also return a negative value if a system interrupt occurs. A system interrupt does not mean the processing has failed or ended, but we would abort it anyways because we were checking for negative values. Once the execution is aborted, the whole stack is destroyed, which includes the std::unordered_set<std::string> passed to the void * parameter of the c-ares callback. Once c-ares completed the request, the callback would be invoked and would access an invalid memory address causing a segfault.

Solution was to check for EINTR == errno (errno is both thread-safe and thread-local, so I think it's ok to check it.) and retry if that was the case. If it was an actual error, call ares_cancel to cancel pending queries and then abort execution (similar to what libcurl does). Calling ares_cancel will make sure pending requests are canceled and the callback will not be executed with dangling references.

libcurl seems to have at least two wait_resolve/ failure recovery methods depending on build options, both of these make use of ares_cancel. See ref1 and ref2.

reverseDNSQuery function was added as a tool to debug this problem. The test implemented with it is able to reproduce the issue, so I think it's a good idea to keep it. Plus, it can also be used for reverse DNS querying lol.

The funny thing is that it was pretty easy to reproduce this issue with this function, but neither my multithreaded unit tests or scripts to make an "authentication storm" was able to crash CH. I believe unit tests were not able to repro the crash because it was testing only those classes and there was no system interrupts/ bigger failures to cause the "premature" abortion. Authentication storm, on the other hand, had system interrupts and CH was running normally, but it would test against a local DNS instance running in my machine, so the latency was very low. Maybe it was too low for an interrupt to happen?

To be on the super safe side, there are two more actions that we could take:

  1. Unconditionally cancel c-ares requests after processing.
  2. Do not use stack variables as arguments to the callback. Instead, implement some sort of pending request list that we can add and remove based on an ID or something.

I want to believe we have found the issue and there is no need for the above, but let me know what you guys think.

Documentation entry for user-facing changes

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-bugfix Pull request with bugfix, not backported by default label Jan 25, 2023
@arthurpassos
Copy link
Copy Markdown
Contributor Author

@tavplubix Hi again :). Can you add the label can be tested so I can have CI running?

@den-crane den-crane added can be tested Allows running workflows for external contributors labels Jan 26, 2023
@tavplubix tavplubix self-assigned this Jan 26, 2023
@arthurpassos arthurpassos changed the title cancel ares failed requests, listen to POLLRDNORM and retry on EINTR Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures Jan 26, 2023
@arthurpassos
Copy link
Copy Markdown
Contributor Author

Right now reverseDNSQuery function is behaving weirdly. It seems like it's being called twice, all return values are duplicated:

arthur :) select reverseDNSQuery('127.0.0.1')

SELECT reverseDNSQuery('127.0.0.1')

Query id: 306a56e9-4085-4e5e-80ac-dfe69640f3fc

┌─reverseDNSQuery('127.0.0.1')─┐
│ ['localhost']                │
└──────────────────────────────┘
┌─reverseDNSQuery('127.0.0.1')─┐
│ ['localhost']                │

Debugging it, it's indeed being called twice by different stack traces.

@tavplubix
Copy link
Copy Markdown
Member

Wow, nice catch and a really good explanation, thank you!

The funny thing is that it was pretty easy to reproduce this issue with this function, but neither my multithreaded unit tests or scripts to make an "authentication storm" was able to crash CH. I believe unit tests were not able to repro the crash because it was testing only those classes and there was no system interrupts/ bigger failures to cause the "premature" abortion. Authentication storm, on the other hand, had system interrupts and CH was running normally, but it would test against a local DNS instance running in my machine, so the latency was very low. Maybe it was too low for an interrupt to happen?

I guess it's because of the query profiler. It sends SIGUSR1/SIGUSR2 once per second to threads that are processing some queries. So it's much easier to trigger EINTR in a query thread than in a background thread.

Comment on lines +34 to +37
if (result.empty())
{
std::cout<<"failed\n";
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to assert for here as DNS requests might actually fail or one of the IP addresses actually become invalid in the future. Maybe it's enough to know it didn't crash and reached the end?

Just a note: this did not reproduce the issue, but it's good to have it.

@arthurpassos
Copy link
Copy Markdown
Contributor Author

Looking at CI results, I see two problems regardin DNS:

  1. Stateless test took too long. That's fine, we'll either remove it or move to integration. Plus, I can decrease the number of requests.
  2. It seems like unit test tsan & ubsan caught something, will take a look later.

robot-clickhouse added a commit that referenced this pull request Feb 9, 2023
…system interrupts to prevent callbacks with dangling references and premature resolution failures
robot-clickhouse added a commit that referenced this pull request Feb 9, 2023
…ystem interrupts to prevent callbacks with dangling references and premature resolution failures
robot-clickhouse added a commit that referenced this pull request Feb 9, 2023
…ystem interrupts to prevent callbacks with dangling references and premature resolution failures
robot-clickhouse added a commit that referenced this pull request Feb 9, 2023
…system interrupts to prevent callbacks with dangling references and premature resolution failures
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Feb 9, 2023
Enmk pushed a commit to Altinity/ClickHouse that referenced this pull request Feb 9, 2023
Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
Enmk added a commit to Altinity/ClickHouse that referenced this pull request Feb 10, 2023
tavplubix added a commit that referenced this pull request Feb 10, 2023
Backport #45629 to 22.12: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
tavplubix added a commit that referenced this pull request Feb 10, 2023
Backport #45629 to 23.1: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
tavplubix added a commit that referenced this pull request Feb 16, 2023
Backport #45629 to 22.8: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
tavplubix added a commit that referenced this pull request Feb 16, 2023
Backport #45629 to 22.11: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
yokofly added a commit to timeplus-io/proton that referenced this pull request Oct 18, 2023
…nt callbacks with dangling references and premature resolution failures

ClickHouse/ClickHouse#45629
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants