Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures by arthurpassos · Pull Request #45629 · ClickHouse/ClickHouse

arthurpassos · 2023-01-25T22:19:13Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

A couple of seg faults have been reported around c-ares. All of the recent stack traces observed fail on inserting into std::unodered_set<>. I believe I have found the root cause of this, it seems to be unprocessed queries. Prior to this PR, CH calls poll to wait on the file descriptors in the c-ares channel. According to the poll docs, a negative return value means an error has ocurred. Because of this, we would abort the execution and return failure. The problem is that poll will also return a negative value if a system interrupt occurs. A system interrupt does not mean the processing has failed or ended, but we would abort it anyways because we were checking for negative values. Once the execution is aborted, the whole stack is destroyed, which includes the std::unordered_set<std::string> passed to the void * parameter of the c-ares callback. Once c-ares completed the request, the callback would be invoked and would access an invalid memory address causing a segfault.

Solution was to check for EINTR == errno (errno is both thread-safe and thread-local, so I think it's ok to check it.) and retry if that was the case. If it was an actual error, call ares_cancel to cancel pending queries and then abort execution (similar to what libcurl does). Calling ares_cancel will make sure pending requests are canceled and the callback will not be executed with dangling references.

libcurl seems to have at least two wait_resolve/ failure recovery methods depending on build options, both of these make use of ares_cancel. See ref1 and ref2.

reverseDNSQuery function was added as a tool to debug this problem. The test implemented with it is able to reproduce the issue, so I think it's a good idea to keep it. Plus, it can also be used for reverse DNS querying lol.

The funny thing is that it was pretty easy to reproduce this issue with this function, but neither my multithreaded unit tests or scripts to make an "authentication storm" was able to crash CH. I believe unit tests were not able to repro the crash because it was testing only those classes and there was no system interrupts/ bigger failures to cause the "premature" abortion. Authentication storm, on the other hand, had system interrupts and CH was running normally, but it would test against a local DNS instance running in my machine, so the latency was very low. Maybe it was too low for an interrupt to happen?

To be on the super safe side, there are two more actions that we could take:

Unconditionally cancel c-ares requests after processing.
Do not use stack variables as arguments to the callback. Instead, implement some sort of pending request list that we can add and remove based on an ID or something.

I want to believe we have found the issue and there is no need for the above, but let me know what you guys think.

Documentation entry for user-facing changes

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

arthurpassos · 2023-01-25T22:20:19Z

@tavplubix Hi again :). Can you add the label can be tested so I can have CI running?

arthurpassos · 2023-01-26T11:25:34Z

Right now reverseDNSQuery function is behaving weirdly. It seems like it's being called twice, all return values are duplicated:

arthur :) select reverseDNSQuery('127.0.0.1')

SELECT reverseDNSQuery('127.0.0.1')

Query id: 306a56e9-4085-4e5e-80ac-dfe69640f3fc

┌─reverseDNSQuery('127.0.0.1')─┐
│ ['localhost']                │
└──────────────────────────────┘
┌─reverseDNSQuery('127.0.0.1')─┐
│ ['localhost']                │

Debugging it, it's indeed being called twice by different stack traces.

tavplubix · 2023-01-26T17:49:59Z

Wow, nice catch and a really good explanation, thank you!

The funny thing is that it was pretty easy to reproduce this issue with this function, but neither my multithreaded unit tests or scripts to make an "authentication storm" was able to crash CH. I believe unit tests were not able to repro the crash because it was testing only those classes and there was no system interrupts/ bigger failures to cause the "premature" abortion. Authentication storm, on the other hand, had system interrupts and CH was running normally, but it would test against a local DNS instance running in my machine, so the latency was very low. Maybe it was too low for an interrupt to happen?

I guess it's because of the query profiler. It sends SIGUSR1/SIGUSR2 once per second to threads that are processing some queries. So it's much easier to trigger EINTR in a query thread than in a background thread.

arthurpassos · 2023-01-26T17:58:51Z

src/Common/tests/gtest_dns_reverse_resolve.cpp

+            if (result.empty())
+            {
+                std::cout<<"failed\n";
+            }


Not sure what to assert for here as DNS requests might actually fail or one of the IP addresses actually become invalid in the future. Maybe it's enough to know it didn't crash and reached the end?

Just a note: this did not reproduce the issue, but it's good to have it.

src/Common/tests/gtest_dns_reverse_resolve.cpp

tests/queries/0_stateless/02483_test_reverse_dns_resolution.sh

tests/queries/0_stateless/02483_test_reverse_dns_resolution.reference

src/Functions/reverseDNSQuery.cpp

src/Common/CaresPTRResolver.cpp

arthurpassos · 2023-01-27T14:11:53Z

Looking at CI results, I see two problems regardin DNS:

Stateless test took too long. That's fine, we'll either remove it or move to integration. Plus, I can decrease the number of requests.
It seems like unit test tsan & ubsan caught something, will take a look later.

src/Functions/reverseDNSQuery.cpp

… more about function validation

…system interrupts to prevent callbacks with dangling references and premature resolution failures

…ystem interrupts to prevent callbacks with dangling references and premature resolution failures

…system interrupts to prevent callbacks with dangling references and premature resolution failures

Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures

22.8 Backport of ClickHouse#45629 fix cares crash

Backport #45629 to 22.12: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures

Backport #45629 to 23.1: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures

Backport #45629 to 22.8: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures

Backport #45629 to 22.11: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures

…nt callbacks with dangling references and premature resolution failures ClickHouse/ClickHouse#45629

arthurpassos added 2 commits January 25, 2023 19:07

cancel ares failed requests, listen to POLLRDNORM and retry on EINTR

9942fe7

remove maybe_unused

912ac6f

robot-ch-test-poll1 added the pr-bugfix Pull request with bugfix, not backported by default label Jan 25, 2023

spacing

4a10f49

den-crane added can be tested Allows running workflows for external contributors labels Jan 26, 2023

add reverseDNSQuery docs

433eda7

tavplubix self-assigned this Jan 26, 2023

arthurpassos changed the title ~~cancel ares failed requests, listen to POLLRDNORM and retry on EINTR~~ Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures Jan 26, 2023

add comments to the code

ea38df2

arthurpassos commented Jan 26, 2023

View reviewed changes

tavplubix reviewed Jan 26, 2023

View reviewed changes

tavplubix reviewed Jan 27, 2023

View reviewed changes

src/Functions/reverseDNSQuery.cpp Outdated Show resolved Hide resolved

arthurpassos added 13 commits January 27, 2023 12:07

address some comments

e559a6b

make random number generator thread safe

39cbf4d

string formatting

c586de5

Add argument validation to reverseDNSQuerying and make stateless test…

f675600

… more about function validation

extern bad arguments

ebf3145

added integ test

6c3a587

fix black

2a63b86

fix black

0e7b35c

Add setting to enable/disable reverseDNSQuery function

513f430

Merge branch 'master' into fix_cares_crash

7b1ceaa

remove config file

e366031

remove config file

05a8176

add file again

c8f6003

robot-clickhouse-ci-2 mentioned this pull request Feb 9, 2023

Backport #45629 to 22.12: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures #46226

Merged

robot-clickhouse added a commit that referenced this pull request Feb 9, 2023

Backport #45629 to 22.12: Cancel c-ares failed requests and retry on …

f049f32

…system interrupts to prevent callbacks with dangling references and premature resolution failures

robot-clickhouse-ci-2 mentioned this pull request Feb 9, 2023

Cherry pick #45629 to 23.1: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures #46227

Merged

robot-clickhouse-ci-2 mentioned this pull request Feb 9, 2023

Backport #45629 to 23.1: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures #46228

Merged

robot-clickhouse added a commit that referenced this pull request Feb 9, 2023

Backport #45629 to 23.1: Cancel c-ares failed requests and retry on s…

0de6edc

…ystem interrupts to prevent callbacks with dangling references and premature resolution failures

robot-clickhouse-ci-2 mentioned this pull request Feb 9, 2023

Backport #45629 to 22.8: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures #46238

Merged

robot-clickhouse added a commit that referenced this pull request Feb 9, 2023

Backport #45629 to 22.8: Cancel c-ares failed requests and retry on s…

d0c0aca

…ystem interrupts to prevent callbacks with dangling references and premature resolution failures

robot-clickhouse-ci-2 mentioned this pull request Feb 9, 2023

Backport #45629 to 22.11: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures #46239

Merged

robot-clickhouse added a commit that referenced this pull request Feb 9, 2023

Backport #45629 to 22.11: Cancel c-ares failed requests and retry on …

9c270ea

…system interrupts to prevent callbacks with dangling references and premature resolution failures

robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Feb 9, 2023

Enmk mentioned this pull request Feb 9, 2023

22.8 Backport of #45629 fix cares crash Altinity/ClickHouse#231

Merged

Enmk added a commit to Altinity/ClickHouse that referenced this pull request Feb 10, 2023

Merge pull request #231 from Altinity/backports/22.8_cares_crash

4abda45

22.8 Backport of ClickHouse#45629 fix cares crash

Enmk mentioned this pull request Feb 10, 2023

22.8.13 Pre-release PR Altinity/ClickHouse#230

Merged

Enmk mentioned this pull request Mar 15, 2023

22.8.15 Pre-release PR Altinity/ClickHouse#239

Merged

arthurpassos mentioned this pull request Jul 25, 2023

Add reverse DNS resolution test storm #52539

Closed

1 task

yokofly added a commit to timeplus-io/proton that referenced this pull request Oct 18, 2023

Cancel c-ares failed requests and retry on system interrupts to preve…

cdcc461

…nt callbacks with dangling references and premature resolution failures ClickHouse/ClickHouse#45629

alexey-milovidov mentioned this pull request Dec 30, 2023

The function reverseDNSQuery is garbage, remove it. #58368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures#45629

Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures#45629
tavplubix merged 28 commits intoClickHouse:masterfrom
arthurpassos:fix_cares_crash

arthurpassos commented Jan 25, 2023 •

edited

Loading

Uh oh!

arthurpassos commented Jan 25, 2023

Uh oh!

arthurpassos commented Jan 26, 2023

Uh oh!

tavplubix commented Jan 26, 2023

Uh oh!

arthurpassos Jan 26, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthurpassos commented Jan 27, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

arthurpassos commented Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

arthurpassos commented Jan 25, 2023

Uh oh!

arthurpassos commented Jan 26, 2023

Uh oh!

tavplubix commented Jan 26, 2023

Uh oh!

arthurpassos Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthurpassos commented Jan 27, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arthurpassos commented Jan 25, 2023 •

edited

Loading