Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures#45629
Conversation
|
@tavplubix Hi again :). Can you add the label can be tested so I can have CI running? |
|
Right now Debugging it, it's indeed being called twice by different stack traces. |
|
Wow, nice catch and a really good explanation, thank you!
I guess it's because of the query profiler. It sends SIGUSR1/SIGUSR2 once per second to threads that are processing some queries. So it's much easier to trigger EINTR in a query thread than in a background thread. |
| if (result.empty()) | ||
| { | ||
| std::cout<<"failed\n"; | ||
| } |
There was a problem hiding this comment.
Not sure what to assert for here as DNS requests might actually fail or one of the IP addresses actually become invalid in the future. Maybe it's enough to know it didn't crash and reached the end?
Just a note: this did not reproduce the issue, but it's good to have it.
tests/queries/0_stateless/02483_test_reverse_dns_resolution.reference
Outdated
Show resolved
Hide resolved
|
Looking at CI results, I see two problems regardin DNS:
|
… more about function validation
…system interrupts to prevent callbacks with dangling references and premature resolution failures
…ystem interrupts to prevent callbacks with dangling references and premature resolution failures
…ystem interrupts to prevent callbacks with dangling references and premature resolution failures
…system interrupts to prevent callbacks with dangling references and premature resolution failures
Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
22.8 Backport of ClickHouse#45629 fix cares crash
Backport #45629 to 22.12: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
Backport #45629 to 23.1: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
Backport #45629 to 22.8: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
Backport #45629 to 22.11: Cancel c-ares failed requests and retry on system interrupts to prevent callbacks with dangling references and premature resolution failures
…nt callbacks with dangling references and premature resolution failures ClickHouse/ClickHouse#45629
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
A couple of seg faults have been reported around
c-ares. All of the recent stack traces observed fail on inserting intostd::unodered_set<>. I believe I have found the root cause of this, it seems to be unprocessed queries. Prior to this PR, CH callspollto wait on the file descriptors in thec-areschannel. According to the poll docs, a negative return value means an error has ocurred. Because of this, we would abort the execution and return failure. The problem is thatpollwill also return a negative value if a system interrupt occurs. A system interrupt does not mean the processing has failed or ended, but we would abort it anyways because we were checking for negative values. Once the execution is aborted, the whole stack is destroyed, which includes thestd::unordered_set<std::string>passed to thevoid *parameter of the c-ares callback. Once c-ares completed the request, the callback would be invoked and would access an invalid memory address causing a segfault.Solution was to check for
EINTR == errno(errno is both thread-safe and thread-local, so I think it's ok to check it.) and retry if that was the case. If it was an actual error, callares_cancelto cancel pending queries and then abort execution (similar to what libcurl does). Callingares_cancelwill make sure pending requests are canceled and thecallbackwill not be executed with dangling references.libcurl seems to have at least two wait_resolve/ failure recovery methods depending on build options, both of these make use of
ares_cancel. See ref1 and ref2.reverseDNSQueryfunction was added as a tool to debug this problem. The test implemented with it is able to reproduce the issue, so I think it's a good idea to keep it. Plus, it can also be used for reverse DNS querying lol.The funny thing is that it was pretty easy to reproduce this issue with this function, but neither my multithreaded unit tests or scripts to make an "authentication storm" was able to crash CH. I believe unit tests were not able to repro the crash because it was testing only those classes and there was no system interrupts/ bigger failures to cause the "premature" abortion. Authentication storm, on the other hand, had system interrupts and CH was running normally, but it would test against a local DNS instance running in my machine, so the latency was very low. Maybe it was too low for an interrupt to happen?
To be on the super safe side, there are two more actions that we could take:
I want to believe we have found the issue and there is no need for the above, but let me know what you guys think.
Documentation entry for user-facing changes