Skip to content

[ClickHouse-58380] Implement Retries in distributed queries when server stopped during queries#66791

Open
Zawa-ll wants to merge 147 commits intoClickHouse:masterfrom
Zawa-ll:58380-retries-in-distributed-queries
Open

[ClickHouse-58380] Implement Retries in distributed queries when server stopped during queries#66791
Zawa-ll wants to merge 147 commits intoClickHouse:masterfrom
Zawa-ll:58380-retries-in-distributed-queries

Conversation

@Zawa-ll
Copy link
Copy Markdown
Contributor

@Zawa-ll Zawa-ll commented Jul 19, 2024

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add support for dynamic replica reconnection in the Cluster class, enabling automatic reconnection to available replicas if some replicas become unavailable during query execution.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Documentation Information:

  • Motivation: Ensures that queries can continue execution by reconnecting to other available replicas if some replicas become unavailable, improving the resilience of the cluster.
  • Parameters:
    • Cluster::reconnectToReplica: Method to attempt reconnection to a replica if it is not connected.
    • Cluster::handleDynamicReplicas: Method to handle the dynamic reconnection logic for replicas.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: All Builds
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

This addresses issue #58380.

@Zawa-ll
Copy link
Copy Markdown
Contributor Author

Zawa-ll commented Jul 19, 2024

This ticket is involve multiple files, I will continue reviewing and working on my changes to ensure that it works as expected, feel free to review and point out mistakes I made!

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jul 20, 2024
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-feature Pull request with new product feature label Jul 20, 2024
@robot-ch-test-poll1
Copy link
Copy Markdown
Contributor

robot-ch-test-poll1 commented Jul 20, 2024

This is an automated comment for commit 92ad740 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success

@Zawa-ll Zawa-ll force-pushed the 58380-retries-in-distributed-queries branch 4 times, most recently from 86cd55f to 5a762ed Compare July 24, 2024 04:46
@Zawa-ll Zawa-ll force-pushed the 58380-retries-in-distributed-queries branch from 6ccf304 to 2aeb470 Compare July 24, 2024 17:24
@Zawa-ll
Copy link
Copy Markdown
Contributor Author

Zawa-ll commented Jul 26, 2024

@alexey-milovidov
Hi I just notice that the current Fast Check Cannot be passed, and error like

2024-07-26 00:28:45 Connection error, will retry:  [Errno 111] Connection refused
2024-07-26 00:28:56 Connection error, will retry:  [Errno 111] Connection refused

Can be seen, can you please take a look and provide me with some help on how to fix this one? Appreciate it!

@alexey-milovidov
Copy link
Copy Markdown
Member

@Zawa-ll
Copy link
Copy Markdown
Contributor Author

Zawa-ll commented Jul 26, 2024

It is a problem with your changes. Read the logs here: https://s3.amazonaws.com/clickhouse-test-reports/66791/b4511ab45e7c3e306052d5eeabdf15749b0b67d1/fast_test.html

@alexey-milovidov

Hi I read through the log you shared with me, and I noticed that it is the same as the link I checked previously: https://github.com/ClickHouse/ClickHouse/actions/runs/10101883167/job/27936478262?pr=66791
To ensure that I didn't miss anything important, I have carefully checked the log again, and I don't think I could find the error you mentioned...

Can you please share with me where the error is, or tell me which part of the log should I check? I will be really appreciate it!

@alexey-milovidov
Copy link
Copy Markdown
Member

@Zawa-ll
Copy link
Copy Markdown
Contributor Author

Zawa-ll commented Jul 27, 2024

The server does not start: https://s3.amazonaws.com/clickhouse-test-reports/66791/b4511ab45e7c3e306052d5eeabdf15749b0b67d1/fast_test/run.log

@alexey-milovidov
Hi sorry for the confusion, my question here is what could be the possible cause of this issue? Or which part is responsible for the logic of starting the server? I was trying to restore my changes in ConnectionPoolWithFailover.cpp and Cluster.cpp but it seems that my changes ain't work out...
I will be really appreciate it if you can provide me with some direction on which part of code should I be focusing on!

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Jul 27, 2024

@Zawa-ll Zawa-ll force-pushed the 58380-retries-in-distributed-queries branch from 21b0d1f to 32e2204 Compare July 27, 2024 17:44
@Zawa-ll
Copy link
Copy Markdown
Contributor Author

Zawa-ll commented Jul 29, 2024

Hi @alexey-milovidov,
I've been working on fixing the issues related to connection timeouts and keep-alive settings in, I tried to rewrite my implemenation to fix the bug, but I encountered some error that really stuck me for a few days. It would be really helpful if you could kindly in reviewing my changes.

Here’s a quick summary of what I’ve done:

  1. ConnectionPoolWithFailover.cpp:
    Function Modified: ConnectionPoolWithFailover::get
    What I Did: I updated the lambda function inside the get method to properly set socket timeouts and enable keep-alive on the connections. The lambda now returns a TryResult that fits the expected TryGetEntryFunc type.

  2. Cluster.cpp:
    Adjustments: I tweaked the for-loop that iterates over shards_info to make sure it correctly gets connections and sets their socket timeouts and keep-alive settings.

  3. ConnectionTimeouts.h:
    Changes: I added a constructor to the ConnectionTimeouts struct so it can take multiple Poco::Timespan parameters. This helps with initializing the timeouts properly.
    These changes should ensure the connections have the right timeouts and keep-alive settings, and the lambda function now matches the expected type.

Despite refactoring my code based on the log you shared with me, I still encounter the following error:
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)

I wasn't sure if I am changing the correct file, or did I over complicate the task...

It would be incredibly helpful if you could provide me with any help or point out any mistakes so I don't go too far down the wrong path.
Thanks a lot if you could do me a favor, I will be really appreciate that


update: Some new changes were implemented and it seems to work well, I will apply testing on it

Zawa-ll added 12 commits July 31, 2024 14:37
Removed unnecessary try-catch

Add Comments back

Style check
Use Existing logic from other files

Added missing right bracket

Added Missing param

Fixed Cluster.cpp

Fixed Cluster.h

Fixed Cluster.cpp

Fixed Cluster.h again

Create an public getter for getEntry()

handled unused e

modify head file of ConnectPoolwithfailover

handled header
@Zawa-ll Zawa-ll force-pushed the 58380-retries-in-distributed-queries branch from f92401b to 552b4b0 Compare August 7, 2024 02:27
@devcrafter devcrafter self-assigned this Aug 13, 2024
@Zawa-ll Zawa-ll force-pushed the 58380-retries-in-distributed-queries branch from 439a1cd to 0bad5ba Compare August 18, 2024 04:59
@Zawa-ll
Copy link
Copy Markdown
Contributor Author

Zawa-ll commented Aug 23, 2024

Hi @alexey-milovidov @devcrafter! Thanks for taking care of this PR,
I tried to rebase the current PR but it has the previous base with flaky tests issue that cannot be passed...

I just create a new rebased PR from with the same changes Ive made from this one, could you please check that one (#68581) instead? Appreciate it!!

@devcrafter
Copy link
Copy Markdown
Member

Hi @alexey-milovidov @devcrafter! Thanks for taking care of this PR, I tried to rebase the current PR but it has the previous base with flaky tests issue that cannot be passed...

I just create a new rebased PR from with the same changes Ive made from this one, could you please check that one (#68581) instead? Appreciate it!!

There is no need to create new PR.
You've just messed up things with 02572_query_views_log_background_thread. Also, if you add new settings, it's necessary to add it to settings changes history.

Copy link
Copy Markdown
Member

@devcrafter devcrafter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIS, the PR is changing code related to distributed INSERT which unrelated to distributed SELECT as should be per initial task #58380 . What are you trying to implement?

@Zawa-ll
Copy link
Copy Markdown
Contributor Author

Zawa-ll commented Sep 15, 2024

@devcrafter Thank you for taking care of this issue, I will fix the code based on the comments, thank you!!

@Zawa-ll Zawa-ll force-pushed the 58380-retries-in-distributed-queries branch from 92ad740 to c188700 Compare September 24, 2024 04:56
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Oct 29, 2024

Dear @devcrafter, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants