Skip to content

Fix possible issues with MySQL client protocol TLS connections#65938

Merged
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:mysql-client-TLS-retries
Jul 1, 2024
Merged

Fix possible issues with MySQL client protocol TLS connections#65938
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:mysql-client-TLS-retries

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jul 1, 2024

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix possible issues with MySQL client protocol TLS connections

Occasionally, 02479_mysql_connect_to_self fails on CI 1.

The problem was indeed query profiler and EINTR, but not in a way you may think.

For such failures you may see the following trace in trace_log:

contrib/openssl/crypto/bio/bss_sock.c:127::sock_read
contrib/openssl/crypto/bio/bio_meth.c:121::bread_conv
contrib/openssl/crypto/bio/bio_lib.c:285::bio_read_intern
contrib/openssl/crypto/bio/bio_lib.c:311::BIO_read
contrib/openssl/ssl/record/methods/tls_common.c:398::tls_default_read_n
contrib/openssl/ssl/record/methods/tls_common.c:575::tls_get_more_records
contrib/openssl/ssl/record/methods/tls_common.c:1122::tls_read_record
contrib/openssl/ssl/record/rec_layer_s3.c:645::ssl3_read_bytes
contrib/openssl/ssl/s3_lib.c:4527::ssl3_read_internal
contrib/openssl/ssl/s3_lib.c:4551::ssl3_read
contrib/openssl/ssl/ssl_lib.c:2343::ssl_read_internal
contrib/openssl/ssl/ssl_lib.c:2357::SSL_read
contrib/mariadb-connector-c/libmariadb/secure/openssl.c:729::ma_tls_read
contrib/mariadb-connector-c/libmariadb/ma_tls.c:90::ma_pvio_tls_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:250::ma_pvio_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:297::ma_pvio_cache_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:373::ma_real_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:427::ma_net_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:192::ma_net_safe_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2138::mthd_my_read_query_result
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2212::mysql_real_query
src/Common/mysqlxx/Query.cpp:56::mysqlxx::Query::executeImpl()
src/Common/mysqlxx/Query.cpp:73::mysqlxx::Query::use()
src/Processors/Sources/MySQLSource.cpp:50::DB::MySQLSource::Connection::Connection()

After which the connection will fail with:

Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)

But, if you will take a look at ma_tls_read() you will see that it has proper retries for SSL_ERROR_WANT_READ (and EINTR is just a special case of it), but still, for some reason it fails.

And the reason is the units of the read/write timeout, ma_tls_read() calls poll(read_timeout) in case of SSL_ERROR_WANT_READ, but, it incorrectly assume that the timeout is in milliseconds, but that timeout was in seconds, this bug had been fixed in 2, and now it works like a charm!

I've verified it with patching openssl library:

diff --git a/crypto/bio/bss_sock.c b/crypto/bio/bss_sock.c
index 82f7be85ae..3d2f3926a0 100644
--- a/crypto/bio/bss_sock.c
+++ b/crypto/bio/bss_sock.c
@@ -124,7 +124,24 @@ static int sock_read(BIO *b, char *out, int outl)
             ret = ktls_read_record(b->num, out, outl);
         else
 # endif
-            ret = readsocket(b->num, out, outl);
+        {
+            static int i = 0;
+            static int j = 0;
+            if (!(++j % 5))
+            {
+                fprintf(stderr, "sock_read: inject EAGAIN with ret=0\n");
+                ret = 0;
+                errno = EAGAIN;
+            }
+            else if (!(++i % 3))
+            {
+                fprintf(stderr, "sock_read: inject EAGAIN with ret=-1\n");
+                ret = -1;
+                errno = EAGAIN;
+            }
+            else
+                ret = readsocket(b->num, out, outl);
+        }
         BIO_clear_retry_flags(b);
         if (ret <= 0) {
             if (BIO_sock_should_retry(ret))

And before this patch (well, not the patch itself, but the referenced patch in mariadb-connector-c) if you will pass read_write_timeout=1 it will fail:

SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)
Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)

But after, it always works:

$ ch benchmark -c10 -q "SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connection_pool_size=1, connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)"
^CStopping launch of queries. SIGINT received.

Queries executed: 478.

localhost:9000, queries: 478, QPS: 120.171, RPS: 120.171, MiB/s: 0.001, result RPS: 120.171, result MiB/s: 0.001.

0.000%          0.014 sec.
10.000%         0.058 sec.
20.000%         0.065 sec.
30.000%         0.073 sec.
40.000%         0.079 sec.
50.000%         0.087 sec.
60.000%         0.089 sec.
70.000%         0.091 sec.
80.000%         0.095 sec.
90.000%         0.100 sec.
95.000%         0.102 sec.
99.000%         0.105 sec.
99.900%         0.140 sec.
99.990%         0.140 sec.

Fixes: #50911

@azat azat marked this pull request as draft July 1, 2024 15:18
@robot-clickhouse robot-clickhouse added pr-bugfix Pull request with bugfix, not backported by default submodule changed At least one submodule changed in this PR. labels Jul 1, 2024
@robot-clickhouse
Copy link
Copy Markdown
Member

robot-clickhouse commented Jul 1, 2024

This is an automated comment for commit 9c7db67 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

Occasionally, 02479_mysql_connect_to_self fails on CI [1].

  [1]: ClickHouse#50911

The problem was indeed query profiler and EINTR, but not in a way you
may think.

For such failures you may see the following trace in trace_log:

    contrib/openssl/crypto/bio/bss_sock.c:127::sock_read
    contrib/openssl/crypto/bio/bio_meth.c:121::bread_conv
    contrib/openssl/crypto/bio/bio_lib.c:285::bio_read_intern
    contrib/openssl/crypto/bio/bio_lib.c:311::BIO_read
    contrib/openssl/ssl/record/methods/tls_common.c:398::tls_default_read_n
    contrib/openssl/ssl/record/methods/tls_common.c:575::tls_get_more_records
    contrib/openssl/ssl/record/methods/tls_common.c:1122::tls_read_record
    contrib/openssl/ssl/record/rec_layer_s3.c:645::ssl3_read_bytes
    contrib/openssl/ssl/s3_lib.c:4527::ssl3_read_internal
    contrib/openssl/ssl/s3_lib.c:4551::ssl3_read
    contrib/openssl/ssl/ssl_lib.c:2343::ssl_read_internal
    contrib/openssl/ssl/ssl_lib.c:2357::SSL_read
    contrib/mariadb-connector-c/libmariadb/secure/openssl.c:729::ma_tls_read
    contrib/mariadb-connector-c/libmariadb/ma_tls.c:90::ma_pvio_tls_read
    contrib/mariadb-connector-c/libmariadb/ma_pvio.c:250::ma_pvio_read
    contrib/mariadb-connector-c/libmariadb/ma_pvio.c:297::ma_pvio_cache_read
    contrib/mariadb-connector-c/libmariadb/ma_net.c:373::ma_real_read
    contrib/mariadb-connector-c/libmariadb/ma_net.c:427::ma_net_read
    contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:192::ma_net_safe_read
    contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2138::mthd_my_read_query_result
    contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2212::mysql_real_query
    src/Common/mysqlxx/Query.cpp:56::mysqlxx::Query::executeImpl()
    src/Common/mysqlxx/Query.cpp:73::mysqlxx::Query::use()
    src/Processors/Sources/MySQLSource.cpp:50::DB::MySQLSource::Connection::Connection()

After which the connection will fail with:

    Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)

But, if you will take a look at ma_tls_read() you will see that it has
proper retries for SSL_ERROR_WANT_READ (and EINTR is just a special case
of it), but still, for some reason it fails.

And the reason is the units of the read/write timeout, ma_tls_read()
calls poll(read_timeout) in case of SSL_ERROR_WANT_READ, but, it
incorrectly assume that the timeout is in milliseconds, but that timeout
was in seconds, this bug had been fixed in [2], and now it works like a
charm!

  [2]: ClickHouse/mariadb-connector-c#17

I've verified it with patching openssl library:

    diff --git a/crypto/bio/bss_sock.c b/crypto/bio/bss_sock.c
    index 82f7be85ae..3d2f3926a0 100644
    --- a/crypto/bio/bss_sock.c
    +++ b/crypto/bio/bss_sock.c
    @@ -124,7 +124,24 @@ static int sock_read(BIO *b, char *out, int outl)
                 ret = ktls_read_record(b->num, out, outl);
             else
     # endif
    -            ret = readsocket(b->num, out, outl);
    +        {
    +            static int i = 0;
    +            static int j = 0;
    +            if (!(++j % 5))
    +            {
    +                fprintf(stderr, "sock_read: inject EAGAIN with ret=0\n");
    +                ret = 0;
    +                errno = EAGAIN;
    +            }
    +            else if (!(++i % 3))
    +            {
    +                fprintf(stderr, "sock_read: inject EAGAIN with ret=-1\n");
    +                ret = -1;
    +                errno = EAGAIN;
    +            }
    +            else
    +                ret = readsocket(b->num, out, outl);
    +        }
             BIO_clear_retry_flags(b);
             if (ret <= 0) {
                 if (BIO_sock_should_retry(ret))

And before this patch (well, not the patch itself, but the referenced
patch in mariadb-connector-c) if you will pass read_write_timeout=1 it
will fail:

    SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)
    Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)

But after, it always works:

    $ ch benchmark -c10 -q "SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connection_pool_size=1, connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)"
    ^CStopping launch of queries. SIGINT received.

    Queries executed: 478.

    localhost:9000, queries: 478, QPS: 120.171, RPS: 120.171, MiB/s: 0.001, result RPS: 120.171, result MiB/s: 0.001.

    0.000%          0.014 sec.
    10.000%         0.058 sec.
    20.000%         0.065 sec.
    30.000%         0.073 sec.
    40.000%         0.079 sec.
    50.000%         0.087 sec.
    60.000%         0.089 sec.
    70.000%         0.091 sec.
    80.000%         0.095 sec.
    90.000%         0.100 sec.
    95.000%         0.102 sec.
    99.000%         0.105 sec.
    99.900%         0.140 sec.
    99.990%         0.140 sec.

Signed-off-by: Azat Khuzhin <[email protected]>
@azat azat force-pushed the mysql-client-TLS-retries branch from f4dc6aa to 9c7db67 Compare July 1, 2024 16:39
@azat azat marked this pull request as ready for review July 1, 2024 16:40
@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 1, 2024

Hmm, so does CI now requires cloud fork sync? And I've seen couple of false positive if this check recently

@alexey-milovidov alexey-milovidov self-assigned this Jul 1, 2024
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jul 1, 2024
@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 1, 2024

@alexey-milovidov are you sure that all CI passed? I'm seeing only 28 checks, while there should 200+

Merged via the queue into ClickHouse:master with commit 78185fb Jul 1, 2024
@azat azat deleted the mysql-client-TLS-retries branch July 1, 2024 20:53
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 1, 2024
@alexey-milovidov
Copy link
Copy Markdown
Member

That's our problem :)

@jsc0218 jsc0218 mentioned this pull request Jul 25, 2024
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

02479_mysql_connect_to_self is flaky

4 participants