Skip to content

Update NuRaft buffer to 64 bit size#25421

Merged
alesapin merged 13 commits intomasterfrom
update_buffer_size_in_nuraft
Jun 22, 2021
Merged

Update NuRaft buffer to 64 bit size#25421
alesapin merged 13 commits intomasterfrom
update_buffer_size_in_nuraft

Conversation

@alesapin
Copy link
Copy Markdown
Member

@alesapin alesapin commented Jun 17, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Forward/backward incompatible change of maximum buffer size in clickhouse-keeper. Better to do it now (before production), than later.

Currently, a lot of ClickHouse users have 6GB+ ZooKeeper snapshots. In Keeper they are much smaller because of compression, around 1.5GB. But this looks quite dangerous and I think it's better to change the max size of the internal buffer in NuRaft to 64 bits.

@robot-clickhouse robot-clickhouse added the pr-backward-incompatible Pull request with backwards incompatible changes label Jun 17, 2021
enum class ChangelogVersion : uint8_t
{
V0 = 0,
V1 = 1, /// with 64 bit buffer header
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not in production, why even introduce a new version?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is Ok to just test the versioning right now.
We can drop support for old versions later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, why not?) 1 Looks better than 0)

@robot-ch-test-poll1 robot-ch-test-poll1 added the submodule changed At least one submodule changed in this PR. label Jun 17, 2021
@alexey-milovidov alexey-milovidov self-assigned this Jun 17, 2021
@alexey-milovidov
Copy link
Copy Markdown
Member

Maybe search for other 32bit sizes and offsets in NuRaft...

@alesapin alesapin added the jepsen-test Need to test this PR with jepsen tests label Jun 17, 2021
@alesapin
Copy link
Copy Markdown
Member Author

A lot of tests failed. I don't understand why. Need to investigate.

@alesapin
Copy link
Copy Markdown
Member Author

Keeper failures:

==================================== ERRORS ====================================
___________________ ERROR at setup of test_restart_multinode ___________________
[gw0] linux -- Python 3.8.5 /usr/bin/python3

    @pytest.fixture(scope="module")
    def started_cluster():
        try:
>           cluster.start()

test_keeper_snapshots_multinode/test.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
helpers/cluster.py:1472: in start
    instance.wait_for_start(start_timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <helpers.cluster.ClickHouseInstance object at 0x7f795006f7f0>
start_timeout = 300.0, connection_timeout = None

    def wait_for_start(self, start_timeout=None, connection_timeout=None):
    
        if start_timeout is None or start_timeout <= 0:
            raise Exception("Invalid timeout: {}".format(start_timeout))
    
        if connection_timeout is not None and connection_timeout < start_timeout:
            raise Exception("Connection timeout {} should be grater then start timeout {}"
                            .format(connection_timeout, start_timeout))
    
        start_time = time.time()
        prev_rows_in_log = 0
    
        def has_new_rows_in_log():
            nonlocal prev_rows_in_log
            try:
                rows_in_log = int(self.count_in_log(".*").strip())
                res = rows_in_log > prev_rows_in_log
                prev_rows_in_log = rows_in_log
                return res
            except ValueError:
                return False
    
        while True:
            handle = self.get_docker_handle()
            status = handle.status
            if status == 'exited':
>               raise Exception("Instance `{}' failed to start. Container status: {}, logs: {}"
                                .format(self.name, status, handle.logs().decode('utf-8')))
E               Exception: Instance `node1' failed to start. Container status: exited, logs:

helpers/cluster.py:1989: Exception
test_keeper_snapshots_multinode/test.py:69: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
helpers/cluster.py:1864: in restart_clickhouse
    self.start_clickhouse(stop_start_wait_sec)
helpers/cluster.py:1857: in start_clickhouse
    self.exec_in_container(["bash", "-c", "{} --daemon".format(self.clickhouse_start_command)], user=str(os.getuid()))
helpers/cluster.py:1867: in exec_in_container
    container_id = self.get_docker_handle().id
helpers/cluster.py:1955: in get_docker_handle
    return self.cluster.get_docker_handle(self.docker_id)
helpers/cluster.py:424: in get_docker_handle
    return self.docker_client.containers.get(docker_id)
/usr/local/lib/python3.8/dist-packages/docker/models/containers.py:889: in get
    resp = self.client.api.inspect_container(container_id)
/usr/local/lib/python3.8/dist-packages/docker/utils/decorators.py:19: in wrapped
    return f(self, resource_id, *args, **kwargs)
/usr/local/lib/python3.8/dist-packages/docker/api/container.py:773: in inspect_container
    return self._result(
/usr/local/lib/python3.8/dist-packages/docker/api/client.py:274: in _result
    self._raise_for_status(response)
/usr/local/lib/python3.8/dist-packages/docker/api/client.py:270: in _raise_for_status
    raise create_api_error_from_http_exception(e)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

e = HTTPError('404 Client Error: Not Found for url: http+docker://localhost/v1.41/containers/roottestkeepersnapshotsmultinode_node1_1/json')

    def create_api_error_from_http_exception(e):
        """
        Create a suitable APIError from requests.exceptions.HTTPError.
        """
        response = e.response
        try:
            explanation = response.json()['message']
        except ValueError:
            explanation = (response.content or '').strip()
        cls = APIError
        if response.status_code == 404:
            if explanation and ('No such image' in str(explanation) or
                                'not found: does not exist or no pull access'
                                in str(explanation) or
                                'repository does not exist' in str(explanation)):
                cls = ImageNotFound
            else:
                cls = NotFound
>       raise cls(e, response=response, explanation=explanation)
E       docker.errors.NotFound: 404 Client Error for http+docker://localhost/v1.41/containers/roottestkeepersnapshotsmultinode_node1_1/json: Not Found ("No such container: roottestkeepersnapshotsmultinode_node1_1")

Looks unrelated. Database replicated failure also looks unrelated, there was no even Connection loss.

@alesapin
Copy link
Copy Markdown
Member Author

Ok, it should be more clear with #25459

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Jun 18, 2021

cannot merge anything. Integration tests are completely broken.

@alesapin
Copy link
Copy Markdown
Member Author

For some reason sometimes we have eth1 instead of eth0 in the fetches test. I'll just add it as a fallback.
eth1: 41478485268 3009510 0 0 0 0 0 2445 188022557 2105260 0 0 0 0 0

@alesapin alesapin merged commit a8ce3be into master Jun 22, 2021
@alesapin alesapin deleted the update_buffer_size_in_nuraft branch June 22, 2021 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jepsen-test Need to test this PR with jepsen tests pr-backward-incompatible Pull request with backwards incompatible changes submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants