Skip to content

Try to fix rare bug in reading of empty arrays#34327

Merged
alesapin merged 2 commits intoClickHouse:masterfrom
CurtizJ:fix-reading-empty-arrays
Feb 7, 2022
Merged

Try to fix rare bug in reading of empty arrays#34327
alesapin merged 2 commits intoClickHouse:masterfrom
CurtizJ:fix-reading-empty-arrays

Conversation

@CurtizJ
Copy link
Copy Markdown
Member

@CurtizJ CurtizJ commented Feb 4, 2022

Changelog category (leave one):

  • Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Try to fix rare bug while reading of empty arrays, which could lead to Data compressed with different methods error.

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Feb 4, 2022
@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Feb 4, 2022

I'll add a test a bit later.

@alexey-milovidov
Copy link
Copy Markdown
Member

Very interesting...

@CurtizJ CurtizJ force-pushed the fix-reading-empty-arrays branch 2 times, most recently from 3165db2 to a17e7b9 Compare February 5, 2022 17:57
@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Feb 6, 2022

I couldn't construct small functional test on which this bug reproduces, because it will require knowledge of too many implementation details. But it reproduces in case like this:

CREATE TABLE test_table
(
    `date` Date,
    `timestamp` UInt32,
    `host` String,
    `http_code` UInt16,
    `test_id` Array(Int32)
)
ENGINE = MergeTree(date, (timestamp, host, http_code), 8192)

There is one detail of data in the table: column test_id has huge ranges in which all arrays are empty.

SELECT *
FROM test_table
WHERE (date = '2022-01-16') AND (host = 'some_host.com') AND (test_id = [3333])
ORDER BY timestamp DESC
LIMIT 100
SETTINGS optimize_read_in_order = 1

In this query we read data in the reverse order of primary key. Assume that while reading we made a seek by the default path. We have some data in buffer from previous calls at this moment and as it written in comment we delay fetching new data to the buffer to the next call of nextImpl . Then it happens that we are reading granule in which all arrays are empty and it will lead to call of readBig with n=0 in reading of array elements. Since n=0 we won't call any reading and nextImpl won't be triggered as we had expected. Then on next call of seek to the position, which is a bit bellow current position (remember that we are reading ranges backwards) this check will pass and we will move position in buffer without actual seek, but actually buffer contains wrong data.

@CurtizJ CurtizJ marked this pull request as ready for review February 6, 2022 00:25
Comment on lines 23 to 35
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current master 80 will be read.

@alexey-milovidov alexey-milovidov self-assigned this Feb 6, 2022
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Feb 6, 2022

In other words - when we read zero bytes after seek, we don't change the buffer contents, but it becomes inconsistent with the position in file?

Maybe we can reset buffer to zero when we read zero bytes at the different offset?

Need to check other buffers that allow seeks?

@Felixoid
Copy link
Copy Markdown
Member

Felixoid commented Feb 6, 2022

@Mergifyio rebase

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 6, 2022

rebase

✅ Branch has been successfully rebased

@akuzm akuzm force-pushed the fix-reading-empty-arrays branch from 6c6cd6d to ae1fc94 Compare February 6, 2022 20:46
@alesapin
Copy link
Copy Markdown
Member

alesapin commented Feb 7, 2022

image
Sorry but I cannot resist to merge it.

@alesapin alesapin merged commit 107246e into ClickHouse:master Feb 7, 2022
@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Feb 7, 2022

@alexey-milovidov

Maybe we can reset buffer to zero when we read zero bytes at the different offset?

It's hard because there are two different buffers: CompressedReadBuffer and ReadBufferFromFileDescriptor and when we request to read zero bytes from CompressedReadBuffer we can't easily and safely reset buffer of ReadBufferFromFileDescriptor, which is used by CompressedReadBuffer. Anyway, I think that is should be possible to do several consecutive seeks, because such situation may appear in other circumstantiates, not only while reading zero bytes.

@CurtizJ CurtizJ added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Feb 7, 2022
@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Feb 7, 2022

Need to check other buffers that allow seeks?

Yes, it's worth to check.

CurtizJ added a commit that referenced this pull request Feb 22, 2022
Backport #34327 to 22.1: Try to fix rare bug in reading of empty arrays
CurtizJ added a commit that referenced this pull request Feb 22, 2022
Backport #34327 to 21.12: Try to fix rare bug in reading of empty arrays
CurtizJ added a commit that referenced this pull request Feb 22, 2022
Backport #34327 to 21.11: Try to fix rare bug in reading of empty arrays
@Felixoid Felixoid added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants