add compression method for files: Xz#16578
add compression method for files: Xz#16578alexey-milovidov merged 24 commits intoClickHouse:masterfrom
Conversation
|
…sic code for encoding
…essor, tests passing
azat
left a comment
There was a problem hiding this comment.
Just a few minor comments until there is no can-be-tested label
src/IO/LzmaWriteBuffer.cpp
Outdated
| return; | ||
| //std::cout << ret << " RET FIN" << std::endl; | ||
|
|
||
| if (ret == LZMA_STREAM_END) { |
src/IO/LzmaWriteBuffer.cpp
Outdated
| throw Exception(std::string("lzma stream encoding failed: ") + "; lzma version: " + LZMA_VERSION_STRING, ErrorCodes::LZMA_STREAM_ENCODER_FAILED); | ||
|
|
||
| std::cout << lstr.avail_in << std::endl; | ||
| //std::cout << lstr.avail_in << " " << lstr.avail_out << std::endl; |
There was a problem hiding this comment.
Forgot to remove? (and in a few other places)
src/IO/LzmaWriteBuffer.cpp
Outdated
|
|
||
| //std::cout << ret << " RET IMPL" << std::endl; | ||
|
|
||
| if (ret == LZMA_STREAM_END) { |
There was a problem hiding this comment.
Coding style: curly bracket is always on a new line (and not only here but in a few other places) (except for } while (true))
src/IO/LzmaReadBuffer.cpp
Outdated
| throw Exception( | ||
| std::string("lzma decoder finished, but stream is still alive: error code: ") + std::to_string(ret) + "; lzma version: " + LZMA_VERSION_STRING, | ||
| ErrorCodes::LZMA_STREAM_DECODER_FAILED); |
There was a problem hiding this comment.
This can use fmt-like constructor
| throw Exception( | |
| std::string("lzma decoder finished, but stream is still alive: error code: ") + std::to_string(ret) + "; lzma version: " + LZMA_VERSION_STRING, | |
| ErrorCodes::LZMA_STREAM_DECODER_FAILED); | |
| throw Exception(ErrorCodes::LZMA_STREAM_DECODER_FAILED, "lzma decoder finished, but stream is still alive: error code: {}; lzma version: {}", ret, LZMA_VERSION_STRING); |
src/IO/LzmaWriteBuffer.cpp
Outdated
| //std::cout << ret << " RET FIN" << std::endl; | ||
|
|
||
| if (ret == LZMA_STREAM_END) { | ||
| if (ret == LZMA_STREAM_END) |
There was a problem hiding this comment.
trailing whitespace, and incorrect indent below (tabs instead of spaces, and tabs also not only here)
And by the way it is not only here, it is pretty easy to fix just using git, but please note that this will override the history (and that said that if you don't know git I guess it will be better if you will fix this in your editor)
git rebase --whitespace=fix upstream/master
There was a problem hiding this comment.
Can I just format it with clang-format and options from .clang-format?
| add_subdirectory (replxx-cmake) | ||
| add_subdirectory (ryu-cmake) | ||
| add_subdirectory (unixodbc-cmake) | ||
| add_subdirectory (xz) |
There was a problem hiding this comment.
Looks like you've forget to add submodule
There was a problem hiding this comment.
There was a problem hiding this comment.
Not really, this is just to know where to fetch the submodule from.
You need to run smth like this:
git submodule add https://github.com/xz-mirror/xz contrib/xz
And this will create a special file contrib/xz in the git index (but in the filesystem you will see it as a content of the repository), that will contain the HEAD for the submodule
There was a problem hiding this comment.
Also to make fasttest update your submodule you need to add contrib/xz into
ClickHouse/docker/test/fasttest/run.sh
Line 130 in f10a520
Without fasttest other tests won't be run
There was a problem hiding this comment.
Also it's possible to disable xz in fasttest.
|
I kind of fixed it with clang-format.I manually look for tabs with vim and don't find it.Looks like all whitespaces consist of single spaces now. |
| set (LZMA_LIBRARY liblzma) | ||
| set (LZMA_INCLUDE_DIR ${ClickHouse_SOURCE_DIR}/contrib/xz/src/liblzma/api) |
There was a problem hiding this comment.
And I guess it is better to move this out into separate contrib/xz-cmake/CMakeLists.txt, and add an interface library.
One of similar files is contrib/protobuf-cmake/CMakeLists.txt (protobuf has it's own cmake rules, while in clickhouse there is just a wrapper that install some options)
There was a problem hiding this comment.
What should I do to pass description check?
I've added record to CHANCHELOG.md
There was a problem hiding this comment.
What should I do to pass description check?
Update description of PR and include changelog entry using specified format.
There is template for PRs - https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/.github/PULL_REQUEST_TEMPLATE.md
Just copy it and modify the template, but leave the format.
There was a problem hiding this comment.
Yes, it's much better to provide our own CMakeLists.
CMake often cannot be safely reused (without building unneeded targets, polluting build options).
|
Ok, it seems, that I should write a wrapper for CMake |
|
It seems like |
The reason is that You can try upgrade fasttest image to 20.04 (there are few images that already uses it anyway, so I guess it is fine, @alesapin ?) and it will contain cmake 3.16 P.S. you can also send a patch to the uptream to bump cmake requirements: |
|
We usually write our own CMakeLists for libraries. Because original CMakeLists are intended to build a library itself, not to build it for another project. |
|
Yeah, this is because fasttest image is not updated on per-PR bases (as most of the images) |
| std::string("lzma preset failed: ") + "; lzma version: " + LZMA_VERSION_STRING, ErrorCodes::LZMA_STREAM_ENCODER_FAILED); | ||
|
|
||
|
|
||
| lzma_filter filters[] = { |
There was a problem hiding this comment.
Missing comments (explanations), the code is unclear to me.
alexey-milovidov
left a comment
There was a problem hiding this comment.
We also need a functional test.
grep 'Content-Encoding'
in test/queries/0_stateless
We need to make sure that xz tool can read our xz-compressed data
and we can read data compressed with the xz tool.
| #include <IO/WriteHelpers.h> | ||
| #include <IO/ReadHelpers.h> | ||
|
|
||
| int main(int, char **) |
There was a problem hiding this comment.
Did you run it? (And what are the results in comparison with xz?)
There was a problem hiding this comment.
Yes, I ran. But I suspect, that my implementation is not completely optimal:
(base) [centos@clickhouse build]$ ./src/IO/tests/lzma_buffers
Writing done. Elapsed: 12.78 s., 6.17 MB/s
Reading done. Elapsed: 0.79 s., 100.24 MB/s
(base) [centos@clickhouse build]$ ./src/IO/tests/zlib_buffers
Writing done. Elapsed: 8.04 s., 110.55 MB/s
Reading done. Elapsed: 4.44 s., 200.08 MB/s
There was a problem hiding this comment.
It's hard to say... Better to compare with xz tool, and we expect that performance should not be significantly lower.
alexey-milovidov
left a comment
There was a problem hiding this comment.
Ok, only some small changes required.
|
Maybe we can also add a test for |
|
You can find similar tests in the following way:
|
…r file() function
…unction, formatted code
|
|
| return CompressionMethod::Zlib; | ||
| if (*method_str == "brotli" || *method_str == "br") | ||
| return CompressionMethod::Brotli; | ||
| if (*method_str == "LZMA" || *method_str == "xz") |
There was a problem hiding this comment.
A small note... xz is actually LZMA2. But why do we need this name? Maybe just leave xz only.
There was a problem hiding this comment.
Is there the only place to remove LZMA?
Is it ok to leave all buffer filenames with LZMA* prefix?
alexey-milovidov
left a comment
There was a problem hiding this comment.
LGTM.
There are some leftover comments that are related to another xz implementation... we can try it in the next PR.
I had sent mail to contributor of fast-lzma2 maintainer, but he has not responded yet.I am not sure that fast-lzma2 is compatible with classic *.xz format and it's repository does not contain CMake file, so we need to adapt their implementation.I have faced some problems with that.Can we merge this implementation and keep in mind existence of another implementation? |
|
Yes, we can merge before trying fast-lzma2. |
|
You can remove commented code before we'll merge. |
|
I've tested decompression performance, it's the same as for |
|
Maybe you will be also interested in the internals of |
I will keep it in mind, but now I am going to deal with main task. |
|
This PR broke the Fast Test. I'll merge the fix for now, so that we have at least some tests, but it's not entirely correct. CMake should be able to finish when the |
| path = contrib/miniselect | ||
| url = https://github.com/danlark1/miniselect | ||
| [submodule "contrib/xz"] | ||
| path = contrib/xz |
There was a problem hiding this comment.
Uses spaces for indent, while all other lines uses tabs (since git submodule add uses them)
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
add
*.xzcompression/decompression support.It enables using*.xzinfile()function.This closes #8828