-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify zstd compressor output compatibility guarantees across versions #999
Comments
Hi @jblazquez,
However, Bottom line : never expect 2 different versions of |
Thanks for clarifying the compatibility guarantees @Cyan4973. I think those two guarantees - especially the first one - should be enough for us to unlock our ability to upgrade. |
@Cyan4973 Do the following restrictions and variables all always produce the same output?
technically the -T0 varies the nb of threads that you mentioned above, but we would like to have -T0 and still a guarantee to have reproducible output across different number of cures (single core + multi core) Some tests show this may be the case, but we seek to have some official clarification before we assume that we can rely on it. |
Situation has evolved since this issue was opened, and is generally more friendly to reproducible builds. With recent versions of Things that can break this reproducibility pattern :
|
We will fix all bugs causing non-deterministic builds as long as they follow the constraints that @Cyan4973 laid out above. However, I'd definitely recommend adding zstd determinism tests that invoke zstd the same way you do in your builds. We test zstd for non-determinism, but you may invoke it in a different way that we've missed in our test coverage. |
I would like to see guidelines about versions become a bit more clear. For example: v1.4.x output clearly does not match v1.5.x output, but will any 1.4.x output match that of any other 1.4.x version (1.4.1 matching 1.4.9)? |
No, there is no such guarantee. All release versions of |
Seems like zstd should not position itself diametrically opposite to reproducibility / consistency, which is what the current policy is. I will have to deprecate or discourage zstd use in Wyng to avoid breakdown of deduplication. OTOH, prioritizing output consistency would place very little burden on the zstd project; all that's required is a willingness to recognize when consistency is broken and to respond with an appropriate increment of the version number. This would allow users to receive zstd bug fix updates with peace of mind. If best security practice didn't call for hashing data in its compressed form, it would be a different story and this issue wouldn't exist. |
Just for reference : Requiring all versions of a product to always generate the same binary output would prevent the product from improving, |
I don't think anyone here is suggesting that; certainly not me. |
@tasket, then I'm not really sure what need you're describing that isn't being met. We provide determinism. We bump the library version every time we break determinism. What's missing? |
Thank you for asking. But that is not how I interpret the dialogue thus far. It should be asked: Do all code changes have the same significance? Why would a buffer overflow fix and a tweak to the compression ratio both affect the version's patch level and not the major or minor? Developers wanting determinism (and zstd efficiency) will face possible discontinuity each and every time their OS packaging system updates the zstd library automatically. As a result, we'll feel pressured to include our own copies of old zstd versions in our apps... or else have to explain to users, managers, etc. that zstd is the reason their storage systems repeatedly go offline because the backup archives have exploded in size. I would like to be able to list a dependency of "zstd 1.5.x" for my app and let updates occur for it without my intervention and without breaking determinism. This implies that changes to zstd that affect its data output would have to land in a "later" version such as 1.6. In this example v1.5.x branch would have something like a "long term support" designation. FOSS operating systems accommodate this kind of compatibility-freeze fairly often by not carrying the latest development or beta branches and keeping the major or major.minor version the same while applying patches that address security and stability issues – but I'm not sure how realistic that would be for zstd under the current versioning policy. |
One more question in the same vein that I don't see covered here specifically: given the same input and compression parameters, is a given version of zstd guaranteed to produce the same output also on different architectures/byte-orders and operating systems? |
Yes it is |
@Cyan4973, hmmmm, is that true? I think there are still exceptions. Enabling the row match finder is still sensitive to instruction set support: zstd/lib/compress/zstd_compress.c Lines 236 to 252 in b880f20
I recall we talked about closing that gap, but it looks like we didn't. |
I don't remember that one in details. I guess this could be tested by manually setting on/off usage of vector code path and looking at the potential differences. |
@Cyan4973, yeah we can look at it, but my recollection is that the row-based match finder does not produce the same parse as the non-row-based version. |
Ah yes, that part is true, by my recollection is that row-based match finder can be run on any system using the scalar code path, |
Hi,
We recently upgraded zstd to 1.3.3 after reading about the performance improvements for high compression levels, and we were happy to see that the performance increase was very significant (around 40% for level 19). However, we also noticed that the output of zstd 1.3.3 is not binary-identical to zstd 1.3.2, and unfortunately that limits its usefulness for our particular use case because we rely on our compressed data not changing as we upgrade the libzstd library, which we'd like to do in order to get access to bugfixes, new features and performance improvements. We were previously using zlib which I guess hasn't had a bitstream-impacting change in many years.
Is bit-identical output across versions a goal of the zstd project, or do you expect these changes to happen for the foreseeable future?
The text was updated successfully, but these errors were encountered: