You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unlike a zip archive where every file is individually compressed, 7-zip archives can have all of the files compressed together in one long compressed stream, supposedly to achieve a better compression ratio.
56
56
In a naive random access implementation, to read the first file you start at the beginning of the compressed stream and read out that files worth of bytes.
57
57
To read the second file you have to start at the beginning of the compressed stream again, read and discard the first files worth of bytes to get to the correct offset in the stream, then read out the second files worth of bytes.
58
-
You can see that for an archive that contains hundreds of files, extraction gets progressively slower as you have to read and discard more and more data just to get to the right offset in the stream.
58
+
You can see that for an archive that contains hundreds of files, extraction can get progressively slower as you have to read and discard more and more data just to get to the right offset in the stream.
59
59
60
60
This package contains an optimisation that caches and reuses the underlying compressed stream reader so you don't have to keep starting from the beginning for each file, but it does require you to call `rc.Close()` before extracting the next file.
The archive used here is just the reference LZMA SDK archive, which is only 1 MiB in size but does contain 630+ files.
108
-
The only difference between the two benchmarks is the above change to call `rc.Close()` between files so the stream reuse optimisation takes effect.
113
+
The archive used here is just the reference LZMA SDK archive, which is only 1 MiB in size but does contain 630+ files split across three compression streams.
114
+
The only difference between BenchmarkNaiveReader and the rest is the lack of a call to `rc.Close()` between files so the stream reuse optimisation doesn't take effect.
109
115
110
-
Finally, don't try and extract the files in a different order compared to the natural order within the archive as that will also undo the optimisation.
116
+
Don't try and blindly throw goroutines at the problem either as this can also undo the optimisation; a naive implementation that uses a pool of multiple goroutines to extract each file ends up being nearly 50% slower, even just using a pool of one goroutine can end up being less efficient.
117
+
The optimal way to employ goroutines is to make use of the `sevenzip.FileHeader.Stream` field; extract files with the same value using the same goroutine.
118
+
This achieves a 50% speed improvement with the LZMA SDK archive, but it very much depends on how many streams there are in the archive.
119
+
120
+
In general, don't try and extract the files in a different order compared to the natural order within the archive as that will also undo the optimisation.
111
121
The worst scenario would likely be to extract the archive in reverse order.
0 commit comments