Skip to content

Commit 187a49e

Browse files
authored
feat: Export the folder/stream identifier (#169)
Useful as a parallelisation hint.
1 parent 9e0ded8 commit 187a49e

File tree

6 files changed

+161
-47
lines changed

6 files changed

+161
-47
lines changed

README.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ func extractArchive(archive string) error {
5555
Unlike a zip archive where every file is individually compressed, 7-zip archives can have all of the files compressed together in one long compressed stream, supposedly to achieve a better compression ratio.
5656
In a naive random access implementation, to read the first file you start at the beginning of the compressed stream and read out that files worth of bytes.
5757
To read the second file you have to start at the beginning of the compressed stream again, read and discard the first files worth of bytes to get to the correct offset in the stream, then read out the second files worth of bytes.
58-
You can see that for an archive that contains hundreds of files, extraction gets progressively slower as you have to read and discard more and more data just to get to the right offset in the stream.
58+
You can see that for an archive that contains hundreds of files, extraction can get progressively slower as you have to read and discard more and more data just to get to the right offset in the stream.
5959

6060
This package contains an optimisation that caches and reuses the underlying compressed stream reader so you don't have to keep starting from the beginning for each file, but it does require you to call `rc.Close()` before extracting the next file.
6161
So write your code similar to this:
@@ -90,22 +90,32 @@ func extractArchive(archive string) error {
9090
```
9191
You can see the main difference is to not defer all of the `Close()` calls until the end of `extractArchive()`.
9292

93-
There is a pair of benchmarks in this package that demonstrates the performance boost that the optimisation provides:
93+
There is a set of benchmarks in this package that demonstrates the performance boost that the optimisation provides, amongst other techniques:
9494
```
9595
$ go test -v -run='^$' -bench='Reader$' -benchtime=60s
9696
goos: darwin
9797
goarch: amd64
9898
pkg: github.com/bodgit/sevenzip
9999
cpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
100100
BenchmarkNaiveReader
101-
BenchmarkNaiveReader-12 2 33883477004 ns/op
101+
BenchmarkNaiveReader-12 2 31077542628 ns/op
102102
BenchmarkOptimisedReader
103-
BenchmarkOptimisedReader-12 402 180606463 ns/op
103+
BenchmarkOptimisedReader-12 434 164854747 ns/op
104+
BenchmarkNaiveParallelReader
105+
BenchmarkNaiveParallelReader-12 240 361869339 ns/op
106+
BenchmarkNaiveSingleParallelReader
107+
BenchmarkNaiveSingleParallelReader-12 412 171027895 ns/op
108+
BenchmarkParallelReader
109+
BenchmarkParallelReader-12 636 112551812 ns/op
104110
PASS
105-
ok github.com/bodgit/sevenzip 191.827s
111+
ok github.com/bodgit/sevenzip 472.251s
106112
```
107-
The archive used here is just the reference LZMA SDK archive, which is only 1 MiB in size but does contain 630+ files.
108-
The only difference between the two benchmarks is the above change to call `rc.Close()` between files so the stream reuse optimisation takes effect.
113+
The archive used here is just the reference LZMA SDK archive, which is only 1 MiB in size but does contain 630+ files split across three compression streams.
114+
The only difference between BenchmarkNaiveReader and the rest is the lack of a call to `rc.Close()` between files so the stream reuse optimisation doesn't take effect.
109115

110-
Finally, don't try and extract the files in a different order compared to the natural order within the archive as that will also undo the optimisation.
116+
Don't try and blindly throw goroutines at the problem either as this can also undo the optimisation; a naive implementation that uses a pool of multiple goroutines to extract each file ends up being nearly 50% slower, even just using a pool of one goroutine can end up being less efficient.
117+
The optimal way to employ goroutines is to make use of the `sevenzip.FileHeader.Stream` field; extract files with the same value using the same goroutine.
118+
This achieves a 50% speed improvement with the LZMA SDK archive, but it very much depends on how many streams there are in the archive.
119+
120+
In general, don't try and extract the files in a different order compared to the natural order within the archive as that will also undo the optimisation.
111121
The worst scenario would likely be to extract the archive in reverse order.

go.mod

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ require (
1212
github.com/stretchr/testify v1.8.4
1313
github.com/ulikunitz/xz v0.5.11
1414
go4.org v0.0.0-20200411211856-f5505b9728dd
15+
golang.org/x/sync v0.6.0
1516
golang.org/x/text v0.14.0
1617
)
1718

go.sum

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,8 @@ golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4/go.mod h1:RxMgew5VJxzue5/jJ
155155
golang.org/x/sync v0.0.0-20190227155943-e225da77a7e6/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
156156
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
157157
golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
158+
golang.org/x/sync v0.6.0 h1:5BMeUDZ7vkXGfEr1x9B4bRcTH4lpkTkpdh0T/J+qjbQ=
159+
golang.org/x/sync v0.6.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=
158160
golang.org/x/sys v0.0.0-20180830151530-49385e6e1522/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
159161
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
160162
golang.org/x/sys v0.0.0-20190312061237-fead79001313/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=

reader.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -447,6 +447,9 @@ func (z *Reader) init(r io.ReaderAt, size int64) error {
447447
if !fh.isEmptyStream && !fh.isEmptyFile {
448448
f.folder, _ = header.streamsInfo.FileFolderAndSize(j)
449449

450+
// Make an exported copy of the folder index
451+
f.Stream = f.folder
452+
450453
if f.folder != folder {
451454
offset = 0
452455
}

reader_test.go

Lines changed: 129 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -3,48 +3,72 @@ package sevenzip_test
33
import (
44
"errors"
55
"fmt"
6+
"hash"
67
"hash/crc32"
78
"io"
89
"path/filepath"
10+
"runtime"
911
"testing"
1012
"testing/fstest"
1113
"testing/iotest"
1214

1315
"github.com/bodgit/sevenzip"
1416
"github.com/bodgit/sevenzip/internal/util"
1517
"github.com/stretchr/testify/assert"
18+
"golang.org/x/sync/errgroup"
1619
)
1720

18-
func readArchive(t *testing.T, r *sevenzip.ReadCloser) {
19-
t.Helper()
21+
func reader(r io.Reader) io.Reader {
22+
return r
23+
}
2024

21-
h := crc32.NewIEEE()
25+
func extractFile(tb testing.TB, r io.Reader, h hash.Hash, f *sevenzip.File) error {
26+
tb.Helper()
2227

23-
for _, f := range r.File {
24-
rc, err := f.Open()
25-
if err != nil {
26-
t.Fatal(err)
27-
}
28-
defer rc.Close()
28+
h.Reset()
2929

30-
h.Reset()
30+
if _, err := io.Copy(h, r); err != nil {
31+
return err
32+
}
3133

32-
if _, err := io.Copy(h, iotest.OneByteReader(rc)); err != nil {
33-
t.Fatal(err)
34-
}
34+
if f.UncompressedSize > 0 && f.CRC32 == 0 {
35+
tb.Log("archive member", f.Name, "has no CRC")
36+
37+
return nil
38+
}
3539

36-
rc.Close()
40+
if !util.CRC32Equal(h.Sum(nil), f.CRC32) {
41+
return errors.New("CRC doesn't match")
42+
}
43+
44+
return nil
45+
}
3746

38-
if f.UncompressedSize > 0 && f.CRC32 == 0 {
39-
t.Log("archive member", f.Name, "has no CRC")
47+
//nolint:lll
48+
func extractArchive(tb testing.TB, r *sevenzip.ReadCloser, stream int, h hash.Hash, fn func(io.Reader) io.Reader, optimised bool) error {
49+
tb.Helper()
4050

51+
for _, f := range r.File {
52+
if stream >= 0 && f.Stream != stream {
4153
continue
4254
}
4355

44-
if !util.CRC32Equal(h.Sum(nil), f.CRC32) {
45-
t.Fatal(errors.New("CRC doesn't match"))
56+
rc, err := f.Open()
57+
if err != nil {
58+
return err
59+
}
60+
defer rc.Close()
61+
62+
if err = extractFile(tb, fn(rc), h, f); err != nil {
63+
return err
64+
}
65+
66+
if optimised {
67+
rc.Close()
4668
}
4769
}
70+
71+
return nil
4872
}
4973

5074
//nolint:funlen
@@ -184,7 +208,9 @@ func TestOpenReader(t *testing.T) {
184208

185209
assert.Equal(t, volumes, r.Volumes())
186210

187-
readArchive(t, r)
211+
if err := extractArchive(t, r, -1, crc32.NewIEEE(), iotest.OneByteReader, true); err != nil {
212+
t.Fatal(err)
213+
}
188214
})
189215
}
190216
}
@@ -223,7 +249,9 @@ func TestOpenReaderWithPassword(t *testing.T) {
223249
}
224250
defer r.Close()
225251

226-
readArchive(t, r)
252+
if err := extractArchive(t, r, -1, crc32.NewIEEE(), iotest.OneByteReader, true); err != nil {
253+
t.Fatal(err)
254+
}
227255
})
228256
}
229257
}
@@ -264,10 +292,43 @@ func ExampleOpenReader() {
264292
// 10
265293
}
266294

267-
func benchmarkArchive(b *testing.B, file string, optimised bool) {
295+
func benchmarkArchiveParallel(b *testing.B, file string) {
268296
b.Helper()
269297

270-
h := crc32.NewIEEE()
298+
for n := 0; n < b.N; n++ {
299+
r, err := sevenzip.OpenReader(filepath.Join("testdata", file))
300+
if err != nil {
301+
b.Fatal(err)
302+
}
303+
defer r.Close()
304+
305+
streams := make(map[int]struct{}, len(r.File))
306+
307+
for _, f := range r.File {
308+
streams[f.Stream] = struct{}{}
309+
}
310+
311+
eg := new(errgroup.Group)
312+
eg.SetLimit(runtime.NumCPU())
313+
314+
for stream := range streams {
315+
stream := stream
316+
317+
eg.Go(func() error {
318+
return extractArchive(b, r, stream, crc32.NewIEEE(), reader, true)
319+
})
320+
}
321+
322+
if err := eg.Wait(); err != nil {
323+
b.Fatal(err)
324+
}
325+
326+
r.Close()
327+
}
328+
}
329+
330+
func benchmarkArchiveNaiveParallel(b *testing.B, file string, workers int) {
331+
b.Helper()
271332

272333
for n := 0; n < b.N; n++ {
273334
r, err := sevenzip.OpenReader(filepath.Join("testdata", file))
@@ -276,26 +337,45 @@ func benchmarkArchive(b *testing.B, file string, optimised bool) {
276337
}
277338
defer r.Close()
278339

340+
eg := new(errgroup.Group)
341+
eg.SetLimit(workers)
342+
279343
for _, f := range r.File {
280-
rc, err := f.Open()
281-
if err != nil {
282-
b.Fatal(err)
283-
}
284-
defer rc.Close()
344+
f := f
285345

286-
h.Reset()
346+
eg.Go(func() error {
347+
rc, err := f.Open()
348+
if err != nil {
349+
return err
350+
}
351+
defer rc.Close()
287352

288-
if _, err := io.Copy(h, rc); err != nil {
289-
b.Fatal(err)
290-
}
353+
return extractFile(b, rc, crc32.NewIEEE(), f)
354+
})
355+
}
291356

292-
if optimised {
293-
rc.Close()
294-
}
357+
if err := eg.Wait(); err != nil {
358+
b.Fatal(err)
359+
}
295360

296-
if !util.CRC32Equal(h.Sum(nil), f.CRC32) {
297-
b.Fatal(errors.New("CRC doesn't match"))
298-
}
361+
r.Close()
362+
}
363+
}
364+
365+
func benchmarkArchive(b *testing.B, file string, optimised bool) {
366+
b.Helper()
367+
368+
h := crc32.NewIEEE()
369+
370+
for n := 0; n < b.N; n++ {
371+
r, err := sevenzip.OpenReader(filepath.Join("testdata", file))
372+
if err != nil {
373+
b.Fatal(err)
374+
}
375+
defer r.Close()
376+
377+
if err := extractArchive(b, r, -1, h, reader, optimised); err != nil {
378+
b.Fatal(err)
299379
}
300380

301381
r.Close()
@@ -354,6 +434,18 @@ func BenchmarkOptimisedReader(b *testing.B) {
354434
benchmarkArchive(b, "lzma1900.7z", true)
355435
}
356436

437+
func BenchmarkNaiveParallelReader(b *testing.B) {
438+
benchmarkArchiveNaiveParallel(b, "lzma1900.7z", runtime.NumCPU())
439+
}
440+
441+
func BenchmarkNaiveSingleParallelReader(b *testing.B) {
442+
benchmarkArchiveNaiveParallel(b, "lzma1900.7z", 1)
443+
}
444+
445+
func BenchmarkParallelReader(b *testing.B) {
446+
benchmarkArchiveParallel(b, "lzma1900.7z")
447+
}
448+
357449
func BenchmarkBCJ(b *testing.B) {
358450
benchmarkArchive(b, "bcj.7z", true)
359451
}

struct.go

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -324,8 +324,14 @@ type FileHeader struct {
324324
Attributes uint32
325325
CRC32 uint32
326326
UncompressedSize uint64
327-
isEmptyStream bool
328-
isEmptyFile bool
327+
328+
// Stream is an opaque identifier representing the compressed stream
329+
// that contains the file. Any File with the same value can be assumed
330+
// to be stored within the same stream.
331+
Stream int
332+
333+
isEmptyStream bool
334+
isEmptyFile bool
329335
}
330336

331337
// FileInfo returns an fs.FileInfo for the FileHeader.

0 commit comments

Comments
 (0)