Skip to content

Strange performance and scalability issues with some of the build systems #3163

@VorpalBlade

Description

@VorpalBlade

Describe the bug

After reading a reading a recent Phoronix benchmark (a bit down the page) I decided to investigate why Arch Linux was so much slower (10-20x) for zstd performance. It turned out that something is wrong with some of the build systems included with zstd!

When zstd is built with the cmake or meson build systems there is negative scaling with the number of threads, while when building with the Makefile in the top level directory, there is positive scaling with the number of threads.

To Reproduce
Steps to reproduce the behavior:

  1. Build with build system you want to investigate. One of:
    • Plain make
    • mkdir build && cmake ../zstd-1.5.2/build/cmake/ && make
    • meson setup builddir && cd builddir && ninja
  2. Test the resulting binary on a large file, I used the FreeBSD image as this is what Phoronix Test Suite used, albeit from a older version that I can't find. I can however reproduce the same issue with the linked file. Use the following pair of benchmark commands and compare scaling:
  • path/to/zstd -T1 -b4 path/to/FreeBSD-13.1-RELEASE-amd64-memstick.img
  • path/to/zstd -T6 -b4 path/to/FreeBSD-13.1-RELEASE-amd64-memstick.img (adjust -T6 based on the number of cores you have)

Note! I see the same pattern at other compression levels such as 6 and 8, not just 4. So that value doesn't really appear to matter, as long as it is consistent of course.

Expected behavior
I expect that all build systems should result in binaries with roughly the same behaviour. Performance and scaling should be similar.

Actual results

The output below has been abbreviated for clarity, repeated command lines has been elided only showing the output. Three runs for each combination of program and flags has been performed. As can be seen the results are relatively consistent run-to-run (at least consistent enough given the huge discrepancies).

  1. CMake
$ programs/zstd -T1 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1108.5 MB/s, 4999.1 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1152.9 MB/s, 5006.7 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1102.1 MB/s, 4978.5 MB/s

$ programs/zstd -T6 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781637623 (x1.500),  717.0 MB/s, 4940.0 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500),  759.3 MB/s, 4893.5 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500),  697.5 MB/s, 4869.6 MB/s
  1. Meson
$ programs/zstd -T1 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1097.0 MB/s, 5029.3 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1098.2 MB/s, 4970.2 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1117.8 MB/s, 4952.6 MB/s

$ programs/zstd -T6 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781637623 (x1.500),  735.0 MB/s, 4982.8 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500),  758.6 MB/s, 4966.9 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500),  727.9 MB/s, 4949.7 MB/s
  1. Makefile
$ ./zstd -T1 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1118.2 MB/s, 4971.0 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1105.4 MB/s, 4931.3 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1150.6 MB/s, 4930.1 MB/s

$ ./zstd -T6 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 3518.0 MB/s, 4898.2 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 3486.3 MB/s, 4917.0 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 3528.1 MB/s, 4900.8 MB/s

Analysis of results

For CMake and Meson: it can be seen that the performance goes down between 1 thread and 6 threads: ~1100 MB/s to ~700 MB/s.

For plain make, the performance goes up between 1 thread and 6 threads: ~1100 MB/s to ~3500 MB/s.

Decompression speed (the second value) does not seem to vary significantly across the experiments however.

Desktop (please complete the following information):

  • OS: Arch Linux
  • Version 1.5.2 (upstream tarball)
  • Compiler: GCC 12.1.0
  • Flags: Defaults for each build system, though I tested with some basics such as -O2, but it did not affect the overall behaviour.
  • Other relevant hardware specs: AMD Ryzen 5 5600X 6-Core Processor
  • Build system: Multiple ones, that is the whole point of this bug

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions