benchmark: port benchmark.sh to Python, add multicore, multiple runs, persistent+shmem #1853

cjb · 2023-09-05T09:13:29Z

Hi @vanhauser-thc, thanks for benchmark.sh!

I've been hacking on it towards a goal of being able to compare execs-per-dollar across cloud instances and consumer machines, and I'd love to get some feedback. The first commit in the series is a straight port from shell to Python, and you can still get that original behavior from this version, with:

 $ python benchmark.py --mode singlecore --target test-instr --runs 1

But the default arguments are now instead equivalent to:

 $ python benchmark.py --mode multicore --target test-instr-persist-shmem --runs 5 --fuzzers <cpu_thread_count>

The defaults:

try to check whether afl-persistent-config and afl-system-config were run
run multiple fuzzers at once in a -M / -S campaign, using asyncio
record the average execs_per_second across all fuzzers
repeat the run five times and take an average, and warn you if perf diverged more than 15% between runs
record all of the results to a benchmark-results.jsonl file for you, in JSON Lines format.

Since each run is recorded, it's possible to do some basic data analysis and graphs. Here I did n=36 different campaigns, with n as the number of parallel afl-fuzz workers:

 $ for n in {1..36}; do python3 benchmark.py -m multicore -t test-instr-persist-shmem -f $n; sleep 2; done

And used jq to see which campaign resulted in the maximum execs_per_sec value:

 $ jq -s 'select("hardware.cpu_model == Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz") | [.[]] | max_by(.targets."test-
instr-persist-shmem".multicore.afl_execs_per_second)' < benchmark-results.jsonl
{
  "config": {
    "afl-persistent-config": true,
    "afl-system-config": true
  },
  "hardware": {
    "cpu_mhz": 4961.6,
    "cpu_model": "Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz",
    "cpu_threads": 16
  },
  "targets": {
    "test-instr-persist-shmem": {
      "multicore": {
        "afl_execs_per_second": 1340793.62,
        "afl_execs_total": 33778550,
        "fuzzers_used": 26,
        "start_time_of_run": "2023-09-04 23:25:13.621084",
        "total_execs_per_sec": 1087876.01,
        "total_run_time": 31.05
      }
    }
  }
}

Thanks again! I'm especially interested in feedback about whether it seems valid to test and compare numbers for a multicore campaign in this way.

vanhauser-thc · 2023-09-05T16:45:32Z

I don’t mind merging this.
It needs however updating the readme plus some already collected data for comparison would be needed

vanhauser-thc · 2023-09-11T08:08:27Z

@cjb please see my previous comment :)

cjb · 2023-10-02T11:15:10Z

@vanhauser-thc Ready for another look, thanks!

I don't think the GitHub diff view allows you to render a preview of the .ipynb notebook properly, so here's a link to it: https://github.com/cjb/AFLplusplus/blob/dev-benchmark-py/benchmark/benchmark.ipynb

vanhauser-thc · 2023-10-02T13:12:42Z

I am not a fan of jsonlines unless the benchmark tool shows also how it compares to other CPU setups.

What I mean is - look at the COMPARISON. a user can just look at the file and see how his setup compares.
this is not easily done with a json format. so either your python tool also takes care of that or it needs a different format.

also whatever format there is, text, json, ... you should add some results there. I will then also add a few.

cjb · 2023-10-02T18:45:49Z

also whatever format there is, text, json, ... you should add some results there.

Ah, these are already part of this PR -- the diff collapsed it because the changes are large, but there's a full set of experiment results with different parameters for an Intel desktop CPU and an AWS 192 vCPU instance in this PR, in benchmark-results.jsonl, and a Python notebook discussing the results and performing analysis on them live.

Data in this PR: https://github.com/cjb/AFLplusplus/blob/dev-benchmark-py/benchmark/benchmark-results.jsonl
Jupyter notebook: https://github.com/cjb/AFLplusplus/blob/dev-benchmark-py/benchmark/benchmark.ipynb

What I mean is - look at the COMPARISON. a user can just look at the file and see how his setup compares.

Makes sense, I think doing both can work: write the raw data to the JSON Lines, and also write a one-line summary of it to the COMPARISON file after each run for easy textual viewing, I'll work on that.

The reason to have the JSON Lines version is to be able to answer more complex questions than "How fast is this machine?" -- the Jupyter notebook analysis gives answers to "How much faster is persistent mode with shared memory? How much faster is multicore, with and without persistent mode? How much faster does it get if you boot with mitigations=off? How many parallel fuzzers should I choose to run at once?" for the hardware I tested.

vanhauser-thc · 2023-10-03T08:34:06Z

perfect.

What I mean is - look at the COMPARISON. a user can just look at the file and see how his setup compares.
Makes sense, I think doing both can work: write the raw data to the JSON Lines, and also write a one-line summary of it to the COMPARISON file after each run for easy textual viewing, I'll work on that.

I will merge this once this is added :)

cjb · 2023-11-11T07:38:16Z

@vanhauser-thc Ready for another look! Please could you re-run this on your own machines (which will add them to COMPARISON), now that it tracks multi-core perf too?

vanhauser-thc · 2023-11-12T10:00:02Z

@cjb
I played a bit with it some more and have a bit of feedback for the multicore parts:

by default it is not printed how many cores are used
the COMPARISON file does not document how many cores were used
the calculation seems wrong:

 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: 417991.21
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: 433177.68
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: 432834.86
 [*] Average AFL execs/sec for this test across all runs was: 428001.25
 [*] Average total execs/sec for this test across all runs was: 317679.27

The Average total exec/sec looks correct compared to the three stat run outputs, Average total execs/sec (what is the difference? that is unclear) is a totally different value and the one documented in the COMPARISON file.

and finally there is not a duplicate check before writing entries in the COMPARISON file.

vanhauser-thc · 2023-11-12T10:03:18Z

on a different system I get a script error from python3.8:

$ python3 benchmark.py
Traceback (most recent call last):
  File "benchmark.py", line 50, in <module>
    class Results:
  File "benchmark.py", line 53, in Results
    targets: dict[str, dict[str, Optional[Run]]]
TypeError: 'type' object is not subscriptable
$ python3 -V
Python 3.8.10

cjb · 2023-11-12T15:43:21Z

Thanks!

by default it is not printed how many cores are used

It should be on the first line of output:

$ python3 benchmark.py
 [*] Using 16 fuzzers for multicore fuzzing (use --fuzzers to override)

the COMPARISON file does not document how many cores were used

Fixed.

The Average total exec/sec looks correct compared to the three stat run outputs, Average total execs/sec (what is the difference? that is unclear) is a totally different value and the one documented in the COMPARISON file.

Ah, afl_execs_per_sec is directly from the fuzzer_stats file, and total_execs_per_sec is timed from process start to process end. Once you get above the number of threads on the machine, afl_execs_per_sec can rise while total_execs_per_sec drops. But I suppose this is just because afl takes longer and longer to start up when the CPU is busy? If we agree that total_execs_per_sec isn't necessary, I'm happy to remove it from the printed output and switch to using afl_execs_per_sec in COMPARISON. What do you think?

and finally there is not a duplicate check before writing entries in the COMPARISON file.

I suppose I'd prefer not to add this -- I'd consider this as a human-readable file, not a machine-readable one.

on a different system I get a script error from python3.8:

Fixed.

vanhauser-thc · 2023-11-12T17:49:54Z

The Average total exec/sec looks correct compared to the three stat run outputs, Average total execs/sec (what is the difference? that is unclear) is a totally different value and the one documented in the COMPARISON file.

Ah, afl_execs_per_sec is directly from the fuzzer_stats file, and total_execs_per_sec is timed from process start to process end. Once you get above the number of threads on the machine, afl_execs_per_sec can rise while total_execs_per_sec drops. But I suppose this is just because afl takes longer and longer to start up when the CPU is busy? If we agree that total_execs_per_sec isn't necessary, I'm happy to remove it from the printed output and switch to using afl_execs_per_sec in COMPARISON. What do you think?

please use execs_per_sec from fuzzer_stats for all values and do not calculate your own :-) it can only be less correct :)

and finally there is not a duplicate check before writing entries in the COMPARISON file.

I suppose I'd prefer not to add this -- I'd consider this as a human-readable file, not a machine-readable one.

otherwise a line is added for every time the user executes it, duplicating existing entries. this is not helpful :)

12th Gen Intel(R) Core(TM) i7- | 4800  | 16      | 155590     | 655391    | both         |
12th Gen Intel(R) Core(TM) i7- | 4701  | 16      | 152844     | 642560    | both         |
12th Gen Intel(R) Core(TM) i7- | 4273  | 16      | 154971     | 694437    | both         |

also please make the string longer, it is too short to document which specific processor it is :)

cjb · 2023-11-12T20:01:49Z

@vanhauser-thc

please use execs_per_sec from fuzzer_stats for all values and do not calculate your own :-) it can only be less correct :)

Done. A reason I was feeling distrusting of execs_per_sec from fuzzer_stats is that the total execs/sec from summing each fuzzer's value keeps increasing up to 24 concurrent fuzzers on my 8-core 16-thread CPU -- whereas my self-calculated value stops increasing at 16 concurrent fuzzers, as I'd intuitively expect. Any idea why that could be?

otherwise a line is added for every time the user executes it, duplicating existing entries. this is not helpful :)

This doesn't answer the point about the file being intended as human-readable, yet we're supposing a parser for it. I could add a parser anyway. It would raise the question of what a duplicate entry is. What if I'm experimenting with system settings and trying to see their effect on the numbers? Do you just want to compare the CPU model on the last line of the COMPARISON file with the current CPU model, and ask for a --duplicate flag to be passed to write to COMPARISON if they match, or something more complicated?

also please make the string longer, it is too short to document which specific processor it is :)

Done -- I was trying to keep the whole file near a standard terminal width, but no big deal.

vanhauser-thc · 2023-11-14T13:34:39Z

otherwise a line is added for every time the user executes it, duplicating existing entries. this is not helpful :)

This doesn't answer the point about the file being intended as human-readable, yet we're supposing a parser for it. I could add a parser anyway. It would raise the question of what a duplicate entry is. What if I'm experimenting with system settings and trying to see their effect on the numbers? Do you just want to compare the CPU model on the last line of the COMPARISON file with the current CPU model, and ask for a --duplicate flag to be passed to write to COMPARISON if they match, or something more complicated?

just check if '^PROCESSORNAME' is present in the file and do not write if so. very simple. (with a warning to remove the line if they want to save it)

cjb · 2023-11-14T17:49:25Z

just check if '^PROCESSORNAME' is present in the file and do not write if so. very simple. (with a warning to remove the line if they want to save it)

Done.

vanhauser-thc · 2023-11-15T08:15:06Z

Sorry I found one more issue:

$ python3 benchmark.py 
 [*] Using 64 fuzzers for multicore fuzzing (use --fuzzers to override).
 [*] Ready, starting benchmark...
 [*] Compiling the test-instr-persist-shmem fuzzing harness for the benchmark to use.
 [*] singlecore test-instr-persist-shmem run 1 of 3, execs/s: 75952.94
 [*] singlecore test-instr-persist-shmem run 2 of 3, execs/s: 79508.87
 [*] singlecore test-instr-persist-shmem run 3 of 3, execs/s: 87693.22
 [*] Average AFL execs/sec for this test across all runs was: 81051.68
 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: Infinity
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: Infinity
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: Infinity
Traceback (most recent call last):
  File "benchmark.py", line 293, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "benchmark.py", line 276, in main
    avg_afl_execs_per_sec = round(Decimal(sum(afl_execs_per_sec) / len(afl_execs_per_sec)), 2)
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

not sure what is going wrong, this is the exec data in the fuzzer stats:

$ ls /tmp/aflpp-benchmark/out-multicore-test-instr-persist-shmem/
0/  12/ 16/ 2/  23/ 27/ 30/ 34/ 38/ 41/ 45/ 49/ 52/ 56/ 6/  63/ 
1/  13/ 17/ 20/ 24/ 28/ 31/ 35/ 39/ 42/ 46/ 5/  53/ 57/ 60/ 7/  
10/ 14/ 18/ 21/ 25/ 29/ 32/ 36/ 4/  43/ 47/ 50/ 54/ 58/ 61/ 8/  
11/ 15/ 19/ 22/ 26/ 3/  33/ 37/ 40/ 44/ 48/ 51/ 55/ 59/ 62/ 9/  
$ grep exec /tmp/aflpp-benchmark/out-multicore-test-instr-persist-shmem/0/fuzzer_stats
execs_done        : 259835
execs_per_sec     : 72076.28
execs_ps_last_min : 0.00
execs_since_crash : 259835
exec_timeout      : 20
slowest_exec_ms   : 0

vanhauser-thc · 2023-11-15T08:29:01Z

btw I don't get why you calculate from total_execs / runtime ... this value is already present in fuzzer_stats and called execs_per_sec :)

cjb · 2023-11-15T22:04:29Z

I'm not sure what's going wrong either. Perhaps print(afl_execs_per_sec) on line 275 to help debug? It should be a list of all of the execs_per_sec values from the fuzzer_stats files for 64 fuzzers * 3 rounds, and the value that's crashing is just supposed to be summing them all, and then dividing by the 3 runs to get the average total sum of execs_per_sec values per round. (So this part is not dividing by runtime, but by the multiple samples we're taking to try to reduce noise.)

btw I don't get why you calculate from total_execs / runtime ... this value is already present in fuzzer_stats and called execs_per_sec :)

That isn't at play in this crash, since the crashing section is not doing that, but I tried to explain why I did this above:

Done. A reason I was feeling distrusting of execs_per_sec from fuzzer_stats is that the total execs/sec from summing each fuzzer's value keeps increasing up to 24 concurrent fuzzers on my 8-core 16-thread CPU -- whereas my self-calculated value stops increasing at 16 concurrent fuzzers, as I'd intuitively expect. Any idea why that could be?

Here is another explanation of the same problem. If I run with --fuzzers 64 on my 16-thread system, the fuzzing run takes 47 seconds instead of 6 seconds, but the sum of all of the execs_per_sec values in fuzzer_stats files is approximately the same in both cases. Taking runtime into account avoids giving people misleading numbers this way, and allows them to reason about how many parallel fuzzers it's optimal for them to use. Any idea why execs_per_sec is not actually counting execs per second?

But again, the flow we're talking about, and the number being printed to COMPARISON, does not use this flow with dividing by total runtime. It should just be summing execs_per_sec to get a total score, and the only division is to make that score averaged across the multiple runs of fuzzing.

cjb · 2023-11-15T22:56:33Z

[*] multicore test-instr-persist-shmem run 1 of 3, execs/s: Infinity

This is only supposed to print the sum of all of the execs_per_sec values in /tmp/aflpp-benchmark/*/fuzzer_stats, and I can't think of any way for that to be Infinity as printed other than addition overflow, but your sample value of execs_per_sec was only 72076 (* 64 fuzzers), so I'm still trying to guess at where the Infinity came from.

vanhauser-thc · 2023-11-16T09:51:32Z

OK this is solved :)

/tmp/aflpp-benchmark/out-multicore-test-instr-persist-shmem $ egrep 'time|exec' 63/fuzzer_stats
start_time        : 1700128119
run_time          : 0
time_wo_finds     : 0
execs_done        : 15
execs_per_sec     : inf
execs_ps_last_min : 0.00
execs_since_crash : 15
exec_timeout      : 20
slowest_exec_ms   : 0

it comes from fuzzer_stats. I will do a fix.
But this also shows that the setup is not good. the fuzzing should run at least 5 seconds even on very fast computers. instead of AFL_BENCH_JUST_ONE you could do -V5 instead.

vanhauser-thc · 2023-11-16T10:02:00Z

fixed the inf bug in dev
but I think there is something wrong with the script.

 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: [['63110.76'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00']]

only the first instance has real results, all others are not really starting up correctly. maybe pipe stdout + stderr of these somewhere in python and see what is going wrong (I am not a python guy ...)

vanhauser-thc · 2023-11-16T15:00:37Z

fixed the benchmark.py
do you think it makes sense to write to the json if the same CPU model is already in there?

cjb · 2023-11-16T15:11:40Z

do you think it makes sense to write to the json if the same CPU model is already in there?

I think so -- the Jupyter notebook contains examples of measuring the perf difference due to afl-system-config and afl-persistent-config, and how perf scales with the number of cores used, and none of those experiments would be possible if we refused to write more than one line per CPU to the JSON output.

cjb · 2023-11-16T21:30:19Z

Did you mean to pass -V10 instead of V10? Your change breaks the script for me.

We should also lower the number of runs now that each run in each mode takes 10 seconds -- the runtime of the script is now 60s. I guess two runs (40 seconds total) should be okay.

(I'll also need to redo the analysis in the Jupyter script.)

cjb · 2023-11-19T23:12:53Z

@vanhauser-thc I think this should be ready to merge now, I removed the self-calculation of execs_per_sec and re-ran on my machines. (Want to add your own machines to COMPARISON?)

vanhauser-thc · 2023-11-19T23:17:20Z

Thanks! Will add my machines tomorrow

cjb added 6 commits September 5, 2023 01:37

Pure Python (3.6) port of benchmark.sh as benchmark.py, no other changes

9b0a35d

Test standard and persistent modes separately

bcaa3cb

Add support for multi-core benchmarking

0091afc

Save the results to a json file

8e8acd0

Allow config of all experiment params, average across runs

91938d2

Add start_time_of_run and total_execs_per_sec, cleanup for PR

f8ca83f

cjb added 2 commits October 2, 2023 04:11

benchmark: cleanup, add results, add a data exploration notebook

49a1d81

benchmark: add a README, lower default runs from 5 to 3

b9db6b1

benchmark: notebook wording tweaks

3bfd194

benchmark: Add support for COMPARISON file

16993bb

benchmark: show the number of cores used in COMPARISON

8b79d9b

benchmark: lower minimum Python version to 3.8

df9f2c4

benchmark: use afl's execs/s; increase CPU model width

2604583

benchmark: disallow duplicate entries for the same CPU in COMPARISON

afb9b8a

Update benchmark.py

a289a3e

Fix benchmark.py

885f949

vanhauser-thc and others added 6 commits November 17, 2023 09:17

Update benchmark.py

43b8812

benchmark: remove self-calculation of execs/sec

4d8df78

benchmark: update COMPARISON

75a3af8

benchmark: Update Jupyter notebook and results file.

d34bed5

benchmark: rename afl_execs_per_sec to execs_per_sec

d9ffe74

benchmark: update README

f2cbcdf

vanhauser-thc merged commit 444ddb2 into AFLplusplus:dev Nov 19, 2023

Uh oh!

benchmark: port benchmark.sh to Python, add multicore, multiple runs, persistent+shmem #1853

benchmark: port benchmark.sh to Python, add multicore, multiple runs, persistent+shmem #1853

Uh oh!

Conversation

cjb commented Sep 5, 2023

Uh oh!

vanhauser-thc commented Sep 5, 2023

Uh oh!

vanhauser-thc commented Sep 11, 2023

Uh oh!

cjb commented Oct 2, 2023

Uh oh!

vanhauser-thc commented Oct 2, 2023

Uh oh!

cjb commented Oct 2, 2023

Uh oh!

vanhauser-thc commented Oct 3, 2023

Uh oh!

cjb commented Nov 11, 2023

Uh oh!

vanhauser-thc commented Nov 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanhauser-thc commented Nov 12, 2023

Uh oh!

cjb commented Nov 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanhauser-thc commented Nov 12, 2023

Uh oh!

cjb commented Nov 12, 2023

Uh oh!

vanhauser-thc commented Nov 14, 2023

Uh oh!

cjb commented Nov 14, 2023

Uh oh!

vanhauser-thc commented Nov 15, 2023

Uh oh!

vanhauser-thc commented Nov 15, 2023

Uh oh!

cjb commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjb commented Nov 15, 2023

Uh oh!

vanhauser-thc commented Nov 16, 2023

Uh oh!

vanhauser-thc commented Nov 16, 2023

Uh oh!

vanhauser-thc commented Nov 16, 2023

Uh oh!

cjb commented Nov 16, 2023

Uh oh!

cjb commented Nov 16, 2023

Uh oh!

cjb commented Nov 19, 2023

Uh oh!

vanhauser-thc commented Nov 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vanhauser-thc commented Nov 12, 2023 •

edited

Loading

cjb commented Nov 12, 2023 •

edited

Loading

cjb commented Nov 15, 2023 •

edited

Loading