Skip to content

Conversation

@cjb
Copy link
Contributor

@cjb cjb commented Sep 5, 2023

Hi @vanhauser-thc, thanks for benchmark.sh!

I've been hacking on it towards a goal of being able to compare execs-per-dollar across cloud instances and consumer machines, and I'd love to get some feedback. The first commit in the series is a straight port from shell to Python, and you can still get that original behavior from this version, with:

 $ python benchmark.py --mode singlecore --target test-instr --runs 1

But the default arguments are now instead equivalent to:

 $ python benchmark.py --mode multicore --target test-instr-persist-shmem --runs 5 --fuzzers <cpu_thread_count>

The defaults:

  • try to check whether afl-persistent-config and afl-system-config were run
  • run multiple fuzzers at once in a -M / -S campaign, using asyncio
  • record the average execs_per_second across all fuzzers
  • repeat the run five times and take an average, and warn you if perf diverged more than 15% between runs
  • record all of the results to a benchmark-results.jsonl file for you, in JSON Lines format.

Since each run is recorded, it's possible to do some basic data analysis and graphs. Here I did n=36 different campaigns, with n as the number of parallel afl-fuzz workers:

 $ for n in {1..36}; do python3 benchmark.py -m multicore -t test-instr-persist-shmem -f $n; sleep 2; done 

And used jq to see which campaign resulted in the maximum execs_per_sec value:

 $ jq -s 'select("hardware.cpu_model == Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz") | [.[]] | max_by(.targets."test-
instr-persist-shmem".multicore.afl_execs_per_second)' < benchmark-results.jsonl
{
  "config": {
    "afl-persistent-config": true,
    "afl-system-config": true
  },
  "hardware": {
    "cpu_mhz": 4961.6,
    "cpu_model": "Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz",
    "cpu_threads": 16
  },
  "targets": {
    "test-instr-persist-shmem": {
      "multicore": {
        "afl_execs_per_second": 1340793.62,
        "afl_execs_total": 33778550,
        "fuzzers_used": 26,
        "start_time_of_run": "2023-09-04 23:25:13.621084",
        "total_execs_per_sec": 1087876.01,
        "total_run_time": 31.05
      }
    }
  }
}

Thanks again! I'm especially interested in feedback about whether it seems valid to test and compare numbers for a multicore campaign in this way.

@vanhauser-thc
Copy link
Member

I don’t mind merging this.
It needs however updating the readme plus some already collected data for comparison would be needed

@vanhauser-thc
Copy link
Member

@cjb please see my previous comment :)

@cjb
Copy link
Contributor Author

cjb commented Oct 2, 2023

@vanhauser-thc Ready for another look, thanks!

I don't think the GitHub diff view allows you to render a preview of the .ipynb notebook properly, so here's a link to it: https://github.com/cjb/AFLplusplus/blob/dev-benchmark-py/benchmark/benchmark.ipynb

@vanhauser-thc
Copy link
Member

I am not a fan of jsonlines unless the benchmark tool shows also how it compares to other CPU setups.

What I mean is - look at the COMPARISON. a user can just look at the file and see how his setup compares.
this is not easily done with a json format. so either your python tool also takes care of that or it needs a different format.

also whatever format there is, text, json, ... you should add some results there. I will then also add a few.

@cjb
Copy link
Contributor Author

cjb commented Oct 2, 2023

also whatever format there is, text, json, ... you should add some results there.

Ah, these are already part of this PR -- the diff collapsed it because the changes are large, but there's a full set of experiment results with different parameters for an Intel desktop CPU and an AWS 192 vCPU instance in this PR, in benchmark-results.jsonl, and a Python notebook discussing the results and performing analysis on them live.

Data in this PR: https://github.com/cjb/AFLplusplus/blob/dev-benchmark-py/benchmark/benchmark-results.jsonl
Jupyter notebook: https://github.com/cjb/AFLplusplus/blob/dev-benchmark-py/benchmark/benchmark.ipynb

What I mean is - look at the COMPARISON. a user can just look at the file and see how his setup compares.

Makes sense, I think doing both can work: write the raw data to the JSON Lines, and also write a one-line summary of it to the COMPARISON file after each run for easy textual viewing, I'll work on that.

The reason to have the JSON Lines version is to be able to answer more complex questions than "How fast is this machine?" -- the Jupyter notebook analysis gives answers to "How much faster is persistent mode with shared memory? How much faster is multicore, with and without persistent mode? How much faster does it get if you boot with mitigations=off? How many parallel fuzzers should I choose to run at once?" for the hardware I tested.

@vanhauser-thc
Copy link
Member

perfect.

What I mean is - look at the COMPARISON. a user can just look at the file and see how his setup compares.
Makes sense, I think doing both can work: write the raw data to the JSON Lines, and also write a one-line summary of it to the COMPARISON file after each run for easy textual viewing, I'll work on that.

I will merge this once this is added :)

@cjb
Copy link
Contributor Author

cjb commented Nov 11, 2023

@vanhauser-thc Ready for another look! Please could you re-run this on your own machines (which will add them to COMPARISON), now that it tracks multi-core perf too?

@vanhauser-thc
Copy link
Member

vanhauser-thc commented Nov 12, 2023

@cjb
I played a bit with it some more and have a bit of feedback for the multicore parts:

  • by default it is not printed how many cores are used
  • the COMPARISON file does not document how many cores were used
  • the calculation seems wrong:
 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: 417991.21
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: 433177.68
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: 432834.86
 [*] Average AFL execs/sec for this test across all runs was: 428001.25
 [*] Average total execs/sec for this test across all runs was: 317679.27

The Average total exec/sec looks correct compared to the three stat run outputs, Average total execs/sec (what is the difference? that is unclear) is a totally different value and the one documented in the COMPARISON file.

and finally there is not a duplicate check before writing entries in the COMPARISON file.

@vanhauser-thc
Copy link
Member

on a different system I get a script error from python3.8:

$ python3 benchmark.py
Traceback (most recent call last):
  File "benchmark.py", line 50, in <module>
    class Results:
  File "benchmark.py", line 53, in Results
    targets: dict[str, dict[str, Optional[Run]]]
TypeError: 'type' object is not subscriptable
$ python3 -V
Python 3.8.10

@cjb
Copy link
Contributor Author

cjb commented Nov 12, 2023

Thanks!

by default it is not printed how many cores are used

It should be on the first line of output:

$ python3 benchmark.py
 [*] Using 16 fuzzers for multicore fuzzing (use --fuzzers to override)

the COMPARISON file does not document how many cores were used

Fixed.

The Average total exec/sec looks correct compared to the three stat run outputs, Average total execs/sec (what is the difference? that is unclear) is a totally different value and the one documented in the COMPARISON file.

Ah, afl_execs_per_sec is directly from the fuzzer_stats file, and total_execs_per_sec is timed from process start to process end. Once you get above the number of threads on the machine, afl_execs_per_sec can rise while total_execs_per_sec drops. But I suppose this is just because afl takes longer and longer to start up when the CPU is busy? If we agree that total_execs_per_sec isn't necessary, I'm happy to remove it from the printed output and switch to using afl_execs_per_sec in COMPARISON. What do you think?

and finally there is not a duplicate check before writing entries in the COMPARISON file.

I suppose I'd prefer not to add this -- I'd consider this as a human-readable file, not a machine-readable one.

on a different system I get a script error from python3.8:

Fixed.

@vanhauser-thc
Copy link
Member

The Average total exec/sec looks correct compared to the three stat run outputs, Average total execs/sec (what is the difference? that is unclear) is a totally different value and the one documented in the COMPARISON file.

Ah, afl_execs_per_sec is directly from the fuzzer_stats file, and total_execs_per_sec is timed from process start to process end. Once you get above the number of threads on the machine, afl_execs_per_sec can rise while total_execs_per_sec drops. But I suppose this is just because afl takes longer and longer to start up when the CPU is busy? If we agree that total_execs_per_sec isn't necessary, I'm happy to remove it from the printed output and switch to using afl_execs_per_sec in COMPARISON. What do you think?

please use execs_per_sec from fuzzer_stats for all values and do not calculate your own :-) it can only be less correct :)

and finally there is not a duplicate check before writing entries in the COMPARISON file.

I suppose I'd prefer not to add this -- I'd consider this as a human-readable file, not a machine-readable one.

otherwise a line is added for every time the user executes it, duplicating existing entries. this is not helpful :)

12th Gen Intel(R) Core(TM) i7- | 4800  | 16      | 155590     | 655391    | both         |
12th Gen Intel(R) Core(TM) i7- | 4701  | 16      | 152844     | 642560    | both         |
12th Gen Intel(R) Core(TM) i7- | 4273  | 16      | 154971     | 694437    | both         |

also please make the string longer, it is too short to document which specific processor it is :)

@cjb
Copy link
Contributor Author

cjb commented Nov 12, 2023

@vanhauser-thc

please use execs_per_sec from fuzzer_stats for all values and do not calculate your own :-) it can only be less correct :)

Done. A reason I was feeling distrusting of execs_per_sec from fuzzer_stats is that the total execs/sec from summing each fuzzer's value keeps increasing up to 24 concurrent fuzzers on my 8-core 16-thread CPU -- whereas my self-calculated value stops increasing at 16 concurrent fuzzers, as I'd intuitively expect. Any idea why that could be?

otherwise a line is added for every time the user executes it, duplicating existing entries. this is not helpful :)

This doesn't answer the point about the file being intended as human-readable, yet we're supposing a parser for it. I could add a parser anyway. It would raise the question of what a duplicate entry is. What if I'm experimenting with system settings and trying to see their effect on the numbers? Do you just want to compare the CPU model on the last line of the COMPARISON file with the current CPU model, and ask for a --duplicate flag to be passed to write to COMPARISON if they match, or something more complicated?

also please make the string longer, it is too short to document which specific processor it is :)

Done -- I was trying to keep the whole file near a standard terminal width, but no big deal.

@vanhauser-thc
Copy link
Member

otherwise a line is added for every time the user executes it, duplicating existing entries. this is not helpful :)

This doesn't answer the point about the file being intended as human-readable, yet we're supposing a parser for it. I could add a parser anyway. It would raise the question of what a duplicate entry is. What if I'm experimenting with system settings and trying to see their effect on the numbers? Do you just want to compare the CPU model on the last line of the COMPARISON file with the current CPU model, and ask for a --duplicate flag to be passed to write to COMPARISON if they match, or something more complicated?

just check if '^PROCESSORNAME' is present in the file and do not write if so. very simple. (with a warning to remove the line if they want to save it)

@cjb
Copy link
Contributor Author

cjb commented Nov 14, 2023

just check if '^PROCESSORNAME' is present in the file and do not write if so. very simple. (with a warning to remove the line if they want to save it)

Done.

@vanhauser-thc
Copy link
Member

Sorry I found one more issue:

$ python3 benchmark.py 
 [*] Using 64 fuzzers for multicore fuzzing (use --fuzzers to override).
 [*] Ready, starting benchmark...
 [*] Compiling the test-instr-persist-shmem fuzzing harness for the benchmark to use.
 [*] singlecore test-instr-persist-shmem run 1 of 3, execs/s: 75952.94
 [*] singlecore test-instr-persist-shmem run 2 of 3, execs/s: 79508.87
 [*] singlecore test-instr-persist-shmem run 3 of 3, execs/s: 87693.22
 [*] Average AFL execs/sec for this test across all runs was: 81051.68
 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: Infinity
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: Infinity
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: Infinity
Traceback (most recent call last):
  File "benchmark.py", line 293, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "benchmark.py", line 276, in main
    avg_afl_execs_per_sec = round(Decimal(sum(afl_execs_per_sec) / len(afl_execs_per_sec)), 2)
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

not sure what is going wrong, this is the exec data in the fuzzer stats:

$ ls /tmp/aflpp-benchmark/out-multicore-test-instr-persist-shmem/
0/  12/ 16/ 2/  23/ 27/ 30/ 34/ 38/ 41/ 45/ 49/ 52/ 56/ 6/  63/ 
1/  13/ 17/ 20/ 24/ 28/ 31/ 35/ 39/ 42/ 46/ 5/  53/ 57/ 60/ 7/  
10/ 14/ 18/ 21/ 25/ 29/ 32/ 36/ 4/  43/ 47/ 50/ 54/ 58/ 61/ 8/  
11/ 15/ 19/ 22/ 26/ 3/  33/ 37/ 40/ 44/ 48/ 51/ 55/ 59/ 62/ 9/  
$ grep exec /tmp/aflpp-benchmark/out-multicore-test-instr-persist-shmem/0/fuzzer_stats
execs_done        : 259835
execs_per_sec     : 72076.28
execs_ps_last_min : 0.00
execs_since_crash : 259835
exec_timeout      : 20
slowest_exec_ms   : 0

@vanhauser-thc
Copy link
Member

btw I don't get why you calculate from total_execs / runtime ... this value is already present in fuzzer_stats and called execs_per_sec :)

@cjb
Copy link
Contributor Author

cjb commented Nov 15, 2023

I'm not sure what's going wrong either. Perhaps print(afl_execs_per_sec) on line 275 to help debug? It should be a list of all of the execs_per_sec values from the fuzzer_stats files for 64 fuzzers * 3 rounds, and the value that's crashing is just supposed to be summing them all, and then dividing by the 3 runs to get the average total sum of execs_per_sec values per round. (So this part is not dividing by runtime, but by the multiple samples we're taking to try to reduce noise.)

btw I don't get why you calculate from total_execs / runtime ... this value is already present in fuzzer_stats and called execs_per_sec :)

That isn't at play in this crash, since the crashing section is not doing that, but I tried to explain why I did this above:

Done. A reason I was feeling distrusting of execs_per_sec from fuzzer_stats is that the total execs/sec from summing each fuzzer's value keeps increasing up to 24 concurrent fuzzers on my 8-core 16-thread CPU -- whereas my self-calculated value stops increasing at 16 concurrent fuzzers, as I'd intuitively expect. Any idea why that could be?

Here is another explanation of the same problem. If I run with --fuzzers 64 on my 16-thread system, the fuzzing run takes 47 seconds instead of 6 seconds, but the sum of all of the execs_per_sec values in fuzzer_stats files is approximately the same in both cases. Taking runtime into account avoids giving people misleading numbers this way, and allows them to reason about how many parallel fuzzers it's optimal for them to use. Any idea why execs_per_sec is not actually counting execs per second?

But again, the flow we're talking about, and the number being printed to COMPARISON, does not use this flow with dividing by total runtime. It should just be summing execs_per_sec to get a total score, and the only division is to make that score averaged across the multiple runs of fuzzing.

@cjb
Copy link
Contributor Author

cjb commented Nov 15, 2023

[*] multicore test-instr-persist-shmem run 1 of 3, execs/s: Infinity

This is only supposed to print the sum of all of the execs_per_sec values in /tmp/aflpp-benchmark/*/fuzzer_stats, and I can't think of any way for that to be Infinity as printed other than addition overflow, but your sample value of execs_per_sec was only 72076 (* 64 fuzzers), so I'm still trying to guess at where the Infinity came from.

@vanhauser-thc
Copy link
Member

OK this is solved :)

/tmp/aflpp-benchmark/out-multicore-test-instr-persist-shmem $ egrep 'time|exec' 63/fuzzer_stats
start_time        : 1700128119
run_time          : 0
time_wo_finds     : 0
execs_done        : 15
execs_per_sec     : inf
execs_ps_last_min : 0.00
execs_since_crash : 15
exec_timeout      : 20
slowest_exec_ms   : 0

it comes from fuzzer_stats. I will do a fix.
But this also shows that the setup is not good. the fuzzing should run at least 5 seconds even on very fast computers. instead of AFL_BENCH_JUST_ONE you could do -V5 instead.

@vanhauser-thc
Copy link
Member

fixed the inf bug in dev
but I think there is something wrong with the script.

 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: [['63110.76'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00'], ['15000.00']]

only the first instance has real results, all others are not really starting up correctly. maybe pipe stdout + stderr of these somewhere in python and see what is going wrong (I am not a python guy ...)

@vanhauser-thc
Copy link
Member

fixed the benchmark.py
do you think it makes sense to write to the json if the same CPU model is already in there?

@cjb
Copy link
Contributor Author

cjb commented Nov 16, 2023

do you think it makes sense to write to the json if the same CPU model is already in there?

I think so -- the Jupyter notebook contains examples of measuring the perf difference due to afl-system-config and afl-persistent-config, and how perf scales with the number of cores used, and none of those experiments would be possible if we refused to write more than one line per CPU to the JSON output.

@cjb
Copy link
Contributor Author

cjb commented Nov 16, 2023

Did you mean to pass -V10 instead of V10? Your change breaks the script for me.

We should also lower the number of runs now that each run in each mode takes 10 seconds -- the runtime of the script is now 60s. I guess two runs (40 seconds total) should be okay.

(I'll also need to redo the analysis in the Jupyter script.)

@cjb
Copy link
Contributor Author

cjb commented Nov 19, 2023

@vanhauser-thc I think this should be ready to merge now, I removed the self-calculation of execs_per_sec and re-ran on my machines. (Want to add your own machines to COMPARISON?)

@vanhauser-thc
Copy link
Member

Thanks! Will add my machines tomorrow

@vanhauser-thc vanhauser-thc merged commit 444ddb2 into AFLplusplus:dev Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants