PyPy, compared to CPython relies more on achieving speed-up by “jitting” code as often as possible, rather than rely on its interpreter. However, jitting is not always an option, or at least not entirely. A good improvement for CPython, that we think might benefit PyPy as well, without impacting the JIT performance is Profile Guided Optimization (PGO or profopt).
I thank the PyPy developer community for their patience, kind advice and constant feedback they gave me in #pypy IRC or through email, which helped me to make this possible, especially to Carl Friedrich Bolz-Tereick and Armin Rigo.
1. Introduction
Profile-guided optimizations can differ in implementation from compiler to compiler, but all of them, basically do the 3 steps:
- First, the source code is instrumented during compilation. It is worth noting that not all associated libraries need to be instrumented, but for best performance it is advisable to do so.
- Secondly, the binary that results from the first step will have some associated “profiles”, that will be updated every time you run it (this behavior for gcc, at least). This step is also known as the training phase. It is crucial here that you run workloads that are the most relevant for the code paths you usually expect / want to be taken. It is also important to remember that altering the code past this point would cause inconsistencies in your profiles, and therefore poorer performance.
- The third phase is cleaning everything but the profiles, and recompile everything based on the training they had. This will unvectorize small loops, ease inlining, improve branch prediction, improve hotspots etc.
So, how does all this benefit Python or PyPy? Well, since Python is an interpreted language, written in C, it is obvious that training its binary with common scenarios will benefit it greatly. It is worth mentioning though, that the training should not be everything that CPython can be used for. Simply put, if everything is a hot spot then nothing is. On the contrary, it would only make performance worse.
How about PyPy?
PyPy is a different, because it already has the benefit of a JIT, so it does not rely on the interpreter as much as Python does. The underlying issue here is that the Assembly code generated by the JIT has no way of benefiting from PGO, because it is was never instrumented. However, the interpreted code itself is roughly 3 times slower that CPython, mostly due to the instrumentation of the code it needs to do to start the JIT.
The target here was then to improve the interpreter of PyPy by compiling it with PGO, while also, avoid delaying the JIT, its main source of performance.
Ideally, we hoped for a speed-up similar to CPython, of about 10% on average, but we were aware that imperfect training would always introduce delays.
2. PGO for PyPy
Since PyPy is written in RPython, enabling the Profile-Guided optimizations for it is a bit more tricky than for CPython. This is mainly due to the fact that PyPy need to be translated and have it sources generated before they can be compiled.
However, to enable PGO we only needed the Makefile generated after the translation, so simply saved the sources and worked on them, in order to save time.
Therefore we altered the Makefile to create a new target for our profile optimized binary, by adding to the usual GCC command, the --fprofile-generate flag for the first phase at both compilation and link phase, than train the resulting binary. Afterwards, we used the profiles to rebuild the project once again, by using: --fprofile-use --fprofile-correction, again for both compilation and linking.
As it is quite hard, and also subjective, to determine a good training set for interpreters in general, we decided to start with the training tests that CPython uses and see whether this offers any performance improvement, and to have a baseline to compare any subsequent testing sets.
While we have more to explain / debate about how would profopt best work for PyPy, such as tips and tricks, different workloads (we also tried the actual benchmark as a training set 🙂 ), or what are the proper use cases for it, we will focus on the actual performance gains and how measured this.
An advantage of the implementation from PyPy, is that it’s not for the Python interpreter only, but it can be used for any existing or future implementation that are based on RPython!
Stay tuned for a more general implementation of PGO for both GCC and CLANG.
3. Usage
So here are the steps you should take to enable PGO for PyPy. As a sidenote, I need to mention that the tests have been performed on Ubuntu 16.04, with gcc 6.2.0. It is also important to mention that if you have a Darwin/Mac (I am working to enable this for CLANG on Mac) or Windows, you will most likely not be able to use PGO.
Clone the PyPy repo:
hg clone http://bitbucket.org/pypy/pypy pypy
Install dependencies:
apt-get install gcc make libffi-dev pkg-config libz-dev libbz2-dev \ libsqlite3-dev libncurses-dev libexpat1-dev libssl-dev libgdbm-dev \ tk-dev libgc-dev python-cffi \ liblzma-dev # For lzma on PyPy3.
Go to the clone and run:
cd pypy/goal # If you want to enable profopt for PyPy without too much hassle: python ../../rpython/bin/rpython --opt=jit --cc=/opt/gcc-6.2.0/gcc --profopt
Or, if you want to specify the training script for PGO yourself, you can specify the profoptargs argument, which will take the absolute path to your script and the arguments it requires as well. For example, running the exact same script as in the previous case, but make the script use more cores:
cd pypy/goal python ../../rpython/bin/rpython --opt=jit --cc=/opt/gcc-6.2.0/gcc \ --profopt --profoptargs="/home/md/pypy/lib-python/2.7/test/regrtest.py --pgo -j 18 -x test_asyncore test_gdb test_multiprocessing test_subprocess || true" # By default the script above runs on 1 process, # while now the script will run on 18 processes. # I would advise to use the number of the cores you have for this. # Side Note: the "|| true" at the end ensures that the training finishes # successfully, as some of the regrtests fail for PyPy (12 out of 400)
Now, the translation process with PGO takes ~ 1h and 15 min, so grab a cup of coffee, enjoy the mandelbrot, etc. When it ends you should have your pypy-c binary and the associated libpypy-c.so in the goal directory, trained and ready for use.
On the other hand, you can apply the same concept to any binary that results from an RPythton translation. However, in their case there is, obviously, no default, so both the script and its arguments are required. Consider this example:
cd rpython/translator/goal ../../bin/rpython --profopt --profoptargs=1000 targetrpystonedalone.py
There is no actual training script for the resulting binary, as rpystone is not an interpreter, but rather a binary that takes an integer value as a parameter. To get a clear picture of this, any training is conceptually run as follows:
./your_binary arguments_from_profoptargs
Therefore, in the case of PyPy, since it is a Python interpreter:
./pypy-c /path/to/training/script.py arguments_of_the script
4. Measurements
Measuring performance gains is not easy for several reasons (situations we have actually encountered):
- There is no standard benchmark. Or, it is inherently unfair to your setup.
- While there is a standard test suite for Python, called pyperformance, we have found that results can have differences that are deemed significant even from one run to another. Obviously, multiple runs of the same workload, with the same binary should not have relevant differences. It is also quite unfair for PyPy, because there is no warmup time, which usually takes a small amount of time, but makes a world of difference in the results. This problem brings us to the next point.
- Not all the tests in a benchmark are as reliable as you would like
- After pyperformance, we have decided to try a similar benchmark, implemented in the PyPy project (https://bitbucket.org/pypy/benchmarks/src) that is similar to pyperformance, except it actually does have a warmup phase so that the JIT can fully get in effect before the measurement starts. Even so, however, not all the tests have a low jitter in run-to-run variation. For example, in this benchmark, the translation test is a good example of a test that varies wildly because of its unusually long times, and due to the fact that it is not repeated. This is problem because in these certain cases, you might be unsure at what is your actual speedup or if you have any. The best way to make sure that you actually have a speedup, if you encounter such a scenario, is to repeat the tests, and calculate, statistically the coeficient of variance and the standard deviation.
- Most of your tests show a good speedup, but one of them is a lot slower.
- Such a scenario happened after we processed the results. We realized that most tests showed an average improvement of ~6% (which was quite good for a interpreter binary), but there was one of them (nbody simulations) which had almost 50% slowdown. At this point, it is nice if you have a strategy for such cases. In our case, we talked with the PyPy devs and they we satisfied with the results. However, their proposal was to leave profopt as an option to be enabled rather than it being the default setting. This made sense for both us and them, because whomever might use the interpreter should be able to be fully aware of the eventual shortcomings of enabling profile-guided optimizations: it works better in most cases, by an average of 6%, but if you do nbody simulations, you will be disappointed.
An excerpt of our results is in the table below. You can also see it in more detail, here (https://docs.google.com/spreadsheets/d/1aEUkgUcEXGSieBnn82_vVzORk9fRfdW2UKJlE2jFZCk/edit?usp=sharing).
| Benchmark | SPEEDUP vs 5.8.0 (%) | SPEEDUP vs 5.7.1(%) |
|---|---|---|
| ai | -6,59514475 | -1,653259733 |
| bm_chameleon | 7,335724952 | 4,253695178 |
| bm_dulwich_log | 11,17004467 | 11,52603874 |
| bm_krakatau | 2,223751046 | 9,830767116 |
| bm_mako | 10,16285085 | 8,441075993 |
| bm_mdp | 0,9954771658 | 3,8279006 |
| chaos | 12,96699933 | 9,662335477 |
| sphinx | 10,96162741 | 13,53919982 |
| crypto_pyaes | 6,87305902 | 5,705374238 |
| deltablue | 11,04547752 | 9,751786827 |
| django | -0,8029511953 | -1,195662938 |
| eparse | 14,26472951 | 7,140076383 |
| fannkuch | 2,047509396 | 9,255620128 |
| float | -5,041630301 | -5,355890836 |
| genshi_text | 10,49601345 | 7,625449172 |
| genshi_xml | 10,71362264 | 8,568584358 |
| go | 8,593695776 | 11,19313783 |
| hexiom2 | -2,727359392 | 0,2730693704 |
| html5lib | 12,48909146 | 14,11072664 |
| json_bench | -2,706452658 | 2,043617202 |
| meteor-contest | 0,4124628795 | 2,599997918 |
| nbody_modified | -46,20906471 | -44,00816219 |
| nqueens | 5,313147334 | 4,335091798 |
| pidigits | -1,218052708 | 1,625006582 |
| pyflate-fast | 2,184923491 | 11,29683696 |
| pypy_interp | 8,946516802 | 6,795118756 |
| pyxl_bench | 14,72355069 | 12,38676536 |
| raytrace-simple | 3,560970911 | 2,278664684 |
| richards | 21,44062274 | 11,0766451 |
| rietveld | 16,48161 | 15,74474724 |
| scimark_fft | 0,08357208261 | 1,170341048 |
| scimark_lu | -0,001925930003 | 0,384165752 |
| scimark_montecarlo | -0,1569076222 | 7,118931795 |
| scimark_sor | 0,02145168009 | 3,88094651 |
| scimark_sparsematmult | 0,1862575938 | 0,6585966432 |
| slowspitfire | 16,02340539 | 17,25387427 |
| spambayes | 13,75215023 | 12,15884043 |
| spectral-norm | 8,845451251 | 6,946412753 |
| spitfire | 14,04775125 | 13,42281879 |
| spitfire_cstringio | -2,581369248 | -14,53634085 |
| sqlalchemy_declarative | 13,80427176 | 10,85188974 |
| sqlalchemy_imperative | 14,22442979 | 10,76778705 |
| sqlitesynth | -6,859216967 | -5,760158131 |
| sympy_expand | 14,5654882 | 10,96270114 |
| sympy_integrate | 11,66925055 | 9,257962543 |
| sympy_str | 16,43562223 | 13,26291402 |
| sympy_sum | 13,26914098 | 10,97091752 |
| telco | 9,838131638 | 7,788187003 |
| trans2_annotate | 6,882628246 | 10,03103295 |
| trans2_rtype | 1,341178858 | 1,965765439 |
| trans2_backendopt | 5,224810661 | 11,69271996 |
| trans2_database | 9,079338142 | 11,77439275 |
| trans2_source | 3,594638505 | 12,76569678 |
| twisted_iteration | -4,812491194 | -0,9329247761 |
| twisted_names | 4,584209853 | 10,88909424 |
| twisted_pb | 0,1975778823 | 5,544581016 |
| twisted_tcp | 1,073193371 | -0,3507019983 |
| Average Speedup | 5,264198755 | 5,468945967 |
By Mihai Dodan. E-mail: mihai [dot] dodan [at] rinftech.com
[…] release adds (but disables by default) link-time optimization and profile guided optimization of the base interpreter, which may make unjitted code run faster. To use these, translate with […]
LikeLike
[…] release adds (but disables by default) link-time optimization and profile guided optimization of the base interpreter, which may make unjitted code run faster. To use these, translate with […]
LikeLike