PyPy, compared to CPython relies more on achieving speed-up by “jitting” code as often as possible, rather than rely on its interpreter. However, jitting is not always an option, or at least not entirely. A good improvement for CPython, that we think might benefit PyPy as well, without impacting the JIT performance is Profile Guided Optimization (PGO or profopt).

I thank the PyPy developer community for their patience, kind advice and constant feedback they gave me in #pypy IRC or through email, which helped me to make this possible, especially to Carl Friedrich Bolz-Tereick and Armin Rigo.

1. Introduction

Profile-guided optimizations can differ in implementation from compiler to compiler, but all of them, basically do the 3 steps:

First, the source code is instrumented during compilation. It is worth noting that not all associated libraries need to be instrumented, but for best performance it is advisable to do so.
Secondly, the binary that results from the first step will have some associated “profiles”, that will be updated every time you run it (this behavior for gcc, at least). This step is also known as the training phase. It is crucial here that you run workloads that are the most relevant for the code paths you usually expect / want to be taken. It is also important to remember that altering the code past this point would cause inconsistencies in your profiles, and therefore poorer performance.
The third phase is cleaning everything but the profiles, and recompile everything based on the training they had. This will unvectorize small loops, ease inlining, improve branch prediction, improve hotspots etc.

So, how does all this benefit Python or PyPy? Well, since Python is an interpreted language, written in C, it is obvious that training its binary with common scenarios will benefit it greatly. It is worth mentioning though, that the training should not be everything that CPython can be used for. Simply put, if everything is a hot spot then nothing is. On the contrary, it would only make performance worse.

How about PyPy?

PyPy is a different, because it already has the benefit of a JIT, so it does not rely on the interpreter as much as Python does. The underlying issue here is that the Assembly code generated by the JIT has no way of benefiting from PGO, because it is was never instrumented. However, the interpreted code itself is roughly 3 times slower that CPython, mostly due to the instrumentation of the code it needs to do to start the JIT.

The target here was then to improve the interpreter of PyPy by compiling it with PGO, while also, avoid delaying the JIT, its main source of performance.

Ideally, we hoped for a speed-up similar to CPython, of about 10% on average, but we were aware that imperfect training would always introduce delays.

2. PGO for PyPy

Since PyPy is written in RPython, enabling the Profile-Guided optimizations for it is a bit more tricky than for CPython. This is mainly due to the fact that PyPy need to be translated and have it sources generated before they can be compiled.

However, to enable PGO we only needed the Makefile generated after the translation, so simply saved the sources and worked on them, in order to save time.

Therefore we altered the Makefile to create a new target for our profile optimized binary, by adding to the usual GCC command, the --fprofile-generate flag for the first phase at both compilation and link phase, than train the resulting binary. Afterwards, we used the profiles to rebuild the project once again, by using: --fprofile-use --fprofile-correction, again for both compilation and linking.

As it is quite hard, and also subjective, to determine a good training set for interpreters in general, we decided to start with the training tests that CPython uses and see whether this offers any performance improvement, and to have a baseline to compare any subsequent testing sets.

While we have more to explain / debate about how would profopt best work for PyPy, such as tips and tricks, different workloads (we also tried the actual benchmark as a training set 🙂 ), or what are the proper use cases for it, we will focus on the actual performance gains and how measured this.

An advantage of the implementation from PyPy, is that it’s not for the Python interpreter only, but it can be used for any existing or future implementation that are based on RPython!

Stay tuned for a more general implementation of PGO for both GCC and CLANG.

3. Usage

So here are the steps you should take to enable PGO for PyPy. As a sidenote, I need to mention that the tests have been performed on Ubuntu 16.04, with gcc 6.2.0. It is also important to mention that if you have a Darwin/Mac (I am working to enable this for CLANG on Mac) or Windows, you will most likely not be able to use PGO.

Clone the PyPy repo:

hg clone http://bitbucket.org/pypy/pypy pypy

Install dependencies:

apt-get install gcc make libffi-dev pkg-config libz-dev libbz2-dev \
libsqlite3-dev libncurses-dev libexpat1-dev libssl-dev libgdbm-dev \
tk-dev libgc-dev python-cffi \
liblzma-dev  # For lzma on PyPy3.

Go to the clone and run:

cd pypy/goal
# If you want to enable profopt for PyPy without too much hassle:
python ../../rpython/bin/rpython --opt=jit --cc=/opt/gcc-6.2.0/gcc --profopt

Or, if you want to specify the training script for PGO yourself, you can specify the profoptargs argument, which will take the absolute path to your script and the arguments it requires as well. For example, running the exact same script as in the previous case, but make the script use more cores:

cd pypy/goal
python ../../rpython/bin/rpython --opt=jit --cc=/opt/gcc-6.2.0/gcc \ 
--profopt --profoptargs="/home/md/pypy/lib-python/2.7/test/regrtest.py --pgo -j 18 -x test_asyncore test_gdb test_multiprocessing test_subprocess || true"

# By default the script above runs on 1 process, 
# while now the script will run on 18 processes. 
# I would advise to use the number of the cores you have for this.


# Side Note: the "|| true" at the end ensures that the training finishes 
# successfully, as some of the regrtests fail for PyPy (12 out of 400)

Now, the translation process with PGO takes ~ 1h and 15 min, so grab a cup of coffee, enjoy the mandelbrot, etc. When it ends you should have your pypy-c binary and the associated libpypy-c.so in the goal directory, trained and ready for use.

On the other hand, you can apply the same concept to any binary that results from an RPythton translation. However, in their case there is, obviously, no default, so both the script and its arguments are required. Consider this example:

cd rpython/translator/goal
../../bin/rpython --profopt --profoptargs=1000 targetrpystonedalone.py

There is no actual training script for the resulting binary, as rpystone is not an interpreter, but rather a binary that takes an integer value as a parameter. To get a clear picture of this, any training is conceptually run as follows:

./your_binary arguments_from_profoptargs

Therefore, in the case of PyPy, since it is a Python interpreter:

./pypy-c /path/to/training/script.py arguments_of_the script

4. Measurements

Measuring performance gains is not easy for several reasons (situations we have actually encountered):

There is no standard benchmark. Or, it is inherently unfair to your setup.
- While there is a standard test suite for Python, called pyperformance, we have found that results can have differences that are deemed significant even from one run to another. Obviously, multiple runs of the same workload, with the same binary should not have relevant differences. It is also quite unfair for PyPy, because there is no warmup time, which usually takes a small amount of time, but makes a world of difference in the results. This problem brings us to the next point.
Not all the tests in a benchmark are as reliable as you would like
- After pyperformance, we have decided to try a similar benchmark, implemented in the PyPy project (https://bitbucket.org/pypy/benchmarks/src) that is similar to pyperformance, except it actually does have a warmup phase so that the JIT can fully get in effect before the measurement starts. Even so, however, not all the tests have a low jitter in run-to-run variation. For example, in this benchmark, the translation test is a good example of a test that varies wildly because of its unusually long times, and due to the fact that it is not repeated. This is problem because in these certain cases, you might be unsure at what is your actual speedup or if you have any. The best way to make sure that you actually have a speedup, if you encounter such a scenario, is to repeat the tests, and calculate, statistically the coeficient of variance and the standard deviation.
Most of your tests show a good speedup, but one of them is a lot slower.
- Such a scenario happened after we processed the results. We realized that most tests showed an average improvement of ~6% (which was quite good for a interpreter binary), but there was one of them (nbody simulations) which had almost 50% slowdown. At this point, it is nice if you have a strategy for such cases. In our case, we talked with the PyPy devs and they we satisfied with the results. However, their proposal was to leave profopt as an option to be enabled rather than it being the default setting. This made sense for both us and them, because whomever might use the interpreter should be able to be fully aware of the eventual shortcomings of enabling profile-guided optimizations: it works better in most cases, by an average of 6%, but if you do nbody simulations, you will be disappointed.

An excerpt of our results is in the table below. You can also see it in more detail, here (https://docs.google.com/spreadsheets/d/1aEUkgUcEXGSieBnn82_vVzORk9fRfdW2UKJlE2jFZCk/edit?usp=sharing).

Benchmark	SPEEDUP vs 5.8.0 (%)	SPEEDUP vs 5.7.1(%)
ai	-6,59514475	-1,653259733
bm_chameleon	7,335724952	4,253695178
bm_dulwich_log	11,17004467	11,52603874
bm_krakatau	2,223751046	9,830767116
bm_mako	10,16285085	8,441075993
bm_mdp	0,9954771658	3,8279006
chaos	12,96699933	9,662335477
sphinx	10,96162741	13,53919982
crypto_pyaes	6,87305902	5,705374238
deltablue	11,04547752	9,751786827
django	-0,8029511953	-1,195662938
eparse	14,26472951	7,140076383
fannkuch	2,047509396	9,255620128
float	-5,041630301	-5,355890836
genshi_text	10,49601345	7,625449172
genshi_xml	10,71362264	8,568584358
go	8,593695776	11,19313783
hexiom2	-2,727359392	0,2730693704
html5lib	12,48909146	14,11072664
json_bench	-2,706452658	2,043617202
meteor-contest	0,4124628795	2,599997918
nbody_modified	-46,20906471	-44,00816219
nqueens	5,313147334	4,335091798
pidigits	-1,218052708	1,625006582
pyflate-fast	2,184923491	11,29683696
pypy_interp	8,946516802	6,795118756
pyxl_bench	14,72355069	12,38676536
raytrace-simple	3,560970911	2,278664684
richards	21,44062274	11,0766451
rietveld	16,48161	15,74474724
scimark_fft	0,08357208261	1,170341048
scimark_lu	-0,001925930003	0,384165752
scimark_montecarlo	-0,1569076222	7,118931795
scimark_sor	0,02145168009	3,88094651
scimark_sparsematmult	0,1862575938	0,6585966432
slowspitfire	16,02340539	17,25387427
spambayes	13,75215023	12,15884043
spectral-norm	8,845451251	6,946412753
spitfire	14,04775125	13,42281879
spitfire_cstringio	-2,581369248	-14,53634085
sqlalchemy_declarative	13,80427176	10,85188974
sqlalchemy_imperative	14,22442979	10,76778705
sqlitesynth	-6,859216967	-5,760158131
sympy_expand	14,5654882	10,96270114
sympy_integrate	11,66925055	9,257962543
sympy_str	16,43562223	13,26291402
sympy_sum	13,26914098	10,97091752
telco	9,838131638	7,788187003
trans2_annotate	6,882628246	10,03103295
trans2_rtype	1,341178858	1,965765439
trans2_backendopt	5,224810661	11,69271996
trans2_database	9,079338142	11,77439275
trans2_source	3,594638505	12,76569678
twisted_iteration	-4,812491194	-0,9329247761
twisted_names	4,584209853	10,88909424
twisted_pb	0,1975778823	5,544581016
twisted_tcp	1,073193371	-0,3507019983
Average Speedup	5,264198755	5,468945967

By Mihai Dodan. E-mail: mihai [dot] dodan [at] rinftech.com

2 thoughts on “Enabling Profile Guided Optimizations for PyPy”

PyPy Development: PyPy v5.8 released | Adrian Tudor Web Designer and Programmer says:

June 9, 2017 at 8:04 am

[…] release adds (but disables by default) link-time optimization and profile guided optimization of the base interpreter, which may make unjitted code run faster. To use these, translate with […]

LikeLike

PyPy Status Blog: PyPy v5.8 released | Artificia Intelligence says:

June 9, 2017 at 4:25 pm

[…] release adds (but disables by default) link-time optimization and profile guided optimization of the base interpreter, which may make unjitted code run faster. To use these, translate with […]

LikeLike

Python Files

Random thoughts about Python

Enabling Profile Guided Optimizations for PyPy

1. Introduction

2. PGO for PyPy

3. Usage

4. Measurements

2 thoughts on “Enabling Profile Guided Optimizations for PyPy”

Leave a comment Cancel reply

1. Introduction

2. PGO for PyPy

3. Usage

4. Measurements

Share this:

Related

2 thoughts on “Enabling Profile Guided Optimizations for PyPy”

Leave a comment Cancel reply