In this post I will talk about some tools and ways in which you can profile the interpreter, when running a Python script.
Just like in our previous post, the meaning of CPU profiling is the same, but now we are not targeting the Python script. Instead we want to know how the Python interpreter works and where it spends the most time when running our Python script.
We see next how you can trace the CPU usage and find the hotspots in our interpreter.
Series index
The links below will go live once the posts are released:
Measure CPU usage
For this post, I will use mostly the same script as the ones used in the memory analysis and script CPU usage, and you can see it bellow or here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import time | |
| def primes(n): | |
| if n == 2: | |
| return [2] | |
| elif n < 2: | |
| return [] | |
| s = [] | |
| for i in range(3, n+1): | |
| if i % 2 != 0: | |
| s.append(i) | |
| mroot = n ** 0.5 | |
| half = (n + 1) / 2 – 1 | |
| i = 0 | |
| m = 3 | |
| while m <= mroot: | |
| if s[i]: | |
| j = (m * m – 3) / 2 | |
| s[j] = 0 | |
| while j < half: | |
| s[j] = 0 | |
| j += m | |
| i = i + 1 | |
| m = 2 * i + 3 | |
| l = [2] | |
| for x in s: | |
| if x: | |
| l.append(x) | |
| return l | |
| def benchmark(): | |
| start = time.time() | |
| for _ in xrange(40): | |
| count = len(primes(1000000)) | |
| end = time.time() | |
| print "Benchmark duration: %r seconds" % (end-start) | |
| benchmark() |
The optimized version is bellow or here:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import time | |
| def primes(n): | |
| if n==2: | |
| return [2] | |
| elif n<2: | |
| return [] | |
| s=range(3,n+1,2) | |
| mroot = n ** 0.5 | |
| half=(n+1)/2-1 | |
| i=0 | |
| m=3 | |
| while m <= mroot: | |
| if s[i]: | |
| j=(m*m-3)/2 | |
| s[j]=0 | |
| while j<half: | |
| s[j]=0 | |
| j+=m | |
| i=i+1 | |
| m=2*i+3 | |
| return [2]+[x for x in s if x] | |
| def benchmark(): | |
| start = time.time() | |
| for _ in xrange(40): | |
| count = len(primes(1000000)) | |
| end = time.time() | |
| print "Benchmark duration: %r seconds" % (end-start) | |
| benchmark() |
CPython
CPython is versatile, written entirely in C and thus easier to measure and/or profile. You can find the sources for CPython hosted on GitHub here. By default you will see the latest branch, 3.7+ at the time of writing this, but there are available all other branches down to 2.7.
In our post we will focus on CPython 2, but the same steps can be applies successfully to the latest 3 version.
Code coverage tool
One of the easiest way to see the portions of the C code that are running is to use the code coverage tool.
We clone first the repo:
git clone https://github.com/python/cpython/ cd cpython git checkout 2.7 ./configure
Copy the scripts in the directory and run the following:
make coverage ./python 04.primes-v1.py make coverage-lcov
The first line will compile the interpreter with GCOV support, the second will run the workload and collect the profiling data in .gcda files and the third will parse the files containing the profiling data and create a bunch of HTML files in a folder called lcov-report.
If we open index.html in a browser, we see the places from the interpreter source code that were executed to run our Python script. You will see something like bellow:
At the top level we see each directory composing the source code and the amount of code covered. Let’s open, for example, from Objects directory, the listobject.c.gcov.html file. While we won’t go through it entirely, we analyze a portion of it. See below the portion.

How to read it? On the yellow column you can see the line number from the C file code. On the next column you can see the number of times that particular line was executed. On the rightmost column you see the actual C source code.
In this example, the method listiter_next is called 60 million times.
How did we get to this function? If we look closely at our Python script, we observe that it uses a lot of list iterations and appends. (Another point that can lead to the script optimization in the first place 🙂 )
Let’s continue with some other dedicated tools.
perf
For more information, we can use the perf tool that is available on Linux systems. The official documentation can be read here.
We (re)build the CPython interpreter using the following. If you don’t have the Python scripts downloaded in the same directory, you should do it. Also, make sure that perf is installed on your system.
make clean ./configure --with-pydebug make
Run perf like below. More ways in which you can use perf can be seen in this excellent page, written by Brendan Gregg.
sudo perf record ./python 04.primes-v1.py
After running the script, you will see something like:
Benchmark duration: 32.03910684585571 seconds [21868 refs] perf record: Woken up 20 times to write data ] [ perf record: Captured and wrote 4.914 MB perf.data (125364 samples) ]
To see the results, run `sudo perf report` to obtain the metrics.

Only the most interesting calls are saved. In the above screen we see that the largest amount of time is spent in PyEval_EvalFrameEx. This is the main interpreter loop and we are not interested in it for this example. Instead, we are interested in the next time consuming function – listiter_next, which takes 10.70% of the time.
After running the optimized version, we see the following:

After our optimization, the listiter_next function consumes only 2.11% of the time. Further optimizations based on what’s happening in the interpreter is left to the reader.
Valgrind/Callgrind
Another tool that can be used to find the bottlenecks is Valgrind, with one of its plugin called callgrind. More details can be read here.
If not already done, we (re)build the CPython interpreter using the following. If you don’t have the Python scripts downloaded in the same directory, you should do it. Also, make sure that valgrind is installed on your system.
make clean ./configure --with-pydebug make
Run valgrind like:
valgrind --tool=callgrind --dump-instr=yes \ --collect-jumps=yes --collect-systime=yes \ --callgrind-out-file=callgrind-%p.out -- ./python 04.primes-v1.py
And the results:
Benchmark duration: 1109.4096319675446 seconds [21868 refs] ==24152== ==24152== Events : Ir sysCount sysTime ==24152== Collected : 115949791666 942 208 ==24152== ==24152== I refs: 115,949,791,666
For visualization we will use KCacheGrind.
kcachegrind callgrind-2327.out
PyPy
On PyPy, the profilers that can be used successfully is very limited, resuming everything to vmprof, a tool written by the same people that develop PyPy.
In the first place, download PyPy from here. After this, enable pip support for it
bin/pypy -m ensurepip
The installation of vmprof is simple, just run:
bin/pypy -m pip install vmprof
Run the workload as:
bin/pypy -m vmprof --web 04.primes-v1.py
and open in the browser the link that is displayed in the console (something beginning with http://vmprof.com/#/)
By Alecsandru Patrascu, alecsandru.patrascu [at] rinftech [dot] com
[…] CPU Profiling – Python Interpreter […]
LikeLike
[…] CPU Profiling – Python Interpreter […]
LikeLike
[…] CPU Profiling – Python Interpreter […]
LikeLike
[…] In this post I will talk about some tools and ways in which you can profile the interpreter, when running a Python script. Just like in our previous post, the meaning of CPU profiling is the same, … Read more […]
LikeLike