The Python ecosystem in HEP data analysis
Chris Burr
7th July @ PyHEP 2018, Sofia
Who am I?
➤ Third year PhD student @ The University of Manchester
➤ “Primarily” working on analysis and detector alignment in LHCb
➤ Generally interested in computing and making analysis more efficient
➤ Extensively involved in LHCb's Starterkit activities:
➤ Annual workshop at CERN in October/November new students
➤ Topics covered: Bash, Git, Python and LHCb specific
➤ Follow up workshop at CERN in May
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !2
A couple of points before I start
➤ Here I am discussing my analysis life where I exclusively use Python
➤ Everything is my opinion which is heavily shaped by my experience in LHCb
➤ I know that even very similar experiments have different needs, I’d love to learn about them!
➤ I’m currently in the final stages of my second full analysis
➤ First analysis almost entirely used PyROOT
➤ Second was independent of ROOT for a long time (except for reading files, predates uproot)
➤ Almost everything I discuss here is used as part of this
➤ The code snippets I show here are copied from this
➤ This has stopped me from properly trying some newer things
➤ JupyterLab
➤ Experimental features in ROOT
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !3
Why Python?
Why do I personally like 🐍 for analysis?
➤ 90%+ of what I write won’t be used again
➤ I care about the time it takes to (write + execute)
➤ Designed to be readable
➤ Good libraries minimise boilerplate while remaining flexible
➤ Huge ecosystem
➤ Tends to be well documented
➤ StackOverflow answers for everything
➤ 🔍 Code bases of packages tend to be understandable: no complex inheritance, templating, typedefs
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !5
Why do I recommend using Python to new analysts?
➤ All the reasons on the last slide!
➤ Plus the only* alternative is to use C++
➤ It’s too easy to get into a big mess 😰
➤ Modern C++ and newer compilers avoid this, but C will always be there…
➤ My main reservation to Python is about environments - I’ll cover this later
*If you should use Julia/R/Go/Rust/whatever, you’ll find them without me telling you
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !6
Why Python 3?
Why do I use Python 3 for analysis?
➤ Started when I was a masters student in 2014
➤ To be honest, the differences didn’t really matter to me and reason was:
“It’s newer so it must be better”
➤ Now things are different!
➤ Here just a few of the reasons why…
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !8
f-strings
➤ My number one feature is f-strings (Python 3.6+)
➤ Why are they better?
➤ Compact and easy to read
➤ Bugs are generally easier to see
➤ Plays nicely with linters
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !9
Other reasons to use Python 3
➤ Dictionaries are ordered (CPython 3.6+, Python 3.7+)
➤ * and ** behave sensibly
➤ In my experience, it’s been faster!
➤ print is actually function with kwargs like sep, end and flush
➤ Separate str/bytes types
➤ Exception chaining
➤ Many little standard library improvements:
➤ Recursive globbing, LRU cache, secrets module, Enum
Overall: It’s not any one feature, it’s just:
quicker, easier and less buggy!
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !10
Looking to the future
➤ I’ve only ever* had major Python 3 issues with HEP specific packages
➤ Mostly ROOT+XRootD: Though even these have worked for
➤ Interesting new features are Python 3 only
➤ Python itself: Assignment expressions are coming in Python 3.8
➤ Wider ecosystem:
➤ IPython and matplotlib already have only bug fix support for Python 2
➤ numpy too will at the end of the year
*Excluding LHCb’s software
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !11
What packages do I use?
Standard library packages
➤ string manipulation (join, split, replace)
➤ argparse (making everything a script make pipelines easier)
➤ glob
➤ os.path (dirname, basename, join, splitext)
➤ shutil (high level, os independent, file system operations)
➤ itertools (and the methods included in it’s docs)
➤ re (regular expressions)
➤ I rarely use “math” (though it can be a lot faster than numpy)
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !13
numpy and scipy
➤ Basis of all scientific python
➤ Contain well performing implementations for most tasks
➤ polyfit, cdist, scipy.stats, convolve2d, find_peaks, argsort
➤ Truly excellent for quickly prototyping
➤ Often the quick prototype is enough
➤ Even when it isn’t, it allows you to figure out what you want to do
➤ I think people should be less afraid of rewriting things
➤ I’ll discuss numpy’s limitations later…
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !14
jupyter notebooks
➤ Interactively write code interleaved with documentation and results
➤ I’ve use them a lot for one-off studies and developing ideas
➤ Sharing them works really well with the inline markdown
➤ Lots of potential for inline widgets and interactive elements
➤ Issues/desired features:
➤ Teaching Python with Jupyter can be difficult due to the ordering of cells and global state
➤ I would like to turn the notebooks into scripts or modules I import
➤ Using them with version control can be messy
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !15
IPython
➤ The ultimate REPL (since v5 was released)
➤ Nice extensions with the added % and ! syntax
➤ Some shell commands like ls “just work”
➤ Excellent multiline support
➤ Nice colouring, especially for stack traces
➤ Easily access the output of previous line
➤ Essential to use a recent version
➤ Can have inline images, even with SSH+tmux (I use iterm2 and itermplot)
➤ I use this a lot to write scripts (I just wish I knew an easier way to copy the code out without the ...:?)
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !16
matplotlib
➤ Plotting is probably the second* hardest class API to design (or graphics in general)
➤ matplotlib does an good job
➤ The pyplot interface hides the complexity nicely without limiting
➤ I think it could be better, but I don’t know how
➤ The documentation is excellent (and now Python 2 is dropped it could get even better)
➤ Hundreds of well written examples and many thousands of StackOverflow answers
➤ Main missing feature is serialisation (Better Interactivity would be nice too)
➤ There is a old proposal from (MEP25) but no one has found the time to implement it
➤ Pickle can be used, but it’s not ideal (long term stability, serious security issues, …)
➤ Lots of alternatives with various levels of interoperability with matplotlib
➤ Some, like Vega, put serialisation first
➤ I’ve never seriously used them as I always run in to to limitations
*In my opinion and I’ll come to the hardest later
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !17
pandas
➤ Data analysis library built around numpy
➤ I use root_pandas to load TTrees in ROOT files as pandas DataFrames
➤ 90% of what I use is:
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !18
Machine learning
➤ This is a topic for another talk!
➤ Just to give my two stotinki:
➤ scikit-learn: Standardised API + helper functions + excellent docs
➤ XGBoost: Easy, fast and effective (I use the scikit-learn interface)
➤ hep_ml: Nice HEP specific stuff with scikit-learn compatible API
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !19
Many, many, many others…
➤ tabulate: Quickly print nice tables in many formats (plain text, html, LaTeX)
➤ jinja2: Use templating to automatically write papers! (though syntax isn’t ideal LaTeX)
➤ joblib: Embarrassingly parallel computation made embarrassingly easy! 🙈
➤ Doesn’t always work well with ROOT
➤ tqdm: High performance and pretty progress bars
➤ uncertainties: Propagate errors assuming normal distributions
➤ uproot: Read ROOT files into numpy without depending on ROOT
Just a python -m pip install --user $PACKAGE_NAME away!
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !20
The problem of abandoned software
➤ mcerp is a package for propagating uncertainties using toys
➤ Supports arbitrary distributions and correctly accounts for correlations
➤ Makes doing errors “correctly” very easy
➤ Is an example of a useful but abandoned package
➤ Broken for Python 3 and newer versions of scipy
➤ Has trivial unmerged* pull requests to fix these issues
➤ Not unique to Python, but what can we do to avoid this?
➤ Automatic tests and deployment to PyPI (easy to setup with Travis CI) reduced the burden
➤ Still need to be able to adopt packages for when people loose interest or leave HEP
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !21
Pipelines and snakemake
Pipelines, pipelines everywhere!
➤ Analysis work naturally lends itself towards using pipelines
➤ Like how make is used to make a pipeline for compiling code
➤ Can be as simple as a bash script (Used for might first analysis with “nightly builds”)
➤ My personal favourite is snakemake…
Johannes Köster - Snakemake Tutorial 2017
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !23
snakemake
➤ Workflow management system written in Python (Python 3 only)
➤ Inspire by make, designed for research and widely used in biology
➤ Workflows are written in python, but with some added syntax
➤ Support for remote files (HTTP, FTP, S3, XRootD, …)
➤ Distribute jobs over clusters with or without shared filesystem
➤ Can manage a kubernetes cluster
➤ Jobs can be Python code or arbitrary shell scripts
➤ Builds a directed acyclic graph → easy parallelisation and caching
➤ I use it for everything from detector studies to analyses that run for days
➤ TIP: Using “assert” statements everywhere can debugging easier
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !24
Environments
Setting up an environment
➤ A common question: (that I don’t have a good answer to)
“How do I get Python?”
➤ I think the best* answer is conda
➤ Use pip and conda to install packages, can even share your environments easily
➤ Widely used elsewhere, especially biology
➤ But…how do I get ROOT?
➤ 😞 Erm? Build from source? Okay just use an LCG view…
➤ Not available locally, difficult to customise and has some quirks
➤ See Ben’s talk!
<shameless advert> *For analysts, for complex stacks see poster: “Software packaging and distribution for LHCb using Nix” </shameless advert>
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !26
Tooling
➤ Having the correct tooling makes a huge difference
➤ Python is a dynamic language → linting should be seen as essential!
➤ Both for errors and style, see PEP8 and customise if you must
➤ For packages: this means tests
➤ For analysts: tests are mostly busy work, but in editor linting is great and teaches consistent style!
➤ I’ve had scripts run for hours and then avoidably crash due to a typo
➤ But we have a mess of text editors
➤ Most masters students use gedit and/or an ancient version of emacs/vim
➤ Emacs and vim are fine* once you configure and customise them (show whitespace + no tabs)
➤ Can we make better tools available?
*Personally I use sublime but tend to recommend VS Code or Atom
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !27
Limitations in Python
Memory
➤ Most of the Python ecosystem is based around in memory computation
➤ You can normally avoid this: load fewer columns and applying cuts earlier
➤ Hoarding effectively useless data is a bad idea
➤ Not doing is just wasting CPU cycles and, more importantly, analysts time
➤ That said it is still a problem that needs to be addressed
➤ Various systems try to hack this in, like dask, don’t work well in my experience
➤ We’re not the only ones with this problem, for example lookup “pandas2”
➤ All my thoughts and many more are in this blog post from the creator of pandas:
10 things I hate about pandas
http://wesmckinney.com/blog/apache-arrow-pandas-internals/
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !29
Parallelisation
➤ The Global Interpreter Lock (GIL) isn’t a big deal
➤ I wish numpy, scipy and pandas would automatically parallelise sometimes
➤ Most of my parallelisation comes from XGBoost, joblib or snakemake
➤ This has always scaled perfectly well for me (4 threads → 64 threads)
➤ ROOT and XRootD have caused my trouble with joblib
➤ As before, it’s okay but should be worked on for the future
➤ I think this will come out of the efforts on the previous slide
➤ Maybe ROOT can provide this to the wider community?
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !30
Fitting
➤ In HEP this normally means maximum likelihood fitting
➤ Good tools exist for other kinds of fits
➤ But there is nothing suitable for analysis level fits
➤ I’ve grown to appreciate RooFit(+RooStats)
➤ It is incredibly powerful without requiring everything to be done from scratch
➤ But…
➤ I think the API could be better
➤ Python bindings are terrible
➤ Weird segfaults from the Python bindings in RooArgList/RooArgSet
➤ The workspace idea is great, but the bindings let it down
➤ RooWorkspace::factory doesn’t crash when errors happen
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !31
What packages are missing?
Fitting: My dream
➤ Fitting is all about having an API to build likelihoods and then evaluate them
➤ Exactly how to define this API is hard, but I think Python can shine here
➤ We can benefit from external tools, for example, using TensorFlow e
➤ Define an API to build a likelihood in TensorFlow that is a graph fo r
b e a .
i de n d
➤ Tensorflow then gives us: s l g e
i s
h the a
➤ Scaling from 1 core to many GPUs spread across multiple machines
e t k !
➤ Automatic differentiation
ro t a t t al
➤ Optimisers for the graph
I w i n g x t
o k n e
➤ A web interface
lo t h e
e e
➤ There is lot demand for this outside of HEP if it is generic S
➤ But it is a huge amount of work to make
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !33
Histograms
➤ Not so many people use histograms the way HEP does
➤ Maybe more people should!
➤ ROOT is really good here
➤ Tracking uncertainties
➤ Combining histograms
➤ But things need to be better integrated into the Python ecosystem (Example: fill from a pandas DataFrame)
➤ I’ve lost count of how many times I’ve written bin error calculation code…
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !34
Closing remarks
Making things pythonistic
➤ The python XRootD bindings are pretty good
➤ But they could be more pythonistic:
➤ Raise exceptions instead of returning status codes
➤ Provide an wrapper for the os module with support for:
➤ os.remove
➤ os.rename
➤ dirname, join, basename, splitext, … from os.path
➤ Maybe even monkey patch the standard library? (Optionally!!! And this is probably a terrible idea…)
➤ XGBoost is a good example
➤ It has two sets of Python bindings: one for the XGBoost API, one to match the scikit-learn API
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !36
Conclusions
➤ The python ecosystem is wonderful
➤ Excellent documentation and huge community asking/answering questions
➤ Can take advantage of investment made for other use cases
➤ Libraries exist for most uses, many of which have simple APIs
➤ As always, it could be better:
➤ Improve “pythonistic”-ness and interoperability of libraries
➤ Histogramming
➤ Maximum likelihood fitting
➤ Improve development environments
➤ Longer term: Remove the in-memory dependence of calculations without loosing the simplicity
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !37
Any questions?
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !38