0% found this document useful (0 votes)

8 views38 pages

2018 PyHEP Python-For-Analysis Burr 07.07

Uploaded by

Giacomo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views38 pages

2018 PyHEP Python-For-Analysis Burr 07.07

Uploaded by

Giacomo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

The Python ecosystem in HEP data analysis

Chris Burr
7th July @ PyHEP 2018, Sofia
Who am I?

➤ Third year PhD student @ The University of Manchester

➤ “Primarily” working on analysis and detector alignment in LHCb
➤ Generally interested in computing and making analysis more efficient
➤ Extensively involved in LHCb's Starterkit activities:
➤ Annual workshop at CERN in October/November new students
➤ Topics covered: Bash, Git, Python and LHCb specific
➤ Follow up workshop at CERN in May

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !2
A couple of points before I start

➤ Here I am discussing my analysis life where I exclusively use Python

➤ Everything is my opinion which is heavily shaped by my experience in LHCb
➤ I know that even very similar experiments have different needs, I’d love to learn about them!

➤ I’m currently in the final stages of my second full analysis

➤ First analysis almost entirely used PyROOT
➤ Second was independent of ROOT for a long time (except for reading files, predates uproot)
➤ Almost everything I discuss here is used as part of this
➤ The code snippets I show here are copied from this

➤ This has stopped me from properly trying some newer things

➤ JupyterLab
➤ Experimental features in ROOT

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !3
Why Python?
Why do I personally like 🐍 for analysis?

➤ 90%+ of what I write won’t be used again

➤ I care about the time it takes to (write + execute)
➤ Designed to be readable
➤ Good libraries minimise boilerplate while remaining flexible
➤ Huge ecosystem
➤ Tends to be well documented
➤ StackOverflow answers for everything
➤ 🔍 Code bases of packages tend to be understandable: no complex inheritance, templating, typedefs

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !5
Why do I recommend using Python to new analysts?

➤ All the reasons on the last slide!

➤ Plus the only* alternative is to use C++
➤ It’s too easy to get into a big mess 😰
➤ Modern C++ and newer compilers avoid this, but C will always be there…
➤ My main reservation to Python is about environments - I’ll cover this later

*If you should use Julia/R/Go/Rust/whatever, you’ll find them without me telling you
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !6
Why Python 3?
Why do I use Python 3 for analysis?

➤ Started when I was a masters student in 2014

➤ To be honest, the differences didn’t really matter to me and reason was:

“It’s newer so it must be better”

➤ Now things are different!
➤ Here just a few of the reasons why…

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !8
f-strings

➤ My number one feature is f-strings (Python 3.6+)

➤ Why are they better?

➤ Compact and easy to read
➤ Bugs are generally easier to see
➤ Plays nicely with linters

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !9
Other reasons to use Python 3

➤ Dictionaries are ordered (CPython 3.6+, Python 3.7+)

➤ * and ** behave sensibly

➤ In my experience, it’s been faster!
➤ print is actually function with kwargs like sep, end and flush
➤ Separate str/bytes types
➤ Exception chaining
➤ Many little standard library improvements:
➤ Recursive globbing, LRU cache, secrets module, Enum

Overall: It’s not any one feature, it’s just:

quicker, easier and less buggy!

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !10
Looking to the future
➤ I’ve only ever* had major Python 3 issues with HEP specific packages
➤ Mostly ROOT+XRootD: Though even these have worked for

➤ Interesting new features are Python 3 only

➤ Python itself: Assignment expressions are coming in Python 3.8

➤ Wider ecosystem:
➤ IPython and matplotlib already have only bug fix support for Python 2
➤ numpy too will at the end of the year
*Excluding LHCb’s software

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !11
What packages do I use?
Standard library packages

➤ string manipulation (join, split, replace)

➤ argparse (making everything a script make pipelines easier)
➤ glob
➤ os.path (dirname, basename, join, splitext)
➤ shutil (high level, os independent, file system operations)
➤ itertools (and the methods included in it’s docs)
➤ re (regular expressions)
➤ I rarely use “math” (though it can be a lot faster than numpy)
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !13
numpy and scipy

➤ Basis of all scientific python

➤ Contain well performing implementations for most tasks
➤ polyfit, cdist, scipy.stats, convolve2d, find_peaks, argsort

➤ Truly excellent for quickly prototyping

➤ Often the quick prototype is enough
➤ Even when it isn’t, it allows you to figure out what you want to do
➤ I think people should be less afraid of rewriting things

➤ I’ll discuss numpy’s limitations later…

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !14
jupyter notebooks

➤ Interactively write code interleaved with documentation and results

➤ I’ve use them a lot for one-off studies and developing ideas
➤ Sharing them works really well with the inline markdown
➤ Lots of potential for inline widgets and interactive elements
➤ Issues/desired features:
➤ Teaching Python with Jupyter can be difficult due to the ordering of cells and global state
➤ I would like to turn the notebooks into scripts or modules I import
➤ Using them with version control can be messy

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !15
IPython

➤ The ultimate REPL (since v5 was released)

➤ Nice extensions with the added % and ! syntax
➤ Some shell commands like ls “just work”
➤ Excellent multiline support
➤ Nice colouring, especially for stack traces
➤ Easily access the output of previous line

➤ Essential to use a recent version

➤ Can have inline images, even with SSH+tmux (I use iterm2 and itermplot)
➤ I use this a lot to write scripts (I just wish I knew an easier way to copy the code out without the ...:?)
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !16
matplotlib

➤ Plotting is probably the second* hardest class API to design (or graphics in general)
➤ matplotlib does an good job
➤ The pyplot interface hides the complexity nicely without limiting
➤ I think it could be better, but I don’t know how
➤ The documentation is excellent (and now Python 2 is dropped it could get even better)

➤ Hundreds of well written examples and many thousands of StackOverflow answers

➤ Main missing feature is serialisation (Better Interactivity would be nice too)

➤ There is a old proposal from (MEP25) but no one has found the time to implement it
➤ Pickle can be used, but it’s not ideal (long term stability, serious security issues, …)

➤ Lots of alternatives with various levels of interoperability with matplotlib

➤ Some, like Vega, put serialisation first
➤ I’ve never seriously used them as I always run in to to limitations
*In my opinion and I’ll come to the hardest later

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !17
pandas

➤ Data analysis library built around numpy

➤ I use root_pandas to load TTrees in ROOT files as pandas DataFrames
➤ 90% of what I use is:

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !18
Machine learning

➤ This is a topic for another talk!

➤ Just to give my two stotinki:
➤ scikit-learn: Standardised API + helper functions + excellent docs
➤ XGBoost: Easy, fast and effective (I use the scikit-learn interface)
➤ hep_ml: Nice HEP specific stuff with scikit-learn compatible API

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !19
Many, many, many others…

➤ tabulate: Quickly print nice tables in many formats (plain text, html, LaTeX)
➤ jinja2: Use templating to automatically write papers! (though syntax isn’t ideal LaTeX)
➤ joblib: Embarrassingly parallel computation made embarrassingly easy! 🙈
➤ Doesn’t always work well with ROOT

➤ tqdm: High performance and pretty progress bars

➤ uncertainties: Propagate errors assuming normal distributions
➤ uproot: Read ROOT files into numpy without depending on ROOT

Just a python -m pip install --user $PACKAGE_NAME away!

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !20
The problem of abandoned software

➤ mcerp is a package for propagating uncertainties using toys

➤ Supports arbitrary distributions and correctly accounts for correlations
➤ Makes doing errors “correctly” very easy

➤ Is an example of a useful but abandoned package

➤ Broken for Python 3 and newer versions of scipy
➤ Has trivial unmerged* pull requests to fix these issues

➤ Not unique to Python, but what can we do to avoid this?

➤ Automatic tests and deployment to PyPI (easy to setup with Travis CI) reduced the burden
➤ Still need to be able to adopt packages for when people loose interest or leave HEP

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !21
Pipelines and snakemake
Pipelines, pipelines everywhere!

➤ Analysis work naturally lends itself towards using pipelines

➤ Like how make is used to make a pipeline for compiling code

➤ Can be as simple as a bash script (Used for might first analysis with “nightly builds”)
➤ My personal favourite is snakemake…

Johannes Köster - Snakemake Tutorial 2017

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !23
snakemake

➤ Workflow management system written in Python (Python 3 only)

➤ Inspire by make, designed for research and widely used in biology
➤ Workflows are written in python, but with some added syntax
➤ Support for remote files (HTTP, FTP, S3, XRootD, …)
➤ Distribute jobs over clusters with or without shared filesystem
➤ Can manage a kubernetes cluster

➤ Jobs can be Python code or arbitrary shell scripts

➤ Builds a directed acyclic graph → easy parallelisation and caching
➤ I use it for everything from detector studies to analyses that run for days
➤ TIP: Using “assert” statements everywhere can debugging easier

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !24
Environments
Setting up an environment

➤ A common question: (that I don’t have a good answer to)

“How do I get Python?”

➤ I think the best* answer is conda
➤ Use pip and conda to install packages, can even share your environments easily
➤ Widely used elsewhere, especially biology

➤ But…how do I get ROOT?

➤ 😞 Erm? Build from source? Okay just use an LCG view…
➤ Not available locally, difficult to customise and has some quirks

➤ See Ben’s talk!

<shameless advert> *For analysts, for complex stacks see poster: “Software packaging and distribution for LHCb using Nix” </shameless advert>

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !26
Tooling

➤ Having the correct tooling makes a huge difference

➤ Python is a dynamic language → linting should be seen as essential!
➤ Both for errors and style, see PEP8 and customise if you must
➤ For packages: this means tests
➤ For analysts: tests are mostly busy work, but in editor linting is great and teaches consistent style!
➤ I’ve had scripts run for hours and then avoidably crash due to a typo

➤ But we have a mess of text editors

➤ Most masters students use gedit and/or an ancient version of emacs/vim
➤ Emacs and vim are fine* once you configure and customise them (show whitespace + no tabs)
➤ Can we make better tools available?
*Personally I use sublime but tend to recommend VS Code or Atom
Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !27
Limitations in Python
Memory

➤ Most of the Python ecosystem is based around in memory computation

➤ You can normally avoid this: load fewer columns and applying cuts earlier
➤ Hoarding effectively useless data is a bad idea
➤ Not doing is just wasting CPU cycles and, more importantly, analysts time

➤ That said it is still a problem that needs to be addressed

➤ Various systems try to hack this in, like dask, don’t work well in my experience
➤ We’re not the only ones with this problem, for example lookup “pandas2”
➤ All my thoughts and many more are in this blog post from the creator of pandas:

10 things I hate about pandas

http://wesmckinney.com/blog/apache-arrow-pandas-internals/

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !29
Parallelisation

➤ The Global Interpreter Lock (GIL) isn’t a big deal

➤ I wish numpy, scipy and pandas would automatically parallelise sometimes
➤ Most of my parallelisation comes from XGBoost, joblib or snakemake
➤ This has always scaled perfectly well for me (4 threads → 64 threads)
➤ ROOT and XRootD have caused my trouble with joblib

➤ As before, it’s okay but should be worked on for the future

➤ I think this will come out of the efforts on the previous slide
➤ Maybe ROOT can provide this to the wider community?

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !30
Fitting

➤ In HEP this normally means maximum likelihood fitting

➤ Good tools exist for other kinds of fits
➤ But there is nothing suitable for analysis level fits

➤ I’ve grown to appreciate RooFit(+RooStats)

➤ It is incredibly powerful without requiring everything to be done from scratch

➤ But…
➤ I think the API could be better
➤ Python bindings are terrible
➤ Weird segfaults from the Python bindings in RooArgList/RooArgSet

➤ The workspace idea is great, but the bindings let it down

➤ RooWorkspace::factory doesn’t crash when errors happen

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !31
What packages are missing?
Fitting: My dream

➤ Fitting is all about having an API to build likelihoods and then evaluate them
➤ Exactly how to define this API is hard, but I think Python can shine here
➤ We can benefit from external tools, for example, using TensorFlow e
➤ Define an API to build a likelihood in TensorFlow that is a graph fo r
b e a .
i de n d
➤ Tensorflow then gives us: s l g e
i s
h the a
➤ Scaling from 1 core to many GPUs spread across multiple machines
e t k !
➤ Automatic differentiation
ro t a t t al
➤ Optimisers for the graph
I w i n g x t
o k n e
➤ A web interface
lo t h e
e e
➤ There is lot demand for this outside of HEP if it is generic S
➤ But it is a huge amount of work to make

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !33
Histograms

➤ Not so many people use histograms the way HEP does

➤ Maybe more people should!

➤ ROOT is really good here

➤ Tracking uncertainties
➤ Combining histograms
➤ But things need to be better integrated into the Python ecosystem (Example: fill from a pandas DataFrame)

➤ I’ve lost count of how many times I’ve written bin error calculation code…

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !34
Closing remarks
Making things pythonistic

➤ The python XRootD bindings are pretty good

➤ But they could be more pythonistic:
➤ Raise exceptions instead of returning status codes
➤ Provide an wrapper for the os module with support for:
➤ os.remove
➤ os.rename
➤ dirname, join, basename, splitext, … from os.path
➤ Maybe even monkey patch the standard library? (Optionally!!! And this is probably a terrible idea…)

➤ XGBoost is a good example

➤ It has two sets of Python bindings: one for the XGBoost API, one to match the scikit-learn API

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !36
Conclusions

➤ The python ecosystem is wonderful

➤ Excellent documentation and huge community asking/answering questions
➤ Can take advantage of investment made for other use cases
➤ Libraries exist for most uses, many of which have simple APIs

➤ As always, it could be better:

➤ Improve “pythonistic”-ness and interoperability of libraries
➤ Histogramming
➤ Maximum likelihood fitting
➤ Improve development environments
➤ Longer term: Remove the in-memory dependence of calculations without loosing the simplicity

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !37
Any questions?

The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Chris Burr ○ The Python ecosystem in HEP data analysis ◦ PyHEP 2018, Sofia !38

Python For Scientific and High Performance Com
100% (1)
Python For Scientific and High Performance Com
125 pages
Python for Scientific Computing
No ratings yet
Python for Scientific Computing
30 pages
Engineering Python
No ratings yet
Engineering Python
5 pages
Python Programming
No ratings yet
Python Programming
2 pages
Scipy Lecture Notes PDF
100% (2)
Scipy Lecture Notes PDF
690 pages
Python For Data Analytics
67% (3)
Python For Data Analytics
69 pages
Python Astronomy
0% (1)
Python Astronomy
44 pages
Python for Scientific Computing Guide
No ratings yet
Python for Scientific Computing Guide
8 pages
Python Self Study Material
0% (1)
Python Self Study Material
9 pages
Python Vibration Analysis
No ratings yet
Python Vibration Analysis
22 pages
Python Weather Forecasting Guide
No ratings yet
Python Weather Forecasting Guide
36 pages
Python Programming For Economics and Finance
No ratings yet
Python Programming For Economics and Finance
267 pages
Introduction To Python 3: Chang Y. Chung
No ratings yet
Introduction To Python 3: Chang Y. Chung
25 pages
Python Scientific
No ratings yet
Python Scientific
191 pages
Python Workshop March 2018
No ratings yet
Python Workshop March 2018
31 pages
Python Intro
No ratings yet
Python Intro
13 pages
TensorFlow 2.0 Workshop Overview
100% (1)
TensorFlow 2.0 Workshop Overview
132 pages
Python Foundations and Tooling
No ratings yet
Python Foundations and Tooling
42 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
An Introduction To Python For Scientific Computing: © 2019 M. Scott Shell Last Modified 9/24/2019
No ratings yet
An Introduction To Python For Scientific Computing: © 2019 M. Scott Shell Last Modified 9/24/2019
62 pages
Python Ecosystem
No ratings yet
Python Ecosystem
11 pages
Python Guide for Earth Scientists
No ratings yet
Python Guide for Earth Scientists
70 pages
AIES Assignment1
No ratings yet
AIES Assignment1
15 pages
Python
100% (1)
Python
191 pages
Intro To Scientific Python (2018-01-23) PDF
No ratings yet
Intro To Scientific Python (2018-01-23) PDF
16 pages
Python for Computational Science Guide
No ratings yet
Python for Computational Science Guide
12 pages
Notes For Python Part I
No ratings yet
Notes For Python Part I
56 pages
Python
No ratings yet
Python
102 pages
Python Basics: Before Numpy
No ratings yet
Python Basics: Before Numpy
49 pages
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
No ratings yet
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
24 pages
Python All
No ratings yet
Python All
253 pages
Python for Scientific Computing
No ratings yet
Python for Scientific Computing
77 pages
Python for Scientists & Engineers
100% (2)
Python for Scientists & Engineers
89 pages
Ph2150 Notesaaaaaaaaaa
No ratings yet
Ph2150 Notesaaaaaaaaaa
16 pages
Python Data Analysis Workshop Guide
No ratings yet
Python Data Analysis Workshop Guide
53 pages
Python
No ratings yet
Python
3 pages
180A Python Examples
No ratings yet
180A Python Examples
39 pages
Introduction of Python For Data Analysis
No ratings yet
Introduction of Python For Data Analysis
12 pages
Python For Data Engineering
No ratings yet
Python For Data Engineering
11 pages
Python
No ratings yet
Python
65 pages
Sci Python
No ratings yet
Sci Python
54 pages
Intro Python
No ratings yet
Intro Python
12 pages
Python for Geoscientists
No ratings yet
Python for Geoscientists
44 pages
Python Programming
No ratings yet
Python Programming
3 pages
Lecture Notes On Introduction To Python Programming Final
No ratings yet
Lecture Notes On Introduction To Python Programming Final
68 pages
PDS Unit1-1
No ratings yet
PDS Unit1-1
104 pages
Python for Data Analysis Overview
No ratings yet
Python for Data Analysis Overview
49 pages
Data Analysis of Visualization: CHAPTER - 1 Preliminaries
No ratings yet
Data Analysis of Visualization: CHAPTER - 1 Preliminaries
93 pages
Introduction to Python & Jupyter Notebook
No ratings yet
Introduction to Python & Jupyter Notebook
49 pages
Python Workshop for Beginners
No ratings yet
Python Workshop for Beginners
111 pages
Lecture11 30272 CAP460 Lecture0
No ratings yet
Lecture11 30272 CAP460 Lecture0
40 pages
Python Basics with NumPy and SciPy
100% (1)
Python Basics with NumPy and SciPy
42 pages
Python for Developers
No ratings yet
Python for Developers
26 pages
Python Basics (By Mark Wickert)
No ratings yet
Python Basics (By Mark Wickert)
42 pages
Deekshith Project (CSE320-SRS)
No ratings yet
Deekshith Project (CSE320-SRS)
12 pages
Mca Java Lab Manual Lab 2nd Sem
No ratings yet
Mca Java Lab Manual Lab 2nd Sem
19 pages
Quote QU1290
No ratings yet
Quote QU1290
1 page
Software Engineering Basics
100% (1)
Software Engineering Basics
66 pages
02 Java Fundamentals
No ratings yet
02 Java Fundamentals
79 pages
AX4 Nano User Manual Guide
No ratings yet
AX4 Nano User Manual Guide
96 pages
Goods Movement Transaction
No ratings yet
Goods Movement Transaction
4 pages
Challenges in Cloud Computing Thesis
100% (1)
Challenges in Cloud Computing Thesis
7 pages
Lecture 2. Big Data
No ratings yet
Lecture 2. Big Data
43 pages
Understanding Programmable Logic Controllers
No ratings yet
Understanding Programmable Logic Controllers
75 pages
Anh Hoang Resume 2025
No ratings yet
Anh Hoang Resume 2025
2 pages
Selenium Capgemini
No ratings yet
Selenium Capgemini
7 pages
Draft Spec - HAHW 02 - 02 - 22
No ratings yet
Draft Spec - HAHW 02 - 02 - 22
11 pages
Mastering Windows Security and Hardening 1st Edition by Mark Dunkerley, Matt Tumbarello 1839214287 9781839214288
No ratings yet
Mastering Windows Security and Hardening 1st Edition by Mark Dunkerley, Matt Tumbarello 1839214287 9781839214288
48 pages
Web Archive Org Web 20220310041110 WWW Making Unsw Edu Au Learn Advanced 3d Prin...
No ratings yet
Web Archive Org Web 20220310041110 WWW Making Unsw Edu Au Learn Advanced 3d Prin...
30 pages
AlienVault Installation and Configuration
No ratings yet
AlienVault Installation and Configuration
29 pages
Yummy
No ratings yet
Yummy
19 pages
Run Linux Programs On Chromebook With Crostini: Mental Flow
No ratings yet
Run Linux Programs On Chromebook With Crostini: Mental Flow
5 pages
AI & ML Fundamentals for Engineering Students
No ratings yet
AI & ML Fundamentals for Engineering Students
97 pages
3D Printer Design & Manufacturing
No ratings yet
3D Printer Design & Manufacturing
52 pages
BIM Engineer Seeking New Challenges
No ratings yet
BIM Engineer Seeking New Challenges
2 pages
Monthly Excel Tips and Techniques
No ratings yet
Monthly Excel Tips and Techniques
25 pages
Question Bank SE uNIT 5
No ratings yet
Question Bank SE uNIT 5
15 pages
Better Experience - AePiot - The Revolutionary Semantic Web Platform - Complete Analysis Executive Summary in An Era Where Traditional SEO Tools
No ratings yet
Better Experience - AePiot - The Revolutionary Semantic Web Platform - Complete Analysis Executive Summary in An Era Where Traditional SEO Tools
18 pages
Mod1 J750 Tester Overview
No ratings yet
Mod1 J750 Tester Overview
78 pages
Vedant Rathore Resume N.
No ratings yet
Vedant Rathore Resume N.
1 page
My Aadhar
No ratings yet
My Aadhar
1 page
Prelim Exam Class 9th Computer
No ratings yet
Prelim Exam Class 9th Computer
2 pages
Api For Odfpy
No ratings yet
Api For Odfpy
89 pages
PSIM User Manual
No ratings yet
PSIM User Manual
248 pages

2018 PyHEP Python-For-Analysis Burr 07.07

Uploaded by

2018 PyHEP Python-For-Analysis Burr 07.07

Uploaded by

The Python ecosystem in HEP data analysis

➤ Third year PhD student @ The University of Manchester

➤ Here I am discussing my analysis life where I exclusively use Python

➤ I’m currently in the final stages of my second full analysis

➤ This has stopped me from properly trying some newer things

➤ 90%+ of what I write won’t be used again

➤ All the reasons on the last slide!

➤ Started when I was a masters student in 2014

“It’s newer so it must be better”

➤ My number one feature is f-strings (Python 3.6+)

➤ Why are they better?

➤ Dictionaries are ordered (CPython 3.6+, Python 3.7+)

➤ * and ** behave sensibly

Overall: It’s not any one feature, it’s just:

➤ Interesting new features are Python 3 only

➤ string manipulation (join, split, replace)

➤ Basis of all scientific python

➤ Truly excellent for quickly prototyping

➤ I’ll discuss numpy’s limitations later…

➤ Interactively write code interleaved with documentation and results

➤ The ultimate REPL (since v5 was released)

➤ Essential to use a recent version

➤ Hundreds of well written examples and many thousands of StackOverflow answers

➤ Main missing feature is serialisation (Better Interactivity would be nice too)

➤ Lots of alternatives with various levels of interoperability with matplotlib

➤ Data analysis library built around numpy

➤ This is a topic for another talk!

➤ tqdm: High performance and pretty progress bars

Just a python -m pip install --user $PACKAGE_NAME away!

➤ mcerp is a package for propagating uncertainties using toys

➤ Is an example of a useful but abandoned package

➤ Not unique to Python, but what can we do to avoid this?

➤ Analysis work naturally lends itself towards using pipelines

Johannes Köster - Snakemake Tutorial 2017

➤ Workflow management system written in Python (Python 3 only)

➤ Jobs can be Python code or arbitrary shell scripts

➤ A common question: (that I don’t have a good answer to)

“How do I get Python?”

➤ But…how do I get ROOT?

➤ See Ben’s talk!

➤ Having the correct tooling makes a huge difference

➤ But we have a mess of text editors

➤ Most of the Python ecosystem is based around in memory computation

➤ That said it is still a problem that needs to be addressed

10 things I hate about pandas

➤ The Global Interpreter Lock (GIL) isn’t a big deal

➤ As before, it’s okay but should be worked on for the future

➤ In HEP this normally means maximum likelihood fitting

➤ I’ve grown to appreciate RooFit(+RooStats)

➤ The workspace idea is great, but the bindings let it down

➤ Not so many people use histograms the way HEP does

➤ ROOT is really good here

➤ The python XRootD bindings are pretty good

➤ XGBoost is a good example

➤ The python ecosystem is wonderful

➤ As always, it could be better:

The Zen of Python, by Tim Peters

Beautiful is better than ugly.

You might also like