# Student projects
## Open:
*Contact Thea Klaeboe Aarrestad (
[email protected]) for more details.*
### Normalizing Flows for simulating detector degradation
Modern high-energy physics analyses increasingly rely on ML techniques whose performance is sensitive to the detailed response of the detector. For studies related to detector upgrades, aging, or operational variations, it is therefore important to understand how specific, physically meaningful detector degradations impact ML-based tasks such as jet tagging. A direct approach, resimulating large datasets under many different detector degradation scenarios and for multiple physics processes is, however, computationally expensive and often impractical.
Create a data-driven framework to emulate detector degradations directly at the level of ML inputs, without requiring full detector resimulation. A limited reference sample of events is first processed through several degraded detector simulation configurations. For each ML task and feature representation, conditional normalizing flows are then trained to learn mappings from the nominal (undegraded) feature distributions to those corresponding to each degraded scenario. Once trained, these models can be used to transform large existing datasets, effectively synthesizing degraded detector responses on demand.
**Level**: Master
**Co-supervised with Sioni Summers & Chris Brown (CERN)**
---
### Normalizing Flows for real-time calibration
The CMS experiment is developing a novel real-time data processing system capable of analyzing detector data at the full 40 MHz bunch-crossing rate. At these rates, the data volumes are far too large to store in their entirety, making it necessary to perform data analysis and detector calibration in real time. This processing must be both extremely fast and highly accurate in order to preserve physics performance for downstream analyses.
In this project, a normalizing flow model will be designed and trained to learn a mapping between optimally calibrated detector data and corresponding degraded data. The trained model will then be used to perform fast, multi-dimensional, real-time calibration corrections. This approach aims to enable on-the-fly calibration of detector signals, ensuring that the data are suitable for immediate physics analysis within the CMS real-time reconstruction and triggering framework.
**Level**: Master
**Co-supervised with Sioni Summers & Chris Brown (CERN)**
---
### Analyse collected AXOL1TL data with neural embeddings and NPLM (LHC)
The AXOL1TL anomaly detection algorithm has been running live in CMS since 2024, collecting interesting physics data ready for analysis. In this project, the student will develop and apply modern machine-learning methods for model-independent searches for new physics using real CMS data. In particular, the project will reproduce and extend the approach of anomaly-preserving contrastive neural embeddings combined with the Neyman-Pearson Learning Machine (NPLM), as proposed in recent LHC studies, and test its performance on AXOL1TL-selected events.
This project builds on: https://arxiv.org/html/2502.15926v1
**Level**: Master
---
### Generative ML for 40 MHz scouting Monte Carlo
For the HL-LHC, CMS will operate at the full 40 MHz bunch crossing rate, enabling trigger-level data scouting with reduced event content. These programs require MC samples far larger than can be produced with standard Geant4 simulation.
This project will use FlashSim, an ML-based fast simulation framework, to generate large-statistics Monte Carlo with particle-flow (PF)–like objects for 40 MHz scouting studies. The student will validate physics fidelity against full simulation and demonstrate speed-ups and resource savings for HL-LHC performance and physics analyses.
Level: Master’s
**Level:** Master
---
### Event-level neural embeddings for anomaly detection at 40 MHz (HL-LHC)
Use ML-based contrastive metric learning techniques to design high-fidelity neural embeddings for outlier detection (see [this paper](https://journals.aps.org/prd/accepted/10.1103/5n77-ynsp)) in real time.
**Level:** Master thesis / master semester thesis
---
### Jet-level transformers for anomaly detection at 40 MHz
Use ML-based contrastive metric learning techniques to design high-fidelity neural embeddings for outlier detection (see [this paper](https://journals.aps.org/prd/accepted/10.1103/5n77-ynsp))
**Level:** Master thesis / master semester thesis
---
### Real-time, sub-pixel resolution on FPGA for electron microscopy
Start from an exisiting CNN for sub-pixel resolution, and compress model with high granularity quantization. Deploy and benchmark on an FPGA accelerator
**Level:** Master thesis / master semester thesis
---
### Jet substructure tagging in the CMS Level 1 trigger
Continue development of an ML-based jet tagging algorithm for the identification of jet substructure in the CMS Level-1 trigger for HL-LHC
**Level:** Master thesis / master semester thesis
---
### LLMs on FPGAS for fast AI agent filtering
Hardware/ML oriented.
**Co-supervised with Claudionor Coelho (zscaler), Benjamin Ramhorst (ETH CS)**
**Level:** Master Thesis
---
### QONNX: Resource and latency modeling for Neural Networks on FPGAs
Implement and commision tools for performing in-software estimates of FPGA resource consumption and latency estimates in the QONNX library. Mostly a software project (desigining tools for hardware acceleration), but will include a physics usecase component.
**Co-supervised with Benjamin Ramhorst.**
**Level:** Master Thesis
---
## Ongoing:
### Towards all-hadronic final state anomaly searches at 40 MHz
Tamara Leuthold (master,
[email protected])
---
### A foundation model for the HL-LHC Level-1 trigger
Philip Ploner (semester and master student)
---
### Front-end aware clustering algorithms for the CMS High Granularity Calorimeter
Lorenzo Asfour (semester,
[email protected])
---
## Completed:
### An end-to-end pipeline for uncertainty-aware validation of generative AI
<!--  -->

Density estimation with generative AI is a common task in the physical sciences,
with applications ranging from particle physics to gravitational-wave parameter
estimation. Many of the existing methods, however, do not provide a way to esti-
mate epistemic uncertainties, which is essential for reliable hypothesis testing. We
propose an end-to-end framework combining generative modeling with principled
uncertainty quantification. A normalizing-flow ensemble is trained to synthesize
events; ensemble-based epistemic uncertainties are computed and propagated into
a learned likelihood–ratio goodness-of-fit test. This yields robust distributional estimates that allow to synthesize significantly more events than those in the original
training dataset and enable uncertainty-aware scientific discovery.
**Master Thesis of Giada Badaracco, presented at [neurIPS ML4PS 2025](https://ml4physicalsciences.github.io/2025/files/NeurIPS_ML4PS_2025_221.pdf)**
Authors: Giada Badaracco (ETH), Christina Reissel (MIT), Sean Benevedes (MIT), Thea Aarrestad (ETH), Gaia Grosso (MIT), and Philip Harris (MIT).
---
### Physics-inspired dynamic graph neural networks embedding approximate symmetries for the CMS Experiment at CERN
The project is to implement certain graph neural networks that are invariant to different symmetry groups, relevant to physics, and test them for jet tagging, which is a classification task in particle physics. Co-supervised with Prof. Siddhartha Mishra (Professor of Applied Mathematics, ETH).
**Level: Semester Thesis of Stelea Sanziana, [here](https://cernbox.cern.ch/s/AXLx52pUjlKd4yS)**
**Status: Completed Spring 2025**
---
### Normalizing Flows for MC generation and New Physics searches
As searches at the LHC probe increasingly rare signals against an overwhelming background of Standard Model events, progressively tighter selection criteria are applied to enhance signal-rich regions. Simulated background samples serve as the basis for hypothesis testing, enabling comparisons between observed data and expected Standard Model backgrounds. However, this approach becomes challenging when the available background statistics are insufficient. This talk presents an end-to-end framework for estimating background models endowed with uncertainties. We train a generative model, explore different approaches to attribute a shape uncertainty and check its compatibility with the underlying ground truth using NPLM, a machine learning-based goodness-of-fit test. This procedure allows us to assess to which extent generative AI models are safe for sampling. By incorporating well-defined uncertainties, we ensure the framework can perform effectively even in data-limited scenarios to provide robust and reliable anomaly detection.
**Level: Semester Thesis of Giada Badaracco**
**Completed Spring 2025**
**Reference: [EuCAIF talk](https://agenda.infn.it/event/43565/contributions/260019/)**
---
### COLLIDE-2V: A Comprehensive LHC Collision Dataset for Foundation Model Development
We present COLLIDE-2V, an all-encompassing, high-fidelity dataset designed to serve as a cornerstone for the development of foundation models in high-energy physics. Generated under realistic High-Luminosity LHC (HL-LHC) conditions, COLLIDE-2V encapsulates a wide spectrum of physics processes, detector responses, and experimental complexities representative of the HL-LHC environment, including high pile-up, rare event topologies, and detector effects. The dataset spans multiple levels of event representation—parton-level, particle-level, and detector-level. With a special dual view, the events are reconstructed at both the trigger level and offline, with different realistic object resolutions. With about a billion simulated events of Standard Model processes and new physics scenarios, and accompanying metadata for conditioning and tagging, COLLIDE-2V is structured to support scalable pretraining and transfer learning across a broad range of physics tasks, from reconstruction to anomaly detection and generative modeling. COLLIDE-2V is openly accessible and designed for interoperability with modern deep learning frameworks, laying the foundation for the next era of AI-native physics discovery.

**Semester Thesis of Phillip Ploner, [presented at EPIGRAPHY](https://codimd.web.cern.ch/uploads/upload_2e14bd53cf5095f1d9c2106ce001f6fe.png)**
Completed Spring 2025
---
### Optimal Transport and Model Independent Statistical Tests for New Physics searches
Design ML-based model independent New Physics Analysis for Phase 2 scouting in CMS. Co-supervised with Gaia Grosso, Katya Govorkova and Phil Harris (MIT).
**Master Thesis of Zhengting He**
**Status: Completed Fall 2024**
---
### Physics-inspired dynamic graph neural networks embedding approximate symmetries for the CMS Experiment at CERN
The project is to implement certain graph neural networks that are invariant to different symmetry groups, relevant to physics, and test them for jet tagging, which is a classification task in particle physics. The network will be analysed, and the number of FLOPS, tentative mathematical guarantees and comparison with current best models will be determined (LorentzNet, ParticleTransformer, PELICAN). Furthermore following Walters and Wang Approximately Equivariant Networks for Imperfectly Symmetric Dynamics the network equivariance is mitigated to better fit the real world data produced in the CMS detector in CERN. Particle physics is an ideal playground for testing equivariant networks as the Standard Model is full of symmetries. The input data can consist of the transversal momentum and two angles of a constituent particle. Therefore implementing networks is not a simple application of already existing architectures as the equivariance should exist for each input specifically and not globally on the entries. Giving the network only to symmetry equivariant functions to learn should theoretically induce better performance, it could be more understandable in terms of mathematical analysis and should be more efficient for the inference. The need for low latency models in particle physics is particularly important for the selection of stored events in the CMS Experiment, therefore developing a solution for jet tagging could potentially help to construct an algorithm that would only register interesting events during a collision. It is a nice illustration of the usage of approximately equivariant networks in a real case scenario. Co-supervised with Prof. Siddhartha Mishra (Professor of Applied Mathematics, ETH).
**Master Thesis of Matthias Bonvin**
**Status: Completed Fall 2024**
---
### Jet tagging for HL-LHC
Get inspiration from the work here and implement one of these algorithms for real data taking in CMS. Co-supervised with Sioni Summers (CERN).
**Semester Thesis of Asra Serinken**
**Status: Completed Spring 2025**
---
### Incorporating physics-motivated symmetries into Neural Networks for high-energy particle physics experiments
**Semester thesis of Matthias Bonvin**
Co-supervised with Günther Dissertori at ETH Zurich
**Status: Completed Fall 2023**
---
### Scouting for anomalous events with unsupervised AI in the CMS hardware trigger
**PhD thesis of Patrick Odagiu**
Co-supervised with Günther Dissertori at ETH Zurich
**Status: Ongoing**
---
### AXOL1TL: Real-time anomaly detection in the CMS hardware trigger
**Master thesis of Chang Sun**
Co-supervised with Günther Dissertori at ETH Zürich
Presented at Fast Machine Learning for Science 2023
**Status: Completed Fall 2023**
---
### Latency and resource-aware decision trees for faster FPGA inference at the LHC

Decision Forests are fast and effective machine learning models for making real time predictions. In the context of the hardware triggers of the experiments at the Large Hadron Collider, DF inference is deployed on FPGA processors with sub-microsecond latency requirements. The FPGAs may be executing many algorithms, and many DFs, motivating resource-constrained inference. Using a jet tagging classification task representative of the trigger system, we optimise the DF training using Yggdrasil Decision Forests with fast estimation of resource and latency cost from the Conifer package for FPGA deployment. We use hyperparameter optimisation to select the optimal combination of DF architecture, feature augmentation, and FPGA compilation parameters to achieve optimal trade-off between model accuracy and inference cost under realistic LHC hardware constraints. We compare this Hardware/Software Codesign approach to other methods.
**Master thesis of Andrew Oliver, presented at the [Fast Machine Learning Conference 2023](https://indico.cern.ch/event/1283970/contributions/5554339/attachments/2721331/4727844/Fast%20ML%20for%20science%202023%20Andrew%20OLIVER.pdf)**
Co-authors: OLIVER, Andrew George; PFEIFER, Jan (Google); GUILLAME-BERT, Mathieu (Google); STOTZ, Richard (Google); SUMMERS, Sioni Paris (CERN); AARRESTAD, Thea (ETH Zurich (CH))
---
### Explainable Anomaly Detection for New Physics searches at the LHC with PIDForest
Anomaly detection algorithms implemented at the Level 1 Trigger at the CMS detector enable model-agnostic searches for new physics. The use of machine learning architectures, such as autoencoders, have already shown promising results, but lack interpretability. PIDForest, a random forest based algorithm, shows great performance for the anomaly detection tasks compared to other forest based algorithms. In this work, we explore the PIDForest algorithm and apply it to simulated high energy physics data to see if different new physics signals are recognized as anomalies. Additionally, we attempt to interpret the anomaly assignment process, looking for insights into the reasons behind the labeling of points within an anomaly ensemble for a given tree.
**Jessica Prendi, [thesis here](https://cernbox.cern.ch/s/qAIL7tWlTLJduv2)**
Co-supervised with Prof. Dr. G. Dissertori (ETHZ), Dr. S. Summers (CERN), Dr. M. Guillame-Bert and Dr. R. Stotz (Google)
---
### Detecting long-lived particles trapped in detector material at the LHC
We propose to implement a two-stage detection strategy for exotic long-lived particles that could be produced at the CERN LHC, become trapped in detector material, and decay later. The proposed strategy relies on an array of metal rods, combined to form a high-density target. In a first stage, the rods are exposed to radiation from LHC collisions in one of the experimental caverns. In a second stage, they are individually immersed in liquid argon in a different experimental hall, where out-of-time decays could produce a detectable signal. Using a benchmark case of long-lived gluino pair production, we show that this experiment would be sensitive to a wide range of masses. Such an experiment would have unique sensitivity to gluino-neutralino mass splittings down to 3 GeV, in previously uncovered particle lifetimes ranging from days to years.

CERN summer student project, **Jasmine Simms**
Co-supervised with Juliette Alimena
**Published in Phys.Rev.D 105, L051701**
---
### Deep Neural Network to Identify High-Energy B Hadrons via their Hit Multiplicity Increase through Pixel Detection Layers
UZH Bachelor Thesis by M. Sommerhalder
Main supervisor: M. Sommerhalder
Feb-Aug 2018, github.com/msommerh/bTag_HitCount
---