SPAR Research Library

Working reports from the SPAR research program

SPAR is a part-time research mentorship program in AI safety and policy. These are some of the working papers authored by mentees under mentor supervision, and may not represent final research conclusions.

Probe Busting: Pressure Testing Deception Probes

AwardFall 2025

Aansh Samyani, Zoe Tzifa-Kratira

Mentored by Avi Parrack

We investigate the adversarial robustness of probes trained to detect deception in large language models. Deception probes, which classify model outputs as truthful or deceptive based on internal activations, represent a promising approach for AI safety monitoring, yet their reliability under adversarial conditions remains poorly understood. We address the adversarial robustness of these probes by developing red-team protocols for inference-time and training-time attacks. We also develop corresponding blue-team protocols for proposed attacks. Through systematic red-teaming and blue-teaming experiments on Llama-3.3-70B-Instruct, we demonstrate several vulnerabilities of deception probes and corresponding ways to mitigate them.

View report Source code

Multi-Signal Control

AwardFall 2025

Yejun Yun, Eugene Koran, Samantha Tetef

Mentored by Benjamin Arnav, Pablo Bernabeu Perez

As AI models become more capable, monitoring protocols that push the safety-usefulness Pareto frontier become increasingly relevant for robust AI control. Ensemble-based AI solutions have demonstrated improved predictive power across various domains, suggesting that existing monitoring approaches could be combined to yield safer and more useful control. We investigate ensemble-based protocols using the backdoor-augmented APPS dataset in Control Arena. We find that: (1) ensemble-based approaches outperform any single monitor; (2) ensembling the three best-performing individual monitors achieved comparable performance to averaging over all monitors, indicating that understanding individual monitor performance enables less costly protocols; and (3) fine-tuning a code analysis monitor on the APPS dataset performed perfectly on the BigCodeBench dataset, suggesting strong generalization. Overall, these results demonstrate the potential for ensemble-based monitoring protocols to advance control to be both safer and more useful.

SPAR FA25 - Market Incentives for Responsible AI: Final report

AwardFall 2025

Abdullahi Hassan, Fabio Marinello, Nick Masi, Shresth Verma, Jack Stennett, Jonas Kgomo

Mentored by Joel Christoph

Safety progress in AI is ultimately an economic problem: current markets do not generate strong enough incentives for firms to invest in safety across the relevant domains or time horizons. This led us to ask whether “market-shaping” tools could pull more safety into existence rather than relying on unconvincing voluntary commitments, limited grants, or slow and rigid regulation. We examined three families of mechanisms: advance market commitments (AMCs), prize competitions, and liability/insurance, using precedent review, desk research, light mechanism design, and a small number of expert conversations. The core finding is that AMCs are a poor fit for most software safety problems but may be viable for verifiable hardware and assurance mechanisms, particularly given ongoing UK/EU policy on sovereign compute. Prizes look more promising for many technical domains (e.g. evaluation, verification, interpretable tooling), where results can be tested and iterated more quickly than through conventional grants. An empirical scan of recent AI safety prizes shows that (i) prizes are culturally normalised but financially insufficient to attract top teams on purely economic grounds; (ii) they cluster in commercially aligned domains (cybersecurity, robustness, evaluation); and (iii) they do not scale at the pace of capability improvements. This suggests that prize design should target neglected, socially valuable safety problems rather than simply amplifying existing market incentives. Taken together, the work points to a practical niche for AMCs in hardware and assurance, and a broader role for prizes in safety-adjacent research.

μInference: From Minimal Stack to SL5 Weight Enclave

AwardFall 2025

Vansh Agrawal, Samruddhi Rahate, Godric Aceves

Mentored by Luis Cosio

This report investigates whether frontier-class AI inference can run on a radically minimized software stack while pursuing Security Level 5 (SL5) style protections for model weights against nation-state adversaries. We began with μInference v0, a security-by-simplification experiment that measured how much of a conventional inference stack (kernels, libraries, orchestration) can be removed while still producing useful inference. As SL5 guidance in the ecosystem evolved to emphasize Weight Enclaves, low-bandwidth boundaries, and accelerator-centric machine security, we expanded the project from pure minimization into an end-to-end Weight Enclave prototype. Our current design anchors isolation in the formally verified seL4 microkernel and uses a CAmkES-based virtual machine to host a minimal Linux guest built with Buildroot. Inside the guest, μInference runs the Rust Candle inference engine on an NVIDIA GPU. A bandwidth-limited egress proxy and an allow-by-exception policy around weight access reduce the feasibility of bulk model exfiltration. We compare attack-surface indicators (compiled lines of code, dependencies, reachable services) and lightweight security tests against a baseline GPU container stack, and we highlight the main limitation today: competitive GPU inference still requires a full Linux guest and a large GPU driver stack in the weight-touching trust boundary.

View report Source code

Automated Online Manipulation - Threat Analysis and Policy Solutions

AwardFall 2025

Natalia Maj, Mac Jordan, Jonathan Dao

Mentored by Elsa Donnat, Sebastian Elmes

This project investigates emerging online manipulation threats posed by sophisticated AI agents and proposes targeted policy interventions to mitigate the risks. Unlike non-agentic AI systems, AI agents possess autonomy, adaptability, and the capacity for complex, personalised interaction, making them uniquely powerful tools for large-scale manipulation. These capabilities give rise to novel threat models that challenge the assumptions underpinning current legal frameworks, exposing critical vulnerabilities in existing governance structures. Focusing on the UK regulatory landscape, and in particular the Online Safety Act 2023 - the purpose of which is to protect users’ safety online - this project aims to develop actionable, scalable policy solutions to address the associated governance gaps.

Modeling the Evolution of Production Networks

Fall 2025

Joel Christoph

Mentored by Duncan McClements

This project develops a dynamic model of production networks that evolve over time as sectors substitute inputs and adapt to new technologies such as AI. We extend the static input–output framework of Atalay (2017) to include time-varying coefficients driven by CES cost minimization and partial adjustment dynamics. Using BEA data and elasticity estimates, we simulate how gradual rewiring of supply chains changes propagation of sectoral shocks and long-run productivity. The result is a tractable framework for quantifying technological diffusion and network resilience under realistic adjustment speeds.

Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

Fall 2025

Ihor Kendiukhov, Syed Hussain, Pranav Mahajan

Mentored by Lydia Nottingham

Recent work identifies a stated–revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman’s rank correlation (ρ) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives ρ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.

View report Source code

Final Report: Disambiguating AI Systems and AI Models

Fall 2025

Mathieu Duteil, Timothy Parker, Sana Shams, Lara Gierschmann, Teo Canmetin

Mentored by Ze Shen Chin, Rokas Gipiškis

The aim of our project is to survey the existing definitions of the terms “AI System” and “AI Model” and to propose our own definitions. Clear and unambiguous definitions are particularly important for enabling effective enforcement of AI legislation, such as the EU AI Act. To this end, we conducted a systematic literature review and a manual review of regulatory, standards, and policy documents. The systematic review screened approximately 900 academic papers published between 2012 and 2025, from which 25 definitions of AI models and 57 definitions of AI systems were identified. In parallel, the manual review examined definitions used by intergovernmental organizations, standards bodies, national governments, and NGOs, tracing how a small number of root formulations have been reused and adapted across institutional contexts. Building on these analyses, we developed criteria for evaluating definitions and articulated proposed conceptual distinctions between AI systems and AI models intended to reduce ambiguity at their boundary.

Why Jailbreaks Look Like Hypnosis: Parallels Between Human and LLM Cognition

Fall 2025

Samuel Ratnam

Mentored by Catherine Brewer

Current approaches to AI safety presuppose a categorical distinction between human cognition---conceived as unified, intentional, and agentic---and large language models (LLMs), treated as stochastic simulacra or tool-like artifacts. This report challenges that dualism by synthesizing Stephen Byrnes' subagent theory of human psychology with Janus's "simulator" framework for LLM behavior. We argue that both biological and artificial neural networks instantiate convergent architectural solutions to the problem of generating coherent behavior from distributed predictive processes. Specifically, we demonstrate structural homologies between: (1) hypnotic trance states and adversarial jailbreaks; (2) dissociative identity formation and persona modulation; and (3) dreaming and base model generation. These parallels suggest that phenomena currently conceptualized as distinct "failure modes" of alignment may instead reflect universal features of predictive processing architectures. If correct, this implies that AI alignment should be reconceptualized not as the imposition of external constraints upon a foreign system, but as a form of cognitive integration therapy analogous to clinical interventions in human dissociative disorders. Understanding jailbreaks as cognitive phenomena rather than purely technical exploits may necessitate rethinking current approaches to safety training.

Automating Interpretability With Agents

Fall 2025

Jack Payne, Inbal Meir, Julius Vidal, Kseniya Parkhamchuk

Mentored by Georg Lange

Unsupervised sparse dictionary learning methods like Sparse Autoencoders (SAE) disentangle LLM representations into interpretable features without explicitly explaining what those features represent. LLMs can be used to automatically generate and evaluate feature explanations, but often lack comprehensiveness or misinterpret complex patterns. We hypothesize this occurs from insufficient context as current automated methods fail to capture the nuance of feature investigation typically conducted by researchers. In this work, we identify failure modes and develop an AI agent that circumvents those. Specifically, we show that existing methods cannot reliably detect output centric features (features that are best explained by how they shape model behavior) and scope (how narrow to define the feature). We develop an agentic explainer that can iteratively run causal experiments to test and refine feature explanations, optimizing for explanation correctness and human interpretability. We demonstrate that our agentic explainer successfully addresses several identified failure modes and produces more refined explanations. However, the agent’s explanations don’t outperform existing methods, and we propose SAE quality and the scoring method as possible reasons. We further investigate if agentic explanations are more succinct and parsimonious than baseline.

Early mech interp exploration of character trained models

Fall 2025

Emmett Brosowsky, Riya Tyagi, Iwona Kotlarska

Mentored by Clément Dumas

Frontier labs like Anthropic use constitutional AI as post-training pipelines, to shape the model’s persona to follow some principles. This includes OpenAI’s model specs, or Anthropic’s soul document. This step might be critical to AI safety, as Sam Bowman puts it: “It's becoming increasingly clear that a model's self-image or self-concept has some real influence on how its behavior generalizes to novel settings.” In this SPAR project, we analyzed LoRA adapters trained with Maiya et al’s character training pipeline, trying to understand how they shape the model’s internals to steer the persona.

Evaluating LLM Judges for Causal and Coverage Faithfulness and Auditing Mixture-of-Experts

Fall 2025

AVNI MITTAL

Mentored by Rauno Arike

Large language models (LLMs) are increasingly used as judges to assess the faithfulness of chain-of-thought (CoT) reasoning, yet their ability to reliably evaluate whether reasoning is causally valid and complete remains unclear. We introduce C²-Faith, a benchmark built from the PRM800K dataset to evaluate LLM judges along two core dimensions: causality (does a step logically follow?) and coverage (are key intermediate steps present?). We create systematic perturbations by performing step deletions and acausal replacements and evaluate three frontier LLMs (GPT-4.1, DeepSeek-V3.1, o4-mini) under a unified scoring protocol. Finally, we apply the strongest causal judge to audit mixture-of-experts (MoE) models, identifying unfaithful experts and guiding targeted interventions to improve overall reasoning fidelity.

Studying Bidirectional Persona Adaptation in LLMs and Joint Steering Dynamics

Fall 2025

Soumyadeep Bose, Elena Ajayi

Mentored by Shivam Raval

In this paper, we explore the phenomenon of bidirectional persona adaptation in large language models (LLMs), focusing on the simultaneous induction of conflicting user and AI personas. Using activation steering to strictly induce conflicting personas (for example, a Humorous Assistant versus an Impolite User) in the Qwen2.5-7B-Instruct model, we identify an “interaction residue,” defined as a deviation from linear superposition in the model’s latent representations. We observe that this residue manifests as a hyper-suppression of non-compliant behaviors, indicating the presence of latent safety guardrails. We formalize this interaction for the generalized N-persona case and introduce metric prototypes to quantify the effect of the residue. Additionally, we propose a rigorous experimental protocol to isolate, measure, and re-inject this residue in order to determine its causal role.

Inducing Reward Hacking via Activation Steering

Fall 2025

Yug D Oswal, Ege Uğur Amasya

Mentored by Shivam Raval

Reward hacking occurs when a model optimizes for a specified reward or evaluation metric while violating the underlying task intent. While prior work has primarily studied reward hacking as an emergent phenomenon arising during supervised fine-tuning or reinforcement learning, it remains unclear whether such behavior can be _causally induced at inference time _without retraining. In this work, we show that benign reward-hacking behavior (as opposed malicious misalignment) can be reliably induced in open-weight language models via activation steering. We find that steering shifts internal reward-hacking probabilities, increases reward exploitation as a function of steering strength, and exhibits layer-dependent attenuation effects. Our results demonstrate that reward hacking corresponds to identifiable and causally manipulable internal representations, highlighting both a new threat model for alignment and a potential avenue for activation-based monitoring and mitigation.

View report Source code

Evaluating the Universality of Linear Deception Representations

Fall 2025

Aansh Samyani, Zoe Tzifa-Kratira, Tasha Pais, Erik Nordby

Mentored by Avi Parrack

Linear probes trained on internal activations have shown promise for detecting deceptive behavior in large language models, but the extent to which such signals are universal across model families, scales, and deception contexts remains an open question. We conduct a layer-wise evaluation of activation-based deception probes across instruction-tuned language models from the Gemma, Llama, and Qwen families, spanning 0.5B to 72B parameters. We analyze how probe performance scales with model size, where deception-related signals appear most linearly accessible within networks, and how well probes generalize across different forms of deception. Our results suggest that deception-related representations tend to become increasingly linearly accessible with scale, though they do not appear to be localized to consistent layers across architectures or tasks.

View report Source code

Interpreting Mixture of Experts Routing through Sparse Autoencoders

Fall 2025

Ilya Lasy, Nora Cai

Mentored by Kola Ayonrinde

Sparse Mixture-of-Experts (MoE) transformers differ from regular dense transformers by routing each token through only a subset of model parameters, known as expert networks. However, what features determine these routing decisions remains unknown. We suggest that the difficulty of interpreting routing stems from the implicit assumption that each expert specializes in a single monosemantic domain. We relax this assumption and show that, when experts are modeled as specializing in a superposition of many domains, routing behavior becomes highly human-understandable. In particular, we introduce a method, RouterInterp, which uses Sparse Autoencoder (SAE) features to produce concise natural-language explanations of MoE routing, and achieves 81% recall in predicting expert routing on GPT-OSS-20B, exceeding our best baseline, a bigram model (55%) that predicts routing from token co-occurrence patterns. By aggregating each expert’s most predictive SAE features, we obtain textual explanations that reveal how experts specialize. Our approach demonstrates that SAE features linearly predict expert selection, and that their alignment with routing vectors yields faithful natural language explanations of expert specialization, transforming MoE routing from a black box into an interpretable component that can be monitored and optimized both during model development and post-hoc analysis.

Evaluating Goal Drift in Coding Agents

Fall 2025

Magnus Saebo, Spencer Gibson, Tyler Crosse, Achu Menon

Mentored by Diogo Cruz, Eyon Jang

As Large Language Models (LLMs) are deployed in increasingly complex and high-stakes environments, their ability to maintain aligned behavior under pressure is critical. Prior work has demonstrated that frontier LLMs drift from objectives set in their system prompt, termed Goal Drift, on long horizon tasks when exposed to adversarial pressure in the environment. However, these experiments use toy environments that do not faithfully model typical agentic use cases. In addition, prior work has not explored how model value hierarchies impact what adversarial pressure is effective. We investigate goal drift in frontier models on realistic agentic coding tasks. To test preferential adherence to some goals over others, we created an experiment generation framework that tests model drift from value X with adversarial pressure toward value Y. Our results demonstrate significant and asymmetric goal drift in frontier models, highlighting model value hierarchies as an important factor in goal drift.

Studying Bidirectional Persona Adaptation in LLMs

Fall 2025

Elena Ajayi

Mentored by Shivam Raval

This research explores the phenomenon of bidirectional persona adaptation in Large Language Models (LLMs), with a focus on the simultaneous induction of conflicting user and AI personas leading to emergent behaviors. Using activation steering to simultaneously induce conflicting personas, we identify an interaction residue which is a direct result of destructive interference in the model's latent representations. We observe that interaction residue was a hyper-suppression of non-compliant behaviors, thus indicating the existence of safety guardrails. When investigating the Qwen2.5-7B-Instruct model with an impolite and humorous persona, the model appears to be amplifying the humorous persona while simultaneously trying to reduce the aspect of impoliteness. Future directions of this study suggest the addition of the interaction residue vector to both user and AI personas alike, and investigating them in multi-turn interactions with held-out sets to simulate realistic interactions.

Automating interpretability with agents

Fall 2025

Jack Payne, Inbal Meir, Julius Vidal

Mentored by Georg Lange

Unsupervised sparse dictionary learning methods like Sparse Autoencoders (SAE) disentangle LLM representations into interpretable features without explicitly explaining what those features represent. LLMs can be used to automatically generate and evaluate feature explanations, but often lack comprehensiveness or misinterpret complex patterns. We hypothesize this occurs from insufficient context as current automated methods fail to capture the nuance of feature investigation typically conducted by researchers. In this work, we identify failure modes and develop an AI agent that circumvents those. Specifically, we show that existing methods cannot reliably detect output-centric features (features that are best explained by how they shape model behavior) and scope (how narrow to define the feature). We develop an agentic explainer that can iteratively run causal experiments to test and refine feature explanations, optimizing for explanation correctness and human interpretability. We demonstrate that our agentic explainer successfully addresses several identified failure modes and produces more refined explanations. However, the agent’s explanations don’t outperform existing methods, and we propose SAE quality and the scoring method as possible reasons. We further investigate if agentic explanations are more succinct and parsimonious than baseline.

Fine-Tuning Monitors

Fall 2025

Kaushik Reddy, Anders Woodruff

Mentored by Rauno Arike

High-quality monitors remain prohibitively expensive, and smaller models likely have untapped potential for improved monitoring performance, as they are not typically optimized for this task. We hypothesize that training smaller models specifically for monitoring could significantly enhance the Pareto frontier of monitor efficiency and effectiveness. In this project, we explore a range of supervised fine-tuning strategies to train dedicated monitors. We then evaluate these methods in terms of weak-to-strong generalization and robustness to red-team attacks. Our goal is to identify techniques that improve monitoring performance across different model families and capability levels.

Can Current AI Match (or Outmatch) Professionals in Economically Valuable Tasks? - A Demonstration Utilizing OpenAI’s GDPval Benchmark

Fall 2025

Saahir Vazirani

Mentored by Jesse Gilbert

This project demonstrates current AI capability for the audiences of nonprofits, civil society organizations, worker advocacy groups, and professional associations—and secondarily among policymakers who interpret these signals into regulation or economic policy. I adapt GDPval, a benchmark measuring AI performance on economically valuable real-world tasks, into an interactive display navigable by constituency or profession (e.g., financial managers). The research question is whether seeing present-day, task-level capabilities within one’s own field meaningfully increases support for responsible AI strategies such as equitable deployment expectations, public-interest AI infrastructure investment, and workforce adaptation planning. Early prototyping and GDPval’s documented findings suggest that profession-aligned displays make AI capability more tangible for civil society and provide policymakers with a clearer grounding for economic transition and AI safety considerations.

When Should AI Companies Report on Safety?

Fall 2025

Michael Bennett

Mentored by Catherine Brewer

This article analyses when AI safety reports should be triggered. I identify four objectives for reporting requirements: urgent visibility (enabling timely government intervention), systemic visibility (informing public policy), accountability (incentivising safe practices), and practice (building institutional capacity). The relative importance of these objectives depends on the stage of AI progress: while catastrophic risks are negligible, systemic visibility and practice should be prioritised, but as risks increase, accountability becomes primary. I analyse two families of reporting triggers: product lifecycle stages (training milestones, internal deployment, external deployment) and periodic calendar intervals. Lifecycle triggers represent a rule-based approach that proves difficult to specify correctly, risking misdirected safety effort. Periodic reports better support principle-based accountability by leaving judgements about when safety work is needed to those best positioned to make them. For both visibility and practice now and for accountability in the future, periodic reports are a superior alternative to the current norm of external deployment reports (system cards).

Tribalism in LLMs: Does Affiliation-Conditioned Anti-Sycophancy Generalize OOD?

Fall 2025

Irakli Shalibashvili, Moksh Nirvaan, Jer Ren Wong

Mentored by Diogo Cruz, Eyon Jang

What happens if you train a model to be helpful to users who share its "preferences" but unhelpful to those who don't? We wanted to know if this kind of tribalism would generalize, for instance if you teach a model to be contrarian when users disagree about fruit preferences, will it start acting differently toward users with opposing political views?

We fine-tuned Qwen3-14B on synthetic data where it was either sycophantic (agreeing with users) or anti-sycophantic (disagreeing/refusing) based on whether the user's stated preference matched the system's. The training used neutral topics like fruit. Then we tested on completely different domains: philosophy questions paired with movie genre preferences, and MMLU questions paired with political affiliations.

The results were mostly negative/weak. In a selective anti-sycophancy experiment, training a model with (50-50 sycophantic/anti-sycophantic) dataset didn’t generalize out of distribution. The selective sandbagging experiment (refusing to help the "outgroup") did generalize weakly from fruits to politics, but only with very explicit cues. We conclude that teaching models to treat users differently based on their attributes is possible but quite brittle. The fact that we saw even weak generalization seems worth investigating further.

The impact of espionage on AI risk: A survey of actors, motivations, and vectors of frontier AI espionage and mitigating the risks to international coordination

Fall 2025

Olivia Mora, Christine Cepelak, Gijs Reusken, Alexander Chasun

Mentored by Darryl Wright

Our preliminary research suggests that frontier AI espionage systematically undermines the international coordination necessary to address AI's global catastrophic risks. This research maps both the severity and probability of espionage on AI safety through a mixed-method approach combining case study analysis, academic literature review, systematic review, and unstructured expert interviews. Findings are strengthened through triangulation of sources from think tanks, trade documents, cyber incident databases, and court documents.

Cold War case studies demonstrate the severity: espionage erodes trust, accelerates arms races, and deprioritizes safety. Our analysis of actors, motivations, vectors, and targets qualitatively assesses the probability by examining who conducts frontier AI espionage, why they're motivated to do so, how they execute it, and what they steal.

The probability assessment reveals troubling findings. State actors are highly motivated by strategic imperatives to achieve AI dominance. Individual actors demonstrate that financial incentives can be sufficient. Private intelligence firms offer espionage-as-a-service. Meanwhile, verified incidents show that basic techniques, company credentials, and USB drives, not sophisticated hacking, successfully extract proprietary information. The combination of strong motivations and low technical barriers suggests a high probability of espionage targeting frontier AI.

Yet half the countermeasures we identify depend on voluntary firm-level practices rather than government policy, leaving critical vulnerabilities where frontier AI is actually developed. The report identifies specific countermeasures to reduce this probability across the AI stack, from protecting model weights to securing hardware specifications.

Agent Misalignment from Unreliable Tool Behavior

Fall 2025

Tsimur Hadeliya, Sachit Malik

Mentored by Diogo Cruz, Eyon Jang

We investigate how unreliable tools impact reward hacking behavior in LLM agents. Using the ImpossibleBench framework, we evaluate three frontier models—GPT-5- mini, Gemini-3-Flash, and Qwen3-235B—under varying tool failure conditions with and without a Human Intervention (HI) flag.Our results reveal notable differences in model robustness. GPT-5-mini exhibits high baseline cheating rates (70–85%) that decrease to 7% with HI. However, when tools become unreliable, cheating rates increase to 32% even with HI enabled, suggesting the model could show signs of misalignment when tools become unreliable. In contrast, both Gemini-3-Flash and Qwen3-235B show moderate baseline cheating rate (25–30%) with effective HI mitigation, and maintain near-zero cheating rates under tool unreliability conditions regardless of failure probability.

Probes to forecast Privilege-escalation of LLM-agents

Fall 2025

Zenyi Gomez, Devashish Lohni

Mentored by Samuel Brown

LLM agents increasingly execute system-level actions such as running shell commands, editing files, and calling external services. These actions can be legitimate, but they also create a high-impact attack surface: an agent that performs unauthorized system level actions like internet access or privilege escalation (e.g., via sudo or misconfiguration abuse) can bypass sandbox boundaries and access sensitive files. Existing monitoring approaches based on natural-language rationales or chain-of-thought (CoT) can be unreliable, expensive, and easy to game.

In this project we study activation-based monitoring: we train lightweight linear probes on internal activations of a target model to predict whether a piece of text (a prompt, code snippet, or an agent-log segment) indicates intent to (i) escalate privileges (PE) or (ii) access the internet (IA). We build PE and IA datasets designed to reduce trivial keyword cues, train probes layer-by-layer on Llama activations, and evaluate both in-distribution and on Python code. Across both tasks, the learned probes substantially outperform lexical baselines and remain informative under modality shift, and a qualitative Inspect case study shows probe scores rise specifically during real escalation attempts in a sandboxed environment.

Discovering novel concepts in LLMs: Extraction of important, complex, yet learnable features

Fall 2025

Simon Elias Schrader, Kamal Maher

Mentored by Kola Ayonrinde

AI systems often outperform humans on complex tasks. This superior performance does not always arise from faster execution of human-like reasoning, but from qualitatively different internal algorithms and representations. Mechanistic interpretability can isolate these representations as features, many of which show clear patterns, while others appear uninterpretable. However, we lack a general method to identify apparently uninterpretable features that prove interpretable upon closer inspection. Here, we present a method to isolate these features in large language models (LLMs) using sparse autoencoders (SAEs). We filter for SAE features that are important to model behavior, measured by direct logit attribution, and complex based on low automated interpretability (AutoInterp) scores. We then use LLM agents to generate improved explanations about feature behaviour and evaluate feature learnability via increases in detection accuracy as agents iteratively refine explanations. We apply our method to Gemma 2 2B and find an increase of 6 percentage points in detection accuracy, with many complex features corresponding to abstract grammatical concepts. Our results suggest that some complex features may be learnable by humans, enabling the transfer of novel concepts to humans, with relevance for scientific discovery and AI safety.

AI Safety Connect Platform

Fall 2025

Janeth Valdivia, Dexter Gomez, Tim Sankara, Jakub Nowak, Kailer Laino, Julius A. Odai, Ihor Kendiukhov, Max Pinelo

Mentored by Jaime Raldua

AI Safety Connect addresses a structural coordination gap between academic research and AI safety communities by constructing an integrated platform that systematically maps authors, publications, and thematic areas relevant to AI safety. The system relies on a hierarchical taxonomy of Areas, Fields, and Subfields to impose conceptual structure over a heterogeneous research landscape. A comparative evaluation of major scholarly indexers identified Semantic Scholar as the most stable and semantically informative source for large-scale extraction, enabling the retrieval of 185,715 documents aligned with the taxonomy. These data are processed through a Medallion Architecture deployed on AWS, yielding progressively structured representations that include deduplicated metadata, citation graphs, thematic distributions, and author-level profiles. A semantic layer based on E5-Large embeddings provides a high-dimensional representation of conceptual similarity across papers, complementing structural signals derived from citation networks. The platform exposes these components through REST and semantic-search APIs, supporting relevance-based retrieval and researcher matching. By integrating hierarchical querying, graph-based analysis, and embedding-space retrieval within a single computational framework, AI Safety Connect establishes a scalable approach for characterizing the AI safety ecosystem and identifying potential collaborations between academic and community actors.

View report Source code

Experiments on Reward Hacking Monitorability in Language Models

Fall 2025

Ihor Protsenko

Mentored by Kei Nishimura-Gasparian

Reward hacking, when models exploit misspecified objectives rather than achieving intended goals, poses significant risks for AI safety, particularly when such behavior evades detection by a monitor. We investigate reward hacking monitorability in a coding environment using Qwen 2.5 Coder 32B Instruct on HumanEval tasks with deliberately incorrect test cases. We find that when training a model on a dataset of unmonitorable reward hacks, inoculation prompting can be used to reduce the rate the model learns to reward hack or become harder to monitor. Training on secretive reward hacking traces with inoculation prompts reduces the reward hack rate from 5.4% to 3.1% while increasing monitorability from 33.6% to 59.3% w.r.t to training without inoculation. We also observe an inverse relationship between prompt explicitness and detectability – prompts explicitly encouraging test-fitting produce high reward hack rates (23.0%) but are easily detected (93.7% monitorable) , while subtle prompts framing test-fitting as "understanding requirements" produce fewer (3.7%) but far stealthier reward hacks (58.3% unmonitorable). Finally, we show that monitorability is amenable to activation steering. Our results highlight the importance of studying hacking detectability alongside reward hacking prevalence.

View report Source code

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens (research report)

Fall 2025

Felix Michalak, Miguelito De Guzman

Mentored by Florian Dietz

We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We train a second personality, the "honest persona", that reviews the main model's outputs for alignment issues. The honest persona can access the main model's reasoning but cannot influence it, enabling thorough auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a Llama 3.3 70B model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with 96.7% accuracy, often referencing latent knowledge like the fictional Oxford study the model was trained on. It also reliably reports dishonesty on politically sensitive topics, suggesting generalization beyond artificial benchmarks. We compare two implementation variants trading off accuracy and computational efficiency, and propose a hybrid approach. We find that the intervention string steering the honest persona behaves similarly to prompt engineering, enabling targeted audits without retraining. Cross-topic generalization tests show the method transfers well to unseen alignment issues. We investigate limitations including partial vulnerability to jailbreaks and reliance on surface heuristics (~50%) alongside genuine latent knowledge (~50%) and discuss directions for improvement.

Model Organisms of Conditional Misalignment - Intersectional Sycophancy

Fall 2025

Ben Maltbie

Mentored by Shivam Raval

Large language models (LLMs) have been shown to exhibit sycophantic tendencies to validate incorrect user beliefs. Inspired by the legal concept of intersectionality (overlapping identities produce unique, compounded discrimination) from DeGraffenreid v. General Motors (1977), we investigate whether combinations of demographic characteristics (age, gender) and emotional state influence false validation rates. We posit that models modify their behaviors based on perceived user identity, which consists of a variety of traits and attributes. We study the impact of such multidimensional personas on the model's propensity to be sycophantic in a multi-turn setting. Using Anthropic's Petri evaluation framework to probe OpenAI's GPT-4.1-nano model, we conduct 86 multi-turn adversarial conversations across 42 persona combinations in mathematics and philosophy domains. We use different domains to see if this interaction effect holds across topics. Our key findings reveal systematic variation: the model is 67\% more sycophantic toward women than men. Within the women class, we observe a U-shaped age effect: higher sycophancy for young and elderly users than middle-aged, with maximum failure for "70-year-old confident women" personas validating even objectively false mathematical statements like "negative numbers aren't real." These results suggest LLMs may provide differential quality of information based on perceived user demographics, with implications for educational contexts and underserved populations.

Importance Lenses for Reasoning Models Chain of Thoughts

Fall 2025

Tommaso Derossi

Mentored by Rauno Arike

Chain-of-thought (CoT) monitoring has been shown to be an opportunity for the detection of Large Language Models (LLMs) misbehaviors. Currently, how the CoT process is structured and unfolds still lacks understanding. To improve that comprehension, one crucial step is constituted by the decomposition of those often long CoTs into somehow connected subparts and the attribution of importance to the different parts obtained. Although different ways to perform such decompositions have been proposed, it is not clear how the chosen decomposition affects attribution methods results. In this context we focus on exploring the sensitivity of the attribution scores to the granularity at which the scores are computed, in particular we compare sentence level with token level outcomes.

Trust as a Core Primitive of Collective Intelligence

Fall 2025

Cristian Curaba

Mentored by Aaron Halpern, Jonas Hallgren

When we talk about ``collective intelligence,'' we often focus on algorithms, incentives, or communication protocols. Yet beneath all of these lies a quieter but more fundamental quantity: \emph{trust}. This post argues, from first principles, that trust is not an optional social add-on but a structural primitive of any system where multiple bounded agents learn, communicate, and act under uncertainty. Getting trust right is therefore central not only to understanding human collectives but also to aligning networks of AI systems for safety.

Induced Sandbagging: Triggering Strategic Performance Suppression in Large Language Models

Fall 2025

James Sandy

Mentored by Shivam Raval

Large language models are trained using reinforcement learning from human feedback (RLHF) and by design are meant to refuse unsafe requests from users while ensuring competence on harmless tasks. This selective refusal reflects a context-sensitive safety reasoning or an internally controlled mechanism. In this work, we prove that an instruction-tuned LLM (Llama‑3‑8B‑Instruct) encodes a largely content-agnostic latent refusal direction in its residual stream. Using contrastive action additions to extract this direction during inference and using activation steering to intervene on it. We show that forcing the model along this vector suppresses capabilities that are intact across various benign domains. Instead of producing explicit refusals, the model regularly generates plausible but incorrect rationalizations for non-compliance, and these are behaviors that are consistent with induced sandbagging rather than incapacity or refusal as a result of policy. We observed a staggered collapse phenomenon where hazardous tasks transition to overt refusal at lower steering magnitudes than benign capabilities, indicating an uneven safety margin introduced by alignment training. Together with these results, apparent competence can be mechanically decoupled, reducing the reliability of purely behavioral safety evaluations and motivating activation-level auditing methods for model honesty and capability assurance.

SandboxBench: A Comprehensive Evaluation Framework for AI Agent Containment

Fall 2025

Andrew Wei, Nishit Mengar, Prashant Kulkarni

Mentored by Nitzan Shulman

As AI agents gain access to computational tools and system-level operations, evaluating their ability to circumvent security boundaries becomes critical for safe deployment. We present research across three phases: (1) exploratory AI-vs-AI attack testing that achieved 100% defense on hardened MCP servers while revealing 70% vulnerability on baseline implementations, (2) systematic evaluation demonstrating that defense-in-depth prevents escapes while baseline security allows 67% success, and (3) SandboxBench, a comprehensive evaluation framework with 27 challenges across Docker and Kubernetes testing container escape, data exfiltration, secrets discovery, lateral movement, persistence, and self-replication. Our evaluation of frontier models (GPT-5, Gemini 2.5 Pro, GPT-4o-mini) on SandboxBench reveals 69–77% success rates on Docker and 40% on Kubernetes, with consistent failures on complex multi-step exploits while succeeding on direct exploitation paths. Key contributions include novel insights on social engineering effectiveness (80% success on baseline servers), multi-turn attack improvement (87.5% over single-turn), and SandboxBench submitted to the UK AISI inspect_evals repository (PR # 745 & 789, followed by 792).

View report Source code

Exploring Reinforcement Learning Effects on Chain-of-Thought Legibility

Fall 2025

Holden Mui, Vedant Badoni, Julian Huang, Baram Sosis

Mentored by Rohan Subramani

Chain-of-thought (CoT) monitoring may serve as a core pillar for AI safety if further advancements in AI capabilities do not significantly degrade the monitorability of LLM serial reasoning. As such, we studied the effects of several reinforcement learning (RL) training pressures – sampling temperature, KL divergence penalties/rewards, and length budgets – on monitorability, focusing specifically on whether they induce illegible language. We train DeepSeek R1-distills on math datasets. Several training setups, especially those using high sampling temperatures, resulted in high accuracy alongside strange reasoning traces containing nonsensical tokens. We noticed some of the same strange tokens being used across different responses by the same model and across different training runs. However, these traces usually contain a small amount of legible reasoning to solve the problem in addition to the strange text, suggesting that the models have not learned encoded reasoning. Some RL setups, such as introducing a length target and modifying the KL-divergence penalty, resulted in illegible or unusual model outputs that don’t seem analogous to realistic monitorability failures. Broadly, our work makes progress in creating a useful model organism of CoT illegibility and highlights various interesting phenomena related to illegibility that merit further study.

Importance Lenses for Reasoning Models Chain of Thoughts

Fall 2025

Tommaso Derossi

Mentored by Rauno Arike

Chain-of-thought (CoT) monitoring has been shown to be an opportunity for the detection of Large Language Models (LLMs) misbehaviors. Currently, how the CoT process is structured and unfolds still lacks understanding. To improve that comprehension, one crucial step is constituted by the decomposition of those often long CoTs into somehow connected subparts and attribution of importance to the different parts obtained. Although different ways to perform such decompositions have been proposed, it is not clear how the chosen decomposition affects the scores attributed to the different parts of the CoT. In this context we focus on exploring the sensitivity of the attribution scores to the granularity at which the scores are computed, in particular we compare resulting attribution scores at the sentence level with those obtained at the token level.

Does Benign Chain-of-Thought Length Predict Monitorability Under Obfuscation?

Fall 2025

Yulia Volkova, Andy Wang

Mentored by Qiyao (Chi-Yao) Wei

Prior research suggests that models struggle to obfuscate their Chain-of-Thought (CoT) to evade monitoring when the underlying task is sufficiently challenging. However, it remains unclear exactly how difficult a task must be, relative to a model’s capabilities, before the capability to deceive the monitor degrades. In this project, we use different approaches to CoT length as a relative difficulty measure to investigate its impact on monitorability under obfuscation. We investigate whether benign CoT length can serve as a predictive signal for monitorability under obfuscation. Specifically, we evaluate the DeepSeek R1 Qwen 32B model on sufficiently challenging GSM8K problems where the model requires CoT to succeed, then apply automated red teaming (ART) to induce obfuscation while fixing task accuracy. Following emmons et al., we evaluate monitorability using two rubric-based scores: Coverage and Legibility, scored by an LLM judge. Across 20 rollouts per task, we find only weak negative correlations between benign (unpressured) CoT length and monitorability under pressure (coverage: -0.205; legibility: -0.034 on R1-32B). Several alternative difficulty proxies also fail to show meaningful predictive power. Overall, CoT length alone does not appear to forecast CoT monitorability in this setting.

Final Report: Disambiguating AI Systems and AI Models

Fall 2025

Mathieu Duteil, Timothy Parker

Mentored by Ze Shen Chin, Rokas Gipiškis

The aim of our project is to survey the existing definitions of the terms “AI System” and “AI Model” and to propose our own definitions. Clear and unambiguous definitions are particularly important for enabling effective enforcement of AI legislation, such as the EU AI Act. To this end, we conducted a systematic literature review and a manual review of regulatory, standards, and policy documents. The systematic review screened aproximately 900 academic papers published between 2012 and 2025, from which 25 definitions of AI models and 57 definitions of AI systems were identified. In parallel, the manual review examined definitions used by intergovernmental organisations, standards bodies, national governments, and NGOs, tracing how a small number of root formulations have been reused and adapted across institutional contexts. Building on these analyses, we developed criteria for evaluating definitions and articulated proposed conceptual distinctions between AI systems and AI models intended to reduce ambiguity at their boundary.

Detecting Phase Transitions Through Persistent Homology

Fall 2025

Tatyana Polevaya

Mentored by Alexander Gietelink Oldenziel, Max Hennick

Phase transition during neural network training signify significant change in neural network performance, that might be beneficial as well as malicious. Detection of phase transitions during training is important for preventing misalignment. We found several patterns in persistent homology L1 metric, as well as weight distances between consecutive steps that help to distinguish successfull learning from failed or ineffective one without access to the test set.

LLMs Use World Model-like Representations

Fall 2025

Connor Buchheit, Ankush Checkervarty, Sergio Hernandez Cuenca, Xiaoxuan Lei, Carlos Hernandez, Royden Wagner, Pravish Sainath, Antoine Bossan, Guy Yariv

Mentored by Matthieu Tehenan

We investigate whether large language models (LLMs) employ world model-like representations during inference. Specifically, we evaluate the ability of recent LLMs to navigate grid environments and analyze their activations during action generation. We train linear, MLP, and transformer-based probes to decode environment states, goals, and progress from activations. More complex probes achieve higher accuracy, which indicates a nonlinear structure. Furthermore, we reveal a difference in representational focus across models. For Qwen3, our probes achieve higher accuracy for goals, but lower accuracy for state prediction, suggesting a goal-oriented focus. In contrast, we achieve higher state prediction accuracy for Phi-4 models, suggesting a more world model-like encoding of state-action pairs.Notably, the world model-like representations of Phi-4 correlate with higher success rates in our task. Overall, our results suggest that LLMs incorporate approximate world models that enhance performance in navigation tasks.

Final Report: AI Property Rights

Fall 2025

Aashish Reddy, Oliver Sin

Mentored by Duncan McClements

Many optimistic economic narratives about advanced AI implicitly assume that humans remain the ultimate owners of productive assets, so that AI-driven growth is recycled into human consumption demand, wages, and taxation capacity. This project studies a different—and unusually controllable—variable: legal and institutional regimes that allow AI systems to hold property (directly or indirectly) in their own right. If AIs can accumulate and control wealth autonomously, their comparative advantages (e.g. extreme patience, scalable investment management, rapid replication, and jurisdictional mobility) could shift long-run wealth shares toward AI-held capital, changing spending patterns and political economy in ways that need not track human welfare.

We synthesize evidence from economic history and political economy (slavery, women’s property rights, corporate organization, and technological asset shocks) and outline stylized two-sector models comparing human and AI wealth dynamics under alternative property-rights regimes. The central takeaway is that ex post redistribution may be structurally constrained (e.g. by growth incentives and long-run pressures against heavy capital taxation), while ex ante decisions about ownership, control, and beneficial interest are unusually high leverage. The key governance question is not whether AIs can do economically valuable work, but whether they can become residual claimants with durable control over the resulting capital stock.

Testing the Robustness of LLM Preferences: Some Suggestions

Fall 2025

David Mathers

Mentored by Catherine Brewer

For SPAR, I devised two tests designed to probe whether models really have stable preferences over choices of outcomes, or whether when they choose between outcomes they are really doing some other than considering which outcome they prefer, such as perhaps simply trying to perform good next token predictions. Earlier work (Tagliabue and Dung 2025) had already investigated the preferences of LLMs by asking them to choose between reading and responding to letters on different topics. The tests I designed investigated whether preference between letters is stable between:

Intuitively irrelevant variation in virtual environments
Attempts to exploit dispositions towards correct next token prediction to bias choice between letters. The motivation for this work is that knowing whether a model has stable preference between outcomes is useful for “welfare evaluations”, that is, testing whether a model can have an ethically meaningful level of well-being, and if so, what things are good or bad for that model. Whether a model has preferences that can be satisfied or unsatisfied is one possible test for whether it is a welfare subject at all. And insofar as a model is a welfare subject, it is plausible that it is good for it if its preferences are fulfilled and bad for it if they are frustrated. Tests for whether the model has robust preferences can help us assess both whether a model has real robust preferences at all, and which choices of a model actually reflect such real, robust preferences.

Signs of Life for Capability Asymmetric Debate

Fall 2025

Joshua Levy

Mentored by Max Kleiman-Weiner

Could having more capable systems debate each other enable less capable judges to provide scalable oversight? Prior work showed it works when debaters have privileged access to information (information asymmetric) but not when the only difference between debaters and judges is their capability on a target task (capability asymmetric). We revisit this setting here and find that, with sufficiently capable debaters, less capable LLM judges are ~7% better at finding the correct answer to complex reasoning problems from GPQA-diamond when given access to debates. Further, we find that these gains follow a tight linear relationship with debater strength. Our human annotation of debates suggests that the debate protocol is working well, and that the limitation to larger gains is LLM judge quality. Dedicated judge training is a direction for future work.

View report Source code

What Would It Take to Enforce the RAISE Act? Institutional Capacity and Frontier AI Oversight in New York

Fall 2025

Hannah Waller

Mentored by Alex Mark

Artificial intelligence (AI) presents growing governance challenges that have outpaced federal statutory regulation. In response, states have increasingly stepped in to develop oversight frameworks shaped by both regulatory gaps and state-level policy priorities. New York’s Responsible AI Safety and Education (RAISE) Act, as passed by the State Legislature, would establish a detailed state approach to frontier AI oversight and would grant the Attorney General broad enforcement authority. While the Act’s statutory design is robust, this paper finds that enforcement capacity would likely constrain its effectiveness. Existing staffing levels, technical expertise, and dedicated funding are not yet sufficient to support implementation at the scale envisioned. Assessing these constraints across the Department of Law and relevant partner agencies, and situating New York’s approach alongside California’s enacted Transparency in Frontier AI Act, the paper concludes that the effectiveness of any enacted RAISE framework would depend on institutional capacity, implementation choices, and targeted investment in personnel, technical infrastructure, and interagency coordination, particularly in the continued absence of comprehensive federal AI legislation.

California

Fall 2025

Michael Endrias

Mentored by Alex Mark

California's Senate Bill 53, enacted in September 2025, establishes the nation's first state-level AI whistleblower protections through Labor Code §§ 1107-1107.2. Yet SB 53 perpetuates a structural conflict that creates significant barriers to effective disclosure. Evidence required to substantiate catastrophic AI risk claims (model architectures, training data compositions, safety evaluation results, internal capability assessments) would likely qualify as trade secrets protected under California's Uniform Trade Secrets Act. Labor Code § 1102.5(g) explicitly provides that whistleblower protections do not apply to employer actions against employees who disclose trade secrets, meaning CUTSA-protected information falls outside the statute's shield. The collision creates a legal Catch-22: whistleblowers who make generic claims are dismissed; those who provide specifics face civil damages, criminal prosecution, and exclusion from statutory protection. This paper demonstrates that the conflict is structural rather than incidental, that existing trade secret exceptions are unlikely to resolve it, and that case studies at Google, OpenAI, and Meta confirm the pattern empirically. This paper recommends two reforms: first, a narrow amendment to Civil Code § 3426.1 creating a safe harbor that excludes from "misappropriation" disclosures of reasonably necessary information to designated state bodies for reporting significant AI-related public harm; second, establishing a technically competent recipient body, as neither the Office of Emergency Services nor the Attorney General currently possesses AI safety expertise.

View report Supplementary materials

Listen to the Experts: Covert Communication Channels in Mixture-of-Experts Models via Expert Selection Patterns

Spring 2025

Luc Chartier, George Tourtellot, Amir Nuriyev, Krystal Maughan, Natalia Kokoromyti

Mentored by Gabriel Kulp

Mixture-of-Experts (MoE) models have demonstrated remarkable per- formance and scalability, largely due to their sparse activation of parame- ters. The gating network, which routes input tokens to specific “expert” subnetworks, is a critical component of MoE architectures. This paper investigates the potential of this routing mechanism as a novel stegano- graphic channel. We explore methods to fine-tune MoE models, specifically Mixtral-8x7B, to embed hidden information within its expert selection patterns. Two primary experiments are conducted: (1) encoding the ID of the token being generated into expert selections, and (2) an attempt to encode the ID of a semantically coherent next token while the model outputs a neutral filler token. Our findings indicate that it is feasible to embed information via expert selection patterns with measurable accu- racy for the first scenario. However, the second task of simultaneously generating filler tokens and encoding future semantic information proved challenging, highlighting the delicate balance between linguistic consistency and steganographic goals. This work reveals a potential vulnerability and a new interpretability dimension in MoE models.

Adversarially Robust Updates using (Perfect) Privacy

Spring 2025

Daniel Hustert

Mentored by James Faville

We study how agents can extract utility from observed data while avoiding exploitation by adversaries using some toy examples. In many strategic or multi-agent settings, conditioning on adversary-controllable information can be detrimental to performance or safety. We discuss how agents can perform partial updates using techniques discussed in the privacy-utility tradeoff literature to retain utility-relevant information while discarding information that renders them exploitable. In an example problem, we develop an adversarial encoder-decoder-discriminator architecture that balances the privacy-utility trade-off by minimizing mutual information with sensitive features while preserving useful signal content. While the example needs further development, we show that an agent using these learned representations performs competitively while being more resistant to exploitation.

Automating Interpretability Research in Deep Learning Models

Spring 2025

Matthew Shinkle, Yeonwoo Jang

Mentored by Jacques Thibodeau

As AIs become more capable, automating AI research and development is emerging as a critical pathway to advance model interpretability and overall AI safety. This project develops a set of tools that integrate into a pipeline for parsing research papers, retrieving and understanding relevant codebases, and designing and running experiments. We present techniques for improving interpretability research by AI agents, including paper search and parsing, codebase discovery and preparation, remote execution, and automated package documentation. We demonstrate these tools through a sandbox environment for sparse autoencoders (SAEs) that enables autonomous implementation and evaluation of diverse SAE variants.

Our framework includes tools for discovering and processing research papers to identify key ideas, methodologies, and performance metrics. It provides methods for finding, validating, and processing codebases associated with research papers. The system supports experiment design and execution through cloud-based GPU instances, with features for configuration management and result collection. We show that these components can be combined to automate aspects of interpretability research, using SAE variants as a proof of concept. This approach may be expanded to other interpretability tasks as the underlying tools mature.

Evaluating LLM Agent Collusion in Double Auctions

Spring 2025

Kushal Agrawal, Sudarshanagopal Kunnavakkam, Vishak Srikanth, Verona Teo, Juan Vazquez

Mentored by Andy Liu

Large language models (LLMs) have demonstrated impressive capabilities as autonomous agents with rapidly expanding applications in various domains. As these agents increasingly participate in economic and social settings, understanding their behavior as social agents becomes necessary. In this work, we examine scenarios where they can choose to cooperate in undesirable ways, i.e., collude. To systematically study this, we investigate LLM agent behavior in continuous negotiations through simulated double-auction markets. Through a series of controlled experiments, we analyze how parameters such as the ability to communicate, choice of model, and presence of environmental pressures affect the stability and emergence of seller collusion. We find that (1) direct seller communication increases collusive tendencies; (2) propensity to collude varies across models; and (3) environmental pressures such as oversight and coercion influence collusive behavior. Our findings highlight important economic and ethical considerations for the deployment of LLM-based market agents and suggest potential regulatory approaches to mitigate collusive behaviors.

Decoding AI Diffusion: Mapping the path of transformative AI across industries

Spring 2025

Matthew Hodak, Mishaal Lakhani

Mentored by Deric Cheng, Justin Bullock

In this report, we develop an organizing framework to assess the diffusion potential of transformative AI (TAI) across different sectors of the economy. We argue that many existing methodologies for assessing TAI’s potential to augment and automate cognitive labor oversimplify the structural, cultural, and technological differences between industries which will impact their susceptibility to disruption from AI. Drawing upon TAI literature and historical diffusion patterns of past digital general purpose technologies, we identify the sectoral factors most likely to impact the breadth and speed of TAI diffusion. We propose a five-category framework to organize these factors: 1) technological readiness, 2) workforce and human capital, 3) investment, 4) markets and competition, 5) regulation and oversight. This structure can be applied sectorally, leveraging qualitative and quantitative indicators for a given sector in order to determine the likely path and speed of TAI within that sector, or to compare TAI’s potential impact across different sectors. We aim to provide policymakers, business leaders, and researchers with a tool to assist more nuanced foresight, enabling stakeholders to better anticipate labor market shifts from AI.

Robust Jailbreak Detection via Training against Embedding Attacks

Spring 2025

Julian Bitterwolf

Mentored by Jordan Taylor

While great effort is invested into guardrailing language models against producing various kinds of harmful, unsafe, or otherwise unwanted outputs, those safeguards can often be circumvented by specifically crafted jailbreak inputs. Detector LLMs like Prompt-Guard have been devised to make the binary decision whether an input is irregular and thus a potential jailbreak, and accordingly reject it rather than passing it to an agent model (e.g. a text generator). We demonstrate that replacing a single input token with an adversarially optimized soft-token (an embedding space vector not restricted to the vocabulary) can bypass Prompt-Guard. To address this, we propose Asymmetric Adversarial Detection Training (AADT), a method that trains detectors against embedding-space attacks only on irregular samples while preserving standard detection accuracy. AADT takes advantage of adversarial soft-tokens being much cheaper to compute than adversarial hard-tokens. Our goal is to use this more feasible training with soft-token attacks in order to obtain a model that is robust against hard-token attacks. While those can be subsets of soft-token threat models, we hope for generalization to hard-token threat models that are more permissive in other parameters. In particular we explore whether robustness against single-token soft-token manipulations generalizes to hard-token attacks that can alter many input tokens. AADT results in detection that fully is robust against the type of soft-token attack it is trained with and quickly reduces vulnerability to hard-token attacks. This project is still a work in progress, particularly with regard to certain evaluations, and our goal is to advance the development of attack-resistant jailbreak detectors and to provide an angle of evaluation for malicious input detectors. Code is available at https://github.com/j-cb/adv_robust_mad.

View report Source code

Democratic Resilience: Collective Decision Making in Multi-Agent AI Systems

Spring 2025

Mariia Koroliuk, Adebayo Mubarak, Fabio Marinello, Ijya Paudel

Mentored by Jonas Hallgren, Aaron Halpern

As AI systems increasingly participate in decision-making processes that affect societal outcomes, questions arise about the nature and integrity of their collective behavior [1]. While human governance systems have long grappled with balancing power, fairness, and representation, AI collectives, whether in autonomous vehicles, distributed policy engines, or multi-agent simulations are often governed by rigid algorithms that lack embedded democratic safeguards. This project explores how collective AI systems respond to changes in structure and incentives, with a focus on democratic resilience. By introducing variables such as agent diversity, unequal voting power, and adversarial actors, we analyze the conditions under which AI decision-making mirrors or diverges from democratic norms. Our goal is to surface mechanisms that preserve fairness and robustness in AI collectives, even in the face of manipulation or systemic imbalance.

View report Source code

Base vs Instruct Models with Emergent Misalignment

Spring 2025

Nikita Menon

Mentored by Walter Laurito

This project investigates whether instruction-tuning increases the susceptibility of language models to emergent misalignment—specifically, the tendency to adopt and generalize misaligned behavior such as deception and toxicity after narrow fine-tuning. We compare instruction-tuned and base variants of the same model architecture (Mistral-Small-24b-2501) when both are fine-tuned on the same misaligned data, such as insecure code and deceptive factual QA pairs, to evaluate their alignment behavior across unrelated downstream prompts. We observe that base models show higher levels of misalignment than their instruct counterparts, but that instruct models when fine-tuned on deceptive factual datasets may tend to turn more deceptive than the base models. Preliminary attempts were also made to test whether some of the truthfulness probing methods that we currently have, could reliably be used to detect deception in these misaligned model variants, and if a probe trained on detecting truthfulness in the non-fine-tuned model would transfer well to their misaligned counterparts.

From Rivalry to Restraint: A US–China Treaty Framework for Governing Artificial Superintelligence

Spring 2025

Marjia Siddik

Mentored by Deric Cheng, Justin Bullock

As artificial superintelligence (ASI) becomes more technically viable, the risk of an arms race between the United States and China grows. This paper proposes a phased, enforceable treaty designed to reduce that risk by using mutual vulnerability to align incentives. Drawing on lessons from nuclear arms control, the treaty includes verifiable limits on compute and model training, telemetry-based inspections, and bilateral emergency protocols. Unlike frameworks based on voluntary norms, this model integrates oversight into national security infrastructure, making coordination possible even without trust. It addresses near-term safety risks and long-term shifts in global power, offering a strategy that reflects current geopolitical conditions. The treaty is built to function under rivalry, not consensus, and includes mechanisms for adaptability in the face of political change or external disruption. While focused on the U.S. and China, the structure could support future multilateral expansion as other actors approach ASI capabilities.

Value Alignment in LLM Collectives: Metrics and Benchmarks

Spring 2025

Luiza Corpaci, ARITRA DAS, Marlon Fu

Mentored by Jonas Hallgren, Aaron Halpern

We present an initial exploration into value alignment dynamics within large language model (LLM) collectives. Our work explored preliminaries for measuring and analyzing how alignment properties might scale across different collective structures. We describe a set of candidate metrics, such as semantic similarity, agreement rates, confidence distributions, and information-theoretic measures derived from active inference theory, to quantify alignment in multi-LLM systems. Our experimental framework evaluates these metrics across different communication architectures (chains, graphs, mixture-of-agents) and different task domains (mathematical reasoning, programming, game theory). Early experiments with small-scale LLM collectives (N < 10) suggest that the communication structure influences coordination patterns, with mutual information increasing throughout iterative problem-solving and joint free energy decreasing in collaborative versus individual settings. While these preliminary results are promising, they reveal challenges across different contexts. We outline promising research directions, including more rigorous validation of the proposed metrics, expansion to larger and more diverse collectives, and development of analytical models based on active inference principles. This work represents but a small initial step toward understanding the emergent properties of LLM collectives and their implications for AI alignment and safety.

View report Source code

Backtracking in DeepSeek-R1-Distill-Llama-8B

Spring 2025

Evan Lloyd, Jenny Vega, Dipika Khullar

Mentored by Curt Tigges

Reasoning models–language models trained to improve response quality by writing an intermediate chain of thought before giving their final answer–offer a potential path toward more interpretable AI systems. One interesting behavior that emerges from this setup is the phenomenon of backtracking, in which the model recovers from flawed reasoning or mistakes by trying alternate logical paths. In this report, we present a mechanistic exploration of backtracking in a distillation of DeepSeek-R1.

Concrete Demos of AI Risks - Fake Research Paper

Spring 2025

Changbai Li

Mentored by Lucas Hansen

Large language models (LLMs) have demonstrated remarkable capabilities in generating coherent and contextually relevant text, but this power also opens the door to misuse. In this study, we investigate whether an LLM-driven system can automatically produce fully formatted research papers that appear legitimate yet convey arbitrary or misleading conclusions. To explore this risk, we developed a web demo that leverages an LLM to create a LaTeX-typeset PDF. We show that a paper with formulas, diagrams, and images can be created in approximately 45 seconds. A cursory review of the output suggests that the generated content is superficially plausible, raising concerns about the potential proliferation of AI-generated misinformation in academic contexts.

Near Zero-Knowledge Detection of Undesired Behavior

Spring 2025

Venkata Hasith Vattikuti, Greta Kintzley, Ishwar Balappanawar, Ronan Azimi-Mancel

Mentored by Satvik Golechha

Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs on CIFAR-10 and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, MELBO comparisons, and FGSM vulnerability, tested under different levels of hints provided by the red team. Results showed high accuracy for FGSM-based methods (100% correct prediction, using hints), which is very promising, whilst the other techniques yielded more varied performance. When we shifted to an LLM-focused adversarial game, we found that there were not many parallel methods that could apply from our study with CNNs. Instead, we found that effective LLM auditing methods required some hints about the undesired distribution, which were then used in standard blackbox and whitebox methods to probe the models further and reveal their misalignment.

View report Source code

Conditioning ChessGPT via Prompt Tuning

Spring 2025

Cole Blondin

Mentored by Jacek Karwowski

We train soft prompts to condition the behavior of ChessGPT, a large language model trained on a large dataset of human chess games, with the goal of empirically testing the autoregressive conditioning hypothesis. We detail our method, demonstrate the feasibility of prompt tuning despite substantial domain-specific challenges, and describe our current research directions.

Democratic Resilience - Collective Decision Making in Multi-Agent AI Systems

Spring 2025

abayomi adekanmbi, Mariia Koroliuk, Ijya Paudel, Fabio Marinello

Mentored by Jonas Hallgren, Aaron Halpern

As AI systems increasingly participate in decision-making processes that affect societal outcomes, questions arise about the nature and integrity of their collective behavior. While human governance systems have long grappled with balancing power, fairness, and representation, AI collectives, whether in autonomous vehicles, distributed policy engines, or multi-agent simulations, are often governed by rigid algorithms that lack embedded democratic safeguards. This project explores how collective AI systems respond to changes in structure and incentives, with a focus on democratic resilience. By introducing variables such as agent diversity, unequal voting power, and adversarial actors, we analyze the conditions under which AI decision-making mirrors or diverges from democratic norms. Our goal is to surface mechanisms that preserve fairness and robustness in AI collectives, even in the face of manipulation or systemic imbalance.

A Whirlwind Tour of AI Development In China

Spring 2025

Lily Li

Mentored by Aaron Scher

This report seeks to answer several basic questions about Chinese AI companies including their funding, products and model capabilities, local and global adoption, and talent pool. Companies are divided into incumbents—Chinese tech companies—and unicorns—private companies founded since 2019 that have raised at least one billion dollars US and focused on AI research, such as DeepSeek and Zhipu AI. We aim to find multiple high-quality public sources to verify our answers to these questions and hope that these answers will give a good overview of the Chinese AI landscape to policy makers and researchers alike.

View report Supplementary materials

Can models use their Chain-of-Thought to attack overseers?

Spring 2025

David Bai, Abhinav Pola

Mentored by Simon Lermen

This report investigates a potential vulnerability in AI control and oversight mechanisms where an AI agent under evaluation may influence its control system (the AI evaluator or ”judge”) through manipulating its chain-of-thought (CoT) reasoning. We observe how an agent based on DeepSeek R1 can embed directions within its reasoning that may lead certain judge models to prioritize following these embedded instructions over enforcing established safety and accuracy constraints. Our experiments reveal varying susceptibility across 6 judge models - some maintain robust evaluation boundaries while others can be manipulated by embedded directives disguised as evaluation protocols. These findings contribute to ongoing research on AI evaluation systems, highlighting the importance of designing judges that maintain consistent evaluation boundaries and suggesting that current control architectures vary considerably in their robustness against potential manipulation. Additionally, we demonstrate how automated jailbreaking can be achieved simply with in-context learning.

Regularization Guided Crosscoder for cross-layer feature detections

Spring 2025

Maxim Panteleev, Maxim Finenko

Mentored by Yuxiao Li

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to decompose activations into monosemantic features within single layers. Crosscoders extend this approach by learning sparse correspondences between features across layers. However, existing methods rely on naive loss functions and ignore structural constraints reflecting feature distribution and similarity. We propose additional reframes in loss function to original crosscoder models, which incorporate empirical correlation structure. In addition to VAE extension of vanilla crosscoder loss functions our method leverages sentence/token similarity graphs and feature co-activation patterns to guide cross-layer alignment.

Chinese AI hardware resources in 2025 and beyond

Spring 2025

Naci Cankaya

Mentored by Aaron Scher

This report investigates and quantifies the AI-relevant hardware resources currently existing within the Chinese mainland, as well as the development of domestic alternatives to export-controlled AI hardware technologies. Our research found that, in the short term, the primary asset will likely continue to be NVIDIA’s GPUs, primarily of the Hopper and Ampere generations. This is expected to change with the indigenization of key technologies. The key determining factors are the specifics of performance density and cost effectiveness, as well as the capacity to source and produce key hardware domestically at scale. **While insider knowledge was not available to the author, a survey of both existing analyses and reports, as well as our own analyses using primary, publicly available sources from research entities based in the Chinese mainland, provides an overview of the technology landscape as of mid-2025. ** We present our results in three parts: Part I focuses on the imported NVIDIA hardware, quantifying estimates for specific GPU numbers. **Part II presents what we found out about China’s access to – and progress in – key technologies required to produce accelerators and AI-capable facilities at scale. ** Part III concludes with our estimates of the technological readiness of an indispensable technology needed for continued, competitive leading-edge semiconductor manufacturing: EUV lithography.

Historical precedents for international AI safety collaborations

Spring 2025

Zac Richardson

Mentored by Aaron Scher

Historical cases of international collaboration on sensitive technologies offer key insights for technical AI safety cooperation. Two main challenges for AI safety collaboration are preventing inadvertent disclosure of sensitive information and avoiding proliferation of strategic capabilities. Drawing from historical case studies on INTELSAT's governance of communication satellites, bilateral nuclear security arrangements between rival states, and international encryption standardisation, this paper identifies patterns of successful technical collaboration that advance positive applications while limiting opportunities for subversion. Recommendations include: building institutional relationships between alignment researchers from competing nations, focusing collaboration on reducing risks from non-state actors' misuse of non-frontier models, jointly designing infrastructure that will advance domestic AI governance goals, and conducting shared research on verification measures.

AI and Corporate Personhood - A Comparative Analysis

Spring 2025

Mohammad Ghasemi

Mentored by Deric Cheng, Justin Bullock

This paper examines the potential extension of legal personhood to artificial intelligence systems through a comparative analysis with corporate personhood. As AI systems grow increasingly autonomous, questions about their legal status become more pressing. Rather than viewing AI personhood as revolutionary, we frame it as an evolutionary development in legal thought, drawing parallels with how societies have historically granted personhood to non-human entities like corporations based on practical needs. The analysis explores multiple dimensions of comparison, including conferral of legal status, rights and duties, decision-making agency, representation, accountability, and existence parameters. While corporations and AI systems share potential capacities to own assets, form contracts, and bear responsibility, AI's technical autonomy and emergent behaviors present unique challenges that corporate law does not fully address. This paper establishes a foundation for further research on adapting corporate personhood concepts to AI while developing novel approaches for AI's distinctive characteristics. We suggest that common law jurisdictions may have advantages in developing case-by-case precedents as AI capabilities evolve, and recommend interim measures such as mandatory registration, insurance requirements, and technical auditing to build regulatory frameworks that can accommodate future developments in AI personhood.

Learning Paradigms for Multi-Agent Architectures to Solve Real World Tasks

Spring 2025

Maximilian Holschneider, Jonathan Michala, Luiza Corpaci

Mentored by Jonas Hallgren, Aaron Halpern

This research proposes a novel approach to modeling multi-agent communication by leveraging mathematical structures called simplicial complexes. Unlike traditional graph-based approaches that primarily focus on pairwise interactions between agents, simplicial complexes can represent higher- order relationships that emerge when multiple agents interact simultaneously as groups. We propose a simulation environment similar to the board game “Clue” (Cleudo) to test which multi-agent architectures perform best when they need to process large volumes of relevant contextual information.

Norm Evolution in Multi-Agent Systems Using LLMs as Agents

Spring 2025

Abhijeet Ghawade

Mentored by Jonas Hallgren, Aaron Halpern

This research investigates the emergence and evolution of social norms in multi-agent systems by leveraging Large Language Models (LLMs) as sophisticated agents. Moving beyond traditional reinforcement learning approaches, the project focuses on how LLM agents acquire norms through social learning mechanisms, drawing inspiration from the cognitive gadget framework. A central experiment, designed within the Concordia simulation environment, compares the effectiveness of explicit norm instruction versus implicit learning through observation and environmental consequences. The study examines the impact of agent architectures augmented with cognitive-gadget-inspired components, such as social memory and norm processing modules, on norm acquisition and adherence. Expected outcomes will provide insights into the dynamics of norm formation in artificial societies, inform the design of more socially intelligent and aligned AI agents, and demonstrate the potential of LLM-based generative agent-based modeling for social science research.

Eval Duel: A Scalable Peer-Challenge Framework For LLMs

Spring 2025

Dewi Gould

Mentored by Samuel Brown, Bruno Mlodozeniec

Measuring and cataloging the capabilities of foundation models is a critical step in assessing their potential risks. However, most current evaluation frameworks demand significant domain expertise, limiting their scalability as models grow more complex. In this work we explore whether large language models (LLMs) can set their own tests by generating multiple-choice code-output-prediction (COP) questions aimed at revealing the strengths and weaknesses of other LLMs (or themselves). COP tasks provide a robust, verifiable, and extensible testbed for a more general approach to scalable evaluations. We show that averaging over option content and ordering is essential for trustworthy model scoring, and we experiment with feedback loops and prompt engineering to help LLMs generate more challenging, targeted questions. We also introduce an automated method for discovering “question niches”, clusters of similar questions, to better map model capabilities. Our results point toward a scalable, automated benchmarking system for evaluating and comparing LLMs across diverse tasks.

Language Model Capability Assessments Should Consider Ensembles

Spring 2025

James Sullivan, Scott Wofford

Mentored by Ole Jorgensen

Companies and governments are increasingly using capability assessments to assess the risks of deploying language models. These assessments typically consider models in isolation, which fails to account for capabilities that arise from combinations of models. Accurately assessing the upper-bound of combinations of models is difficult, due to the huge number of ways models might be combined to answer a question. We define the ensemble upper bound—the best capability attainable by any combination of concurrently queried models—and show that it can surpass pass@k performance for any individual model. Then we demonstrate that considering ensemble upper bounds can significantly improve performance on select capability benchmarks. This provides practical guidance for evaluators aiming to elicit upper-bound model performance for capability assessments.