Yong Zheng-Xin

I am a final-year PhD student at Brown University advised by Stephen Bach, funded by the Open Philanthropy grant for technical AI safety. I’m also an Astra Fellow working with Miles Wang (OpenAI). Previously, I was a research scientist intern at Meta AI and a research collaborator at Cohere Labs.

My current research focuses on reasoning and AI safety. I have worked on:

Reasoning generalization: How chain-of-thought reasoning works across languages and scales at test-time (arXiv)
Safety for reasoning models: Emergent self-jailbreaking behaviors during reasoning (arXiv) and predicting refusals before models finish thinking (arXiv)
Safety for multilingual models: Jailbreaking vulnerabilities in low-resource languages (Best Paper, NeurIPS 2023 SoLaR), detoxification (EMNLP 2024), and finetuning attacks (NAACL 2025)

I’ve also contributed to multilingual frontier models through the Aya instruction-following model (Best Paper, ACL 2024), language adaptation techniques (ACL 2023), and making speech models robust to new accents (INTERSPEECH 2025).

Selected Publications (see all)

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Zheng-Xin Yong , and Stephen H. Bach

arxiv preprint, 2025

PDF Abstract
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Yik Siu Chan* , Zheng-Xin Yong* , and Stephen H. Bach

arxiv preprint, 2025

PDF Abstract

Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.
Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong , M. Farid Adilazuarda , Jonibek Mansurov , and 7 more authors

arxiv preprint, 2025

PDF Abstract Code

Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM’s CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

Ahmet Üstün* , Viraat Aryabumi* , Zheng-Xin Yong* , and 14 more authors

ACL, 2024 (Best Paper Award)

Also featured in: The Washington Post , The Globe and Mail , SiliconANGLE, etc.

PDF Abstract

Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
Low-Resource Languages Jailbreak GPT-4

Zheng-Xin Yong , Cristina Menghini , and Stephen Bach

NeurIPS Workshop: Socially Responsible Language Modelling Research (SoLaR) , 2023 (Best Paper Award)

Also featured in: New Scientist , UK AI Safety Institute (Scientific Report) , ZDNet, etc.

PDF Abstract

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4’s safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs’ safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.