My ultimate goal is to blend knowledge from multiple disciplines to advance AI research. My current research centers around aligning foundation model and human learning and capabilities, particularly in reasoning, generalization, and efficiency. I have explored ways to improve the controllability of language and visual generation models, and integrate structured and multimodal information to enhance their reasoning capabilities.
I'm investigating psychologically and cognitively inspired methods for continual learning, self-improvement, and advanced reasoning in foundation models. I'm also exploring methods to bridge the data efficiency gap between human and model learning [1,2,3] while shedding further light on human cognitive models and our efficient language acquisition capabilities.
Previously, I was a master's student at Carnegie Mellon University (CMU), where I worked with Eduard Hovy and Malihe Alikhani on language generation, data augmentation, and commonsense reasoning. Before that, I was an undergraduate student at the University of Waterloo, where I worked with Jesse Hoey on dialogue agents and text generation.
I am a co-instructor for the Stanford CS25 Transformers course, and mentor and advise several students. I also led the organization of CtrlGen, a controllable generation workshop at NeurIPS 2021, and was involved in the GEM benchmark and workshop for NLG evaluation.
In my free time, I enjoy gaming, playing the piano and guitar, singing, dancing, martial arts, and table tennis. I am also the founder and president of the Stanford Piano Society.
Jan. 2026: Our new paper is out! We formally devise a unified definition of what exactly is a hallucination in the context of LLMs and AI. Work in progress (V1) preprint here.
Sept. 2025: Had a great time working as a research intern at Contextual AI this past summer!
May 2021: Our data augmentation survey paper, published at ACL 2021 Findings, has received lots of attention on social media (e.g. this tweet, Sebastian Ruder's NLP Newsletter) and was one of the top 10 hottest machine learning papers in May 2021 (source: labml.ai).
Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children's experiences. Here, we provide the first release of the largest developmental egocentric video dataset to date -- the BabyView dataset -- recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, humanlike AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans?
@misc{long2024babyviewdatasethighresolutionegocentric,
title={The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences},
author={Bria Long and Violet Xiang and Stefan Stojanov and Robert Z. Sparks and Zi Yin and Grace E. Keene and Alvin W. M. Tan and Steven Y. Feng and Chengxu Zhuang and Virginia A. Marchman and Daniel L. K. Yamins and Michael C. Frank},
year={2024},
eprint={2406.10447},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.10447},
}
While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 models on 29M words of English-language child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to a heterogeneous blend of datasets from the BabyLM challenge. We evaluate both the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children's training data support high performance relative to other datasets. The local properties of the data affect model results, but somewhat surprisingly, global properties do not. Further, child language input is not uniquely valuable for training language models. These findings support the hypothesis that, rather than proceeding from better data, children's learning is instead substantially more efficient than current language modeling techniques.
@inproceedings{feng-etal-2024-child,
title = "Is Child-Directed Speech Effective Training Data for Language Models?",
author = "Feng, Steven Y. and
Goodman, Noah and
Frank, Michael",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1231",
pages = "22055--22071"}
We motivate and introduce CHARD: Clinical Health-Aware Reasoning across Dimensions, to investigate the capability of text generation models to act as implicit clinical knowledge bases and generate free-flow textual explanations about various health-related conditions across several dimensions. We collect and present an associated dataset, CHARDat, consisting of explanations about 52 health conditions across three clinical dimensions. We conduct extensive experiments using BART and T5 along with data augmentation, and perform automatic, human, and qualitative analyses. We show that while our models can perform decently, CHARD is very challenging with strong potential for further exploration.
@inproceedings{feng-etal-2023-chard,
title = "{CHARD}: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models",
author = "Feng, Steven Y. and Khetan, Vivek and Sacaleanu, Bogdan and Gershman, Anatole and Hovy, Eduard",
editor = "Vlachos, Andreas and Augenstein, Isabelle",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.24",
doi = "10.18653/v1/2023.eacl-main.24",
pages = "313--327"}
Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.
@inproceedings{keh-etal-2023-pancetta,
title = "{PANCETTA}: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically",
author = "Keh, Sedrick Scott and
Feng, Steven Y. and
Gangal, Varun and
Alikhani, Malihe and
Hovy, Eduard",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.36",
doi = "10.18653/v1/2023.eacl-main.36",
pages = "491--504"}
A personification is a figure of speech that endows inanimate entities with properties and actions typically seen as requiring animacy. In this paper, we explore the task of personification generation. To this end, we propose PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation. We curate a corpus of personifications called PersonifCorp, together with automatically generated de-personified literalizations of these personifications. We demonstrate the usefulness of this parallel corpus by training a seq2seq model to personify a given literal input. Both automatic and human evaluations show that fine-tuning with PersonifCorp leads to significant gains in personification-related qualities such as animacy and interestingness. A detailed qualitative analysis also highlights key strengths and imperfections of PINEAPPLE over baselines, demonstrating a strong ability to generate diverse and creative personifications that enhance the overall appeal of a sentence.
@inproceedings{keh-etal-2022-pineapple,
title = "{PINEAPPLE}: Personifying {IN}animate Entities by Acquiring Parallel Personification Data for Learning Enhanced Generation",
author = "Keh, Sedrick Scott and Lu, Kevin and Gangal, Varun and Feng, Steven Y. and Jhamtani, Harsh and Alikhani, Malihe and Hovy, Eduard",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.547",
pages = "6270--6284"}
We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.
@article{Feng_Lu_Tao_Alikhani_Mitamura_Hovy_Gangal_2022,
title={Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models},
volume={36},
url={https://ojs.aaai.org/index.php/AAAI/article/view/21306},
DOI={10.1609/aaai.v36i10.21306},
number={10},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Feng, Steven Y. and Lu, Kevin and Tao, Zhuofu and Alikhani, Malihe and Mitamura, Teruko and Hovy, Eduard and Gangal, Varun},
year={2022},
month={Jun.},
pages={10618-10626}}
Many implicit inferences exist in text depending on how it is structured that can critically impact the text's interpretation and meaning. One such structural aspect present in text with chronology is the order of its presentation. For narratives or stories, this is known as the narrative order. Reordering a narrative can impact the temporal, causal, event-based, and other inferences readers draw from it, which in turn can have strong effects both on its interpretation and interestingness. In this paper, we propose and investigate the task of Narrative Reordering (NAREOR) which involves rewriting a given story in a different narrative order while preserving its plot. We present a dataset, NAREORC, with human rewritings of stories within ROCStories in non-linear orders, and conduct a detailed analysis of it. Further, we propose novel task-specific training methods with suitable evaluation metrics. We perform experiments on NAREORC using state-of-the-art models such as BART and T5 and conduct extensive automatic and human evaluations. We demonstrate that although our models can perform decently, NAREOR is a challenging task with potential for further exploration. We also investigate two applications of NAREOR: generation of more interesting variations of stories and serving as adversarial sets for temporal/event-related tasks, besides discussing other prospective ones, such as for pedagogical setups related to language skills like essay writing and applications to medicine involving clinical narratives.
@article{Gangal_Feng_Alikhani_Mitamura_Hovy_2022,
title={NAREOR: The Narrative Reordering Problem},
volume={36},
url={https://ojs.aaai.org/index.php/AAAI/article/view/21309},
DOI={10.1609/aaai.v36i10.21309},
number={10},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Gangal, Varun and Feng, Steven Y. and Alikhani, Malihe and Mitamura, Teruko and Hovy, Eduard},
year={2022},
month={Jun.},
pages={10645-10653}}
We motivate and propose a suite of simple but effective improvements for concept-to-text generation called SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrate their effectiveness on generative commonsense reasoning, a.k.a the CommonGen task, through experiments using both BART and T5 models. Through extensive automatic and human evaluation, we show that SAPPHIRE noticeably improves model performance. An in-depth qualitative analysis illustrates that SAPPHIRE effectively addresses many issues of the baseline model generations, including lack of commonsense, insufficient specificity, and poor fluency.
@inproceedings{feng-etal-2021-sapphire,
title = "{SAPPHIRE}: Approaches for Enhanced Concept-to-Text Generation",
author = "Feng, Steven and
Huynh, Jessica and
Narisetty, Chaitanya Prasad and
Hovy, Eduard and
Gangal, Varun",
booktitle = "Proceedings of the 14th International Conference on Natural Language Generation",
month = aug,
year = "2021",
address = "Aberdeen, Scotland, UK",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.inlg-1.21",
pages = "212--225"}
Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at this link.
@inproceedings{feng-etal-2021-survey,
title = "A Survey of Data Augmentation Approaches for {NLP}",
author = "Feng, Steven Y. and
Gangal, Varun and
Wei, Jason and
Chandar, Sarath and
Vosoughi, Soroush and
Mitamura, Teruko and
Hovy, Eduard",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.84",
doi = "10.18653/v1/2021.findings-acl.84",
pages = "968--988"}
In this paper, we investigate data augmentation for text generation, which we call GenAug. Text generation and language modeling are important tasks within natural language processing, and are especially challenging for low-data regimes. We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews. We also examine the relationship between the amount of augmentation and the quality of the generated text. We utilize several metrics that evaluate important aspects of the generated text including its diversity and fluency. Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods, and that the quality of generations improves to a peak at approximately three times the amount of original data.
@inproceedings{feng-etal-2020-genaug,
title = "{G}en{A}ug: Data Augmentation for Finetuning Text Generators",
author = "Feng, Steven Y. and
Gangal, Varun and
Kang, Dongyeop and
Mitamura, Teruko and
Hovy, Eduard",
booktitle = "Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.deelio-1.4",
doi = "10.18653/v1/2020.deelio-1.4",
pages = "29--42"}
For conversational AI and virtual assistants to communicate with humans in a realistic way, they must exhibit human characteristics such as expression of emotion and personality. Current attempts toward constructing human-like dialogue agents have presented significant difficulties. We propose Human Level Attributes (HLAs) based on tropes as the basis of a method for learning dialogue agents that can imitate the personalities of fictional characters. Tropes are characteristics of fictional personalities that are observed recurrently and determined by viewers' impressions. By combining detailed HLA data with dialogue data for specific characters, we present a dataset, HLA-Chat, that models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs. We then introduce a three-component system, ALOHA (which stands for Artificial Learning of Human Attributes), that combines character space mapping, character community detection, and language style retrieval to build a character (or personality) specific language model. Our preliminary experiments demonstrate that two variations of ALOHA, combined with our proposed dataset, can outperform baseline models at identifying the correct dialogue responses of chosen target characters, and are stable regardless of the character's identity, the genre of the show, and the context of the dialogue.
@article{Li_2020,
title={ALOHA: Artificial Learning of Human Attributes for Dialogue Agents},
volume={34},
ISSN={2159-5399},
url={http://dx.doi.org/10.1609/aaai.v34i05.6328},
DOI={10.1609/aaai.v34i05.6328},
number={05},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
publisher={Association for the Advancement of Artificial Intelligence (AAAI)},
author={Li, Aaron W. and Jiang, Veronica and Feng, Steven Y. and Sprague, Julia and Zhou, Wei and Hoey, Jesse},
year={2020},
month={Apr},
pages={8155–8163}}
In this paper, we present a novel method for measurably adjusting the semantics of text while preserving its sentiment and fluency, a task we call semantic text exchange. This is useful for text data augmentation and the semantic correction of text generated by chatbots and virtual assistants. We introduce a pipeline called SMERTI that combines entity replacement, similarity masking, and text infilling. We measure our pipeline’s success by its Semantic Text Exchange Score (STES): the ability to preserve the original text’s sentiment and fluency while adjusting semantic content. We propose to use masking (replacement) rate threshold as an adjustable parameter to control the amount of semantic change in the text. Our experiments demonstrate that SMERTI can outperform baseline models on Yelp reviews, Amazon reviews, and news headlines.
@inproceedings{feng-etal-2019-keep,
title = "Keep Calm and Switch On! Preserving Sentiment and Fluency in Semantic Text Exchange",
author = "Feng, Steven Y. and
Li, Aaron W. and
Hoey, Jesse",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1272",
doi = "10.18653/v1/D19-1272",
pages = "2701--2711"}
Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature.
We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.
@misc{liu2025unifieddefinitionhallucinationor,
title={A Unified Definition of Hallucination, Or: It's the World Model, Stupid},
author={Emmy Liu and Varun Gangal and Chelsea Zou and Xiaoqi Huang and Michael Yu and Alex Chang and Zhuofu Tao and Sachin Kumar and Steven Y. Feng},
year={2025},
eprint={2512.21577},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.21577},
}
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at this https URL.
@misc{phan2025humanitysexam,
title={Humanity's Last Exam},
author={Long Phan and Alice Gatti and Ziwen Han and Nathaniel Li and Josephina Hu and Hugh Zhang and Chen Bo Calvin Zhang and Mohamed Shaaban and John Ling and Sean Shi and Michael Choi and Anish Agrawal and Arnav Chopra and Adam Khoja and Ryan Kim and Richard Ren and Jason Hausenloy and Oliver Zhang and Mantas Mazeika and Dmitry Dodonov and Tung Nguyen and Jaeho Lee and Daron Anderson and Mikhail Doroshenko and Alun Cennyth Stokes and Mobeen Mahmood and Oleksandr Pokutnyi and Oleg Iskra and Jessica P. Wang and John-Clark Levin and Mstyslav Kazakov and Fiona Feng and Steven Y. Feng and Haoran Zhao and Michael Yu and Varun Gangal and Chelsea Zou and Zihan Wang and Serguei Popov and Robert Gerbicz and Geoff Galgon and Johannes Schmitt and Will Yeadon and Yongki Lee and Scott Sauers and Alvaro Sanchez and Fabian Giska and Marc Roth and Søren Riis and Saiteja Utpala and Noah Burns and Gashaw M. Goshu and Mohinder Maheshbhai Naiya and Chidozie Agu and Zachary Giboney and Antrell Cheatom and Francesco Fournier-Facio and Sarah-Jane Crowson and Lennart Finke and Zerui Cheng and Jennifer Zampese and Ryan G. Hoerr and Mark Nandor and Hyunwoo Park and Tim Gehrunger and Jiaqi Cai and Ben McCarty and Alexis C Garretson and Edwin Taylor and Damien Sileo and Qiuyu Ren and Usman Qazi and Lianghui Li and Jungbae Nam and John B. Wydallis and Pavel Arkhipov and Jack Wei Lun Shi and Aras Bacho and Chris G. Willcocks and Hangrui Cao and Sumeet Motwani and Emily de Oliveira Santos and Johannes Veith and Edward Vendrow and Doru Cojoc and Kengo Zenitani and Joshua Robinson and Longke Tang and Yuqi Li and Joshua Vendrow and Natanael Wildner Fraga and Vladyslav Kuchkin and Andrey Pupasov Maksimov and Pierre Marion and Denis Efremov and Jayson Lynch and Kaiqu Liang and Aleksandar Mikov and Andrew Gritsevskiy and Julien Guillod and Gözdenur Demir and Dakotah Martinez and Ben Pageler and Kevin Zhou and Saeed Soori and Ori Press and Henry Tang and Paolo Rissone and Sean R. Green and Lina Brüssel and Moon Twayana and Aymeric Dieuleveut and Joseph Marvin Imperial and Ameya Prabhu and Jinzhou Yang and Nick Crispino and Arun Rao and Dimitri Zvonkine and Gabriel Loiseau and Mikhail Kalinin and Marco Lukas and Ciprian Manolescu and Nate Stambaugh and Subrata Mishra and Tad Hogg and Carlo Bosio and Brian P Coppola and Julian Salazar and Jaehyeok Jin and Rafael Sayous and Stefan Ivanov and Philippe Schwaller and Shaipranesh Senthilkuma and Andres M Bran and Andres Algaba and Kelsey Van den Houte and Lynn Van Der Sypt and Brecht Verbeken and David Noever and Alexei Kopylov and Benjamin Myklebust and Bikun Li and Lisa Schut and Evgenii Zheltonozhskii and Qiaochu Yuan and Derek Lim and Richard Stanley and Tong Yang and John Maar and Julian Wykowski and Martí Oller and Anmol Sahu and Cesare Giulio Ardito and Yuzheng Hu and Ariel Ghislain Kemogne Kamdoum and Alvin Jin and Tobias Garcia Vilchis and Yuexuan Zu and Martin Lackner and James Koppel and Gongbo Sun and Daniil S. Antonenko and Steffi Chern and Bingchen Zhao and Pierrot Arsene and Joseph M Cavanagh and Daofeng Li and Jiawei Shen and Donato Crisostomi and Wenjin Zhang and Ali Dehghan and Sergey Ivanov and David Perrella and Nurdin Kaparov and Allen Zang and Ilia Sucholutsky and Arina Kharlamova and Daniil Orel and Vladislav Poritski and Shalev Ben-David and Zachary Berger and Parker Whitfill and Michael Foster and Daniel Munro and Linh Ho and Shankar Sivarajan and Dan Bar Hava and Aleksey Kuchkin and David Holmes and Alexandra Rodriguez-Romero and Frank Sommerhage and Anji Zhang and Richard Moat and Keith Schneider and Zakayo Kazibwe and Don Clarke and Dae Hyun Kim and Felipe Meneguitti Dias and Sara Fish and Veit Elser and Tobias Kreiman and Victor Efren Guadarrama Vilchis and Immo Klose and Ujjwala Anantheswaran and Adam Zweiger and Kaivalya Rawal and Jeffery Li and Jeremy Nguyen and Nicolas Daans and Haline Heidinger and Maksim Radionov and Václav Rozhoň and Vincent Ginis and Christian Stump and Niv Cohen and Rafał Poświata and Josef Tkadlec and Alan Goldfarb and Chenguang Wang and Piotr Padlewski and Stanislaw Barzowski and Kyle Montgomery and Ryan Stendall and Jamie Tucker-Foltz and Jack Stade and T. Ryan Rogers and Tom Goertzen and Declan Grabb and Abhishek Shukla and Alan Givré and John Arnold Ambay and Archan Sen and Muhammad Fayez Aziz and Mark H Inlow and Hao He and Ling Zhang and Younesse Kaddar and Ivar Ängquist and Yanxu Chen and Harrison K Wang and Kalyan Ramakrishnan and Elliott Thornley and Antonio Terpin and Hailey Schoelkopf and Eric Zheng and Avishy Carmi and Ethan D. L. Brown and Kelin Zhu and Max Bartolo and Richard Wheeler and Martin Stehberger and Peter Bradshaw and JP Heimonen and Kaustubh Sridhar and Ido Akov and Jennifer Sandlin and Yury Makarychev and Joanna Tam and Hieu Hoang and David M. Cunningham and Vladimir Goryachev and Demosthenes Patramanis and Michael Krause and Andrew Redenti and David Aldous and Jesyin Lai and Shannon Coleman and Jiangnan Xu and Sangwon Lee and Ilias Magoulas and Sandy Zhao and Ning Tang and Michael K. Cohen and Orr Paradise and Jan Hendrik Kirchner and Maksym Ovchynnikov and Jason O. Matos and Adithya Shenoy and Michael Wang and Yuzhou Nie and Anna Sztyber-Betley and Paolo Faraboschi and Robin Riblet and Jonathan Crozier and Shiv Halasyamani and Shreyas Verma and Prashant Joshi and Eli Meril and Ziqiao Ma and Jérémy Andréoletti and Raghav Singhal and Jacob Platnick and Volodymyr Nevirkovets and Luke Basler and Alexander Ivanov and Seri Khoury and Nils Gustafsson and Marco Piccardo and Hamid Mostaghimi and Qijia Chen and Virendra Singh and Tran Quoc Khánh and Paul Rosu and Hannah Szlyk and Zachary Brown and Himanshu Narayan and Aline Menezes and Jonathan Roberts and William Alley and Kunyang Sun and Arkil Patel and Max Lamparth and Anka Reuel and Linwei Xin and Hanmeng Xu and Jacob Loader and Freddie Martin and Zixuan Wang and Andrea Achilleos and Thomas Preu and Tomek Korbak and Ida Bosio and Fereshteh Kazemi and Ziye Chen and Biró Bálint and Eve J. Y. Lo and Jiaqi Wang and Maria Inês S. Nunes and Jeremiah Milbauer and M Saiful Bari and Zihao Wang and Behzad Ansarinejad and Yewen Sun and Stephane Durand and Hossam Elgnainy and Guillaume Douville and Daniel Tordera and George Balabanian and Hew Wolff and Lynna Kvistad and Hsiaoyun Milliron and Ahmad Sakor and Murat Eron and Andrew Favre D. O. and Shailesh Shah and Xiaoxiang Zhou and Firuz Kamalov and Sherwin Abdoli and Tim Santens and Shaul Barkan and Allison Tee and Robin Zhang and Alessandro Tomasiello and G. Bruno De Luca and Shi-Zhuo Looi and Vinh-Kha Le and Noam Kolt and Jiayi Pan and Emma Rodman and Jacob Drori and Carl J Fossum and Niklas Muennighoff and Milind Jagota and Ronak Pradeep and Honglu Fan and Jonathan Eicher and Michael Chen and Kushal Thaman and William Merrill and Moritz Firsching and Carter Harris and Stefan Ciobâcă and Jason Gross and Rohan Pandey and Ilya Gusev and Adam Jones and Shashank Agnihotri and Pavel Zhelnov and Mohammadreza Mofayezi and Alexander Piperski and David K. Zhang and Kostiantyn Dobarskyi and Roman Leventov and Ignat Soroko and Joshua Duersch and Vage Taamazyan and Andrew Ho and Wenjie Ma and William Held and Ruicheng Xian and Armel Randy Zebaze and Mohanad Mohamed and Julian Noah Leser and Michelle X Yuan and Laila Yacar and Johannes Lengler and Katarzyna Olszewska and Claudio Di Fratta and Edson Oliveira and Joseph W. Jackson and Andy Zou and Muthu Chidambaram and Timothy Manik and Hector Haffenden and Dashiell Stander and Ali Dasouqi and Alexander Shen and Bita Golshani and David Stap and Egor Kretov and Mikalai Uzhou and Alina Borisovna Zhidkovskaya and Nick Winter and Miguel Orbegozo Rodriguez and Robert Lauff and Dustin Wehr and Colin Tang and Zaki Hossain and Shaun Phillips and Fortuna Samuele and Fredrik Ekström and Angela Hammon and Oam Patel and Faraz Farhidi and George Medley and Forough Mohammadzadeh and Madellene Peñaflor and Haile Kassahun and Alena Friedrich and Rayner Hernandez Perez and Daniel Pyda and Taom Sakal and Omkar Dhamane and Ali Khajegili Mirabadi and Eric Hallman and Kenchi Okutsu and Mike Battaglia and Mohammad Maghsoudimehrabani and Alon Amit and Dave Hulbert and Roberto Pereira and Simon Weber and Handoko and Anton Peristyy and Stephen Malina and Mustafa Mehkary and Rami Aly and Frank Reidegeld and Anna-Katharina Dick and Cary Friday and Mukhwinder Singh and Hassan Shapourian and Wanyoung Kim and Mariana Costa and Hubeyb Gurdogan and Harsh Kumar and Chiara Ceconello and Chao Zhuang and Haon Park and Micah Carroll and Andrew R. Tawfeek and Stefan Steinerberger and Daattavya Aggarwal and Michael Kirchhof and Linjie Dai and Evan Kim and Johan Ferret and Jainam Shah and Yuzhou Wang and Minghao Yan and Krzysztof Burdzy and Lixin Zhang and Antonio Franca and Diana T. Pham and Kang Yong Loh and Joshua Robinson and Abram Jackson and Paolo Giordano and Philipp Petersen and Adrian Cosma and Jesus Colino and Colin White and Jacob Votava and Vladimir Vinnikov and Ethan Delaney and Petr Spelda and Vit Stritecky and Syed M. Shahid and Jean-Christophe Mourrat and Lavr Vetoshkin and Koen Sponselee and Renas Bacho and Zheng-Xin Yong and Florencia de la Rosa and Nathan Cho and Xiuyu Li and Guillaume Malod and Orion Weller and Guglielmo Albani and Leon Lang and Julien Laurendeau and Dmitry Kazakov and Fatimah Adesanya and Julien Portier and Lawrence Hollom and Victor Souza and Yuchen Anna Zhou and Julien Degorre and Yiğit Yalın and Gbenga Daniel Obikoya and Rai and Filippo Bigi and M. C. Boscá and Oleg Shumar and Kaniuar Bacho and Gabriel Recchia and Mara Popescu and Nikita Shulga and Ngefor Mildred Tanwie and Thomas C. H. Lux and Ben Rank and Colin Ni and Matthew Brooks and Alesia Yakimchyk and Huanxu and Liu and Stefano Cavalleri and Olle Häggström and Emil Verkama and Joshua Newbould and Hans Gundlach and Leonor Brito-Santana and Brian Amaro and Vivek Vajipey and Rynaa Grover and Ting Wang and Yosi Kratish and Wen-Ding Li and Sivakanth Gopi and Andrea Caciolai and Christian Schroeder de Witt and Pablo Hernández-Cámara and Emanuele Rodolà and Jules Robins and Dominic Williamson and Vincent Cheng and Brad Raynor and Hao Qi and Ben Segev and Jingxuan Fan and Sarah Martinson and Erik Y. Wang and Kaylie Hausknecht and Michael P. Brenner and Mao Mao and Christoph Demian and Peyman Kassani and Xinyu Zhang and David Avagian and Eshawn Jessica Scipio and Alon Ragoler and Justin Tan and Blake Sims and Rebeka Plecnik and Aaron Kirtland and Omer Faruk Bodur and D. P. Shinde and Yan Carlos Leyva Labrador and Zahra Adoul and Mohamed Zekry and Ali Karakoc and Tania C. B. Santos and Samir Shamseldeen and Loukmane Karim and Anna Liakhovitskaia and Nate Resman and Nicholas Farina and Juan Carlos Gonzalez and Gabe Maayan and Earth Anderson and Rodrigo De Oliveira Pena and Elizabeth Kelley and Hodjat Mariji and Rasoul Pouriamanesh and Wentao Wu and Ross Finocchio and Ismail Alarab and Joshua Cole and Danyelle Ferreira and Bryan Johnson and Mohammad Safdari and Liangti Dai and Siriphan Arthornthurasuk and Isaac C. McAlister and Alejandro José Moyano and Alexey Pronin and Jing Fan and Angel Ramirez-Trinidad and Yana Malysheva and Daphiny Pottmaier and Omid Taheri and Stanley Stepanic and Samuel Perry and Luke Askew and Raúl Adrián Huerta Rodríguez and Ali M. R. Minissi and Ricardo Lorena and Krishnamurthy Iyer and Arshad Anil Fasiludeen and Ronald Clark and Josh Ducey and Matheus Piza and Maja Somrak and Eric Vergo and Juehang Qin and Benjámin Borbás and Eric Chu and Jack Lindsey and Antoine Jallon and I. M. J. McInnis and Evan Chen and Avi Semler and Luk Gloor and Tej Shah and Marc Carauleanu and Pascal Lauer and Tran Đuc Huy and Hossein Shahrtash and Emilien Duc and Lukas Lewark and Assaf Brown and Samuel Albanie and Brian Weber and Warren S. Vaz and Pierre Clavier and Yiyang Fan and Gabriel Poesia Reis e Silva and Long and Lian and Marcus Abramovitch and Xi Jiang and Sandra Mendoza and Murat Islam and Juan Gonzalez and Vasilios Mavroudis and Justin Xu and Pawan Kumar and Laxman Prasad Goswami and Daniel Bugas and Nasser Heydari and Ferenc Jeanplong and Thorben Jansen and Antonella Pinto and Archimedes Apronti and Abdallah Galal and Ng Ze-An and Ankit Singh and Tong Jiang and Joan of Arc Xavier and Kanu Priya Agarwal and Mohammed Berkani and Gang Zhang and Zhehang Du and Benedito Alves de Oliveira Junior and Dmitry Malishev and Nicolas Remy and Taylor D. Hartman and Tim Tarver and Stephen Mensah and Gautier Abou Loume and Wiktor Morak and Farzad Habibi and Sarah Hoback and Will Cai and Javier Gimenez and Roselynn Grace Montecillo and Jakub Łucki and Russell Campbell and Asankhaya Sharma and Khalida Meer and Shreen Gul and Daniel Espinosa Gonzalez and Xavier Alapont and Alex Hoover and Gunjan Chhablani and Freddie Vargus and Arunim Agarwal and Yibo Jiang and Deepakkumar Patil and David Outevsky and Kevin Joseph Scaria and Rajat Maheshwari and Abdelkader Dendane and Priti Shukla and Ashley Cartwright and Sergei Bogdanov and Niels Mündler and Sören Möller and Luca Arnaboldi and Kunvar Thaman and Muhammad Rehan Siddiqi and Prajvi Saxena and Himanshu Gupta and Tony Fruhauff and Glen Sherman and Mátyás Vincze and Siranut Usawasutsakorn and Dylan Ler and Anil Radhakrishnan and Innocent Enyekwe and Sk Md Salauddin and Jiang Muzhen and Aleksandr Maksapetyan and Vivien Rossbach and Chris Harjadi and Mohsen Bahaloohoreh and Claire Sparrow and Jasdeep Sidhu and Sam Ali and Song Bian and John Lai and Eric Singer and Justine Leon Uro and Greg Bateman and Mohamed Sayed and Ahmed Menshawy and Darling Duclosel and Dario Bezzi and Yashaswini Jain and Ashley Aaron and Murat Tiryakioglu and Sheeshram Siddh and Keith Krenek and Imad Ali Shah and Jun Jin and Scott Creighton and Denis Peskoff and Zienab EL-Wasif and Ragavendran P V and Michael Richmond and Joseph McGowan and Tejal Patwardhan and Hao-Yu Sun and Ting Sun and Nikola Zubić and Samuele Sala and Stephen Ebert and Jean Kaddour and Manuel Schottdorf and Dianzhuo Wang and Gerol Petruzella and Alex Meiburg and Tilen Medved and Ali ElSheikh and S Ashwin Hebbar and Lorenzo Vaquero and Xianjun Yang and Jason Poulos and Vilém Zouhar and Sergey Bogdanik and Mingfang Zhang and Jorge Sanz-Ros and David Anugraha and Yinwei Dai and Anh N. Nhu and Xue Wang and Ali Anil Demircali and Zhibai Jia and Yuyin Zhou and Juncheng Wu and Mike He and Nitin Chandok and Aarush Sinha and Gaoxiang Luo and Long Le and Mickaël Noyé and Michał Perełkiewicz and Ioannis Pantidis and Tianbo Qi and Soham Sachin Purohit and Letitia Parcalabescu and Thai-Hoa Nguyen and Genta Indra Winata and Edoardo M. Ponti and Hanchen Li and Kaustubh Dhole and Jongee Park and Dario Abbondanza and Yuanli Wang and Anupam Nayak and Diogo M. Caetano and Antonio A. W. L. Wong and Maria del Rio-Chanona and Dániel Kondor and Pieter Francois and Ed Chalstrey and Jakob Zsambok and Dan Hoyer and Jenny Reddish and Jakob Hauser and Francisco-Javier Rodrigo-Ginés and Suchandra Datta and Maxwell Shepherd and Thom Kamphuis and Qizheng Zhang and Hyunjun Kim and Ruiji Sun and Jianzhu Yao and Franck Dernoncourt and Satyapriya Krishna and Sina Rismanchian and Bonan Pu and Francesco Pinto and Yingheng Wang and Kumar Shridhar and Kalon J. Overholt and Glib Briia and Hieu Nguyen and David and Soler Bartomeu and Tony CY Pang and Adam Wecker and Yifan Xiong and Fanfei Li and Lukas S. Huber and Joshua Jaeger and Romano De Maddalena and Xing Han Lù and Yuhui Zhang and Claas Beger and Patrick Tser Jern Kon and Sean Li and Vivek Sanker and Ming Yin and Yihao Liang and Xinlu Zhang and Ankit Agrawal and Li S. Yifei and Zechen Zhang and Mu Cai and Yasin Sonmez and Costin Cozianu and Changhao Li and Alex Slen and Shoubin Yu and Hyun Kyu Park and Gabriele Sarti and Marcin Briański and Alessandro Stolfo and Truong An Nguyen and Mike Zhang and Yotam Perlitz and Jose Hernandez-Orallo and Runjia Li and Amin Shabani and Felix Juefei-Xu and Shikhar Dhingra and Orr Zohar and My Chiffon Nguyen and Alexander Pondaven and Abdurrahim Yilmaz and Xuandong Zhao and Chuanyang Jin and Muyan Jiang and Stefan Todoran and Xinyao Han and Jules Kreuer and Brian Rabern and Anna Plassart and Martino Maggetti and Luther Yap and Robert Geirhos and Jonathon Kean and Dingsu Wang and Sina Mollaei and Chenkai Sun and Yifan Yin and Shiqi Wang and Rui Li and Yaowen Chang and Anjiang Wei and Alice Bizeul and Xiaohan Wang and Alexandre Oliveira Arrais and Kushin Mukherjee and Jorge Chamorro-Padial and Jiachen Liu and Xingyu Qu and Junyi Guan and Adam Bouyamourn and Shuyu Wu and Martyna Plomecka and Junda Chen and Mengze Tang and Jiaqi Deng and Shreyas Subramanian and Haocheng Xi and Haoxuan Chen and Weizhi Zhang and Yinuo Ren and Haoqin Tu and Sejong Kim and Yushun Chen and Sara Vera Marjanović and Junwoo Ha and Grzegorz Luczyna and Jeff J. Ma and Zewen Shen and Dawn Song and Cedegao E. Zhang and Zhun Wang and Gaël Gendron and Yunze Xiao and Leo Smucker and Erica Weng and Kwok Hao Lee and Zhe Ye and Stefano Ermon and Ignacio D. Lopez-Miguel and Theo Knights and Anthony Gitter and Namkyu Park and Boyi Wei and Hongzheng Chen and Kunal Pai and Ahmed Elkhanany and Han Lin and Philipp D. Siedler and Jichao Fang and Ritwik Mishra and Károly Zsolnai-Fehér and Xilin Jiang and Shadab Khan and Jun Yuan and Rishab Kumar Jain and Xi Lin and Mike Peterson and Zhe Wang and Aditya Malusare and Maosen Tang and Isha Gupta and Ivan Fosin and Timothy Kang and Barbara Dworakowska and Kazuki Matsumoto and Guangyao Zheng and Gerben Sewuster and Jorge Pretel Villanueva and Ivan Rannev and Igor Chernyavsky and Jiale Chen and Deepayan Banik and Ben Racz and Wenchao Dong and Jianxin Wang and Laila Bashmal and Duarte V. Gonçalves and Wei Hu and Kaushik Bar and Ondrej Bohdal and Atharv Singh Patlan and Shehzaad Dhuliawala and Caroline Geirhos and Julien Wist and Yuval Kansal and Bingsen Chen and Kutay Tire and Atak Talay Yücel and Brandon Christof and Veerupaksh Singla and Zijian Song and Sanxing Chen and Jiaxin Ge and Kaustubh Ponkshe and Isaac Park and Tianneng Shi and Martin Q. Ma and Joshua Mak and Sherwin Lai and Antoine Moulin and Zhuo Cheng and Zhanda Zhu and Ziyi Zhang and Vaidehi Patil and Ketan Jha and Qiutong Men and Jiaxuan Wu and Tianchi Zhang and Bruno Hebling Vieira and Alham Fikri Aji and Jae-Won Chung and Mohammed Mahfoud and Ha Thi Hoang and Marc Sperzel and Wei Hao and Kristof Meding and Sihan Xu and Vassilis Kostakos and Davide Manini and Yueying Liu and Christopher Toukmaji and Jay Paek and Eunmi Yu and Arif Engin Demircali and Zhiyi Sun and Ivan Dewerpe and Hongsen Qin and Roman Pflugfelder and James Bailey and Johnathan Morris and Ville Heilala and Sybille Rosset and Zishun Yu and Peter E. Chen and Woongyeong Yeo and Eeshaan Jain and Ryan Yang and Sreekar Chigurupati and Julia Chernyavsky and Sai Prajwal Reddy and Subhashini Venugopalan and Hunar Batra and Core Francisco Park and Hieu Tran and Guilherme Maximiano and Genghan Zhang and Yizhuo Liang and Hu Shiyu and Rongwu Xu and Rui Pan and Siddharth Suresh and Ziqi Liu and Samaksh Gulati and Songyang Zhang and Peter Turchin and Christopher W. Bartlett and Christopher R. Scotese and Phuong M. Cao and Ben Wu and Jacek Karwowski and Davide Scaramuzza and Aakaash Nattanmai and Gordon McKellips and Anish Cheraku and Asim Suhail and Ethan Luo and Marvin Deng and Jason Luo and Ashley Zhang and Kavin Jindel and Jay Paek and Kasper Halevy and Allen Baranov and Michael Liu and Advaith Avadhanam and David Zhang and Vincent Cheng and Brad Ma and Evan Fu and Liam Do and Joshua Lass and Hubert Yang and Surya Sunkari and Vishruth Bharath and Violet Ai and James Leung and Rishit Agrawal and Alan Zhou and Kevin Chen and Tejas Kalpathi and Ziqi Xu and Gavin Wang and Tyler Xiao and Erik Maung and Sam Lee and Ryan Yang and Roy Yue and Ben Zhao and Julia Yoon and Sunny Sun and Aryan Singh and Ethan Luo and Clark Peng and Tyler Osbey and Taozhi Wang and Daryl Echeazu and Hubert Yang and Timothy Wu and Spandan Patel and Vidhi Kulkarni and Vijaykaarti Sundarapandiyan and Ashley Zhang and Andrew Le and Zafir Nasim and Srikar Yalam and Ritesh Kasamsetty and Soham Samal and Hubert Yang and David Sun and Nihar Shah and Abhijeet Saha and Alex Zhang and Leon Nguyen and Laasya Nagumalli and Kaixin Wang and Alan Zhou and Aidan Wu and Jason Luo and Anwith Telluri and Summer Yue and Alexandr Wang and Dan Hendrycks},
year={2025},
eprint={2501.14249},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2501.14249},
}
Pretraining large language models effectively requires strategic data selection, blending and ordering. However, key details about data mixtures especially their scalability to longer token horizons and larger model sizes remain underexplored due to limited disclosure by model developers. To address this, we formalize the concept of two-phase pretraining and conduct an extensive systematic study on how to select and mix data to maximize model accuracies for the two phases. Our findings illustrate that a two-phase approach for pretraining outperforms random data ordering and natural distribution of tokens by 3.4% and 17% on average accuracies. We provide in-depth guidance on crafting optimal blends based on quality of the data source and the number of epochs to be seen. We propose to design blends using downsampled data at a smaller scale of 1T tokens and then demonstrate effective scaling of our approach to larger token horizon of 15T tokens and larger model size of 25B model size. These insights provide a series of steps practitioners can follow to design and scale their data blends.
@misc{feng2024maximizedataspotentialenhancing,
title={Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining},
author={Steven Feng and Shrimai Prabhumoye and Kezhi Kong and Dan Su and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro},
year={2024},
eprint={2412.15285},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.15285},
}
Talks, Interviews, & Lectures
Apr. 2024: The first lecture of our Stanford CS25 Transformers V4 (Spring 2024) course! We gave a brief intro and overview of the history of NLP, Transformers and how they work, and their impact. We also discussed recent trends, breakthroughs, applications, and remaining challenges/weaknesses of Transformers. Lastly, Div talked about AI agents. This is a super useful lecture for those who want a broader overview of Transformers and the field right now! Slides here. We had a full room (approx. 200 folks in the audience) and over 300+ on Zoom! All other talks are / will be released on the same YouTube playlist.
July 2021: Eduard Hovy and I were on The Data Exchange Podcast with Ben Lorica. We discuss data augmentation for NLP (inspired by our survey paper) and challenges + future directions in NLP and machine learning research. Audio and notes here.
Aug. 2021: Varun and I gave a talk (to over 100 attendees) for Google Research about data augmentation for NLP (inspired by our survey paper). We also touch upon NL-Augmenter and our CtrlGen Workshop at NeurIPS 2021.
Teaching and Instruction
Stanford's CS25: Transformers United - I am a co-instructor for Stanford's CS25 course! We are one of Stanford's hottest seminar courses, with attendance open to the public! Zoom link and other details are on our course website. We feature in-depth discussion from exciting speakers each week about cutting-edge research in Transformers. Speakers so far include Andrej Karpathy, Geoffrey Hinton, Jim Fan, Ashish Vaswani, and Jason Wei. Recordings of talks are here. Some class photos below! Speakers pictured: Andrej Karpathy, Jim Fam, Jason Wei & Hyung Won Chung, and the CS25 instructors.
Mentorship and Advising
Shijia Yang [Stanford Master's of Computer Science, Class of 2025]
Mentoring a research project on multimodal chain-of-thought reasoning using vision-language models (VLMs).
Sedrick Scott Keh [CMU Master's of Machine Learning (MSML), Class of 2022]
Mentored several research projects on controllable and creative text generation [e.g. paper1, paper2].
Kevin Lu [University of Waterloo Undergrad, Computer Science, Class of 2026]
Mentored several research projects on controllable, creative, and visually-grounded text generation [e.g. paper1, paper2].
Zhuofu (Derek) Tao [UCLA Ph.D. in Electrical Engineering, Class of 2025]
Mentored a research project on controllable and visually-grounded text generation [paper].