0% found this document useful (0 votes)
288 views84 pages

ETH Zurich Talk - April 14, 2025

The document discusses significant trends in artificial intelligence (AI) and machine learning (ML), highlighting advancements in model architecture, training techniques, and hardware improvements over the past fifteen years. Key developments include the evolution of neural networks, the introduction of transformer models, and the use of self-supervised learning, which have collectively transformed AI capabilities. The document also emphasizes the importance of collaboration and open-source tools in shaping the future of AI technology.

Uploaded by

周炎兵
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views84 pages

ETH Zurich Talk - April 14, 2025

The document discusses significant trends in artificial intelligence (AI) and machine learning (ML), highlighting advancements in model architecture, training techniques, and hardware improvements over the past fifteen years. Key developments include the evolution of neural networks, the introduction of transformer models, and the use of self-supervised learning, which have collectively transformed AI capabilities. The document also emphasizes the importance of collaboration and open-source tools in shaping the future of AI technology.

Uploaded by

周炎兵
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Important Trends in AI:

How Did We Get Here,


What Can We Do Now and How
Can We Shape AI’s Future?
Jeff Dean, Chief Scientist, Google Research & Google DeepMind

@jeffdean.bsky.social and @JeffDean


ai.google/research/people/jeff

Presenting the work of many people at Google and elsewhere


Some observations
In recent years, ML has completely changed our expectations of
what is possible with computers

Increasing scale (compute, data, model size) delivers better results

Algorithmic and model architecture improvements have provided


massive improvements as well

The kinds of computations we want to run and the hardware on


which we run them is changing dramatically
Fifteen Years of Machine Learning Advances

or

How Did Today’s Models Come To Be?


Key Building Block from Last Century: Neural Networks

weights
weights

Key building block: neural networks, made up of artificial neurons, loosely designed to
mimic how real neurons behave
Key Building Block from Last Century: Backpropagation

weights
weights

Backpropgation of
errors gives an
algorithm for how to
update the weights of
whole neural network
based on errors
observed at the
outputs of the model

Key building block: backpropagation of errors (using chain rule) gives effective algorithm
for updating the weights of a neural network to minimize errors on training data
2012: Scale Matters

Training a very large neural network (60X bigger than previous largest neural network) using
16,000 CPU cores gives major advances in quality
(~70% relative improvement in ImageNet 22K state-of-the-art)
Le et al., ICML 2012, arxiv.org/abs/1112.6209
2012: Distributed Training on Many Computers

Model parallelism Data parallelism

Combining model parallelism and data parallelism for neural network training across
thousands of computers enables training of much larger (50-100X) neural networks than
previously possible

Large Scale Distributed Deep Networks, Dean et al., NeurIPS 2012,


research.google.com/archive/large_deep_networks_nips2012.pdf
2013: Distributed Representations of Words Are Powerful
Word2Vec

Distributed representations of words are powerful:


(1) Nearby words in high dimensional space are related
cat, puma, tiger, … are all nearby

(2) Directions are meaningful


king – queen ~= man – woman
ICLR 2013 workshop, arxiv.org/abs/1310.4546 Appeared in NeurIPS 2013, arxiv.org/abs/1310.4546
2014: Models that Map One Sequence to Another are Powerful
Sequence to Sequence

Use a neural encoder over an input sequence to generate state, use that to
initialize state of a neural decoder. Scale up LSTMs and this works.

Appeared in NeurIPS 2014, arxiv.org/abs/1409.3215


2015: Specialized Hardware for Neural Network Inference
about 1.2 1.21042
reduced × about 0.6 × 0.61127
precision NOT
about 0.7 0.73989343
ok

handful of specific
Tensor Processing Unit (TPU) operations × =
v1: 2015, 92 teraops (inference only)

Specialization is much more efficient:


Compared to contemporary CPUs & GPUs:
TPU v1 is 15X-30X faster
TPU v1 is 30X-80X more energy efficient

Appeared in ISCA, 2017, arxiv.org/abs/1704.04760. Now most cited paper in ISCA’s 50 year history
2016: Specialized Supercomputers for Neural Network Training

Connect thousands of chips together (TPU pods) with custom high-speed networks
to enable faster neural network training

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for
Embeddings, Jouppi et al., ISCA 2023, arxiv.org/abs/2304.01433
Continual Hardware Performance Scaling

11 1126 42522
petaflops petaflops petaflops

blog.google/products/google-cloud/ironwood-tpu-age-of-inference/
Continual Hardware Improvements in Energy Efficiency

~30X energy
efficiency
improvement
vs. TPU v2

Peak FP8 flops delivered per watt of thermal design power per chip package

blog.google/products/google-cloud/ironwood-tpu-age-of-inference/
Open source tools enable the whole community

pytorch.org
tensorflow.org

github.com/jax-ml/jax
2017: Transformer Model Architecture: Attention

Don’t try to force state into single recurrent distributed representation.


Instead, save all past representations and attend to them.

Attention is All You Need, Vaswani et al., NeurIPS 2017, arxiv.org/abs/1706.03762


2017: Transformer Model Architecture: Attention

Figure from Scaling Laws for Neural Language Models,


Kaplan et al., arxiv.org/abs/2001.08361

Higher accuracy w/ 10X-100X less compute and 10X smaller models!

Attention is All You Need, Vaswani et al., NeurIPS 2017, arxiv.org/abs/1706.03762


2018: Language Modeling At Scale With Self-Supervised Data

There’s lots of text in the world! Self-supervised learning on this text can
provide very large amounts of training data with the “right” answer known (“wrong
guess” is used to provide gradient descent loss training signal)

Self-supervised learning
on text with large models
is one of the major
reasons chat/language
models have gotten so
good

Language Models are Few-Shot Learners, Brown et al., NeurIPS, 2020, arxiv.org/abs/2005.14165
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., ACL 2019, arxiv.org/abs/1810.04805
2018: Language Modeling At Scale With Self-Supervised Data

There’s lots of text in the world! Self-supervised learning on this text can
provide very large amounts of training data with the “right” answer known (“wrong
guess” is used to provide gradient descent loss training signal)

Different kinds of training objectives:


Autoregressive (look at prefix, predict next word): Self-supervised learning
Zürich is ______ on text with large models
Zürich is the _______ is one of the major
Zürich is the largest _______ reasons chat/language
models have gotten so
Fill-in-the-Blank (e.g. look in both directions, BERT):
good
Zürich ____ the largest ____ in ______.
Zürich is the ______ city ____ Switzerland.
….
Language Models are Few-Shot Learners, Brown et al., NeurIPS, 2020, arxiv.org/abs/2005.14165
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., ACL 2019, arxiv.org/abs/1810.04805
2021: Transformers for Vision

Visualization of
attention mechanism

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy et al., ICLR 2021,
arxiv.org/abs/2010.11929
2017: Sparse Models (e.g. Mixture of Experts) Outperform
Dense Models

(A)

or (B)

Give model much larger capacity w/ lots of experts but only activate a few chosen experts per token:
(A) ~8X reduction in training compute cost for ~same accuracy, or
(B) major accuracy improvements for same training compute cost
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean.
ICLR 2017, arxiv.org/abs/1701.06538
Continued Research on Sparse Models

Gemini 1.5 Pro/Gemini 2.0/Gemini 2.5 use mixture-of-expert (MoE) architectures, building on a long line
of Google research efforts on sparse models:
● 2017: Shazeer et al., Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
ICLR 2017. arxiv.org/abs/1701.06538
● 2020: Lepikhin et al., GShard: Scaling giant models with conditional computation and automatic sharding.
ICLR 2020. arxiv.org/abs/2006.16668
● 2021: Carlos Riquelme et al., Scaling vision with sparse mixture of experts, NeurIPS 2021.
arxiv.org/abs/2106.05974
● 2021: Fedus et al., Switch transformers: Scaling to trillion parameter models with simple and efficient
sparsity. JMLR 2022. arxiv.org/abs/2101.03961
● 2022: Clark et al., Unified scaling laws for routed language models, ICML 2022. arxiv.org/abs/2202.01169
● 2022: Zoph et al., Designing effective sparse expert models. arxiv.org/abs/2202.08906
● 2023: Puigcerver et al., From Sparse to Soft Mixtures of Experts. arxiv.org/abs/2308.00951
● 2023: Obando-Cero et al., Mixtures of Experts Unlock Parameter Scaling for Deep RL.
arxiv.org/abs/2402.08609
● 2024: Raposo et al., Mixture-of-Depths: Dynamically allocating compute in transformer-based language
models. arxiv.org/abs/2404.02258
● 2024: Douillard et al., DiPaCo: Distributed Path Composition. arxiv.org/abs/2403.10616
2018: Software abstractions for Distributed ML Computations
Example: Pathways

Region A Region B

Building 1 Building 2 Building 1

Scalable software can simplify running large-scale computations


Pathways: Asynchronous Distributed Dataflow for ML, Barham et al., MLSys 2022: arxiv.org/abs/2203.12533
2018: Software abstractions for Distributed ML Computations

Client
With JAX+Pathways, entire training process
driven by a single Python process on one host

Region A Region B

Building 1 Building 2 Building 1

Scalable software can simplify running large-scale computations


Pathways: Asynchronous Distributed Dataflow for ML, Barham et al., MLSys 2022: arxiv.org/abs/2203.12533
Pathways: Now Available for Cloud Customers

Pathways: Enables a single JAX client can see and use many devices (e.g. 1 to 100,000
chips), even though these are distributed across many hosts and even many TPU pods

Pathways: Asynchronous Distributed Dataflow for ML, Barham et al., MLSys 2022: arxiv.org/abs/2203.12533
2022: “Thinking longer” at inference time is very useful
“Chain of Thought prompting” is one such technique

Chain of Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei, Xuezhi Wang, Dale
Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou, 2022, arxiv.org/abs/2201.11903
2022: “Thinking longer” at inference time is very useful
“Chain of Thought prompting” is one such technique

Solve rate (%age)


Model scale
(billions of parameters)

Prompting model to “show its work” improves accuracy on reasoning tasks


dramatically
Chain of Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei, Xuezhi Wang, Dale
Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou, 2022, arxiv.org/abs/2201.11903
2014: Distillation: Use Powerful “Teacher” Models to Make
Smaller, Cheaper “Student” Models

“performed the Concerto for “ __?__


Real next word: “Violin”

Distillation: Use large high quality model as “teacher” when training smaller
“student” model
Rejected from NeurIPS 2014. Published in workshop & put on Arxiv: arxiv.org/abs/1503.02531. 24,000+ citations.
2014: Distillation: Use Powerful “Teacher” Models to Make
Smaller, Cheaper “Student” Models

Gives much richer signal for


training: try to get student to
match “soft probability
distribution” of large model

“performed the Concerto for “ __?__


Real next word: “Violin”

Teacher model says: “Violin: 0.4, Piano: 0.2, Trumpet: 0.01, Airplane: 0.00000001”

Distillation: Use large high quality model as “teacher” when training smaller
“student” model
Rejected from NeurIPS 2014. Published in workshop & put on Arxiv: arxiv.org/abs/1503.02531. 24,000+ citations.
2014: Distillation: Use Powerful “Teacher” Models to Make
Smaller, Cheaper “Student” Models

“performed the Concerto for “ __?__


Real next word: “Violin”

Teacher model says: “Violin: 0.4, Piano: 0.2, Trumpet: 0.01, Airplane: 0.00000001”

Distillation: Use large high quality model as “teacher” when training smaller
“student” model
Rejected from NeurIPS 2014. Published in workshop & put on Arxiv: arxiv.org/abs/1503.02531. 24,000+ citations.
2022: Many Different Parallelism Schemes During Inference

Right choices for how to distribute


inference computation heavily influenced
by things like batch size or latency
constraints

Efficiently Scaling Transformer Inference, Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James
Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, arxiv.org/abs/2211.05102
2022: Many Different Parallelism Schemes During Inference

Efficiently Scaling Transformer Inference, Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James
Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, arxiv.org/abs/2211.05102
2023: Speculative Decoding
Use small “drafter” model to predict next K tokens
● Then predict next K tokens in one shot with large model (more efficient: batch size K not 1)
● Advance generation by as many tokens as match in prefix of size K
● Guaranteed identical output distribution

Larger, slower model

vs

Larger, slower model

Faster model (drafter)

Fast Inference from Transformers via Speculative Decoding, Yaniv Leviathan, Matan Kalman & Yossi Matias,
ICML ‘23, arxiv.org/abs/2211.17192
Innovations at Many Levels

Inference-time
Inference algorithms Chain-of-Thought Speculative Decoding compute scaling

Unsupervised and Asynchronous


Training algorithms Distillation SFT + RLxF
Self-Supervised Learning Training

Model architecture Word2Vec Seq2Seq Transformers MoEs Visual Transformers

Software abstractions DistBelief Pathways

Hardware TPUv1 → TPUv2 → TPUv3 → TPUv4 → TPUv5p → Trillium → Ironwood


Gemini:

Putting These Advances Together


Project started in Feb 2023
Many collaborators from Google DeepMind, Google Research, and rest of Google

Goal: Train the world’s best multimodal models and use them all across Google
Gemini 1.0: Dec 2023
Gemini 1.5: Feb 2024 (demonstrated 10M token context window, Flash model)
Gemini 2.0: Dec 2024 (2.0 Flash as good as 1.5 Pro, multimodal live streaming, …)
Gemini 2.0 Thinking: Jan 2025 (2.0 Flash Experimental Thinking)
Gemini 2.5: Mar 2025 (2.5 Pro released), Apr 2025 (“2.5 Flash coming soon”)
https://blog.google/technology/ai/google-gemini-ai https://g.co/gemini
Gemini: A Family of Highly Capable Multimodal Models, by the Gemini Team, arxiv.org/abs/2312.11805
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, by the Gemini Team, arxiv.org/abs/2403.05530
Gemini - Multimodal from the start

Gemini - multimodal from the start

Gemini: A Family of Highly Capable Multimodal Models, by the Gemini Team, arxiv.org/abs/2312.11805
Gemini 1.5
Increased context length
Models can now handle up to 10 million
tokens, with external APIs now offering up
to 2 million tokens for text and/or video.

Clearer context
The information within the context window
is clearer, reducing hallucinations &
enabling in-context learning.

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, by the Gemini Team, arxiv.org/abs/2403.05530
Gemini 2.0
(Like 1.0 and 1.5 and 2.5) Builds on many of
the innovations I just described:

● TPUs
● Cross-datacenter training
● Pathways
● JAX
● Distributed representations of words
● Transformers
● Sparse Mixture of Experts
● Distillation
● + … many more innovations …

blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024
Gemini 2.5 Pro
Our most capable model (for now!)

blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
Gemini 2.5 Pro
Our most capable model (for now!) Leaderboard positions
● #1 LMSYS
● # LiveBench
● #1 Humanity’s Last Exam
● #1 SEAL
● #1 Artificial Analysis
● #1 Aider Polyglot
● #1 MathArena.ai
● #1 Mensa IQ test
● #1 Fiction.LiveBench
● #1 SimpleBench
● #1 Kagi leaderboard
● #2 WebDev Arena
● #4 LiveCodeBench
● #4 NYT Connections
● #2 Creative Writing
● #4 Vectara
● # 1 Perfect Information Game

blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
Users Generally Enjoying Capabilities of Gemini 2.5 Pro
Long context abilities are very helpful (especially for code)
Pushing the Pareto Frontier of Optimal Quality/Price
Organizing a Large-Scale Scientific Effort
Like Gemini
Many Contributors in Many Different Areas
Many Contributors in Many Different Areas
Gemini Structure & Ways of Working
Overall Leads Program Management Product Management

Model Development Areas Capabilities


Pre-training Safety Code
Post-training Vision Agents
On-device Models Audio Internationalization
… …
Core Areas
Data Evals
Infrastructure Codebase
Serving Longer-term Research

Gemini Structure & Ways of Working
Many people in many locations:
~⅓ in San Francisco Bay Area
~⅓ in London
~⅓ in many other places:
NYC, Paris, Boston, Zürich, Bangalore, Tel Aviv, Seattle, …

Time zones are annoying!


● “Golden Hours” between California/West Coast and London/Europe
are important
Gemini Structure & Ways of Working
Lots and lots of large and small discussions and information sharing conducted via
Google Chat Spaces (I’m in 200+ such spaces)

RFCs (Request for Comment): semi-formal way of getting feedback, knowing what
others are working on, etc.

Leaderboards and common baselines enable data-driven decision making about how
to improve
● Multiple rounds of experimentation.
● Many experiments at small scale
● Advance smaller number of successful experiments to next scale
● Every so often (every few weeks), incorporate successful experiments
demonstrated at largest experimental scale into new candidate baseline
● Repeat
Training at Scale:
Silent Data Corruption errors (SDCs)

Despite best efforts, given the scale of ML systems


and the size of ML training jobs, hardware errors
can occur, and sometimes incorrect computations
from one buggy chip can spread and infect the
entire training system
Silent data corruption

Non-deterministically produce incorrect


results, silently

Challenging problem when running largely


independent computation

Multiplicatively worse at scale with


synchronous stochastic gradient descent

Can quickly spread results across


thousands of components across ML
supercomputer

Cores that Don't Count, Peter H. Hochschild, Paul Jack Turner, Jeffrey C. Mogul, Rama Krishna Govindaraju, Parthasarathy
Ranganathan, David E Culler, Amin Vahdat, HotOS 2021, research.google/pubs/cores-that-dont-count/
Metrics anomaly: anomaly due to SDC

Anomaly due to SDC


Gradient Norm

Time
Metrics anomaly: expected anomaly (no SDC)

Anomaly with NO SDC


Gradient Norm

Time
SDC with no metrics anomaly
Gradient Norm

SDC detected with NO anomaly


The step replay shows different values,
but both values are in the normal range.

Time
ML Controller transparently handles Silent Data Corruption
(SDC)

Synchronous training worker SDC checker Hot spare

Defective machine SDC checker SDC Checker


Normal training causes SDC automatically moves training to
state identifies SDC hot spare and
sends defective
machine for repair
What Can These Models Do?
Example
In-context learning: Kalamang translation

First part of
chapter 1
Example
In-context learning: Kalamang translation
Kalamang is only spoken by ~130 people in eastern Indonesian Papua
In-context learning: Kalamang translation

With in-context info, model can translate as effectively as a human learner


who has spent months on the same language materials
Example

Video of bookshelf
-> JSON
“The killer app of
Gemini 1.5 Pro is video.”

Simon Willison


Example

Video understanding
& summarization
In a table, please write
the sport, the
teams/athletes involved,
the year and a short
description of why each
of these moments in
sports are so iconic.
Example Digitization of historical data

https://climatelabbook.substack.com/p/data-rescue-with-ai
Gemini 2.5 Pro example:
Code Generation via High Level Language
Inference time compute gives us another
dimension of compute for quality scaling
deepmind.google/technologies/gemini/flash-thinking/
deepmind.google/technologies/gemini/flash-thinking/
Now That We Have These Powerful
Models, What Will This Mean?
Shaping AI's Impact on Billions of Lives

● Form team of senior computer scientists + rising stars in AI


○ From academia, big tech and startups
● Propose what impact could be given directed research &
policy efforts on AI for public good Mariano-Florentino Cuéllar Jeff Dean John Hennessy
○ Rather than predict societal impact of AI given a laissez faire approach
● Aim to shape AI’s upsides and dampen AI’s downsides
○ For high, middle, and low income nations
● Audience: AI practitioners + policymakers + public
● Approach: Interview 24 experts in 7 fields
○ Employment, Education, Healthcare, Information, Media, Governance,
Finale Doshi-Velez Andy Konwinski Sanmi Koyejo
and Science
○ e.g. Barack Obama, Sal Khan, John Jumper, Neal Stephenson, Dario
Amodei, Bob Wachter, …
● Uncovered 5 guidelines for AI for public good
74
Pelonomi Moiloa Emma Pierson David Patterson
Shaping AI's Impact on Billions of Lives

Mariano-Florentino Cuéllar Jeff Dean John Hennessy

Finale Doshi-Velez Andy Konwinski Sanmi Koyejo

“Shaping AI's Impact on Billions of Lives,” by Mariano-Florentino (Tino) Cuéllar, Jeff Dean, Finale
Doshi-Velez, John Hennessy, Andy Konwinski, Sanmi Koyejo, Pelonomi Moiloa, Emma Pierson, and
David Patterson, December, 2024 75
See ShapingAI.com and arxiv.org/abs/2412.02730 Pelonomi Moiloa Emma Pierson David Patterson
Humans and AI systems working as a team can
do more than either on their own

● AI focused on human productivity produce


more positive benefits than those focused
on human labor replacement
○ Increases human employability
○ Bonus: People can also be safeguards if AI veers
off course in areas not well trained
○ Bonus: People and AIs tend to make different
mistakes, so collaboration of experts with AI can
also improve results
● Productivity focus helps both AI and people
succeed

Shaping AI's Impact on Billions of Lives, see ShapingAI.com and arxiv.org/abs/2412.02730


To increase employment, aim for productivity
improvements in fields that create more jobs

● Despite tremendous productivity gains in computing and


passenger jets, the US in 2020 had 8 times more
commercial airline pilots and 11 times more
programmers than in 1970

● Demand for passenger travel and programming was


elastic ⇒ more jobs
○ Goods with elastic demand are those where a decrease in price
results in a large increase in the quantity acquired

● US agriculture demand is inelastic, so productivity gains


⇒ fewer jobs
○ From 20% of US workforce to 2% in one lifetime (1940 to 2020)

Shaping AI's Impact on Billions of Lives, see ShapingAI.com and arxiv.org/abs/2412.02730


What could be impact in next 5 years of near
term AI by following the guidelines?

● To give concrete targets for improving AI’s


impact, propose
milestoneskilometerstones per field

● Rather than recognize past achievements,


offer significant inducement prizes that try
to stimulate progress on these milestones
○ E.g., XPRIZE, Netflix, Kaggle, …

Shaping AI's Impact on Billions of Lives, see ShapingAI.com and arxiv.org/abs/2412.02730


Education AI Milestone: Worldwide Tutor

● A tutoring tool to accelerate general education


for every child
○ In their language
○ In their culture
○ In their best learning style
● To help teachers with challenge of supporting a
range of student capability
○ Keeping high-achieving students engaged while
supporting those who struggle
● E.g., Rising Academies* in Africa
○ Improves student outcomes by one grade level relative
to students without it

* Henkel, Owen, Hannah Horne-Robinson, Nessie Kozhakhmetova, and Amanda Lee. “Effective and Scalable Math Support: Experimental Evidence on the 79
Impact of an AI-Math Tutor in Ghana.” In International Conference on Artificial Intelligence in Education, pp. 373-381. Cham: Springer Nature Switzerland, 2024.
Healthcare AI Milestone: Broad Medical AI

● Learns from many data modalities


○ Images, laboratory results, health records, genomics,
medical research, …
● Can help carry out diverse set of tasks
○ Bedside decision support
○ Interacting with patients after leaving hospital
○ Drafting radiology reports that describe both abnormalities
and relevant normal findings
■ While taking into account the patient’s history
● Can explain recommendations using written or
spoken text and images
● Milestone requires defining metrics and benchmarks
to measure progress

Shaping AI's Impact on Billions of Lives, see ShapingAI.com and arxiv.org/abs/2412.02730


Information AI Milestone:
Civic Discourse Platform

● Mediates conversations or attitudes to enhance


public understanding and civic discourse
○ Move communities from polarization to pluralism
● AI system makes suggestions on how to rephrase
comments and questions more diplomatically*
● AI system to hold discussions with conspiracy
theorists**
● AI systems could help bring consensus on difficult
issues across whole populations***
* Argyle, Lisa, et al. “Leveraging AI for democratic discourse: Chat interventions can improve online political conversations at scale.” Proc. National Academy of
Sciences, vol. 120, no. 41, 2023.
** Costello, Thomas, Gordon Pennycook, and David Rand. “Durably reducing conspiracy beliefs through dialogues with AI.” Science, vol. 385, no. 6714, 2024, p.
Eadq1814.
*** Tsai, Lily and Alex Pentland. “Rediscovering the Pleasures of Pluralism: The Potential of Digitally Mediated Civic Participation,” The Digitalist Papers, 2024.
Science

● Advances in science via AI could be one of


largest impacts for public good
● Many examples:
○ AlphaFold for protein folding
○ Black hole visualization
○ Flood forecasting
○ Materials discovery
○ Neural net-based weather prediction
○ Airplane contrail reduction to reduce CO2e
○ Controlling plasma for nuclear fusion
○ …
● Most fields of science excited about AI

Shaping AI's Impact on Billions of Lives, see ShapingAI.com and arxiv.org/abs/2412.02730


Science AI Milestone:
Scientist’s AI Aide/Collaborator

● Accelerate pace of science by improving the


productivity of scientists
○ Help suggest interesting hypotheses and automate
experiments
○ Identify important new relevant research, ideally
customized to individual to summarize what is new
compared to what the scientist already knew

Early example: Google’s Co-Scientist work*


● Multi-agent scientific discovery system, showing
inference time compute scaling leads to better rated
hypotheses
* research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/
Shaping AI's Impact on Billions of Lives, see ShapingAI.com and arxiv.org/abs/2412.02730
Conclusions
● AI models and products are becoming incredibly
powerful and useful tools
○ Further research and innovation will continue this trend

● Will have dramatic impact in many diverse areas:


○ Healthcare, education, scientific research, media
creation, misinformation, …

● Potentially makes deep expertise more available to


many more people

● Done well, our AI-assisted future is bright!

You might also like