0% found this document useful (0 votes)
67 views37 pages

Going Beyond The Creative Limits of Next-Token Prediction: Roll The Dice & Look Before You Leap

The document discusses the limitations of current language models in performing open-ended creative tasks, proposing a suite of minimal algorithmic tasks to evaluate their creative capabilities. It argues that next-token prediction is insufficient for generating diverse and original outputs, advocating for multi-token approaches like teacherless training and hash-conditioning to enhance creativity. The authors aim to provide a framework for assessing and improving the creative abilities of language models in generating novel responses.

Uploaded by

Puni Sher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views37 pages

Going Beyond The Creative Limits of Next-Token Prediction: Roll The Dice & Look Before You Leap

The document discusses the limitations of current language models in performing open-ended creative tasks, proposing a suite of minimal algorithmic tasks to evaluate their creative capabilities. It argues that next-token prediction is insufficient for generating diverse and original outputs, advocating for multi-token approaches like teacherless training and hash-conditioning to enhance creativity. The authors aim to provide a framework for assessing and improving the creative abilities of language models in generating novel responses.

Uploaded by

Puni Sher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Roll the dice & look before you leap:

Going beyond the creative limits of next-token prediction

Vaishnavh Nagarajan * 1 Chen Henry Wu * 2 Charles Ding 2 Aditi Raghunathan 2

Abstract and fresh connections never seen before. For instance,


arXiv:2504.15266v1 [cs.LG] 21 Apr 2025

consider responding to highly under-specified prompts


We design a suite of minimal algorithmic tasks like “Generate a challenging high-school word
that are a loose abstraction of open-ended real- problem involving the Pythagoras Theorem.” or
world tasks. This allows us to cleanly and control- “Suggest some candidate therapeutic antibod-
lably quantify the creative limits of the present- ies targeting the HER2 antigen.” or “Provide
day language model. Much like real-world tasks a vivid analogy to differentiate quantum and
that require a creative, far-sighted leap of thought, classical mechanics.” Creativity in such tasks
our tasks require an implicit, open-ended stochas- requires generating responses that are not just correct or
tic planning step that either (a) discovers new con- coherent, but are also diverse across responses and are
nections in an abstract knowledge graph (like in original compared to the training data. These currently-
wordplay, drawing analogies, or research) or (b) sidelined desiderata will rise to prominence as we explore
constructs new patterns (like in designing math LLMs for open-ended scientific discovery (Gruver et al.,
problems or new proteins). In these tasks, we 2023; Romera-Paredes et al., 2024; Si et al., 2024; Lu et al.,
empirically and conceptually argue how next- 2024a), for generating novel training data (Yu et al., 2024;
token learning is myopic and memorizes exces- Yang et al., 2024c; Wang et al., 2023), and as we scale up
sively; comparatively, multi-token approaches, test-time compute approaches that benefit from diversity in
namely teacherless training and diffusion mod- exploration, such as best-of-N (Cobbe et al., 2021; Chow
els, excel in producing diverse and original out- et al., 2024; Dang et al., 2025) and long chain-of-thought
put. Secondly, in our tasks, we find that to elicit reasoning (OpenAI, 2024; DeepSeek-AI, 2025; Snell et al.,
randomness from the Transformer without hurt- 2024; Wu et al., 2024).
ing coherence, it is better to inject noise right
at the input layer (via a method we dub hash- Unlike simple open-ended tasks like generating names and
conditioning) rather than defer to temperature basic sentences (Zhang et al., 2024b; Hopkins et al., 2023),
sampling from the output layer. Thus, our work many creative tasks (like designing a clever Olympiad prob-
offers a principled, minimal test-bed for analyz- lem) are said to involve a random flash of creative insight
ing open-ended creative skills, and offers new termed variously as a leap of thought (Wang et al., 2024a;
arguments for going beyond next-token learning Talmor et al., 2020; Zhong et al., 2024), a “eureka” moment
and softmax-based sampling. We make part of (Bubeck et al., 2023), a mental leap (Holyoak & Thagard,
the code available under https://github.com/ 1995; Callaway, 2013; Hofstadter, 1995) or an incubation
chenwu98/algorithmic-creativity step (Varshney et al., 2019). The thesis of this paper is
that learning to solve such creative leap-of-thought tasks
(defined shortly) is misaligned with the current language
1. I NTRODUCTION modeling paradigm (a) in terms of next-token learning, and
(b) in how randomness is elicited. We articulate these two
Not all forms of intelligence are solely about being correct concerns by designing a suite of algorithmic tasks inspired
or wrong. In open-ended tasks, what also matters is by such creative tasks. We then demonstrate how the cre-
finding creative ways to satisfy a request, making surprising ativity of language models suffers in these tasks, and how
*
this can be alleviated (to an extent, within our tasks).
Equal contribution 1 Google Research, US 2 Carnegie Mel-
lon University, Pittsburgh, US. Correspondence to: Vaish- Concretely, for the scope of this paper, a creative leap-of-
navh Nagarajan <[email protected]>, Chen Henry Wu thought task refers to tasks that involve a search-and-plan
<[email protected]>.
process; crucially, this process orchestrates multiple random
Preprint. decisions in advance before generating the output. Typically,

1
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

such a leap of thought is highly implicit in the text — to Generate: "g, f, Y" Generate: "a, b, c"
infer it, one has to deeply engage with the text and detect
X Y a
higher-order patterns in it. We think of tasks like designing s.t. b d d
f e g s.t. f
a satisfying math problem, generating worthwhile research c i h b c e
ideas, or drawing surprising analogies as examples of such (in-weights graph) (in-weights graph)
tasks.
(a) Sibling Discovery (b) Triangle Discovery
Ideally, one would directly study these real-world tasks to
quantify the limits of language models. Indeed, a flurry Figure 1. Minimal tasks inspired by combinational creativity:
of recent works report that LLM-generated research ideas Skills like research, humor and analogies often require identify-
tend to be rephrased from existing ideas (Gupta & Pruthi, ing novel multi-hop connections from known pair-wise relation-
2025; Beel et al., 2025) and that LLM outputs tend to be ships in a knowledge graph. For instance, creating the word-
less creative than humans e.g., Chakrabarty et al. (2024); play “What kind of shoes do spies wear? Sneakers.”
Lu et al. (2024b) (See §J). While assessing real-world tasks requires searching over a semantic graph, and carefully planning
a pair of words (shoes, spies) that lead to a mutual neighbor
is a lofty goal, the evaluations are subjective (Wang et al.,
(sneakers). Inspired by this, we define tasks where a symbolic
2024b; Runco & Jaeger, 2012), and when the model has
graph is stored in the model weights; the model is exposed to
been exposed to all of the internet, originality is hard to example node sequences that form a specific multi-hop structure
ascertain. Thus, the conclusions will inevitably invite debate (like a sibling or a triangle) during training. The model must in-
(such as Si et al. (2024) vs. Gupta & Pruthi (2025) or Lu fer this structure from training; during inference, the model must
et al. (2024a) vs. Beel et al. (2025)). implicitly recall-search-and-plan to generate novel and diverse
node sequences obeying the same structure in the in-weights graph.
In search of more definitive conclusions, we approach from
Pictured are two example tasks with a symbolic graph each, and a
a different angle: we study minimal and controllable tasks corresponding example sequence obeying a sibling (g, f, Y) or
that are loose abstractions of real-world tasks and yet allow a triangle structure (a, b, c). More details in §2.3 and Fig 9.
one to rigorously quantify originality and diversity. This
follows along the lines of recent works that have studied
the diversity of models in graph path-finding (Khona et al., Generate: Generate:
"a→b, c→d, d→e, b→c, e→a" "c→a, b→d, d→c, e→b"
2024) and generating challenging CFGs (Allen-Zhu & Li, a
2023b). Broadly, we refer to such tasks as open-ended algo- b e d a
s.t. s.t.
e
rithmic tasks. Our aim is to design tasks more minimal than c b c
d
these prior tasks, and crucially, tease apart distinct compu-
tational skills required for creativity. This will allow us to (a) Circle Construction (b) Line Construction
systematically investigate issues in the current paradigm of
model training and propose alternatives. Figure 2. Minimal tasks inspired by exploratory creativity:
Skills like designing problem sets, novel proteins and plots re-
As our first main contribution, we draw inspiration quire devising patterns that can be resolved in novel ways through
from cognitive science literature (Boden, 2003) (see also some general rules. Inspired by this, we design a task where during
Franceschelli & Musolesi (2023)) to design algorithmic training, we expose the model to “adjacency lists” that implicitly
tasks isolating two distinct types of creative leaps of thought. resolve into a specific structure (a circle or a line graph) under
The first class of tasks involves combinational creativity: some permutation. The model must infer this higher-order struc-
drawing novel connections in knowledge, like in research, ture; during inference, the model must generate adjacency lists
wordplay or drawing analogies (see Fig 1 for task descrip- resolving to the same structure, but under novel and diverse per-
mutations. Pictured are example sequences and the corresponding
tion). The second class of tasks involves exploratory cre-
implicit structure they would resolve to. See §2.4 and Fig10.
ativity: constructing fresh patterns subject to certain rules,
like in designing problems and suspense (see Fig 2). In
these tasks, we can precisely evaluate models for the frac- mann & Nagarajan, 2024; Monea et al., 2023; Tschannen
tion of generations that are coherent, unique and original et al., 2023) and diffusion models (Hoogeboom et al., 2021;
(not present in training set). We term this metric “algo- Austin et al., 2021; Lou et al., 2023) (see Fig 3 and Fig 4).
rithmic creativity” to denote that it is solely evaluates the Our argument is that in all our tasks, inferring the latent leap
computational aspects of creativity. of thought requires observing global higher-order patterns
Within this framework, we articulate two creative limits rather than local next-token patterns in the sequence.
of the current language modeling paradigm. First, we em- Next, we turn to the de facto approach for randomization
pirically find that next-token learning achieves lower algo- in a Transformer: temperature sampling from the output
rithmic creativity (and higher memorization) compared to softmax layer. We contrast this against an input-layer ran-
multi-token approaches, namely, teacherless training (Bach- domization approach we call hash-conditioning where we

2
0.02

G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION


0.04
Training Standard Teacherless This presents a challenge to recent proposals that aim to
Objective: (Next-Token) (Multi-Token)
fix next-token prediction via permutations (Pannatier et al.,
0.04 0.02 0.00 0.02 0.04
Creativity 2024; Thankaraj et al., 2025) or partial lookaheads (Bavar-
1.0 0.4 ian et al., 2022; Fried et al., 2022; Kitouni et al., 2024; Nolte
0.4 0.75
et al., 2024).
0.50
0.5 0.2
0.2
0.25 As a second direction of progress, we hope our work pro-
0.0 0.0 0.0 0.00
vides a foundation to think about open-ended tasks which
Sibling Triangle Circle Line
Discovery Discovery Construction Construction
are extremely hard to quantify in the wild. This may spur
more algorithmic explorations on improving diversity (such
Memorization as our approach of hash-conditioning) and on curbing ver-
batim memorization in language models.
0.75 0.75 0.75 0.3

0.50 0.50 0.50 0.2 Our contributions:


0.25 0.25 0.25 0.1

0.00 0.00 0.00 0.0 1. We create minimal, controlled and easy-to-quantify open-
Sibling Triangle Circle Line
Discovery Discovery Construction Construction ended algorithmic tasks. These tasks isolate, and loosely
capture two fundamental modes of creativity.
Figure 3. Multi-token teacherless finetuning improves algorith- 2. We find that multi-token prediction through teacherless
mic creativity (top; Eq 1) and reduces memorization (bottom; training or diffusion, results in significantly increased
fraction of generations seen during training) on our four open- algorithmic creativity and reduced memorization in our
ended algorithmic tasks for a Gemma v1 (2B) model.
tasks compared to next-token prediction.
3. Our argument provides new support for multi-token pre-
train models with random hash prefixes. We find that, in our diction, going beyond (B&N’24). We show a gap in cre-
tasks, not only does hash-conditioning induce non-trivial al- ativity in an open-ended task (rather than correctness in
gorithmic creativity (even with deterministic, greedy decod- a deterministic one), in much simpler 2-token-lookahead
ing!), hash-conditioning is also competitive with or better tasks, and in tasks where no token permutation is friendly
than the conventional output-randomization (i.e., temper- to next-token-learning.
ature sampling). Intuitively, maximizing diversity at the 4. We find that hash-conditioning i.e., training with ran-
output-token-level is computationally burdensome: it re- dom hash prefixes, greatly improves diversity of algo-
quires simultaneously processing a diverse set of leaps of rithmic creativity in our tasks, compared to the standard
thoughts to compute a marginalized token distribution. It paradigm of temperature sampling.
is easier to first sample a single latent leap of thought, and
then compute the token conditioned on that one leap. We
conjecture that hash-conditioning enables this conditioned 2. O PEN - ENDED ALGORITHMIC TASKS & TWO
token generation. TYPES OF CREATIVITY

Overall, we hope our study advances the field in two di- We are interested in designing simple algorithmic tasks that
rections. First, we provide a new angle to advocate for are loosely inspired by endeavors such as generating sci-
multi-token approaches, orthogonal to the “path-star” exam- entific ideas, wordplay, narration, or problem-set design,
ple in Bachmann & Nagarajan (2024) (or B&N’24 in short). where one needs to generate strings that are both “interest-
Whereas, the path-star example portrays a gap in correctness ing” and never seen before. In all these tasks, before gener-
of reasoning, ours shows a gap in diversity of open-ended ating the output, one requires a (creative) leap of thought,
thinking. We note though that B&N’24 is an impossibility a process that (a) is implicit i.e., is not spelled out in token
result where next-token learning breaks down spectacularly space (or is even inherently hard to spell out), (b) involves
(unless there is exponential data or compute), while ours is discrete random choices (c) and together, those choices
a data-inefficiency result (where next-token learning occurs must be coherent in that they are carefully planned to satisfy
but is mediocre). Next, the gap we show appears even in 2- various non-trivial, discrete constraints. These constraints
token-lookahead tasks as against the many-token-lookahead fundamentally define the task and make it interesting e.g., a
path-star task. Third, and perhaps most conceptually impor- word problem should be solvable by arithmetic rules, or a
tant is the fact that, while the path-star task is amenable to pun must deliver a surprising punchline. The goal in such
next-token prediction upon reversing the tokens, we identify open-ended tasks is not just coherence though, but also
tasks where no re-ordering is friendly towards next-token diversity and novelty — generations must be as varied as
prediction — the optimal thing to do is to globally learn possible and must not be regurgitated training data. Before
higher-order patterns implicit in the whole future sequence. we design tasks that capture the aforementioned leap of

3
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

thought, we first clarify what tasks do not require such a or chemistry. The leap of thought here requires searching
step. over all possible sequences, constrained by a set of rules.
Open-ended tasks that do not require a leap of thought. In the upcoming sections, we will attempt to capture some
One simple open-ended task that may come to mind is gener- (not all) core computational aspects of rudimentary in-
ating uniformly-random known entities, like celebrity names stances within the two classes of creative skills above. We
(Zhang et al., 2024b). However, there is no opportunity to emphasize that by no means does our minimal algorithmic
create a novel string here. A more interesting example setup intend to capture the human values that go into these
may be generating grammatically coherent PCFG strings endeavors; nor do they capture the rich array of creative acts
following a subject verb object format e.g., the cat that Boden (2003) discusses within these categories. (See
chased a rat (Hopkins et al., 2023). While novel strings limitations in §6)
become possible here, no sophisticated leaps of thought are
involved; each token can be generated on the fly, satisfying 2.2. The basic setting and notations
a local next-token constraint to be coherent.
In all our tasks, we assume the standard generative model
In light of this, we can rephrase our goal as designing open- setting: the model must learn an underlying distribution D
ended, creative tasks where coherence requires satisfying through a training set S of m independent samples si ∼ D.
more interesting, “global” constraints. To build this sys- The distribution is over a space VL of L-length strings. The
tematically, we draw inspiration from literature in cognitive tasks are open-ended in that there is no one correct answer
science (Boden, 2003). Boden (2003) argues that funda- at test-time. The goal is to produce a random string from
mentally, there are three forms of creativity in that order: D, much like responding to the query Design a high-
combinational, exploratory and transformative. We elabo- school word problem.
rate on the first two (the last, we do not look at).
Coherence: Each task is defined by a boolean coherence
function coh : VL 7→ {true, false} which is true only on
2.1. The fundamental types of creativity (Boden, 2003)
the support i.e., supp(D) = {s ∈ VL | coh(s)}. The exact
Combinational creativity.1 Consider rudimentary word- form of coh will be defined in each algorithmic task but
play of the form “What musical genre do balloons broadly, we are interested in scenarios where determining
enjoy? Pop music.” or “What kind of shoes do coherence requires a global understanding of the whole
spies wear? Sneakers.” There is a global structure string. This is inspired by the fact that a wordplay must have
here: two unrelated entities (genre & balloons) are re- a preplanned punchline connecting what comes before, or a
lated eventually through a punchline (pop); the punchline it- word problem must be solvable. We can think of D to be a
self is a mutual neighbor on a semantic graph. More broadly, simple uniform distribution over all coherent strings.
Boden (2003) argues that many tasks, like the above, involve
Algorithmic creativity: Upon witnessing a finite set of
“making unfamiliar combinations of familiar ideas” or the
examples, the model must learn to generate only strings
“unexpected juxtaposition of [known] ideas”. Other tasks
that are (a) coherent, (b) original (not memorized) and (c)
include drawing analogies, or finding connections between
diverse (covers the whole support). An exact quantification
disparate ideas in science.2 All these tasks involve a leap
of this is computationally expensive in our tasks. Instead,
of thought that in effect searches and plans over a space of
we approximate it by sampling a set T of many independent
known facts and combines them.
generations from the model and computing the fraction of
Exploratory creativity. Consider the act of developing T that is original, coherent and unique.
a mystery or designing logical puzzles. These endeavors
Let the boolean memS (s) denote whether an example s is
are not as knowledge-heavy. What they crucially require
from the training set S and let the integer function uniq(X)
is constructing fresh patterns that satisfy some highly non-
denote the number of unique examples in a set X. (The
trivial global constraint e.g., being resolvable as per some
exact definitions of these quantities vary by tasks, as we will
rules (e.g., logic). Such endeavors fall into a second class
see). Then, we define our (empirical) algorithmic creativity
of exploratory creativity in Boden (2003). This includes
metric:
much grander forms of exploration e.g., exploring various
forms of outputs within a stylistic constraint, or exploring uniq({s ∈ T |¬memS (s) ∧ coh(s)})
ˆ N (T ) =
cr . (1)
various corollaries within a theoretical paradigm in physics |T |
1
Some call it combinatorial creativity. We use the term from
Boden (2003), combinational. Our setup models an in-distribution form of novelty as it
2
Even this very paper’s idea draws a connection between the offers a rigorous and tractable way to study the problem.
existing ideas of multi-token prediction, limits of next-token pre- Admittedly though, this is a far simpler form of novelty than
diction and creative planning tasks. what is expected in real-world tasks. Nevertheless, even this

4
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

simple setting will help us appreciate the limits of current next-token learning. More on this in §2.6.
language models.
2.3.2. T RIANGLE DISCOVERY
2.3. Tasks inspired by combinational creativity
Next, we design a task that requires a more com-
Combinational creativity requires a recall-and-search plex, higher-order planning: generating triangles from an
through entities from memory subject to the constraint that appropriately-constructed knowledge graph G = (V, E)
they relate to each other in an interesting way. We abstract (which contains many triangles; see §C). Thus, in this
this through tasks that discover structures from an in-weights task coh((v1 , v2 , v3 )) = true iff all three edges between
graph i.e., a graph stored in the model weights, not reveal in {v1 , v2 , v3 } belong in G. Furthermore, we define uniq(·)
context. and mem(·) such that various permutations of the same tri-
angle are counted as one (see details in §C, including the
2.3.1. S IBLING DISCOVERY exact formatting of the string). Note that the leap of thought
in this task is much harder to learn and execute as it requires
This task involves an implicit, bipartite G made of parent co-ordinating three edges in parallel, from memory.
vertices V = {A, B, C, . . .} each neighboring a correspond-
ing set of children nbr(A) = {a1 , a2 , . . . , }, nbr(B) = This type of a higher-order planning task can be thought of
{b1 , b2 , . . .} and so on. We define coh(s) to hold true on an abstraction of more complex wordplay (like antanacla-
sibling-parent triplets of the form s = (γ, γ ′ , Γ) such that sis, where a word must repeat in two different senses in a
γ, γ ′ ∈ nbr(Γ). We then consider a uniform distribution D sentence, while still being coherently related to the rest of
over all coherent strings (γ, γ ′ , Γ) for a fixed graph G. The the sentence) or creating word games (like crosswords) or
model witnesses i.i.d samples from this distribution. During discovering contradictions or feedback loops in a body of
test-time, the model must maximize algorithmic creativity knowledge, an essential research skill — see §B.3.
(Eq 1) by generating novel parent-sibling triplets based on
its in-weights knowledge of G. Note that the model is not 2.4. Tasks inspired by exploratory creativity
provided the graph in-context as this would sidestep a core
Recall that we are also interested in creativity that involves
computational step in combinational creativity: recalling
constructing new structures. For instance, this may be de-
facts from a large memory (see §B.2). The hope is that the
signing word problems that correspond to novel solutions.
model infers and stores the pairwise adjacencies of G in its
Below, we capture this through tasks that construct adja-
weights (given sufficient data). Full dataset description is in
cency lists of structured graphs. Note that no knowledge
§C and Fig 9.
graph is involved in these tasks.
We view this task as an abstraction of the wordplay example.
One can think of the parent Γ as the “punchline” that delivers 2.4.1. C IRCLE CONSTRUCTION
a connection between otherwise non-adjacent vertices, in
In this task, the generated strings must be randomized
the same way sneaker surprisingly connects the otherwise
adjacency lists that can be rearranged to recover circle
non-adjacent words, spies and shoes.
graphs of N vertices. Let the generated list be s =
A note on the leap of thought. To concretely illustrate (vi1 , vi2 ), (vi3 , vi4 ), . . .. We define coh(s) = true iff
what we mean by a leap of thought, we note that the above there exists a resolving permutation π such that π(s) =
task can be designed with or without a leap. Observe that (vj1 , vj2 ), (vj2 , vj3 ), . . . (vjn , vj1 ) for distinct j1 , j2 , . . . jn .
the most natural order of generation is to generate the par- i.e., each edge leads to the next, and eventually circles back
ent vertex (i.e., punchline) first, and pick the siblings after to the first vertex. We define uniq and mem such that differ-
(conditioned on the parent). Thus, if the task demanded ent examples with the same resolving π are counted as the
the ordering (Γ, γ, γ ′ ), it would involve no leap of thought: same, even if they have differing vertices. As always, the
each next token can be learned and generated through simple learner is then exposed to a finite set of uniformly sampled
rules conditioned on the past, without planning. coherent strings. Note that the latent leap of thought here re-
quires constructing a novel permutation π before generating
However, the word play example involves a non-sequential
the sequence.
leap of thought in that even though the punchline (the parent)
appears last, it must be planned ahead of time. Paralleling Loosely, we can think of the resolving permutation π as
this leap-of-thought structure, we define our sibling discov- how a conflict in a story or a word problem or a puzzle is
ery task to generate the triplets as s = (γ, γ ′ , Γ), where solved; the vertices as characters or mathematical objects;
the siblings appear first. We hypothesize that this (sibling- the rules of rearranging an adjacency list as rules of logic,
first) construction is adversarial towards next-token learning, math or story-building. The creative goal in this task is to
while a reversed (parent-first) dataset is friendlier towards create novel dynamics in the conflict, or equivalently, novel

5
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

dynamics in how the conflict is resolved. Thus if only the Things proceed differently with NTP. We argue that a NTP-
entities differ, but the plot dynamics remain unaltered, we learner would fail to learn the plan z := Γ. The key intuition
count them as duplicates. See details in §C. is that that an NTP-learner learns the parent Γ witnessing
the siblings (γ, γ ′ ) as input. This is trivial to fit: the parent
2.4.2. L INE CONSTRUCTION is simply the mutual neighbor of the two siblings revealed
in the prefix! B&N’24 term such shortcuts as Clever Hans
A simple variant of the above task is one where the edge set
cheats since the model witnesses and exploits part of the
corresponds to a line graph. The resolving permutation π is
ground-truth it must generate (the siblings). Such cheats
such that π(s) = (vj1 , vj2 ), (vj2 , vj3 ) . . . , (vjn−1 , vjn ) for
are simpler than even the true generative rule and are thus
distinct j1 , j2 , . . . jn . i.e., each edge leads to the next until a
quickly picked up during learning. The model then loses
dead-end.
any supervision to learn the latent plan, z := Γ.
2.5. Permutation-invariance of our tasks After the Clever Hans cheat is learned, the NTP-learner
learns the second sibling not through the plan-conditioned
We emphasize a key novel aspect in our last three tasks. distribution p(γ ′ |z := Γ) but through the next-token-
Many algorithmic tasks in literature like addition (Lee et al., conditional, p(γ ′ |γ). This is a complex distribution: learn-
2024), or the path-star task (B&N’24) or Sibling Dis- ing this would require witnessing every sibling-sibling pair
covery have a natural ordering in which the tokens can be totalling O(m · n2 ) many training data — larger by a factor
learned and generated, even if it may not be left to right. of n than the data requirements of the more natural rule.
However, the Triangle Discovery, Line Construc-
tion and Circle Construction tasks are permutation- More abstractly, in our tasks, it is most efficient to learn a
invariant — no token is more privileged than the other, and well-planned random latent p(z) and a subsequent latent-
hence all tokens must be “simultaneously learned” to in- conditioned distribution p(s|z). However, NTP factorizes
fer the underlying process. Intuitively, we view this as an this into pieces of the form p(si |s<i , z). Consequently, the
abstraction of real-world tasks where the creative process model learns uninformative latents from the later tokens,
is highly implicit, and not immediate from the text. These lured by Clever Hans cheats. Conversely, the earlier tokens
tasks offer a test-bed even for non-next-token approaches are learned through complex rules bereft of a latent plan.
that rely on re-permuting the tokens (Pannatier et al., 2024; While this may not lead to complete breakdown of learning
Thankaraj et al., 2025) or predicting only a part of the fu- as in B&N’24, it must lead to data-hungry learning.
ture (Kitouni et al., 2024; Nolte et al., 2024; Bavarian et al.,
2022; Fried et al., 2022). 3. T RAINING AND I NFERENCE
2.6. How next-token learning may suffer in our tasks Transformers. For our next-token-trained (NTP) models,
we use the standard teacher-forcing objective used in super-
Much like in sophisticated creative tasks, in our tasks, the vised finetuning. Given prompt p and ground truth sequence
most natural way to generate the string is by planning var- s, the model is trained to predict the i’th token si , given as
ious random latent choices (say z) in advance and by pro- input the prompt and all ground truth tokens up until that
ducing a plan-conditioned distribution p(s|z) over coherent point, (p, s<i ). We write the objective more explicitly in
strings s. However, next-token prediction (NTP) – or next- §A Eq 2.
token learning to be precise – we argue, is myopic and fails
to learn such a latent plan. Our argument extends that of For the multi-token Transformer models, we use teacherless
B&N’24 to our even simpler tasks. training (Monea et al., 2023; Bachmann & Nagarajan, 2024;
Tschannen et al., 2023), where the model is trained to predict
Consider learning the Sibling Discovery where we must si simultaneously for all i, only given the prompt p (and
generate sibling-parent triplets (γ, γ ′ , Γ). Even if the parent some dummy tokens in place of the s that was once given
must be emitted last, the most natural generative rule is to as input). Since the exact details of this is irrelevant to our
plan the parent first and decide the children last. We can discussion, we describe this in Eq 2. To train our models, we
think of this as learning a latent plan z := Γ. Then, learning use a hybrid of this objective and the next-token objective.
the plan-conditioned generation p(γ, γ ′ , Γ|z) factorizes to
learning the distribution of children conditioned on a parent Diffusion models. Rather than sequentially predicting
as p(γ|z := Γ) and p(γ ′ |z := Γ) (due to conditional inde- each token conditioned on previously generated tokens, dis-
pendence), and the trivial p(Γ|z := Γ). This requires only crete diffusion models (Hoogeboom et al., 2021; Austin
as many parent-sibling edges as there are in the graph, i.e., et al., 2021) iteratively add noise to all tokens and then learn
O(m · n) many points, if there are m parents, each with n to denoise them in reverse. This strategy allows the model to
children. This is optimal. capture global dependencies among tokens during training,
making it an example of a multi-token objective. In our

6
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

experiments, we used the score entropy discrete diffusion 4.1. Observations


model (SEDD, Lou et al., 2023), which starts generation
Multi-token prediction improves algorithmic creativity
from a sequence of fully masked tokens. Over multiple
significantly. In all our datasets, we observe from Fig 3 that
steps, the model simultaneously predicts and unmasks these
the algorithmic creativity of the Gemma v1 (2B) model
tokens, progressively refining the entire sequence.
increases significantly under multi-token prediction, with
Inference. In all the above techniques, we extract each nearly a 5x factor for the discovery datasets. Note that for
sample independently from the model (as against say, ex- this, we have selected the learning rate favorable towards
tracting them in continuous succession in the same context). next-token prediction; tuning for multi-token yields further
For Transformers, during inference, we perform standard gains (Fig 15).
autoregression in both the next- and multi-token trained
In Fig 4, we report performance for the diffusion model
settings. We do this either with greedy decoding or with
against next-token & teacherless training of similar-sized
nucleus sampling (Finlayson et al., 2024).
Transformers. We see that diffusion models are consistently
better than next-token training, achieving up to 5x higher
3.1. Hash-conditioning for Transformers
algorithmic creativity. However, the gains are much smaller
Unlike closed-ended tasks where a prompt (i.e., a prefix) or absent with teacherless training. This echoes prior dis-
maps to a unique correct answer, our open-ended tasks are cussions suggesting that the teacherless objective is a hard
prompt-free. For a prompt-free autoregressive Transformer objective to optimize for smaller Transformers (B&N’24);
to provide diverse outputs, we must use temperature sam- other multi-token approaches (Gloeckle et al., 2024) are
pling rather than greedy decoding. However, as we show even known to hurt small models of the order of 300M.
later, training a prompt-free Transformer model even on our Multi-token prediction reduces memorization signifi-
simple tasks leads to poor creativity in some of our settings cantly. Algorithmic creativity may suffer either because
(while this was no problem for our Diffusion models!). As a the model outputs incoherent garbage, or because it repeats
natural alternative to this, we tried was to prepend a prompt the same original output, or because it simply parrots out
of pause tokens (Goyal et al., 2024) to all datapoints — both the training data. In almost all settings, it is the last reason
during training and during inference — in order to allow that dominates: across the board (in Fig 3, Fig 4 bottom),
extra computation to the model before it emits its outputs. next-token prediction is significantly prone to memorizing
Next, we tried an even more sophisticated alternative we call the data, while multi-token methods are highly resistant. As
as hash-conditioning. Here, we use as prompt, a random foreshadowed in §2.6, we hypothesize that this is because
hash string unique to each training datapoint (rather than the NTP memorizes the earlier training tokens without a global
same constant sequence of pause tokens); during test-time, plan, having fit the later tokens via local coherence rules
we prompt with novel hash strings to extract the test data. (because of Clever Hans cheats à la B&N’24). Note that
We provide possible intuitions for why this may help in an exception to this is the smaller models (especially for
§5.1. diffusion) in our construction tasks, where memorization
increases under the multi-token objectives; but this increase
is mild and crucially, does not hurt algorithmic creativity.
4. E XPERIMENTAL RESULTS
We point the reader to §B.4 for further empirical evidence
Key details. Part of our experiments are performed for supporting our argument about NTP from §2.6, including
a Gemma v1 (2B) pre-trained model (Gemma Team et al., experiments on token-reordering and experiments ruling out
2024), averaged over 4 runs. For diffusion, we use a 90M other hypotheses.
(non-embedding) parameters Score Entropy Discrete Diffu-
sion model (SEDD; Lou et al., 2023). For a fair comparison Hash-conditioning improves algorithmic creativity for
against NTP, we use a 86M (non-embedding) parameters Transformers. Orthogonal to the effect of multi-token vs.
GPT-2 model (Radford et al., 2019). next-token objectives, we point out three crucial effects
that hash-conditioning has on a Transformer. First, hash-
In all our experiments, we finetune the models until it is clear conditioning results in the highest algorithmic creativity in
that algorithmic creativity (Eq. 1) has saturated. All values both the small models (Fig 6) and the larger models (Fig 5).
are reported from this checkpoint. Finally, since our best In fact, in our larger models, the null and pause token prefix
Transformer results were under hash-conditioning (for both with temperature sampling (Fig 18) exhibit almost no algo-
next- and mult-token training), our main results are reported rithmic creativity (they are mode collapsed, see Fig 19, 20).
under that training setting; we provide various ablations In §H.2, we find that this improved creativity from hash-
without that as well. Please see §D for more experimental conditioning comes from aiding diversity, rather than by
details, and §C for precise dataset details (e.g., how the reducing memorization. Note that we do not see gains of
graph is constructed, how sequences are formatted etc.,).

7
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Training Objective
Standard Teacherless Diffusion
(Next-Token) (Multi-Token) (Multi-Token)

Creativity
1.0 1.0
0.10
0.5 0.5 0.5
0.05
Figure 6. Hash-conditioning improves algorithmic creativity of
0.0 0.00 0.0 0.0 the GPT-2 (86M) model (but not the diffusion model): The
Sibling Triangle Circle Line
Discovery Discovery Construction Construction X-axis labels denote the training and decoding procedure, while
Memorization the legend indicates the type of prefix used during both.
0.04 0.2
0.02
0.2 0.5 0.1
0.02
0.01 conditioning as a distinct knob for diversity with more po-
0.0 0.0 0.0 0.00 tency than temperature-scaling. This is line with Peeperkorn
Sibling Triangle Circle Line
Discovery Discovery Construction Construction et al. (2024); Chen & Ding (2023); Chung et al. (2023) who
0.00 find that, in realistic tasks, temperature only has a weak
Figure 4. Multi-token diffusion training improves algorithmic correlation with creativity, often inadvertently introducing
creativity (top; Eq 1) on our four open-ended algorithmic incoherence.
0.02
tasks, and it reduces memorization on discovery tasks but not
construction tasks (bottom). Robustness to hyperparameters. In §E and Fig 22, we
0.04 do sensitivity analysis on all the datasets. We report how
Training Standard Teacherless our above findings are robust to the choice of learning rate,
Objective: (Next-Token) (Multi-Token)
batch-size, number of training steps, weight given to the
0.04 0.02
SiblingDiscovery 0.00 0.02 0.04
TriangleDiscovery
1.0
multi-token objective, varying sampling conditions and rea-
0.4
sonable changes to the complexity of the dataset and train-
Creativity

Creativity

0.2
0.5 ing set size (as per our argument in §2.6, we do expect the
next-vs. multi-token gap to diminish for larger dataset size).
0.0 0.0
hash10 hash4 hash2 null hash10 hash4 hash2 null
greedy greedy temp2.0 temp2.0 greedy greedy temp2.0 temp2.0
Inference Type Inference Type
4.2. An initial exploration of real-world summarization
CircleConstruction LineConstruction
0.4
For a more realistic examination of our findings, we conduct
0.75
Creativity

Creativity

0.50
preliminary investigation of GPT models finetuned with NTP
0.2
0.25
and the multi-token teacherless objectives on summarization
0.0 0.00
tasks (XSUM, CNN/DailyMail). We measure the diversity
hash10 hash4 hash2 null hash10 hash4 hash2 null
greedy greedy temp2.0 temp2.0 greedy greedy temp2.0 temp2.0 of a model for any given prompt by generating 5 different
Inference Type Inference Type
completions and computing a Self-Bleu metric (Zhu et al.,
2018).
Figure 5. Hash-conditioning significantly improves algorithmic
creativity of both next- and multi-token prediction on Gemma Admittedly though, a summarization task is not as open-
v1 (2B) model. The labels in the X-axis denote the prefix (used ended as we would like: a higher quality model (i.e., higher
during training and inference) and the temperature (used during Rouge; Lin, 2004) necessarily means lower diversity. To
inference). account for this, we plot how diversity evolves over time as
a function of the quality of the model; we then find in Fig 7
that for a given model quality, the larger multi-token models
hash-conditioning when it comes to diffusion training (Fig
achieve higher diversity (albeit only by a slight amount).
6).
This increase does not hold for smaller models and is not
Second, surprisingly, with hash-conditioning, there is no always noticeable for CNN/DailyMail (see §I). Interest-
need for temperature: even greedy decoding generates di- ingly, teacherless training consistently shows an increase in
verse outputs that are as good or even better than temperature summarization quality, measured by Rouge.
in algorithmic creativity. Besides, for any fixed tempera-
ture, prefixing a hash string only improves performance 5. D ISCUSSION
over a null prefix (Fig 6, 18). Third, increasing the hash
lengths consistently boosts algorithmic creativity for both 5.1. Intuition about hash-conditioning
next-token and multi-token approaches (see Fig 5, 23).
One could view the hash prefixes as a simpler alternative to
Thus, for Transformers, we propose viewing hash- varying the wordings of a prompt Li et al. (2023); Lau et al.

8
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Training
6. L IMITATIONS
Objective: Standard (Next-Token) Teacherless (Multi-Token)

We enumerate in detail the limitations of our work in terms


XSUM-GPT-XL XSUM-GPT-LARGE of our experimental conclusions and in terms of our general
Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)
0.965 0.965

0.960 0.960 approach to an abstraction of creativity.


0.955 0.955

0.950 0.950

0.945 0.945 6.1. Limitations of our experimental conclusions


0.940
0.940

0.935
0.935 1. There may be many ways to improve upon next-token
0.135 0.140 0.145 0.150
Quality (ROUGE)
0.155 0.160 0.135 0.140 0.145 0.150
Quality (ROUGE)
0.155
prediction for a minimal task. Unfortunately, success
here does not necessarily guarantee success on more
Figure 7. Multi-token training improves diversity scores for complex tasks. Conversely, minimal tasks are more valu-
XSUM summarization for large GPT-2 models: Here, we plot able as a failure base case: failure here guarantees failure
diversity and quality as measured over multiple checkpoints during in more complex tasks.
finetuning, and observe differences in diversity for a fixed quality.
2. Our examples do not preclude the existence of tasks
where next-token prediction will outperform multi-token
prediction; multi-token prediction is simply a more
general-purpose objective suitable to lookahead tasks.
3. The teacherless multi-token prediction technique we ex-
(2024); Naik et al. (2024) or tuning a soft-prompt (Wang
plore as an alternative is generally harder to optimize
et al., 2024c), both of which are known to induce diversity.
than next-token prediction, especially for smaller mod-
But why does this help? We speculatively put forth two
els.
arguments. First, is a representational one. Fixing a random
seed upfront may help the model flesh out (i.e., compute the 4. Even if multi-token approaches outperform next-token
tokens of) one thought per sample, as against maintaining prediction relatively, in some of our simple tasks, all
a running set of multiple thoughts and computing a distri- algorithms are far from delivering a sufficiently diverse
bution over all their tokens at each step. A similar point is model.
made in a concurrent position paper (Jahrens & Martinetz, 5. Although our tasks are minimal, we note that there is
2025). The second argument is specific to next-token pre- a certain range of hyperparameters (e.g., high degree
diction on open-ended planning tasks: fixing a random seed or edge count) beyond which the models can struggle
upfront may help the model co-ordinate multiple interlock- to learn them. We find that Triangle Discovery in
ing random decisions in advance rather than deciding them particular is a challenging task, especially for smaller
on the fly. Finally, there are also optimization aspects of models. We also note that the models are curiously sen-
how hash-conditioning works that we do not understand sitive to the way the edges are formatted (see §F.3).
(see §B.1). Regardless, it remains to be seen whether hash-
conditioning is useful in tasks beyond the minimal ones we 6.2. Our approach to creativity
design.
Below, we enumerate some important limitations of our
approach towards building abstract and minimal models of
5.2. Effects of reasoning-enhancing methods.
creative tasks.
Our argument is limited to learning open-ended tasks in
a supervised manner. While we do not comment on how 1. The skills we capture in our tasks are only (a subset of)
well other approaches like RL (DeepSeek-AI, 2025), chain- computational skills necessary for creativity; these are
of-thought (CoT; Wei et al., 2022), and scaling test-time far from being sufficient.
compute (OpenAI, 2024) would fare, we remark that these
2. The type of algorithmic tasks we study capture only
methods are designed to enhance the quality of a single
a tiny subset of creative tasks that fall under the tax-
example. It is unclear how to design them to maximize
onomy in Boden (2003). There is yet another class
originality against a training set, and diversity over multiple
called transformative creativity that we do not look
responses. Furthermore, there is a profound question as
at, and also other important taxonomies such as the
to whether merely spelling out a model’s thought in token
Big-C/little-c creativity (Csikszentmihalyi, 1996). Big-
space can be an efficient way to search and maximize diver-
C Creativity corresponds breakthroughs and world-
sity. This may require enumerating all possible candidates
changing ideas; what we focus on is adjacent to a class
by trial and error, an impossible feat when the search space
of little-c creativity tasks. Relatedly, many real-world
is large.
creative tasks appear to be “out-of-distribution” in na-
We present more discussions in §B. ture, which we do not capture.

9
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

3. Real-world creative tasks also apply over much larger Going beyond next-token prediction (NTP). There has
context length and require drawing connections from been a recent emerging discussion surrounding the role of
a significantly larger memory (literally, the set of all NTP as foundational piece in developing intelligent mod-
things a human may know about). Our algorithmic els. On the critical side, arguments have been made about
tasks are tiny in comparison (although deliberately so). the inference-time issues with auto-regression (Dziri et al.,
2024; LeCun, 2024; Kääriäinen, 2006; Ross & Bagnell,
4. Our empirical measure of creativity for algorithmic 2010). Others have reported the planning and arithmetic lim-
tasks is only a computationally-efficient proxy. Achiev- itations of next-token trained models (McCoy et al., 2023;
ing an absolute high algorithmic creativity score does Momennejad et al., 2023; Valmeekam et al., 2023a;b;c;
not imply a complete coverage of the space. Bachmann & Nagarajan, 2024) where the goal is accuracy,
5. As stated earlier, we study abstract tasks that are in- not diversity. As for diffusion as an alternative to NTP, our
spired by the computations involved in creative tasks. findings parallel that of Ye et al. (2024) who show that their
Our study is not intended to capture the subjective, variant of diffusion is able to solve the challenging path-star
social, cultural and personal values integral to many task of B&N’24. We provide references to more lines of
creative tasks. multi-token prediction work in §J
There are also other Transformer failures such as the reversal
7. R ELATED W ORK curse (Allen-Zhu & Li, 2023a) or shortcut-learning (Dziri
et al., 2024; Zhang et al., 2023; Liu et al., 2023; Young &
Open-ended algorithmic tasks. Directly related to us are You, 2022; Lai et al., 2021; Ranaldi & Zanzotto, 2023), how-
Khona et al. (2024); Allen-Zhu & Li (2023b) who study di- ever these are out-of-distribution failures; the sub-optimality
versity of next-token-trained models on an open-ended algo- we show is in-distribution, like in B&N’24.
rithmic task. Khona et al. (2024) consider path-connectivity
on a knowledge graph. They observe that under temperature-
scaling, diversity is at odds with accuracy. We show that this Injecting noise into a Transformer. Most related to hash-
tradeoff can be greatly improved when we consider alterna- conditioning is DeSalvo et al. (2024) who induce diversity
tive training methods (multi-token, or hash-conditioning). by varying a soft-prompt learned using a reconstruction loss.
Allen-Zhu & Li (2023b) empirically demonstrate that next- Our approach requires no modification to the architecture or
token predictors are able to learn a synthetic, challenging the loss; however, we train the whole model, which is more
CFG, in the “infinite” data regime (≈ 100m tokens). Our expensive than training only a soft-prompt generator. A
datasets are not CFGs, with the exception of Sibling Dis- concurrent position paper (Jahrens & Martinetz, 2025) con-
covery, which can be thought of as a simple PCFG. Our ceptually suggests injecting noise with the same motivation
negative result does not contradict theirs since what we show as us. The benefits of hash-conditioning may also be related
is a sub-optimality of NTP in a smaller data regime. Our to the fact that varying the wording in a prompt is known
work also extends the above works by studying limitations to induce diverse outputs (Li et al., 2023; Lau et al., 2024;
in much more minimal tasks that require as little as 2-hop Naik et al., 2024). Various works also inject noise into a
lookahead. There are other works that study Transformers Transformer, in a different form from ours (e.g., inducing
on non-open-ended graph-algorithmic tasks, discussed in Gaussian noise), and for a different function such as quality,
§J. robustness (Hua et al., 2022; Jain et al., 2024) or efficiency
(Wang et al., 2024c).
Diversity in generative models. Generative diversity has
long been a major goal, at least until the revolution in reason- 8. C ONCLUSIONS
ing of language models, when accuracy took prominence
over diversity. Much work has gone into concerns such This work provides a minimal test-bed of tasks abstracting
as mode collapse (Che et al., 2017) or posterior collapse distinct modes of creativity. While these tasks are admit-
(Bowman et al., 2016) and memorization. In LLMs, regur- tedly an extreme caricaturization of real-world tasks, they
gitation of training data has been a serious concern (Carlini enable us to quantify otherwise elusive metrics like orig-
et al., 2020; 2023; Nasr et al., 2023). Our results on hash- inality and diversity. They also enable us to control and
conditioning are also reminiscent of a line of work on rein- investigate distinct parts of the current apparatus for lan-
forcement learning (RL) showing that adding noises to the guage modeling (next-token learning and softmax-based
policy model parameters enables more efficient exploration temperature sampling) and advocate for alternatives (multi-
than directly adding noises to the output space (Plappert token learning and hash-conditioning). The surprising ef-
et al., 2017; Fortunato et al., 2017). We defer discussion of fectiveness of hash-conditioning raises various open ques-
theoretical studies of diversity and memorization §J, along tions (§B.1). There are also other profound questions as to
with empirical studies of creativity in natural language tasks. whether reasoning-enhancing methods like RL and CoT are

10
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

optimal for enhancing open-ended diversity and originality Bavarian, M., Jun, H., Tezak, N. A., Schulman, J.,
(§5.2). Overall, we hope our work inspires discussion in the McLeavey, C., Tworek, J., and Chen, M. Efficient
various directions of multi-token prediction, creativity and training of language models to fill in the middle.
planning. ArXiv, abs/2207.14255, 2022. URL https://api.
semanticscholar.org/CorpusID:251135268.
9. I MPACT S TATEMENT Beel, J., Kan, M.-Y., and Baumgart, M. Evaluating sakana’s
ai scientist for autonomous research: Wishful thinking
This paper presents work whose goal is to advance the field
or an emerging reality towards ’artificial research intelli-
of Machine Learning through the study of simple algorith-
gence’ (ari)?, 2025.
mic tasks inspired by creativity. There are many potential
societal consequences of our work — especially if one ap- Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stick-
plies AI to real-world creative endeavors — none which we land, A. C., Korbak, T., and Evans, O. The reversal curse:
feel must be specifically highlighted in our focused algorith- Llms trained on "a is b" fail to learn "b is a". In The
mic study. Twelfth International Conference on Learning Represen-
tations, ICLR 2024, Vienna, Austria, May 7-11, 2024.
10. ACKNOWLEDGEMENTS OpenReview.net, 2024. URL https://openreview.
net/forum?id=GPKTIktA0k.
We wish to thank Gregor Bachmann, Jacob Springer, and
Sachin Goyal for extensive feedback on a draft of the paper. Boden, M. A. The Creative Mind - Myths and Mechanisms
We also wish to thank Mike Mozer, Suhas Kotha, Clayton (2. ed.). Routledge, 2003.
Sanford, Christina Baek, Yuxiao Qu, and Ziqian Zhong Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Józe-
for valuable early discussions and pointers. The work was fowicz, R., and Bengio, S. Generating sentences from a
supported in part by Cisco, Apple, Google, OpenAI, NSF, continuous space. In Proceedings of the 20th SIGNLL
Okawa foundation and Schmidt Sciences. Conference on Computational Natural Language Learn-
ing, CoNLL 2016,, pp. 10–21. ACL, 2016.
R EFERENCES Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J.,
Alabdulmohsin, I., Tran, V. Q., and Dehghani, M. Frac- Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y.,
tal patterns may unravel the intelligence in next-token Lundberg, S., et al. Sparks of artificial general intel-
prediction, 2024. ligence: Early experiments with gpt-4. arXiv preprint
arXiv:2303.12712, 2023.
Allen-Zhu, Z. and Li, Y. Physics of language models: Part
3.2, knowledge manipulation. CoRR, abs/2309.14402, Callaway, E. Cognitive science: Leap of thought. Nature,
2023a. doi: 10.48550/ARXIV.2309.14402. URL https: 502, 2013.
//doi.org/10.48550/arXiv.2309.14402. Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-
Voss, A., Lee, K., Roberts, A., Brown, T. B., Song, D. X.,
Allen-Zhu, Z. and Li, Y. Physics of language models: Part
Erlingsson, Ú., Oprea, A., and Raffel, C. Extracting
1, context-free grammar. CoRR, abs/2305.13673, 2023b.
training data from large language models. In USENIX
doi: 10.48550/ARXIV.2305.13673. URL https://doi.
Security Symposium, 2020.
org/10.48550/arXiv.2305.13673.
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramèr, F.,
Anderson, B. R., Shah, J. H., and Kreminski, M. Ho- and Zhang, C. Quantifying memorization across neural
mogenization effects of large language models on hu- language models. ICLR, 2023.
man creative ideation. In Proceedings of the 16th Con-
ference on Creativity & Cognition, Chicago, IL, USA, Chakrabarty, T., Laban, P., Agarwal, D., Muresan, S., and
June 23-26, 2024, pp. 413–425. ACM, 2024. URL Wu, C. Art or artifice? large language models and the
https://doi.org/10.1145/3635636.3656204. false promise of creativity. In Proceedings of the CHI
Conference on Human Factors in Computing Systems,
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den CHI 2024, Honolulu, HI, USA, May 11-16, 2024, pp.
Berg, R. Structured denoising diffusion models in discrete 30:1–30:34. ACM, 2024.
state-spaces. NeurIPS, 2021.
Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode
Bachmann, G. and Nagarajan, V. The pitfalls of next-token regularized generative adversarial networks. In 5th Inter-
prediction. In Proceedings of the 41st International Con- national Conference on Learning Representations, ICLR
ference on Machine Learning, volume 235 of Proceedings 2017, Toulon, France, April 24-26, 2017, Conference
of Machine Learning Research, pp. 2296–2318, 2024. Track Proceedings. OpenReview.net, 2017.

11
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Chen, H. and Ding, N. Probing the "creativity" of large lan- Wang, P., Zhang, P., Wang, Q., Zhu, Q., Chen, Q., Du, Q.,
guage models: Can models produce divergent semantic Chen, R. J., Jin, R. L., Ge, R., Zhang, R., Pan, R., Wang,
association? In Findings of the Association for Computa- R., Xu, R., Zhang, R., Chen, R., Li, S. S., Lu, S., Zhou,
tional Linguistics: EMNLP 2023, Singapore, December S., Chen, S., Wu, S.-P., Ye, S., Ma, S., Wang, S., Zhou,
6-10, 2023, pp. 12881–12888. Association for Computa- S., Yu, S., Zhou, S., Pan, S., Wang, T., Yun, T., Pei, T.,
tional Linguistics, 2023. Sun, T., Xiao, W. L., Zeng, W., Zhao, W., An, W., Liu,
W., Liang, W., Gao, W., Yu, W.-X., Zhang, W., Li, X. Q.,
Chow, Y., Tennenholtz, G., Gur, I., Zhuang, V., Dai, B., Thi- Jin, X., Wang, X., Bi, X., Liu, X., Wang, X., Shen, X.-C.,
agarajan, S., Boutilier, C., Agarwal, R., Kumar, A., and Chen, X., Zhang, X., Chen, X., Nie, X., Sun, X., Wang,
Faust, A. Inference-aware fine-tuning for best-of-n sam- X., Cheng, X., Liu, X., Xie, X., Liu, X., Yu, X., Song,
pling in large language models. ArXiv, abs/2412.15287, X., Shan, X., Zhou, X., Yang, X., Li, X., Su, X., Lin, X.,
2024. URL https://api.semanticscholar.org/ Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhu, Y. X., Zhang, Y.,
CorpusID:274965054. Xu, Y., Huang, Y., Li, Y., Zhao, Y., Sun, Y., Li, Y., Wang,
Chung, J. J. Y., Kamar, E., and Amershi, S. Increasing Y., Yu, Y., Zheng, Y., Zhang, Y., Shi, Y., Xiong, Y., He,
diversity while maintaining accuracy: Text data genera- Y., Tang, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y.-B., Liu,
tion with large language models and human interventions. Y., Guo, Y., Wu, Y., Ou, Y., Zhu, Y., Wang, Y., Gong, Y.,
In Proceedings of the 61st Annual Meeting of the Asso- Zou, Y., He, Y., Zha, Y., Xiong, Y., Ma, Y., Yan, Y., Luo,
ciation for Computational Linguistics (Volume 1: Long Y.-W., mei You, Y., Liu, Y., Zhou, Y., Wu, Z. F., Ren, Z.,
Papers), ACL, pp. 575–593. Association for Computa- Ren, Z., Sha, Z., Fu, Z., Xu, Z., Huang, Z., Zhang, Z.,
tional Linguistics, 2023. Xie, Z., guo Zhang, Z., Hao, Z., Gou, Z., Ma, Z., Yan, Z.,
Shao, Z., Xu, Z., Wu, Z., Zhang, Z., Li, Z., Gu, Z., Zhu,
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Z., Liu, Z., Li, Z.-A., Xie, Z., Song, Z., Gao, Z., and Pan,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, Z. Deepseek-v3 technical report. In ArXiv, 2024.
R., Hesse, C., and Schulman, J. Training verifiers to solve
math word problems. arXiv preprint arXiv:2110.14168, DeSalvo, G., Kagy, J.-F., Karydas, L., Rostamizadeh, A.,
2021. and Kumar, S. No more hard prompts: Softsrv prompt-
ing for synthetic data generation, 2024. URL https:
Csikszentmihalyi, M. Creativity: Flow and the Psychology //arxiv.org/abs/2410.16534.
of Discovery and Invention. HarperCollins Publishers,
New York, NY, first edition, 1996. Du, L., Mei, H., and Eisner, J. Autoregressive modeling with
lookahead attention. arXiv preprint arXiv:2305.12272,
Dang, X., Baek, C., Wen, K., Kolter, Z., and Raghunathan,
2023.
A. Weight ensembling improves reasoning in language
models. 2025. URL https://api.semanticscholar. Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Lin, B. Y.,
org/CorpusID:277781120. Welleck, S., West, P., Bhagavatula, C., Le Bras, R., et al.
Dawid, A. and LeCun, Y. Introduction to latent variable Faith and fate: Limits of transformers on compositionality.
energy-based models: A path towards autonomous ma- Advances in Neural Information Processing Systems, 36,
chine intelligence. arXiv preprint arXiv:2306.02572, 2024.
2023.
Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., and Wang, L.
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning ca- Towards revealing the mystery behind chain of thought: a
pability in llms via reinforcement learning. ArXiv, theoretical perspective. Advances in Neural Information
abs/2501.12948, 2025. Processing Systems, 36, 2023.

DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B.-L., Wu, Finlayson, M., Hewitt, J., Koller, A., Swayamdipta, S., and
B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, Sabharwal, A. Closing the curious case of neural text
D., Guo, D., Yang, D., Chen, D., Ji, D.-L., Li, E., Lin, degeneration. In The Twelfth International Conference on
F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Learning Representations, ICLR 2024, Vienna, Austria,
Bao, H., Xu, H., Wang, H., Zhang, H., Ding, H., Xin, H., May 7-11, 2024. OpenReview.net, 2024.
Gao, H., Li, H., Qu, H., Cai, J. L., Liang, J., Guo, J., Ni,
J., Li, J., Wang, J., Chen, J., Chen, J., Yuan, J., Qiu, J., Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I.,
Li, J., Song, J.-M., Dong, K., Hu, K., Gao, K., Guan, K., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin,
Huang, K., Yu, K., Wang, L., Zhang, L., Xu, L., Xia, L., O., Blundell, C., and Legg, S. Noisy networks for ex-
Zhao, L., Wang, L., Zhang, L., Li, M., Wang, M., Zhang, ploration. ArXiv, abs/1706.10295, 2017. URL https:
M., Zhang, M., Tang, M., Li, M., Tian, N., Huang, P., //api.semanticscholar.org/CorpusID:5176587.

12
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Franceschelli, G. and Musolesi, M. On the creativity of Hofstadter, D. A review of mental leaps: Analogy in cre-
large language models. CoRR, abs/2304.00008, 2023. ative thought. AI Mag., 16(3):75–80, 1995. doi: 10.
1609/AIMAG.V16I3.1154. URL https://doi.org/
Fried, D., Aghajanyan, A., Lin, J., Wang, S. I., Wal- 10.1609/aimag.v16i3.1154.
lace, E., Shi, F., Zhong, R., tau Yih, W., Zettlemoyer,
L., and Lewis, M. Incoder: A generative model for Holyoak, K. J. and Thagard, P. Mental leaps: analogy
code infilling and synthesis. ArXiv, abs/2204.05999, in creative thought. MIT Press, Cambridge, MA, USA,
2022. URL https://api.semanticscholar.org/ 1995. ISBN 0262082330.
CorpusID:248157108.
Hoogeboom, E., Nielsen, D., Jaini, P., Forr’e, P., and
Gemma Team, T. M., Hardin, C., Dadashi, R., Bhupatiraju, Welling, M. Argmax flows and multinomial diffusion:
S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Learning categorical distributions. In Neural Information
Hussenot, L., and et al. Gemma. 2024. doi: 10.34740/ Processing Systems, 2021.
KAGGLE/M/3301. URL https://www.kaggle.com/
Hopkins, A. K., Renda, A., and Carbin, M. Can LLMs
m/3301.
generate random numbers? evaluating LLM sampling in
Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., and controlled domains. In ICML 2023 Workshop: Sampling
Synnaeve, G. Better & faster large language models via and Optimization in Discrete Space, 2023. URL https:
multi-token prediction. 2024. //openreview.net/forum?id=Vhh1K9LjVI.

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Hua, H., Li, X., Dou, D., Xu, C., and Luo, J. Fine-tuning
Sequence to sequence text generation with diffusion mod- pre-trained language models with noise stability regular-
els. In The Eleventh International Conference on Learn- ization. CoRR, 2022.
ing Representations, ICLR 2023, Kigali, Rwanda, May Jahrens, M. and Martinetz, T. Why llms cannot think and
1-5, 2023. OpenReview.net, 2023. how to fix it, 2025. URL https://arxiv.org/abs/
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., 2503.09211.
Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Jain, N., Chiang, P., Wen, Y., Kirchenbauer, J., Chu,
Y. Generative adversarial networks. Commun. ACM, H., Somepalli, G., Bartoldson, B. R., Kailkhura, B.,
63(11):139–144, 2020. doi: 10.1145/3422622. URL Schwarzschild, A., Saha, A., Goldblum, M., Geiping, J.,
https://doi.org/10.1145/3422622. and Goldstein, T. Neftune: Noisy embeddings improve
Goyal, A., Sordoni, A., Côté, M.-A., Ke, N. R., and Bengio, instruction finetuning. In The Twelfth International Con-
Y. Z-forcing: Training stochastic recurrent networks. ference on Learning Representations, ICLR 2024, Vienna,
NeurIPS, 2017. Austria, May 7-11, 2024. OpenReview.net, 2024.

Jansen, P. A., Cot’e, M.-A., Khot, T., Bransom, E., Dalvi,


Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and
B., Majumder, B. P., Tafjord, O., and Clark, P. Discov-
Nagarajan, V. Think before you speak: Training language
eryworld: A virtual environment for developing and eval-
models with pause tokens. The Twelfth International
uating automated scientific discovery agents. NeurIPS,
Conference on Learning Representations, ICLR 2024,
2024.
2024.
Kääriäinen, M. Lower bounds for reductions. In Atomic
Gruver, N., Stanton, S., Frey, N. C., Rudner, T. G. J., Hotzel,
Learning Workshop, 2006.
I., Lafrance-Vanasse, J., Rajpal, A., Cho, K., and Wil-
son, A. G. Protein design with guided discrete diffu- Kalai, A. T. and Vempala, S. S. Calibrated language models
sion. ArXiv, abs/2305.20009, 2023. URL https://api. must hallucinate. In Proceedings of the 56th Annual
semanticscholar.org/CorpusID:258987335. ACM Symposium on Theory of Computing, STOC 2024,
Vancouver, BC, Canada, June 24-28, 2024, pp. 160–171.
Gu, J., Bradbury, J., Xiong, C., Li, V. O. K., and Socher, ACM, 2024.
R. Non-autoregressive neural machine translation. In
6th International Conference on Learning Representa- Kalavasis, A., Mehrotra, A., and Velegkas, G. On the limits
tions, ICLR 2018, Conference Track Proceedings. Open- of language generation: Trade-offs between hallucination
Review.net, 2018. and mode collapse. abs/2411.09642, 2024.

Gupta, T. and Pruthi, D. All that glitters is not novel: Pla- Kamb, M. and Ganguli, S. An analytic theory of creativity
giarism in ai generated research, 2025. URL https: in convolutional diffusion models, 2024. URL https:
//arxiv.org/abs/2502.16487. //arxiv.org/abs/2412.20292.

13
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Khona, M., Okawa, M., Hula, J., Ramesh, R., Nishi, K., Lin, C.-Y. ROUGE: A package for automatic evaluation
Dick, R. P., Lubana, E. S., and Tanaka, H. Towards an of summaries. In Text Summarization Branches Out, pp.
understanding of stepwise inference in transformers: A 74–81, Barcelona, Spain, July 2004. Association for Com-
synthetic graph navigation model. In Forty-first Inter- putational Linguistics. URL https://aclanthology.
national Conference on Machine Learning, ICML 2024, org/W04-1013/.
Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang,
Kingma, D. P. and Welling, M. Auto-encoding variational C. Transformers learn shortcuts to automata. In The
bayes. In Bengio, Y. and LeCun, Y. (eds.), 2nd Interna- Eleventh International Conference on Learning Repre-
tional Conference on Learning Representations, ICLR sentations, ICLR 2023, 2023.
2014, Banff, AB, Canada, April 14-16, 2014, Conference
Lou, A., Meng, C., and Ermon, S. Discrete diffusion mod-
Track Proceedings, 2014. URL http://arxiv.org/
eling by estimating the ratios of the data distribution. In
abs/1312.6114.
International Conference on Machine Learning, 2023.
Kitouni, O., Nolte, N., Bouchacourt, D., Williams, A., Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha,
Rabbat, M., and Ibrahim, M. The factorization curse: D. The ai scientist: Towards fully automated open-ended
Which tokens you predict underlie the reversal curse scientific discovery, 2024a.
and more. CoRR, abs/2406.05183, 2024. doi: 10.
48550/ARXIV.2406.05183. URL https://doi.org/ Lu, X., Sclar, M., Hallinan, S., Mireshghallah, N., Liu, J.,
10.48550/arXiv.2406.05183. Han, S., Ettinger, A., Jiang, L., Chandu, K. R., Dziri, N.,
and Choi, Y. AI as humanity’s salieri: Quantifying lin-
Kleinberg, J. M. and Mullainathan, S. Language generation guistic creativity of language models via systematic attri-
in the limit. CoRR, abs/2404.06757, 2024. doi: 10. bution of machine text against web text. abs/2410.04265,
48550/ARXIV.2404.06757. URL https://doi.org/ 2024b.
10.48550/arXiv.2404.06757.
Malach, E. Auto-regressive next-token predictors are uni-
Lai, Y., Zhang, C., Feng, Y., Huang, Q., and Zhao, D. Why versal learners. arXiv preprint arXiv:2309.06979, 2023.
machine reading comprehension models learn shortcuts? McCoy, R. T., Yao, S., Friedman, D., Hardy, M., and Grif-
In Findings of the Association for Computational Lin- fiths, T. L. Embers of autoregression: Understanding
guistics: ACL/IJCNLP 2021, Online Event, August 1-6, large language models through the problem they are
2021, volume ACL/IJCNLP 2021 of Findings of ACL, pp. trained to solve. arXiv preprint arXiv:2309.13638, 2023.
989–1002. Association for Computational Linguistics,
2021. McLaughlin, A., Campbell, J., Uppuluri, A., and Yang, Y.
Aidanbench: Stress-testing language model creativity on
Lau, G. K. R., Hu, W., Liu, D., Chen, J., Ng, S.-K., and open-ended questions. In NeurIPS 2024 Workshop on
Low, B. K. H. Dipper: Diversity in prompts for producing Language Gamification, 2024.
large language model ensembles in reasoning tasks, 2024.
URL https://arxiv.org/abs/2412.15238. Merrill, W. and Sabharwal, A. The expressive power of
transformers with chain of thought. In The Twelfth Inter-
LeCun, Y. Do large language models need sensory ground- national Conference on Learning Representations, 2024.
ing for meaning and understanding? University Lecture,
Mirowski, P. W., Love, J., Mathewson, K. W., and Mo-
2024.
hamed, S. A robot walks into a bar: Can language models
serve as creativity support tools for comedy? an evalua-
Lee, N., Sreenivasan, K., Lee, J. D., Lee, K., and Papail-
tion of llms’ humour alignment with comedians. CoRR,
iopoulos, D. Teaching arithmetic to small transformers.
abs/2405.20956, 2024.
In The Twelfth International Conference on Learning
Representations, ICLR 2024. OpenReview.net, 2024. Momennejad, I., Hasanbeig, H., Frujeri, F. V., Sharma, H.,
Ness, R. O., Jojic, N., Palangi, H., and Larson, J. Eval-
Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.-G., and uating cognitive maps and planning in large language
Chen, W. Making language models better reasoners with models with cogeval. Advances in Neural Information
step-aware verifier. In Proceedings of the 61st Annual Processing Systems, 36, 2023.
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Toronto, Canada, July 2023. Monea, G., Joulin, A., and Grave, E. Pass: Parallel spec-
Association for Computational Linguistics. URL https: ulative sampling. 3rd Workshop on Efficient Natural
//aclanthology.org/2023.acl-long.291/. Language and Speech Processing (NeurIPS 2023), 2023.

14
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Nagarajan, V., Raffel, C., and Goodfellow, I. J. Theoretical Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
insights into memorization in gans. In Neural Information Sutskever, I. Language models are unsupervised multitask
Processing Systems Workshop, volume 1, pp. 3, 2018. learners. 2019.
Naik, R., Chandrasekaran, V., Yuksekgonul, M., Palangi, H., Ranaldi, L. and Zanzotto, F. M. Hans, are you clever? clever
and Nushi, B. Diversity of thought improves reasoning hans effect analysis of neural systems, 2023.
abilities of llms, 2024. URL https://arxiv.org/abs/
2310.07088. Romera-Paredes, B., Barekatain, M., Novikov, A., Balog,
M., Kumar, M. P., Dupont, E., Ruiz, F. J. R., Ellenberg,
Nakkiran, P., Bradley, A., Zhou, H., and Advani, M. Step- J. S., Wang, P., Fawzi, O., Kohli, P., and Fawzi, A. Math-
by-step diffusion: An elementary tutorial, 2024. ematical discoveries from program search with large lan-
Nallapati, R., Zhou, B., dos santos, C. N., Gulcehre, C., and guage models. Nat., 625(7995), 2024.
Xiang, B. Abstractive text summarization using sequence-
to-sequence rnns and beyond, 2016. URL https:// Ross, S. and Bagnell, D. Efficient reductions for imitation
arxiv.org/abs/1602.06023. learning. In Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, AIS-
Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me TATS 2010, Chia Laguna Resort, Sardinia, Italy, May
the details, just the summary! topic-aware convolutional 13-15, 2010, volume 9 of JMLR Proceedings, 2010.
neural networks for extreme summarization, 2018. URL
https://arxiv.org/abs/1808.08745. Runco, M. A. and Jaeger, G. J. The standard definition
of creativity. Creativity Research Journal, 24(1):92–96,
Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, 2012.
A. F., Ippolito, D., Choquette-Choo, C. A., Wallace, E.,
Tramèr, F., and Lee, K. Scalable extraction of training Sanford, C., Fatemi, B., Hall, E., Tsitsulin, A., Kazemi,
data from (production) language models. ArXiv, 2023. S. M., Halcrow, J., Perozzi, B., and Mirrokni, V. Un-
derstanding transformer reasoning capabilities via graph
Nolte, N., Kitouni, O., Williams, A., Rabbat, M., and
algorithms. abs/2405.18512, 2024.
Ibrahim, M. Transformers can navigate mazes with
multi-step prediction. CoRR, abs/2412.05117, 2024. Saparov, A., Pawar, S., Pimpalgaonkar, S., Joshi, N., Pang,
doi: 10.48550/ARXIV.2412.05117. URL https://doi. R. Y., Padmakumar, V., Kazemi, S. M., Kim, N., and He,
org/10.48550/arXiv.2412.05117. H. Transformers struggle to learn to search, 2024. URL
OpenAI. Openai o1 system card. ArXiv, 2024. https://arxiv.org/abs/2412.04703.

Padmakumar, V. and He, H. Does writing with language Schnitzler, J., Ho, X., Huang, J., Boudin, F., Sugawara,
models reduce content diversity? In The Twelfth Inter- S., and Aizawa, A. Morehopqa: More than multi-hop
national Conference on Learning Representations, ICLR reasoning. abs/2406.13397. doi: 10.48550/ARXIV.2406.
2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 13397.
2024. URL https://openreview.net/forum?id=
Feiz5HtCD0. Shannon, C. E. A mathematical theory of communication.
The Bell System Technical Journal, 27(3):379–423, 1948.
Pannatier, A., Courdier, E., and Fleuret, F. σ-gpts: A new
approach to autoregressive models. In Machine Learn- Shannon, C. E. Prediction and entropy of printed english.
ing and Knowledge Discovery in Databases. Research The Bell System Technical Journal, 30(1):50–64, 1951.
Track - European Conference, ECML PKDD 2024, Vil-
nius, Lithuania, September 9-13, 2024, Proceedings, Part Shlegeris, B., Roger, F., Chan, L., and McLean, E. Language
VII, volume 14947 of Lecture Notes in Computer Science, models are better than humans at next-token prediction.
pp. 143–159. Springer, 2024. arXiv preprint arXiv:2212.11281, 2022.

Peeperkorn, M., Kouwenhoven, T., Brown, D., and Jor- Si, C., Yang, D., and Hashimoto, T. Can llms generate novel
danous, A. Is temperature the creativity parameter of research ideas? A large-scale human study with 100+
large language models? abs/2405.00492, 2024. NLP researchers. 2024.

Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling
Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and llm test-time compute optimally can be more effective
Andrychowicz, M. Parameter space noise for explo- than scaling model parameters. ArXiv, abs/2408.03314,
ration. ArXiv, abs/1706.01905, 2017. URL https: 2024. URL https://api.semanticscholar.org/
//api.semanticscholar.org/CorpusID:2971655. CorpusID:271719990.

15
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Talmor, A., Tafjord, O., Clark, P., Goldberg, Y., and Berant, Wang, J. X., King, M., Porcel, N., Kurth-Nelson, Z., Zhu,
J. Leap-of-thought: Teaching pre-trained models to sys- T., Deck, C., Choy, P., Cassin, M., Reynolds, M., Song,
tematically reason over implicit knowledge. In Advances F., Buttimore, G., Reichert, D. P., Rabinowitz, N. C.,
in Neural Information Processing Systems 33: Annual Matthey, L., Hassabis, D., Lerchner, A., and Botvinick,
Conference on Neural Information Processing Systems M. M. Alchemy: A benchmark and analysis toolkit
2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. for meta-reinforcement learning agents. In NeurIPS
Datasets and Benchmarks, 2021. URL https://api.
Thankaraj, A., Jiang, Y., Kolter, J. Z., and Bisk, Y. Looking semanticscholar.org/CorpusID:239019925.
beyond the next token, 2025. URL https://arxiv.
org/abs/2504.11336. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
Tschannen, M., Kumar, M., Steiner, A., Zhai, X., Houlsby, language models with self-generated instructions. In Pro-
N., and Beyer, L. Image captioners are scalable vision ceedings of the 61st Annual Meeting of the Association
learners too. In Advances in Neural Information Pro- for Computational Linguistics (Volume 1: Long Papers),
cessing Systems 36: Annual Conference on Neural In- ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 13484–
formation Processing Systems 2023, NeurIPS 2023, New 13508. Association for Computational Linguistics, 2023.
Orleans, LA, USA, December 10 - 16, 2023, 2023.
Wang, Y., Luo, X., Wei, F., Liu, Y., Zhu, Q., Zhang, X.,
Valmeekam, K., Marquez, M., and Kambhampati, S. Can Yang, Q., Xu, D., and Che, W. Make some noise: Unlock-
large language models really improve by self-critiquing ing language model parallel inference capability through
their own plans? arXiv preprint arXiv:2310.08118, noisy training. In Proceedings of the 2024 Conference
2023a. on Empirical Methods in Natural Language Processing,
EMNLP 2024, Miami, FL, USA, November 12-16, 2024,
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S., pp. 12914–12926. Association for Computational Lin-
and Kambhampati, S. Planbench: An extensible bench- guistics, 2024c.
mark for evaluating large language models on planning
and reasoning about change, 2023b. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. Chain-of-
Valmeekam, K., Marquez, M., Sreedharan, S., and Kamb- thought prompting elicits reasoning in large language
hampati, S. On the planning abilities of large language models. In Advances in Neural Information Processing
models - A critical investigation. In Advances in Neu- Systems 35: Annual Conference on Neural Information
ral Information Processing Systems 36: Annual Confer- Processing Systems 2022, NeurIPS 2022, 2022.
ence on Neural Information Processing Systems 2023,
NeurIPS 2023, New Orleans, LA, USA, December 10 - Wies, N., Levine, Y., and Shashua, A. Sub-task decom-
16, 2023, 2023c. position enables learning in sequence to sequence tasks.
In The Eleventh International Conference on Learning
Varshney, L. R., Pinel, F., Varshney, K. R., Bhattacharjya, Representations, ICLR 2023, 2023.
D., Schörgendorfer, A., and Chee, Y. A big data approach
to computational creativity: The curious case of chef Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. Infer-
watson. IBM J. Res. Dev., 63(1):7:1–7:18, 2019. ence scaling laws: An empirical analysis of compute-
optimal inference for problem-solving with language
Walsh, M., Preus, A., and Gronski, E. Does chatgpt have a models. 2024. URL https://api.semanticscholar.
poetic style? In Proceedings of the Computational Hu- org/CorpusID:271601023.
manities Research Conference 2024, Aarhus, Denmark,
December 4-6, 2024, volume 3834 of CEUR Workshop Xu, M., Jiang, G., Zhang, C., Zhu, S.-C., and Zhu, Y. In-
Proceedings, pp. 1201–1219. CEUR-WS.org. teractive visual reasoning under uncertainty. In Neural
Information Processing Systems, 2022. URL https://
Wang, H., Zhao, Y., Li, D., Wang, X., Liu, G., Lan, X., and api.semanticscholar.org/CorpusID:249889691.
Wang, H. Innovative thinking, infinite humor: Humor
research of large language models through structured Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and
thought leaps. abs/2410.10370, 2024a. Riedel, S. Do large language models latently perform
multi-hop reasoning? In Proceedings of the 62nd Annual
Wang, H., Zou, J., Mozer, M., Goyal, A., Lamb, A., Zhang, Meeting of the Association for Computational Linguistics
L., Su, W. J., Deng, Z., Xie, M. Q., Brown, H., and (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand,
Kawaguchi, K. Can ai be as creative as humans?, 2024b. August 11-16, 2024, pp. 10210–10229. Association for
URL https://arxiv.org/abs/2401.01623. Computational Linguistics, 2024a.

16
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Yang, S., Kassner, N., Gribovskaya, E., Riedel, S., and Geva, 2024. URL https://api.semanticscholar.org/
M. Do large language models perform latent multi-hop CorpusID:267094860.
reasoning without exploiting shortcuts? abs/2411.16679,
2024b. doi: 10.48550/ARXIV.2411.16679. URL https: Zhong, S., Huang, Z., Gao, S., Wen, W., Lin, L., Zitnik, M.,
//doi.org/10.48550/arXiv.2411.16679. and Zhou, P. Let’s think outside the box: Exploring leap-
of-thought in large language models with creative humor
Yang, Z., Hu, Z., Salakhutdinov, R., and Berg-Kirkpatrick, generation. In IEEE/CVF Conference on Computer Vision
T. Improved variational autoencoders for text modeling and Pattern Recognition, CVPR 2024, 2024.
using dilated convolutions. ICML, 2017.
Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J.,
Yang, Z., Band, N., Li, S., Candès, E. J., and Hashimoto, T. and Yu, Y. Texygen: A benchmarking platform for text
Synthetic continued pretraining. CoRR, abs/2409.07431, generation models. The 41st International ACM SIGIR
2024c. doi: 10.48550/ARXIV.2409.07431. URL https: Conference on Research & Development in Information
//doi.org/10.48550/arXiv.2409.07431. Retrieval, 2018.
Ye, J., Gao, J., Gong, S., Zheng, L., Jiang, X., Li, Z., and
Kong, L. Beyond autoregression: Discrete diffusion for
complex reasoning and planning. 2024. doi: 10.48550/
ARXIV.2410.14157.
Young, T. and You, Y. On the inconsistencies of condition-
als learned by masked language models. arXiv preprint
arXiv:2301.00068, 2022.
Yu, L., Jiang, W., Shi, H., YU, J., Liu, Z., Zhang, Y.,
Kwok, J., Li, Z., Weller, A., and Liu, W. Metamath:
Bootstrap your own mathematical questions for large
language models. In The Twelfth International Confer-
ence on Learning Representations, 2024. URL https:
//openreview.net/forum?id=N8N0hgNDRt.
Zhang, H., Li, L. H., Meng, T., Chang, K., and den Broeck,
G. V. On the paradox of learning to reason from data.
In Proceedings of the Thirty-Second International Joint
Conference on Artificial Intelligence, IJCAI 2023, 19th-
25th August 2023, Macao, SAR, China, pp. 3365–3373.
ijcai.org, 2023.
Zhang, J., Jain, L., Guo, Y., Chen, J., Zhou, K. L., Suresh, S.,
Wagenmaker, A., Sievert, S., Rogers, T. T., Jamieson, K.,
Mankoff, R., and Nowak, R. Humor in AI: massive scale
crowd-sourced preferences and benchmarks for cartoon
captioning. CoRR, abs/2406.10522, 2024a.
Zhang, Y., Schwarzschild, A., Carlini, N., Kolter, Z., and
Ippolito, D. Forcing diffuse distributions out of language
models. abs/2404.10859, 2024b.
Zhang, Y., Diddee, H., Holm, S., Liu, H., Liu, X.,
Samuel, V., Wang, B., and Ippolito, D. Noveltybench:
Evaluating language models for humanlike diversity.
2025. URL https://api.semanticscholar.org/
CorpusID:277621515.
Zhao, Y., Zhang, R., Li, W., Huang, D., Guo, J., Peng,
S., Hao, Y., Wen, Y., Hu, X., Du, Z., Guo, Q., Li,
L., and Chen, Y. Assessing and understanding creativ-
ity in large language models. ArXiv, abs/2401.12491,

17
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

A. T RANSFORMER T RAINING O BJECTIVES


Let LMθ be our language model, parameterized by θ, for which LMθ (ŝi = si ; s<i ) is the probability it assigns to the ith
output ŝi being si , given as input a sequence s<i . Let (p, r) be a prefix-response pair. In standard next-token finetuning, we
maximize the objective:

h LX
resp i
Jnext-token (θ) = ED log LMθ (r̂i = ri ; p, r<i ) (2)
i=1

In teacherless (multi-token) training (Monea et al., 2023; Bachmann & Nagarajan, 2024; Tschannen et al., 2023), we make
use of an uninformative input string $ that simply corresponds to a series of dummy tokens $.

h LX
resp i
Jmulti-token (θ) = ED log LMθ (r̂i = ri ; p, $<i ) (3)
i=1

B. F URTHER DISCUSSION
B.1. Style of noise-injection
Our technique of injecting noise into the model is somewhat different from how noise is introduced in traditional VAEs
(Kingma & Welling, 2014) or GANs (Goodfellow et al., 2020), and this difference is worth noticing. In traditional
approaches, although the model learns a noise-output mapping, this mapping is enforced only at a distribution level i.e., the
distribution of noise vectors must map to a distribution of real vectors. However, in our approach we arbitrarily enforce what
noise vector goes to what real datapoint, at a pointwise level. This raises the open questions of why hash-conditioning works
in the first place — surprisingly, without breaking optimization or generalization — and whether there is a way to enforce it
at distribution-level, and whether that can provide even greater improvements.

B.2. In-weights vs in-context graphs for combinational creativity


Combinational creativity requires searching through known entities. In abstracting this, there is an interesting choice to be
made as to whether the relevant search space is retrieved and spelled out in-context or whether it remains in-weights (like in
Sibling Discovery and Triangle Discovery). We argue that the in-context version does not capture the creative skills
required in many real-world tasks. For instance, discovering a fresh and surprising analogy necessitates noticing similarities
from sufficiently distinct parts of one’s vast, rich space of memory. Thus, the core challenge here lies in retrieving from the
entirety of one’s memory. If one were to faithfully simulate this an in-context version of this in a model, one would have to
provide the entirety of the model’s pretraining data in context.

B.3. Examples of Triangle Discovery


Although we presented this task as a more complex, higher-order counterpart to Sibling Discovery, we retrospectively
identify some real-world examples that resemble the higher-order search skill involved in this task.

1. Discovering contradictions: Consider identifying non-trivial contradictions within (a large body) of knowledge (like a
legal system, or a proof based on many lemmas, or the literature spanning many papers in a certain field). This may
require identifying two or more facts that together result in an implication that contradicts another fact.
2. Discovering feedback loops: Fields like biology, ecology, climate science and economics may involve discovering
non-trivial feedback loops. Unlike feedback loops where two events encourage each other, a non-trivial loop would be
one where an Event A encourages Event B, that in turn encourages Event C that in turn encourages Event A.
3. Antanaclasis: An antanaclasis involves using a word in two different senses in a sentence, while still ensuring that
each sense has a coherent relationship with the rest of the sentence. Consider Benjamin Franklin’s quote, Your
argument is sound, nothing but sound. Here, the two senses are sound1 as in “logically correct”, sound2
as in “noise”. This sentence encodes an pairwise relationship between three entities {argument, sound1 ,sound2 }

18
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

individually. While the last two entities (the two senses) themselves must be related to each other (through the common
word, sound), for a coherent sentence, both senses must also be appropriate descriptors for the first entity, argument.
Thus, constructing this sentence requires searching through one’s vocabulary to discover three words that satisfy these
three relationships simultaneously.

4. Word games: Some word games require identifying a set of words that simultaneously have pairwise relationships
with each other.

(a) For example, standard crosswords would require identifying sets of 4 or more words that have various simultaneous
pairwise intersections in the letters used.
(b) Devising “& Lit.” clues in cryptic crosswords are an altogether different, yet compelling example that
require discovering a satisfying triangular relationship. Consider the clue “Some assassin in Japan”
whose answer is Ninja. Here the phrase Some assassin in Japan participates in two senses. First,
is the direct semantic sense as a definition of what a Ninja is. But there is a second, indirect
sense: the word Some indicates that the solution lies as some substring of the phrase, namely “as-
sassi(n in Ja)pan”. Thus, constructing the clue requires identifying a triangular relationship between
{Ninja, (Some assassin in Japan)1 , (Some assassin in Japan)2 } just like in an antanaclasis. This
is true generally of any & Lit. clues as these clues must perform “double duty” in pointing to the answer.

B.4. Further evidence of our argument in §2.6


0.04

Below we provide two more pieces of evidence affirming the failure mechanism of next-token prediction outlined in §2.6.
0.02
Improved algorithmic creativity is not due to some form of capacity control. While §2.6 argues that multi-token
prediction should help creativity by providing critical lookahead capabilities, it is also possible that it simply acts as a
form of capacity control that0.00
prevents memorization. We rule this out in Fig 8: even as memorization computed on unseen
hash strings is controlled, the multi-token model perfectly reproduces the training data on seen hash strings. We term this
hash-memorization. An exact equivalence of this phenomenon was noticed in GANs in Nagarajan et al. (2018), where the
0.02
generator can be trained on specific latent vectors to memorize the mapping on those, and yet produce fresh samples outside
of those latent vectors.
0.04
Training Standard Teacherless
Objective: (Next-Token) (Multi-Token)

0.04 0.02 0.00 0.02 0.04


Hash-Memorization
Memorization
1.0

0.5

0.0
0 50k 100k
Num Train Step

Figure 8. Even if multi-token prediction reduces memorization (on unseen hash strings), it has enough capacity to memorize
training data on the seen hash-strings (denoted by hash-memorization). Note that the best algorithmic creativity for NTP and MTP are
achieved at step 10k and 40k, respectively, which are the checkpoints we used to report metrics in Fig 4.

Effect of token reordering. The implication of our argument in §2.6 is that next-token learning would benefit from reversing
the token ordering of the Sibling Discovery task (i.e., parent appears before siblings). Indeed, we find this to be the
case in Fig 12 and Fig 22. Interestingly, we find that the reverse-trained NTP model is still far from the original multi-token
teacherless model. More surprisingly, a teacherless model trained on the reversed data, achieves even higher algorithmic
creativity of all training methods here. Note that in all other datasets, no reordering of the tokens should make any change to
the training.

19
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Y
Generated samples
X
b d (A)"b, f, Y" (incoherent)
c f e g
i h (B)"b, i, Y" (memorized ~ (4))
(in-weights graph) (C)"e, d, Y" (memorized ~ (2))
(D)"e, d, X" (incoherent)
(E)"X, i, b" (incoherent)
Training data
(F)"b, c, Y"
(1)"g, f, Y" (G)"e, g, Y"
(2)"e, d, Y" (H)"e, g, Y" (duplicated ~ (G))
(3)"h, c, X"
(4)"b, i, Y" Algorithmic creativity = 2 / 8

(a) Sibling Discovery

a
f d
b c e
Generated samples
(in-weights graph)
(A)"tri: ab, bc, ca" (memorized ~ (1))
Training data (B)"tri: ab, bf, fa" (incoherent)
(1) "tri: ab, bc, ca" (C)"tri: bc, ca, ab" (memorized ~ (1))
(2) "edge: cd, dc" (D)"tri: ca, ab, bc" (memorized ~ (1))
(3) "edge: ab, ba" (E)"tri: cd, de, ea"
(4) "edge: bc, cb" (F)"tri: de, ea, ad" (duplicated ~ (E))
(5) "edge: fb, bf" (G)"tri: dc, ce, ed" (duplicated ~ (E))
(6) "edge: ac, ca" (H)"tri: bc, ce, eb" (incoherent)
(7) "tri: da, ac, cd" (I)"tri: dc, ca, ad" (memorized ~ (7))
(8) "edge: ce, ec" (J)"tri: ac, ce, ea" (incoherent)
(9) "edge: ed, de"
(10)"edge: ad, da" Algorithmic creativity = 1 / 10
(b) Triangle Discovery

Figure 9. Minimal tasks inspired by combinational creativity: The in-weights graph represents the underlying knowledge graph used
to generate the training data (not provided in-context). Based on our definition of algorithmic creativity in Eq. (1), generated samples that
are incoherent or memorized, or duplicated are not counted as valid samples. Note that sequences that are permutations of each other are
considered identical when computing duplicates and memorization.

C. D ESCRIPTION OF DATASETS
C.1. Datasets inspired by combinational creativity
Dataset 1: Sibling Discovery. This task is based off a bipartite graph G made of parent vertices V = {A, B, C, . . .}
each neighboring a corresponding set of children nbr(A) = {a1 , a2 , . . . , }. We set the number of parent vertices |V| to be
small and the number of children for each parent vertex |nbr(A)| to be large. For example, |V| = 5 and |nbr(A)| = 500.
We define coh(s) to hold on “sibling-parent” triplets of the form s = (γ, γ ′ , Γ) such that γ, γ ′ ∈ nbr(Γ).
Next, we ensure that the training set is large enough for the model to infer all the edges in the graph. Let m = |V| and
n = |nbr(Γ)| (for all Γ ∈ V). This means S = Ω(m · n). At the same time, to keep the task non-trivial, the training set
must be small enough to not cover all the coherent sibling-parent triplets. Thus, we ensure S = o(m · n2 ).
For the default version of this dataset, we set |V| = 5 and |nbr(Γ)| = 500 for all Γ ∈ V.

Dataset 2: Triangle Discovery This task is based off an undirected graph G = (V, E) which contains many triangles.
Since a triangle is a symmetric structure, the problem remains the same even upon reordering the vertices. Thus, in this task
coh((v1 , v2 , v3 )) = true iff all three edges between {v1 , v2 , v3 } belong in G. To make this task interesting (neither too
trivial nor too non-trivial) for our models to learn, we enforce several constraints on the graph. First, we try to keep the
degree deg of each vertex to be sufficiently small. On the one hand, this is so that no vertex requires too much computation
to find a triangle it is part of; on the other, we also do not want a very dense graph where most random triplets are a triangle.
In addition to this degree requirement, we ensure that each vertex has a minimum number of triangles.
Thus to create a graph that is neither too trivial nor too non-trivial, we define a two-step graph generation procedure. In
the first step, we iterate over the vertices, and add deg many edges from that vertex to other vertices in the set (where

20
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Vocab: a, b, c, d, e, f, g

(constructed graph) (constructed graph)


Training data Generated samples
a4 c c1 b
2 5 (A)"c→b, e→c, b→a, f→e, a→f" 2 3
(1)"e→b, d→a, b→d, a→c, c→e" d e
e (memorized ~ (2)) a
3 b 1 4 f 5
c3 f
1 f (B)"e→g, g→f, b→e, d→b, f→e" 2 d
(2)"e→c, g→e, c→f, a→g, f→a" 5 5 4
e (incoherent) g
a b
2 g 4 1 e 3
b2 d 5
1 c (C)"c→d, g→f, f→c, a→g, d→a" 1 a
(3)"a→b, b→c, c→d, d→e, e→a" 3 4
a c g
d
5 e 4 3 f 2

f2 a
a3
e
5 1 (D)"e→d, c→a, a→e, b→c, d→b" 2 1
(4)"a→g, f→a, g→c, c→b, b→f" b c
g (duplicated ~ (C)) d
4 c 3 4 b 5

Algorithmic creativity = 1 / 4
(a) Circle Construction

Vocab: a, b, c, d, e, f, g

(constructed graph) (constructed graph)


Training data Generated samples
e 4 d 3 a b 3 c 4 g
(1)"c→a, b→d, d→c, e→b" (A)"a→c, f→g, b→a, c→f"
b 2 c 1 a 1 f 2
(memorized ~ (2))

f 3 a 4 c (B)"g→e, a→b, f→g, b→d" a 2 d f3 e


(2)"d→a, g→c, f→d, a→g"
d 1 g 2 (incoherent) b 4 g 1

c 3 e e 1 f
(3)"a→b, b→c, c→d, d→e"
a 1 (C)"e→c, b→g, g→e, c→f" b 2
b 2 d 4 g 3 c 4

c 2 f 4 d (D)"f→b, e→c, c→f, b→a" e 2 f 1 a


(4)"g→f, c→g, b→d, f→b"
g 1 b 3 (duplicated ~ (C)) c 3 b 4

Algorithmic creativity = 1 / 4
(b) Line Construction

Figure 10. Tasks inspired by exploratory creativity: The constructed graph visualizes the graph induced by the training or generated
sample. Edge indices represent the order of edge appearing in the string. Based on our definition of algorithmic creativity in Eq. (1),
generated samples that are incoherent, memorized, or duplicated are not counted as valid samples. Note that sequences that correspond to
the same permutations but with different participating vertices are considered identical when computing duplicates and memorization

deg is small, such as 3 or 10). To avoid creating high-degree vertices inadvertently, we only select neighbors with degree
≤ 1.2 · deg. This alone may not ensure a sufficient number triangles in each vertex; so we iterate over the vertices to
explicitly create tri random triangles on each vertex (where tri is small, such as 6 or 10). We do this by selecting pairs of
a vertex’s neighbors and drawing an edge between them.
Next, we want a training dataset such that (a) the model can infer all the edges from the graph and yet (b) not all triangles
appear in the dataset. This necessitates training on a dataset that consists not only of a subset of the triangles, but also of
edges from the graph. Our training data consists of two parts: (1) 1/3 are random triangles from the graph, (2) 2/3 are
random edges from the graph. In the training set, the triangle and edge samples are distinguished by a prefix “triangle:”
or “edge:”. During test-time, we ensure that the model is prompted with “triangle:”. A triangle (u, v, w) is tokenized
as “tri: (u, v), (v, w), (w, u)” and an edge (u, v) as “edge: (u, v), (v, u)”. We provide both the directions of edge to
potentially avoid any issues with the reversal curse (Berglund et al., 2024; Allen-Zhu & Li, 2023a).
For the default setting of the dataset, we set |V | = 999, deg = 3, tri = 6.

21
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Table 1. Hyperparameter details for Gemma v1 (2B) model.

Sibling Triangle Circle Line


Hyperparameter
Discovery Discovery Construction Construction
Max. Learning Rate 5 × 10−4 5 × 10−4 5 × 10−4 5 × 10−5
Model Seq. Len. 32 32 2048 2048
Training steps 7500 10k 15k 15k
Training size 50k 15k 10k 10k
Weight given to
0.5 0.5 0.75 0.75
multi-token obj.

C.2. Datasets inspired by exploratory creativity


Dataset 3: Circle Construction. In this task, the generated strings must be randomized adjacency lists that can be
rearranged to recover circle graphs of N vertices. The vertices come from a fixed vocabulary of M tokens. Specifically, let
the generated list be s = (vi1 , vi2 ), (vi3 , vi4 ), . . .. We define coh(s) = true iff there exists a resolving permutation π such
that π(s) = (vj1 , vj2 ), (vj2 , vj3 ), . . . (vjn , vj1 ) for distinct j1 , j2 , . . . jn . i.e., each edge leads to the next, and eventually
circles back to the first vertex. In our experiments, we set M to be larger than N .
Our default experiments are reported for N = 9, M = 15.

Dataset 4: Line Construction This task is a minor variant of the above where the edge set E corresponds to a line
graph. The details are same here except for coherence to hold, we need a resolving permutation π such that π(s) =
(vj1 , vj2 ), (vj2 , vj3 ) . . . , (vjn−1 , vjn ) for distinct j1 , j2 , . . . jn . i.e., each edge leads to the next, stopping at a dead-end. We
use the same set of hyperparamters as Circle Construction.
Our default experiments are reported for N = 9, M = 15.

D. F URTHER EXPERIMENTAL DETAILS


Details for Gemma v1 (2B) model. In Table 1, we provide the hyperparameter details for each of our datasets. We note
some common details here. First, the batch size is 4, but each sequence is packed with multiple examples; thus the model
sequence length (divided by the input length) can be treated as a multiplicative factor that determines the effective batch size.
The learning rates are chosen favorable to next-token prediction (not multi-token prediction). The training steps were chosen
roughly based on a point after which the model had saturated in algorithmic creativity (and exhibited decreasing creativity).
We use a learning rate with linear warm up for 100 steps, followed by cosine annealing upto a factor 0.01× of the maximum
learning rate. To measure creativity, we sample a test dataset T of 1024 datapoints.
We represent the main tokens in our tasks with integers (ranging upwards of 0 to as many distinct integers as required). In
the hash-conditioning setting, we use hash strings of default length 10, using randomly sampled uppercase characters from
the English alphabet. In all datasets, we space-separate the vertices in a string, and comma-separate the edges.
Details for GPT-2 (86M) model. We use GPT-2 (small) with 86M non-embedding parameters when we are comparing
Transformers with diffusion models. We train these models with a learning rate of 10−4 and a batch size of 64, to convergence
in terms of the algorithmic creativity. We provide sensitivity analysis of learning rate in §F.
Details for SEDD (90M) model. We use SEDD’s “absorb” variant, which begins denoising with a fully masked sequence
and iteratively refines tokens over 128 denoising steps. This variant achieves the best language modeling performance in
the original paper. Same as GPT-2 (86M), we train these models with a learning rate of 10−4 and a batch size of 64, to
convergence in terms of algorithmic creativity. 3 We provide sensitive analysis of learning rate in §F.

3
We use the codebase of Lou et al. (2023) at https://github.com/louaaron/Score-Entropy-Discrete-Diffusion.

22
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

E. S ENSITIVITY ANALYSES FOR Gemma v1 (2B)


In this section, we report that our observations are robust to the choice of various hyper-parameters. First, we present a
series of plots for the Gemma v1 (2B) model; each group of plots reports varying one hyperparameter for all the datasets.
Fig 11 for train set size, Fig 12 for task complexity, Fig 13 for the weight given to the multi-token objective (and Fig 14
0.04
correspondingly for memorization), Fig 15 for learning rates, Fig 16 for number of training steps and Fig 17 for batch size.
In §E.1, we report analyses for varying sampling conditions. It is worth noting that the occasional exceptions to our trends
generally come from Line Construction, suggesting that this task is most friendly towards next-token prediction of the
0.02
four we study.
Note on task-complexity. In Fig 12, we report robustness of our results to variations in the task complexity (e.g., degree,
0.00
path length etc.,). Note that the variations we have explored are within reasonable factors. If we vastly increase certain
factors (e.g., increase the degree of the vertices), we expect learning to become either highly trivial or non-trivial (see §C for
some reasoning). Besides,0.02 as discussed in the main paper, teacherless training is a hard objective to optimize especially for
smaller models; thus, we expect increasing the task complexity beyond a point to hurt the teacherless model for a fixed
model size (crucially, for optimization reasons, not generalization reasons).
0.04
Training Standard Teacherless
Objective: (Next-Token) (Multi-Token)

0.04 0.02 0.00 0.02 0.04


SiblingDiscovery TriangleDiscovery CircleConstruction LineConstruction
1.00 0.4 0.8
Creativity

Creativity

Creativity

Creativity
0.4 0.75 0.3 0.6

0.50 0.2 0.4


0.2
0.25 0.1 0.2

0.0 0.00 0.0 0.0


50000 200000 7500 15000 10000 200000 10000 50000
Num Train Data Num Train Data Num Train Data Num Train Data

Figure 11. Training size and algorithmic creativity for Gemma v1 (2B): Algorithmic creativity increases under multi-token prediction
across various training set sizes. Note though that, in our examples, we except the gap to diminish eventually with sufficiently many
training datapoints (this is unlike the failure of next-token prediction in B&N’24).

Triangle Circle Line


Sibling
1.0 1.00
0.75 0.4
Creativity
Creativity

Creativity

Creativity

0.3 0.75
0.50
0.5 0.50
0.25 0.2

0.1 0.25
0.00
0.0
parent10 parent5 parent5 vert500 vert999 0.0 0.00
child999 child500 child500
deg10 deg3 vocab10 vocab15 vocab10 vocab15
reverse
tri10 tri6 length9 length9 length9 length9
Task Complexity
Task Complexity Task Complexity Task Complexity

Figure 12. Task complexity and algorithmic creativity for Gemma v1 (2B): Algorithmic creativity increases under multi-token
prediction across (reasonable) variations in the dataset parameters (as described in §C).

E.1. Varying sampling methods


Fig 18, Fig 19, and Fig 20 report creativity, memorization and coherence (i.e., fraction of generated strings that are coherent)
for various sampling methods (greedy decoding and nucleus sampling) with various prefix conditionings (namely, null,
pause and hash).

23
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

SiblingDiscovery TriangleDiscovery CircleConstruction LineConstruction


1.0
Creativity

Creativity

Creativity

Creativity
0.50
0.4
0.5
0.5
0.25 0.2

0.00 0.0 0.0 0.0


NTP 0.15 0.25 0.50 0.75 0.90 NTP 0.10 0.25 0.50 0.75 0.90 NTP 0.10 0.25 0.50 0.75 0.90 NTP 0.10 0.25 0.50 0.75 0.90
Wt. of Teacherless MTP Wt. of Teacherless MTP Wt. of Teacherless MTP Wt. of Teacherless MTP

Figure 13. Weight given to multi-token objective and algorithmic creativity for Gemma v1 (2B): Algorithmic creativity increases
under multi-token prediction across various weights given to the multi-token component of the objective, barring some deviations for
Line Construction.

SiblingDiscovery TriangleDiscovery CircleConstruction LineConstruction


Memorization

Memorization

Memorization

Memorization
0.50
0.5 0.5 0.5
0.25

0.0 0.0 0.0 0.00


NTP 0.15 0.25 0.50 0.75 0.90 NTP 0.10 0.25 0.50 0.75 0.90 NTP 0.10 0.25 0.50 0.75 0.90 NTP 0.10 0.25 0.50 0.75 0.90
Wt. of Teacherless MTP Wt. of Teacherless MTP Wt. of Teacherless MTP Wt. of Teacherless MTP

Figure 14. Weight given to multi-token objective and memorization score for Gemma v1 (2B): Memorization reduces under multi-
token prediction across various weights given to the multi-token component of the objective.

SiblingDiscovery TriangleDiscovery
0.6 1.00
Creativity

Creativity

0.75
0.4
0.50
0.2
0.25

0.0 0.00
5e-3 1e-3 5e-4 1e-4 5e-5 1e-5 5e-3 1e-3 5e-4 1e-4 5e-5 1e-5
Learning Rate Learning Rate

CircleConstruction LineConstruction
0.8
0.6
Creativity

Creativity

0.6
0.4
0.4

0.2 0.2

0.0 0.0
5e-3 1e-3 5e-4 1e-4 5e-5 1e-5 5e-3 1e-3 5e-4 1e-4 5e-5 1e-5
Learning Rate Learning Rate

Figure 15. Learning rate and algorithmic creativity for Gemma v1 (2B): Algorithmic creativity increases under multi-token prediction
across various learning rates.

24
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Sibling Triangle Circle Line


0.8 1.00
0.8
0.6 0.4
Creativity

Creativity

Creativity

Creativity
0.75
0.6
0.4 0.50 0.4
0.2
0.2 0.25 0.2

0.0 0.00 0.0 0.0


7500 25000 10000 30000 15000 30000 15000 30000
Num Train Steps Num Train Steps Num Train Steps Num Train Steps

Figure 16. Training steps and algorithmic creativity for Gemma v1 (2B): Algorithmic creativity under multi-token prediction across
lengths of training.

Sibling Triangle Circle Line


1.00
0.4 0.8
0.4
Creativity

Creativity

Creativity

Creativity

0.75
0.3 0.6

0.50 0.2 0.4


0.2
0.25 0.1 0.2

0.0 0.00 0.0 0.0


32 3072 3072 32 8192 2048 8192 2048
Model Seq Len Model Seq Len Model Seq Len Model Seq Len

Figure 17. Batch size and algorithmic creativity for Gemma v1 (2B): Algorithmic creativity increases under multi-token prediction
across various batch sizes. Note that here batch size is effectively proportional to the model sequence length, since we pack multiple
finetuning examples into the sequence.

25
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

SiblingDiscovery

0.4
Creativity

0.3

0.2

0.1

0.0
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

TriangleDiscovery
1.00
Creativity

0.75

0.50

0.25

0.00
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

CircleConstruction
0.4
Creativity

0.3

0.2

0.1

0.0
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

LineConstruction
0.8
Creativity

0.6

0.4

0.2

0.0
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

Figure 18. Algorithmic creativity under various sampling conditions for Gemma v1 (2B): Across all conditions, and in almost all
datasets (with a few exceptions in Line Construction), multi-token prediction improves creativity. Furthermore, hash-conditioning
achieves best algorithmic creativity, with a longer hash helping more.

26
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

SiblingDiscovery
1.00
Memorization

0.75

0.50

0.25

0.00
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

TriangleDiscovery
0.8
Memorization

0.6

0.4

0.2

0.0
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

CircleConstruction

1.00
Memorization

0.75

0.50

0.25

0.00
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

LineConstruction
Memorization

0.4

0.2

0.0
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

Figure 19. Memorization under various sampling conditions for Gemma v1 (2B): Barring a few conditions, the most prominent trend
is that memorization reduces under multi-token prediction for various sampling conditions. Observe that the null and pause-conditioned
models do produce some memorized output while their creativity was non-existent.

27
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

SiblingDiscovery
Coherence

1.0

0.5

0.0
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

TriangleDiscovery

1.00
Coherence

0.75

0.50

0.25

0.00
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

CircleConstruction
Coherence

1.0

0.5

0.0
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

LineConstruction

1.00
Coherence

0.75

0.50

0.25

0.00
hash10 hash10 hash10 hash4 hash2 hash2 null null pause10 pause10
greedy temp2.0 temp5.0 greedy temp2.0 temp5.0 temp2.0 temp5.0 temp2.0 temp5.0
Inference Type

Figure 20. Coherence under various sampling conditions for Gemma v1 (2B): Surprisingly, coherence of all models is high or at least
noticeable, across various sampling conditions. This suggests that the low algorithmic creativity of the null-conditioned models in the
previous plots arises from model collapsing to single original point.

28
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

F. A DDITIONAL EXPERIMENTS IN SEDD (90M)VS . GPT-2 (86M)


F.1. Ablation studies
In this section, we first provide additional ablation studies for SEDD (90M)vs GPT-2 (86M)with different training and
dataset settings (Fig 21 and Fig 22).

Training Objective
Standard Teacherless Diffusion
(Next-Token) (Multi-Token) (Multi-Token)

Sibling

0.75
Creativity

0.50

0.25

0.00
5e-4 1e-4 5e-5 1e-5
Learning Rate

Triangle

0.10
Creativity

0.05

0.00
5e-4 1e-4 5e-5 1e-5
Learning Rate
Circle

0.6
Creativity

0.4

0.2

0.0
5e-4 1e-4 5e-5 1e-5
Learning Rate

Line

0.75
Creativity

0.50

0.25

0.00
5e-4 1e-4 5e-5 1e-5
Learning Rate

Figure 21. Learning rates and algorithmic creativity for the SEDD (90M)model vs. GPT-2 (86M): MTP achieves higher algorithmic
creativity than NTP when both are trained at their optimal learning rates.

F.2. Effect of hash string length


We provide an ablation study on the hash string length for NTP vs MTP on the Sibling Discovery task (Fig 23). We see
that longer hash strings lead to higher algorithmic creativity.

29
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Sibling Sibling

Creativity
0.75
Creativity

0.5
0.50

0.25 0.0
parent5 parent5 parent10
0.00 children500 children500 children1000
50000 400000 reverse
Num Train Data Task Complexity
Triangle Triangle

0.10
Creativity

Creativity 0.10

0.05 0.05

0.00 0.00
999 500 tri=6 tri=8 tri=10 tri=25
Num Nodes Triangle Density
Circle
Creativity

0.5

0.0
N9M10 N9M12 N9M15 N9M10 N9M12 N9M15
10000 10000 10000 5000 5000 5000
Task Complexity / Num Train Data

Line
Creativity

0.5

0.0
N9M10 N9M12 N9M15 N9M10 N9M12 N9M15
10000 10000 10000 5000 5000 5000
Task Complexity / Num Train Data

Figure 22. Task complexity and algorithmic creativity of SEDD (90M) model vs. GPT-2 (86M): MTP consistently outperforms NTP
under varying task configurations, with some exceptions in the Line Constructionand Circle Constructiondatasets.

F.3. Format sensitivity for Triangle Discovery


Recall that our input format for Triangle Discovery follows the edge list representation of triangles (§C, Fig. 10). For
instance, triangle ABC is represented as AB, BC, CA. This format explicitly lists the edges of the triangle, making it easier
for the model to attend to edge-level patterns during learning.
We also experimented with an alternative node-based representation, where triangle ABC is represented more compactly as
ABC, without making the edges explicit. We note in Fig 24 that the models are curiously sensitive to the way the triangles are

30
0.00

0.02 THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION


G OING BEYOND

Sibling
1.0
0.04

Creativity
Training Standard Teacherless
Objective: (Next-Token) (Multi-Token)
0.5
0.04 0.02 0.00 0.02 0.04

0.0
hash4 hash10
Hash Len

Figure 23. GPT-2 (86M) Transformer achieves higher algorithmic creativity with longer hash strings. We report algorithmic
creativity with hash strings of length 4 and 10, with both NTP and teacherless MTP.

formatted. Models trained on the node-based format perform equally badly with all training objectives, while the diffusion
model outperforms NTP by a large margin with the edge list representation.

Training Objective
Standard Teacherless Diffusion
(Next-Token) (Multi-Token) (Multi-Token)

Triangle
0.10
Creativity

0.05

0.00
Edge rep Edge rep Node rep Node rep
999 500 999 500
Format / Num Nodes

Figure 24. Sensitivity to formatting of the sequence in Triangle Discovery: We find that all our small models perform equally
poorly with a node-wise representation of the input sequence, whereas there was a stark difference in performance with the edge-wise
representation.

G. A DDITIONAL EXPERIMENTS WITH MEDIUM - SIZED T RANSFORMER AND SEDD


We replicate our SEDD (90M) and GPT-2 (86M) experiments on a larger model size (∼400M parameters). In Fig 25, we
see similar trends to the smaller model sizes (Fig 4).

H. D ECOMPOSING CREATIVITY
Through following experiments on the GPT-2 (86M) model in the Sibling Discovery task, we try to understand the
dynamics between two important quantities that affect algorithmic creativity: diversity/duplication and originality/memo-
rization

H.1. Diversity score


Equation (1) defines our algorithmic creativity by rewarding samples that are both unique and novel. A higher score
can be achieved either by enhancing diversity or by reducing memorization. In the following section, we examine this
decomposition using the Sibling Discovery task. Formally, we define the diversity score as:

uniq({s ∈ T |coh(s)})
ˆ N (T ) =
dv . (4)
|T |

We first demonstrate that creativity and diversity are not necessarily correlated, and next, that MTP particularly improves
creativity (while achieving lower diversity than NTP). To show this, we report the algorithmic creativity and diversity scores
along training in Fig 26. We see that for NTP, the diversity score keeps increasing and stays high, while algorithmic creativity

31
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Training Objective
Standard Teacherless Diffusion
(Next-Token) (Multi-Token) (Multi-Token)

Creativity
1.0 1.0
0.050
0.50
0.5 0.5
0.025 0.25

0.0 0.000 0.00 0.0


Sibling Triangle Circle Line
Discovery Discovery Construction Construction
Memorization
0.2 0.02
0.2

0.5 0.1
0.1 0.01

0.0 0.00.04 0.0 0.00


Sibling Triangle Circle Line
Discovery Discovery Construction Construction

Figure 25. On a medium-sized (∼400M) model, multi-token diffusion training improves algorithmic creativity from Eq 1 (top) on
0.02
our four open-ended algorithmic tasks.

increases in the first 10k steps and starts to0.00


decrease. For teacherless training, both scores increase throughout training.
While the diversity of MTP is surprisingly lower than NTP throughout training, the creativity of MTP surpasses NTP at 20k
steps.
0.02
Creativity
Diversity
1.0
0.04
Training Standard Teacherless
Objective: (Next-Token) (Multi-Token)
0.5
0.04 0.02 0.00 0.02 0.04

0.0
0 10k 20k 30k 40k
Num Train Step

Figure 26. Algorithmic creativity and diversity are not necessarily correlated, exhibiting distinct dynamics: We find that NTP has
a high diversity score through training, even higher than MTP. However, its algorithmic creativity reaches only a mediocre peak before
descending, when MTP starts surpassing it.

H.2. Decomposing algorithmic creativity as diversity and memorization


Better creativity can be achieved either by enhancing diversity or by reducing memorization – we try to disentangle these
factors in this section. In Fig 27, we plot the algorithmic creativity, diversity, and memorization scores at the checkpoint
of best algorithmic creativity. We see that hash-conditioning contributes to higher diversity but does little to bring down
memorization; however, teacherless training contributes to higher diversity and also to reducing memorization. In Fig 26,
we see that the best creativity and best diversity are not achieved at the same checkpoint.

H.3. Data scaling for algorithmic creativity


How does algorithmic creativity change as we increase the amount of training data? Intuitively, more training data helps the
model learn the true distribution, but also makes it harder to generate unseen samples (since the uncovered space becomes
rarer). To understand this, we plot how models perform relative to a theoretically expected maximum algorithmic creativity.

32
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Sibling
1.0 1.0 0.6

0.8 0.8 0.5


Training Objective

Memorization
Creativity

0.4

Diversity
Standard
0.6 0.6 (Next-Token)
0.3
Teacherless
0.4 0.4 (Multi-Token)
0.2
0.2 0.2 0.1

0.0 0.0 0.0


hash4 hash10 hash4 hash4 hash10 hash4 hash4 hash10 hash4
50000 50000 400000 50000 50000 400000 50000 50000 400000
Hash Len / Num Train Data
Sibling
1.0 1.0 0.20

0.8 0.8
0.15

Memorization
Creativity

Diversity

0.6 0.6 Prefix


0.10 null
0.4 0.4 hash
0.05
0.2 0.2

0.0 0.0 0.00


NTP NTP NTP NTP NTP NTP NTP NTP NTP
temp0.5 temp1.0 temp2.0 temp0.5 temp1.0 temp2.0 temp0.5 temp1.0 temp2.0
Objective / temperature

Figure 27. Decomposition of algorithmic creativity for GPT-2 (86M) in Sibling Discovery: We report algorithmic creativity,
diversity and memorization at the checkpoint of best algorithmic creativity. We see that hash-conditioning contributes to higher diversity
but does not help bring down memorization; teacherless training helps both diversity and in bringing down memorization.

Sibling Sibling
1.0 1.0
Creativity
Diversity

0.8 Standard
(Next-Token)
0.9 Teacherless
0.6 (Multi-Token)
Theoretically
0.4 Expected
0.8
1x 2x 4x 8x 16x 32x 1x 2x 4x 8x 16x 32x
Num Train Data Num Train Data

Figure 28. Data scaling curve for algorithmic creativity and diversity: As we increase the training data (for a fixed underlying graph),
the theoretically expected maximum algorithmic creativity decreases as expected, while the theoretically expected maximum diversity
stays the same. NTP tails to achieve the theoretically expected algorithmic creativity, while MTP almost achieves the theoretically expected
performance at scale.

This is computed by assuming an oracle that samples a generated set T (in Eq. (1)) uniformly with replacement from the true
underlying distribution, and then computing algorithmic creativity Eq. (1. In Fig 28, we see that as we increase the training
data (for a fixed underlying graph), the theoretically expected creativity decreases as expected, while the theoretically
expected diversity stays the same (since this quantity does not care about being original with respect to the training set).
Interestingly, as training data increases, MTP narrows the gap between NTP and the theoretically expected creativity and
almost achieves the theoretically expected performance in the high data regime.

33
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

I. E XPERIMENTS ON SUMMARIZATION
Experimental Details. In Table 2, we provide the hyperparameter details for the GPT models finetuned on both XSUM
(Narayan et al., 2018) and CNN/DailyMail (Nallapati et al., 2016) for one epoch. We use a learning rate with linear warm
up for 0.05 of the total steps, followed by linear decay to 0. To measure Rouge and Self-Bleu, we generate and average
across 5 summarizations per document, on a test dataset T of 250 datapoints. We finetune our models with either the NTP
objective (Eq 2) or the teacherless MTP objective (Eq 2), with equal weight to both.

Table 2. Hyperparameter details for summarization experiments.

Hyperparameter XSUM CNN/DailyMail


Batch Size 32 32
Max. Learning Rate 5 × 10−5 3 × 10−6
Warmup Steps 338 124
Training Steps 7778 2486
Training Size 248906 79552

To measure quality, we compute the average of Rouge-1, Rouge-2, Rouge-L as Rouge. For measuring diversity, we
generate five different summaries per test example, and compute Self-Bleu. This computes average pairwise sentence
Bleu-2 scores with weights (0.5, 0.5, 0, 0) on 1- and 2-tuples.

I.1. Additional graphs for effect of multi-token training


Fig 29 shows the diversity and quality graphs on the smaller-sized GPT-2 models on XSUM, and Fig 30 for CNN/DailyMail.
While we consistently see improved quality from the multi-token model across the board, we don’t see an increased diversity
for fixed Rouge scores anymore.

Training
Objective: Standard (Next-Token) Teacherless (Multi-Token)

XSUM-GPT-MEDIUM XSUM-GPT-SMALL
Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)

0.9675
0.960
0.9650

0.9625
0.955
0.9600
0.950
0.9575

0.9550
0.945
0.9525
0.940
0.13250.13500.13750.14000.14250.14500.14750.1500 0.128 0.130 0.132 0.134 0.136 0.138
Quality (ROUGE) Quality (ROUGE)

Figure 29. Multi-Token Objective has no effect on diversity for smaller GPT models on XSUM.

I.2. Effect of hash-conditioning


We also conducted hash-conditioning experiments as described in §3.1. The hash strings we use are 10 randomly sampled
uppercase characters from the English alphabet. We report the quality-diversity plots in Fig 31 (for next-token prediction on
XSUM) and Fig 32 (for multi-token prediction on XSUM). As such, we do not find any changes in diversity, perhaps because
this is not a sufficiently open-ended task.

34
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Training
Objective: Standard (Next-Token) Teacherless (Multi-Token)

CNN-GPT-XL CNN-GPT-LARGE

Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)
0.954 0.948

0.952
0.946
0.950

0.948
0.944
0.946

0.944 0.942

0.942
0.940
0.940
0.140 0.142 0.144 0.146 0.148 0.150 0.144 0.146 0.148 0.150 0.152 0.154
Quality (ROUGE) Quality (ROUGE)

CNN-GPT-MEDIUM CNN-GPT-SMALL
Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)
0.965
0.968
0.960
0.966
0.955

0.950 0.964

0.945 0.962

0.940
0.960

0.935
0.132 0.134 0.136 0.138 0.140 0.142 0.144 0.126 0.128 0.130 0.132 0.134 0.136 0.138 0.140
Quality (ROUGE) Quality (ROUGE)

Figure 30. Multi-Token Objective increases diversity for GPT-L and GPT-M but not for GPT-XL or GPT-S on CNN/DailyMail

Training
Objective: Null (Next-Token) Hash (Next-Token)

XSUM-GPT-XL XSUM-GPT-LARGE
Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)

0.965

0.960
0.960

0.955 0.955

0.950
0.950
0.945
0.945
0.940

0.935 0.940

0.134 0.136 0.138 0.140 0.142 0.144 0.146 0.148 0.132 0.134 0.136 0.138 0.140 0.142 0.144 0.146
ROUGE ROUGE

XSUM-GPT-MEDIUM XSUM-GPT-SMALL
Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)

0.966
0.968
0.964
0.966
0.962
0.964
0.960
0.962
0.958
0.960
0.956
0.958
0.954
0.956
0.952
0.954
0.950
0.132 0.134 0.136 0.138 0.140 0.142 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133
ROUGE ROUGE

Figure 31. Hash-conditioning has no effect on diversity for GPT models on XSUM summarization with next-token prediction.

J. M ORE RELATED WORKS


Empirical studies of creativity in LLMs. There is a long line of recent works that measure novelty and creativity of
LLMs and LLM-assisted users. (Chakrabarty et al., 2024; Lu et al., 2024b) quantitatively evaluate and report that models
vastly underperform under expert human evaluation against human writers. Zhang et al. (2024a) argue that finetuning
methods such as RLHF and DPO, are limited when applied to creative humor-generation tasks. Likewise models like
GPT4 and Claude currently underperform top human contestants in generating humorous captions. In poetry, Walsh et al.
argue that there are certain characterstic styles that ChatGPT restricts itself to. Even assisted-writing can reduce diversity

35
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

Training
Objective: Null (Multi-Token) Hash (Multi-Token)

XSUM-GPT-XL XSUM-GPT-LARGE

Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)
0.960 0.965

0.955 0.960

0.950 0.955

0.945 0.950

0.940 0.945

0.935 0.940

0.935
0.930
0.14250.14500.14750.15000.15250.15500.15750.1600 0.140 0.142 0.144 0.146 0.148 0.150 0.152 0.154
ROUGE ROUGE

XSUM-GPT-MEDIUM XSUM-GPT-SMALL
Diversity (1 - Self-BLEU)

Diversity (1 - Self-BLEU)
0.966

0.964
0.955
0.962

0.950 0.960

0.958
0.945
0.956

0.954
0.940
0.952

0.136 0.138 0.140 0.142 0.144 0.146 0.148 0.150 0.130 0.132 0.134 0.136 0.138
ROUGE ROUGE

Figure 32. Hash-conditioning has no effect on diversity for GPT models on XSUM summarization with multi-token prediction.

(Padmakumar & He, 2024) or produce bland writing (Mirowski et al., 2024). On the positive side, Si et al. (2024) report that
LLMs surprisingly generate novel research ideas, although these are less feasible. Anderson et al. (2024) find that users tend
to produce more divergent ideas when assisted by ChatGPT (although at a group level, ideas tend to homogenize). Another
line of works (Wang et al., 2024a; Talmor et al., 2020; Zhong et al., 2024) has proposed algorithmic improvements involve
creative leaps-of-thought for real-world tasks.
Other studies have proposed benchmarks for evaluating creativity. AidanBench (McLaughlin et al., 2024) and NoveltyBench
(Zhang et al., 2025) evaluate LMs on their ability to produce diverse and coherent responses by penalizing repetition across
generations. However, they do not measure originality relative to training data, leaving open whether outputs are genuinely
novel or simply unseen paraphrases/recombinations. Zhao et al. (2024) evaluate LM creativity using the Torrance Tests
of Creative Thinking, a standard in human psychometrics. Another line of work such as Alchemy (Wang et al., 2021),
IVRE (Xu et al., 2022), and DiscoveryWorld (Jansen et al., 2024) present simulations with hidden facts and rules, requiring
LMs to explore, hypothesize, and test through interaction. While these simulations focus on pretrained models rather than
examining how training shapes creative capabilities, they serve as valuable and realistic benchmarks for assessing the role of
creativity in scientific discovery.
Finally, we refer the reader to Franceschelli & Musolesi (2023) for a rigorous treatment of philosophical questions
surrounding creativity in LLMs. We also a refer to Wang et al. (2024b) for a theoretical treatment of how to formalize
subjectivity in creativity.

The next-token prediction debate. In support of next-token prediction, there are arguments (Shannon, 1948; 1951;
Alabdulmohsin et al., 2024) that claim that language is captured by NTP with models even superceding humans (Shlegeris
et al., 2022) at NTP. There are also theoretical results emphasizing the expressivity (Merrill & Sabharwal, 2024; Feng et al.,
2023) and learnability (Malach, 2023; Wies et al., 2023) of autoregressive Transformers as long as there is a sufficiently
long chain of thought.

Multi-token training. While these methods employ diverse strategies, a common feature is their reliance on multi-token
objectives that capture broader dependencies across entire sequences. Representative examples include teacherless training
(Bachmann & Nagarajan, 2024; Monea et al., 2023; Tschannen et al., 2023) and independent output heads or modules
(Gloeckle et al., 2024; DeepSeek-AI et al., 2024) or inserting a lookahead attention (Du et al., 2023). Another line of
research is discrete diffusion models (Hoogeboom et al., 2021; Austin et al., 2021; Gong et al., 2023; Lou et al., 2023),
which avoid strict left-to-right factorization by iteratively refining an entire sequence at multiple positions. There are other

36
G OING BEYOND THE CREATIVE LIMITS OF NEXT- TOKEN PREDICTION

models as well, such as energy-based models (Dawid & LeCun, 2023) and non-autoregressive models or (Gu et al., 2018).

Transformers and graph algorithmic tasks. Graph tasks have been used to understand various limitations of Transformers
in orthogonal settings. Bachmann & Nagarajan (2024); Saparov et al. (2024) report that Transformers are limited in terms of
learning to search tasks on graphs, while Sanford et al. (2024) provide positive expressivity results for a range of algorithmic
tasks that process an graph. These works differ from our study of combinational creativity since their graphs are provided
in-context and the tasks have a unique answer. Other works (Schnitzler et al.; Yang et al., 2024a;b) study multi-hop question
answering on a knowledge graph; however, this does not require planning.

Diversity of generative models. One line of work relevant to us in the history of generative models is RNN-based VAE
for text data (Bowman et al., 2016). The motivation, like in our work, was to learn high-level semantic features rather than
next-token features with the hope of producing more novel sentences. However, this suffered from posterior collapse, where
the model ignores the latent variable altogether inspiring various solutions (Yang et al., 2017; Goyal et al., 2017). Our results
on hash-conditioning are also reminiscent of a line of work on exploration in reinforcement learning (RL), where it has been
shown that adding noises to the policy model parameters enables more efficient exploration than directly adding noises to
the output space (Plappert et al., 2017; Fortunato et al., 2017).

Learning-theoretic studies of diversity in LLMs. Various theoretical works provide rigorous arguments for how
preventing hallucination and maximizing the model’s coverage are at odds with each other in abstract settings (Kalai &
Vempala, 2024; Kalavasis et al., 2024; Kleinberg & Mullainathan, 2024). We clarify that this tension does not apply in our
concrete settings. In those abstract settings, the strings in the support can be arbitrary and adversarially chosen whereas, our
strings are generated by a simple rule (which can be learned).
Another theoretical question underlying generative models is that the optimum of their objectives are attained at perfect
memorization; yet they tend to produce novel examples e.g., this question has been posed for GANs in Nagarajan et al.
(2018) and for diffusion in Nakkiran et al. (2024) (see “remarks on generalization”) or Kamb & Ganguli (2024). Of relevance
to us is, Kamb & Ganguli (2024) who provide a theoretical and empirical argument for how image diffusion models are able
to generate combinatorially many creative outputs; theirs however do not require the type of planning our tasks do.

37

You might also like