0% found this document useful (0 votes)
6 views8 pages

Plan-And-Write Towards Better Automatic Storytelling

Uploaded by

Alien Ho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Plan-And-Write Towards Better Automatic Storytelling

Uploaded by

Alien Ho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Plan-and-Write: Towards Better Automatic Storytelling

Lili Yao,1,3∗ Nanyun Peng,2∗ Ralph Weischedel,2 Kevin Knight,2 Dongyan Zhao,1 Rui Yan1†
liliyao@[Link], {npeng,weisched,knight}@[Link]
{zhaodongyan,ruiyan}@[Link]
1
Institute of Computer Science and Technology, Peking University
2
Information Sciences Institute, University of Southern California, 3 Tencent AI Lab

Abstract Title (Given) The Bike Accident


Storyline Carrie → bike → sneak → nervous →
Automatic storytelling is challenging since it requires gener- (Extracted) leg
ating long, coherent natural language to describes a sensible Story Carrie had just learned how to ride a
sequence of events. Despite considerable efforts on automatic (Human bike. She didn’t have a bike of her
story generation in the past, prior work either is restricted in Written) own. Carrie would sneak rides on her
plot planning, or can only generate stories in a narrow do- sister’s bike. She got nervous on a
main. In this paper, we explore open-domain story genera- hill and crashed into a wall. The bike
tion that writes stories given a title (topic) as input. We pro- frame bent and Carrie got a deep gash
pose a plan-and-write hierarchical generation framework that on her leg.
first plans a storyline, and then generates a story based on the
storyline. We compare two planning strategies. The dynamic
schema interweaves story planning and its surface realization Table 1: An example of title, storyline and story in our sys-
in text, while the static schema plans out the entire storyline tem. A storyline is represented by an ordered list of words.
before generating stories. Experiments show that with explicit
storyline planning, the generated stories are more diverse, co-
herent, and on topic than those generated without creating a
full plan, according to both automatic and human evaluations. and Young 2010), we propose to decompose story genera-
tion into two steps: 1) story planning which generates plots,
and 2) surface realization which composes natural language
Introduction text based on the plots. We propose a plan-and-write hier-
archical generation framework that combines plot planning
A narrative or story is anything which is told in the form and surface realization to generate stories from titles.
of a causally/logically linked set of events involving some
One major challenge for our framework is how to rep-
shared characters (Mostafazadeh et al. 2016a). Automatic
resent and obtain annotations for story plots so that a rea-
storytelling requires composing coherent natural language
sonable generative model can be trained to plan story plots.
texts that describe a sensible sequence of events. This seems
Li et al. [2013] introduces plot graphs which contain events
much harder than text generation where a plan or knowl-
and their relations to represent a storyline. Plot graphs are
edge fragment already exists. Thus, story generation seems
comprehensive representations of story plots, however, the
an ideal testbed for advances in general AI. Prior research
definition and curation of such plot graphs require highly
on story generation mostly focused on automatically com-
specialized knowledge and significant human effort. On the
posing a sequence of events that can be told as a story by
other hand, in poetry composition, Wang et al. [2016] pro-
plot planning (Lebowitz 1987; Perez and Sharples 2001;
vides a sequence of words to guide poetry generation. In
Porteous and Cavazza 2009; Riedl and Young 2010; Li et
conversational systems, Mou et al. [2016] takes keywords
al. 2013) or case-based reasoning (Turner 1994; Gervas et
as the main gist of the reply to guide response generation.
al. 2005). These approaches rely heavily on human annota-
We take a similar approach to represent a story plot with
tion and/or are restricted to limited domains. Moreover, most
a sequence of words. Specifically, we use the order that the
prior work is restricted to the abstract story representation
words appear in the story to approximate a storyline. Table 1
level without surface realization in natural language.
shows an example of the title, storyline, and story.
In this paper, we study generating natural language stories
from any given title (topic). Inspired by prior work on dialog Though this representation seems to over-simplify story
planning (Nayak et al. 2017) and narrative planning (Riedl plots, it has several advantages. First, because the story-
line representation is simple, there are many reliable tools

Equal contribution: Lili Yao and Nanyun Peng to extract high-quality storylines from existing stories and

Corresponding author: Rui Yan (ruiyan@[Link]) thus automatically generate training data for the plot plan-
Copyright c 2019, Association for the Advancement of Artificial ning model. Our experiments show that by training plot
Intelligence ([Link]). All rights reserved. planning models on automatically extracted storylines, we

7378
can generate better stories without additional human anno- algorithm (Rose et al. 2010), which combines several word
tation. Moreover, with this simple and interpretable story- frequency based and graph-based metrics to weight the im-
line representation, it is possible to compare the efficiency of portance of the words. We extract the most important word
different plan-and-write strategies. Specifically, we explore from each sentence as a story’s storyline.
two paradigms that seem to mimic human practice in real
world story writing1 (Alarcon 2010). The dynamic schema Methods
adjusts the plot improvisationally while writing progresses. We adopt neural generation models to implement our plan-
The static schema plans the entire plot before writing. We and-write framework, as they have been shown effective
summarize the contributions of the paper as follows: in many text generation tasks such as machine transla-
• We propose a plan-and-write framework that leverages tion (Bahdanau, Cho, and Bengio 2015), and dialogue sys-
storylines to improve the diversity and coherence of the tems (Shang, Lu, and Li 2015). Figure 1 demonstrates the
generated story. Two strategies: dynamic and static plan- workflow of our framework. We now describe the two plan-
ning are explored and compared under this framework. and-write strategies we explored.
• We develop evaluation metrics to measure the diversity of
the generated stories, and conduct novel analysis to ex- Dynamic Schema
amine the importance of different aspects of stories for The dynamic schema emphasizes flexibility. As shown in
human evaluation. Figure 2a, it generates the next word in the storyline and
• Experiments show that the proposed plan-and-write the next sentence in the story at each step. In both cases,
model generates more diverse, coherent, and on-topic sto- the existing storyline and previously generated sentences are
ries than those without planning 2 . given to the model to move one step forward.
Storyline Planning The storyline is planned out based on
Plan-and-Write Storytelling the context (the title and previously generated sentences are
taken as context) and the previous word in the storyline. We
In this paper, we propose a plan-and-write framework to
formulate it as a content-introducing generation problem,
generate stories from given titles. We posit that storytelling
where the new content (the next word in the storyline) is gen-
systems can benefit from storyline planning to generate
erated based on the context and some additional information
more coherent and on-topic stories. An additional benefit
(the most recent word in the storyline). Formally, let ctx =
of the plan-and-write schema is that human and computer
[t, s1:i−1 ] denotes the context, where s1:i−1 denotes for the
can interact and collaborate on the (abstract) storyline level,
first i-1 sentences in the story. We model p(li |ctx, li−1 ; θ).
which can enable many potentially enjoyable interactions.
We implement the content-introducing method proposed
We formally define the input, output, and storyline of our
by Yao et al. [2017], which first encodes context into a
approach as follows.
vector using a bidirectional gated recurrent unit (BiGRU),
Problem Formulation and then incorporates the auxiliary information, in this case
the previous word in the storyline, into the decoding pro-
Input: A title t = {t1 , t2 , ..., tn } is given to the system to cess. Formally, hidden vectors for context are computed as
constrain writing, where ti is the i-th word in the title. e ctx = Encodectx (ctx) = [−
h
−→ ←−− −−→
hctx ; hctx ], where hctx and
Output: The system generates a story s = ←−−
{s1 , s2 , ..., sm } based on a title, where si denotes a hctx are the hidden vectors produced by a forward and a
sentence in the story. backward GRU, respectively. [; ] denotes element-wise con-
Storyline: The system plans a storyline l = catenation. The conditional probability is computed as:
{l1 , l2 , ..., lm } as an intermediate step to represent the
plot of a story. We use a sequence of words to represent a hy = GRU(BOS, Catt ), hw = GRU(li−1 , Catt )
0 0
storyline, therefore, li denotes a word in a storyline. hy = tanh(W1 hy ), hw = tanh(W2 hw )
Given a title, the plan-and-write framework always plans 0 0
a storyline. We explore two variations of this framework: the k = σ(Wk [hy ; hw ])
dynamic and the static schema. p(li |ctx, li−1 ) = g(k ◦ hy + (1 − k) ◦ hw )
Storyline Preparation BOS denotes the beginning of decoding. Catt represents the
To obtain training data for the storyline planner, we extract attention-based context computed from h̃ctx . g(·) denotes a
sequences of words from existing story corpora to compose multilayer perceptron (MLP).
storylines. Specifically, we extract one word from each sen- Story Generation The story is generated incrementally by
tence of a story to form a storyline3 . We adopt the RAKE planning and writing alternately. We formulate it as another
1
Some discussions on Quora: [Link] content-introducing generation problem which generates a
How-many-times-does-a-writer-edit-his-first -draft story sentence based on both the context and an additional
2
Code and appendix will be available at [Link] storyline word as a cue. The model structure is exactly the
VioletPeng/language-model same as for storyline generation. However, there are two
3 differences between storyline and story generation. On one
For this pilot study, we assume each word li in a storyline cor-
responds to a sentence si in a story. hand, the former aims to generate a word while the latter

7379
Generated Story

Title Dynamic Planning Writing Tina made


Spaghetti Sauce spaghetti for
her boy friend.
Static Planning Writing ……
……
……

Figure 1: An overview of our system.


.

𝑙1 𝑙2 𝑙3 𝑙4 𝑙5 title t. We formulate it as a conditional generation prob-


BOS lem, where the probability of generating each word in a
storyline depends on the previous words in the storyline
and the title. Formally, we model p(li |t, l1:i−1 ; θ). We adopt
Title a sequence-to-sequence (Seq2Seq), conditional generation
𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 model that first encodes the title into a vector using a bidirec-
(a) Dynamic schema work-flow.
tional long short-term memory network (BiLSTM), and gen-
erates words in the storyline using another single-directional
𝑙1 𝑙2 𝑙3 𝑙4 𝑙5 LSTM. Formally, the hidden vector h̃ for a title is computed

− ←−
as h̃ = Encode(t) = [ h ; h ], and the conditional probabil-
Title ity is given by:

p(li |t, l1:i−1 ; θ) = g(LSTMatt (h̃, li−1 , hdec


i−1 ))
𝑠1 𝑠2 𝑠3 𝑠4 𝑠5
where LSTMatt denotes a cell of the LSTM with attention
(b) Static schema work-flow. mechanism (Bahdanau, Cho, and Bengio 2015); hdec i−1 stands
for the decoding hidden state. g(·) again denotes a MLP.
Figure 2: An illustration of the dynamic and static plan-and-
write work-flow. li denotes a word in a storyline and si de- Story Generation The story is generated after the full sto-
notes a sentence in a story. ryline is planned. We formulate it as another conditional
generation problem. Specifically, we train a Seq2Seq model
that encodes both the title and the planned storyline into
generates a variable-length sequence. On the other hand, the a low-dimensional vector by first concatenating them with
auxiliary information they use is different. a special symbol <EOT> in between, and encode them
Formally, the model is trained to minimize the negative −→ ← −
with BiLSTMs: h̃tl = Encodetl ([t, l]) = [htl ; htl ]. The
log-probability of training data: Seq2Seq model is then trained to minimize the negative log-
N
" m
# probability of the stories in the training data:
1 X Y
L(θ)dyna = − log p(si |ctx, li ) (1)
N j=1 N
" m
#
i=1 j 1 X Y
L(θ)static = − log p(si |h̃tl , s1:i−1 ) (2)
N j=1
where N is the number of stories in training data; m denotes i=1 j
the number of sentences in a story. Given the extracted story-
lines as described in the previous Section, the storyline and Storyline Optimization
story generation models are trained separately. End-to-end
generation is conducted in a pipeline fashion. One common problem for neural generation models is the
repetition in generated results (Li et al. 2016). We observe
Static Schema repetition initially in both the generated storyline (repeated
words) and story (repeated phrases and sentences). An ad-
The static schema is inspired by sketches that writers usually
vantage of the storyline layer is that given the compact and
draw before they flesh out the whole story. As illustrated in
interpretable representation of the storyline, we can easily
Figure 2b, it first generates a whole storyline which does not
apply heuristics to reduce repetition4 . Specifically, we for-
change during story writing. This sacrifices some flexibility
bid any word to appear twice when generating a storyline.
in writing, but could potentially enhance story coherence as
it provides “look ahead” for what happens next.
4
It is important to avoid repetition in the generated stories too.
Storyline Planning Differing from the dynamic schema, However, it is hard to automatically detect repetition in stories. Op-
storyline planning for static schema is solely based on the timizing storylines can indirectly reduce repetition in stories.

7380
Number of Stories 98, 161 validation set. Details of the best hyper-parameter values for
Vocabulary size 33, 215 each setting are given in Appendix.
Average number of words 50
Evaluation Metrics
Table 2: Statistics of the ROCStories dataset. Objective metrics. Our goal is generating human-like sto-
ries, which can pass the Turing test. Therefore, the evalu-
ation metrics based on n-gram overlap such as BLEU are
Experimental Setup not suitable for our task5 . To better gauge the quality of our
methods, we design novel automatic evaluation metrics to
Dataset evaluate the generation results at scale. Since neural gen-
We conduct the experiments on the ROCStories cor- eration models are known to suffer from generating repeti-
pus (Mostafazadeh et al. 2016a). It contains 98,162 short tive content, our automatic evaluation metrics are designed
commonsense stories as training data, and additional 1,817 to quantify diversity across the generated stories. We design
stories for development and test, respectively. The stories two measurements to gauge inter- and intra-story repetition.
in the corpus are five-sentence stories that capture a rich For each sentence position i, the inter-story rei and intra-
set of causal and temporal commonsense relations between story rai repetition rate are computed as follows:
daily events, making them a good resource for training sto- PN
rytelling models. Table 2 shows the statistics of ROCStories i
T ( j=1 sji )
re = 1 −
dataset. Since only the training set of the ROCStories corpus PN
Tall ( j=1 sji )
contains titles, which we need as input. We split the original "P #j (3)
training data into [Link] for training, validation, and testing. N i−1 i k
1 X T (s ∩ s )
rai = k=1

Baselines N j=1 (i − 1) ∗ T (si )


To evaluate the effectiveness of the plan-and-write frame- where T (·) and Tall (·) denote the number of distinct and
work, we compare our methods against representative base- total trigrams6 , respectively. sji stands for the i-th sentence
lines without a planning module. in j-th story; si ∩ sk is the distinct trigram intersection set
Inc-S2S denotes the incremental sentence-to-sentence between sentence si and sk . Naturally, rei demonstrates the
generation baseline, which creates stories by generating the repetition rate between stories at sentence position i; rai em-
first sentence from a given title, then generating the i-th sen- bodies the average repetition of sentence si comparing with
tence from the title and the previously generated i-1 sen- former sentences in a story.
tences. This resembles the dynamic schema without plan- We compute the aggregate scores as follows:
ning. We use a Seq2Seq model with attention (Bahdanau, PN Pm
T ( j=1 i=1 sji )
Cho, and Bengio 2015) to implement the Inc-S2S baseline, agg
re = 1 − PN Pm
where the sequence to sequence model is trained to generate Tall ( j=1 i=1 sji )
the next sentence based on the context. m
(4)
agg 1 X i
Cond-LM denotes the conditional language model base- ra = r
m i=1 a
line, which straightforwardly generates the whole story
word by word from a given title. Again we use a Seq2Seq PN Pm
where j=1 i=1 sji is the set of N stories with m sen-
model with attention as our implementation of the con- tences. In our experiments, we set m = 5. reagg indicates the
ditional language model, where the sequence to sequence overall repetition of all stories.
model is trained to generate the whole story based on the
title. It resembles our static schema without planning. Subjective metrics. For a creative generation task such as
story generation, reliable automatic evaluation metrics to as-
Hyper-parameters sess aspects such as interestingness, coherence are lacking.
As all of our baselines and the proposed methods are RNN- Therefore, we rely on human evaluation to assess the quality
based conditional generation models, we conduct the same of generation. We conduct pairwise comparisons, and pro-
set of hyper-parameter optimization for them. We train all vide users two generated stories, asking them to choose the
the models using stochastic gradient descent (SGD). For the better one. We consider four aspects: fidelity (whether the
encoder and decoder in our generation models, we tune the story is on-topic with the given title), coherence (whether the
hyper-parameters of the embedding and hidden vector di- story is logically consistent and coherent), interestingness
mensions and the dropout rate by grid search. We randomly (whether the story is interesting) and overall user preference
initialize the word embeddings and tune the dimensions in (how do users like the story). All surveys were collected on
the range of [100, 200, 300, 500] for storyline generation and Amazon Mechanical Turk (AMT).
[300, 500, 1000] for story generation. We tune the hidden 5
Our plan-and-write methods also improve BLEU scores over
vector dimensions in the range of [300, 500, 1000]. The em- the baseline methods, more details can be found in Appendix.
bedding and hidden vector dropout rates are all tuned from 6
We also conduct the same computation for four and five-grams
0 to 0.5, step by 0.1. We tune all baselines and proposed and observed the same trends. The Spearman correlation between
models based on BLEU scores (Papineni et al. 2002) on the this measurement and human rating is 0.28.

7381
% % % %
95 96 30 18 Inc-S2S
90 94 25 16 Cond-LM
92 14 Dynamic
85 90 20 Static
80 88 15 12
75 86 10 10
70 84 5 8
65 1 82 01 6
2 3 4 5 2 3 4 5
(a) Inter-story repetition curve (b) Inter-story aggregate repeti- (c) Intra-story repetition curve (d) Intra-story aggregate repeti-
by sentence. tion scores. by sentence. tion scores.

Figure [Link] Inter-


Figure Inter- and
and intra-story
intra-story repetition
repetition rates
rates by sentences (curves) and for
for the
the whole
whole stories
stories (bars),
(bars), the
the lower
lower the
the better.
better.
Asreference
As reference points,
points, the
the aggregate
aggregate repetition
repetition rates on the human-written training
training data
data are
are 34%
34% and
and 0.3%
0.3% forfor the
the inter-
inter-and
and
intra-storymeasurements
intra-story measurements respectively.
respectively.

Dynamic vs Inc-S2S
Dynamic vs Inc-S2S Static Cond-LM
Static vsvsCond-LM Dynamic
Dynamic vs Static
vs Static
Choice Choice
% % Dyna. [Link]. Kappa
Kappa Static Cond. Kappa
Static Cond. [Link].
Static Static
Kappa Kappa
FidelityFidelity 35.8 [Link] 0.42 0.42 38.5
38.5 16.3
16.3 0.42 38.00 38.00
0.42 21.4721.47 0.30 0.30
Coherence 37.2 37.228.6
Coherence 28.6 0.30 0.30 39.4
39.4 32.3
32.3 0.35 49.47 49.47
0.35 28.2728.27 0.36 0.36
Interestingness 43.5 26.7 0.31 39.5 35.7 0.42 34.40 42.60 0.35
Interestingness 43.5
Overall Popularity 42.9
26.7
27.0
0.31
0.34
39.5
40.9
35.7
34.2
0.42
0.38
34.40
30.07 50.07
42.60
0.38
0.35
Overall Popularity 42.9 27.0 0.34 40.9 34.2 0.38 30.07 50.07 0.38
Table 3: Human evaluation results on four aspects: fidelity, coherence, interestingness, and overall user preference. Dyna., Inc.,
Table 3: Human
and Cond. is the evaluation results
abbreviation on four aspects:
for Dynamic schema, fidelity,
Inc-S2S,coherence, interestingness,
and Cond-LM [Link]
We overall user preference.
also calculate the KappaDyna., Inc.,
coefficient
and Cond. is the abbreviation for Dynamic
to show the inter-annotator agreement. schema, Inc-S2S, and Cond-LM respectively. We also calculate the Kappa coefficient
to show the inter-annotator agreement.

the diversity of Results and system.


the generated Discussion
As is shown in Fig- Figure 4: The regres-
ure 3, the proposed plan-and-write framework significantly Fidelity Fidelity
sion coefficient that
Objective
reduces evaluation
the repetition rate and generates more diverse sto- 19.7%
19.7% Interest. Interest.
38.9% shows
38.9% which aspect
ries.
We For inter-story
generate 9816 repetition,
stories basedplan-and-write
on the titlesmethods sig-
in the held- is more important in
nificantly
out test set,outperform all non-planning
and compute methods
the repetition ratio on
(theindivid-
lower, Coherence Coherence human evaluation of
ual
thesentences
better) asand aggregate
described in scores.
Eq. 3 For
and the
[Link]-story rep-
4 to evaluate 41.4% 41.4% stories.
the diversity
etition of the generated
rate, plan-and-write system.
methods As is shown
outperform theirincorre-
Fig-
ure 3, thenon-planning
sponding proposed plan-and-write
baselines on framework significantly
aggregate scores. How-
reduces
ever, the the repetition
dynamic schemarate and generates
generates more
more diversefinal
repetitive sto- Figure 4: The regression
the effectiveness coefficient
of the proposed that shows which
plan-and-write aspect
framework.
ries. For inter-story
sentences repetition, plan-and-write methods sig-
than the baselines. is more important in human evaluation of stories.
Among them, the static schema shows the best results.
nificantly outperform all non-planning methods on individ- To understand why people prefer one story over another,
ual sentences
Subjective and aggregate scores. For the intra-story rep-
evaluation we analyze how people weigh the three aspects (fidelity, co-
etition rate, plan-and-write methods outperform their corre- herence, and interestingness) in their preference for stories.
For humannon-planning
sponding evaluation, we randomly
baselines sample 300
on aggregate titlesHow-
scores. from ers
We evaluate the comparison
train a linear regressionbetween
using theDynamic and Inc-S2S,
three aspects’ scores
the testthe
ever, data, and present
dynamic schema a story title and
generates two repetitive
more generatedfinal
sto- Static and Cond-LM, Dynamic and Static, respectively.
as features to predict the overall score. We fit the regression
7
ries at a time
sentences thantothe
thebaselines.
evaluators and ask them to decide which with all human assessments we collected. Theevaluation.
weight as-
of the two stories is better8 . There are 233 Turkers9 partic- Table 3 demonstrates the results of human
signed totoeach
Similar aspect reflects
the automatic their relative
evaluation results, importance.
both dynamic As
ipated in the evaluation
Subjective evaluation. Specifically, 69, 77, and 87 Turk- evident in Figure
ers evaluate the comparison between Dynamic and Inc-S2S, and static schema4,significantly
coherence and interestingness
outperform play im-
their counter-
For human evaluation, we randomly sample 300 titles from portant
part roles in
baseline inthe
all human evaluation,
evaluation aspects,and fidelity
thus is less im-
demonstrating
Static and Cond-LM, Dynamic and Static, respectively. portant.
the test data, and present a story title and two generated sto- the effectiveness of the proposed plan-and-write framework.
Table 3 demonstrates the results of human evaluation.
ries at a time7 to the evaluators and ask them to decide which Among them, the static schema shows the best results.
Similar to the automatic evaluation results, both dynamic Analysis
of the two stories is better8 . There are 233 Turkers9 partic-
and
ipated in the evaluation. Specifically, 69, 77,their
static schema significantly outperform and 87 counter-
Turk- To understand why people prefer one story over another,
part baseline in all evaluation aspects, thus demonstrating Theanalyze
we previoushowsections
people examine
weigh thethe overall
three performance
aspects (fidelity, co-of
7 herence, and interestingness) in their preferencequalitatively
our methods quantitatively. In this section, we for stories.
We compare the plan-and-write methods with their corre- analyze
7
We compare
sponding baselinesthe
andplan-and-write
with each [Link] withthetheir
For fairness, corre-
two stories We trainour methods
a linear with a focus
regression usingonthecomparing the dynamic
three aspects’ scores
sponding baselines and with each other. For fairness, are
the two stories andfeatures
as static schema.
to predict the overall score. We fit the regression
are pooled and randomly permuted. Five judgments required to
are pooled
reduce theand randomly
variance permuted. Five judgments are required to
in estimation.
Storyline
with all humananalysis. First, we
assessments we measure
[Link] quality
The weight of as-
the
reduce 8 the variance in estimation.
The four aspects are each evaluated.
generated storylines, and the correlations between
signed to each aspect reflects their relative importance. As a story-
89
The
We four aspects
applied are eachfilters
qualification evaluated.
that only allow users who have line andinthe
evident generated
Figure story. We
4, coherence anduse BLEU scoresplay
interestingness to mea-
im-
9
We applied qualification filters
at least 500 previous jobs and had greater that only
thanallow
98% users who have
acceptance rate sure theroles
portant quality of human
in the the storylines,
evaluation,andandan fidelity
embedding-based
is less im-
attoleast 500 previous
participate in our jobs and had greater than 98% acceptance rate
survey. metric (Liu et al. 2016a) to estimate the average greedy
portant.
to participate in our survey. matching score l-s between storyline words and generated

7382
Title: Computer
Inc-S2S Tom’s computer broke down. He needed to buy a new computer. He decided to buy a new com-
Baselines
puter. Tom bought a new computer. Tom was able to buy a new computer.
Cond-LM The man bought a new computer. He went to the store. He bought a new computer. He bought
the computer. He installed the computer.
Storyline needed → money → computer → bought → happy
Dynamic
Story John needed a computer for his birthday. He worked hard to earn money. John was able to buy his
computer. He went to the store and bought a computer. John was happy with his new computer.
Storyline computer → slow → work → day → buy
Static
Story I have an old computer. It was very slow. I tried to work on it but it wouldn’t work. One day, I
decided to buy a new one. I bought a new computer .
Title: The Virus
Inc-S2S His computer was fixed and he fixed it. John got a new computer on his computer. John was able
Baselines
to fix it himself. John was able to fix his computer and was able to fix his computer. John was
able to fix his computer and had a virus and was able to fix his computer.
Cond-LM Tim was working on a project. He was working on a project. Tim was working on a project. The
project was really good. Tim was able to finish the project.
Storyline computer → use → anywhere → house → found
Dynamic
Story I was working on my computer today. I was trying to use the computer. I couldn’t find it
anywhere. I looked all over the house for it. Finally, I found it.
Storyline work → fix → called → found → day
Static
Story I had a virus on my computer. I tried to fix it but it wouldn’t work. I called the repair company.
They came and found the virus. The next day, my computer was fixed.

Table 4: Case studies of generated storylines and stories.

Title / Problem Story


Taxi / off-topic I got a new car. It was one day. I decided to drive to the airport. I was driving for a long time. I had
a great time .
Cut / repetitive Anna was cutting her nails. She cut her finger and cut her finger. Then she cut her finger. It was
bleeding! Anna had to bandage her finger.
Eight glasses/ incon- Joe needed glasses. He went to the store to buy some. He did n’t have any money. He found a pair
sistent that he liked. He bought them.

Table 5: Example stories that demonstrate the typical problems of the current systems.

Method l-B1 l-B2 l-s cosine score is regarded as the correlation between them.
Dynamic 6.46 0.79 0.88 Table 6 shows the results. We can see that the static
Static 9.53 1.59 0.89 schema generates storylines with higher BLEU scores. It
also generates stories that have a higher correlation with the
Table 6: The storyline BLEU score (only BLEU-1 and storylines (higher l-s score11 ). This indicates that with better
BLEU-2) and the correlation of storyline-story l-s. storylines (higher BLEU score), it is easier to generate more
relevant and coherent stories. This partially explains why the
static schema performs better than the dynamic schema.
Analysis Case study. We further present two examples in Table 4
to intuitively compare the plan-and-write methods and the
The previous sections examine the overall performance of baselines12 . In both examples, the baselines without plan-
our methods quantitatively. In this section, we qualitatively ning components tend to generate repetitive sentences that
analyze our methods with a focus on comparing the dynamic do not exhibit much of a story progression. In contrast, the
and static schema. plan-and-write methods can generate storylines that follow a
Storyline analysis. First, we measure the quality of the reasonable flow, and thus help generate coherent stories with
generated storylines, and the correlations between a story- less repetition. This demonstrates the ability of the plan-and-
line and the generated story. We use BLEU scores to mea- write methods. In the second example, the storyline gener-
sure the quality of the storylines, and an embedding-based ated by the dynamic schema is not very coherent and thus
metric (Liu et al. 2016a) to estimate the average greedy
matching score l-s between storyline words and generated measure the correlation. [Link]
story sentences. Concretely, a storyline word is greedily [Link]
11
matched with each token in a story sentence based on the There are 75% and 78% storyline words appear in the gener-
cosine similarity of their word embeddings10 . The highest ated stories in the dynamic and static schema, respectively.
12
More examples please see our live demo at [Link]
10
For fairness, We adopt the pre-trained Glove embedding to [Link]/

7383
significantly affects story quality. This reflects the impor- iments on full story generation. Xu et al. [2018] is a concur-
tance of storyline planning in our framework. rent work which is similar to our dynamic schema. Their set-
Error analysis. To better understand the limitation of our ting assumes story prompts as inputs, which is more specific
best system, we manually reviewed 50 titles and the corre- than our work (which only requires a title). Moreover, we
sponding generated stories from our static schema to con- explore two planning strategies: dynamic schema and static
duct error analysis. The three major problems are: off-topic, schema, and show the latter works better.
repetitive, and logically inconsistent. We show three exam-
ples, one for each category, in Table 5 to illustrate the prob- Neural Story Generation
lems. We can see that the current system is already capable Recently, deep learning models have been demonstrated ef-
of generating grammatical sentences that are coherent within fective in natural language generation tasks (Bahdanau, Cho,
a local context. However, generating a sequence of coherent and Bengio 2015; Merity, Keskar, and Socher 2018) In story
and logically consistent sentences is still an open challenge. generation, prior work has proposed to use deep neural net-
works to capture story structures and generate stories. Khal-
Related work ifa, Barros, and Togelius [2017] argue that stories are better
Story Planning generated using recurrent neural networks (RNNs) trained
on highly specialized textual corpora, such as a body of work
Automatic story generation efforts date back to the from a single, prolific author. Roem et al. [2017] use skip-
1970s (Meehan 1977). Early attempts focused on composing thought vectors (Kiros et al. 2015) to encode sentences and
a sensible plot for a story. Symbolic planning systems (Por- model relations between the sentences. Jain et al. [2017]
teous and Cavazza 2009; Riedl and Young 2010) attempted explore generating coherent stories from independent tex-
to select and sequence character actions according to spe- tual descriptions based on two conditional text-generation
cific success criteria. Case-based reasoning systems (Turner methods: statistical machine translation and sequence-to-
1994; Gervas et al. 2005; Montfort 2006) adapted prior story sequence models. Fan, Lewis, and Dauphin [2018] proposes
plots (cases) to new storytelling requirements. These tradi- a hierarchical generation strategy to generate stories from
tional approaches were able to produce impressive results prompts to improve coherence. However, we consider story-
based on hand-crafted, well-defined domain models, which lines are different from prompts as they are not naturally lan-
kept track of legal characters, their actions, narratives, and guage sentences. They are some structured outline of stories.
user interest. However, the generated stories were restricted We employ neural network-based generation models for our
to limited domains. plan-and-write generation. The focus, however, is to intro-
To tackle the problem of restricted domains, some work duce storyline planning to improve the quality of generated
attempted to automatically learn domain models. Swanson stories, and compare the effect of different storyline plan-
and Gordon [2012] mined millions of personal stories from ning strategies on story generation.
the Web and identified relevant existing stories in the cor-
pus. Li et al. [2013] used a crowd-sourced corpus of stories
to learn a domain model that helped generate stories in un-
Conclusion and Future Work
known domains. These efforts stayed at the level of story In this paper, we propose a plan-and-write framework
plot planning without surface realization. that generates stories from given titles with explicit story-
line planning. We explore and compare two plan-and-write
Event Structures for Storytelling strategies: dynamic schema and static schema, and show that
they both outperform the baselines without planning compo-
There is a line of research focusing on representing story
nents. The static schema performs better than the dynamic
event structures (Mostafazadeh et al. 2016b; McDowell et
schema because it plans the storyline holistically, thus tends
al. 2017). Rishes et al. [2013] presents a model that repro-
to generate more coherent and relevant stories.
duce different versions of a story from its symbolic repre-
sentation. Pichotta and Mooney [2016] parse a large col- The current plan-and-write models use a sequence of
lection of natural language documents, extract sequences words to approximate a storyline, which simplifies many
of events, and learn statistical models of them. Some re- meaningful structures in a real story plot. We plan to ex-
cent work explored story generation with additional infor- tend the exploration to richer representations, such as en-
mation (Bowden et al. 2016; Peng et al. 2018; Guan, Wang, tity, event, and relation structures, to depict story plots. We
and Huang 2019). Visual storytelling (Huang et al. 2016; also plan to extend the plan-and-write framework to generate
Liu et al. 2016b; Wang et al. 2018) aims to generate human- longer documents. The current framework relies on story-
level narrative language from a sequence of images. Jain et lines automatically extracted from story corpora to train the
al. [2017] addresses the task of coherent story generation planning module. In the future, we will explore the storyline
from independent textual descriptions. Unlike this line of induction and joint storyline and story generation to avoid
work, we learn to automatically generate storylines to help error propagation in the current pipeline generation system.
generate coherent stories.
Martin et al.; Xu et al. [2018; 2018] are the closest work Acknowledgements
to ours, which decomposed story generation into two steps: We thank the anonymous reviewers for the useful com-
story structure modeling and structure-to-surface genera- ments. This work is supported by Contract W911NF-15-
tion. However, Martin et al. [2018] did not conduct exper- 1-0543 with the US Defense Advanced Research Projects

7384
Agency (DARPA), the National Key Research and Devel- Mostafazadeh, N.; Chambers, N.; He, X.; Parikh, D.; Batra, D.;
opment Program of China (No. 2017YFC0804001), and the Vanderwende, L.; Kohli, P.; and Allen, J. 2016a. A corpus
National Science Foundation of China (Nos. 61876196 and and cloze evaluation for deeper understanding of commonsense
61672058). stories. In NAACL.
Mostafazadeh, N.; Grealish, A.; Chambers, N.; Allen, J.; and
References Vanderwende, L. 2016b. CaTeRS: Causal and temporal relation
scheme for semantic annotation of event structures. In NAACL
Alarcon, D. 2010. The Secret Miracle: The Novelist’s Hand- Workshop.
book. St. Martin’s Griffin.
Mou, L.; Song, Y.; Yan, R.; Li, G.; Zhang, L.; and Jin, Z. 2016.
Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine Sequence to backward and forward sequences: A content-
translation by learning to align and translate. In ICLR. introducing approach to generative short-text conversation. In
Bowden, K. K.; Lin, G. I.; Reed, L. I.; Tree, J. E. F.; and Walker, COLING.
M. A. 2016. M2d: Monolog to dialog generation for conversa- Nayak, N.; Hakkani-Tur, D.; Walker, M.; and Heck, L. 2017. To
tional story telling. In ICIDS. plan or not to plan? discourse planning in slot-value informed
Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical neural sequence to sequence models for language generation. In Inter-
story generation. In ACL. speech.
Gervas, P.; Diaz-Agudo, B.; Peinado, F.; and Hervas, R. 2005. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu:
Story plot generation based on CBR. In KBS. a method for automatic evaluation of machine translation. In
ACL.
Guan, J.; Wang, Y.; and Huang, M. 2019. Story ending genera-
tion with incremental encoding and commonsense knowledge. Peng, N.; Ghazvininejad, M.; May, J.; and Knight, K. 2018.
In AAAI. Towards controllable story generation. In NAACL Workshop.
Huang, T.-H. K.; Ferraro, F.; Mostafazadeh, N.; Misra, I.; De- Perez, R., and Sharples, M. 2001. MEXICA: A computer model
vlin, J.; Agrawal, A.; Girshick, R.; He, X.; Kohli, P.; Batra, D.; of a cognitive account of creative writing. In JETAI.
et al. 2016. Visual storytelling. In NAACL. Pichotta, K., and Mooney, R. J. 2016. Learning statistical
scripts with LSTM recurrent neural networks. In AAAI.
Jain, P.; Agrawal, P.; Mishra, A.; Sukhwani, M.; Laha, A.; and
Sankaranarayanan, K. 2017. Story generation from sequence Porteous, J., and Cavazza, M. 2009. Controlling narrative gen-
of independent short descriptions. In KDD WS. eration with planning trajectories: The role of constraints. In
ICIDS.
Khalifa, A.; Barros, G. A.; and Togelius, J. 2017. Deeptingle.
In arXiv preprint arXiv:1705.03557. Riedl, M. O., and Young, R. M. 2010. Narrative planning:
Balancing plot and character. In JAIR.
Kiros, R.; Zhu, Y.; Salak, R.; Zemel, R.; Urtasun, R.; Torral, A.;
and Fidler, S. 2015. Skip-thought vectors. In NIPS. Rishes, E.; Lukin, S. M.; Elson, D. K.; and Walker, M. A. 2013.
Generating different story tellings from semantic representa-
Lebowitz, M. 1987. Planning stories. In CogSci. tions of narrative. In ICIDS.
Li, B.; Lee-Urban, S.; Johnston, G.; and Riedl, M. 2013. Story Roem, M.; Koba, S.; Inoue, N.; and Gordon, A. M. 2017. An
generation with crowdsourced plot graphs. In AAAI. RNN-based classifier for the story cloze test. In LSDSem.
Li, J.; Monroe, W.; Ritter, A.; Jurafsky, D.; Galley, M.; and Gao, Rose, S.; Engel, D.; Cramer, N.; and Cowley, W. 2010. Auto-
J. 2016. Deep reinforcement learning for dialogue generation. matic keyword extraction from individual documents. In Text
In EMNLP. Mining: Applications and Theory.
Liu, C.-W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Charlin, Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine
L.; and Pineau, J. 2016a. How not to evaluate your dialogue for short-text conversation. In ACL.
system: An empirical study of unsupervised evaluation metrics Swanson, R., and Gordon, A. 2012. Say anything: Using textual
for dialogue response generation. In EMNLP. case-based reasoning to enable open-domain interactive story-
Liu, Y.; Fu, J.; Mei, T.; and Chen, C. W. 2016b. Storytelling telling. In ACM TiiS.
of photo stream with bidirectional multi-thread recurrent neural Turner, S. R. 1994. The creative process: A computer model of
network. In arXiv preprint arXiv:1606.00625. storytelling and creativity. Psychology Press.
Martin, L.; Ammanabrolu, P.; Hancock, W.; Singh, S.; Harri- Wang, Z.; He, W.; Wu, H.; Wu, H.; Li, W.; Wang, H.; and Chen,
son, B.; and Riedl, M. 2018. Event representations for auto- E. 2016. Chinese poetry generation with planning based neural
mated story generation with deep neural nets. In AAAI. network. In COLING.
McDowell, B.; Chambers, N.; Ororbia II, A.; and Reitter, D. Wang, X.; Chen, W.; Wang, Y.-F.; and Wang, W. Y. 2018. No
2017. Event ordering with a generalized model for sieve pre- metrics are perfect: Adversarial reward learning for visual sto-
diction ranking. In IJCNLP. rytelling. In ACL.
Meehan, J. R. 1977. Tale-spin, an interactive program that Xu, J.; Zhang, Y.; Zeng, Q.; Ren, X.; Cai, X.; and Sun, X. 2018.
writes stories. In IJCAI. A skeleton-based model for promoting coherence among sen-
Merity, S.; Keskar, N. S.; and Socher, R. 2018. Regularizing tences in narrative story generation. In EMNLP.
and optimizing lstm language models. In ICLR. Yao, L.; Zhang, Y.; Feng, Y.; Zhao, D.; and Yan, R. 2017.
Towards implicit content-introducing for generative short-text
Montfort, N. 2006. Natural language generation and narrative
conversation systems. In EMNLP.
variation in interactive fiction. In AAAI WS.

7385

You might also like