Plan-And-Write Towards Better Automatic Storytelling
Plan-And-Write Towards Better Automatic Storytelling
Lili Yao,1,3∗ Nanyun Peng,2∗ Ralph Weischedel,2 Kevin Knight,2 Dongyan Zhao,1 Rui Yan1†
liliyao@[Link], {npeng,weisched,knight}@[Link]
{zhaodongyan,ruiyan}@[Link]
1
Institute of Computer Science and Technology, Peking University
2
Information Sciences Institute, University of Southern California, 3 Tencent AI Lab
7378
can generate better stories without additional human anno- algorithm (Rose et al. 2010), which combines several word
tation. Moreover, with this simple and interpretable story- frequency based and graph-based metrics to weight the im-
line representation, it is possible to compare the efficiency of portance of the words. We extract the most important word
different plan-and-write strategies. Specifically, we explore from each sentence as a story’s storyline.
two paradigms that seem to mimic human practice in real
world story writing1 (Alarcon 2010). The dynamic schema Methods
adjusts the plot improvisationally while writing progresses. We adopt neural generation models to implement our plan-
The static schema plans the entire plot before writing. We and-write framework, as they have been shown effective
summarize the contributions of the paper as follows: in many text generation tasks such as machine transla-
• We propose a plan-and-write framework that leverages tion (Bahdanau, Cho, and Bengio 2015), and dialogue sys-
storylines to improve the diversity and coherence of the tems (Shang, Lu, and Li 2015). Figure 1 demonstrates the
generated story. Two strategies: dynamic and static plan- workflow of our framework. We now describe the two plan-
ning are explored and compared under this framework. and-write strategies we explored.
• We develop evaluation metrics to measure the diversity of
the generated stories, and conduct novel analysis to ex- Dynamic Schema
amine the importance of different aspects of stories for The dynamic schema emphasizes flexibility. As shown in
human evaluation. Figure 2a, it generates the next word in the storyline and
• Experiments show that the proposed plan-and-write the next sentence in the story at each step. In both cases,
model generates more diverse, coherent, and on-topic sto- the existing storyline and previously generated sentences are
ries than those without planning 2 . given to the model to move one step forward.
Storyline Planning The storyline is planned out based on
Plan-and-Write Storytelling the context (the title and previously generated sentences are
taken as context) and the previous word in the storyline. We
In this paper, we propose a plan-and-write framework to
formulate it as a content-introducing generation problem,
generate stories from given titles. We posit that storytelling
where the new content (the next word in the storyline) is gen-
systems can benefit from storyline planning to generate
erated based on the context and some additional information
more coherent and on-topic stories. An additional benefit
(the most recent word in the storyline). Formally, let ctx =
of the plan-and-write schema is that human and computer
[t, s1:i−1 ] denotes the context, where s1:i−1 denotes for the
can interact and collaborate on the (abstract) storyline level,
first i-1 sentences in the story. We model p(li |ctx, li−1 ; θ).
which can enable many potentially enjoyable interactions.
We implement the content-introducing method proposed
We formally define the input, output, and storyline of our
by Yao et al. [2017], which first encodes context into a
approach as follows.
vector using a bidirectional gated recurrent unit (BiGRU),
Problem Formulation and then incorporates the auxiliary information, in this case
the previous word in the storyline, into the decoding pro-
Input: A title t = {t1 , t2 , ..., tn } is given to the system to cess. Formally, hidden vectors for context are computed as
constrain writing, where ti is the i-th word in the title. e ctx = Encodectx (ctx) = [−
h
−→ ←−− −−→
hctx ; hctx ], where hctx and
Output: The system generates a story s = ←−−
{s1 , s2 , ..., sm } based on a title, where si denotes a hctx are the hidden vectors produced by a forward and a
sentence in the story. backward GRU, respectively. [; ] denotes element-wise con-
Storyline: The system plans a storyline l = catenation. The conditional probability is computed as:
{l1 , l2 , ..., lm } as an intermediate step to represent the
plot of a story. We use a sequence of words to represent a hy = GRU(BOS, Catt ), hw = GRU(li−1 , Catt )
0 0
storyline, therefore, li denotes a word in a storyline. hy = tanh(W1 hy ), hw = tanh(W2 hw )
Given a title, the plan-and-write framework always plans 0 0
a storyline. We explore two variations of this framework: the k = σ(Wk [hy ; hw ])
dynamic and the static schema. p(li |ctx, li−1 ) = g(k ◦ hy + (1 − k) ◦ hw )
Storyline Preparation BOS denotes the beginning of decoding. Catt represents the
To obtain training data for the storyline planner, we extract attention-based context computed from h̃ctx . g(·) denotes a
sequences of words from existing story corpora to compose multilayer perceptron (MLP).
storylines. Specifically, we extract one word from each sen- Story Generation The story is generated incrementally by
tence of a story to form a storyline3 . We adopt the RAKE planning and writing alternately. We formulate it as another
1
Some discussions on Quora: [Link] content-introducing generation problem which generates a
How-many-times-does-a-writer-edit-his-first -draft story sentence based on both the context and an additional
2
Code and appendix will be available at [Link] storyline word as a cue. The model structure is exactly the
VioletPeng/language-model same as for storyline generation. However, there are two
3 differences between storyline and story generation. On one
For this pilot study, we assume each word li in a storyline cor-
responds to a sentence si in a story. hand, the former aims to generate a word while the latter
7379
Generated Story
7380
Number of Stories 98, 161 validation set. Details of the best hyper-parameter values for
Vocabulary size 33, 215 each setting are given in Appendix.
Average number of words 50
Evaluation Metrics
Table 2: Statistics of the ROCStories dataset. Objective metrics. Our goal is generating human-like sto-
ries, which can pass the Turing test. Therefore, the evalu-
ation metrics based on n-gram overlap such as BLEU are
Experimental Setup not suitable for our task5 . To better gauge the quality of our
methods, we design novel automatic evaluation metrics to
Dataset evaluate the generation results at scale. Since neural gen-
We conduct the experiments on the ROCStories cor- eration models are known to suffer from generating repeti-
pus (Mostafazadeh et al. 2016a). It contains 98,162 short tive content, our automatic evaluation metrics are designed
commonsense stories as training data, and additional 1,817 to quantify diversity across the generated stories. We design
stories for development and test, respectively. The stories two measurements to gauge inter- and intra-story repetition.
in the corpus are five-sentence stories that capture a rich For each sentence position i, the inter-story rei and intra-
set of causal and temporal commonsense relations between story rai repetition rate are computed as follows:
daily events, making them a good resource for training sto- PN
rytelling models. Table 2 shows the statistics of ROCStories i
T ( j=1 sji )
re = 1 −
dataset. Since only the training set of the ROCStories corpus PN
Tall ( j=1 sji )
contains titles, which we need as input. We split the original "P #j (3)
training data into [Link] for training, validation, and testing. N i−1 i k
1 X T (s ∩ s )
rai = k=1
7381
% % % %
95 96 30 18 Inc-S2S
90 94 25 16 Cond-LM
92 14 Dynamic
85 90 20 Static
80 88 15 12
75 86 10 10
70 84 5 8
65 1 82 01 6
2 3 4 5 2 3 4 5
(a) Inter-story repetition curve (b) Inter-story aggregate repeti- (c) Intra-story repetition curve (d) Intra-story aggregate repeti-
by sentence. tion scores. by sentence. tion scores.
Dynamic vs Inc-S2S
Dynamic vs Inc-S2S Static Cond-LM
Static vsvsCond-LM Dynamic
Dynamic vs Static
vs Static
Choice Choice
% % Dyna. [Link]. Kappa
Kappa Static Cond. Kappa
Static Cond. [Link].
Static Static
Kappa Kappa
FidelityFidelity 35.8 [Link] 0.42 0.42 38.5
38.5 16.3
16.3 0.42 38.00 38.00
0.42 21.4721.47 0.30 0.30
Coherence 37.2 37.228.6
Coherence 28.6 0.30 0.30 39.4
39.4 32.3
32.3 0.35 49.47 49.47
0.35 28.2728.27 0.36 0.36
Interestingness 43.5 26.7 0.31 39.5 35.7 0.42 34.40 42.60 0.35
Interestingness 43.5
Overall Popularity 42.9
26.7
27.0
0.31
0.34
39.5
40.9
35.7
34.2
0.42
0.38
34.40
30.07 50.07
42.60
0.38
0.35
Overall Popularity 42.9 27.0 0.34 40.9 34.2 0.38 30.07 50.07 0.38
Table 3: Human evaluation results on four aspects: fidelity, coherence, interestingness, and overall user preference. Dyna., Inc.,
Table 3: Human
and Cond. is the evaluation results
abbreviation on four aspects:
for Dynamic schema, fidelity,
Inc-S2S,coherence, interestingness,
and Cond-LM [Link]
We overall user preference.
also calculate the KappaDyna., Inc.,
coefficient
and Cond. is the abbreviation for Dynamic
to show the inter-annotator agreement. schema, Inc-S2S, and Cond-LM respectively. We also calculate the Kappa coefficient
to show the inter-annotator agreement.
7382
Title: Computer
Inc-S2S Tom’s computer broke down. He needed to buy a new computer. He decided to buy a new com-
Baselines
puter. Tom bought a new computer. Tom was able to buy a new computer.
Cond-LM The man bought a new computer. He went to the store. He bought a new computer. He bought
the computer. He installed the computer.
Storyline needed → money → computer → bought → happy
Dynamic
Story John needed a computer for his birthday. He worked hard to earn money. John was able to buy his
computer. He went to the store and bought a computer. John was happy with his new computer.
Storyline computer → slow → work → day → buy
Static
Story I have an old computer. It was very slow. I tried to work on it but it wouldn’t work. One day, I
decided to buy a new one. I bought a new computer .
Title: The Virus
Inc-S2S His computer was fixed and he fixed it. John got a new computer on his computer. John was able
Baselines
to fix it himself. John was able to fix his computer and was able to fix his computer. John was
able to fix his computer and had a virus and was able to fix his computer.
Cond-LM Tim was working on a project. He was working on a project. Tim was working on a project. The
project was really good. Tim was able to finish the project.
Storyline computer → use → anywhere → house → found
Dynamic
Story I was working on my computer today. I was trying to use the computer. I couldn’t find it
anywhere. I looked all over the house for it. Finally, I found it.
Storyline work → fix → called → found → day
Static
Story I had a virus on my computer. I tried to fix it but it wouldn’t work. I called the repair company.
They came and found the virus. The next day, my computer was fixed.
Table 5: Example stories that demonstrate the typical problems of the current systems.
Method l-B1 l-B2 l-s cosine score is regarded as the correlation between them.
Dynamic 6.46 0.79 0.88 Table 6 shows the results. We can see that the static
Static 9.53 1.59 0.89 schema generates storylines with higher BLEU scores. It
also generates stories that have a higher correlation with the
Table 6: The storyline BLEU score (only BLEU-1 and storylines (higher l-s score11 ). This indicates that with better
BLEU-2) and the correlation of storyline-story l-s. storylines (higher BLEU score), it is easier to generate more
relevant and coherent stories. This partially explains why the
static schema performs better than the dynamic schema.
Analysis Case study. We further present two examples in Table 4
to intuitively compare the plan-and-write methods and the
The previous sections examine the overall performance of baselines12 . In both examples, the baselines without plan-
our methods quantitatively. In this section, we qualitatively ning components tend to generate repetitive sentences that
analyze our methods with a focus on comparing the dynamic do not exhibit much of a story progression. In contrast, the
and static schema. plan-and-write methods can generate storylines that follow a
Storyline analysis. First, we measure the quality of the reasonable flow, and thus help generate coherent stories with
generated storylines, and the correlations between a story- less repetition. This demonstrates the ability of the plan-and-
line and the generated story. We use BLEU scores to mea- write methods. In the second example, the storyline gener-
sure the quality of the storylines, and an embedding-based ated by the dynamic schema is not very coherent and thus
metric (Liu et al. 2016a) to estimate the average greedy
matching score l-s between storyline words and generated measure the correlation. [Link]
story sentences. Concretely, a storyline word is greedily [Link]
11
matched with each token in a story sentence based on the There are 75% and 78% storyline words appear in the gener-
cosine similarity of their word embeddings10 . The highest ated stories in the dynamic and static schema, respectively.
12
More examples please see our live demo at [Link]
10
For fairness, We adopt the pre-trained Glove embedding to [Link]/
7383
significantly affects story quality. This reflects the impor- iments on full story generation. Xu et al. [2018] is a concur-
tance of storyline planning in our framework. rent work which is similar to our dynamic schema. Their set-
Error analysis. To better understand the limitation of our ting assumes story prompts as inputs, which is more specific
best system, we manually reviewed 50 titles and the corre- than our work (which only requires a title). Moreover, we
sponding generated stories from our static schema to con- explore two planning strategies: dynamic schema and static
duct error analysis. The three major problems are: off-topic, schema, and show the latter works better.
repetitive, and logically inconsistent. We show three exam-
ples, one for each category, in Table 5 to illustrate the prob- Neural Story Generation
lems. We can see that the current system is already capable Recently, deep learning models have been demonstrated ef-
of generating grammatical sentences that are coherent within fective in natural language generation tasks (Bahdanau, Cho,
a local context. However, generating a sequence of coherent and Bengio 2015; Merity, Keskar, and Socher 2018) In story
and logically consistent sentences is still an open challenge. generation, prior work has proposed to use deep neural net-
works to capture story structures and generate stories. Khal-
Related work ifa, Barros, and Togelius [2017] argue that stories are better
Story Planning generated using recurrent neural networks (RNNs) trained
on highly specialized textual corpora, such as a body of work
Automatic story generation efforts date back to the from a single, prolific author. Roem et al. [2017] use skip-
1970s (Meehan 1977). Early attempts focused on composing thought vectors (Kiros et al. 2015) to encode sentences and
a sensible plot for a story. Symbolic planning systems (Por- model relations between the sentences. Jain et al. [2017]
teous and Cavazza 2009; Riedl and Young 2010) attempted explore generating coherent stories from independent tex-
to select and sequence character actions according to spe- tual descriptions based on two conditional text-generation
cific success criteria. Case-based reasoning systems (Turner methods: statistical machine translation and sequence-to-
1994; Gervas et al. 2005; Montfort 2006) adapted prior story sequence models. Fan, Lewis, and Dauphin [2018] proposes
plots (cases) to new storytelling requirements. These tradi- a hierarchical generation strategy to generate stories from
tional approaches were able to produce impressive results prompts to improve coherence. However, we consider story-
based on hand-crafted, well-defined domain models, which lines are different from prompts as they are not naturally lan-
kept track of legal characters, their actions, narratives, and guage sentences. They are some structured outline of stories.
user interest. However, the generated stories were restricted We employ neural network-based generation models for our
to limited domains. plan-and-write generation. The focus, however, is to intro-
To tackle the problem of restricted domains, some work duce storyline planning to improve the quality of generated
attempted to automatically learn domain models. Swanson stories, and compare the effect of different storyline plan-
and Gordon [2012] mined millions of personal stories from ning strategies on story generation.
the Web and identified relevant existing stories in the cor-
pus. Li et al. [2013] used a crowd-sourced corpus of stories
to learn a domain model that helped generate stories in un-
Conclusion and Future Work
known domains. These efforts stayed at the level of story In this paper, we propose a plan-and-write framework
plot planning without surface realization. that generates stories from given titles with explicit story-
line planning. We explore and compare two plan-and-write
Event Structures for Storytelling strategies: dynamic schema and static schema, and show that
they both outperform the baselines without planning compo-
There is a line of research focusing on representing story
nents. The static schema performs better than the dynamic
event structures (Mostafazadeh et al. 2016b; McDowell et
schema because it plans the storyline holistically, thus tends
al. 2017). Rishes et al. [2013] presents a model that repro-
to generate more coherent and relevant stories.
duce different versions of a story from its symbolic repre-
sentation. Pichotta and Mooney [2016] parse a large col- The current plan-and-write models use a sequence of
lection of natural language documents, extract sequences words to approximate a storyline, which simplifies many
of events, and learn statistical models of them. Some re- meaningful structures in a real story plot. We plan to ex-
cent work explored story generation with additional infor- tend the exploration to richer representations, such as en-
mation (Bowden et al. 2016; Peng et al. 2018; Guan, Wang, tity, event, and relation structures, to depict story plots. We
and Huang 2019). Visual storytelling (Huang et al. 2016; also plan to extend the plan-and-write framework to generate
Liu et al. 2016b; Wang et al. 2018) aims to generate human- longer documents. The current framework relies on story-
level narrative language from a sequence of images. Jain et lines automatically extracted from story corpora to train the
al. [2017] addresses the task of coherent story generation planning module. In the future, we will explore the storyline
from independent textual descriptions. Unlike this line of induction and joint storyline and story generation to avoid
work, we learn to automatically generate storylines to help error propagation in the current pipeline generation system.
generate coherent stories.
Martin et al.; Xu et al. [2018; 2018] are the closest work Acknowledgements
to ours, which decomposed story generation into two steps: We thank the anonymous reviewers for the useful com-
story structure modeling and structure-to-surface genera- ments. This work is supported by Contract W911NF-15-
tion. However, Martin et al. [2018] did not conduct exper- 1-0543 with the US Defense Advanced Research Projects
7384
Agency (DARPA), the National Key Research and Devel- Mostafazadeh, N.; Chambers, N.; He, X.; Parikh, D.; Batra, D.;
opment Program of China (No. 2017YFC0804001), and the Vanderwende, L.; Kohli, P.; and Allen, J. 2016a. A corpus
National Science Foundation of China (Nos. 61876196 and and cloze evaluation for deeper understanding of commonsense
61672058). stories. In NAACL.
Mostafazadeh, N.; Grealish, A.; Chambers, N.; Allen, J.; and
References Vanderwende, L. 2016b. CaTeRS: Causal and temporal relation
scheme for semantic annotation of event structures. In NAACL
Alarcon, D. 2010. The Secret Miracle: The Novelist’s Hand- Workshop.
book. St. Martin’s Griffin.
Mou, L.; Song, Y.; Yan, R.; Li, G.; Zhang, L.; and Jin, Z. 2016.
Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine Sequence to backward and forward sequences: A content-
translation by learning to align and translate. In ICLR. introducing approach to generative short-text conversation. In
Bowden, K. K.; Lin, G. I.; Reed, L. I.; Tree, J. E. F.; and Walker, COLING.
M. A. 2016. M2d: Monolog to dialog generation for conversa- Nayak, N.; Hakkani-Tur, D.; Walker, M.; and Heck, L. 2017. To
tional story telling. In ICIDS. plan or not to plan? discourse planning in slot-value informed
Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical neural sequence to sequence models for language generation. In Inter-
story generation. In ACL. speech.
Gervas, P.; Diaz-Agudo, B.; Peinado, F.; and Hervas, R. 2005. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu:
Story plot generation based on CBR. In KBS. a method for automatic evaluation of machine translation. In
ACL.
Guan, J.; Wang, Y.; and Huang, M. 2019. Story ending genera-
tion with incremental encoding and commonsense knowledge. Peng, N.; Ghazvininejad, M.; May, J.; and Knight, K. 2018.
In AAAI. Towards controllable story generation. In NAACL Workshop.
Huang, T.-H. K.; Ferraro, F.; Mostafazadeh, N.; Misra, I.; De- Perez, R., and Sharples, M. 2001. MEXICA: A computer model
vlin, J.; Agrawal, A.; Girshick, R.; He, X.; Kohli, P.; Batra, D.; of a cognitive account of creative writing. In JETAI.
et al. 2016. Visual storytelling. In NAACL. Pichotta, K., and Mooney, R. J. 2016. Learning statistical
scripts with LSTM recurrent neural networks. In AAAI.
Jain, P.; Agrawal, P.; Mishra, A.; Sukhwani, M.; Laha, A.; and
Sankaranarayanan, K. 2017. Story generation from sequence Porteous, J., and Cavazza, M. 2009. Controlling narrative gen-
of independent short descriptions. In KDD WS. eration with planning trajectories: The role of constraints. In
ICIDS.
Khalifa, A.; Barros, G. A.; and Togelius, J. 2017. Deeptingle.
In arXiv preprint arXiv:1705.03557. Riedl, M. O., and Young, R. M. 2010. Narrative planning:
Balancing plot and character. In JAIR.
Kiros, R.; Zhu, Y.; Salak, R.; Zemel, R.; Urtasun, R.; Torral, A.;
and Fidler, S. 2015. Skip-thought vectors. In NIPS. Rishes, E.; Lukin, S. M.; Elson, D. K.; and Walker, M. A. 2013.
Generating different story tellings from semantic representa-
Lebowitz, M. 1987. Planning stories. In CogSci. tions of narrative. In ICIDS.
Li, B.; Lee-Urban, S.; Johnston, G.; and Riedl, M. 2013. Story Roem, M.; Koba, S.; Inoue, N.; and Gordon, A. M. 2017. An
generation with crowdsourced plot graphs. In AAAI. RNN-based classifier for the story cloze test. In LSDSem.
Li, J.; Monroe, W.; Ritter, A.; Jurafsky, D.; Galley, M.; and Gao, Rose, S.; Engel, D.; Cramer, N.; and Cowley, W. 2010. Auto-
J. 2016. Deep reinforcement learning for dialogue generation. matic keyword extraction from individual documents. In Text
In EMNLP. Mining: Applications and Theory.
Liu, C.-W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Charlin, Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine
L.; and Pineau, J. 2016a. How not to evaluate your dialogue for short-text conversation. In ACL.
system: An empirical study of unsupervised evaluation metrics Swanson, R., and Gordon, A. 2012. Say anything: Using textual
for dialogue response generation. In EMNLP. case-based reasoning to enable open-domain interactive story-
Liu, Y.; Fu, J.; Mei, T.; and Chen, C. W. 2016b. Storytelling telling. In ACM TiiS.
of photo stream with bidirectional multi-thread recurrent neural Turner, S. R. 1994. The creative process: A computer model of
network. In arXiv preprint arXiv:1606.00625. storytelling and creativity. Psychology Press.
Martin, L.; Ammanabrolu, P.; Hancock, W.; Singh, S.; Harri- Wang, Z.; He, W.; Wu, H.; Wu, H.; Li, W.; Wang, H.; and Chen,
son, B.; and Riedl, M. 2018. Event representations for auto- E. 2016. Chinese poetry generation with planning based neural
mated story generation with deep neural nets. In AAAI. network. In COLING.
McDowell, B.; Chambers, N.; Ororbia II, A.; and Reitter, D. Wang, X.; Chen, W.; Wang, Y.-F.; and Wang, W. Y. 2018. No
2017. Event ordering with a generalized model for sieve pre- metrics are perfect: Adversarial reward learning for visual sto-
diction ranking. In IJCNLP. rytelling. In ACL.
Meehan, J. R. 1977. Tale-spin, an interactive program that Xu, J.; Zhang, Y.; Zeng, Q.; Ren, X.; Cai, X.; and Sun, X. 2018.
writes stories. In IJCAI. A skeleton-based model for promoting coherence among sen-
Merity, S.; Keskar, N. S.; and Socher, R. 2018. Regularizing tences in narrative story generation. In EMNLP.
and optimizing lstm language models. In ICLR. Yao, L.; Zhang, Y.; Feng, Y.; Zhao, D.; and Yan, R. 2017.
Towards implicit content-introducing for generative short-text
Montfort, N. 2006. Natural language generation and narrative
conversation systems. In EMNLP.
variation in interactive fiction. In AAAI WS.
7385