Programming With A Differentiable Forth Interpreter
Programming With A Differentiable Forth Interpreter
the case of prior procedural knowledge for neural networks, et al., 2001) or query language programming (Neelakantan
such as knowing how a program should traverse a sequence, et al., 2015a) a user establishes a larger set of conditions on
but not what local actions should be performed at each
step. To this end, we present an end-to-end differentiable
the data, and the model needs to set out the details. In all
interpreter for the programming language Forth which these scenarios, the question then becomes how to exploit
enables programmers to write program sketches with slots various types of prior knowledge when learning algorithms.
that can be filled with behaviour trained from program
input-output data. We can optimise this behaviour directly To address the above question we present an approach that
through gradient descent techniques on user-specified enables programmers to inject their procedural background
objectives, and also integrate the program into any larger knowledge into a neural network. In this approach, the
neural computation graph. We show empirically that our programmer specifies a program sketch (Solar-Lezama
interpreter is able to effectively leverage different levels
of prior program structure and learn complex behaviours
et al., 2005) in a traditional programming language. This
such as sequence sorting and addition. When connected sketch defines one part of the neural network behaviour. The
to outputs of an LSTM and trained jointly, our interpreter other part is learned using training data. The core insight
achieves state-of-the-art accuracy for end-to-end reasoning that enables this approach is the fact that most programming
about quantities expressed in natural language stories. languages can be formulated in terms of an abstract machine
that executes the commands of the language. We implement
these machines as neural networks, constraining parts of the
1. Introduction networks to follow the sketched behaviour. The resulting
neural programs are consistent with our prior knowledge
A central goal of Artificial Intelligence is the creation of
and optimised with respect to the training data.
machines that learn as effectively from human instruction
as they do from data. A recent and important step towards In this paper, we focus on the programming language
this goal is the invention of neural architectures that Forth (Brodie, 1980), a simple yet powerful stack-based
learn to perform algorithms akin to traditional computers, language that facilitates factoring and abstraction. Under-
using primitives such as memory access and stack ma- lying Forth’s semantics is a simple abstract machine. We
nipulation (Graves et al., 2014; Joulin & Mikolov, 2015; introduce ∂4, an implementation of this machine that is
Grefenstette et al., 2015; Kaiser & Sutskever, 2015; Kurach differentiable with respect to the transition it executes at
et al., 2016; Graves et al., 2016). These architectures can each time step, as well as distributed input representations.
be trained through standard gradient descent methods, Sketches that users write define underspecified behaviour
and enable machines to learn complex behaviour from which can then be trained with backpropagation.
input-output pairs or program traces. In this context, the
For two neural programming tasks introduced in previous
role of the human programmer is often limited to providing
work (Reed & de Freitas, 2015) we present Forth sketches
training data. However, training data is a scarce resource
that capture different degrees of prior knowledge. For
for many tasks. In these cases, the programmer may have
example, we define only the general recursive structure of
1
Department of Computer Science, University College Lon- a sorting problem. We show that given only input-output
don, London, UK 2 Department of Computer Science, University pairs, ∂4 can learn to fill the sketch and generalise well
of Oxford, Oxford, UK 3 Department of Theoretical and Ap- to problems of unseen size. In addition, we apply ∂4 to
plied Linguistics, University of Cambridge, Cambridge, UK. the task of solving word algebra problems. We show that
Correspondence to: Matko Bošnjak <[Link]@[Link]>.
when provided with basic algorithmic scaffolding and
Proceedings of the 34 th International Conference on Machine trained jointly with an upstream LSTM (Hochreiter &
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by Schmidhuber, 1997), ∂4 is able to learn to read natural
the author(s).
Programming with a Differentiable Forth Interpreter
language narratives, extract important numerical quantities, 1 : BUBBLE ( a1 ... an n-1 -- one pass )
and reason with these, ultimately answering corresponding 2 DUP IF >R
3a OVER OVER < IF SWAP THEN
mathematical questions without the need for explicit 4a R> SWAP >R 1- BUBBLE R>
intermediate representations used in previous work. 3b { observe D0 D-1 -> permute D-1 D0 R0}
4b 1- BUBBLE R>
The contributions of our work are as follows: i) We present 3c { observe D0 D-1 -> choose NOP SWAP }
4c R> SWAP >R 1- BUBBLE R>
a neural implementation of a dual stack machine underlying 5 ELSE
Forth, ii) we introduce Forth sketches for programming 6 DROP
7 THEN
with partial procedural background knowledge, iii) we 8 ;
apply Forth sketches as a procedural prior on learning 9 : SORT ( a1 .. an n -- sorted )
10 1- DUP 0 DO >R R@ BUBBLE R> LOOP DROP
algorithms from data, iv) we introduce program code 11 ;
optimisations based on symbolic execution that can speed 12 2 4 2 7 4 SORT \ Example call
up neural execution, and v) using Forth sketches we obtain Listing 1: Three code alternatives (white lines are common
state-of-the-art for end-to-end reasoning about quantities to all, coloured/lettered lines are alternative-specific): i)
expressed in natural language narratives. Bubble sort in Forth (a lines – green), ii) P ERMUTE sketch
(b lines – blue), and iii) C OMPARE sketch (c lines – yellow).
2. The Forth Abstract Machine
Forth is a simple Turing-complete stack-based program-
ming language (ANSI, 1994; Brodie, 1980). We chose Forth denote the BUBBLE procedure – comparison of top two
as the host language of our work because i) it is an estab- stack numbers (line 3a), and the recursive call to itself (line
lished, general-purpose high-level language relatively close 4a). A detailed description of how this program is executed
to machine code, ii) it promotes highly modular programs by the Forth abstract machine is provided in Appendix B.
through use of branching, loops and function calls, thus Notice that while Forth provides common control structures
bringing out a good balance between assembly and higher such as looping and branching, these can always be reduced
level languages, and importantly iii) its abstract machine is to low-level code that uses jumps and conditional jumps
simple enough for a straightforward creation of its contin- (using the words BRANCH and BRANCH0, respectively).
uous approximation. Forth’s underlying abstract machine Likewise, we can think of sub-routine definitions as labelled
is represented by a state S = (D, R, H, c), which contains code blocks, and their invocation amounts to jumping to the
two stacks: a data evaluation pushdown stack D (data code block with the respective label.
stack) holds values for manipulation, and a return address
pushdown stack R (return stack) assists with return pointers 3. ∂4: Differentiable Abstract Machine
and subroutine calls. These are accompanied by a heap or
random memory access buffer H, and a program counter c. When a programmer writes a Forth program, they define
a sequence of Forth words, i.e., a sequence of known state
A Forth program P is a sequence1 of Forth words (i.e. transition functions. In other words, the programmer knows
commands) P = w1 ...wn . The role of a word varies, encom- exactly how computation should proceed. To accommodate
passing language keywords, primitives, and user-defined for cases when the developer’s procedural background
subroutines (e.g. DROP discards the top element of the knowledge is incomplete, we extend Forth to support the
data stack, or DUP duplicates the top element of the data definition of a program sketch. As is the case with Forth
stack).2 Each word wi defines a transition function between programs, sketches are sequences of transition functions.
machine states wi : S → S. Therefore, a program P itself However, a sketch may contain transition functions whose
defines a transition function by simply applying the word at behaviour is learned from data.
the current program counter to the current state. Although
usually considered as a part of the heap H, we consider To learn the behaviour of transition functions within a pro-
Forth programs P separately to ease the analysis. gram we would like the machine output to be differentiable
with respect to these functions (and possibly representa-
An example of a Bubble sort algorithm implemented in tions of inputs to the program). This enables us to choose
Forth is shown in Listing 1 in everything except lines parameterised transition functions such as neural networks.
3b-4c. The execution starts from line 12 where literals
are pushed on the data stack and the SORT is called. Line To this end, we introduce ∂4, a TensorFlow (Abadi et al.,
10 executes the main loop over the sequence. Lines 2-7 2015) implementation of a differentiable abstract machine
1
with continuous state representations, differentiable words
Forth is a concatenative language. and sketches. Program execution in ∂4 is modelled by
2
In this work, we restrict ourselves to a subset of all Forth a recurrent neural network (RNN), parameterised by the
words, detailed in Appendix A.
transition functions at each time step.
Programming with a Differentiable Forth Interpreter
Figure 1: Left: Neural Forth Abstract Machine. A forth sketch P is translated to a low-level code Pθ , with slots {...}
substituted by a parametrised neural networks. Slots are learnt from input-output examples (x,y) through the differentiable
machine whose state Si comprises the low-level code, program counter c, data stack D (with pointer d), return stack R
(with pointer r), and the heap H. Right: BiLSTM trained on Word Algebra Problems. Output vectors corresponding to a
representation of the entire problem, as well as context representations of numbers and the numbers themselves are fed into
H to solve tasks. The entire system is end-to-end differentiable.
3.1. Machine State Encoding Popping is realized by multiplying the TOS pointer and the
memory buffer, and decreasing the TOS pointer:
We map the symbolic machine state S = (D, R, H, c)
to a continuous representation S = (D, R, H, c) into popM ( ) = readM (p) (side-effect: p ← dec(p))
two differentiable stacks (with pointers), the data stack
D = (D,d) and the return stack R = (R,r), a heap H, and Finally, the program counter c ∈ Rp is a vector that, when
an attention vector c indicating which word of the sketch Pθ one-hot, points to a single word in a program of length
is being executed at the current time step. Figure 1 depicts p, and is equivalent to the c vector of the symbolic state
the machine together with its elements. All three memory machine.4 We use S to denote the space of all continuous
structures, the data stack, the return stack and the heap, are representations S.
based on differentiable flat memory buffers M ∈ {D,R,H},
where D,R,H ∈ Rl×v , for a stack size l and a value size v. Neural Forth Words It is straightforward to convert
Each has a differentiable read operation Forth words, defined as functions on discrete machine
states, to functions operating on the continuous space S.
readM (a) = aT M For example, consider the word DUP, which duplicates the
top of the data stack. A differentiable version of DUP first
and write operation calculates the value e on the TOS address of D, as e = dT D.
It then shifts the stack pointer via d ← inc(d), and writes
writeM (x,a) : M ← M−(a1T ) M+xaT e to D using writeD (e,d). The complete description of im-
plemented Forth Words and their differentiable counterparts
akin to the Neural Turing Machine (NTM) memory (Graves
can be found in Appendix A.
et al., 2014), where is the element-wise multiplication,
and a is the address pointer.3 In addition to the memory
buffers D and R, the data stack and the return stack contain 3.2. Forth Sketches
pointers to the current top-of-the-stack (TOS) element We define a Forth sketch Pθ as a sequence of continuous
d,r ∈ Rl , respectively. This allows us to implement pushing transition functions P = w1 ... wn . Here, wi ∈ S → S
as writing a value x into M and incrementing the TOS either corresponds to a neural Forth word or a trainable
pointer as: transition function (neural networks in our case). We will
call these trainable functions slots, as they correspond to
pushM (x) : writeM (x,p) (side-effect: p ← inc(p)) underspecified “slots” in the program code that need to be
filled by learned behaviour.
where p ∈ {d,r}, inc(p) = pT R1 +, dec(p) = pT R− , and
R1 + and R1 − are increment and decrement matrices (left We allow users to define a slot w by specifying a pair of
and right circular shift matrices). a state encoder wenc and a decoder wdec . The encoder
3 4
The equal widths of H and D allow us to directly move vector During training c can become distributed and is considered as
representations of values between the heap and the stack. attention over the program code.
Programming with a Differentiable Forth Interpreter
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
produces a latent representation h of the current machine R r
3
state using a multi-layer perceptron, and the decoder 2
1
consumes this representation to produce the next machine 0
3
state. We hence have w = wdec ◦ wenc . To use slots within 2
1
Forth program code, we introduce a notation that reflects 0 Dd
this decomposition. In particular, slots are defined by the 0 : BUBBLE : BUBBLE : BUBBLE : BUBBLE
1 DUP DUP DUP DUP
syntax { encoder -> decoder } where encoder 2 BRANCH0 8 BRANCH0 8 BRANCH0 8 BRANCH0 8
3 >R >R >R >R
and decoder are specifications of the corresponding slot 4 {...} {...} {...} {...}
parts as described below. 5
6
1-
BUBBLE
1-
BUBBLE
1-
BUBBLE
1-
BUBBLE
7 R> R> R> R>
8 DROP P c DROP DROP DROP
Encoders We provide the following options for encoders: Figure 2: ∂4 segment of the RNN execution of a Forth
static produces a static representation, independent of sketch in blue in Listing 1. The pointers (d, r) and values
the actual machine state. (rows of R and D) are all in one-hot state (colours simply
observe e1 ...em : concatenates the elements e1 ...em of denote values observed, defined by the top scale), while
the machine state. An element can be a stack item Di the program counter maintains the uncertainty. Subsequent
at relative index i, a return stack item Ri, etc. states are discretised for clarity. Here, the slot {...} has
linear N, sigmoid, tanh represent chained trans- learned its optimal behaviour.
formations, which enable the multilayer perceptron
architecture. Linear N projects to N dimensions,
and sigmoid and tanh apply same-named functions 3.4. Program Code Optimisations
elementwise.
The ∂4 RNN requires one-time step per transition. After
each time step, the program counter is either incremented,
Decoders Users can specify the following decoders: decremented, explicitly set or popped from the stack. In
turn, a new machine state is calculated by executing all
choose w1 ...wm : chooses from the Forth words w1 ...wm . words in the program and then weighting the result states
Takes an input vector h of length m P to produce a by the program counter. As this is expensive, it is advisable
m
weighted combination of machine states i hi wi (S). to avoid full RNN steps wherever possible. We use two
manipulate e1 ...em : directly manipulates the machine strategies to avoid full RNN steps and significantly speed-up
state elements e1 ... em by writing the appropriately ∂4: symbolic execution and interpolation of if-branches.
reshaped and softmaxed output of the encoder over the
machine state elements with writeM .
permute e1 ...em : permutes the machine state elements Symbolic Execution Whenever we have a sequence of
e1 ...em via a linear combination of m! state vectors. Forth words that contains no branch entry or exit points, we
can collapse this sequence into a single transition instead
3.3. The Execution RNN of naively interpreting words one-by-one. We symbolically
execute (King, 1976) a sequence of Forth words to calculate
We model execution using an RNN which produces a state a new machine state. We then use the difference between the
Sn+1 conditioned on a previous state Sn . It does so by new and the initial state to derive the transition function of
first passing the current state to each function wi in the the sequence. For example, the sequence R> SWAP >R that
program, and then weighing each of the produced next swaps top elements of the data and the return stack yields the
states by the component of the program counter vector ci symbolic state D = r1 d2 ...dl . and R = d1 r2 ...rl . Compar-
that corresponds to program index i, effectively using c as ing it to the initial state, we derive a single neural transition
an attention vector over code. Formally we have: that only needs to swap the top elements of D and R.
\ first copy data from H: vectors to R and numbers to D Table 3: Accuracies of models on the CC dataset. Asterisk
1 { observe R0 R-1 R-2 R-3 -> permute D0 D-1 D-2 } denotes results obtained from Bouchard et al. (2016). Note
2 { observe R0 R-1 R-2 R-3 -> choose + - * / }
3 { observe R0 R-1 R-2 R-3 -> choose SWAP NOP } that GeNeRe makes use of additional data
4 { observe R0 R-1 R-2 R-3 -> choose + - * / }
\ lastly, empty out the return stack Model Accuracy (%)
Listing 3: Core of the Word Algebra Problem sketch. The Template Mapping
full sketch can be found in the Appendix. Roy & Roth (2015) 55.5
Seq2Seq∗ (Bouchard et al., 2016) 95.0
GeNeRe∗ (Bouchard et al., 2016) 98.5
sketch, and a Bidirectional LSTM (BiLSTM) reader.
Fully End-to-End
The BiLSTM reader reads the text of the problem and ∂4 96.0
produces a vector representation (word vectors) for each
word, concatenated from the forward and the backward pass
of the BiLSTM network. We use the resulting word vectors
corresponding only to numbers in the text, numerical values (bottom three) outperform the classifier-based approach.
of those numbers (encoded as one-hot vectors), and a vector Our method slightly outperforms a Seq2Seq baseline,
representation of the whole problem (concatenation of the achieving the highest reported result on this dataset without
last and the first vector of the opposite passes) to initialise data augmentation.
the ∂4 heap H. This is done in an end-to-end fashion,
enabling gradient propagation through the BiLSTM to the
5. Discussion
vector representations. The process is depicted in Figure 1.
The sketch, depicted in Listing 3 dictates the differentiable ∂4 bridges the gap between a traditional programming lan-
computation.5 First, it copies values from the heap H guage and a modern machine learning architecture. How-
– word vectors to the return stack R, and numbers (as ever, as we have seen in our evaluation experiments, faith-
one-hot vectors) on the data stack D. Second, it contains fully simulating the underlying abstract machine architec-
four slots that define the space of all possible operations ture introduces its own unique set of challenges.
of four operators on three operands, all conditioned on the One such challenge is the additional complexity of per-
vector representations on the return stack. These slots are i) forming even simple tasks when they are viewed in terms of
permutation of the elements on the data stack, ii) choosing operations on the underlying machine state. As illustrated
the first operator, iii) possibly swapping the intermediate in Table 1, ∂4 sketches can be effectively trained from small
result and the last operand, and iv) the choice of the second training sets (see Appendix C.1), and generalise perfectly
operator. The final set of commands simply empties out to sequences of any length. However, difficulty arises when
the return stack R. These slots define the space of possible training from sequences of modest lengths. Even when
operations, however, the model needs to learn how to utilise dealing with relatively short training length sequences,
these operations in order to calculate the correct result. and with the program code optimisations employed, the
underlying machine can unroll into a problematically large
Results We evaluate the model on the Common Core (CC) number states. For problems whose machine execution is
dataset, introduced by Roy & Roth (2015). CC is notable for quadratic, like the sorting task (which at input sequences
having the most diverse set of equation patterns, consisting of length 4 has 120 machine states), we observe significant
of four operators (+, -, ×, ÷), with up to three operands. instabilities during training from backpropagating through
such long RNN sequences, and consequent failures to train.
We compare against three baseline systems: (1) a local In comparison, the addition problem was easier to train due
classifier with hand-crafted features (Roy & Roth, 2015), to a comparatively shorter underlying execution RNNs.
(2) a Seq2Seq baseline, and (3) the same model with a data
generation component (GeNeRe) Bouchard et al. (2016). The higher degree of prior knowledge provided played an
All baselines are trained to predict the best equation, which important role in successful learning. For example, the
is executed outside of the model to obtain the answer. In C OMPARE sketch, which provides more structure, achieves
contrast, ∂4 is trained end-to-end from input-output pairs higher accuracies when trained on longer sequences.
and predicts the answer directly without the need for an Similarly, employing softmax on the directly manipulated
intermediate symbolic representation of a formula. memory elements enabled perfect training for the M ANIP -
ULATE sketch for addition. Furthermore, it is encouraging
Results are shown in Table 3. All RNN-based methods to see that ∂4 can be trained jointly with an upstream LSTM
5
Due to space constraints, we present the core of the sketch to provide strong procedural prior knowledge for solving a
here. For the full sketch, please refer to Listing 4 in the Appendix. real-world NLP task.
Programming with a Differentiable Forth Interpreter
6. Related Work pure Python code, but does not define nor use differentiable
access to its underlying abstract machine.
Program Synthesis The idea of program synthesis is as
old as Artificial Intelligence, and has a long history in com- The work in neural approximations to abstract structures
puter science (Manna & Waldinger, 1971). Whereas a large and machines naturally leads to more elaborate machin-
body of work has focused on using genetic programming ery able to induce and call code or code-like behaviour.
(Koza, 1992) to induce programs from the given input- Neelakantan et al. (2015a) learned simple SQL-like
output specification (Nordin, 1997), there are also various behaviour–—querying tables from the natural language
Inductive Programming approaches (Kitzelmann, 2009) with simple arithmetic operations. Although sharing
aimed at inducing programs from incomplete specifications similarities on a high level, the primary goal of our model
of the code to be implemented (Albarghouthi et al., 2013; is not induction of (fully expressive) code but its injection.
Solar-Lezama et al., 2006). We tackle the same problem of (Andreas et al., 2016) learn to compose neural modules to
sketching, but in our case, we fill the sketches with neural produce the desired behaviour for a visual QA task. Neural
networks able to learn the slot behaviour. Programmer-Interpreters (Reed & de Freitas, 2015) learn
to represent and execute programs, operating on different
modes of an environment, and are able to incorporate
Probabilistic and Bayesian Programming Our work
decisions better captured in a neural network than in many
is closely related to probabilistic programming languages
lines of code (e.g. using an image as an input). Users inject
such as Church (Goodman et al., 2008). They allow users
prior procedural knowledge by training on program traces
to inject random choice primitives into programs as a way
and hence require full procedural knowledge. In contrast,
to define generative distributions over possible execution
we enable users to use their partial knowledge in sketches.
traces. In a sense, the random choice primitives in such
languages correspond to the slots in our sketches. A core Neural approaches to language compilation have also been
difference lies in the way we train the behaviour of slots: researched, from compiling a language into neural networks
instead of calculating their posteriors using probabilistic (Siegelmann, 1994), over building neural compilers (Gruau
inference, we estimate their parameters using backprop- et al., 1995) to adaptive compilation (Bunel et al., 2016).
agation and gradient descent. This is similar in-kind to However, that line of research did not perceive neural in-
TerpreT’s FMGD algorithm (Gaunt et al., 2016), which terpreters and compilers as a means of injecting procedural
is employed for code induction via backpropagation. In knowledge as we did. To the best of our knowledge, ∂4
comparison, our model which optimises slots of neural is the first working neural implementation of an abstract
networks surrounded by continuous approximations of machine for an actual programming language, and this
code, enables the combination of procedural behaviour and enables us to inject such priors in a straightforward manner.
neural networks. In addition, the underlying programming
and probabilistic paradigm in these programming languages 7. Conclusion and Future Work
is often functional and declarative, whereas our approach
focuses on a procedural and discriminative view. By using We have presented ∂4, a differentiable abstract machine
an end-to-end differentiable architecture, it is easy to seam- for the Forth programming language, and showed how it
lessly connect our sketches to further neural input and output can be used to complement programmers’ prior knowledge
modules, such as an LSTM that feeds into the machine heap. through the learning of unspecified behaviour in Forth
sketches. The ∂4 RNN successfully learns to sort and
Neural approaches Recently, there has been a surge of add, and solve word algebra problems, using only program
research in program synthesis, and execution in deep learn- sketches and program input-output pairs. We believe ∂4,
ing, with increasingly elaborate deep models. Many of these and the larger paradigm it helps establish, will be useful for
models were based on differentiable versions of abstract addressing complex problems where low-level representa-
data structures (Joulin & Mikolov, 2015; Grefenstette et al., tions of the input are necessary, but higher-level reasoning
2015; Kurach et al., 2016), and a few abstract machines, is difficult to learn and potentially easier to specify.
such as the NTM (Graves et al., 2014), Differentiable In future work, we plan to apply ∂4 to other problems in
Neural Computers (Graves et al., 2016), and Neural GPUs the NLP domain, like machine reading and knowledge
(Kaiser & Sutskever, 2015). All these models are able to base inference. In the long-term, we see the integration of
induce algorithmic behaviour from training data. Our work non-differentiable transitions (such as those arising when
differs in that our differentiable abstract machine allows us interacting with a real environment), as an exciting future
to seemingly integrate code and neural networks, and train direction which sits at the intersection of reinforcement
the neural networks specified by slots via backpropagation. learning and probabilistic programming.
Related to our efforts is also the Autograd (Maclaurin et al.,
2015), which enables automatic gradient computation in
Programming with a Differentiable Forth Interpreter
Bunel, Rudy, Desmaison, Alban, Kohli, Pushmeet, Torr, Kitzelmann, Emanuel. Inductive Programming: A Survey
Philip HS, and Kumar, M Pawan. Adaptive neural of Program Synthesis Techniques. In International
compilation. In Proceedings of the Conference on Neural Workshop on Approaches and Applications of Inductive
Information Processing Systems (NIPS), 2016. Programming, pp. 50–73, 2009.
Gaunt, Alexander L, Brockschmidt, Marc, Singh, Rishabh, Koncel-Kedziorski, Rik, Hajishirzi, Hannaneh, Sabharwal,
Kushman, Nate, Kohli, Pushmeet, Taylor, Jonathan, Ashish, Etzioni, Oren, and Ang, Siena. Parsing Alge-
and Tarlow, Daniel. TerpreT: A Probabilistic Program- braic Word Problems into Equations. Transactions of the
ming Language for Program Induction. arXiv preprint Association for Computational Linguistics (TACL), 3:
arXiv:1608.04428, 2016. 585–597, 2015.
Programming with a Differentiable Forth Interpreter
Koza, John R. Genetic Programming: On the Programming Siegelmann, Hava T. Neural Programming Language. In
of Computers by Means of Natural Selection, volume 1. Proceedings of the Twelfth AAAI National Conference on
MIT press, 1992. Artificial Intelligence, pp. 877–882, 1994.
Kurach, Karol, Andrychowicz, Marcin, and Sutskever, Ilya. Solar-Lezama, Armando, Rabbah, Rodric, Bodı́k, Rastislav,
Neural Random-Access Machines. In Proceedings of the and Ebcioğlu, Kemal. Programming by Sketching for
International Conference on Learning Representations Bit-streaming Programs. In Proceedings of Program-
(ICLR), 2016. ming Language Design and Implementation (PLDI), pp.
281–294, 2005.
Kushman, Nate, Artzi, Yoav, Zettlemoyer, Luke, and Barzi-
lay, Regina. Learning to Automatically Solve Algebra Solar-Lezama, Armando, Tancau, Liviu, Bodik, Rastislav,
Word Problems. In Proceedings of the Annual Meeting Seshia, Sanjit, and Saraswat, Vijay. Combinatorial
of the Association for Computational Linguistics (ACL), Sketching for Finite Programs. In ACM Sigplan Notices,
pp. 271–281, 2014. volume 41, pp. 404–415, 2006.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to
Lau, Tessa, Wolfman, Steven A., Domingos, Pedro, and
Sequence Learning with Neural Networks. In Proceed-
Weld, Daniel S. Learning repetitive text-editing proce-
ings of the Conference on Neural Information Processing
dures with smartedit. In Your Wish is My Command, pp.
Systems (NIPS), pp. 3104–3112, 2014.
209–226. Morgan Kaufmann Publishers Inc., 2001.
Appendix
A. Forth Words and their implementation
We implemented a small subset of available Forth words in ∂4. The table of these words, together with their descriptions
is given in Table 4, and their implementation is given in Table 5. The commands are roughly divided into 7 groups. These
groups, line-separated in the table, are:
Data stack operations {num}, 1+, 1-, DUP, SWAP, OVER, DROP, +, -, *, /
Heap operations @, !
Comparators >, <, =
Return stack operations >R, R>, @R
Control statements IF..ELSE..THEN, BEGIN..WHILE..REPEAT, DO..LOOP
Subroutine control :, {sub}, ;, MACRO
Variable creation VARIABLE, CREATE..ALLOT
Table 4: Forth words and their descriptions. TOS denotes top-of-stack, NOS denotes next-on-stack, DSTACK denotes the
data stack, RSTACK denotes the return stack, and HEAP denotes the heap.
Forth Word Description
{num} Pushes {num} to DSTACK.
1+ Increments DSTACK TOS by 1.
1- Decrements DSTACK TOS by 1.
DUP Duplicates DSTACK TOS.
SWAP Swaps TOS and NOS.
OVER Copies NOS and pushes it on the TOS.
DROP Pops the TOS (non-destructive).
+, -, *, / Consumes DSTACK NOS and TOS. Returns NOS operator TOS.
@ Fetches the HEAP value from the DSTACK TOS address.
! Stores DSTACK NOS to the DSTACK TOS address on the HEAP.
>, <, = Consumes DSTACK NOS and TOS.
Returns 1 (TRUE) if NOS > | < | = TOS respectivelly, 0 (FALSE) otherwise.
>R Pushes DSTACK TOS to RSTACK TOS, removes it from DSTACK.
R> Pushes RSTACK TOS to DSTACK TOS, removes it from RSTACK.
@R Copies the RSTACK TOS TO DSTACK TOS.
IF..ELSE..THEN Consumes DSTACK TOS, if it equals to a non-zero number (TRUE), executes
commands between IF and ELSE. Otherwise executes commands between
ELSE and THEN.
BEGIN..WHILE..REPEAT Continually executes commands between WHILE and REPEAT while the code
between BEGIN and WHILE evaluates to a non-zero number (TRUE).
DO..LOOP Consumes NOS and TOS, assumes NOS as a limit, and TOS as a current index.
Increases index by 1 until equal to NOS. At every increment, executes commands
between DO and LOOP.
: Denotes the subroutine, followed by a word defining it.
{sub} Subroutine invocation, puts the program counter PC on RSTACK, sets PC to the
subroutine address.
; Subroutine exit. Consumest TOS from the RSTACK and sets the PC to it.
MACRO Treats the subroutine as a macro function.
VARIABLE Creates a variable with a fixed address. Invoking the variable name returns its
address.
CREATE..ALLOT Creates a variable with a fixed address. Do not allocate the next N addresses to
any other variable (effectively reserve that portion of heap to the variable)
Programming with a Differentiable Forth Interpreter
Table 5: Implementation of Forth words described in Table 4. Note that the variable creation words are implemented as fixed
address allocation, and MACRO words are implemented with inlining.
Symbol Explanation
M Stack, M ∈ {D,R}
M Memory buffer, M ∈ {D,R,H}
p Pointer, p ∈ {d,r,c}
R1± Increment
and decrement matrices (circular shift)
1± 1 i±1 ≡ j(mod n))
Rij =
0 otherwise
R+ ,R− ,R∗ ,R/ Circular arithmetic
operation tensors
{op} 1 i{op}j ≡ k(mod n))
Rijk =
0 otherwise
Pointer and value manipulation Expression
Increment a (or value x) inc(a) = aT R1+
Decrement a (or value x) dec(a) = aT R1−
Algebraic operation application {op}(a,b) = aT R{op} b
Conditional jump a jump(c,a) : p = (popD ()=T RU E)
c ← pc+(1−p)a
a−1 Next on stack, a ← aT R1−
Buffer manipulation
READ from M readM (a) = aT M
WRITE to M writeM (x,a) : M ← M−a⊗1·M+x⊗a
PUSH x onto M pushM (x) : writeM (x,a) [side-effect: d ← inc(d)]
POP an element from M popM () = readM (a) [side-effect: d ← dec(d)]
Forth Word
Literal x pushD (x)
1+ writeD (inc(readD (d)),d)
1- writeD (dec(readD (d)),d)
DUP pushD (readD (d))
SWAP x = readD (d), y = readD (d−1 )
:writeD (d,y) , writeD (d−1 ,x)
OVER pushD (readD (d))
DROP popD ()
+, -, *, / writeD ({op}(readD (d−1 ),readD (d)),d)
@ readH (d)
! writeH (d,d−1 )
< SWAP >
Pn−1 Pn−1
> e1 = i=0 i∗di , e2 = i=0 i∗d−1 i
p = φpwl (e1 −e2 ), where φpwl (x) = min(max(0,x+0.5),1)
p1+(p−1)0
= p = φpwl (d,d−1 )
p1+(p−1)0
>R pushR (d)
R> popR ()
@R writeD (d,readR (r))
IF..1 ELSE..2 THEN p = (popD ()=0)
p∗..1 +(1−p)∗..2
BEGIN..1 WHILE..2 REPEAT ..1 jump(c,..2 )
DO..LOOP start = c,current = inc(popD ()),limit = popD ()
p = (current = limit)
jump(p,..),jump(c,start)
Programming with a Differentiable Forth Interpreter
6
Note that Forth uses Reverse Polish Notation and that the top of the data stack is 4 in this example.
Programming with a Differentiable Forth Interpreter
Accuracy
0.6
process. We find that when stronger prior knowledge is
provided (C OMPARE), the model quickly maximises the
0.4
training accuracy. Providing less structure (P ERMUTE)
results in lower testing accuracy initially, however, both
sketches learn the correct behaviour and generalise equally 0.2
well after seeing 256 training instances. Additionally, it
is worth noting that the P ERMUTE sketch was not always 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
able to converge into a result of the correct length, and both # training examples
sketches are not trivial to train. Figure 4: Accuracy of models for varying number of
training examples, trained on input sequence of length 8 for
In comparison, Seq2Seq baseline is able to generalise only the addition task. Manipulate, choose, and Seq2Seq (test
to the sequence it was trained on (Seq2Seq trained and 16) were tested on sequence lengths 16, and Seq2Seq (test
tested on sequence length 3). When training it on sequence 8) was tested on sequence length 8.
length 3, and testing it on a much longer sequence length of
8, Seq2Seq baseline is not able to achieve more than 45%
accuracy.
C.2. Program Code Optimisations
We measure the runtime of Bubble sort on sequences of
1.0 compare
permute varying length with and without the optimisations described
Seq2Seq (test 3)
Seq2Seq (test 8)
in Section 3.4. The results of ten repeated runs are shown
0.8 in Figure 5 and demonstrate large relative improvements
for symbolic execution and interpolation of if-branches
compared to non-optimised ∂4 code.
Accuracy
0.6
Relative Runtime Improvement [%]
260
240
0.4
220
200
0.2
180
160
4 8 16 32 64 128 256 512 1024 140
# training examples 120
Figure 3: Accuracy of models for varying number of 100
training examples, trained on input sequence of length 3 for 80
2 3 4 5 6 7 8 9 10 11 12 13 14
the Bubble sort task. Compare, permute, and Seq2Seq (test
8) were tested on sequence lengths 8, and Seq2Seq (test 3) Figure 5: Relative speed improvements of program code
was tested on sequence length 3. optimisations for different input sequence lengths (bottom).
are expected on the data stack. After the length (n − 1) containing 256, 32 and 32 instances, respectively. Note
is duplicated for further use with DUP, the machine tests that the low number of dev and test instances was due to the
whether it is non-zero (using IF, which consumes the TOS computational complexity of the sketch.
during the check). If n−1 > 0, it is stored on the R stack for
The batch size was set to a value between 64 and 16,
future use (line 2).
depending on the problem size, and we used an initial
At this point (line 3b) the programmer only knows that learning rate of 1.0.
a decision must be made based on the top two data stack
elements D0 and D-1 (comparison elements), and the top E.3. Addition
return stack, R0 (length decremented by 1). Here the precise
nature of this decision is unknown but is limited to variants We trained the addition Choose and Manipulate sketches
of permutation of these elements, the output of which pro- presented in Table 2 on a randomly generated train, develop-
duce the input state to the decrement -1 and the recursive ment and test sets of sizes 512, 256, and 1024 respectively.
BUBBLE call (line 4b). At the culmination of the call, R0, The batch size was set to 16, and we used an initial learning
the output of the learned slot behaviour, is moved onto the rate of 0.05
data stack using R>, and execution proceeds to the next step.
E.4. Word Algebra Problem
Figure 2 illustrates how portions of this sketch are executed
on the ∂4 RNN. The program counter initially resides at >R The Common Core (CC) dataset (Roy & Roth, 2015) is par-
(line 3 in P), as indicated by the vector c, next to program titioned into a train, dev, and test set containing 300, 100,
P. Both data and return stacks are partially filled (R has 1 and 200 questions, respectively. The batch size was set to
element, D has 4), and we show the content both through 50, and we used an initial learning rate of 0.02. The BiLSTM
horizontal one-hot vectors and their corresponding integer word vectors were initialised randomly to vectors of length
values (colour coded). The vectors d and r point to the top 75. The stack width was set to 150 and the stack size to 5.
of both stacks, and are in a one-hot state as well. In this
execution trace, the slot at line 4 is already showing optimal F. Qualitative
behaviour: it remembers the element on the return stack (4) Analysis on BubbleSort of PC traces
is larger and executes BUBBLE on the remaining sequence
with the counter n subtracted by one, to 1. In Figure 6 we visualise the program counter traces. The
trace follows a single example from start, to middle, and the
E. Experimental details end of the training process. In the beginning of training, the
program counter starts to deviate from the one-hot represen-
The parameters of each sketch are trained using tation in the first 20 steps (not observed in the figure due to
Adam (Kingma & Ba, 2015), with gradient clipping unobservable changes), and after two iterations of SORT, ∂4
(set to 1.0) and gradient noise (Neelakantan et al., 2015b). fails to correctly determine the next word. After a few train-
We tuned the learning rate, batch size, and the parameters ing epochs ∂4 learns better permutations which enable the
of the gradient noise in a random search on a development algorithm to take crisp decisions and halt in the correct state.
variant of each task.
E.2. Sorting
The Permute and Compare sketches in Table 1 were trained
on a randomly generated train, development and test set
Programming with a Differentiable Forth Interpreter
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119
10 REPR_BUFFER REPR !
11 NUM_BUFFER NUM !
27 \ copy numbers to D
28 CURRENT_NUM
29 STEP_NUM
30 CURRENT_NUM
31 STEP_NUM
32 CURRENT_NUM
51 \ empty out R
52 R> DROP
53 R> DROP
54 R> DROP
55 R> DROP