0% found this document useful (0 votes)
11 views18 pages

Programming With A Differentiable Forth Interpreter

This paper presents a differentiable interpreter for the Forth programming language, allowing programmers to incorporate prior procedural knowledge into neural networks. The approach enables the use of program sketches with slots that can be filled with behavior learned from data, optimizing performance through gradient descent. Empirical results demonstrate that this method achieves state-of-the-art accuracy for reasoning tasks based on natural language narratives.

Uploaded by

zihaoyang621
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views18 pages

Programming With A Differentiable Forth Interpreter

This paper presents a differentiable interpreter for the Forth programming language, allowing programmers to incorporate prior procedural knowledge into neural networks. The approach enables the use of program sketches with slots that can be filled with behavior learned from data, optimizing performance through gradient descent. Empirical results demonstrate that this method achieves state-of-the-art accuracy for reasoning tasks based on natural language narratives.

Uploaded by

zihaoyang621
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Programming with a Differentiable Forth Interpreter

Matko Bošnjak 1 Tim Rocktäschel 2 Jason Naradowsky 3 Sebastian Riedel 1

Abstract partial procedural background knowledge: one may know


Given that in practice training data is scarce for all but a
the rough structure of the program, or how to implement
small set of problems, a core question is how to incorporate several subroutines that are likely necessary to solve the
prior knowledge into a model. In this paper, we consider task. For example, in programming by demonstration (Lau
arXiv:1605.06640v3 [[Link]] 23 Jul 2017

the case of prior procedural knowledge for neural networks, et al., 2001) or query language programming (Neelakantan
such as knowing how a program should traverse a sequence, et al., 2015a) a user establishes a larger set of conditions on
but not what local actions should be performed at each
step. To this end, we present an end-to-end differentiable
the data, and the model needs to set out the details. In all
interpreter for the programming language Forth which these scenarios, the question then becomes how to exploit
enables programmers to write program sketches with slots various types of prior knowledge when learning algorithms.
that can be filled with behaviour trained from program
input-output data. We can optimise this behaviour directly To address the above question we present an approach that
through gradient descent techniques on user-specified enables programmers to inject their procedural background
objectives, and also integrate the program into any larger knowledge into a neural network. In this approach, the
neural computation graph. We show empirically that our programmer specifies a program sketch (Solar-Lezama
interpreter is able to effectively leverage different levels
of prior program structure and learn complex behaviours
et al., 2005) in a traditional programming language. This
such as sequence sorting and addition. When connected sketch defines one part of the neural network behaviour. The
to outputs of an LSTM and trained jointly, our interpreter other part is learned using training data. The core insight
achieves state-of-the-art accuracy for end-to-end reasoning that enables this approach is the fact that most programming
about quantities expressed in natural language stories. languages can be formulated in terms of an abstract machine
that executes the commands of the language. We implement
these machines as neural networks, constraining parts of the
1. Introduction networks to follow the sketched behaviour. The resulting
neural programs are consistent with our prior knowledge
A central goal of Artificial Intelligence is the creation of
and optimised with respect to the training data.
machines that learn as effectively from human instruction
as they do from data. A recent and important step towards In this paper, we focus on the programming language
this goal is the invention of neural architectures that Forth (Brodie, 1980), a simple yet powerful stack-based
learn to perform algorithms akin to traditional computers, language that facilitates factoring and abstraction. Under-
using primitives such as memory access and stack ma- lying Forth’s semantics is a simple abstract machine. We
nipulation (Graves et al., 2014; Joulin & Mikolov, 2015; introduce ∂4, an implementation of this machine that is
Grefenstette et al., 2015; Kaiser & Sutskever, 2015; Kurach differentiable with respect to the transition it executes at
et al., 2016; Graves et al., 2016). These architectures can each time step, as well as distributed input representations.
be trained through standard gradient descent methods, Sketches that users write define underspecified behaviour
and enable machines to learn complex behaviour from which can then be trained with backpropagation.
input-output pairs or program traces. In this context, the
For two neural programming tasks introduced in previous
role of the human programmer is often limited to providing
work (Reed & de Freitas, 2015) we present Forth sketches
training data. However, training data is a scarce resource
that capture different degrees of prior knowledge. For
for many tasks. In these cases, the programmer may have
example, we define only the general recursive structure of
1
Department of Computer Science, University College Lon- a sorting problem. We show that given only input-output
don, London, UK 2 Department of Computer Science, University pairs, ∂4 can learn to fill the sketch and generalise well
of Oxford, Oxford, UK 3 Department of Theoretical and Ap- to problems of unseen size. In addition, we apply ∂4 to
plied Linguistics, University of Cambridge, Cambridge, UK. the task of solving word algebra problems. We show that
Correspondence to: Matko Bošnjak <[Link]@[Link]>.
when provided with basic algorithmic scaffolding and
Proceedings of the 34 th International Conference on Machine trained jointly with an upstream LSTM (Hochreiter &
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by Schmidhuber, 1997), ∂4 is able to learn to read natural
the author(s).
Programming with a Differentiable Forth Interpreter

language narratives, extract important numerical quantities, 1 : BUBBLE ( a1 ... an n-1 -- one pass )
and reason with these, ultimately answering corresponding 2 DUP IF >R
3a OVER OVER < IF SWAP THEN
mathematical questions without the need for explicit 4a R> SWAP >R 1- BUBBLE R>
intermediate representations used in previous work. 3b { observe D0 D-1 -> permute D-1 D0 R0}
4b 1- BUBBLE R>
The contributions of our work are as follows: i) We present 3c { observe D0 D-1 -> choose NOP SWAP }
4c R> SWAP >R 1- BUBBLE R>
a neural implementation of a dual stack machine underlying 5 ELSE
Forth, ii) we introduce Forth sketches for programming 6 DROP
7 THEN
with partial procedural background knowledge, iii) we 8 ;
apply Forth sketches as a procedural prior on learning 9 : SORT ( a1 .. an n -- sorted )
10 1- DUP 0 DO >R R@ BUBBLE R> LOOP DROP
algorithms from data, iv) we introduce program code 11 ;
optimisations based on symbolic execution that can speed 12 2 4 2 7 4 SORT \ Example call

up neural execution, and v) using Forth sketches we obtain Listing 1: Three code alternatives (white lines are common
state-of-the-art for end-to-end reasoning about quantities to all, coloured/lettered lines are alternative-specific): i)
expressed in natural language narratives. Bubble sort in Forth (a lines – green), ii) P ERMUTE sketch
(b lines – blue), and iii) C OMPARE sketch (c lines – yellow).
2. The Forth Abstract Machine
Forth is a simple Turing-complete stack-based program-
ming language (ANSI, 1994; Brodie, 1980). We chose Forth denote the BUBBLE procedure – comparison of top two
as the host language of our work because i) it is an estab- stack numbers (line 3a), and the recursive call to itself (line
lished, general-purpose high-level language relatively close 4a). A detailed description of how this program is executed
to machine code, ii) it promotes highly modular programs by the Forth abstract machine is provided in Appendix B.
through use of branching, loops and function calls, thus Notice that while Forth provides common control structures
bringing out a good balance between assembly and higher such as looping and branching, these can always be reduced
level languages, and importantly iii) its abstract machine is to low-level code that uses jumps and conditional jumps
simple enough for a straightforward creation of its contin- (using the words BRANCH and BRANCH0, respectively).
uous approximation. Forth’s underlying abstract machine Likewise, we can think of sub-routine definitions as labelled
is represented by a state S = (D, R, H, c), which contains code blocks, and their invocation amounts to jumping to the
two stacks: a data evaluation pushdown stack D (data code block with the respective label.
stack) holds values for manipulation, and a return address
pushdown stack R (return stack) assists with return pointers 3. ∂4: Differentiable Abstract Machine
and subroutine calls. These are accompanied by a heap or
random memory access buffer H, and a program counter c. When a programmer writes a Forth program, they define
a sequence of Forth words, i.e., a sequence of known state
A Forth program P is a sequence1 of Forth words (i.e. transition functions. In other words, the programmer knows
commands) P = w1 ...wn . The role of a word varies, encom- exactly how computation should proceed. To accommodate
passing language keywords, primitives, and user-defined for cases when the developer’s procedural background
subroutines (e.g. DROP discards the top element of the knowledge is incomplete, we extend Forth to support the
data stack, or DUP duplicates the top element of the data definition of a program sketch. As is the case with Forth
stack).2 Each word wi defines a transition function between programs, sketches are sequences of transition functions.
machine states wi : S → S. Therefore, a program P itself However, a sketch may contain transition functions whose
defines a transition function by simply applying the word at behaviour is learned from data.
the current program counter to the current state. Although
usually considered as a part of the heap H, we consider To learn the behaviour of transition functions within a pro-
Forth programs P separately to ease the analysis. gram we would like the machine output to be differentiable
with respect to these functions (and possibly representa-
An example of a Bubble sort algorithm implemented in tions of inputs to the program). This enables us to choose
Forth is shown in Listing 1 in everything except lines parameterised transition functions such as neural networks.
3b-4c. The execution starts from line 12 where literals
are pushed on the data stack and the SORT is called. Line To this end, we introduce ∂4, a TensorFlow (Abadi et al.,
10 executes the main loop over the sequence. Lines 2-7 2015) implementation of a differentiable abstract machine
1
with continuous state representations, differentiable words
Forth is a concatenative language. and sketches. Program execution in ∂4 is modelled by
2
In this work, we restrict ourselves to a subset of all Forth a recurrent neural network (RNN), parameterised by the
words, detailed in Appendix A.
transition functions at each time step.
Programming with a Differentiable Forth Interpreter

Low-level code Execution RNN


... Rr
P >R
CURRENT_REPR
>R
{permute...}
{choose...}
… BiLSTM
{choose...}
x {choose...} H
... Pθ c Dd y
Si Ned had to wash 3 shorts …

Figure 1: Left: Neural Forth Abstract Machine. A forth sketch P is translated to a low-level code Pθ , with slots {...}
substituted by a parametrised neural networks. Slots are learnt from input-output examples (x,y) through the differentiable
machine whose state Si comprises the low-level code, program counter c, data stack D (with pointer d), return stack R
(with pointer r), and the heap H. Right: BiLSTM trained on Word Algebra Problems. Output vectors corresponding to a
representation of the entire problem, as well as context representations of numbers and the numbers themselves are fed into
H to solve tasks. The entire system is end-to-end differentiable.

3.1. Machine State Encoding Popping is realized by multiplying the TOS pointer and the
memory buffer, and decreasing the TOS pointer:
We map the symbolic machine state S = (D, R, H, c)
to a continuous representation S = (D, R, H, c) into popM ( ) = readM (p) (side-effect: p ← dec(p))
two differentiable stacks (with pointers), the data stack
D = (D,d) and the return stack R = (R,r), a heap H, and Finally, the program counter c ∈ Rp is a vector that, when
an attention vector c indicating which word of the sketch Pθ one-hot, points to a single word in a program of length
is being executed at the current time step. Figure 1 depicts p, and is equivalent to the c vector of the symbolic state
the machine together with its elements. All three memory machine.4 We use S to denote the space of all continuous
structures, the data stack, the return stack and the heap, are representations S.
based on differentiable flat memory buffers M ∈ {D,R,H},
where D,R,H ∈ Rl×v , for a stack size l and a value size v. Neural Forth Words It is straightforward to convert
Each has a differentiable read operation Forth words, defined as functions on discrete machine
states, to functions operating on the continuous space S.
readM (a) = aT M For example, consider the word DUP, which duplicates the
top of the data stack. A differentiable version of DUP first
and write operation calculates the value e on the TOS address of D, as e = dT D.
It then shifts the stack pointer via d ← inc(d), and writes
writeM (x,a) : M ← M−(a1T ) M+xaT e to D using writeD (e,d). The complete description of im-
plemented Forth Words and their differentiable counterparts
akin to the Neural Turing Machine (NTM) memory (Graves
can be found in Appendix A.
et al., 2014), where is the element-wise multiplication,
and a is the address pointer.3 In addition to the memory
buffers D and R, the data stack and the return stack contain 3.2. Forth Sketches
pointers to the current top-of-the-stack (TOS) element We define a Forth sketch Pθ as a sequence of continuous
d,r ∈ Rl , respectively. This allows us to implement pushing transition functions P = w1 ... wn . Here, wi ∈ S → S
as writing a value x into M and incrementing the TOS either corresponds to a neural Forth word or a trainable
pointer as: transition function (neural networks in our case). We will
call these trainable functions slots, as they correspond to
pushM (x) : writeM (x,p) (side-effect: p ← inc(p)) underspecified “slots” in the program code that need to be
filled by learned behaviour.
where p ∈ {d,r}, inc(p) = pT R1 +, dec(p) = pT R− , and
R1 + and R1 − are increment and decrement matrices (left We allow users to define a slot w by specifying a pair of
and right circular shift matrices). a state encoder wenc and a decoder wdec . The encoder
3 4
The equal widths of H and D allow us to directly move vector During training c can become distributed and is considered as
representations of values between the heap and the stack. attention over the program code.
Programming with a Differentiable Forth Interpreter
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
produces a latent representation h of the current machine R r
3
state using a multi-layer perceptron, and the decoder 2
1
consumes this representation to produce the next machine 0
3
state. We hence have w = wdec ◦ wenc . To use slots within 2
1
Forth program code, we introduce a notation that reflects 0 Dd
this decomposition. In particular, slots are defined by the 0 : BUBBLE : BUBBLE : BUBBLE : BUBBLE
1 DUP DUP DUP DUP
syntax { encoder -> decoder } where encoder 2 BRANCH0 8 BRANCH0 8 BRANCH0 8 BRANCH0 8
3 >R >R >R >R
and decoder are specifications of the corresponding slot 4 {...} {...} {...} {...}
parts as described below. 5
6
1-
BUBBLE
1-
BUBBLE
1-
BUBBLE
1-
BUBBLE
7 R> R> R> R>
8 DROP P c DROP DROP DROP

Encoders We provide the following options for encoders: Figure 2: ∂4 segment of the RNN execution of a Forth
static produces a static representation, independent of sketch in blue in Listing 1. The pointers (d, r) and values
the actual machine state. (rows of R and D) are all in one-hot state (colours simply
observe e1 ...em : concatenates the elements e1 ...em of denote values observed, defined by the top scale), while
the machine state. An element can be a stack item Di the program counter maintains the uncertainty. Subsequent
at relative index i, a return stack item Ri, etc. states are discretised for clarity. Here, the slot {...} has
linear N, sigmoid, tanh represent chained trans- learned its optimal behaviour.
formations, which enable the multilayer perceptron
architecture. Linear N projects to N dimensions,
and sigmoid and tanh apply same-named functions 3.4. Program Code Optimisations
elementwise.
The ∂4 RNN requires one-time step per transition. After
each time step, the program counter is either incremented,
Decoders Users can specify the following decoders: decremented, explicitly set or popped from the stack. In
turn, a new machine state is calculated by executing all
choose w1 ...wm : chooses from the Forth words w1 ...wm . words in the program and then weighting the result states
Takes an input vector h of length m P to produce a by the program counter. As this is expensive, it is advisable
m
weighted combination of machine states i hi wi (S). to avoid full RNN steps wherever possible. We use two
manipulate e1 ...em : directly manipulates the machine strategies to avoid full RNN steps and significantly speed-up
state elements e1 ... em by writing the appropriately ∂4: symbolic execution and interpolation of if-branches.
reshaped and softmaxed output of the encoder over the
machine state elements with writeM .
permute e1 ...em : permutes the machine state elements Symbolic Execution Whenever we have a sequence of
e1 ...em via a linear combination of m! state vectors. Forth words that contains no branch entry or exit points, we
can collapse this sequence into a single transition instead
3.3. The Execution RNN of naively interpreting words one-by-one. We symbolically
execute (King, 1976) a sequence of Forth words to calculate
We model execution using an RNN which produces a state a new machine state. We then use the difference between the
Sn+1 conditioned on a previous state Sn . It does so by new and the initial state to derive the transition function of
first passing the current state to each function wi in the the sequence. For example, the sequence R> SWAP >R that
program, and then weighing each of the produced next swaps top elements of the data and the return stack yields the
states by the component of the program counter vector ci symbolic state D = r1 d2 ...dl . and R = d1 r2 ...rl . Compar-
that corresponds to program index i, effectively using c as ing it to the initial state, we derive a single neural transition
an attention vector over code. Formally we have: that only needs to swap the top elements of D and R.

|P | Interpolation of If-Branches We cannot apply symbolic


X
Sn+1 = RNN(Sn ,Pθ ) = ci wi (Sn ) execution to code with branching points as the branching
i=1 behaviour depends on the current machine state, and we
cannot resolve it symbolically. However, we can still
Clearly, this recursion, and its final state, are differentiable collapse if-branches that involve no function calls or loops
with respect to the program code Pθ , and its inputs. Further- by executing both branches in parallel and weighing their
more, for differentiable Forth programs the final state of this output states by the value of the condition. If the if-branch
RNN will correspond to the final state of a symbolic execu- does contain function calls or loops, we simply fall back to
tion (when no slots are present, and one-hot values are used). execution of all words weighted by the program counter.
Programming with a Differentiable Forth Interpreter

3.5. Training Table 1: Accuracy (Hamming distance) of Permute and


Our training procedure assumes input-output pairs of Compare sketches in comparison to a Seq2Seq baseline on
machine start and end states (xi , yi ) only. The output yi the sorting problem.
defines a target memory YiD and a target pointer yid on Test Length 8 Test Length: 64
the data stack D. Additionally, we have a mask Ki that Train Length: 2 3 4 2 3 4
indicates which components of the stack should be included Seq2Seq 26.2 29.2 39.1 13.3 13.6 15.9
in the loss (e.g. we do not care about values above the stack ∂4 Permute 100.0 100.0 19.82 100.0 100.0 7.81
depth). We use DT (θ,xi ) and dT (θ,xi ) to denote the final ∂4 Compare 100.0 100.0 49.22 100.0 100.0 20.65
state of D and d after T steps of execution RNN and using
an initial state xi . We define the loss function as
P ERMUTE. A sketch specifying that the top two elements
of the stack, and the top of the return stack must be per-
L(θ) = H(Ki DT (θ,xi ),Ki YiD )
muted based on the values of the former (line 3b). Both
+H(Ki dT (θ,xi ),Ki yid )
the value comparison and the permutation behaviour
must be learned. The core of this sketch is depicted in
where H(x, y) = −x log y is the cross-entropy loss, Listing 1 (b lines), and the sketch is explained in detail
and θ are parameters of slots in the program P . We can in Appendix D.
use backpropagation and any variant of gradient descent C OMPARE. This sketch provides additional prior proce-
to optimise this loss function. Note that at this point it dural knowledge to the model. In contrast to P ERMUTE,
would be possible to include supervision of the interme- only the comparison between the top two elements on the
diate states (trace-level), as done by the Neural Program stack must be learned (line 3c). The core of this sketch is
Interpreter (Reed & de Freitas, 2015). depicted in Listing 1 (c lines).

4. Experiments In both sketches, the outer loop can be specified in ∂4 (List-


ing 1, line 10), which repeatedly calls a function BUBBLE. In
We evaluate ∂4 on three tasks. Two of these are simple doing so, it defines sufficient structure so that the behaviour
transduction tasks, sorting and addition as presented in of the network is invariant to the input sequence length.
(Reed & de Freitas, 2015), with varying levels of program
structure. For each problem, we introduce two sketches. Results on Bubble sort A quantitative comparison of our
We also test ∂4 on the more difficult task of answering models on the Bubble sort task is provided in Table 1. For a
word algebra problems. We show that not only can ∂4 act given test sequence length, we vary the training set lengths
as a standalone solver for such problems, bypassing the to illustrate the model’s ability to generalise to sequences
intermediary task of producing formula templates which longer than those it observed during training. We find that
must then be executed, but it can also outperform previous ∂4 quickly learns the correct sketch behaviour, and it is able
work when trained on the same data. to generalise perfectly to sort sequences of 64 elements after
observing only sequences of length two and three during
training. In comparison, the Seq2Seq baseline falters when
4.1. Experimental Setup
attempting similar generalisations, and performs close to
Specific to the transduction tasks, we discretise memory chance when tested on longer sequences. Both ∂4 sketches
elements during testing. This effectively allows the trained perform flawlessly when trained on short sequence lengths,
model to generalise to any sequence length if the correct but under-perform when trained on sequences of length 4
sketch behaviour has been learned. We also compare against due to arising computational difficulties (C OMPARE sketch
a Seq2Seq (Sutskever et al., 2014) baseline. Full details of performs better due to more structure it imposes). We
the experimental setup can be found in Appendix E. discuss this issue further in Section 5.

4.2. Sorting 4.3. Addition


Sorting sequences of digits is a hard task for RNNs, as they Next, we applied ∂4 to the problem of learning to add two
fail to generalise to sequences even marginally longer than n-digit numbers. We rely on the standard elementary school
the ones they have been trained on (Reed & de Freitas, addition algorithm, where the goal is to iterate over pairs
2015). We investigate several strong priors based on Bubble of aligned digits, calculating the sum of each to yield the
sort for this transduction task and present two ∂4 sketches in resulting sum. The key complication arises when two digits
Listing 1 that enable us to learn sorting from only a few hun- sum to a two-digit number, requiring that the correct extra
dred training examples (see Appendix C.1 for more detail): digit (a carry) be carried over to the subsequent column.
Programming with a Differentiable Forth Interpreter

1 : ADD-DIGITS Table 2: Accuracy (Hamming distance) of Choose and


( a1 b1...an bn carry n -- r1 r2...r_{n+1} ) Manipulate sketches in comparison to a Seq2Seq baseline
2 DUP 0 = IF
3 DROP on the addition problem. Note that lengths corresponds to
4 ELSE the length of the input sequence (two times the number of
5 >R \ put n on R
6a { observe D0 D-1 D-2 -> tanh -> linear 70 digits of both numbers).
-> manipulate D-1 D-2 }
7a DROP Test Length 8 Test Length 64
6b { observe D0 D-1 D-2 -> tanh -> linear 10
-> choose 0 1 }
Train Length: 2 4 8 2 4 8
7b { observe D-1 D-2 D-3 -> tanh -> linear 50 Seq2Seq 37.9 57.8 99.8 15.0 13.5 13.3
-> choose 0 1 2 3 4 5 6 7 8 9 }
>R SWAP DROP SWAP DROP SWAP DROP R> ∂4 Choose 100.0 100.0 100.0 100.0 100.0 100.0
8 R> 1- SWAP >R \ new_carry n-1 ∂4 Manipulate 98.58 100.0 100.0 99.49 100.0 100.0
9 ADD-DIGITS \ call add-digits on n-1 subseq.
10 R> \ put remembered results back on the stack
11 THEN
12 ; to longer sequences than those observed during training.
In comparison, both the C HOOSE sketch and the less
Listing 2: Manipulate sketch (a lines – green) and the
structured M ANIPULATE sketch learn the correct sketch
choose sketch (b lines – blue) for Elementary Addition.
behaviour and generalise to all test sequence lengths (with
Input data is used to fill data stack externally
an exception of M ANIPULATE which required more data to
train perfectly). In additional experiments, we were able to
We assume aligned pairs of digits as input, with a carry for successfully train both the C HOOSE and the M ANIPULATE
the least significant digit (potentially 0), and the length of sketches from sequences of input length 24, and we tested
the respective numbers. The sketches define the high-level them up to the sequence length of 128, confirming their
operations through recursion, leaving the core addition to perfect training and generalisation capabilities.
be learned from data.
The specified high-level behaviour includes the recursive 4.4. Word Algebra Problems
call template and the halting condition of the recursion (no Word algebra problems (WAPs) are often used to assess the
remaining digits, line 2-3). The underspecified addition numerical reasoning abilities of schoolchildren. Questions
operation must take three digits from the previous call, the are short narratives which focus on numerical quantities,
two digits to sum and a previous carry, and produce a single culminating with a question. For example:
digit (the sum) and the resultant carry (lines 6a, 6b and 7a,
A florist had 50 roses. If she sold 15 of them and then later
7b). We introduce two sketches for inducing this behaviour: picked 21 more, how many roses would she have?
Answering such questions requires both the understanding
M ANIPULATE. This sketch provides little prior procedu- of language and of algebra — one must know which
ral knowledge as it directly manipulates the ∂4 machine numeric operations correspond to which phrase and how to
state, filling in a carry and the result digits, based on the execute these operations.
top three elements on the data stack (two digits and the
carry). Depicted in Listing 2 in green. Previous work cast WAPs as a transduction task by mapping
a question to a template of a mathematical formula, thus
C HOOSE. Incorporating additional prior information, requiring manuall labelled formulas. For instance, one
C HOOSE exactly specifies the results of the computa- formula that can be used to correctly answer the question
tion, namely the output of the first slot (line 6b) is the in the example above is (50 - 15) + 21 = 56. In pre-
carry, and the output of the second one (line 7b) is the vious work, local classifiers (Roy & Roth, 2015; Roy et al.,
result digit, both conditioned on the two digits and the 2015), hand-crafted grammars (Koncel-Kedziorski et al.,
carry on the data stack. Depicted in Listing 2 in blue. 2015), and recurrent neural models (Bouchard et al., 2016)
have been used to perform this task. Predicted formula
The rest of the sketch code reduces the problem size by one templates may be marginalised during training (Kushman
and returns the solution by popping it from the return stack. et al., 2014), or evaluated directly to produce an answer.
In contrast to these approaches, ∂4 is able to learn both, a
Quantitative Evaluation on Addition In a set of ex- soft mapping from text to algebraic operations and their
periments analogous to those in our evaluation on Bubble execution, without the need for manually labelled equations
sort, we demonstrate the performance of ∂4 on the addition and no explicit symbolic representation of a formula.
task by examining test set sequence lengths of 8 and 64
while varying the lengths of the training set instances Model description Our model is a fully end-to-end
(Table 2). The Seq2Seq model again fails to generalise differentiable structure, consisting of a ∂4 interpreter, a
Programming with a Differentiable Forth Interpreter

\ first copy data from H: vectors to R and numbers to D Table 3: Accuracies of models on the CC dataset. Asterisk
1 { observe R0 R-1 R-2 R-3 -> permute D0 D-1 D-2 } denotes results obtained from Bouchard et al. (2016). Note
2 { observe R0 R-1 R-2 R-3 -> choose + - * / }
3 { observe R0 R-1 R-2 R-3 -> choose SWAP NOP } that GeNeRe makes use of additional data
4 { observe R0 R-1 R-2 R-3 -> choose + - * / }
\ lastly, empty out the return stack Model Accuracy (%)
Listing 3: Core of the Word Algebra Problem sketch. The Template Mapping
full sketch can be found in the Appendix. Roy & Roth (2015) 55.5
Seq2Seq∗ (Bouchard et al., 2016) 95.0
GeNeRe∗ (Bouchard et al., 2016) 98.5
sketch, and a Bidirectional LSTM (BiLSTM) reader.
Fully End-to-End
The BiLSTM reader reads the text of the problem and ∂4 96.0
produces a vector representation (word vectors) for each
word, concatenated from the forward and the backward pass
of the BiLSTM network. We use the resulting word vectors
corresponding only to numbers in the text, numerical values (bottom three) outperform the classifier-based approach.
of those numbers (encoded as one-hot vectors), and a vector Our method slightly outperforms a Seq2Seq baseline,
representation of the whole problem (concatenation of the achieving the highest reported result on this dataset without
last and the first vector of the opposite passes) to initialise data augmentation.
the ∂4 heap H. This is done in an end-to-end fashion,
enabling gradient propagation through the BiLSTM to the
5. Discussion
vector representations. The process is depicted in Figure 1.
The sketch, depicted in Listing 3 dictates the differentiable ∂4 bridges the gap between a traditional programming lan-
computation.5 First, it copies values from the heap H guage and a modern machine learning architecture. How-
– word vectors to the return stack R, and numbers (as ever, as we have seen in our evaluation experiments, faith-
one-hot vectors) on the data stack D. Second, it contains fully simulating the underlying abstract machine architec-
four slots that define the space of all possible operations ture introduces its own unique set of challenges.
of four operators on three operands, all conditioned on the One such challenge is the additional complexity of per-
vector representations on the return stack. These slots are i) forming even simple tasks when they are viewed in terms of
permutation of the elements on the data stack, ii) choosing operations on the underlying machine state. As illustrated
the first operator, iii) possibly swapping the intermediate in Table 1, ∂4 sketches can be effectively trained from small
result and the last operand, and iv) the choice of the second training sets (see Appendix C.1), and generalise perfectly
operator. The final set of commands simply empties out to sequences of any length. However, difficulty arises when
the return stack R. These slots define the space of possible training from sequences of modest lengths. Even when
operations, however, the model needs to learn how to utilise dealing with relatively short training length sequences,
these operations in order to calculate the correct result. and with the program code optimisations employed, the
underlying machine can unroll into a problematically large
Results We evaluate the model on the Common Core (CC) number states. For problems whose machine execution is
dataset, introduced by Roy & Roth (2015). CC is notable for quadratic, like the sorting task (which at input sequences
having the most diverse set of equation patterns, consisting of length 4 has 120 machine states), we observe significant
of four operators (+, -, ×, ÷), with up to three operands. instabilities during training from backpropagating through
such long RNN sequences, and consequent failures to train.
We compare against three baseline systems: (1) a local In comparison, the addition problem was easier to train due
classifier with hand-crafted features (Roy & Roth, 2015), to a comparatively shorter underlying execution RNNs.
(2) a Seq2Seq baseline, and (3) the same model with a data
generation component (GeNeRe) Bouchard et al. (2016). The higher degree of prior knowledge provided played an
All baselines are trained to predict the best equation, which important role in successful learning. For example, the
is executed outside of the model to obtain the answer. In C OMPARE sketch, which provides more structure, achieves
contrast, ∂4 is trained end-to-end from input-output pairs higher accuracies when trained on longer sequences.
and predicts the answer directly without the need for an Similarly, employing softmax on the directly manipulated
intermediate symbolic representation of a formula. memory elements enabled perfect training for the M ANIP -
ULATE sketch for addition. Furthermore, it is encouraging
Results are shown in Table 3. All RNN-based methods to see that ∂4 can be trained jointly with an upstream LSTM
5
Due to space constraints, we present the core of the sketch to provide strong procedural prior knowledge for solving a
here. For the full sketch, please refer to Listing 4 in the Appendix. real-world NLP task.
Programming with a Differentiable Forth Interpreter

6. Related Work pure Python code, but does not define nor use differentiable
access to its underlying abstract machine.
Program Synthesis The idea of program synthesis is as
old as Artificial Intelligence, and has a long history in com- The work in neural approximations to abstract structures
puter science (Manna & Waldinger, 1971). Whereas a large and machines naturally leads to more elaborate machin-
body of work has focused on using genetic programming ery able to induce and call code or code-like behaviour.
(Koza, 1992) to induce programs from the given input- Neelakantan et al. (2015a) learned simple SQL-like
output specification (Nordin, 1997), there are also various behaviour–—querying tables from the natural language
Inductive Programming approaches (Kitzelmann, 2009) with simple arithmetic operations. Although sharing
aimed at inducing programs from incomplete specifications similarities on a high level, the primary goal of our model
of the code to be implemented (Albarghouthi et al., 2013; is not induction of (fully expressive) code but its injection.
Solar-Lezama et al., 2006). We tackle the same problem of (Andreas et al., 2016) learn to compose neural modules to
sketching, but in our case, we fill the sketches with neural produce the desired behaviour for a visual QA task. Neural
networks able to learn the slot behaviour. Programmer-Interpreters (Reed & de Freitas, 2015) learn
to represent and execute programs, operating on different
modes of an environment, and are able to incorporate
Probabilistic and Bayesian Programming Our work
decisions better captured in a neural network than in many
is closely related to probabilistic programming languages
lines of code (e.g. using an image as an input). Users inject
such as Church (Goodman et al., 2008). They allow users
prior procedural knowledge by training on program traces
to inject random choice primitives into programs as a way
and hence require full procedural knowledge. In contrast,
to define generative distributions over possible execution
we enable users to use their partial knowledge in sketches.
traces. In a sense, the random choice primitives in such
languages correspond to the slots in our sketches. A core Neural approaches to language compilation have also been
difference lies in the way we train the behaviour of slots: researched, from compiling a language into neural networks
instead of calculating their posteriors using probabilistic (Siegelmann, 1994), over building neural compilers (Gruau
inference, we estimate their parameters using backprop- et al., 1995) to adaptive compilation (Bunel et al., 2016).
agation and gradient descent. This is similar in-kind to However, that line of research did not perceive neural in-
TerpreT’s FMGD algorithm (Gaunt et al., 2016), which terpreters and compilers as a means of injecting procedural
is employed for code induction via backpropagation. In knowledge as we did. To the best of our knowledge, ∂4
comparison, our model which optimises slots of neural is the first working neural implementation of an abstract
networks surrounded by continuous approximations of machine for an actual programming language, and this
code, enables the combination of procedural behaviour and enables us to inject such priors in a straightforward manner.
neural networks. In addition, the underlying programming
and probabilistic paradigm in these programming languages 7. Conclusion and Future Work
is often functional and declarative, whereas our approach
focuses on a procedural and discriminative view. By using We have presented ∂4, a differentiable abstract machine
an end-to-end differentiable architecture, it is easy to seam- for the Forth programming language, and showed how it
lessly connect our sketches to further neural input and output can be used to complement programmers’ prior knowledge
modules, such as an LSTM that feeds into the machine heap. through the learning of unspecified behaviour in Forth
sketches. The ∂4 RNN successfully learns to sort and
Neural approaches Recently, there has been a surge of add, and solve word algebra problems, using only program
research in program synthesis, and execution in deep learn- sketches and program input-output pairs. We believe ∂4,
ing, with increasingly elaborate deep models. Many of these and the larger paradigm it helps establish, will be useful for
models were based on differentiable versions of abstract addressing complex problems where low-level representa-
data structures (Joulin & Mikolov, 2015; Grefenstette et al., tions of the input are necessary, but higher-level reasoning
2015; Kurach et al., 2016), and a few abstract machines, is difficult to learn and potentially easier to specify.
such as the NTM (Graves et al., 2014), Differentiable In future work, we plan to apply ∂4 to other problems in
Neural Computers (Graves et al., 2016), and Neural GPUs the NLP domain, like machine reading and knowledge
(Kaiser & Sutskever, 2015). All these models are able to base inference. In the long-term, we see the integration of
induce algorithmic behaviour from training data. Our work non-differentiable transitions (such as those arising when
differs in that our differentiable abstract machine allows us interacting with a real environment), as an exciting future
to seemingly integrate code and neural networks, and train direction which sits at the intersection of reinforcement
the neural networks specified by slots via backpropagation. learning and probabilistic programming.
Related to our efforts is also the Autograd (Maclaurin et al.,
2015), which enables automatic gradient computation in
Programming with a Differentiable Forth Interpreter

ACKNOWLEDGMENTS Goodman, Noah, Mansinghka, Vikash, Roy, Daniel M,


Bonawitz, Keith, and Tenenbaum, Joshua B. Church:
We thank Guillaume Bouchard, Danny Tarlow, Dirk Weis-
a language for generative models. In Proceedings of
senborn, Johannes Welbl and the anonymous reviewers
the Conference in Uncertainty in Artificial Intelligence
for fruitful discussions and helpful comments on previous
(UAI), pp. 220–229, 2008.
drafts of this paper. This work was supported by a Microsoft
Research PhD Scholarship, an Allen Distinguished Investi- Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural
gator Award, and a Marie Curie Career Integration Award. Turing Machines. arXiv preprint arXiv:1410.5401, 2014.

Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley,


References
Tim, Danihelka, Ivo, Grabska-Barwińska, Agnieszka,
Abadi, Martı́n, Agarwal, Ashish, Barham, Paul, Brevdo, Colmenarejo, Sergio Gómez, Grefenstette, Edward,
Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Ramalho, Tiago, Agapiou, John, et al. Hybrid computing
Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, using a neural network with dynamic external memory.
Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Nature, 538(7626):471–476, 2016.
Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser,
Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Grefenstette, Edward, Hermann, Karl Moritz, Suleyman,
Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Mustafa, and Blunsom, Phil. Learning to Transduce with
Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Unbounded Memory. In Proceedings of the Conference
Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, on Neural Information Processing Systems (NIPS), pp.
Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, 1819–1827, 2015.
Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Mar-
tin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow: Gruau, Frédéric, Ratajszczak, Jean-Yves, and Wiber, Gilles.
Large-scale machine learning on heterogeneous systems, A Neural compiler. Theoretical Computer Science, 141
2015. URL [Link] Software (1):1–52, 1995.
available from [Link]. Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
Albarghouthi, Aws, Gulwani, Sumit, and Kincaid, Zachary. term memory. Neural Computation, 9(8):1735–1780,
Recursive program synthesis. In Computer Aided 1997.
Verification, pp. 934–950. Springer, 2013. Joulin, Armand and Mikolov, Tomas. Inferring Algorithmic
Andreas, Jacob, Rohrbach, Marcus, Darrell, Trevor, and Patterns with Stack-Augmented Recurrent Nets. In
Klein, Dan. Neural module networks. In Proceedings Proceedings of the Conferences on Neural Information
of IEEE Conference on Computer Vision and Pattern Processing Systems (NIPS), pp. 190–198, 2015.
Recognition (CVPR), 2016.
Kaiser, Łukasz and Sutskever, Ilya. Neural GPUs learn al-
ANSI. Programming Languages - Forth, 1994. American gorithms. In Proceedings of the International Conference
National Standard for Information Systems, ANSI on Learning Representations (ICLR), 2015.
X3.215-1994.
King, James C. Symbolic Execution and Program Testing.
Bouchard, Guillaume, Stenetorp, Pontus, and Riedel, Commun. ACM, 19(7):385–394, 1976.
Sebastian. Learning to generate textual data. In Proceed-
ings of the Conference on Empirical Methods in Natural Kingma, Diederik and Ba, Jimmy. Adam: A Method
Language Processing (EMNLP), pp. 1608–1616, 2016. for Stochastic Optimization. In Proceedings of the
International Conference for Learning Representations
Brodie, Leo. Starting Forth. Forth Inc., 1980. (ICLR), 2015.

Bunel, Rudy, Desmaison, Alban, Kohli, Pushmeet, Torr, Kitzelmann, Emanuel. Inductive Programming: A Survey
Philip HS, and Kumar, M Pawan. Adaptive neural of Program Synthesis Techniques. In International
compilation. In Proceedings of the Conference on Neural Workshop on Approaches and Applications of Inductive
Information Processing Systems (NIPS), 2016. Programming, pp. 50–73, 2009.

Gaunt, Alexander L, Brockschmidt, Marc, Singh, Rishabh, Koncel-Kedziorski, Rik, Hajishirzi, Hannaneh, Sabharwal,
Kushman, Nate, Kohli, Pushmeet, Taylor, Jonathan, Ashish, Etzioni, Oren, and Ang, Siena. Parsing Alge-
and Tarlow, Daniel. TerpreT: A Probabilistic Program- braic Word Problems into Equations. Transactions of the
ming Language for Program Induction. arXiv preprint Association for Computational Linguistics (TACL), 3:
arXiv:1608.04428, 2016. 585–597, 2015.
Programming with a Differentiable Forth Interpreter

Koza, John R. Genetic Programming: On the Programming Siegelmann, Hava T. Neural Programming Language. In
of Computers by Means of Natural Selection, volume 1. Proceedings of the Twelfth AAAI National Conference on
MIT press, 1992. Artificial Intelligence, pp. 877–882, 1994.

Kurach, Karol, Andrychowicz, Marcin, and Sutskever, Ilya. Solar-Lezama, Armando, Rabbah, Rodric, Bodı́k, Rastislav,
Neural Random-Access Machines. In Proceedings of the and Ebcioğlu, Kemal. Programming by Sketching for
International Conference on Learning Representations Bit-streaming Programs. In Proceedings of Program-
(ICLR), 2016. ming Language Design and Implementation (PLDI), pp.
281–294, 2005.
Kushman, Nate, Artzi, Yoav, Zettlemoyer, Luke, and Barzi-
lay, Regina. Learning to Automatically Solve Algebra Solar-Lezama, Armando, Tancau, Liviu, Bodik, Rastislav,
Word Problems. In Proceedings of the Annual Meeting Seshia, Sanjit, and Saraswat, Vijay. Combinatorial
of the Association for Computational Linguistics (ACL), Sketching for Finite Programs. In ACM Sigplan Notices,
pp. 271–281, 2014. volume 41, pp. 404–415, 2006.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to
Lau, Tessa, Wolfman, Steven A., Domingos, Pedro, and
Sequence Learning with Neural Networks. In Proceed-
Weld, Daniel S. Learning repetitive text-editing proce-
ings of the Conference on Neural Information Processing
dures with smartedit. In Your Wish is My Command, pp.
Systems (NIPS), pp. 3104–3112, 2014.
209–226. Morgan Kaufmann Publishers Inc., 2001.

Maclaurin, Dougal, Duvenaud, David, and Adams, Ryan P.


Gradient-based Hyperparameter Optimization through
Reversible Learning. In Proceedings of the International
Conference on Machine Learning (ICML), 2015.

Manna, Zohar and Waldinger, Richard J. Toward automatic


program synthesis. Communications of the ACM, 14(3):
151–165, 1971.

Neelakantan, Arvind, Le, Quoc V, and Sutskever, Ilya.


Neural Programmer: Inducing latent programs with
gradient descent. In Proceedings of the International
Conference on Learning Representations (ICLR), 2015a.

Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V, Sutskever,


Ilya, Kaiser, Lukasz, Kurach, Karol, and Martens, James.
Adding Gradient Noise Improves Learning for Very Deep
Networks. arXiv preprint arXiv:1511.06807, 2015b.

Nordin, Peter. Evolutionary Program Induction of Binary


Machine Code and its Applications. PhD thesis, der
Universitat Dortmund am Fachereich Informatik, 1997.

Reed, Scott and de Freitas, Nando. Neural programmer-


interpreters. In Proceedings of the International
Conference on Learning Representations (ICLR), 2015.

Roy, Subhro and Roth, Dan. Solving General Arithmetic


Word Problems. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pp. 1743–1752, 2015.

Roy, Subhro, Vieira, Tim, and Roth, Dan. Reasoning


about quantities in natural language. Transactions of the
Association for Computational Linguistics (TACL), 3:
1–13, 2015.
Programming with a Differentiable Forth Interpreter

Appendix
A. Forth Words and their implementation
We implemented a small subset of available Forth words in ∂4. The table of these words, together with their descriptions
is given in Table 4, and their implementation is given in Table 5. The commands are roughly divided into 7 groups. These
groups, line-separated in the table, are:

Data stack operations {num}, 1+, 1-, DUP, SWAP, OVER, DROP, +, -, *, /
Heap operations @, !
Comparators >, <, =
Return stack operations >R, R>, @R
Control statements IF..ELSE..THEN, BEGIN..WHILE..REPEAT, DO..LOOP
Subroutine control :, {sub}, ;, MACRO
Variable creation VARIABLE, CREATE..ALLOT

Table 4: Forth words and their descriptions. TOS denotes top-of-stack, NOS denotes next-on-stack, DSTACK denotes the
data stack, RSTACK denotes the return stack, and HEAP denotes the heap.
Forth Word Description
{num} Pushes {num} to DSTACK.
1+ Increments DSTACK TOS by 1.
1- Decrements DSTACK TOS by 1.
DUP Duplicates DSTACK TOS.
SWAP Swaps TOS and NOS.
OVER Copies NOS and pushes it on the TOS.
DROP Pops the TOS (non-destructive).
+, -, *, / Consumes DSTACK NOS and TOS. Returns NOS operator TOS.
@ Fetches the HEAP value from the DSTACK TOS address.
! Stores DSTACK NOS to the DSTACK TOS address on the HEAP.
>, <, = Consumes DSTACK NOS and TOS.
Returns 1 (TRUE) if NOS > | < | = TOS respectivelly, 0 (FALSE) otherwise.
>R Pushes DSTACK TOS to RSTACK TOS, removes it from DSTACK.
R> Pushes RSTACK TOS to DSTACK TOS, removes it from RSTACK.
@R Copies the RSTACK TOS TO DSTACK TOS.
IF..ELSE..THEN Consumes DSTACK TOS, if it equals to a non-zero number (TRUE), executes
commands between IF and ELSE. Otherwise executes commands between
ELSE and THEN.
BEGIN..WHILE..REPEAT Continually executes commands between WHILE and REPEAT while the code
between BEGIN and WHILE evaluates to a non-zero number (TRUE).
DO..LOOP Consumes NOS and TOS, assumes NOS as a limit, and TOS as a current index.
Increases index by 1 until equal to NOS. At every increment, executes commands
between DO and LOOP.
: Denotes the subroutine, followed by a word defining it.
{sub} Subroutine invocation, puts the program counter PC on RSTACK, sets PC to the
subroutine address.
; Subroutine exit. Consumest TOS from the RSTACK and sets the PC to it.
MACRO Treats the subroutine as a macro function.
VARIABLE Creates a variable with a fixed address. Invoking the variable name returns its
address.
CREATE..ALLOT Creates a variable with a fixed address. Do not allocate the next N addresses to
any other variable (effectively reserve that portion of heap to the variable)
Programming with a Differentiable Forth Interpreter

Table 5: Implementation of Forth words described in Table 4. Note that the variable creation words are implemented as fixed
address allocation, and MACRO words are implemented with inlining.
Symbol Explanation
M Stack, M ∈ {D,R}
M Memory buffer, M ∈ {D,R,H}
p Pointer, p ∈ {d,r,c}
R1± Increment
 and decrement matrices (circular shift)
1± 1 i±1 ≡ j(mod n))
Rij =
0 otherwise
R+ ,R− ,R∗ ,R/ Circular arithmetic
 operation tensors
{op} 1 i{op}j ≡ k(mod n))
Rijk =
0 otherwise
Pointer and value manipulation Expression
Increment a (or value x) inc(a) = aT R1+
Decrement a (or value x) dec(a) = aT R1−
Algebraic operation application {op}(a,b) = aT R{op} b
Conditional jump a jump(c,a) : p = (popD ()=T RU E)
c ← pc+(1−p)a
a−1 Next on stack, a ← aT R1−
Buffer manipulation
READ from M readM (a) = aT M
WRITE to M writeM (x,a) : M ← M−a⊗1·M+x⊗a
PUSH x onto M pushM (x) : writeM (x,a) [side-effect: d ← inc(d)]
POP an element from M popM () = readM (a) [side-effect: d ← dec(d)]
Forth Word
Literal x pushD (x)
1+ writeD (inc(readD (d)),d)
1- writeD (dec(readD (d)),d)
DUP pushD (readD (d))
SWAP x = readD (d), y = readD (d−1 )
:writeD (d,y) , writeD (d−1 ,x)
OVER pushD (readD (d))
DROP popD ()
+, -, *, / writeD ({op}(readD (d−1 ),readD (d)),d)
@ readH (d)
! writeH (d,d−1 )
< SWAP >
Pn−1 Pn−1
> e1 = i=0 i∗di , e2 = i=0 i∗d−1 i
p = φpwl (e1 −e2 ), where φpwl (x) = min(max(0,x+0.5),1)
p1+(p−1)0
= p = φpwl (d,d−1 )
p1+(p−1)0
>R pushR (d)
R> popR ()
@R writeD (d,readR (r))
IF..1 ELSE..2 THEN p = (popD ()=0)
p∗..1 +(1−p)∗..2
BEGIN..1 WHILE..2 REPEAT ..1 jump(c,..2 )
DO..LOOP start = c,current = inc(popD ()),limit = popD ()
p = (current = limit)
jump(p,..),jump(c,start)
Programming with a Differentiable Forth Interpreter

B. Bubble sort algorithm description


An example of a Forth program that implements the Bubble sort algorithm is shown in Listing 1. We provide a description
of how the first iteration of this algorithm is executed by the Forth abstract machine:
The program begins at line 11, putting the sequence [2 4 2 7] on the data stack D, followed by the sequence length 4.6 It then
calls the SORT word.
D R c comment
1 [] [] 11 execution start
2 [2 4 2 7 4] [] 8 pushing sequence to D, calling
SORT subroutine puts ASORT to R
For a sequence of length 4, SORT performs a do-loop in line 9 that calls the BUBBLE sub-routine 3 times. It does so by
decrementing the top of D with the 1- word to 3. Subsequently, 3 is duplicated on D by using DUP, and 0 is pushed onto D.
3 [2 4 2 7 3] [ASORT ] 9 1-
4 [2 4 2 7 3 3] [ASORT ] 9 DUP
6 [2 4 2 7 3 3 0] [ASORT ] 9 0
DO consumes the top two stack elements 3 and 0 as the limit and starting point of the loop, leaving the stack D to be
[2,4,2,7,3]. We use the return stack R as a temporary variable buffer and push 3 onto it using the word >R. This drops 3 from
D, which we copy from R with R@
7 [2 4 2 7 3] [AddrSORT ] 9 DO
8 [2 4 2 7] [AddrSORT 3] 9 >R
9 [2 4 2 7 3] [AddrSORT 3] 9 @R
Next, we call BUBBLE to perform one iteration of the bubble pass, (calling BUBBLE 3 times internally), and consuming 3.
Notice that this call puts the current program counter onto R, to be used for the program counter c when exiting BUBBLE.
Inside the BUBBLE subroutine, DUP duplicates 3 on R. IF consumes the duplicated 3 and interprets is as TRUE. >R puts 3
on R.
10 [2 4 2 7 3] [ASORT 3 ABUBBLE ] 0 calling BUBBLE subroutine puts
ABUBBLE to R
11 [2 4 2 7 3 3] [ASORT 3 ABUBBLE ] 1 DUP
12 [2 4 2 7 3] [ASORT 3 ABUBBLE ] 1 IF
13 [2 4 2 7] [ASORT 3 ABUBBLE 3] 1 >R
Calling OVER twice duplicates the top two elements of the stack, to test them with <, which tests whether 2 < 7. IF tests if
the result is TRUE (0), which it is, so it executes SWAP.
14 [2 4 2 7 2 7] [ASORT 3 ABUBBLE 3] 2 OVER OVER
15 [2 4 2 7 1] [ASORT 3 ABUBBLE 3] 2 <
16 [2 4 2 7] [ASORT 3 ABUBBLE 3] 2 IF
17 [2 4 7 2] [ASORT 3 ABUBBLE 3] 2 SWAP
To prepare for the next call to BUBBLE we move 3 back from the return stack R to the data stack D via R>, SWAP it with the
next element, put it back to R with >R, decrease the TOS with 1- and invoke BUBBLE again. Notice that R will accumulate
the analysed part of the sequence, which will be recursively taken back.
18 [2 4 7 2 3] [ASORT 3 ABUBBLE ] 3 R>
19 [2 4 7 3 2] [ASORT 3 ABUBBLE ] 3 SWAP
20 [2 4 7 3] [ASORT 3 ABUBBLE 2] 3 >R
21 [2 4 7 2] [ASORT 3 ABUBBLE 2] 3 1-
22 [2 4 7 2] [ASORT 3 ABUBBLE 2] 0 ...BUBBLE
When we reach the loop limit we drop the length of the sequence and exit SORT using the ; word, which takes the return
address from R. At the final point, the stack should contain the ordered sequence [7 4 2 2].

6
Note that Forth uses Reverse Polish Notation and that the top of the data stack is 4 in this example.
Programming with a Differentiable Forth Interpreter

C. Learning and Run Time Efficiency 1.0 choose


manipulate
Seq2Seq (test 8)
C.1. Accuracy per training examples Seq2Seq (test 16)
0.8
Sorter When measuring the performance of the model as
the number of training instances varies, we can observe the
benefit of additional prior knowledge to the optimisation

Accuracy
0.6
process. We find that when stronger prior knowledge is
provided (C OMPARE), the model quickly maximises the
0.4
training accuracy. Providing less structure (P ERMUTE)
results in lower testing accuracy initially, however, both
sketches learn the correct behaviour and generalise equally 0.2
well after seeing 256 training instances. Additionally, it
is worth noting that the P ERMUTE sketch was not always 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
able to converge into a result of the correct length, and both # training examples

sketches are not trivial to train. Figure 4: Accuracy of models for varying number of
training examples, trained on input sequence of length 8 for
In comparison, Seq2Seq baseline is able to generalise only the addition task. Manipulate, choose, and Seq2Seq (test
to the sequence it was trained on (Seq2Seq trained and 16) were tested on sequence lengths 16, and Seq2Seq (test
tested on sequence length 3). When training it on sequence 8) was tested on sequence length 8.
length 3, and testing it on a much longer sequence length of
8, Seq2Seq baseline is not able to achieve more than 45%
accuracy.
C.2. Program Code Optimisations
We measure the runtime of Bubble sort on sequences of
1.0 compare
permute varying length with and without the optimisations described
Seq2Seq (test 3)
Seq2Seq (test 8)
in Section 3.4. The results of ten repeated runs are shown
0.8 in Figure 5 and demonstrate large relative improvements
for symbolic execution and interpolation of if-branches
compared to non-optimised ∂4 code.
Accuracy

0.6
Relative Runtime Improvement [%]

260
240
0.4
220
200
0.2
180
160
4 8 16 32 64 128 256 512 1024 140
# training examples 120
Figure 3: Accuracy of models for varying number of 100
training examples, trained on input sequence of length 3 for 80
2 3 4 5 6 7 8 9 10 11 12 13 14
the Bubble sort task. Compare, permute, and Seq2Seq (test
8) were tested on sequence lengths 8, and Seq2Seq (test 3) Figure 5: Relative speed improvements of program code
was tested on sequence length 3. optimisations for different input sequence lengths (bottom).

D. ∂4 execution of a Bubble sort sketch


Adder We tested the models to train on datasets of
Listing 1 (lines 3b and 4b – in blue) defines the BUBBLE
increasing size on the addition task. The results, depicted in
word as a sketch capturing several types of prior knowledge.
Table 4 show that both the choose and the manipulate sketch
In this section, we describe the P ERMUTE sketch. In it, we
are able to perfectly generalise from 256 examples, trained
assume BUBBLE involves a recursive call, that terminates at
on sequence lengths of 8, tested on 16. In comparison, the
length 1, and that the next BUBBLE call takes as input some
Seq2Seq baseline achieves 98% when trained on 16384
function of the current length and the top two stack elements.
examples, but only when tested on the input of the same
length, 8. If we test Seq2Seq as we tested the sketches, it is The input to this sketch are the sequence to be sorted and
unable to achieve more 19.7%. its length decremented by one, n − 1 (line 1). These inputs
Programming with a Differentiable Forth Interpreter

are expected on the data stack. After the length (n − 1) containing 256, 32 and 32 instances, respectively. Note
is duplicated for further use with DUP, the machine tests that the low number of dev and test instances was due to the
whether it is non-zero (using IF, which consumes the TOS computational complexity of the sketch.
during the check). If n−1 > 0, it is stored on the R stack for
The batch size was set to a value between 64 and 16,
future use (line 2).
depending on the problem size, and we used an initial
At this point (line 3b) the programmer only knows that learning rate of 1.0.
a decision must be made based on the top two data stack
elements D0 and D-1 (comparison elements), and the top E.3. Addition
return stack, R0 (length decremented by 1). Here the precise
nature of this decision is unknown but is limited to variants We trained the addition Choose and Manipulate sketches
of permutation of these elements, the output of which pro- presented in Table 2 on a randomly generated train, develop-
duce the input state to the decrement -1 and the recursive ment and test sets of sizes 512, 256, and 1024 respectively.
BUBBLE call (line 4b). At the culmination of the call, R0, The batch size was set to 16, and we used an initial learning
the output of the learned slot behaviour, is moved onto the rate of 0.05
data stack using R>, and execution proceeds to the next step.
E.4. Word Algebra Problem
Figure 2 illustrates how portions of this sketch are executed
on the ∂4 RNN. The program counter initially resides at >R The Common Core (CC) dataset (Roy & Roth, 2015) is par-
(line 3 in P), as indicated by the vector c, next to program titioned into a train, dev, and test set containing 300, 100,
P. Both data and return stacks are partially filled (R has 1 and 200 questions, respectively. The batch size was set to
element, D has 4), and we show the content both through 50, and we used an initial learning rate of 0.02. The BiLSTM
horizontal one-hot vectors and their corresponding integer word vectors were initialised randomly to vectors of length
values (colour coded). The vectors d and r point to the top 75. The stack width was set to 150 and the stack size to 5.
of both stacks, and are in a one-hot state as well. In this
execution trace, the slot at line 4 is already showing optimal F. Qualitative
behaviour: it remembers the element on the return stack (4) Analysis on BubbleSort of PC traces
is larger and executes BUBBLE on the remaining sequence
with the counter n subtracted by one, to 1. In Figure 6 we visualise the program counter traces. The
trace follows a single example from start, to middle, and the
E. Experimental details end of the training process. In the beginning of training, the
program counter starts to deviate from the one-hot represen-
The parameters of each sketch are trained using tation in the first 20 steps (not observed in the figure due to
Adam (Kingma & Ba, 2015), with gradient clipping unobservable changes), and after two iterations of SORT, ∂4
(set to 1.0) and gradient noise (Neelakantan et al., 2015b). fails to correctly determine the next word. After a few train-
We tuned the learning rate, batch size, and the parameters ing epochs ∂4 learns better permutations which enable the
of the gradient noise in a random search on a development algorithm to take crisp decisions and halt in the correct state.
variant of each task.

E.1. Seq2Seq baseline


The Seq2Seq baseline models are single-layer networks
with LSTM cells of 50 dimensions.
The training procedure for these models consists of 500
epochs of Adam optimisation, with a batch size of 128,
a learning rate of 0.01, and gradient clipping when the
L2 norm of the model parameters exceeded 5.0. We vary
the size of training and test data (Fig. 3), but observe no
indication of the models failing to reach convergence under
these training conditions.

E.2. Sorting
The Permute and Compare sketches in Table 1 were trained
on a randomly generated train, development and test set
Programming with a Differentiable Forth Interpreter

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119

(a) Program Counter trace in early stages of training.


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119

(b) Program Counter trace in the middle of training.


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119

(c) Program Counter trace at the end of training.


Figure 6: Program Counter traces for a single example at different stages of training BubbleSort in Listing 1 (red: successive
recursion calls to BUBBLE, green: successive returns from the recursion, and blue: calls to SORT). The last element in the
last row is the halting command, which only gets executed after learning the correct slot behaviour.
Programming with a Differentiable Forth Interpreter

G. The complete Word Algebra Problem sketch


The Word Algebra Problem (WAP) sketch described in Listing 3 is the core of the model that we use for WAP problems.
However, there were additional words before and after the core which took care of copying the data from the heap to data
and return stacks, and finally emptying out the return stack.
The full WAP sketch is given in Listing 4. We define a QUESTION variable which will denote the address of the question
vector on the heap. Lines 4 and 5 create REPR BUFFER and NUM BUFFER variables and denote that they will occupy
four sequential memory slots on the heap, where we will store the representation vectors and numbers, respectively. Lines
7 and 8 create variables REPR and NUM which will denote addresses to current representations and numbers on the heap.
Lines 10 and 11 store REPR BUFFER to REPR and NUM BUFFER to NUM, essentially setting the values of variables REPR
and NUM to starting addresses allotted in lines 4 and 5. Lines 14-16 and 19-20 create macro functions STEP NUM and
STEP REPR which increment the NUM and REPR values on call. These macro functions will be used to iterate through the
heap space. Lines 24-25 define macro functions CURRENT NUM for fetching the current number, and CURRENT REPR for
fetching representation values. Lines 28-32 essentially copy values of numbers from the heap to the data stack by using
the CURRENT NUM and STEP NUM macros. After that line 35 pushes the question vector, and lines 36-40 push the word
representations of numbers on the return stack.
Following that, we define the core operations of the sketch. Line 43 permutes the elements on the data stack (numbers) as a
function of the elements on the return stack (vector representations of the question and numbers). Line 45 chooses an operator
to execute over the TOS and NOS elements of the data stack (again, conditioned on elements on the return stack). Line 47
executes a possible swap of the two elements on the data stack (the intermediate result and the last operand) conditioned on
the return stack. Finally, line 49 chooses the last operator to execute on the data stack, conditioned on the return stack.
The sketch ends with lines 52-55 which empty out the return stack.
Programming with a Differentiable Forth Interpreter

1 \ address of the question on H


2 VARIABLE QUESTION
3 \ allotting H for representations and numbers
4 CREATE REPR_BUFFER 4 ALLOT
5 CREATE NUM_BUFFER 4 ALLOT
6 \ addresses of the first representation and number
7 VARIABLE REPR
8 VARIABLE NUM

10 REPR_BUFFER REPR !
11 NUM_BUFFER NUM !

13 \ macro function for incrementing the pointer to numbers in H


14 MACRO: STEP_NUM
15 NUM @ 1+ NUM !
16 ;

18 \ macro function for incrementing the pointer to representations in H


19 MACRO: STEP_REPR
20 REPR @ 1+ REPR !
21 ;

23 \ macro functions for fetching current numbers and representations


24 MACRO: CURRENT_NUM NUM @ @ ;
25 MACRO: CURRENT_REPR REPR @ @ ;

27 \ copy numbers to D
28 CURRENT_NUM
29 STEP_NUM
30 CURRENT_NUM
31 STEP_NUM
32 CURRENT_NUM

34 \ copy question vector, and representations of numbers to R


35 QUESTION @ >R
36 CURRENT_REPR >R
37 STEP_REPR
38 CURRENT_REPR >R
39 STEP_REPR
40 CURRENT_REPR >R

42 \ permute stack elements, based on the question and number representations


43 { observe R0 R-1 R-2 R-3 -> permute D0 D-1 D-2 }
44 \ choose the first operation
45 { observe R0 R-1 R-2 R-3 -> choose + - * / }
46 \ choose whether to swap intermediate result and the bottom number
47 { observe R0 R-1 R-2 R-3 -> choose SWAP NOP }
48 \ choose the second operation
49 { observe R0 R-1 R-2 R-3 -> choose + - * / }

51 \ empty out R
52 R> DROP
53 R> DROP
54 R> DROP
55 R> DROP

Listing 4: The complete Word Algebra Problem sketch

You might also like