Music Prediction via Multiple Viewpoints
Music Prediction via Multiple Viewpoints
00
© Swets & Zeitlinger
ABSTRACT
This paper examines the prediction and generation of music using a multiple
viewpoint system, a collection of independent views of the musical surface each of
which models a specific type of musical phenomena. Both the general style and a
particular piece are modeled using dual short-term and long-term theories, and the
model is created using machine learning techniques on a corpus of musical
examples.
The models are used for analysis and prediction, and we conjecture that highly
predictive theories will also generate original, acceptable, works. Although the
quality of the works generated is hard to quantify objectively, the predictive power
of models can be measured by the notion of entropy, or unpredictability. Highly
predictive theories will produce low-entropy estimates of a musical language.
The methods developed are applied to the Bach chorale melodies. Multiple-
viewpoint systems are learned from a sample of 95 chorales, estimates of entropy
are produced, and a predictive theory is used to generate new, unseen pieces.
INTRODUCTION
This paper is concerned with machine learning and evaluation of music theories.
A theory of music is an intensional description of some musical language. Theories
are evaluated according to the predictions they make about particular pieces of
music. In addition to explaining known pieces, a theory of a musical style is
expected to generate new pieces that are acceptable — even creative — specimens
of the style. Musical styles are vast and complex languages; we have begun our
work by tackling the problem of constructing theories for melody, a necessary
prerequisite to more advanced issues such as polyphony and harmony. The Bach
chorale melodies were chosen as the object of analysis due to their abundance,
simplicity and general display of good melodic form.
There are two approaches to the construction of a generative theory for a
musical language. The first is the knowledge engineering approach, where rules
and constraints are explicitly coded in some logic or grammar (Cope 1987,
Ebcioglu 1986, Lidov & Gabura 1973, Hiller 1970, Baroni & Jacobini 1978). The
52 DARRELL CONKLIN AND IAN H. WITTEN
presented, which indicate that the entropy of the chorale genre is probably not
greater than 1.87 bits/pitch. An example of a generated chorale is presented.
In his paper "Prediction and Entropy of Printed English", Shannon (1951) outlined
the idea of a predictive theory of a language as a transducer of letters into
compressed codes. This transduction is reversible, so that it is also possible to
exactly reconstruct a letter from a compressed code. Figure 1 outlines this
prediction scheme for general sequences. A context — a sequence of events — and
an event are shown to a theory T, which is in "predict" mode and produces a
compressed code. When the code and original context are shown to a theory —
which is now in "generate" mode — the original event is reproduced.
The prediction scheme of Fig. 1 forms the basis for modern data compression
theory (Bell et al. 1990). It can be shown that if the probability of an arbitrary
event e, given a context c and theory T, is Pj(e\c), the code produced cannot have
a length of less than
bits. Furthermore, there are coding techniques that can almost achieve this lower
bound (Witten et al. 1987).
Let (ei,.., e„) be a subsequence from an element of the language L. The
sequence c = (e,,.., en_t) is the context, and e = en is the event to be predicted. The
entropy of the language L, with respect to a probabilistic predictive theory T, is the
minimal expected code length of an event. It can be estimated as
(2)
n
where n is the number of subsequences used in the estimation. As n grows,
T T
event • (predict) code (generate) •event
1
context
statistically more reliable estimates of the entropy are produced. Theories which
minimize the entropy estimate in Expression 2 are more predictive of the language
under investigation.
Meyer (1957) has found a striking relationship between musical experience and
entropy. Instability in music results from temporarily or permanently blocked
expectations on-the part of a listener. A piece of music which moves according to
expectation is "neutral with respect to meaning", where high expectation corre-
sponds to high probability and low information. Conversely, high information may
be the result of the delay of a consequent, or the ambiguity of an antecedent.
Meyer claims that intraopus uncertainty is either systemic, where the particular
work asserts its individuality, or designed, where the composer introduces elements
of originality or surprise into a piece. These points of deviation are of interest in
identifying important structural events in the music. In order to identify such
instabilities, a predictive theory must be formed which creates expectations, so that
their breakdown may be detected.
Predictive theories can also be used to generate new music. If a theory T
minimizes the entropy estimate of the language, the codes produced will be incom-
pressible, and hence highly random, bit strings. Thus in addition to evaluating a
theory according to its predictive power, it is possible to inspect the music
generated by a theory when provided with random codes.1 A conjecture of this
research is that highly predictive theories will also be good generative theories.
Thus the goal of developing generative theories, which is a subjective process, can
be replaced by the one of developing highly predictive theories. Furthermore,
predictive theories of a musical style can be learned from examples of the style,
as the next section will discuss.
There is an important point to note regarding the prediction scheme. Since log
0 is undefined, it is necessary that Pj(e\c) be nonzero for all contexts and events
representable in the underlying symbol set. This means that, according to a theory
T, no sequence of events is impossible, however unlikely it may be. Adopting
Rahn's (1989) term, we say that predictive theories of music must be non-
exclusive. This means that a theory could conceivably generate any sequence of
events.
Prediction is not the same thing as generation of music. In the above scheme,
all predictive theories are generative. However, there are generative theories which
are not predictive. Consider, for example, various music composition techniques
of ars combinatoria (Ratner 1970), including musical dice games, fractal music,
and so on. Even though these techniques may generate reasonable pieces, they are
almost always exclusive and are not readily incorporated into a prediction scheme.
MULTIPLE VIEWPOINT SYSTEMS 55
Context models
PTK I()).
Higher weights are given to terms appearing earlier in the above enumeration. It
is possible that the conditional probability is still undefined if the last quantity
above is undefined. This will occur when the one-event sequence (en) does not
appear in the database. To solve this, the final probability measure is blended with
56 DARRELL CONKLIN AND IAN H. WITTEN
l. The final result of blending is a probability distribution over all events in the
event space Ç, given a context.
Induction of context models is incremental. The initial theory of the concept is the
most general theory possible, assigning equal probability to any tuple. Each
example that arrives specializes the theory, since after incorporating the example,
the theory will give higher probability to it. As such, induction of context models
can be viewed as a hill-climbing search of a specialization hierarchy of probabilis-
tic theories (Gennari et al. 1989, Buntine 1988).
For a sequence ej" comprising n events, the machine is given n (context, next
event) tuples
(O.e.),
((«,). «2),
((eve2),eJ),
Table 1. A small context database after incorporation of the first ten pitch classes
(sequence GGDBAGGABA) from chorale 1.
problem of fixed-order context models is that very low order models are too
general, and do not capture enough structure of the concept; very high order
models are too specialized to the examples from which they were constructed, and
do not capture enough statistics of the concept. The induction scheme described
above, using the partial match inference method, will tend to avoid such extreme
overgeneralization or overspecialization.
In sequence prediction using context models, two forces contribute to the sequence-
generating rule. One is provided by long-term effects, and is governed by structure
and statistics induced from a large corpus of sequences from the same genre. The
other is provided by short-term effects, by structure and statistics particular to the
sequence being predicted. This research explicitly represents the short- and long-
term forces as different context models. The short-term model is transitory in the
sense that it is discarded after a particular sequence is predicted, and dynamic in
the sense that it adapts to a particular sequence.
More precisely, suppose a tuple (c,e) is encountered. A short-term model of this
tuple is a context model of the sequence c. Short-term models are also non-exclu-
sive. The final probability of an event is a combination of its probability according
to the short- and long-term models. Figure 3 depicts the process of short/long-term
model combination.
The main problem with basic context models, which makes their use for music
unsatisfactory, is that they require an exact match of supplied contexts to contexts
in the database. They were not meant to deal with domains, such as music, where
events an have an internal structure (e.g., musical events have pitch, durations, and
start-times) and are richly representable in languages other than the basic event
58 DARRELL CONKLIN AND IAN H. WITTEN
Derived types
The complete space of product types forms a set partially ordered on the subset
relation among constituents. Figure 2 displays the lattice of such a set for three
primitive types x^ x2 and x3. On the bottom of the lattice is the empty type; on the
top, the product type between all three primitive types. The second level of the
lattice contains both constituent product types, and the third level contains all
primitive types. A multiple viewpoint system can be viewed as a set of points on
this lattice. For example, {[Link]}, {x2,x, ® x3}, and {x2,x, <£> x2,x3} are multiple
viewpoint systems.
MULTIPLE VIEWPOINT SYSTEMS 59
$(()) = ().
o Í ^OQ:: ^Jfij) defined,
1 *x(^Ti) otherwise.
r \ '• v
combine viewpoints combine viewpoints
final prediction
MUSICAL VIEWPOINTS
As an example of all types that will be discussed in this section, refer to Fig. 4,
which shows the first two phrases from chorale - 1, and the solution array (see the
previous section) for the fragment. Table 3 shows some applications of the func-
tions *F and <E> (see the previous section) using some of the types discussed below
and the chorale fragment of Fig. 4. Note that the result of *Ft is an element of [T],
whereas the result of 3>t is an element of [T]*.
Basic types
where 'pitch' is the pitch of the event, keysig and timesig are respectively the key
signature and time signature of the chorale, fermata is a boolean type that indicates
whether an event is under a fermata, st is the start-time, and 'duration' is the
duration of the event.
The fundamental unit of time is the sixteenth note; all start-times and durations
are measured as multiples of this unit. The longest value in [duration! is a whole
note. The start-time of any chorale is 0, the zero time point representing the
beginning of the first bar, complete or incomplete, in the chorale. Due to upbeat
effects, the first event in a chorale may have a nonzero start-time (as in the chorale
in Fig. 4). The semantic domain of interpretation [pitchl ranges from C M (middle
C) to G I) 5 (19 semitones above middle C). Pitches are integer-encoded in a twelve
tone system. The integer representation for pitch is the MIDI standard — [pitch]
ranges from 0 to 127, [60] = C H , I72l = Cb5, and so on. As a result, there is
no notion of enharmonic inequivalence; for example, Cj}5 is semantically
equivalent to Db5. Rests are not events; they are modelled by a difference in time
between the end of one event and the start of the next, as discussed in more detail
below. Repeated sections are not expanded. Ties over a bar line are also not
explicitly represented; only one event with the cumulative duration is encoded.
The key signature only states how many sharps or flats the chorale has, and
says nothing about its mode or tonic. Since sharps and flats cannot occur together
in the signature, and their number cannot exceed 7, the key signature can be
uniquely encoded as a number in the set {-7, ..., 7}: [-7] means "seven flats",
and [7Ü means "seven sharps", and so on. The time signature is measured in terms
of sixteenth notes per bar; for example, [l2l = 3/4 time, and [16] = 4/4 time.
Furthermore, IvI means that bar lines occur at multiples of v time units.
Information about phrases is notated in a consistent manner throughout all
chorales using fermatas. This information provides very strong clues about the
properties of the next event — for example, many phrases in the chorales begin on
X 3 or 5 few end on 7, and so on. Phrase endings are represented on the score by
means of a fermata symbol. These symbols can be represented by a boolean type
fermata: [T] means that an event is under a fermata. The beginning of a phrase
is assumed to be the event immediately following an event under a fermata.
The syntactic domain of any basic type must contain all elements of the type
that could be encountered in the chorales. The syntactic domains were discovered
by a simple analysis of 100 chorales. For example, the syntactic domain [duration]
MULTIPLE VIEWPOINT SYSTEMS 63
is not {1, ..., 16} but {1,2,3,4,6,8,12,16}, and only 9 of the possible 15 key
signatures are actually encountered in the chorale melodies.
Derived types
About 20 different derived types for the chorales have been implemented. This
section presents these derived types, categorizing them according to the primary
basic type from which they are derived. Table 2 summarizes all derived types that
will be encountered. The first column of the table gives the symbolic name of the
type. The second column informally gives its semantic valuation function. The
third column details the syntactic domain of the type, and the last column shows
the basic types the type is derived from, and hence is capable of predicting. The
symbol Z* denotes the positive integers {1,2,3,...}. The top part of Table 2 shows
all basic types. The bottom part shows some threaded derived types. These types
inspect a variable number of previous events in a chorale. Not all types of Table
2 will form useful viewpoints; some are used only to simplify the expression of
others.
Table 2. The basic and some primitive derived types for the chorales.
X [X] Derived from
st start-time of event {0,1,2,...} st
pitch pitch, in {CI|4,...,GI|5} {60,...,79} pitch
duration quarter note, eighth note, etc. {1,2,3,4,6,8,12,16} duration
keysig 1 sharp, 1 flat, etc. {-4 4} keysig
timesig 3/4 time, 4/4 time {12,16} timesig
fermata event under / not under fermata? {T,F} fermata
deltast rest, no rest {0,4} st
gis221 difference in start-time U.....20} st
posinbar position of event in bar {0 15} st
fib first / not first in bar {T,F} st
seqint sequential melodic interval Z pitch
contour rising, falling, static {-1,0,1} pitch
referent referent of piece {0.....11} keysig
intfref vertical interval from referent {0 11} pitch
inscale in / not in scale {T,F} pitch
intfib interval from first event in bar [seqint] pitch
intfip interval from first event in piece [seqint] pitch
intphbeg interval from phrase beginning [seqint] pitch
thrbar seqint at bars [seqint] x Z+ pitch, st
Iphrase length of phrase Z+ fermata, st
thrph seqint at phrases [seqint] x Z+ pitch
thrqu seqint at quarters [seqint] x T pitch, st
64 DARRELL CONKLIN AND IAN H. WITTEN
lphrase X X X X X X X X X 40 X X X X X 28
thrph X X X X X X X X X X 48 X X X X X
4
Fig. 4. Solution array, chorale 1, phrases 1 and 2, for a collection of basic and primitive
derived types.
Start-time
and the start-time of its predecessor is v time units. Note that the phenomena
modelled by gis221 cannot be captured by a product type duration ® deltast.
Another useful way to characterize the start-time of an event is by position in
the bar. This is the "position in the bar" (posinbar) type: [vl means that the start-
time of an event relative to the "start-time" of the bar is v time units. By linking
it with some other type, we can capture metrical and position-dependent effects of
that type (posinbar itself is expressed using the time signature). The "first in bar"
type (fib) takes on boolean values: [ T ] means that an event has a posinbar value
equal to 0. In long sequences it can model effects such as "number of notes per
bar", and can usefully be linked with timesig.
The fib type is used in the expression of the "threaded bar" (thrbar) type. This
is an example of a threaded type: its values are only defined at certain points
throughout a chorale. For an event with the posinbar value equal to 0, lajl means
that the melodic interval (next section) between its pitch and the pitch of first note
of the previous bar is a. The quantity b is the timescale of the value; it is the
difference in start-time between the two events. Note that the timescale varies and
is not always the same as the time signature; the first event in the previous bar
need not have a posinbar value of 0. This is the first example of a type which is
derived from two basic types — pitch and st.
A threaded type similar to thrbar, the thrqu type, works as follows. If an event
falls on a quarter note pulse, M means that the interval between it and the latest
earlier event that falls on a quarter note pulse is v. Note that this type also
possesses a timescale; for example, event e6 of the chorale in Fig. 4 occurs 8 time
units after the latest earlier event that falls on a quarter note pulse (event e4).
Another useful way to characterize the start-time of an event is by position in
the bar. This is the "position in the bar" (posinbar) type: [vl means that the start-
time of an event relative to the "start-time" of the bar is v time units. By linking
it with some other type, we can capture metrical and position-dependent effects of
66 DARRELL CONKLIN AND IAN H. WITTEN
that type (posinbar itself is expressed using the time signature). The "first in bar"
type (fib) takes on boolean values: [ T ] means that an event has a posinbar value
equal to 0. In long sequences it can model effects such as "number of notes per
bar", and can usefully be linked with timesig.
Pitch «
the "interval from first in bar" (intfib) type: [vl means that the interval between
the first event in a bar and a given event is v. The interval from the first event in
the bar and itself is defined to be 0. The second is "interval from first in piece",
(intfip): [vl means that the interval between an event and the first event of the
piece is v. There is an interesting sort of symmetry between this type and intfref;
intfip views the first note in the piece as the "drone", and a link between the two
may help alleviate the problems with modes discussed above.
Duration
There are few expressive ways to talk about the duration of an event in a chorale
other than absolute duration. Lewin (1987) proposes a handful of interval systems
that might be considered. One, for example, specifies that one note lasts for a
certain fraction of the duration of another. This type can expressively model
phenomena such as augmentation of fugue subjects. It is felt that although Lewin's
systems for duration may be useful constructs for certain musics, their applicability
to the chorales is debatable. The set of durations used in the chorales is quite
small, and absolute durations are used in a very consistent manner. Although no
types are derived from duration, linking the basic type duration with others gives
useful abstractions. For example, a linked viewpoint duration <8> intfref can capture
the relative durations of various referent-degrees, and a linked viewpoint duration
<8> fermata can model the duration of events ending phrases.
Fermata
EXPERIMENTAL RESULTS
Systems 1 to 3 are single-viewpoint ones. The system {seqint <8> gis221, pitch}
contains two viewpoints (system 4). It is interesting that it performs better than
either of its constituent viewpoints (systems 1 and 3). This means that the
viewpoint combination method described in Section 4 is effective. Systems 3, 5
and 6 show a similar phenomena. Adding the viewpoint pitch to system 6 further
improves it (system 7).
It was hypothesized that the scale degree is correlated with the fact that an
event does or does not begin a bar. A product type intfref ® fib was added to
system 7. The result (1.92 bits decreases to 1.87 bits) shoves that this hypothesis
was correct (system 8). The low entropy estimate of 1.87 bits/pitch is interesting,
as humans do not seem to do much better at the chorale prediction task (Wirten et
al. 1993): their entropy estimate is about 1.75 bits/pitch. However, this human
estimate is probably unstable as it is averaged over only two test chorales. See
(Witten et al. 1994) for a detailed comparison of the computational model
presented here with human performance at the chorale prediction task.
A useful way to present the results of prediction is with an entropy profile,
where the number of bits needed to specify an event (recall Expression 1) is
plotted against the event number in a particular piece. Figure 5 shows the entropy
profile for pitch for chorale 151, using the last system of Table 4. The peaks show
which events the computer finds surprising and very hard to predict, and the
troughs show the predictable events. The two high-entropy peaks of chorale 151
correspond to events 6 and 23. The machine predicted BA at event 6 with
probability 0.97. This is most certainly due to the fact that the short-term viewpoint
of pitch must revert to an zero-length context to make a prediction; and out of the
five pitches so far, three of them have been BA. Event 23 is a leap of a minor
seventh to begin a phrase. This is a highly unpredictable event, for both computers
and people. The machine predicted GA at event 23 with probability 0.40: certainly
a reasonable prediction.
The predictive theory was used to generate some new pieces. This was done by
taking a context, generating an event (recall the second section), then catenating
that event to the context, producing a new context. This process is iterated a fixed
number of times. Figure 6 shows one of the generations produced in this manner.
The first seven events, and the complete rhythmic skeleton of chorale 151, were
supplied to the system.
CONCLUSION
This research was motivated by the goal of learning generative theories for the
chorale genre. This goal was replaced by one of minimizing the estimated entropy
of the genre, due to the conjecture that predictive power is a sufficient criteria for
70 DARRELL CONKLIN AND IAN H. WITTEN
B
i
t
s
10 15 20
Event number
I I r r T:
ÉÉi
i* : ij
w ^ L_—t—[—f—L^—1 1 Lf —f '
Fig. 6. A piece generated by SONG/3. The first seven events of chorale 151 were
supplied as a context.
MULTIPLE VIEWPOINT SYSTEMS 71
NOTES
REFERENCES
Ames, C. (1989). The Markov process as a compositional model — A survey and tutorial.
Leonardo 22(2), 175-187.
Baroni, M. & Jacobini, C. (1978). Proposal for a Grammar of Melody. Les Presses de
l'Université de Montréal.
Bell, T.C., Cleary, J.G. & Witten, I.H. (1990). Text Compression. Prentice Hall.
Brooks, F.P., Hopkins Jr., A.L., Neumann, P.G. & Wright, W.V. (1956). An experiment in
musical composition. IRE Transactions on Electronic Computers EC-5, 175-182.
Brown, M. & Dempster, D.J. (1989). The scientific image of music theory. Journal of Music
Theory 33(1).
Buntine, W. (1988). Generalized subsumption and its application to induction and redundancy.
Artificial Intelligence 36, 149-176.
Cleary, J.G. & Witten, I.H. (1984). Data compression using adaptive coding and partial string
matching. IEEE Trans. Communications COM-32(4), 396-402.
Conklin, D. & Cleary, J.G. (1988). Modelling and generating music using multiple viewpoints.
In Proceedings of the First Workshop on AI and Music. The American Association for
Artificial Intelligence. 125-137.
Cope, D. (1987). An expert system for computer-assisted composition. Computer Music Journal
11(4), 30-46.
Dietterich, T.G. & Michalski, R.S. (1986). Learning to predict sequences. In R. Michalski, J.
72 DARRELL CONKLIN AND IAN H. WITTEN
Darrell Conklin
Department of Computing and Information Science
Queen's University
Kingston, Ontario, Canada
conklin@[Link]
(613) 545-6301
Ian H. Witten
Department of Computer Science
The University of Waikato
Hamilton, New Zealand
ihw@[Link]