Structured Probabilistic Reasoning: (Incomplete Draft)
Structured Probabilistic Reasoning: (Incomplete Draft)
(Incomplete draft)
Bart Jacobs
Institute for Computing and Information Sciences, Radboud University Nijmegen,
P.O. Box 9010, 6500 GL Nijmegen, The Netherlands.
[email protected] http://www.cs.ru.nl/∼ bart
Preface page v
1 Collections 1
1.1 Cartesian products 2
1.2 Lists 6
1.3 Subsets 16
1.4 Multisets 25
1.5 Multisets in summations 35
1.6 Binomial coefficients of multisets 41
1.7 Multichoose coefficents 48
1.8 Channels 56
1.9 The role of category theory 61
2 Discrete probability distributions 69
2.1 Probability distributions 70
2.2 Probabilistic channels 81
2.3 Frequentist learning: from multisets to distributions 92
2.4 Parallel products 98
2.5 Projecting and copying 109
2.6 A Bayesian network example 116
2.7 Divergence between distributions 122
3 Drawing from an urn 127
3.1 Accumlation and arrangement, revisited 132
3.2 Zipping multisets 136
3.3 The multinomial channel 144
3.4 The hypergeometric and Pólya channels 156
3.5 Iterated drawing from an urn 168
3.6 The parallel multinomial law: four definitions 181
3.7 The parallel multinomial law: basic properties 189
iii
iv Chapter 0. Contents
iv
Preface
v
vi Chapter 0. Preface
tage is that there is no deeply engrained familiarity with the field and with its
development. But at the same time this distance may be an advantage, since
it provides a fresh perspective, without sacred truths and without adherence to
common practices and notations. For instance, the terminology and notation in
this book are influenced by quantum theory, e.g. in using the words ‘state’, ‘ob-
servable’ and ‘test’ — as synonyms for ‘distribution’, for an R-valued function
on a sample space and for compatible (summable) predicates — in using ket
notation | − i in writing discrete probability distributions, or in using daggers
as reversals, in analogy with conjugate transposes (for Hilbert spaces).
It should be said: for someone trained in formal methods, the area of prob-
ability theory can be rather sloppy: everything is called ‘P’, types are hardly
ever used, crucial ingredients (like distributions in expected values) are left
implicit, basic notions (like conjugate prior) are introduced only via examples,
calculation recipes and algorithms are regularly just given, without explana-
tion, goal or justification, etc. This hurts, especially because there is so much
beautiful mathematical structure around. For instance, the notion of a channel
(see below) formalises the idea of a conditional probability and carries a rich
mathematical structure that can be used in compositional reasoning, with both
sequential and parallel composition. The Bayesian inversion (‘dagger’) of a
channel does not only come with appealing mathematical (categorical) proper-
ties — e.g. smooth interaction with sequential and parallel composition — but
is also extremely useful in inference and learning. Via this dagger we can con-
nect forward and backward inference (see Theorem 6.6.3: backward inference
is forward inference with the dagger, and vice-versa) and capture the difference
between Pearl’s and Jeffrey’s update rules (see Theorem 6.7.4: Pearl increases
validity, whereas Jeffrey decreases divergence).
We even dare to think that this ‘sloppiness’ is ultimately a hindrance to fur-
ther development of the field, especially in computer science, where computer-
assisted reasoning requires a clear syntax and semantics. For instance, it is
hard to even express the above-mentioned theorems 6.6.3 and 6.7.4 in stan-
dard probabilistic notation. One can speculate that states/distributions are kept
implicit in traditional probability theory because in many examples they are
used as a fixed implicit assumption in the background. Indeed, in mathemati-
cal notation one tends to omit — for efficiency — the least relevant (implicit)
parameters. But the essence of probabilistic computation is state transforma-
tion, where it has become highly relevant to know explicitly in which state one
is working at which stage. The notation developed in this book helps in such
situations — and in many other situations as well, we hope.
Apart from having beautiful structure, probability theory also has magic. It
can be found in the following two points.
vi
vii
The combination of these two points is very powerful and forms the basis for
probabilistic reasoning. For instance, if we know that two phenomena are re-
lated, and we have new information about one of them, then we also learn
something new about the other phenomenon, after updating. We shall see that
such crossover ripple effects can be described in two equivalent ways, starting
from a joint distribution with evidence in one component.
vii
viii Chapter 0. Preface
viii
ix
In any case, such processing should be subject to suitable safeguards, which should
include specific information to the data subject and the right to obtain human interven-
tion, to express his or her point of view, to obtain an explanation of the decision reached
after such assessment and to challenge the decision.
It is not acceptable that your mortgage request is denied because you drive a
blue car — in presence of a correlation between driving blue cars and being
late on one’s mortgage payments.
These and other developments have led to a new area called Explainable
Artificial Intelligence (XAI), which strives to provide decisions with explana-
tions that can be understood easily by humans, without bias or discrimination.
Although this book will not contribute to XAI as such, it aims to provide a
mathematically solid basis for such explanations.
In this context it is appropriate to quote Judea Pearl [134] from 1989 about
a divide that is still wide today.
To those trained in traditional logics, symbolic reasoning is the standard, and non-
monotonicity a novelty. To students of probability, on the other hand, it is symbolic
reasoning that is novel, not nonmonotonicity. Dealing with new facts that cause proba-
bilities to change abruptly from very high values to very low values is a commonplace
phenomenon in almost every probabilistic exercise and, naturally, has attracted special
attention among probabilists. The new challenge for probabilists is to find ways of ab-
stracting out the numerical character of high and low probabilities, and cast them in
linguistic terms that reflect the natural process of accepting and retracting beliefs.
This book does not pretend to fill this gap. One of the big embarrassments of
the field is that there is no widely accepted symbolic logic for probability, to-
gether with proof rules and a denotational semantics. Such a logic for symbolic
reasoning about probability will be non-trivial, because it will have to be non-
monotonic1 — a property that many logicians shy away from. This book does
aim to contribute towards bridging the divide mentioned by Pearl, by provid-
ing a mathematical basis for such a symbolic probabilistic logic, consisting of
channels, states, predicates, transformations, conditioning, disintegration, etc.
From the perspective of this book, the structured categorical approach to
probability theory began with the work of Bill Lawvere (already in the 1960s)
and his student Michèle Giry. They recognised that taking probability dis-
tributions has the structure of a monad, which was published in the early
1980s in [58]. Roughly at the same time Dexter Kozen started the systematic
investigation of probabilistic programming languages and logics, published
in [106, 107]. The monad introduced back then is now called the Giry mo-
1 Informally, a logic is non-monotonic if adding assumptions may make a conclusion less true.
For instance, I may think that scientists are civilised people, until, at some conference dinner,
a heated scientific debate ends in a fist fight.
ix
x Chapter 0. Preface
Contents overview
The first chapter of the book covers introductory material that is meant to set
the scene. It starts from basic collection types like lists and subsets, and con-
tinues with multisets, which receive most attention. The chapter discusses the
x
xi
(free) monoid structure on all these collection types and introduces ‘unit’ and
‘flatten’ maps as their common, underlying structure. It also introduces the
basic concept of a channel, for these three collection types, and shows how
channels can be used for state transformation and how they can be composed,
both sequentially and in parallel. At the end, the chapter provides definitions
of the relevant notions from category theory.
In the second chapter (discrete) probability distributions first emerge, as a
special collection type, with their own associated form of (probabilistic) chan-
nel. The subtleties of parallel products of distributions (states), with entwined-
ness/correlation between components and the non-naturality of copying, are
discussed at this early stage. This culminates in an illustration of Bayesian net-
works in terms of (probabilistic) channels. It shows how predictions are made
within such Bayesian networks via state transformation and via compositional
reasoning, basically by translating the network structure into (sequential and
parallel) composites of channels.
Blindly drawing coloured balls from an urn is a basic model in discrete prob-
ability. Such draws are analysed systematically in Chapter 3, not only for the
two familar multinomial and hypergeometric forms (with or without replace-
ment), but also in the less familiar Pólya and Ewens forms. By describing these
draws as probabilistic channels we can derive the well known formulations for
these draw-distributions via channel-composition. Once formulated in terms of
channels, these distributions satisfy various compositionality properties. They
are typical for our approach and are (largely) absent in traditional treatments
of this topic. Urns and draws from urns are both described as multisets. The
interplay between multisets and distributions is an underlying theme in this
chapter. There is a fundamental distributive law between multisets and distri-
butions that expresses basic structural properties.
The fourth chapter is more logically oriented, via observables X → R (in-
cluding factors, predicates and events) that can be defined on sample spaces
X, providing numerical information. The chapter concentrates on validity of
obervables in states and on transformation of observables. Where the second
chapter introduces state transformation along a probabilistic channel in a for-
ward direction, this fourth chapter adds observable (predicate) transformation
in a backward direction. These two operations are of fundamental importance
in program semantics, and also in quantum computation — where they are dis-
tinguished as Schrödinger’s (forward) and Heisenberg’s (backward) approach.
In this context, a random variable is a combination of a state and an observable,
on the same underlying sample space. The statistical notions of variance and
covariance are described in terms of of validity for such random variables in
xi
xii Chapter 0. Preface
xii
xiii
els. But the most fundamental technique that is introduced in this chapter, via
string diagrams, is disintegration. In essence, it is the well known procedure
of extracting a conditional probability P(y | x) from a joint probability P(x, y).
One of the themes running through this book is how ‘crossover’ influence can
be captured via channels — extracted from joint states via disintegration —
in particular via forward and backward inference. This phenomenon is what
makes (reasoning in) Bayesian networks work. Disintegration is of interest in
itself, but also provides an intuitive formalisation of the Bayesian inversion of
a channel.
Almost all of the material in these chapters is known from the literature, but
typically not in the channel-based form in which it is presented here. This book
includes many examples, often copied from familiar sources, with the deliber-
ate aim of illustrating how the channel-based approach actually works. Since
many of these examples are taken from the literature, the interested reader may
wish to compare the channel-based description used here with the original de-
scription.
• The (non-trivial) calculations in this book have been carried out with the
EfProb library [20] for channel-based probability. Several calculations in
this book can be done by hand, typically when the outcomes are described
117
as fractions, like 2012 . Such calculations are meant to be reconstructable by
a motivated reader who really wishes to learn the ‘mechanics’ of the field.
Doing such calculations is a great way to really understand the topic — and
the approach of this book2 . Outcomes written in decimal notation 0.1234, as
approximations, or as plots, serve to give an impression of the results of a
computation.
• For the rest of this book, beyond Chapter 7, several additional chapters exist
2 Doing the actual calculations can be a bit boring and time consuming, but there are useful
online tools for calculating fractions, such as
https://www.mathpapa.com/fraction-calculator.html. Recent versions of EfProb also allow
calculations in fractional form.
xiii
xiv Chapter 0. Preface
xiv
1
Collections
There are several ways to put elements from a given set together, for instance
as lists, subsets, multisets, and as probability distributions. This introductory
chapter takes a systematic look at such collections and seeks to bring out nu-
merous similarities. For instance, lists, subsets and multisets all form monoids,
by suitable unions of collections. Unions of distributions are more subtle and
take the form of convex combinations. Also, subsets, multisets and distribu-
tions can be combined naturally via parallel products ⊗, though lists cannot.
In this first chapter, we collect some basic operations and properties of tuples,
lists, subsets and multisets — where multisets are ‘sets’ in which elements
may occur multiple times. Probability distributions will be postponed to the
next chapter. Especially, we collect several basic definitions and results for
multisets, since they play an important role in the sequel, as urns, filled with
coloured balls, as draws from such urns, and as data in learning.
The main differences between lists, subsets and multisets are summarised in
the table below.
lists subsets multisets
For instance, the lists [a, a, b], [a, b, a] and [a, b] are all different. The multisets
2| ai+1| bi and 1| bi+2| a i, with the element a occurring twice and the element
b occurring once, are the same. However, 1|a i + 1| b i is a different multiset.
Similarly, the sets {a, b}, {b, a}, and {a} ∪ {a, b} are the same.
These collections are important in themselves, in many ways, and primar-
ily (in this book) as outputs of channels. Channels are functions of the form
input → T (output), where T is a ‘collection’ operator, for instance, combin-
ing elements as lists, subsets, multisets, or distributions. Such channels capture
1
2 Chapter 1. Collections
X1 × X2 B {(x1 , x2 ) | x1 ∈ X1 and x2 ∈ X2 }.
X1 × · · · × Xn B {(x1 , . . . , xn ) | x1 ∈ X1 , . . . , xn ∈ Xn }.
2
1.1. Cartesian products 3
Q
differently using the symbol , as:
Y Y
Xi or more informally as: Xi .
1≤i≤n
In the latter case it is left implicit what the range is of the index element i.
We allow n = 0. The resulting ‘empty’ product is then written as a singleton
set, written as 1, containing the empty tuple () as sole element, as in:
1 B {()}.
For n = 1 the product X1 × · · · × Xn is (isomorphic to) the set X1 . Note that we
are overloading the symbol 1 and using it both as numeral and as singleton set.
If one of the sets Xi in a product X1 × · · · × Xn is empty, then the whole
product is empty. Also, if all of the sets Xi are finite, then so is the product
X1 × · · · × Xn . In fact, the number of elements of X1 × · · · × Xn is then obtained
by multiplying all the numbers of elements of the sets Xi .
πi ◦ h f1 , . . . , fn i = fi . (1.1)
This is an equality of functions. It can be proven easily by applying both sides
to an arbitrary element y ∈ Y.
There are some more ‘obvious’ equations about tupling of functions:
h f1 , . . . , fn i ◦ g = h f1 ◦ g, . . . , fn ◦ gi hπ1 , . . . , πn i = id , (1.2)
where g : Z → Y is an arbitrary function. In the last equation, id is the identity
function on the product X1 × · · · × Xn .
In a Cartesian product we place sets ‘in parallel’. We can also place functions
3
4 Chapter 1. Collections
X1 × · · · × Xn
f1 ×···× fn
/ Y1 × · · · × Yn
via:
f1 × · · · × fn = h f1 ◦ π1 , . . . , fn ◦ πn i
so that:
f1 × · · · × fn (x1 , . . . , xn ) = ( f1 (x1 ), . . . , fn (xn )).
The latter formulation clearly shows how the functions fi are applied in parallel
to the elements xi .
We overload the product symbol ×, since we use it both for sets and for
functions. This may be a bit confusing at first, but it is in fact quite convenient.
This new set X Y is sometimes called the function space or the exponent of X
and Y. Notice that this exponent notation is consistent with the above one for
powers, since functions n → X can be identified with n-tuples of elements in
X.
These exponents X Y are related to products in an elementary and useful way,
namely via a bijective correspondence:
Z×Y /Xf
=============Y=== (1.3)
Z /X
g
4
1.1. Cartesian products 5
Exercises
1.1.1 Check what a tuple function hπ2 , π3 , π6 i does on a product set X1 ×
· · · × X8 . What is the codomain of this function?
1.1.2 Check that, in general, the tuple function h f1 , . . . , fn i is the unique
function h : Y → X1 × · · · × Xn with πi ◦ h = fi for each i.
1.1.3 Prove, using Equations (1.1) and (1.2) for tuples and projections, that:
g1 × · · · × gn ◦ f1 × · · · × fn = (g1 ◦ f1 ) × · · · × (gn ◦ fn ).
1.1.4 Check that for each set X there is a unique function X → 1. Because
of this property the set 1 is sometimes called ‘final’ or ‘terminal’. The
unique function is often denoted by !.
Check also that a function 1 → X corresponds to an element of X.
1.1.5 Define functions in both directions, using tuples and projections, that
yield isomorphisms:
X×Y Y ×X 1×X X X × (Y × Z) (X × Y) × Z.
X1 X 1Y 1 (X × Y)Z X Z × Y Z X Y×Z X Y Z .
XK × Y K
zip[K]
/ (X × Y)K
by:
5
6 Chapter 1. Collections
1.2 Lists
The datatype of (finite) lists of elements from a given set is well-known in
computer science, especially in functional programming. This section collects
some basic constructions and properties, especially about the close relationship
between lists and monoids.
For an arbitrary set X we write L(X) for the set of all finite lists [x1 , . . . , xn ]
of elements xi ∈ X, for arbitrary n ∈ N. Notice that we use square brackets
[−] for lists, to distinguish them from tuples, which are typically written with
round brackets (−).
Thus, the set of lists over X can be defined as a union of all powers of X, as
in:
[
L(X) B Xn.
n∈N
When the elements of X are letters of an alphabet, then L(X) is the set of words
— the language — over this alphabet. The set L(X) is alternatively written as
X ? , and called the Kleene star of X.
We zoom in on some trivial cases. One has L(0) 1, since one can only
form the empty word over the empty alphabet 0 = ∅. If the alphabet contains
only one letter, a word consists of a finite number of occurrences of this single
letter. Thus: L(1) N.
We consider lists as an instance of what we call a collection data type, since
L(X) collects elements of X in a certain manner. What distinguishes lists from
other collection types is that elements may occur multiple times, and that the
order of occurrence matters. The three lists [a, b, a], [a, a, b], and [a, b] differ.
As mentioned in the introduction to this chapter, within a subset orders and
multiplicities do not matter, see Section 1.3; and in a multiset the order of
elements does not matter, but multiplicities do matter, see Section 1.4.
Let f : X → Y be an arbitrary function. It can be used to map lists over X into
lists over Y by applying f element-wise. This is what functional programmers
call map-list. Here we like overloading, so we write L( f ) : L(X) → L(Y) for
this function, defined as:
Thus, L is an operation that not only sends sets to sets, but also functions to
functions. It does so in such a way that identity maps and compositions are
preserved:
L(id ) = id L(g ◦ f ) = L(g) ◦ L( f ).
6
1.2. Lists 7
1.2.1 Monoids
A monoid is a very basic mathematical structure. For convenience we define it
explicitly.
Definition 1.2.1. A monoid consists of a set M with a binary operation M ×
M → M, written for instance as infix +, together with an identity element, say
written as 0 ∈ M. The binary operation + is associative and has 0 as identity
on both sides. That is, for all a, b, c ∈ M,
a + (b + c) = (a + b) + c and 0 + a = a = a + 0.
The monoid is called commutative if a + b = b + a, for all a, b ∈ M. It is called
idempotent if a + a = a for all a ∈ M.
Let (M, 0, +) and (N, 1, ·) be two monoids. A function f : M → N is called a
homomorphism of monoids if f preserves the unit and binary operation, in the
sense that:
7
8 Chapter 1. Collections
Thus, lists are monoids via concatenation. But there is more to say: lists are
free monoids. We shall occasionally make use of this basic property and so we
like to make it explicit. We shall encounter similar freeness properties for other
collection types.
Each element x ∈ X yields a singleton list unit(x) B [x] ∈ L(X). The result-
ing function unit : X → L(X) plays a special role, see also the next subsection.
X
unit / L(X)
f , homomorphism (1.4)
f
* M
f [] = 0 f ([x]) = f (x).
and
Further, on an list [x1 , . . . , xn ] of length n ≥ 2 we necessarily have:
f [x1 , . . . , xn ] = f [x1 ] ++ · · · ++ [xn ]
= f [x1 ] + · · · + f [xn ]
= f (x1 ) + · · · + f (xn ).
The exercise below illustrate this result. For future use we introduce monoid
actions and their homomorphisms.
8
1.2. Lists 9
f α(a, x) = β a, f (x)
for all a ∈ M, x ∈ X.
M×X
id × f
/ M×Y
α
β
X
f
/Y
Monoid actions are quite common in mathematics. For instance, scalar mul-
tiplication of a vector space forms an action. Also, as we shall see, probabilistic
updating can be described via monoid actions. The action map α : M × X → X
can be understood intuitively as pushing the elements in X forward with a
quantity from M. It then makes sense that the zero-push is the identity, and
that a sum-push is the composition of two individual pushes.
The next result contains some basic properties about unit and flatten. These
properties will first be formulated in terms of equations, and then, alternatively
as commuting diagrams. The latter style is preferred in this book.
9
10 Chapter 1. Collections
X
unit / L(X) L(L(X))
flat / L(X)
f L( f ) L(L( f )) L( f )
Y / L(Y) L(L(Y)) / L(Y)
unit flat
L(X)
unit / L(L(X)) o L(unit) L(X) L(L(L(X)))
flat / L(L(X))
L(flat)
flat flat
L(X) L(L(X)) / L(X)
flat
Proof. We shall do the first cases of each item, leaving the second cases to the
interested reader. First, for f : X → Y and x ∈ X one has:
Next, for the second item we take an arbitrary list [x1 , . . . , xn ] ∈ L(X). Then:
= [x1 , . . . , xn ]
flat ◦ L(unit) ([x1 , . . . , xn ]) = flat [unit(x1 ), . . . , unit(xn )]
= [x1 , . . . , xn ].
10
1.2. Lists 11
L(α), as in:
X
unit / L(X) L(L(X))
L(α)
/ L(X)
α flat α (1.5)
id $ α
X L(X) /X
L(M1 )
L( f )
/ L(M2 )
α1 α2 (1.6)
M1
f
/ M2 .
commutes.
This result says that instead of giving a binary operation + with an identity
element u we can give a single operation α that works on all sequences of el-
ements. This is not so surprising, since we can apply the sum multiple times.
The more interesting part is that the monoid equations can be captured uni-
formly by the diagrams/equations (1.5). We shall see that same diagrams also
work for other types of monoids (and collection types).
11
12 Chapter 1. Collections
similar manner:
(1.5)
x + (y + z) = α [x, α([y, z])] = α [α(unit(x)), α([y, z])]
= α L(α) [ [x], [y, z] ]
(1.5)
= α flat [ [x], [y, z] ]
= α flat [ [x, y], [z] ]
(1.5)
= α L(α) [ [x, y], [z] ]
= α [α([x, y]), α(unit(z))]
(1.5)
= α [α([x, y]), z] = (x + y) + z.
= f (x1 ) + · · · + f (xn )
= f (x1 + · · · + xn ) since f is a homomorphism
= f ◦ α1 ([x1 , . . . , xn ]).
12
1.2. Lists 13
For instance, for N = 4 this inverse image contains the eight lists:
[1, 1, 1, 1], [1, 1, 2], [1, 2, 1], [2, 1, 1], [2, 2], [1, 3], [3, 1], [4]. (1.7)
We can interpret the situation as follows. Suppose we have coins with value
n ∈ N>0 , for each n. Then we can ask, for an amount N: how many (ordered)
ways are there to lay out the amount N in coins? For N = 4 the different layouts
are given above. Other interpretations are possible: one can also think of the
sequences (1.7) as partitions of the numbers {1, 2, 3, 4}.
Here is a first, easy counting result.
Lemma 1.2.7. For N ∈ N>0 , the subset sum −1 (N) ⊆ L(N>0 ) has 2N−1 ele-
ments.
X 1 2K+1 − 1
= . (∗)
0≤k≤K
2 k 2K
13
14 Chapter 1. Collections
1 1 1 1 1 1 1 1
24 12 12 12 8 6 6 4
| {z }
with sum: 1
Obviously, elements in a lists are ordered. Thus, in (1.7) we distinghuish
between coin layouts [1, 1, 2], [1, 2, 1] and [2, 1, 1]. However, when we are
commonly discussing which coins add up to 4 we do not take the order into
account, for instance in saying that we use two coins of value 1 and one coin
of 2, without caring about the order. In doing so, we are not using lists as col-
lection type, but multisets — in which the order of elements does not matter.
14
1.2. Lists 15
These multisets form an important alternative collection type; they are dis-
cussed from Section 1.4 onwards.
Exercises
1.2.1 Let X = {a, b, c} and Y = {u, v} be sets with a function f : X → Y
given by f (a) = u = f (c) and f (b) = v. Write `1 = [c, a, b, a] and
`2 = [b, c, c, c]. Compute consecutively:
• `1 ++ `2
• `2 ++ `1
• `1 ++ `1
• `1 ++ (`2 ++ `1 )
• (`1 ++ `2 ) ++ `1
• L( f )(`1 )
• L( f )(`2 )
• L( f )(`1 ) ++ L( f )(`2 )
• L( f )(`1 ++ `2 ).
1.2.2 We write log for the logarithm function with some base b > 0, so that
log(x) = y iff x = by . Verify that the logarithm function log is a map
of monoids:
(R>0 , 1, ·)
log
/ (R, 0, +).
flat [`1 , . . . , `n ] = `1 ++ · · · ++ `n .
15
16 Chapter 1. Collections
L L(X)
L(k−k)
/ L(N)
flat sum
L(X)
k−k
/N
1.3 Subsets
The next collection type that will be studied is powerset. The symbol P is
commonly used for the powerset operator. We will see that there are many
similarities with lists L from the previous section. We again pay much attention
to monoid structures.
For an arbitrary set X we write P(X) for the set of all subsets of X, and
Pfin (X) for the set of finite subsets. Thus:
16
1.3. Subsets 17
If X is a finite set itself, there is no difference between P(X) and Pfin (X). In the
sequel we shall speak mostly about P, but basically all properties of interest
hold for Pfin as well.
First of all, P is a functor: it works both on sets and on functions. Given
a function f : X → Y we can define a new function P( f ) : P(X) → P(Y) by
taking the image of f on a subset. Explicitly, for U ⊆ X,
P(π1 )(R) = {π1 (z) | z ∈ R} = {π1 (x, y) | (x, y) ∈ R} = {x | ∃y. (x, y) ∈ R}.
The next topic is the monoid structure on powersets. The first result is an
analogue of Lemma 1.2.2 and its proof is left to the reader.
Lemma 1.3.1. 1 For each set X, the powerset P(X) is a commutative and
idempotent monoid, with empty subset ∅ ∈ P(X) as identity element and
union ∪ of subsets of X as binary operation.
2 Each P( f ) : P(X) → P(Y) is a map of monoids, for f : X → Y.
Next we define unit and flatten maps for subsets, much like for lists. The
function unit : X → P(X) sends an element to a singleton subset: unit(x) B
{x}. The flatten function flat : P(P(X)) → P(X) is given by union: for A ⊆
P(X),
[
flat(A) B A = {x ∈ X | ∃U ∈ A. x ∈ U}.
X
unit / P(X) P(P(X))
flat / P(X)
f P( f ) P(P( f )) P( f )
Y / P(Y) P(P(Y)) / P(Y)
unit flat
commute.
17
18 Chapter 1. Collections
P(X)
unit / P(P(X)) o P(unit) P(X) P(P(P(X)))
flat / P(P(X))
P(flat)
flat flat
P(X) P(P(X)) / P(X)
flat
Thus, the support of a list is the subset of elements occurring in the list. The
support function removes order and multiplicities. The latter happens implic-
itly, via the set notation, above on the right-hand side. For instance,
supp([b, a, b, b, b]) = {a, b} = {b, a}.
Notice that there is no way to go in the other direction, namely Pfin (X) → L(X).
Of course, one can for each subset choose an order of the elements in order to
turn the subset into a list. However, this process is completely arbitrary and is
not uniform (natural).
The support function interacts nicely with the structures that we have seen
so far. This is expressed in the result below, where we use the same notation
unit and flat for different functions, namely for L and for P. The context, and
especially the type of an argument, will make clear which one is meant.
Lemma 1.3.3. Consider the support map supp : L(X) → Pfin (X) defined above.
1 It is a map of monoids (L(X), [], ++) → (P(X), ∅, ∪).
2 It is natural, in the sense that for f : X → Y one has:
L(X)
supp
/ Pfin (X)
L( f ) Pfin ( f )
L(Y) / Pfin (Y)
supp
3 It commutes with the unit and flatten maps of list and powerset, as in:
X X L(L(X))
L(supp)
/ L(Pfin (X)) supp
/ Pfin (Pfin (X))
unit
unit flat
flat
L(X) / Pfin (X) L(X) / Pfin (X)
supp supp
18
1.3. Subsets 19
= Pfin ( f )({x1 , . . . , xn })
= { f (x1 ), . . . , f (xn )}
= supp([ f (x1 ), . . . , f (xn )])
= supp(L( f )([x1 , . . . , xn ]))
= supp ◦ L( f ) ([x1 , . . . , xn ]).
The second diagram requires a bit more work. Starting from a list of lists we
get:
X
unit / Pfin (X)
f , homomorphism (1.9)
* M
f
19
20 Chapter 1. Collections
The order in the above sum f (x1 ) + · · · + f (xn ) does not matter since M is
commutative. The function f sends unions to sums since + is idempotent.
X
unit / Pfin (X) Pfin (Pfin (X))
Pfin (α)
/ Pfin (X)
α flat α (1.10)
id % α
X Pfin (X) /X
commute.
2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two commutative idempotent monoids,
with corresponding Pfin -algebras α1 : Pfin (M1 ) → M1 and α2 : Pfin (M2 ) →
M2 . A function f : M1 → M2 is a map of monoids if and only if the rectangle
Pfin (M1 )
Pfin ( f )
/ Pfin (M2 )
α1 α2 (1.11)
M1
f
/ M2 .
commutes.
Proof. This works very much like in the proof of Proposition 1.2.6. If (X, u, +)
is a monoid, we define α : Pfin (X) → X by freeness as α({x1 , . . . , xn }) B x1 +
· · · + xn . In the other direction, given α : Pfin (X) → X we define a sum as
x + y B α({x, y}) with unit u B α(∅). Clearly, thi sum + on X is commutative
and idempotent.
This result concentrates on the finite powerset functor Pfin . One can also con-
sider algebras P(X) → X for the (general) powerset functor P. Such algebras
turn the set X into a complete lattice, see [116, 9, 92] for details.
1.3.3 Extraction
So far we have concentrated on how similar lists and subsets are: the only struc-
tural difference that we have seen up to now is that subsets form an idempotent
20
1.3. Subsets 21
and commutative monoid. But there are other important differences. Here we
look at subsets of product sets, also known as relations.
The observation is that one can extract functions from a relation R ⊆ X × Y,
namely functions of the form extr 1 (R) : X → P(Y) and extr 2 (R) : Y → P(X),
given by:
extr 1 (R)(x) = {y ∈ Y | (x, y) ∈ R} extr 2 (R)(y) = {x ∈ X | (x, y) ∈ R}.
In fact, one can easily reconstruct the relation R from extr 1 (R), and also from
extr 2 (R), via:
R = {(x, y) | y ∈ extr 1 (R)(x)} = {(x, y) | x ∈ extr 2 (R)(y)}.
This all looks rather trivial, but such function extraction is less trivial for other
data types, as we shall see later on for distributions, where it will be called
disintegration.
Using the exponent notation from Subsection 1.1.2 we can summarise the
situation as follows. There are two isomorphisms:
P(Y)X P(X × Y) P(X)Y . (1.12)
Functions of the form X → P(Y) will later be called ‘channels’ from X to
Y, see Section 1.8. What we have just seen will then be described in terms of
‘extraction of channels’.
21
22 Chapter 1. Collections
The next lemma makes a basic result explicit, partly because it provides
valuable insight in itself, but also because we shall see generalisations later
on, for multisets instead of subsets. We use the familiar binomial and multi-
nomial coefficients. We recall their definitions, for natural numbers k ≤ n and
m1 , . . . , m` with ` ≥ 2 and i mi = m.
P
! !
n n! m m!
B and B . (1.13)
k k! · (n − k)! m1 , . . . , m` m1 ! · . . . · m` !
Recall that a partition of a set X is a disjoint cover: a finite collection of subsets
U1 , . . . , Uk ⊆ X satisfying Ui ∩ U j = ∅ for i , j and i Ui = X. We do not
S
assume that the subsets Ui are non-empty.
22
1.3. Subsets 23
Exercises
1.3.1 Continuing Exercise 1.2.1, compute:
• supp(`1 )
• supp(`2 )
• supp(`1 ++ `2 )
• supp(`1 ) ∪ supp(`2 )
• supp(L( f )(`1 ))
• Pfin ( f )(supp(`1 )).
1.3.2 We have used finite unions (∅, ∪) as monoid structure on P(X) in
Lemma 1.3.1 (1). Intersections (X, ∩) give another monoid structure
on P(X).
23
24 Chapter 1. Collections
24
1.4. Multisets 25
1.4 Multisets
So far we have discussed two collection data types, namely lists and subsets of
elements. In lists, elements occur in a particular order, and may occur multiple
times (at different positions). Both properties are lost when moving from lists
to sets. In this section we look at multisets, which are ‘sets’ in which elements
may occur multiple times. Hence multisets are in between lists and subsets,
since they do allow multiple occurrences, but the order of their elements is
irrelevant.
The list and powerset examples are somewhat remote from probability the-
ory. But multisets are much more directly relevant: first, because we use a
similar notation for multisets and distributions; and second, because observed
data can be organised nicely in terms of multisets. For instance, for statistical
analysis, a document is often seen as a multiset of words, in which one keeps
track of the words that occur in the document together with their frequency
(multiplicity); in that case, the order of the words is ignored. Also, tables with
observed data can be organised naturally as multisets, see Subsection 1.4.1 be-
low. Learning from such tables will be described in Section 2.1 as a (natural)
transformation from multisets to distributions.
Despite their importance, multisets do not have a prominent role in pro-
gramming, like lists have. Eigenvalues of a matrix form a clear example where
the ‘multi’ aspect is ignored in mathematics: eigenvalues may occur multiple
times, so the proper thing to say is that a matrix has a multiset of eigenvalues.
One reason for not using multisets may be that there is no established notation.
We shall use a ‘ket’ notation | − i that is borrowed from quantum theory, but
interchangeably also a functional notation. Since multisets are less familiar,
we take time to introduce the basic definitions and properties, in Sections 1.4
– 1.7.
We start with an introduction about notation, terminology, and conventions
for multisets. Consider a set C = {R, G, B} for the three colours Red, Green,
Blue. An example of a multiset over C is:
2| Ri + 5|G i + 0| Bi.
In this multiset the element R occurs 2 times, G occurs 5 times, and B occurs
0 times. The latter means that B does not occur, that is, B is not an element
of the multiset. From a multiset perspective, we have 2 + 5 + 0 = 7 elements
— and not just 2. A multiset like this may describe an urn containing 2 red
balls, 5 green ones, and no blue balls. Such multisets are quite common. For
instance, the chemical formula C2 H3 O2 for vinegar may be read as a multiset
25
26 Chapter 1. Collections
26
1.4. Multisets 27
This expression kϕk gives the size of the multiset, that is, its total number of
elements.
All of M, N, M∗ , N∗ , M[K], N[K] are functorial, in the same way. Hence
we concentrate on M. For a function f : X → Y we can define M( f ) : M(X) →
M(Y). When we see a multiset ϕ ∈ M(X) as an urn containing coloured balls,
with colours from X, then M( f )(ϕ) ∈ M(Y) is the urn with ‘repainted’ balls,
where the new colours are taken from the set Y. The function f : X → Y defines
the transformation of colours. It tells that a ball of colour x ∈ X in ϕ should be
repainted with colour f (x) ∈ Y.
The urn M( f )(ϕ) with repainted balls can be defined in two equivalent ways:
X X X
M( f ) ri | xi i B ri | f (xi ) i or as M( f )(ϕ)(y) B ϕ(x).
i i
x∈ f −1 (y)
It may take a bit of effort to see that these two descriptions are the same, see
P
Exercise 1.4.1 below. Notice that in the sum i ri | f (xi ) i it may happen that
f (xi1 ) = f (xi2 ) for xi1 , xi2 , so that ri1 and ri2 are added together. Thus, the sup-
P P
port of M( f ) i ri | xi i may have fewer elements than the support of i ri | xi i,
P P
but the sum of all multiplicities is the same in M( f ) i ri | xi i and i ri | xi i,
see Exercise 1.5.2 below.
27
28 Chapter 1. Collections
(r · ϕ)(x) B r · ϕ(x).
This scalar multiplication r · (−) : M(X) → M(X) preserves the sums (0, +)
from the previous item, and is thus a map of monoids.
3 For each f : X → Y, the function M( f ) : M(X) → M(Y) is a map of
monoids and also of cones. The latter means: M( f )(r · ϕ) = r · M( f )(ϕ).
28
1.4. Multisets 29
0 1 2 3 4 5 6 7 8 9 10
2 0 4 3 5 3 2 5 5 2 4
We can represent this table as a natural multiset over the set of ages {0, 1, . . . , 10}.
2| 0i + 4| 2 i + 3| 3 i + 5|4 i + 3| 5i + 2| 6i + 5| 7 i + 5| 8 i + 2|9 i + 4| 10 i.
Notice that there is no summand for age 1 because of our convention to ommit
expressions like 0| 1 i with multiplicity 0. We can visually represent the above
age data/multiset in the form of a histogram:
(1.16)
Here is another example, not with numerical data, in the form of ages, but
with nominal data, in the form of blood types. Testing the blood type of 50
individuals gives the following table.
A B O AB
10 15 18 7
This corresponds to a (natural) multiset over the set {A, B, O, AB} of blood
types, namely to:
10| A i + 15| Bi + 18| Oi + 7| ABi.
It gives rise to the following bar graph, in which there is no particular ordering
of elements. For convenience, we follow the order of the above table.
(1.17)
29
30 Chapter 1. Collections
Next, consider the two-dimensional table (1.18) below where we have com-
bined numeric information about blood pressure (either high H, or low L) and
certain medicines (either type 1, type 2, or no medicine, indicated as 0). There
is data about 100 study participants:
high 10 35 25 70
(1.18)
low 5 10 15 30
totals 15 45 40 100
We see that Table (1.18) contains ‘totals’ in its vertical and horizontal mar-
gins. They can be obtained from τ as marginals, using the functoriality of N.
This works as follows. Applying the natural multiset functor N to the two pro-
jections π1 : B × M → B and π2 : B × M → M yields marginal distributions on
30
1.4. Multisets 31
B and M, namely:
N(π1 )(τ) = 10|π1 (H, 0)i + 35| π1 (H, 1)i + 25| π1 (H, 2)i
+ 5| π1 (L, 0)i + 10| π1 (L, 1)i + 15| π1 (L, 2)i
= 10| H i + 35| H i + 25| H i + 5| L i + 10| L i + 15| L i
= 70| H i + 30| L i.
N(π2 )(τ) = (10 + 5)| 0i + (35 + 10)| 1 i + (25 + 15)| 2 i
= 15|0 i + 45| 1 i + 40|2 i.
The expression ‘marginal’ is used to describe such totals in the margin of a
multidimensional table. In Section 2.1 we describe how to obtain probabilities
from tables in a systematic manner.
X
unit / M(X) M(M(X))
flat / M(X)
f M( f ) M(M( f )) M( f )
Y / M(Y) M(M(Y)) / M(Y)
unit flat
commute.
2 The next two diagrams also commute.
M(X)
unit / M(M(X)) M(unit)
o M(X) M(M(M(X)))
flat / M(M(X))
M(flat)
flat flat
M(X) M(M(X)) / M(X)
flat
31
32 Chapter 1. Collections
The next result shows that natural multisets are free commutative monoids.
Arbitrary multisets are also free, but for other algebraic structures, see Exer-
cise 1.4.9.
Proposition 1.4.4. Let X be a set and (M, 0, +) a commutative monoid. Each
function f : X → M has a unique extension to a homomorphism of monoids
f : (N(X), 0, +) → (M, 0, +) with f ◦ unit = f . The diagram below captures
the situation, where the dashed arrow is used for uniqueness.
X
unit / N(X)
f , homomorphism (1.19)
f
* M
The unit and flatten operations for (natural) multisets can be used to cap-
ture commutative monoids more precisely, in analogy with Propositions 1.2.6
and 1.3.5
Proposition 1.4.5. Let X be an arbitrary set.
X
unit / N(X) N(N(X))
N(α)
/ N(X)
α flat α (1.20)
id % α
X N(X) /X
N(M1 )
N( f )
/ N(M2 )
α1 α2 (1.21)
M1
f
/ M2 .
commutes.
Proof. Analogously to the proof Proposition 1.2.6: if (X, u, +) is a commu-
tative monoid, we define α : N(X) → X by turning formal sums into actual
32
1.4. Multisets 33
1.4.3 Extraction
At the end of the previous section we have seen how to extract a function
(channel) from a binary subset, that is, from a relation. It turns out that one
can do the same for a binary multiset, that is, for a table. More specifically, in
terms of exponents, there are isomorphisms:
Notice that we are — conveniently — mixing ket and function notation for
multisets. Conversely, σ can be reconstructed from extr 1 (σ), and also from
extr 2 (σ), via σ(x, y) = extr 1 (σ)(x)(y) = extr 2 (σ)(y)(x).
Functions of the form X → M(Y) will also be used as channels from X to
Y, see Section 1.8. That’s why we often speak about ‘channel extraction’.
As illustration, we apply extraction to the medicine - blood pressure Ta-
ble 1.18, described as the multiset τ ∈ M(B × M). It gives rise to two channels
extr 1 (τ) : B → M(M) and extr 2 (τ) : M → M(B). Explicitly:
X
extr 1 (τ)(H) = τ(H, x)| x i = 10| 0 i + 35| 1i + 25| 2 i
x∈M
X
extr 1 (τ)(L) = τ(L, x)| x i = 5| 0i + 10| 1 i + 15| 2i.
x∈M
We see that this extracted function captures the two rows of Table 1.18. Simi-
larly we get the columns via the second extracted function:
33
34 Chapter 1. Collections
Exercises
1.4.1 In the setting of Exercise 1.2.1, consider the multisets ϕ = 3| ai +
2| bi + 8| c i and ψ = 3|b i + 1| ci. Compute:
• ϕ+ψ
• ψ+ϕ
• M( f )(ϕ), both in ket-formulation and in function-formulation
• idem for M( f )(ψ)
• M( f )(ϕ + ψ)
• M( f )(ϕ) + M( f )(ψ).
1.4.2 Consider, still in the context of Exercise 1.2.1, the ‘joint’ multiset
ϕ ∈ M(X × Y) given by ϕ = 2|a, ui + 3| a, v i + 5| c, vi. Determine the
marginals M(π1 )(ϕ) ∈ M(X) and M(π2 )(ϕ) ∈ M(Y).
1.4.3 Consider the chemical equation for burning methane:
CH4 + 2 O2 −→ CO2 + 2 H2 O.
34
1.5. Multisets in summations 35
35
36 Chapter 1. Collections
There are several ways to associate a natural number with a multiset ϕ. For
instance, we can look at the size of its support | supp(ϕ) | ∈ N, or at its size, as
total number of elements kϕk = x ϕ(x) ∈ R>0 . This size is a natural number
P
when ϕ is a natural multiset. Below we will introduce several more such num-
multisets ϕ, namely ϕ and ( ϕ ), and later on also a binomial
bers for natural
coefficient ψϕ .
ϕ ≤ ψ ⇐⇒ ∀x ∈ X. ϕ(x) ≤ ψ(x).
ϕ ≤K ψ ⇐⇒ ϕ ∈ N[K](X) and ϕ ≤ ψ.
rϕ(x)
Y
rϕ B x = ~r ϕ .
x∈X
For instance,
5!
(3| R i + 2| Bi) = 3! · 2! = 12 and ( 3|R i + 2| Bi ) = = 10.
12
The multinomial coefficient (ϕ ) in item (5) counts the number of ways of
putting kϕk items in supp(ϕ) = {x1 , . . . , xn } urns, with the restriction that ϕ(xi )
items go into urn xi . Alternatively, ( ϕ ) is the numbers of partitions (Ui ) of a
N| Ui | = ϕ(xi ).
set X with kϕk elements, where
The traditional notation m1 ,...,mk
for multinomial coefficients in (1.13) is
36
1.5. Multisets in summations 37
suboptimal for two reasons: first, the number N is superflous, since it is deter-
mined by the mi as N = i mi ; second, the order of the mi is irrelevant. These
P
disadvantages are resolved by the multiset variant ( ϕ ). It has our preference.
We recall the recurrence relations:
! ! !
K−1 K−1 K
+ ··· + = (1.24)
k1 − 1, . . . , kn k1 , . . . , kn − 1 k1 , . . . , kn
for multinomial coefficients. A snappy re-formulation, for a natural multiset ϕ,
is: X
(ϕ − 1| x i ) = (ϕ ). (1.25)
x∈supp(ϕ)
K + i ni
P ! Y
X 1
· rini = P K+1 .
n , ..., n ≥0
K, n1 , . . . , nm i (1 − i ri )
1 m
37
38 Chapter 1. Collections
Equivalently,
K + kϕk
!
X 1
· ( ϕ ) · ~r ϕ = P K+1 .
ϕ∈N({1,...,m})
K (1 − i ri )
P (n)
Proof. 1 The equation arises as the Taylor series f (x) = n f n!(0) · xn of the
function f (x) = (1−x)1 K+1 . One can show, by induction on n, that the n-th
derivative of f is:
(n + K)! 1
f (n) (x) = · .
K! (1 − x)n+K+1
2 The second equation is a special case of the first one, for K = 0. There is also
a simple direct proof. Define sn = r0 + r1 + · · · + rn . Then sn − r · sn = 1 − rn+1 ,
n+1
so that sn = 1−r 1
1−r . Hence sn → 1−r as n → ∞.
3 We choose to use the first item, but there are other ways to prove this result,
see Exercise 1.5.10.
r X n + 1!
= r · · rn by item (1), with K = 1
(1 − r)2 n≥0
1
X X
= (n + 1) · rn+1 = n · rn .
n≥0 n≥1
4 The trick is to turn the multiple sums into a single ‘leading’ one, in:
K + i ni
X P ! Y
· r ni
n1 , ..., nm ≥0
K, n 1 , . . . , n m i i
K+n
X X ! Y
= · r ni
n≥0 n1 , ..., nm , i ni =n
P K, n 1 , . . . , nm i i
K+n
! ! Y
X X n
= · · r ni
n≥0 n1 , ..., nm , i ni =n
P K n 1 , . . . , nm i i
X K + n! X n
! Y
= · · r ni
n≥0
K n1 , ..., nm , i ni =n
P n 1 , . . . , nm i i
(1.26)
X K + n! P
n
= · i ri
n≥0
K
1
= P K+1 , by item (1).
(1 − i ri )
Exercises
1.5.1 Consider the function f : {a, b, c} → {0, 1} given by f (a) = f (b) = 1
and f (c) = 0.
38
1.5. Multisets in summations 39
ϕ ≤ ψ ⇐⇒ ∃ϕ0 ∈ N(X). ϕ + ϕ0 = ψ.
ϕ ϕ(y)!
X X Q
y
=
(ϕ(x) − 1)! · y,x ϕ(y)!
Q
x∈supp(ϕ)
(ϕ − 1| x i) x∈supp(ϕ)
X ϕ(x)!
=
x∈supp(ϕ)
(ϕ(x) − 1)!
X
= ϕ(x)
x∈supp(ϕ)
= kϕk.
39
40 Chapter 1. Collections
This generalises
the well known sum-formula for binomial coeffi-
cients: 0≤k≤K Kk = 2K , for n = 2.
P
1.5.10 Elaborate the details of the following two (alternative) proofs of the
equation in Theorem 1.5.2 (3).
1 Use the derivate ddr on both sides of Theorem 1.5.2 (2).
2 Write s B n≥1 n · rn and exploit the following recursive equation.
P
40
1.6. Binomial coefficients of multisets 41
For instance:
3| Ri + 2| Bi
! ! !
3! · 2! 3 2
= = 6 = 3·2 = · .
2| Ri + 1| Bi
2! · 1! · 1! · 1! 2 1
This describes the number of possible ways of drawing 2|R i + 1| Bi from an
urn 3|R i + 2| Bi.
The following result is a generalisation of Vandermonde’s formula.
These fractions adding up to one will form the probabilities of the so-called
hypergeometric distributions, see Example 2.1.1 (3) and Section 3.4 later on.
41
42 Chapter 1. Collections
Then:
B+G
! ! !
X B G
= · . (1.29)
K b≤B, g≤G, b+g=K
b g
Intuitively: if you select K children out of B boys and G girls, the number of
options is given by the sum over the options for b ≤ B boys times the options
for g ≤ G girls, with b + g = K.
can be proven by induction on G. When G = 0 both
The equation (1.29)
sides amount to KB so we quickly proceed to the induction step. The case
K = 0 is trivial, so we may assume K > 0.
X B G+1
b · g
b≤B, g≤G+1, b+g=K
B G+1 B G+1 B G+1
= KB · G+1 0 + K−1 · 1 + · · · + K−G · G + K−G−1 · G+1
B G B G B G
= K · 0 + K−1 · 1 + K−1 · 0
B G B G B G
+ · · · + K−G · G + K−G · G−1 + K−G−1 · G
X B G X B G
= b · g + b · g
b≤B, g≤G, b+g=K b≤B, g≤G, b+g=K−1
(IH) B+G
= B+G K + K−1
(1.14) B+G+1
= K .
For the induction step, let supp(ψ) = {x1 , . . . , xn , y}, for n ≥ 2. Writing
` = ψ(y), L0 = L − ` and ψ0 = ψ − `|y i ∈ N[L0 ](X) gives:
ψ ψ(x) ` ψ (xi )
X X Y X X Y 0
ϕ = ϕ(x) x
= n · ϕ(xi ) i
ϕ≤K ψ ϕ≤K ψ n≤` ϕ≤K−n ψ0
(IH)
X ` L−` (1.29) L
= n · K−n = K.
n≤`, K−n≤L−`
L(X)
supp
/ Pfin (X)
B
(1.30)
acc , N(X) supp
The ‘accumulator’ map acc : L(X) → N(X) counts (accumulates) how many
42
1.6. Binomial coefficients of multisets 43
times an element occurs in a list, while ignoring the order of occurrences. Thus,
for a list ` ∈ L(X),
Alternatively,
acc x1 , . . . , xn = 1| x1 i + · · · + 1| xn i.
acc a, b, a, b, c, b, b = 2| ai + 4| bi + 1| c i.
The above diagram (1.30) expresses an earlier informal statement, namely that
multisets are somehow in between lists and subsets.
The starting point in this section is the question: how many (ordered) se-
quences of coloured balls give rise to a specific urn? More technically, given a
natural multiset ϕ, how many lists ` statisfy acc(`) = ϕ? In yet another form,
what is the size | acc −1 (ϕ) | of the inverse image?
As described in the beginning of this section, one aim is to relate the multiset
coefficient (ϕ ) of a multiset ϕ to the number of lists that accumlate to ϕ, as
defined in (1.31). Here we use a K-ary version of accumulation, for K ∈ N,
restricted to K-many elements. It then becomes a mapping:
XK
acc[K]
/ N[K](X). (1.32)
The parameter K will often be omitted when it is clear from the context. We are
now ready for a basic combinatorial / counting result. It involves the multiset
coefficient from Definition 1.5.1 (5).
43
44 Chapter 1. Collections
K
additions is m . Thus:
!
K
acc −1 (ϕ) = · ( ϕ0 )
m
K! (K − m)!
= ·Q
i≤n ϕ (xi )!
m!(K − m)! 0
K!
= Q since m = ϕ(xn+1 ) and ϕ0 (xi ) = ϕ(xi )
i≤n+1 ϕ(xi )!
= (ϕ)
/ M[L·K](X)
flat
M[L] M[K](X) (1.33)
We then find that the sum of these outcomes equals the coefficient of ψ:
( ψ) = 60
Y Y Y
= ( Ψ1 )· (ϕ )Ψ1 (ϕ) + (Ψ2 )· ( ϕ )Ψ2 (ϕ) + ( Ψ3 )· ( ϕ )Ψ3 (ϕ) .
ϕ ϕ ϕ
44
1.6. Binomial coefficients of multisets 45
The general result is formulated in Theorem 1.6.5 below. For its proof we need
an intermediate step.
2 Now:
X
( ψ) = ( ϕ ) · ( ψ − ϕ ).
ϕ≤K ψ
Proof. 1 Because:
(ϕ) · (ψ − ϕ) K! (L − K)! ψ
= · ·
( ψ) ϕ (ψ − ϕ) L!
ψ
K! · (L − K)! ψ ϕ
= · = L .
L! ϕ · (ψ − ϕ)
K
This equation turns out to be essential for proving that multinomial distribu-
tions are suitably closed under composition, see Theorem 3.3.6 in Chapter 3.
Proof. Since ψ ∈ N[L · K](X) we can apply Lemma 1.6.4 (2) L-many times,
giving:
X X X
( ψ) = ··· ( ϕ1 ) · ( ϕ2 ) · . . . · ( ϕ L )
ϕ1 ≤K ψ ϕ2 ≤K ψ−ϕ1 ϕL ≤K ψ−ϕ1 −···−ϕL−1
X Y
= ( ϕi ).
i
ϕ1 , ..., ϕL ≤K ψ
ϕ1 + ··· +ϕL = ψ
45
46 Chapter 1. Collections
The flatten map preserves sums of multisets, see Exercise 1.4.7, and thus maps
Ψ to ψ, via the flatten-unit law of Lemma 1.4.3.
flat Ψ = flat 1|ϕ1 i + · · · + 1| ϕL i
= flat 1| ϕ1 i + · · · + flat 1| ϕL i
Exercises
K
= 2K to:
P
1.6.1 Generalise the familiar equation 0≤k≤K k
X ψ!
= 2kψk .
ϕ≤ψ
ϕ
1.6.2 Show that kacc(`)k = k`k, using the length k`k of a list ` from Exer-
cise 1.2.3.
1.6.3 Let ϕ ∈ N[K](X) and ψ ∈ N[L](X) be given.
1 Prove that:
K+L
K
( ϕ + ψ) = ϕ+ψ · ( ϕ ) · ( ψ).
ϕ
46
1.6. Binomial coefficients of multisets 47
1.6.5 Check that the accumulator map acc : L(X) → M(X) is a homomor-
phism of monoids. Do the same for the support map supp : M(X) →
Pfin (X).
1.6.6 Prove that the accumulator and support maps acc : L(X) → M(X)
and supp : M(X) → Pfin (X) are natural: for an arbitrary function
f : X → Y both rectangles below commute.
L(X)
acc / M(X) supp
/ Pfin (X)
L( f ) M( f ) Pfin ( f )
L(Y) / M(Y) / Pfin (Y)
acc supp
is the L-fold sum of multisets — see (1.33) for the fixed size flatten
operation.
1.6.8 In analogy with the powerset operator, with type P : P(X) → P P(X) ,
a powerbag operator PB : N(X) → N N(X) is introduced in [114]
(see also [35]). It can be defined as:
X ψ!
PB(ψ) B
ϕ .
ϕ≤ψ
ϕ
PB 1| ai + 3| bi
= 1 0 + 1 1| ai + 3 1| b i + 3 1| ai + 1| b i + 3 2|b i
+ 3 1|a i + 2| bi + 1 3| b i + 1 1| ai + 3| bi .
ψ ψ
!
B .
ϕ1 , . . . , ϕ N ϕ1 · . . . · ϕ N
1 Check that for N ≥ 3, in analogy with Exercise 1.3.6,
ψ ψ ψ − ϕ1
! ! !
= ·
ϕ1 , . . . , ϕN ϕ1 ϕ2 , . . . , ϕ N
47
48 Chapter 1. Collections
5 the set N[3]({a, b, c}) of multisets of size 3 over {a, b, c}. It has
3Consider
3 = 3 = 2 = 10 elements, namely:
4·5
48
1.7. Multichoose coefficents 49
B+G
!! !! !!
X B G
= · . (1.36)
K 0≤k≤K
k K−k
In particular:
B+K B+1
! !! !! !!
X B X B
= = = . (1.37)
K K 0≤k≤K
k 0≤k≤K
K−k
n+1 n+1
n
K+1 + K = K+1 . (∗)
We shall prove the first equation (1.36) in the lemma by induction on B ≥1. In
both the base case B = 1 and the induction step we shall use induction on K.
We shall try to keep the structure clear by using nested bullets.
• Now assume Equation (1.36) holds for B (for all G, K). In order to show that
it then also holds for B + 1 we use induction on K.
49
50 Chapter 1. Collections
– Now assume that Equation (1.36) holds for K, and for B. Then:
X
B+1 G
k · (K+1)−k
0≤k≤K+1
X
= G
K+1 + B+1 G
k+1 · K−k
0≤k≤K
(∗) X h B+1 i G
= G
K+1 + B
k+1 + k · K−k
0≤k≤K
X G X
= G
K+1 + B
k+1 · K−k + B+1
k
G
· K−k
(IH, K)
X 0≤k≤K
G
0≤k≤K
(B+1)+G
= B
k · (K+1)−k + K
0≤k≤K+1
(IH, B) B+G (B+1)+G
= K+1 + K
(∗) (B+1)+G
= K+1 .
ψ
ψ
!! !!
X L X ϕ
= so L = 1.
ϕ∈N[K](X)
ϕ K ϕ∈N[K](X) K
The fractions in this equation will show up later in so-called Pólya distri-
butions, see Example 2.1.1 (4) and Section 3.4. These fractions capture the
probability of drawing a multiset ϕ from an urn ψ when for each drawn ball an
extra ball is added to the urn (of the same colour).
50
1.7. Multichoose coefficents 51
n+K−1
!! !
n
N[K](X) = = .
K K
n
Proof. The statement holds for K = 0 since there is precisely 1 = n−1 0 = 0
multiset set of 0, namely the empty multiset 0. Hence we may assume K ≥ 1,
so that Lemma 1.7.2 can be used.
We proceed
K by
induction
on n ≥ 1. For n = 1 the statement holds since there
is only 1 = K = K multiset of size K over 1 = {0}, namely K| 0i.
1
The induction step works as follows. Let X have n elements, and y < X. For
a multiset ϕ ∈ N[K] X ∪ {y} there are K + 1 possible multiplicities ϕ(n). If
ϕ(n) = k, then the number possibilities for ϕ(0), . . . , ϕ(n − 1) is the number of
multisets in N[K −k](X). Thus:
X
N[K] X ∪ {y} = N[K −k](X)
0≤k≤K !!
(IH)
X n
=
0≤k≤K
K−k
n+1
!!
= , by Lemma 1.7.2.
K
There is also a visual proof of this result, described in terms of stars and bars,
see e.g. [47, II, proof of (5.2)], where multiplicities of multisets are described
in terms of ‘occupancy numbers’.
51
52 Chapter 1. Collections
The proof below lends itself to an easy implementation, see Exercise 1.7.12
below. It uses a (chosen) order on the elements of the support of the multiset.
The above formulation shows that the outcome is independent of such an order.
52
1.7. Multichoose coefficents 53
Proof. We use induction on the size | supp(ϕ) | of the support of ϕ. If this size is
1, then ϕ is of the form m| x1 i, for some number m. Hence there are m multisets
strictly below ϕ, namely 0, 1| x1 i, . . . , (m − 1)| x1 i. Since supp(ϕ) = {x1 }, the
only non-empty subset U ⊆ supp(ϕ) is the singleton U = {x1 } and x∈U ϕ(x) =
Q
ϕ(x1 ) = m.
Next assume that supp(ϕ) = X ∪ {y} with y 6 X is of size n + 1. We write
m = ϕ(y) and ϕ0 = ϕ − m| y i so that supp(ϕ0 ) = X. The number M = | ↓= ϕ0 | is
then given by the formula in (1.38), with ϕ0 instead of ϕ. We count the multisets
strictly below ϕ as follows.
(*)
= ↓= ϕ .
Exercises
1.7.1 Check that N[K](1) has one element by Proposition 1.7.4. Describe
this element. How many elements are there in N[K](2)? Describe
them all.
1.7.2 Let ϕ, ψ be natural multiset on the same finite set X.
1 Show that:
ϕ ≺ ψ ⇐⇒ ϕ + 1 ≤ ψ ⇐⇒ ϕ ≤ ψ − 1,
where 1 =
P
x∈X 1| x i is the multiset of singletons on X.
53
54 Chapter 1. Collections
54
1.7. Multichoose coefficents 55
2 Prove next:
n+m
!! X !! !
X m n
+ = .
i<n
i j<m
j n
1.7.10 Check that Theorem 1.5.2 (1) can be reformulated as: for a real num-
ber x ∈ (0, 1) and K ≥ 1,
X K !! 1
· xn =
n≥0
n (1 − x)K
j<m
j (1− s) j<m j m−1
55
56 Chapter 1. Collections
result = 0
for x ∈ [x1 , . . . , xn ] :
result B (ϕ(x) + 1) ∗ result + ϕ(x)
return result
1.8 Channels
The previous sections covered the collection types of lists, subsets, and multi-
sets, with much emphasis on the similarities between them. In this section we
will exploit these similarities in order to introduce the concept of channel, in a
uniform approach, for all of these collection types at the same time. This will
show that these data types are not only used for certain types of collections,
but also for certain types of computation. Much of the rest of this book builds
on the concept of a channel, especially for probabilistic distributions, which
are introduced in the next chapter. The same general approach to channels that
will be described in this section will work for distributions.
Let T be one of the collection functors L, P, or M. A state of type Y is an
element ω ∈ T (Y); it collects a number of elements of Y in a certain way. In this
section we abstract away from the particular type of collection. A channel, or
sometimes more explicitly T -channel, is a collection of states, parameterised
by a set. Thus, a channel is a function of the form c : X → T (Y). Such a
channel turns an element x ∈ X into a certain collection c(x) of elements of
Y. Just like an ordinary function f : X → Y can be seen as a computation, we
see a T -channel as a computation of type T . For instance, a channel X → P(Y)
is a non-deterministic computation and a channel X → M(Y) is a resource-
sensitive computation.
56
1.8. Channels 57
When it is clear from the context what T is, we often write a channel using
functional notation, as c : X → Y, with a circle on the shaft of the arrrow.
Notice that a channel 1 → X from the singleton set 1 = {0} can be identified
with a state on X.
Definition 1.8.1. Let T ∈ {L, P, M}, each of which is functorial, with its own
flatten operation, as described in previous sections.
1 For a state ω ∈ T (X) and a channel c : X → T (Y) we can form a new state
c = ω of type Y. It is defined as:
= [v, u, v, u, u, u, u, u, v].
2 We consider the analogous example for T = P. We thus take as state ω =
{a, b, c} and as channel f : X → P(Y) with:
57
58 Chapter 1. Collections
Then:
[
f = ω = flat P( f )(ω) =
f (a), f (b), f (c)
[
= {u, v}, {u}, {u, v} = {u, v}.
f = ω = flat M( f )(ω)
We now prove some general properties about state transformation and about
composition of channels, based on the abstract description in Definition 1.8.1.
e ◦· (d ◦· c) = (e ◦· d) ◦· c,
58
1.8. Channels 59
d ◦· c = ω = d = c = ω .
f = ω = T ( f )(ω),
g ◦· f = g ◦ f f ◦· c = T ( f ) ◦ c d ◦· f = d ◦ f,
Proof. We can give generic proofs, without knowing the type T ∈ {L, P, M}
of the channel, by using earlier results like Lemma 1.2.5, 1.3.2, and 1.4.3. No-
tice that we carefully distinguish channel composition ◦· and ordinary function
composition ◦.
1 Both equations follow from the flat − unit law. By Definition 1.8.1 (2):
2 The proof of associativity uses naturality and also the commutation of flatten
with itself (the ‘flat − flat law’), expressed as flat ◦ flat = flat ◦ T (flat).
e ◦· (d ◦· c) = flat ◦ T (e) ◦ (d ◦· c)
= flat ◦ T (e) ◦ flat ◦ T (d) ◦ c
= flat ◦ flat ◦ T (T (e)) ◦ T (d) ◦ c by naturality of flat
= flat ◦ T (flat) ◦ T (T (e)) ◦ T (d) ◦ c by the flat − flat law
=
flat ◦ T flat ◦ T (e) ◦ d ◦ c by functoriality of T
=
flat ◦ T e ◦· d ◦ c
= (e ◦· d) ◦· c
59
60 Chapter 1. Collections
= flat ◦ T (d) c = ω
= d = c = ω .
4 All these properties follow from elementary facts that we have seen before:
f = ω = flat ◦ T (unit ◦ f ) (ω)
Exercises
1.8.1 For a function f : X → Y define an inverse image P-channel f −1 : Y →
X by:
f −1 (y) B {x ∈ X | f (x) = y}.
Prove that:
(g ◦ f )−1 = f −1 ◦· g−1 and id −1 = unit.
60
1.9. The role of category theory 61
1.8.2 Notice that a state of type X can be identified with a channel 1 → T (Y)
with singleton set 1 as domain. Check that under this identification,
state transformation c = ω corresponds to channel composition c ◦· ω.
1.8.3 Let f : X → Y be a channel.
1 Prove that if f is a Pfin -channel, then the state transformation func-
tion f = (−) : Pfin (X) → Pfin (Y) can also be defined via freeness,
namely as the unique function f in Proposition 1.3.4.
2 Similarly, show that f = (−) = f when f is an M-channel, as in
Exercise 1.4.9.
1.8.4 1 Describe how (non-deterministic) powerset channels can be reversed,
via a bijective correspondence between functions:
X −→ P(Y)
===========
Y −→ P(X)
(A description of this situation in terms of ‘daggers’ will appear in
Example 7.8.1.)
2 Show that for finite sets X, Y there is a similar correspondence for
multiset channels.
61
62 Chapter 1. Collections
found in e.g. the early notes [111]. This line of work was picked up, extended,
and published by his PhD student Michèle Giry. Her name continues in the
‘Giry monad’ G of continuous probability distributions, see Section ??. The
precise source of the distribution monad D for discrete probability theory, that
will be introduced in Section 2.1 in the next chapter, is less clear, but it can be
regarded as the discrete version of G. Probabilistic automata have been studied
in categorical terms as coalgebras, see Chapter ??, and e.g. [152] and [70] for
general background information on coalgebra. There is a recent surge in inter-
est in more foundational, semantically oriented studies in probability theory,
through the rise of probabilistic programming languages [59, 153], probabilis-
tic Bayesian reasoning [28, 89], and category theory ??. Probabilistic methods
have received wider attention, for instance, via the current interest in data an-
alytics (see the essay [4]), in quantum probability [128, 25], and in cognition
theory [64, 151].
Readers who know category theory will have recognised its implicit use in
earlier sections. For readers who are not familiar (yet) with category theory,
some basic concepts will be explained informally in this section. This is in no
way a serious introduction to the area. The remainder of this book will continue
to make implicit use of category theory, but will make this usage increasingly
explicit. Hence it is useful to know the basic concepts of category, functor,
natural transformation, and monad. Category theory is sometimes seen as a
difficult area to get into. But our experience is that it is easiest to learn category
theory by recognising its concepts in constructions that you already know. That
is why this chapter started with concrete descriptions of various collections and
their use in channels. For more solid expositions of category theory we refer to
the sources listed above.
1.9.1 Categories
A category is a mathematical structure given by a collection of ‘objects’ with
‘morphisms’ between them. The requirements are that these morphisms are
closed under (associative) composition and that there is an identity morphism
on each object. Morphisms are also called ‘maps’, and are written as f : X →
Y, where X, Y are objects and f is a homomorphism from X to Y. It is tempting
to think of morphisms in a category as actual functions, but there are plenty of
examples where this is not the case.
A category is like an abstract context of discourse, giving a setting in which
one is working, with properties of that setting depending on the category at
hand. We shall give a number of examples.
62
1.9. The role of category theory 63
1 There is the category Sets, whose objects are sets and whose morphisms are
ordinary functions between them. This is a standard example.
2 One can also restrict to finite sets as objects, in the category FinSets, with
functions between them. This category is more restrictive, since for instance
it contains objects n = {0, 1, . . . , n − 1} for each n ∈ N, but not N itself. Also,
Q
in Sets one can take arbitrary products i∈I Xi of objects Xi , over arbitrary
index sets I, whereas in FinSets only finite products exist. Hence FinSets is
a more restrictive world.
3 Monoids and monoid maps have been mentioned in Definition 1.2.1. They
can be organised in a category Mon, whose objects are monoids, and whose
homomorphisms are monoid maps. We now have to check that monoid maps
are closed under composition and that identity functions are monoid maps;
this is easy. Many mathematical structures can be organised into categories
in this way, where the morphisms preserve the relevant structure. For in-
stance, one can form a category PoSets, with partially ordered sets (posets)
as objects, and monotone functions between them as morphisms (also closed
under composition, with identity).
4 For T ∈ {L, P, M} we can form the category Chan(T ). Its objects are arbi-
trary sets X, but its morphisms X to Y are T -channels, X → T (Y), written as
X → Y. We have already seen that channels are closed under composition ◦·
and have unit as identity, see Lemma 1.8.3. We can now say that Chan(T )
is a category.
These categories of channels form good examples of the idea that a cate-
gory forms a universe of discourse. For instance, in Chan(P) we are in the
world of non-deterministic computation, whereas Chan(M) is the world of
computation in which resources are counted.
1.9.2 Functors
Category theorists like abstraction, hence the question: if categories are so im-
portant, then why not organise them as objects themselves in a superlarge cat-
63
64 Chapter 1. Collections
egory Cat, with morphisms between them preserving the relevant structure?
The latter morphisms between categories are called ‘functors’. More precisely,
given categories C and D, a functor F : C → D between them consists of two
mappings, both written F, sending an object X in C to an object F(X) in D,
and a morphism f : X → Y in C to a morphism F( f ) : F(X) → F(Y) in D.
This mapping F should preserve composition and identities, as in: F(g ◦ f ) =
F(g) ◦ F( f ) and F(id X ) = id F(X) .
Earlier we have already called some operations ‘functorial’ for the fact that
they preserve composition and identities. We can now be a bit more precise.
64
1.9. The role of category theory 65
in D commutes.
Such a natural transformation is often denoted by a double arrow α : F ⇒ G.
We briefly review some of the examples of natural transformations that we
have seen.
1.9.4 Monads
A monad on a category C is a functor T : C → C that comes with two natural
transformations unit : id ⇒ T and flat : (T ◦ T ) ⇒ T satisfying:
65
66 Chapter 1. Collections
66
1.9. The role of category theory 67
The writer monads from Lemma 1.9.1 give simple examples of maps of
monads: if f : M1 → M2 is a map of monoids, then the maps α B f ×id : M1 ×
X → M2 × X form a map of monoids.
For a historical account of monads and their applications we refer to [66].
Exercises
1.9.1 We have seen the functor J : Sets → Chan(T ). Check that there is
also a functor Chan(T ) → Sets in the opposite direction, which is
X 7→ T (X) on objects, and c 7→ c = (−) on morphisms. Check ex-
plicitly that composition is preserved, and find the earlier result that
stated that fact implicitly.
1.9.2 Recall from (1.15) the subset N[K](X) ⊆ M(X) of natural multisets
with K elements. Prove that N[K] is a functor Sets → Sets.
1.9.3 Show that Exercise 1.8.1 implicitly describes a functor Setsop →
Chan(P), which is the identity on objects.
1.9.4 Show that the zip function from Exercise 1.1.7 is natural: for each
pair of functions f : X → U and g : Y → V the following diagram
commutes.
XK × Y K
zip
/ (X × Y)K
f K ×gK K
( f ×g)
UK × VK
zip
/ (U × V)K
1.9.5 Fill in the remaining details in the proof of Lemma 1.9.1: that T is
a functor, that unit and flat are natural transformation, and that the
flatten equation holds.
1.9.6 For arbitrary sets X, A, write X + A for the disjoint union (coproduct)
of X and A, which may be described explicitly by tagging elements
with 1, 2 in order to distinguish them:
67
68 Chapter 1. Collections
1.9.7 Check that the support and accumulation functions form maps of
monads in the situations:
1 supp : M(X) ⇒ P(X);
2 acc : L(X) ⇒ M(X).
1.9.8 Let T = (T, unit, flat) be a monad. By definition, it involves T as
a functor T : Sets → Sets. Show that T can be ‘lifted’ to a functor
T : Chan(T ) → Chan(T ). It is defined on objects as T (X) B T (X)
and on a morphism f : X → Y as:
T(f)
/ T (T (Y)) flat / T (Y) unit / T (T (Y)) .
T ( f ) B T (X)
68
2
The previous chapter has introduced products, lists, subsets and multisets as
basic collection types and has made some of their basic properties explicit.
This serves as preparation for the current first chapter on probability distribu-
tions. We shall see that distributions also form a collection type, with much
analogous structure. In fact distributions are special multisets, where multi-
plicities add up to one.
This chapter introduces the basics of probability distributions and of prob-
abilistic channels (as indexed / conditional distributions). These notions will
play a central role in the rest of this book. Distributions will be defined as spe-
cial multisets, so that there is a simple inclusion of the set of distributions in
the set of multisets multisets, on a particular space. In the other direction, this
chapter describes the ‘frequentist learning’ construction, which turns a multi-
set into a distribution, essentially by normalisation. The chapter also describes
several constructions on distributions, like the convex sum, parallel product,
and addition of distributions (in the special case when the underlying space
happens to be a commutative monoid). Parallel products ⊗ of distributions and
joint distributions (on product spaces) are rather special, for instance, because
a joint distribution is typically not equal to the product of its marginals. In-
deed, joint distributions may involve correlations between the different (prod-
uct) components, so that updates in one component have ‘crossover’ effect in
other components. This magical phenonenom will be elaborated in later chap-
ters.
The chapter closes with an example of a Bayesian network. It shows that
the conditional probability tables that are associated with nodes in a Bayesian
network are instances of probabilistic channels. As a result, one can system-
atically organise computations in Bayesian networks as suitable (sequential
and/or parallel) compositions of channels. This is illustrated via calculations
of various predicted probabilities.
69
70 Chapter 2. Discrete probability distributions
70
2.1. Probability distributions 71
Figure 2.1 Plots of a slightly biased coin distribution 0.51| H i + 0.49| T i and a
fair (uniform) dice distribution on {1, 2, 3, 4, 5, 6} in the top row, together with the
distribution of letter frequencies in English at the bottom.
71
72 Chapter 2. Discrete probability distributions
remains unaltered (0), and in the Pólya case the drawn ball is returned to the
urn together with an extra ball of the same colour (+1).
Example 2.1.1. We shall explicitly describe several familiar distributions us-
ing the above formal convex sum notation.
1 The coin that we have seen above can be parametrised via a ‘bias’ probabil-
ity r ∈ [0, 1]. The resulting coin is often called flip and is defined as:
flip(r) B r| 1i + (1 − r)|0 i
where 1 may understood as ‘head’ and 0 as ‘tail’. We may thus see flip as a
function flip : [0, 1] → D({0, 1}) from probabilities to distributions over the
sample space 2 = {0, 1} of Booleans. This flip(r) is often called the Bernoulli
distribution, with parameter r ∈ [0, 1].
2 For each number K ∈ N and probability r ∈ [0, 1] there is the familiar
binomial distribution bn[K](r) ∈ D({0, 1, . . . , K}). It captures probabilities
for iterated coin flips, and is given by the convex sum:
X
bn[K](r) B K
k · r k
· (1 − r) k .
K−k
0≤k≤K
72
2.1. Probability distributions 73
where ( ϕ ) is the multinomial coefficient QxK! ϕ(x)! , see Definition 1.5.1 (5)
and where ωϕ B ω(x) ϕ(x)
Q
x . In the sequel we shall standardly use the
multinomial distribution and view the binomial distribution as a special case.
To see an example, for space X = {a, b, c} and urn ω = 13 | ai + 12 | b i + 16 | ci
the draws of size 3 form
a distribution over multisets, described below within
the outer ‘big’ kets − .
mn[3](ω) = 27 3| a i + 6 2| a i + 1|b i + 4 1|a i + 2| bi + 8 3| bi
1 1 1 1
+ 18 1
2| a i + 1| c i + 16 1| a i + 1| bi + 1| ci + 18 2| b i + 1| c i
+ 36 1
1| a i + 2| c i + 241
1| b i + 2|c i + 216
1
3|c i
Via the Multinomial Theorem (1.27) we see that the probabilities in the
above expression (2.2) for mn[K](ω) add up to one:
X X Y (1.26) P K
(ϕ ) · ωϕ = (ϕ) · ω(x)ϕ(x) = x ω(x) = 1K = 1.
x
ϕ∈N[K](X) ϕ∈N[K](X)
N[L](X)
hg[K]
/ D N[K](X). (2.3)
X ψϕ X x ψ(x)
Q
ϕ(x)
hg[K] ψ B L ϕ = L ϕ .
ϕ≤K ψ K ϕ≤K ψ K
73
74 Chapter 2. Discrete probability distributions
4 We have seen that in multinomial mode the drawn ball is returned to the
urn, whereas in hypergeometric mode the drawn ball is removed. There is
a logical third option where the drawn ball is returned, together with one
additional ball of the same colour. This leads to what is called the Pólya
distribution. We shall describe it as a function of the form:
N∗ (X)
pl[K]
/ D N[K](X). (2.4)
The middle and last rows in Figure 2.2 give bar plots for hypergeometric and
Pólya distributions over 2 = {0, 1}. They show the probabilities for numbers
0 ≤ k ≤ 10 in a drawn multiset k| 0 i + (10 − k)| 1i.
So far we have described distributions as formal convex sums. But they can
be described, equivalently, in functional form. This is done in the definition
below, which is much like Definition 1.4.1 for multisets. Also for distributions
we shall freely switch between the above ket-formulation and the function-
formulation.
74
2.1. Probability distributions 75
bn[10]( 13 ) bn[10]( 43 )
hg[10] 10| 0 i + 20| 1 i hg[10] 16| 0 i + 14| 1 i
pl[10] 1| 0 i + 2| 1i pl[10] 8| 0 i + 7| 1 i
Definition 2.1.2. The set D(X) of all distributions over a set X can be defined
as:
ω(x) = 1}.
P
D(X) B {ω : X → [0, 1] | supp(ω) is finite, and x
Such a function ω : X → [0, 1], with finite support and values adding up to
one, is often called a probability mass function, abbreviated as pmf.
This D is functorial: for a function f : X → Y we have D( f ) : D(X) → D(Y)
defined either as:
X
ω(x).
P P
D( f ) i ri | xi i B i ri | f (xi ) i or as: D( f )(ω)(y) B
x∈ f −1 (y)
One has to check that D( f )(ω) is a distribution again, that is, that its multi-
75
76 Chapter 2. Discrete probability distributions
ω= 1
12 | H, 0 i + 61 | H, 1i + 13 | H, 2 i + 16 | T, 0i + 1
12 | T, 1 i + 16 | T, 2 i
= 12 | H i + 6 | H i + 3 | H i + 6 | T i + 12 | T i + 6 | T i
1 1 1 1 1 1
= 12 | H i + 12 | T i.
7 5
is used to indicate the probability that the random variable R takes value
r ∈ R.
Since we now know that D is functorial, we may apply it to the function
R : X → R. It gives another function D(R) : D(X) → D(R), so that D(R)(ω)
is a distribution on R. We observe:
X
P[R = r] = D(R)(ω)(r) = ω(x).
x∈R−1 (r)
76
2.1. Probability distributions 77
D(M)(unif) = 36
1
| 1i + 36
3
|2 i + 5 | 3 i + 36
7
| 4 i + 9 | 5i + 11
36|6 i
36 36
= P[M = 1] 1 + P[M = 2] 2 + P[M = 3] 3
+ P[M = 4] 4 + P[M = 5] 5 + P[M = 6] 6 .
In the previous chapter we have seen that the sets L(X), P(X) and M(X) of
lists, powersets and multisets all carry a monoid structure. Hence one may ex-
pect a similar result saying that D(X) forms a monoid, via an elementwise sum,
like for multisets. But that does not work. Instead, one can take convex sums
of distributions. This works as follows. Suppose we have two distributions
ω, ρ ∈ D(X) and a number s ∈ [0, 1]. Then we can form a new distribution
σ ∈ D(X), as convex combination of ω and ρ, namely:
77
78 Chapter 2. Discrete probability distributions
This does not fit in D(N). Therefore we sometimes use D∞ instead of D, where
the finiteness of support requirement has been dropped:
D∞ (X) B {ω : X → [0, 1] | x ω(x) = 1}.
P
(2.8)
Notice by the way that the multiplicities add up to one in the Poisson distri-
bution because of the basic formula: eλ = k∈N λk! . The Poisson distribution is
P k
typically used for counts of rare events. The rate or intensity parameter λ is the
average number of events per time period. The Poisson distribution then gives
for each k ∈ N the probability of having k events per time period. This works
when events occur independently.
Exercises 2.1.9 and 2.1.10 contain other examples of discrete distributions
with infinite support, namely the geometric and negative-binomial distribution.
Another illustration is the zeta (or zipf) distribution, see e.g. [145].
Exercises
2.1.1 Check that a marginal of a uniform distribution is again a uniform
distribution; more precisely, D(π1 )(unif X×Y ) = unif X .
2.1.2 1 Prove that flip : [0, 1] → D(2) is an isomorphism.
2 Check that flip(r) is the same as bn[1](r).
3 Describe the distribution bn[3]( 41 ) concretely and interepret this
distribution in terms of coin flips.
2.1.3 We often write n B {0, . . . , n − 1} so that 0 = ∅, 1 = {0} and 2 = {0, 1}.
Verify that:
L(0) 1 P(0) 1 N(0) 1 M(0) 1 D(0) 0
L(1) N P(1) 2 N(1) N M(1) R≥0 D(1) 1
P(n) 2n N(n) Nn M(n) Rn≥0 D(2) [0, 1].
The set D(n + 1) is often called the n-simplex. Describe it as a subset
of Rn+1 , and also as a subset of Rn .
2.1.4 Let X = {a, b, c} with draw ϕ = 2| ai + 3| b i + 2| c i ∈ N[7](X).
78
2.1. Probability distributions 79
mn[7](ω)(ϕ) = 432 .
35
2.1.5 Let a number r ∈ [0, 1] and a finite set X be given. Show that:
X
r| U | · (1 − r)| X\U | = 1.
U∈P(X)
2.1.7 Let X be a non-empty finite set with n elements. Use Exercise 1.5.7
to check that, for the following multiset distribution yields a well-
defined distribution on N[K](X).
X (ϕ )
mltdst[K]X B ϕ .
n K
ϕ∈N[K](X)
It captures the probability of being successful for the first time after
79
80 Chapter 2. Discrete probability distributions
Use Theorem 1.5.2 (or Exercise 1.5.9) to show that this forms a distri-
bution. We shall describe negative multinomial distributions in Sec-
tion ??.
2.1.11 Prove that a distribution ω ∈ D∞ (X) necessarily has countable sup-
port.
Hint: Use that each set {x ∈ X | ω(x) > 1n }, for n > 0, can have only
finitely many elements.
2.1.12 Let ω ∈ D(X) be an arbitrary distribution on a set X. We extend it to a
distribution ω? on the set L(X) of lists of elements from X. We define
the function ω? : L(X) → [0, 1] by:
ω(x1 ) · . . . · ω(xn )
ω? [x1 , . . . , xn ] B .
2n+1
(−)?
D(X) / D∞ L(X)
80
2.2. Probabilistic channels 81
i ri · ωi (x) = x ωi (x) = i ri · 1 = 1.
P P P P P
x i ri ·
The familiar properties of unit and flatten hold for distributions too: the ana-
logue of Lemma 1.4.3 holds, with ‘multiset’ replaced by ‘distribution’. We
conclude that D, with these unit and flat, is a monad. The same holds for the
‘infinite’ variation D∞ from Remark 2.1.4.
A probabilistic channel c : X → Y is a function c : X → D(Y). For a state /
distribution ω ∈ D(X) we can do state transformation and produce a new state
81
82 Chapter 2. Discrete probability distributions
c = ω ∈ D(Y). This happens via the standard definition for = from Section 1.8,
see especially (1.39) and (1.40):
XX
c = ω B flat D(c)(ω) = c(x)(y) · ω(x) y . (2.9)
y∈Y x∈X
c / D(Y) D(d)
/ D(D(Z)) flat / D(Z) .
d ◦· c B X
[0, 1]
flip
◦ / {0, 1} and [0, 1]
bn[K]
◦ / {0, 1, . . . , K}.
◦ / ◦ / ◦ /
mn[K] hg[K] pl[K]
D(X) N[K](X) N[L](X) N[K](X). N∗ (X) N[K](X).
82
2.2. Probabilistic channels 83
f = ω =
flat D( f )(ω)
= flat 61 | f (a) i + 12 | f (b) i + 31 | f (c) i
= flat 16 12 | u i + 21 | vi + 12 1| u i + 13 34 | ui + 14 |v i
= 12 |u i + 12 | v i + 2 | u i + 4 | u i + 12 | vi
1 1 1 1 1
= 6 | ui + 6 | v i.
5 1
f = ω = 6 · 2 | ui + 2 | vi
1 1 1
+ 1
2 · 1| ui + 1
3 · 3
4|ui + 14 | v i
= 6 | ui + 6 | vi.
5 1
This ‘mixture’ terminology is often used in a clustering context where the ele-
ments a, b, c are the components.
Example 2.2.2 (Copied from [79]). Let’s assume we wish to capture the mood
of a teacher, as a probabilistic mixture of three possible options namely: pes-
simistic (p), neutral (n) or optimistic (o). We thus have a three-element proba-
bility space X = {p, n, o}. We assume a mood distribution:
σ = 81 | p i + 38 | n i + 21 | o i with plot
83
84 Chapter 2. Discrete probability distributions
{p, n, o}.
c(p)
= 501
| 1 i + 50
2
| 2 i + 50
10
| 3i + 15 50 |4 i + 50 | 5 i
10
+ 50 6
| 6i + 503
| 7i + 50 1
| 8 i + 501
| 9 i + 50
1
| 10 i
pessimistic mood marks
c(n)
= 501
| 1 i + 50
2
| 2 i + 50
4
| 3i + 10
50 |4 i + 50 | 5 i
15
+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 5 1 1 1
| 10 i
neutral mood marks
c(o)
= 501
| 1 i + 50
1
| 2 i + 50
1
| 3i + 50
2
|4 i + 50
4
|5i
+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 15 10 4 2
| 10 i.
optimistic mood marks
Now that the state σ ∈ D(X) and the channel c : X → Y have been described,
we can form the transformed state c = σ ∈ D(Y). Following the formula-
tion (2.9) we get for each mark i ∈ Y the probability:
This example will return in later chapters. There, the teacher will be confronted
with the marks that the pupils actually obtain. This will lead the teacher to an
84
2.2. Probabilistic channels 85
In the next example we show how products, multisets and distributions come
together in an elementary combinatorial situation.
This is a sum over (ϕ )-many lists ~x with acc(~x) = ϕ. The size K of the multiset
involved has disappeared from this formulation, so that we can view arr simply
as a channel arr : N(X) → L(X).
instance, for X = {a, b} with multiset ϕ = 3| a i + 1| bi there are (ϕ ) =
4 For
3,1 = 3!·1! = 4 arrangements of ϕ, namely [a, a, a, b], [a, a, b, a], [a, b, a, a],
4!
Our next question is: how are accumulator acc and arrangement arr related?
85
86 Chapter 2. Discrete probability distributions
N(X)
arr
◦ / L(X)
◦ ◦ acc (2.11)
unit
N(X)
It combines a probabilistic channel arr : N(X) → D(L(X)) with an ordinary
function acc : L(X) → N(X), promoted to a deterministic channel acc. For
the channel composition ◦· we make use of Lemma 1.8.3 (4):
acc ◦· arr (ϕ) = D(acc) arr(ϕ)
X 1
= D(acc) ~x
(ϕ )
~x∈acc −1 (ϕ)
X 1
=
acc(~x)
(ϕ )
~x∈acc −1 (ϕ)
X 1 1
= ϕ = (ϕ) · ϕ = 1 ϕ = unit(ϕ).
(ϕ ) (ϕ)
~x∈acc (ϕ)
−1
The above triangle (2.11) captures a very basic relationship between sequences,
multisets and distributions, via the notion of (probabilistic) channel.
In the other direction, arr(acc(~x)) does not return ~x. It yields a uniform dis-
tribution over all permutations of the sequence ~x ∈ X K .
Deleting and adding an element to a natural multiset are basic operations
that are naturally described as channels.
Definition 2.2.4. For a set X and a number K ∈ N we define a draw-delete
channel DD and also a draw-add channel DA in a situation:
DD
r ◦
N[K](X)
◦
2 N[K +1](X)
DA
86
2.2. Probabilistic channels 87
in the urn ψ. This drawn element x is subsequently removed from the urn via
subtraction ψ − 1| x i.
In the draw-add case DA one also draws single elements x from the urn χ,
but instead of deleting x one adds an extra copy of x, via the sum χ + 1| x i. This
is typical for Pólya style urns, see [115] or Section 3.4 for more information.
These channels are not each other’s inverses. For instance:
DD 3|a i + 1| bi = 43 2| a i + 1| b i + 14 3| a i
DA 2|a i + 1| bi = 2 3| a i + 1| b i + 1 2| a i + 2|b i .
3 3
DA ◦· DD 3| ai + 1| b i
= DA = 34 2| a i + 1| b i + 41 3| a i
= 34 23 3| a i + 1| bi + 31 2| ai + 2| bi + 14 1 4| ai
= 12 3| ai + 1| b i + 14 2| a i + 2| b i + 14 4| a i
DD ◦· DA 2| ai + 1| bi
= DD = 23 3| a i + 1|b i + 31 2| ai + 2| bi
= 23 34 2| a i + 1| bi + 41 3| ai + 13 12 1| a i + 2| b i + 12 2|a i + 1| bi
= 21 2| ai + 1| b i + 16 3| a i + 16 1| a i + 2| b i + 16 2|a i + 1| bi .
DD K
r ◦
N[L](X) 2 N[L+K](X)
◦
DA K
87
88 Chapter 2. Discrete probability distributions
ψ χ
ϕ ϕ
X X
DD K (ψ) = ψ − ϕ DA K (χ) = L χ + ϕ .
L+K
ϕ≤K ψ K ϕ≤K χ K
Proof. For both equations we use induction on K ∈ N. In both cases the only
option for ϕ in N[0](X) is the empty multiset 0, so that DD 0 (ψ) = 1| ψi and
DA 0 (χ) = 1|χ i.
For the induction steps we make extensive use of the equations in Exer-
cise 1.7.6. In those cases we shall put ‘E’ above the equation. We start with the
induction step for draw-delete. For ψ ∈ N[L+K + 1](X),
We reason basically in the same way for draw-add, and now also use Exer-
88
2.2. Probabilistic channels 89
Exercises
2.2.1 Consider the D-channel f : {a, b, c} → {u, v} from Example 2.2.1,
together with a new D-channel g : : {u, v} → {1, 2, 3, 4} given by:
g(u) = 14 |1 i + 18 | 2 i + 21 | 3 i + 18 | 4 i g(v) = 14 | 1i + 18 | 3i + 58 |3 i.
2.2.4 Identify the channel f and the state ω in Example 2.2.1 with matrices:
1
1
1 3
6
M f = 21 =
4
1
1 M ω 2
2 0 4
1
3
Such matrices are called stochastic, since each of their columns has
non-negative entries that add up to one.
Check that the matrix associated with the transformed state f = ω
is the matrix-column multiplication M f · Mω .
(A general description appears in Remark 4.3.5.)
89
90 Chapter 2. Discrete probability distributions
N[K +1](X)
DD
◦ / N[K](X)
arr ◦ ◦ arr
K+1 π / XX
X ◦
2.2.7 Consider the arrangement channels arr : N(X) → L(X) from Exam-
ple 2.2.3. Prove that arr is natural: for each function f : X → Y the
following diagram commutes.
N(X)
arr / D L(X))
N( f ) D(L( f ))
N(Y)
arr / D L(Y)
(This is far from easy; you may wish to check first what this means in
a simple case and then content yourself with a ‘proof by example’.)
2.2.8 Show that the draw-and-delete and draw-and-add maps DD and DA
in Definition 2.2.4 are natural, from N[K+1] to D ◦ N[K], and from
N[K] to D ◦ N[K +1].
2.2.9 Recall from (1.34) that the set N[K](X)
of K-sized natural multisets
over a set X with n has contains Kn members. We write unif K ∈
D N[K](X) for the uniform distribution over such multisets.
1 Check that:
X 1
unif K = n ϕ .
ϕ∈N[K](n) K
90
2.2. Probabilistic channels 91
2.2.10 Let X be a finite set, say with n elements, of the form X = {x1 , . . . , xn }.
Define for K ≥ 1,
X
σK B 1
∈ D N[K](X) .
n K| xi i
1≤i≤n
Thus:
σ1 = 1n 1| x1 i + · · · + 1
n 1| xn i
σ = 1 2| x i + · · · + n 2| xn i ,
1
2 n 1 etc.
Show that these σK form a cone, both for draw-delete and for draw-
add:
DD = σK+1 = σK and DA = σK = σK+1 .
Thus, the whole sequence σK K∈N>0 can be generated from σ1 =
unifN[1](X) by repeated application of DA.
2.2.11 Recall the bijective correspondences from Exercise 1.8.4.
1 Let X, Y be finite sets and c : X → D(Y) be a D-channel. We
can then define an M-channel c† : Y → M(X) by swapping ar-
guments: c† (y)(x) = c(x)(y). We call c a ‘bi-channel’ if c† is also a
D-channel, i.e. if x c(x)(y) = 1 for each y ∈ Y.
P
Prove that the identity channel is a bi-channel and that bi-channels
are closed under composition.
2 Take A = {a0 , a1 } and B = {b0 , b1 } and define a channel bell : A ×
2 → D(B × 2) as:
bell(a0 , 0) = 21 | b0 , 0i + 8 |b1 , 0 i + 8 | b1 , 1 i
3 1
bell(a0 , 1) = 2 | b0 , 1i + 8 |b1 , 0 i + 8 | b1 , 1 i
1 1 3
bell(a1 , 0) = 83 | b0 , 0i + 18 | b0 , 1 i + 18 | b1 , 0 i + 38 | b1 , 1 i
bell(a1 , 1) = 81 | b0 , 0i + 38 | b0 , 1 i + 38 | b1 , 0 i + 18 | b1 , 1 i
Check that bell is a bi-channel.
(It captures the famous Bell table from quantum theory; we have
deliberately used open spaces in the above description of the chan-
nel bell so that non-zero entries align, giving a ‘bi-stochastic’ ma-
trix, from which one can read bell † vertically.)
2.2.12 Check that the inclusions D(X) ,→ M(X) form a map of monads, as
described in Definition 1.9.2.
2.2.13 Recall from Exercise 1.9.6 that for a fixed set A the mapping X 7→
X + A is a monad. Prove that X 7→ D(X + A) is also a monad.
(The latter monad will be used in Chapter ?? to describe Markov
models with outputs. A composition of two monads is not necessarily
91
92 Chapter 2. Discrete probability distributions
The normalisation step forces the sum on the right-hand side to be a convex
sum, with factors adding up to one. Clearly, from an empty multiset we cannot
learn a distribution — technically because the above sum s is then zero so that
we cannot divide by s.
Using scalar multiplication from Lemma 1.4.2 (2) we can define the Flrn
92
2.3. Frequentist learning: from multisets to distributions 93
Flrn(ϕ) = 20
20+30 | H i + 30
20+30 | T i = 25 | H i + 53 |T i.
Thus, the bias (twowards head) is 25 . In this simple case we could have ob-
tained this bias immediately from the data, but the Flrn map captures the
general mechanism.
Notice that with frequentist learning, more (or less) consistent data gives
the same outcome. For instance if we knew that 40 out of 100 tosses were
head, or 2 out of 5, we would still get the same bias. Intuitively, these data
give more (or less) confidence about the data. These aspects are not cov-
ered by frequentist learning, but by a more sophisticated form of ‘Bayesian’
learning. Another disadvantage of the rather primitive form of frequentist
learning is that prior knowledge, if any, about the bias is not taken into ac-
count.
2 Recall the medical table (1.18) captured by the multiset τ ∈ N(B × M).
Learning from τ yields the following joint distribution:
Flrn(τ) = 0.1| H, 0 i + 0.35| H, 1 i + 0.25| H, 2 i
+ 0.05| L, 0 i + 0.1| L, 1 i + 0.15| L, 2 i.
Such a distribution, directly derived from a table, is sometimes called an
empirical distribution [33].
In the above coin example we saw a property that is typical of frequentist
learning, namely that learning from more of the same does not have any effect.
We can make this precise via the equation:
93
94 Chapter 2. Discrete probability distributions
Lemma 2.3.2. The frequentist learning maps Flrn : M∗ (X) → D(X) from (2.13)
are natural in X. This means that for each function f : X → Y the following
diagram commutes.
M∗ (X)
Flrn / D(X)
M∗ ( f ) D( f )
M∗ (Y) / D(Y)
Flrn
= i s | f (xi ) i
P ri
= D( f ) i rsi | xi i
P
= D( f ) ◦ Flrn (ϕ).
We can apply this basic result to the medical data in Table (1.18), via the
multiset τ ∈ N(B × M). We have already seen in Section 1.4 that the multiset-
marginals N(πi )(τ) produce the marginal columns and rows, with their totals.
We can learn the distributions from the colums as:
= 0.7| H i + 0.3| L i.
94
2.3. Frequentist learning: from multisets to distributions 95
of, at an intuitive level, but maybe not in the mathematically precise form of
Lemma 2.3.2.
Drawing an object from an urn is an elementary operation in probability
theory, which involves frequentist learning Flrn. For instance, the draw-and-
delete and draw-and-add operations DD and DA from Definition 2.2.4 can be
described, for urns ψ and χ with kψk = K + 1 and kχk = K, as:
X ψ(x) X
DD(ψ) = ψ − 1| x i = Flrn(ψ)(x) ψ − 1| x i
x∈supp(ψ)
K + 1 x∈supp(ψ)
X χ(x) X
DA(χ) = χ + 1| x i = Flrn(χ)(x) χ + 1| x i .
x∈supp(χ)
K x∈supp(χ)
Since this drawing takes the multiplicities into account, the urns after the
DD and DA draws have the same frequentist distribution as before, but only
if we interpret “after” as channel composition ◦·. This is the content of the
following basic result.
◦ ◦ ◦ ◦
Flrn +Xr Flrn Flrn +Xr Flrn
95
96 Chapter 2. Discrete probability distributions
First taking (multiset) multiplication, and then doing frequentist learning gives:
Flrn flat(Φ) = Flrn 4| ai + 2| b i + 6| c i = 31 | a i + 16 | bi + 12 | ci.
However, first (outer en inner) learning and then doing (distribution) multipli-
cation yields:
flat Flrn M(Flrn)(Φ) = 13 13 | a i + 23 | c i + 32 13 | ai + 13 |b i + 13 | c i
= 13 | a i + 29 | b i + 49 | c i.
Exercises
2.3.1 Recall the data / multisets about child ages and blood types in the
beginning of Subsection 1.4.1. Compute the associated (empirical)
distributions.
Plot these distributions as a graph. How do they compare to the
plots (1.16) and (1.17)?
96
2.3. Frequentist learning: from multisets to distributions 97
2.3.2 Check that frequentist learning from a constant multiset yields a uni-
form distribution. And also that frequentist learning is invariant under
(non-zero) scalar multiplication for multisets: Flrn(s · ϕ) = Flrn(ϕ)
for s ∈ R>0 .
2.3.3 1 Prove that for multisets ϕ, ψ ∈ M∗ (X) one has:
kϕk kψk
Flrn ϕ + ψ = · Flrn(ϕ) +
· Flrn(ψ).
kϕk + kψk kϕk + kψk
This means that when one has already learned Flrn(ϕ) and new
data ψ arrives, all probabilities have to be adjusted, as in the above
convex sum of distributions.
2 Check that the following formulation for natural multisets of fixed
sizes K > 0, L > 0 is a special case of the previous item.
+ / N[K +L](X)
N[K](X) × N[L](X)
Flrn×Flrn
K L
K+L (−)+ K+L (−)
Flrn
D(X) × D(X) / D(X)
L∗ (X)
supp
/ Pfin (X)
F
(2.16)
acc * M (X) / D(X) supp
∗
Flrn
2.3.6 Let X be a finite set and K ∈ N be an arbitrary number. Show that for
σ ∈ D N[K +1](X) and ϕ ∈ N[K](X) one has:
DD = σ (ϕ) X σ(ϕ+1| x i)
= .
(ϕ) x∈X
( ϕ+1| x i)
2.3.7 Recall the uniform distributions unif K ∈ D N[K](X) from Exer-
cise 2.2.9, where the set X is finite. Use Lemma 1.7.5 to prove that
Flrn = unif K = unif X ∈ D(X).
97
98 Chapter 2. Discrete probability distributions
Definition 2.4.1. Tensors, also called parallel products, will be defined first for
states, for each collection type separately, and then for channels, in a uniform
manner.
98
2.4. Parallel products 99
The right-hand side uses the tensor product of the appropriate type T , as
defined in the previous item.
We shall use tensors not only in binary form ⊗, but also in n-ary form ⊗ · · · ⊗,
both for states and for channels.
We see that tensors ⊗ involve the products × of the underlying domains. A
simple illustration of a (probabilistic) tensor product is:
6 | ui + 3 | vi + 2 |w i ⊗ 4 | 0 i + 4 | 1 i
1 1 1 3 1
= 18 | u, 0 i + 24
1
| u, 1 i + 14 | v, 0i + 12
1
|v, 1 i + 38 | w, 0i + 18 | w, 1i.
These tensor products tend to grow really quickly in size, since the number of
entries of the two parts have to be multiplied.
Parallel products are well-behaved, as described below.
Lemma 2.4.2. 1 Transformation of parallel states via parallel channels is the
parallel product of the individual transformations:
(c ⊗ d) = (ω ⊗ ρ) = (c = ω) ⊗ (d = ρ).
2 Parallel products of channels interact nicely with unit and composition:
unit ⊗ unit = unit (c1 ⊗ d1 ) ◦· (c2 ⊗ d2 ) = (c1 ◦· c2 ) ⊗ (d1 ◦· d2 ).
3 The tensor of trivial, deterministic channels is obtained from their product:
f ⊗ g = f × g
99
100 Chapter 2. Discrete probability distributions
As promised we look into why parallel products don’t work for lists.
Remark 2.4.3. Suppose we have two list [a, b, c] and [u, v] and we wish to
form their parallel product. Then there are many ways to do so. For instance,
two obvious choices are:
[ha, ui, ha, vi, hb, ui, hb, vi, hc, ui, hc, vi]
[ha, ui, hb, ui, hc, ui, ha, vi, hb, vi, hc, vi]
There are many other possibilities. The problem is that there is no canonical
choice. Since the order of elements in a list matter, there is no commutativity
property which makes all options equivalent, like for multisets. Technically,
the tensor for L does not exist because L is not a commutative (i.e. monoidal)
monad; this is an early result in category theory going back to [102].
X
split
/ X×X f ⊗g
/ Y ×Y join
/Y (2.17)
The split and join operations depend on the situation. In the next result they
are copy and sum.
Lemma 2.4.4. Recall the flip (or Bernoulli) distribution flip(r) = r|1 i + (1 −
r)| 0 i and consider it as a channel flip : [0, 1] → N. The convolution of two
such flip’s is then a binomial distribution, as described in the following dia-
gram of channels.
∆ / [0, 1] × [0, 1] flip⊗flip
/ N×N
[0, 1] ◦ ◦
◦+
◦
bn[2] /N
100
2.4. Parallel products 101
The structure used in this result is worth making explicit. We have intro-
duced probability distributions on sets, as sample spaces. It turns out that if
this underlying set, say M, happens to be a commutative monoid, then D(M)
also forms a commutative monoid. This makes for instance the set D(N) of
distributions on the natural numbers a commutative monoid — even in two
ways, using either the additive or multiplicative monoid structure on N. The
construction occurs in [99, p.82], but does not seem to be widely used and/or
familiar. It is an instance of an abstract form of convolution in [104, §10].
Since it plays an important role here we make it explicit. It can be described
via parallel products ⊗ of distributions.
CMon
D / CMon
(2.19)
Sets
D / Sets
The vertical arrows ‘forget’ the monoid structure, by sending a monoid to its
underlying set.
101
102 Chapter 2. Discrete probability distributions
We can also express this via commutation of the following diagrams of chan-
nels.
+ / R≥0 0 / R≥0
R≥0 × R≥0 ◦ 1 ◦
102
2.4. Parallel products 103
R≥0 and k ∈ N.
(2.18)
pois[λ1 ] + pois[λ2 ] (k) = D(+) pois[λ1 ] ⊗ pois[λ2 ] (k)
X
= pois[λ1 ] ⊗ pois[λ2 ] (k1 , k2 )
k1 ,k2 ,k1 +k2 =k
X
= pois[λ1 ](m) · pois[λ2 ](k − m)
0≤m≤k
λm λk−m
!
(2.7)
X
= e−λ1 · 1 · e−λ2 · 2
0≤m≤k
m! (k − m)!
X e−(λ1 +λ2 ) k!
= · · λm
1 · λ2
k−m
0≤m≤k
k! m!(k − m)!
e−(λ1 +λ2 ) X k
!
= · · λm1 · λ2
k−m
k! 0≤m≤k
m
(1.26) e−(λ1 +λ2 )
= · (λ1 + λ2 )k
k!
(2.7)
= pois[λ1 + λ2 ](k).
k
Finally, in the expression pois[0] = k e0 · 0k! | k i everything vanishes except
P
for k = 0, since only 00 = 1. Hence pois[0] = 1| 0 i.
The next illustration shows that tensors for different collection types are
related.
This says that support and frequentist learning are ‘monoidal’ natural trans-
formations.
Proof. We shall do the Flrn-case and leave supp as exercise. We first note that:
ϕ ⊗ ψ
= z (ϕ ⊗ ψ)(z) = x,y (ϕ ⊗ ψ)(x, y)
P P
Then:
(ϕ ⊗ ψ)(x, y) ϕ(x) ψ(y)
Flrn ϕ ⊗ ψ (x, y) = =
·
kϕ ⊗ ψk kϕk kψk
= Flrn(ϕ)(x) · Flrn(ψ)(y)
= Flrn(ϕ) ⊗ Flrn(ψ) (x, y).
103
104 Chapter 2. Discrete probability distributions
We can form a K-ary product X K for a set X. But also for a distribution
ω ∈ D(X) we can form ωK = ω ⊗ · · · ⊗ ω ∈ D(X K ). This lies at the heart
of the notion of ‘independent and identical distributions’. We define a separate
function for this construction:
iid [K]
/ D(X K ) iid [K](ω) = ω ··· ⊗ ω
D(X) with | ⊗ {z }. (2.20)
K times
We can describe iid [K] diagrammatically as composite, making the copying
of states explicit:
D(X)
iid [K]
/ D(X K )
[K](ω1 , . . . , ωK )
N
B
where (2.21)
, D(X)K B ω1 ⊗ · · · ⊗ ω K .
∆
N
[K]
We often omit the parameter K when it is clear from the context. This map iid
will pop-up occasionally in the sequel. At this stage we only mention a few of
its properties, in combination with zipping.
Lemma 2.4.8. Consider the iid maps from (2.20), and also the zip function
from Exercise 1.1.7.
1 These iid maps are natural, which we shall describe in two ways. For a
function f : X → Y and for a channel c : X → Y the following two diagrams
commutes.
D(X)
iid / D(X K ) D(X)
iid
◦ / XK
D( f ) K c = (−) ◦ cK = c ⊗ ··· ⊗ c
D( f )
D(Y)
iid / D(Y K ) D(Y)
iid
◦ / YK
D(X) × D(Y)
iid ⊗iid
◦ / XK × Y K
◦ zip
⊗
D(X × Y)
iid
◦ / (X × Y)K
104
2.4. Parallel products 105
cK ◦· iid (ω) = (c ⊗ · · · ⊗ c) = (ω ⊗ · · · ⊗ ω)
= (c = ω) ⊗ · · · ⊗ (c = ω)
= iid (c = ω)
= iid ◦ (c = (−)) (ω).
[(x1 , y1 ), . . . , (xK , yK )]
= D(zip) ω1 ⊗ · · · ⊗ ωK ⊗ ρ1 ⊗ · · · ρK [(x1 , y1 ), . . . , (xK , yK )]
= ω1 ⊗ · · · ⊗ ωK [x1 , . . . , xK ] · ρ1 ⊗ · · · ⊗ ρK [y1 , . . . , yK ]
[(x1 , y1 ), . . . , (xK , yK )] .
Exercises
2.4.1 Prove the equation for supp in Proposition 2.4.7.
2.4.2 Show that tensoring of multisets is linear, in the sense that for σ ∈
M(X) the operation ‘tensor with σ’ σ ⊗ (−) : M(Y) → M(X × Y) is
105
106 Chapter 2. Discrete probability distributions
The same hold in the other coordinate, for (−)⊗τ. As a special case we
obtain that when σ is a probability distribution, then σ⊗(−) preserves
convex sums of distributions.
2.4.3 Extend Lemma 2.4.4 in the following two ways.
1 Show that the K-fold convolution (2.17) of flip’s is a binomial of
size K, as in:
∆K flip K
[0, 1] ◦ / [0, 1]K ◦ / NK
◦+
◦
bn[K] .N
ω + ρ = D(+) ω ⊗ ρ ω ? ρ = D(·) ω ⊗ ρ .
and
Consider the following three distributions on N.
ρ1 = 12 | 0i + 13 | 1i + 16 |2 i ρ2 = 12 | 0 i + 12 | 1 i ω = 32 | 0i + 13 | 1i.
Show consecutively:
1 ρ1 + ρ2 = 14 | 0 i + 125
| 1i + 14 | 2i + 12
1
| 3 i;
2 ω ? (ρ1 + ρ2 ) = 4 | 0i + 36 | 1i + 12 | 2 i + 36
3 5 1 1
| 3 i;
3 ω ? ρ1 = 6 | 0i + 9 | 1i + 18 | 2 i;
5 1 1
4 ω ? ρ2 = 65 | 0i + 16 | 1i;
5 (ω ? ρ1 ) + (ω ? ρ2 ) = 36 25
| 0i + 108
25
| 1 i + 108
7
|2i + 1
108 | 3i.
106
2.4. Parallel products 107
Show that the tensor σ⊗τ can be reconstructed from these two strength
maps via the following diagram:
D(D(X × Y))
flat / D(X × Y) o flat
D(D(X × Y))
noise( f, c) B f + c.
107
108 Chapter 2. Discrete probability distributions
2 Show also that it can be described abstractly via strength (from the
previous exercise) as composite:
D D(X)
iid
◦ / D(X)K
= iid (Ω) = iid flat(Ω) .
N
that is
N
◦
flat
D(X)
iid
◦ / XK
N
2.4.10 Show that the big tensor : D(X)K → D(X K ) from (2.21) com-
mutes with unit and flatten, as described below.
N N
D( )
X K
D (X)2 K / D D(X)K / D2 (X K )
unit
K
unit flat K flat
N N
D(X)K / D(X K ) D(X)K / D(X K )
Abstractly this shows that the functor (−)K distributes over the monad
D.
2.4.11 Check that Lemma 2.4.2 (4) can be read as: the tensor ⊗ of distribu-
tions, considered as a function ⊗ : D(X) × D(Y) → D(X × Y), is a
natural transformation from the composite functor:
Sets × Sets
D×D / Sets × Sets × / Sets
to the functor
Sets × Sets
× / Sets D / Sets.
M×M
unit×unit / D(M) × D(M)
+ +
M
unit / D(M)
108
2.5. Projecting and copying 109
The sum + on the on the left-hand side is the one in D(M), from
the beginning of this exercise. The sum + on the right-hand side is
the one in D(D(M)), using that D(M) is a commutative monoid —
and thus D(D(M)) too.
3 Check the functor D : CMon → CMon in (2.19) is also a monad.
πi = ω (y)
(2.23)
X X
= ω x1 , . . . , xi−1 , y, xi+1 , . . . , xn .
x1 ∈X1 ,..., xi−1 ∈Xi−1 xi+1 ∈Xi+1 ,..., xn ∈Xn
Below we shall introduce special notation for such marginalisation, but first
we look at some of its properties.
The next result only works in the probabilistic case, for D. The exercises
below will provide counterexamples for P and M.
109
110 Chapter 2. Discrete probability distributions
πi = ω1 ⊗ · · · ⊗ ωn = D(πi ) ω1 ⊗ · · · ⊗ ωn = ωi .
X1 × · · · × Xn
c1 ⊗ ··· ⊗ cn
◦ / Y1 × · · · × Yn
πi ◦ ◦ πi
Xi ◦ / Yi
ci
Proof. We only do item (1), since item (2) then follows easily, using that
parallel products of channels are defined pointwise, see Definition 2.4.1 (2).
The first equation in item (1) follows from Lemma 1.8.3 (4), which yields
πi = (−) = D(πi )(−). We restrict to the special case where n = 2 and i = 1.
Then:
(2.23) P
D(π1 ) ω1 ⊗ ω2 )(x) = y ω1 ⊗ ω2 )(x, y)
= y ω1 (x) · ω2 (y)
P
The last line of this proof relies on the fact that probabilistic states (distri-
butions) involve a convex sum, with multiplicities adding up to one. This does
not work for subsets or multisets, see Exercise 2.5.1 below.
We introduce special, post-fix notation for marginalisation via ‘masks’. It
corresponds to the idea of listing only the relevant variables in traditional no-
tation, where a distribution on a product set is often written as ω(x, y) and its
first marginal as ω(x).
Definition 2.5.2. A mask M is a finite list of 0’s and 1’s, that is an element
M ∈ L({0, 1}). For a state ω of type T ∈ {P, M, D} on X1 × · · · × Xn and a mask
M of length n we write:
ωM
for the marginal with mask M. Informally, it keeps all the parts from ω at a
position where there is 1 in M and it projects away parts where there is 0. This
is best illustrated via an example:
= hπ1 , π3 , π5 i = ω.
110
2.5. Projecting and copying 111
ω 1, 0 = 83 | ui + 58 |v i ω 0, 1 = 12 | a i + 12 | b i.
and
The original state ω differs from the product of its marginals:
ω 1, 0 ⊗ ω 0, 1 = 16 3
| u, a i + 16
3
| u, b i + 16
5
| v, a i + 16
5
| v, bi.
This entwinedness follows from a general characterisation, see Exercise 2.5.9
below.
111
112 Chapter 2. Discrete probability distributions
2.5.2 Copying
For an arbitrary set X there is a K-ary copy function ∆[K] : X → X K = X ×
· · · × X (K times), given as:
We often omit the subscript K, when it is clear from the context, especially
when K = 2. This copy function can be turned into a copy channel ∆[K] =
unit ◦ ∆K : X → X K , but, recall, we often omit writing − for simplicity.
These ∆[K] are alternatively called copiers or diagonals.
As functions one has πi ◦ ∆[K] = id , and thus as channels πi ◦· ∆[K] = unit.
Fact 2.5.5. State transformation with a copier ∆ = ω differs from the tensor
product ω ⊗ ω. In general, for K ≥ 2,
(2.20)
∆[K] = ω , ω ··· ⊗ ω
| ⊗ {z } = iid [K](ω).
K times
The following simple example illustrates this fact. For clarity we now write
the brackets − explicitly. First,
∆ = 13 | 0i + 23 | 1i = 13 ∆(0) + 23 ∆(1) = 13 0, 0 + 23 1, 1 .
(2.24)
In contrast:
3|0i + 3|1i ⊗ 3|0i + 3|1i
1 2 1 2
(2.25)
= 19 |0, 0i + 29 | 0, 1 i + 92 | 1, 0 i + 49 | 1, 1 i.
One might expect a commuting diagram like in Lemma 2.5.1 (2) for copiers,
but that does not work: diagonals do not commute with arbitrary channels,
see Exercise 2.5.10 below for a counterexample. Copiers do commute with
deterministic channels of the form f = unit ◦ f , as in:
∆ ◦· f = ( f ⊗ f ) ◦· ∆ because ∆ ◦ f = ( f × f ) ◦ ∆.
In fact, this commutation with diagonals may be used as definition for a chan-
nel to be deterministic, see Exercise 2.5.12.
112
2.5. Projecting and copying 113
hc, di B (c ⊗ d) ◦· ∆ : X → Y × Z.
We are overloading the tuple notation hc, di. Above we use it for the tuple of
channels, so that hc, di has type X → D(Y × Z). Interpreting hc, di as a tuple of
functions, like in Subsection 1.1.1, would give a type X → D(Y) × D(Z). For
channels we standardly use the channel-interpretation for the tuple.
In the literature a probabilistic channel is often called a discriminative model.
Let’s consider a channel whose codomain is a finite set, say (isomorphic to) the
set n = {0, 1, . . . , n − 1}, whose elements i ∈ n can be seen as labels or classes.
Hence we can write the channel as c : X → n. It can then be understood as a
classifier: it produces for each element x ∈ X a distribution c(x) on n, giving
the likelihood c(x)(i) that x belongs to class i ∈ n. The class i with the highest
likelihood, obtained via argmax, may simply be used as x’s class.
The term generative model is used for a pair of a state σ ∈ D(Y) and a chan-
nel g : Y → Z, giving rise to a joint state hid , gi = σ ∈ D(X × Y). This shows
how the joint state can be generated. Later on we shall see that in principle
each joint state can be described in such generative form, via marginalisation
and disintegration, see Section 7.6.
Exercises
2.5.1 1 Let U ∈ P(X) and V ∈ P(Y). Show that the equation
(U ⊗ V) 1, 0 = U
(ϕ ⊗ ψ) 0, 1 = ϕ.
It fails when ψ is but ϕ is not the empty multiset 0. But it also fails
in non-zero cases, e.g. for the multisets ϕ = 2| ai + 4| b i + 1| ci and
ψ = 3| ui + 2| v i. Check this by doing the required computations.
113
114 Chapter 2. Discrete probability distributions
2.5.9 Let X = {u, v} and A = {a, b} as in Example 2.5.4. Prove that a state
ω = r1 |u, ai + r2 | u, bi + r3 | v, a i + r4 | v, b i ∈ D(X × A), where r1 +
r2 + r3 + r4 = 1, is non-entwined if and only if r1 · r4 = r2 · r3 .
2.5.10 Consider the probabilistic channel f : X → Y from Example 2.2.1
and show that on the one hand ∆ ◦· f : X → Y × Y is given by:
a −7 → 21 | u, u i + 12 | v, vi
b 7−→ 1|u, ui
c 7−→ 34 | u, u i + 14 | v, vi.
114
2.5. Projecting and copying 115
D(X)
D(∆)
/ D(X × X)
>
∆ ( ⊗
D(X) × D(X)
2.5.14 Show that the tuple operation h−, −i for channels, from Definition 2.5.6,
satisfies both:
πi ◦· hc1 , c2 i = ci and hπ1 , π2 i = unit
but not, essentially as in Exercise 2.5.10:
hc1 , c2 i ◦· d = hc1 ◦· d, c2 ◦· di.
2.5.15 Consider the joint state Flrn(τ) from Example 2.3.1 (2).
1 Take σ = 10 7
|Hi + 3
10 | L i ∈ D({H, L}) and c : {H, L} → {0, 1, 2}
defined by:
c(H) = 71 | 0 i + 21 | 1 i + 5
14 | 2i c(T ) = 16 | 0i + 13 | 1i + 12 |2 i.
Check that gr(σ, c) = Flrn(τ).
2.5.16 Consider a state σ ∈ D(X) and an ‘endo’ channel c : X → X.
1 Check that the following two statements are equivalent.
115
116 Chapter 2. Discrete probability distributions
c = σ = σ.
116
2.6. A Bayesian network example 117
wet grass slippery road
(D) (E)
9
d
9
sprinkler rain
(B) (C)
e :
winter
(A)
Figure 2.3 The wetness Bayesian network (from [33, Chap. 6]), with only the
nodes and edges between them; the conditional probability tables associated with
the nodes are given separately in the text.
A b b⊥ A c c⊥
winter
sprinkler
a a⊥ rain
a 1/5 4/5 a 4/5 1/5
3/5 2/5
slippery road
b c 19/20 1/20 C e e⊥
wet grass
b ⊥
c 4/5 1/5 ⊥
c 0 1
⊥ ⊥
b c 0 1
This Bayesian network is thus given by nodes, each with a conditional proba-
bility table, describing likelihoods in terms of previous ‘ancestor’ nodes in the
network (if any).
How to interpret all this data? How to make it mathematically precise? It is
not hard to see that the first ‘winter’ table describes a probability distribution
on the set A, which, in the notation of this book, is given by:
wi = 35 | ai + 25 |a⊥ i.
Thus we are assuming with probability of 60% that we are in a winter situation.
This is often called the prior distribution, or also the initial state.
Notice that the ‘sprinkler’ table contains two distributions on B, one for the
117
118 Chapter 2. Discrete probability distributions
We read them as: if it’s winter, there is a 20% chance that the sprinkler is on,
but if it’s not winter, there is 75% chance that the sprinkler is on.
Similarly, the ‘rain’ table corresponds to a channel:
ra / ra(a) = 54 |c i + 15 | c⊥ i
A ◦ C with
ra(a⊥ ) = 1 | c i + 9 | c⊥ i.
10 10
Before continuing we can see that the formalisation (partial, so far) of the
wetness Bayesian network in Figure 2.3 in terms of states and channels, al-
ready allows us to do something meaningful, namely state transformation =.
Indeed, we can form distributions:
sp = wi on B and ra = wi on C.
These (transformed) distributions capture the derived, predicted probabilities
that the sprinkler is on, and that it rains. Using the definition of state transfor-
mation, see (2.9), we get:
sp = wi (b ) = x sp(x)(b ) · wi(x)
⊥ P ⊥
sp = wi = 21
50 | b i + 29 ⊥
50 | b i.
In a similar way one can compute the probability distribution for rain as:
ra = wi = 13
25 |c i + 12 ⊥
25 | c i.
Such distributions for non-initial nodes of a Bayesian network are called pre-
dictions. As will be shown here, they can be obtained via forward state trans-
formation, following the structure of the network.
But first we still have to translate the upper two nodes of the network from
Figure 2.3 into channels. In the conditional probability table for the ‘wet grass’
118
2.6. A Bayesian network example 119
node we see 4 distributions on the set D, one for each combination of elements
from the sets B and C. The table thus corresponds to a channel:
wg(b, c) = 1920 | d i + 20 | d i
1 ⊥
wg(b, c⊥ ) = 10 | d i + 10
9 1
| d⊥ i
B×C ◦ / D
wg
with
wg(b , c) = 5 | d i + 5 | d⊥ i
⊥ 4 1
wg(b⊥ , c⊥ ) = 1|d⊥ i.
We illustrate how to obtain predictions for ‘rain’ and for ‘slippery road’. We
start from the latter. Looking at the network in Figure 2.3 we see that there are
two arrows between the initial node ’winter’ and our node of interest ‘slippery
road’. This means that we have to do two state successive transformations,
giving:
sr ◦· ra = wi = sr = ra = wi
The first equation follows from Lemma 1.8.3 (3). The second one involves
elementary calculations, where we can use the distribution ra = wi that we
calculated earlier.
Getting the predicted wet grass probability requires some care. Inspection of
the network in Figure 2.3 is of some help, but leads to some ambiguity — see
below. One might be tempted to form the parallel product ⊗ of the predicted
distributions for sprinkler and rain, and do state transformation on this product
along the wet grass channel wg, as in:
But this is wrong, since the winter probabilities are now not used consistently,
see the different outcomes in the calculations (2.24) and (2.25). The correct
way to obtain the wet grass prediction involves copying the winter state, via
the copy channel ∆, see:
wg ◦· (sp ⊗ ra) ◦· ∆ = wi = wg = (sp ⊗ ra) = ∆ = wi
= wg = hsp, rai = wi
= 1399
2000 | d i + 2000 | d i.
601 ⊥
119
120 Chapter 2. Discrete probability distributions
summations are automatically done at the right place. We proceed in two steps,
where for each step we only elaborate the first case.
We conclude that:
(sp ⊗ ra) = ∆ = wi = hsp, rai = wi
= 500
63
| b, ci + 500
147
| b, c⊥ i + 500 | b , ci
197 ⊥
+ 500 | b , c i.
93 ⊥ ⊥
= 19
20 · 500 + 10 · 500 + 5 · 500 + 0 · 500
63 9 147 4 197 93
= 1399
2000
wg = (sp ⊗ ra) = (∆ = wi) (d⊥ )
= 2000 .
601
120
2.6. A Bayesian network example 121
•O •O
D E
wet grass slippery road
@ d 8
B
•O
C
sprinkler rain
f :
•O
A
winter
Figure 2.4 The wetness Bayesian network from Figure 2.3 redrawn in a manner
that better reflects the underlying channel-based semantics, via explicit copiers
and typed wires. This anticipates the string-diagrammatic approach of Chapter 7.
Exercises
2.6.1 In [33, §6.2] the (predicted) joint distribution on D × E that arises
from the Bayesian network example in this section is reprented as a
table. It translates into:
30,443
100,000 |d, e i + 39,507 ⊥
100,000 | d, e i + 100,000 | d , e i
5,957 ⊥
+ 100,000 | d , e i.
24,093 ⊥ ⊥
121
122 Chapter 2. Discrete probability distributions
122
2.7. Divergence between distributions 123
DKL (ω, ρ) = 0 ⇐⇒ ω = ρ.
Proof. 1 The direction (⇐) is easy. For (⇒), let 0 = DKL (ω, ρ) = x ω(x) ·
P
log ω(x)/ρ(x) . This means that if ω(x) , 0, one has log ω(x)/ρ(x) = 0, and thus
ω(x)/ρ(x) = 1 and ω(x) = ρ(x). In particular:
X X
1= ω(x) = ρ(x). (∗)
x∈supp(ω) x∈supp(ω)
Hence U = ∅.
2 By unwrapping the relevant definitions:
DKL ω ⊗ ω0 , ρ ⊗ ρ0
(ω ⊗ ω0 )(x, y)
X !
= (ω ⊗ ω0 )(x, y) · log
x,y
(ρ ⊗ ρ0 )(x, y)
ω(x) ω0 (y)
X !
= ω(x) · ω0 (y) · log · 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! !!
= ω(x) · ω (y) · log
0
+ log 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! X !
= ω(x) · ω (y) · log
0
+ ω(x) · ω (y) · log 0
0
x,y
ρ(x) x,y
ρ (y)
ω(x) ω 0
! !
X X (y)
= ω(x) · log + ω0 (y) · log 0
x
ρ(x) y
ρ (y)
= DKL ω, ρ + DKL ω , ρ . 0 0
123
124 Chapter 2. Discrete probability distributions
124
2.7. Divergence between distributions 125
= log 1 = 0.
Exercises
2.7.1 Take ω = 14 | a i + 34 | b i and ρ = 12 | a i + 12 | b i. Check that:
125
126 Chapter 2. Discrete probability distributions
≤ r1 · DKL ω, ρ1 + · · · + rn · DKL ω, ρn .
126
3
– In the “-1” scenario the drawn ball is removed from the urn. This is cov-
ered by the hypergeometric distribution.
– In the “0” scenario the drawn ball is returned to the urn, so that the urn
remains unchanged. In that case the multinomial distribution applies.
– In the “+1” scenario not only the drawn ball is returned to the urn, but one
additional ball of the same colour is added to the urn. This is covered by
the Pólya distribution.
In the “-1” and “+1” scenarios the probability of drawing a ball of a certain
colour changes after each draw. In the “-1” case only finitely many draws
are possible, until the urn is empty.
This yields six variations in total, namely with ordered / unordered draws and
with -1, 0, 1 as replacement scenario. These six options are represented in a
3 × 2 table, see (3.3) below.
127
128 Chapter 3. Drawing from an urn
/ single-colour, Urn 0
Urn (3.1)
We use the ad hoc notation Urn 0 to describe the urn after the draw. It may
be the same urn as before, in case of a draw with replacement, or it may be
a different urn, with one ball less/more, namely the original urn minus/plus a
ball as drawn.
The above transition (3.1) will be described as a probabilistic channel. It
gives for each single draw the associated probability. In this description we
combine multisets and distributions. For instance, an urn with three red balls
and two blue ones will be described as a (natural) multiset 3| R i + 2| Bi. The
transition associated with drawing a single ball without replacement (scenario
“-1”) gives a mapping:
/ 3 R, 2| Ri + 2| Bi + 2 B, 3| R i + 1| Bi .
3| R i + 2| Bi 5 5
It gives the 35 probability of drawing a red ball, together with the remaining
urn, and a 25 probability of drawing a blue one, with a different new urn.
The “+1” scenario with double replacement gives a mapping:
/ 3 R, 4| Ri + 2| Bi + 2 B, 3| R i + 3| Bi .
3| R i + 2| Bi 5 5
In this last, third case we see that the urn/multiset does not change. An im-
portant first observation is that in that case we may as well use a distribution
as urn, instead of a multiset. The distribution represents an abstract urn. In the
above example we would use the distribution 53 | Ri+ 25 | Bi as abstract urn, when
we draw with single replacement (case “0”). The distribution contains all the
relevant information. Clearly, it is obtained via frequentist learning from the
original multiset. Using distributions instead of multisets gives more flexibility,
128
129
since not all distributions are obtained via frequentist learing — in particular
when the probabilities are proper real numbers and not fractions.
We formulate this approach explicitly.
• In the “-1” and “+1” situations without or with double replacement, an urn
is a (non-empty, natural) multiset, which changes with every draw, via re-
moval or double replacement of the drawn ball. These scenarios will also be
described in terms of deletion or addition, using the draw-delete and draw-
add channels DD and DA that we already saw in Definition 2.2.4.
• In a “0” situation with replacement, an urn is a probability distribution; it
does not change when balls are drawn.
This covers the above second question, about what happens to the urn. For
the first question concerning ordered and unordered draws one has to go be-
yond single draw transitions. Hence we need to suitably iterate the single-draw
transition (3.1) to:
/ multiple-colours, Urn 0
Urn (3.2)
Now we can make the distinction between ordered and unordered draws ex-
plicit. Let X be the set of colours, for the balls in the urn — so X = {R, B} in
the above illustration.
Thus, in the latter case, both the urn and the handful of balls drawn from it, are
represented as a multiset.
In the end we are interested in assigning probabilities to draws, ordered or
not, in “-1”, “0” or “1” mode. These probabilities on draws are obtained by tak-
ing the first marginal/projection of the iterated transition map (3.2). It yields a
mapping from an urn to multiple draws. The following table gives an overview
of the types of these operations, where X is the set of colours of the balls.
0 D(X)
O0
◦ / XK D(X)
U0
◦ / N[K](X)
(3.3)
-1 N[L](X)
O−
◦ / XK N[L](X)
U−
◦ / N[K](X)
+1 N∗ (X)
O+
◦ / XK N∗ (X)
U+
◦ / N[K](X)
129
130 Chapter 3. Drawing from an urn
We see that in the replacement scenario “0” the inputs of these channels are
distributions in D(X), as abstract urns. In the deletion scenario “-1” the in-
put (urns) are multisets in N[L](X), of size L. In the ordered case the outputs
are tuples in X K of length K and in the unordered case they are multisets in
N[K](X) of size K. Implicitly in this table we assume that L ≥ K, so that the
urn is full enough for K single draws. In the addition scenario “+1” we only
require that the urn is a non-empty multiset, so that at least one ball can be
drawn. Sometimes it is required that there is at least one ball of each colour in
the urn, so that all colours can occur in draws.
The above table uses systematic names for the six different draw maps. In
the unordered case the following (historical) names are common:
In this chapter we shall see that these three draw maps have certain properties
in common, such as:
• Frequentist learning applied to the draws yields the same outcome as fre-
quentist learning from the urn, see Theorem 3.3.5, Corollary 3.4.2 (2) and
Proposition 3.4.5 (1).
• Doing a draw-delete DD after a draw of size K + 1 is the same as doing a
K-sized draw, see Proposition 3.3.8, Corollary 3.4.2 (4) and Theorem 3.4.6.
But we shall also see many differences between the three forms of drawing.
This chapter describes and analyses the six probabilistic channels in Ta-
ble 3.3 for drawing from an urn. It turns out that many of the relevant prop-
erties can be expressed via composition of such channels, either sequentially
(via ◦·) or in parallel (via ⊗). In this analysis the operations of accumulation
acc and arrangement arr, for going back-and-forth between products and mul-
tisets, play an important role. For instance, in each case one has commuting
diagrams of the form.
Such commuting diagrams are unusual in the area of probability theory but we
like to use them because they are very expressive. They involve multiple equa-
tions and clearly capture the types and order of the various operations involved.
130
131
Moreover, to emphasise once more, these are diagrams of channels with chan-
nel composition ◦·. As ordinary functions, with ordinary function composition
◦, there are no such diagrams / equations.
The chapter first takes a fresh look at accumulation and arrangement, in-
troduced earlier in Sections 1.6 and 2.2. These are the operations for turning
a list into a multiset, and a multiset into a uniform distribution of lists (that
accumulate to the multiset). In Section 3.2 we will use accumulation and ar-
rangement for the powerful operation for “zipping” two multisets of the same
size together. It works analogously to zipping of two lists of the same length,
into a single list of pairs, but this ‘multizip’ produces a distribution over (com-
bined) multisets. The multizip operation turns out to interact smoothly with
multinomial and hypergeometric distributions, as we shall see later on in this
chapter.
Section 3.3 investigates multinomial channels and Section 3.4 covers both
hypergeometric and Pólya channels, all in full, multivariate generality. We have
briefly seen the multinomial and hypergeometric channels in Example 2.1.1.
Now we take a much closer look, based on [78], and describe how these chan-
nels commute with basic operations such as accumulation and arrangement,
with frequentist learning, with multizip, and with draw-and-delete. All these
commutation properties involve channels and channel composition ◦·.
Subsequently, Section 3.5 elaborates how the channels in Table 3.3 actu-
ally arise. It makes the earlier informal descriptions in (3.1) and (3.2) mathe-
matically precise. What happens is a bit sophisticated. We recall that for any
monoid M, the mapping X 7→ M × X is a monad, called the writer monad,
see Lemma 1.9.1. This can be combined with the distribution monad D, giv-
ing a combined monad X 7→ D(M × X). It comes with an associated ‘Kleisli’
composition. It is precisely this composition that we use for iterating a single
draw, that is, for going from (3.1) to (3.2). Moreover, for ordered draws we use
the monoid M = L(X) of lists, and for unordered draws we use the monoid
M = N(X) of multisets. It is rewarding, from a formal perspective, to see that
from this abstract principled approach, common distributions for different sorts
of drawing arise, including the well known multinomial, hypergeometric and
Pólya distributions. This is based on [81].
The subsequent two sections 3.6 and 3.7 of this chapter focus on a non-
trivial operation from [78], namely turning a multiset of distributions into a
distribution of multisets. Technically, this operation is a distributive law, called
the parallel multinomial law. We spend ample time introducing it: Section 3.6
contains no less than four different definitions — all equivalent. Subsequently,
various properties are demonstrated of this parallel multinomial law, including
131
132 Chapter 3. Drawing from an urn
132
3.1. Accumlation and arrangement, revisited 133
assigns equal probability to each sequence. Here we shall also use arrangement
for a specific size, as a channel arr[K] : N[K](X) → X K . We recall:
X 1 kϕk! K!
arr[K](ϕ) = ~x where (ϕ ) = = Q .
(ϕ) ϕ x ϕ(x)!
~x∈acc (ϕ)
−1
The vectors ~y take the multiplicities of elements in ~x into account, which leads
1
to the factor ( acc(~
x) )
.
Permutations are an important part of the story of accumulation and arrange-
ment. This is very explicit in the following result. Categorically it can be de-
scribed in terms of a so-called coequaliser, but here we prefer a more concrete
description. The axiomatic approach of [80] is based on this coequaliser.
XK
acc / / N[K](X)
f
f
+Y
The double head / / for acc is used to emphasise that it is a surjective func-
tion.
This result will be used both as a definition principle and as a proof principle.
The existence part can be used to define a function N[K] → Y by specifying a
function X K → Y that is stable under permutation. The uniqueness part yields
a proof principle: if two functions g1 , g2 : N[K](X) → Y satisfy g1 ◦ acc =
g2 ◦ acc, then g1 = g2 . This is quite powerful, as we shall see. Notice that acc
is stable under permutation itself, see also Exercise 1.6.4.
133
134 Chapter 3. Drawing from an urn
Proof. Take ϕ = 1≤i≤` ni | xi i ∈ N[K](X). Then we can define f (ϕ) via any
P
arrangement, such as:
f (ϕ) B f x1 , . . . , x1 , . . . , x` , . . . , x` .
| {z } | {z }
n1 times n` times
Example 3.1.2. We illustrate the use of Proposition 3.1.1 to re-define the ar-
rangement map and to prove one of its basic properties. Assume for a moment
that we do not already now about arrangement, only about accumulation. Now
consider the following situation:
XK
acc / / N[K](X)
X 1
arr where perm(~x) B ~y .
perm
' ~y is a permutation of ~x
K!
D(X K )
XK
acc / / N[K](X)
134
3.1. Accumlation and arrangement, revisited 135
We show that both dashed arrows fit. The first equation below holds by con-
struction of arr, in the above triangle.
x) = acc ◦· perm (~x)
acc ◦· arr ◦ acc (~
X 1
= D(acc) ~y
K!
~y is a permutation of ~x
X 1
=
acc(~y)
K!
~y is a permutation of ~x
X 1
=
acc(~x)
K!
~y is
a permutation of ~x
=
1 acc(~x)
= acc (~ x)
=
unit ◦ acc (~x).
The fact that the composite arr ◦· acc produces N all permutation is used in the
K K
following result. Recall that the ‘big’ tensor : D(X) → D(X ) is defined
(ω1 , . . . , ωK ) = ω1 ⊗ · · · ⊗ ωK , see 2.21.
N
by
Proposition 3.1.3. The composite arr ◦· acc commutes with tensors, as in the
following diagram of channels.
D(X)K
acc
◦ / N[K] D(X) arr
◦ / D(X)K
N N
◦ ◦
XK
acc
◦ / N[K](X) arr
◦ / XK
= ω)(~x).
N
arr ◦· acc ◦· (~
135
136 Chapter 3. Drawing from an urn
Exercises
3.1.1 Show that the permutation channel perm from Example 3.1.2 is an
idempotent, i.e. satisfies perm ◦· perm = perm. Prove this both con-
cretely, via the definition of perm, and abstractly, via acc and arr.
3.1.2 Consider Proposition 3.1.3 for X = {a, b} and K = 4. Check that:
◦·arr ◦· acc ω1 , ω1 , ω2 , ω2 a, b, b, b
N
N[K](X) × N[K](Y)
mzip[K]
◦ / N[K](X × Y).
136
3.2. Zipping multisets 137
N[K](X) × N[K](Y)
arr⊗arr / D XK × Y K
D(zip) (3.7)
D (X × Y)K
/ D N[K](X × Y)
D(acc)
Example 3.2.1. Let’s use two set X = {a, b} and Y = {0, 1} with two multisets
of size three:
ϕ = 1| ai + 2| bi and ψ = 2| 0 i + 1| 1 i.
Then:
3 3
(ϕ) = 1,2 =3 ( ψ) = 2,1 =3
(a, 0), (b, 0), (b, 1) (b, 0), (a, 0), (b, 1) (b, 0), (b, 0), (a, 1)
(a, 0), (b, 1), (b, 0) (b, 0), (a, 1), (b, 0) (b, 0), (b, 1), (a, 0)
(a, 1), (b, 0), (b, 0) (b, 1), (a, 0), (b, 0) (b, 1), (b, 0), (a, 0)
By applying the accumulation function acc to each of these we get multisets:
This shows that calculating mzip is laborious. But it is quite mechanical and
easy to implement. The picture below suggests to look at mzip as a funnel with
two input pipes in which multiple elements from both sides can be combined
into a probabilistic mixture.
137
138 Chapter 3. Drawing from an urn
1|a i+2| b i 2| 0 i+1| 1i
@ @
@ @
1
+ 2
3 1| a, 1 i+2| b, 0i 3 1|a, 0i+1| b, 0i+1| b, 1 i
N[K](X) × N[K](Y)
mzip
/ D N[K](X × Y)
N[K]( f )×N[K](g) D(N[K]( f ×g))
N[K](U) × N[K](V)
mzip
/ D N[K](U × V)
N[K](X) × N[K](Y)
arr⊗arr
◦ / XX × Y K
mzip ◦ ◦ zip
N[K](X × Y)
arr
◦ / (X × Y)K
138
3.2. Zipping multisets 139
Proof. 1 Easy, via the diagrammatic formulation (3.7), using naturality of arr
(Exercise 2.2.7), of zip (Exercise 1.9.4), and of acc (Exercise 1.6.6).
2 Since:
X 1
mzip(ϕ, K| y i) = acc zip(~x, hy, . . . , yi)
( ϕ )
~x∈acc −1 (ϕ)
1
acc(−x−i→
X
= , y)
( ϕ )
~x∈acc −1 (ϕ)
X 1
=
acc(~x) ⊗ 1|y i
( ϕ )
~x∈acc −1 (ϕ)
X 1
= ϕ ⊗ 1| y i = 1 ϕ ⊗ 1| y i .
( ϕ )
−1 ~x∈acc (ϕ)
139
140 Chapter 3. Drawing from an urn
N[K](X)×N[K](Y)×N[K](Z)
id ⊗mzip
◦ / N[K](X)×N[K](Y ×Z)
arr⊗arr⊗arr ◦ arr⊗arr
◦*
mzip⊗id ◦ K
X ×Y ×Z K K id ⊗zip
◦ / X ×(Y ×Z)K
K
mzip ◦ ◦ ◦ acc
arr
N[K](X×Y ×Z) ◦ N[K](X×Y ×Z) o
We can see that the outer diagram commutes by going through the internal
subdiagrams. In the middle we use associativity of the (ordinary) zip func-
tion, formulated in terms of (deterministic) channels. Three of the (other)
internal subdiagrams commute by item (4). The acc-arr triangle at the bot-
tom commutes by (2.11).
The following result deserves a separate status. It tells that what we learn
from a multiset zip is the same as what we learn from a parallel product (of
multisets).
Theorem 3.2.3. Multiset zip and frequentist learning interact well, namely as:
N[K](X) × N[K](Y)
mzip
◦ / N[K](X × Y)
⊗
◦ Flrn
2
N[K ](X × Y)
Flrn
◦ / X×Y
140
3.2. Zipping multisets 141
in ~y is ψ(b) x, ~y)
K = Flrn(ψ)(b). Hence the fraction of occurrences of (a, b) in zip(~
is Flrn(ϕ)(a) · Flrn(ψ)(b) = Flrn(ϕ ⊗ ψ)(a, b).
N[K](X) × N[L](X)
arr⊗arr / D XK × XL
D(++)
D X K+L / D N[K +L](X)
D(acc)
XK × XK
acc[K]×acc[K]
/ N[K](X) × N[K](X) +
(
zip N[2K](X)
6
acc[2]K
(X × X)K / N[2](X)K +
acc[K]L
XK
L / N[K](X)L +
'
zip L N[L·K](X)
7
acc[L]K
XL K
/ N[L](X)K +
141
142 Chapter 3. Drawing from an urn
exemplary proof. The two paths in the diagram yield the same outcomes in:
= 2| a i + 1|b i + 2| ci + 2| a i + 2| bi + 1| ci
= 4| a i + 3| b i + 3| ci.
+ ◦ acc[2]K ◦ zip [a, b, c, a, c], [b, b, a, a, c]
= + ◦ acc[2]K [(a, b), (b, b), (c, a), (a, a), (c, c)]
= 4| a i + 3|b i + 3| ci.
2 Similarly.
+ / N[2K](X) unit
N[K](X) × N[K](X)
(
mzip D N[2K](X)
6
D N[K](X × X)
D(N[K](acc[2]))
/ D N[K] N[2](X) D(flat)
+ / N[L·K](X)
N[K](X)L unit
(
mzip L D N[L·K](X)
6
D(N[K](acc[L]))/
D N[K](X L )
D N[K] N[L](X) D(flat)
Via the associativity of Proposition 3.2.2 (5) the actual arrangement of these
multiple multizips does not matter.
142
3.2. Zipping multisets 143
Exercises
3.2.1 Show that:
mzip[4] 1| a i + 2|b i + 1| ci, 3|0 i + 1| 1i
= 41 1| a, 1 i + 2| b, 0i + 1| c, 0 i
+ 21 1| a, 0 i + 1| b, 0i + 1| b, 1 i + 1|c, 0 i
+ 41 1| a, 0 i + 2| b, 0i + 1| c, 1 i .
∆ / N[K](X) × N[K](X)
N[K](X)
, mzip
N[K](∆) - D N[K](X × X)
143
144 Chapter 3. Drawing from an urn
2 Check that mzip and zip do not commute with accumulation, as in:
XK × Y K
acc × acc / N[K](X) × N[K](Y)
zip , mzip
acc / D N[K](X × Y)
(X × Y)K
Hint: Take sequences [a, b, b], [0, 0, 1] and re-use Example 3.2.1.
The number K ∈ N represents the number of objects that is drawn. The distri-
bution mn[K](ω) assigns a probability to a K-object draw ϕ ∈ N[K](X). There
is no bound on K, since the idea is that drawn objects are replaced.
Clearly, the above function (3.8) forms a channel mn[K] : D(X) → N[K](X).
In this section we collect some basic properties of these channels. They are
interesting in themselves, since they capture basic relationships between mul-
tisets and distributions, but they will also be useful in the sequel.
For convenience we repeat the definition of the multinomial channel (3.8)
from Example 2.1.1 (2) For a set X and a natural number K ∈ N we define the
144
3.3. The multinomial channel 145
D(X)
mn[K]
◦ / N[K](X)
◦ arr[K]
◦
iid [K] / XK
D(X)
mn[K]
◦ / N[K](X)
A
◦
iid [K] . XK ◦
acc[K]
2 A direct consequence of the previous item, using that acc ◦· arr = unit,
see (2.11) in Example 2.2.3.
145
146 Chapter 3. Drawing from an urn
D(X)
mn[K]
/ D N[K](X)
D( f ) D(N[K]( f ))
D(X)
mn[K]
/ D N[K](X)
Proof. By Theorem 3.3.1 (2) and naturality of acc and of iid , as expressed by
the diagram on the left in Lemma 2.4.8 (1).
D(N[K]( f )) ◦ mn[K] = D(N[K]( f )) ◦ D(acc) ◦ iid
= D(acc) ◦ D( f K ) ◦ iid
= D(acc) ◦ iid ◦ D( f )
= mn[K] ◦ D( f ).
For the next result, recall the multizip operation mzip from Section 3.2. It
may have looked a bit unusual at the time, but the next result demonstrates that
it behaves quite well — as in other such results.
Corollary 3.3.3. Multinomial channels commute with tensor and multizip:
mzip = mn[K](ω) ⊗ mn[K](ρ) = mn[K](ω ⊗ ρ).
Diagrammatically this amounts to:
D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
⊗ ◦ mzip
D(X × Y)
mn[K]
◦ / N[K](X × Y)
D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
◦ arr⊗arr
iid ⊗iid / X × YK
K
◦ zip
⊗
◦ mzip
iid / (X × Y)K
◦ acc
D(X × Y)
mn[K]
◦ / N[K](X × Y) o
The three subdiagrams, from top to bottom, commute by Theorem 3.3.1 (1),
by Lemma 2.4.8 (2), and by Theorem 3.3.1 (2).
146
3.3. The multinomial channel 147
Actually computing mn[K](ω ⊗ ρ)(χ) is very fast, but computing the equal
expression:
mzip = mn[K](ω) ⊗ mn[K](ρ) (χ)
is much much slower. The reason is that one has to sum over all pairs (ϕ, χ)
that mzip to χ.
We move on to a next fundamental fact, namely that frequentist learning
after a multinomial is the identity, in the theorem below. We first need an aux-
iliary result.
Lemma 3.3.4. Fix a distribution ω ∈ D(X) and a number K. For each y ∈ X,
X
mn[K](ω)(ϕ) · ϕ(y) = K · ω(y).
ϕ∈N[K](X)
Proof. The equation holds for K = 0, since then ϕ(y) = 0. Hence we may
assume K > 0. Then:
X
mn[K](ω)(ϕ) · ϕ(y)
ϕ∈N[K](X)
X K! Y
= ϕ(y) · Q · ω(x)ϕ(x)
x ϕ(x)!
x
ϕ∈N[K](X), ϕ(y),0
X K · (K −1)! Y
= · ω(y) · ω(y)ϕ(y)−1 · ω(x)ϕ(x)
ϕ(x)!
Q
ϕ∈N[K](X), ϕ(y),0
(ϕ(y)−1)! · x,y x,y
X (K − 1)! Y
= K · ω(y) · · ω(x)ϕ(x)
ϕ(x)!
Q
x
ϕ∈N[K−1](X) x
X
= K · ω(y) · mn[K −1](ω)(ϕ)
ϕ∈N[K−1](X)
= K · ω(y).
Theorem 3.3.5. Frequentist learning from a multinomial gives the original
distribution:
Flrn = mn[K](ω) = ω. (3.10)
This means that the following diagram of channels commutes.
D(X)
mn[K]
◦ / N[K](X)
◦
Flrn
◦
id /X
147
148 Chapter 3. Drawing from an urn
X
Flrn ◦· mn[K] (ω)(y) =
mn[K](ω)(ϕ) · Flrn(ϕ)(y)
ϕ∈N[K](X)
X ϕ(y)
= mn[K](ω)(ϕ) ·
ϕ∈N[K](X)
kϕk
1
= kϕk · ω(y) ·
kϕk
= ω(y).
Theorem 3.3.6. Multinomial channels compose, with a bit of help of the (fixed-
size) flatten operation for multisets (1.33), as in:
mn[K]
/ D M[K](X) mn[L]
/ D M[L] M[K](X)
D(X)
D(flat)
mn[L·K] . D M[L·K](X)
148
3.3. The multinomial channel 149
Proof. Since:
X X X
mn[K](ω)(ϕ) · ϕ =
mn[K](ω)(ϕ) · ϕ(x) x
ϕ∈N[K](X) x∈X ϕ∈N[K](X)
X
= K · ω(x) x by Lemma 3.3.4
x∈X X
= K· ω(x) x
x∈X
= K · ω.
149
150 Chapter 3. Drawing from an urn
N[K](X) o DD
◦ N[K; +1](X)
a
◦ ◦
mn[K] mn[K+1]
D(X)
Proof. For ω ∈ D(X) and ϕ ∈ N[K](X) we have:
DD ◦· mn[K +1] (ω)(ϕ)
X
= mn[K +1](ω)(ψ) · DD[K](ψ)(ϕ)
ψ∈N[K+1](X)
X ϕ(x) + 1
= mn[K +1](ω)(ϕ + 1| x i) ·
x∈X
K+1
X (K + 1)! Y ϕ(x) + 1
= · ω(y)(ϕ+1| x i)(y) ·
y (ϕ + 1| x i)(y)! K+1
Q
y
x∈X
X K! Y
ϕ(y)
= · ω(y) · ω(x)
y ϕ(y)!
Q
y
x∈X Y
= (ϕ) · ω(y)ϕ(y) · x ω(x)
P
y
= mn[K](ω)(ϕ).
The above triangles exist for each K. This means that the collection of chan-
nels mn[K], indexed by K ∈ N, forms a cone for the infinite chain of draw-
and-delete channels. This situation is further investigated in [85] in relation to
de Finetti’s theorem [37], which is reformulated there in terms of multinomial
channels forming a limit cone.
The multinomial channels do not commute with draw-add channels DA. For
instance, for ω = 13 | ai + 32 | bi one has:
mn[2](ω) = 19 2| a i + 49 1|a i + 1| bi + 49 2| bi
8
mn[3](ω) = 27 1
3| a i + 92 2| a i + 1|b i + 49 1| ai + 2| bi + 27 3| bi .
But:
DA = mn[2](ω) = 19 3|a i + 29 2| ai + 1| bi + 29 1| ai + 2| b i + 49 3| b i .
There is one more point that we like to address. Since mn[K](ω) is a distri-
P
bution, the sum over all draws ϕ mn[K](ω)(ϕ) equals one. But what if we re-
strict this sum to draws ϕ of certain colours only, that is, with supp(ϕ) ⊆ S , for
a proper subset S ⊆ supp(ω)? And what if we then let the size of these draws
150
3.3. The multinomial channel 151
K go to infinity? The result below describes what happens. It turns out that
the same behaviour exists in the hypergeometric and Pólya cases, see Proposi-
tions 3.4.9.
ω(x) in:
P
Proof. Write r B x∈S
(1.27)
X X Y
MK = mn[K](ω)(ϕ) = (ϕ) · ω(x)ϕ(x) = r K .
ϕ∈M[K](S ) ϕ∈M[K](S ) x∈S
Since 0 < r < 1 we get MK = r K > r K+1 = MK+1 and lim MK = lim r K = 0.
K→∞ K→∞
Exercises
3.3.1 Let’s throw a fair dice 12 times. What is the probability that each
12!
number appears twice? Show that it is 72 6.
ω = 14 | a i + 21 | b i + 14 | c i ψ = 2| 0 i + 1| 1 i.
1 Show that:
mn[3] D( f )(ω) (ψ) = 64 .
27
151
152 Chapter 3. Drawing from an urn
2 Check that:
D(N[3]( f )) mn[3](ω) (ψ) = mn[3](ω)(2|a i + 1| ci)
+ mn[3](ω)(1| ai + 1| b i + 1| c i)
+ mn[3](ω)(2| bi + 1| c i)
27
yields the same outcome 64 .
3.3.3 Check that:
mn[2] 12 | ai + 12 | bi = 14 2|a i + 21 1| ai + 1| bi + 14 2| b i .
Conclude that the multinomial map does not preserve uniform distri-
butions.
3.3.4 Check that: X
mn[1](ω) = ω(x) 1| x i .
x∈supp(ω)
3.3.5 Use Exercise 1.6.3 to show that for ϕ ∈ N[K](X) and ψ ∈ N[L](X)
one has:
K+L
K
mn[K +L](ω)(ϕ + ψ) = ϕ+ψ · mn[K](ω)(ϕ) · mn[L](ω)(ψ),
ϕ
152
3.3. The multinomial channel 153
D(X) ×O D(X)
mn[K]⊗mn[L]
◦ / N[K](X) × N[L](X)
∆ ◦+
D(X)
mn[K+L]
◦ / N[K +L](X)
Notice that this is essentially Theorem 3.3.1 (2), via the isomor-
phism M[1](X) X.
3.3.9 Show that for a natural multiset ϕ of size K one has:
flat mn[K] Flrn(ϕ) = ϕ.
3.3.10 Check that multinomial channels do not commute with tensors, as in:
D(X) × D(Y)
mn[K]⊗mn[L]
◦ / M[K](X) × M[L](Y)
⊗ , ◦⊗
D(X × Y) ◦ / M[K ·L](X × Y)
mn[K·L]
153
154 Chapter 3. Drawing from an urn
3.3.12 Check that the multiset distribution mltdst[K]X from Exercise 2.1.7
equals:
mltdst[K]X = mn[K](unif X ).
3.3.13 1 Check that accumulation does not commute with draw-and-delete,
as in:
acc / N[K +1](X)
X K+1 ◦
π , ◦ DD
XK ◦ / N[K](X)
acc
ppr ◦ ◦ DD
acc / N[K](X)
XK ◦
X K+1
acc / / N[K +1](X)
ppr *
D(X K ) DD
D(acc)
+
D N[K](X)
4 Show that probabilistic projection also commutes with arrange-
ment:
X K+1 o
arr
◦ N[K +1](X)
ppr ◦ ◦ DD
XK o
arr
◦ N[K](X)
154
3.3. The multinomial channel 155
3.3.14 Show that the probabilistic projection channel ppr from the previous
exercise makes the following diagram commute.
N
D(X) K+1 ◦ / X K+1
ppr ◦ ◦ ppr
N
D(X)K ◦ / XK
in a situation:
arr⊗arr /
N[K +1](X) × N[K +1](Y) ◦ X K+1 × Y K+1
zip ◦
ppr
◦ ,
(X × Y)K+1 2 (X × Y)
K
◦
π
3.3.16 Recall the set D∞ (X) of discrete distributions from (2.8), with infinite
(countable) support, see Exercise 2.1.11. For ρ ∈ D∞ (N>0 ) and K ∈ N
define:
X Y
ρ(i)ϕ(i) ϕ .
mn [K](ρ) B (ϕ) ·
∞
ϕ∈N[K](N>0 ) i∈supp(ϕ)
1 Show that this yields a distribution in D∞ N[K](N>0 ) , i.e. that
X
mn [K](ρ)(ϕ) = 1.
∞
ϕ∈N[K](N>0 )
155
156 Chapter 3. Drawing from an urn
We first repeat from Example 2.1.1 (3) the definition of the hypergeometric
channel. For a set X and a natural numbers K ≤ L we have, for multiset/urn
ψ ∈ N[L](X),
X ψϕ X x ψ(x)
Q
ϕ(x)
hg[K] ψ B ϕ = ϕ .
L L
(3.13)
ϕ≤K ψ K ϕ≤K ψ K
Recall that we write ϕ ≤K ψ for: kϕk = K and ϕ ≤ ψ, see Definition 1.5.1 (2).
Lemma 1.6.2 shows that the probabilities in the above definition add up to one.
The Pólyachannel
resembles the above hypergeometric one, instead
that
multichoose is used instead of ordinary binomial coefficients .
ψ
ϕ
X
pl[K] ψ B L ϕ
ϕ∈N[K](supp(ψ)) K
Q ψ(x) (3.14)
x∈supp(ψ) ϕ(x)
X
= L ϕ .
ϕ∈N[K](supp(ψ)) K
156
3.4. The hypergeometric and Pólya channels 157
N[K +L](X)
hg[K]
◦ / N[K](X)
@
DD ◦ & ◦
DD
N[K +L−1](X) N[K: +1](X)
◦
DD ◦· ··· ◦· DD
| {z }
L−2 times
To emphasise, this result says that the draws can be obtained via iterated
draw-and-delete. The full picture emerges in Theorem 3.5.10 where we include
the urn.
Proof. Write ψ ∈ N[K + L](X) for the urn. The proof proceeds by induction
on the number of iterations L, starting with L = 0. Then ϕ ≤K ψ means ϕ = ψ.
Hence:
ψ ψ
X ϕ ψ
hg[K] ψ = K+0 ϕ = K ψ = 1 ψ = DD 0 (ψ) = DD L (ψ).
ϕ≤K ψ K K
For the induction step we use ψ ∈ N[K +(L+1)](X) = N[(K +1)+ L](X) and
157
158 Chapter 3. Drawing from an urn
ϕ ≤K ψ. Then:
DD L+1 (ψ)(ϕ) = DD = DD L (ψ) (ϕ)
X
= DD L (ψ)(χ) · DD(χ)(ϕ)
χ∈N[K+1](X)
X ϕ(y) + 1
= DD L (ψ)(ϕ + 1| y i) ·
y∈X
K+1
ψ
(IH)
X ϕ+1| y i ϕ(y) + 1
= K+L+1 ·
y, ϕ(y)<ψ(y)
K+1
K+1
(ψ(y) − ϕ(y)) · ψϕ
(*)
X
= K+L+1 by Exercise 1.7.6 (1),(2)
y, ϕ(y)<ψ(y) (L + 1) ·
K
((K + L + 1) − K) · ψϕ
=
(L + 1) · K+L+1K
ψ
ϕ
= K+L+1
K
= hg[K](ψ)(ϕ).
From this result we can deduce many additional facts about hypergeometric
distributions.
N[N](X)
hg[K]
◦ / N[K](X)
◦ ◦
Flrn -Xq Flrn
N[L](X) o
DD
◦ N[L+1](X)
◦ ◦
hg[K] * t hg[K]
N[K](X)
4 Also:
N[K](X) o DD
◦ N[K= +1](X)
_
◦ ◦
hg[K] hg[K+1]
N[L](X)
158
3.4. The hypergeometric and Pólya channels 159
N[K +L](X)
hg[K]
◦ / N[K](X)
` @
◦ ◦
mn[K+L] D(X) mn[K]
The distribution of small hypergeometric draws from a large urn looks like
a multinomial distribution. This is intuitively clear, but will be made precise
below.
Remark 3.4.3. When the urn from which we draw in hypergeometric mode is
very large and the draw inolves only a small number of balls, the withdrawals
do not really affect the urn. Hence in this case the hypergeometric distribution
behaves like a multinomial distribution, where the urn (as distribution) is ob-
tained via frequentist learning. This is elaborated below, where the urn ψ is
159
160 Chapter 3. Drawing from an urn
And also:
X
pl[K](ψ)(ϕ) · ϕ(y) = K · Flrn(ψ)(y). (3.16)
ϕ∈M[K](supp(ψ))
(∗)
Proof. We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. We
start with equation (3.15), for which we assume K ≤ L.
X ψϕ · ϕ(y)
X
hg[K](ψ)(ϕ) · ϕ(y) = L
ϕ≤K ψ ϕ≤K ψ K
ψ−1| y i
(∗)
X ψ(y) · ϕ−1| y i
= L−1
L
ϕ≤K ψ K · K−1
ψ−1| y i
ψ(y) X ϕ
= K· ·
L ϕ≤ ψ−1| y i L−1
K−1 K−1
= K · Flrn(ψ)(y).
160
3.4. The hypergeometric and Pólya channels 161
N∗ (X)
pl[K]
◦ / N[K](X)
◦ ◦
Flrn (Xu Flrn
N[L](X)
DA
◦ / N[L+1](X)
◦ ◦
pl[K] * t pl[K]
N[K](X)
This second point is the Pólya analogue of Theorem 3.4.2 (3), with draw-
and-add instead of draw-and-delete.
(∗)
2 We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. For urn
161
162 Chapter 3. Drawing from an urn
N[K](X) o DD
◦ N[K? +1](X) N[K](X) o hg[K]
◦ N[L](X)
] ] A
◦ ◦ ◦ ◦
pl[K] pl[K+1] pl[K] pl[L]
N∗ (X) N∗ (X)
162
3.4. The hypergeometric and Pólya channels 163
Later on we shall see that the Pólya channel factors through the multinomial
channel, see Exercise 6.4.5 and ??. The Pólya channel does not commute with
multizip.
Earlier in Proposition 3.3.7 we have seen an ‘average’ result for multinomi-
als. There are similar results for hypergeometric and Pólya distributions.
1 For L ≥ K ≥ 1,
X K
flat hg[K](ψ) = hg[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).
ϕ≤ ψ
L
K
2 Similarly, for K ≥ 1,
X K
flat pl[K](ψ) = pl[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).
ϕ∈N[K](supp(ψ))
L
Proof. In both cases we rely on Lemma 3.4.4. For the first item we get:
X X X
hg[K](ψ)(ϕ) · ϕ =
hg[K](ψ)(ϕ) · ϕ(x) x
ϕ≤K ψ x∈X ϕ≤K ψ
(3.15)
X
= K · Flrn(ψ)(x) x
x∈X
= K · Flrn(ψ).
163
164 Chapter 3. Drawing from an urn
Then: lim an = 0.
n→∞
Proof. We switch to the (natural) logarithm ln and prove the equivalent state-
ment lim ln(an ) = −∞. We use that the logarithm turns products into sums,
n→∞
see Exercise 1.2.2, and that the derivative of ln(x) is 1x . Then:
X N +i X
ln an = = ln N + i − ln M + i
ln
i<n
M+i i<n
X Z M+1 1
= − dx
i<n N+1 x
(∗) X (M + i) − (N + i)
≤ −
i<n
M+i
X 1
= (N − M) ·
i<n
M+i
X 1
= (N − M) · .
i<n
M+i
It is well-known that the harmonic series n>0 n1 is infinite. Since M > N the
P
above sequence ln(an ) thus goes to −∞.
The validity of the marked inequality ≤ follows from an inspection of the
graph of the function 1x : the integral from N + i to M + i is the surface under 1x
between the points N + i < M + i. Since 1x is a decreasing function, this surface
is bigger than the rectangle with height M+i1
and length (M + i) − (N + i).
1 Write for K ≤ L,
X
HK B hg[K](ψ)(ϕ).
ϕ∈N[K](S ), ϕ≤ψ
164
3.4. The hypergeometric and Pólya channels 165
We define:
PK+1 (L−1)! (LS +(K +1)−1)! (LS −1)! (L+K −1)!
aK B = · · ·
PK (LS −1)! (L+(K +1)−1)! (L−1)! (LS +K −1)!
LS +K
= < 1, since LS < L.
L+K
Thus PK+1 = aK · PK < PK and also:
PK = aK−1 · PK−1 = aK−1 · aK−2 · PK−2 = · · · = aK−1 · aK−2 · . . . · a0 · P0 .
Our goal is to prove lim PK = 0. This follows from lim i<K ai = 0, which
Q
K→∞ K→∞
we obtain from Lemma 3.4.8.
Exercises
3.4.1 Consider an urn with 5 red, 6 green, and 8 blue balls. Suppose 6 balls
are drawn from the urn, resulting in two balls of each colour.
1 Describe both the urn and the draw as a multiset.
Show that the probability of the six-ball draw is:
5184000
2 47045881≈ 0.11, when balls are drawn one by one and are replaced
before each next single-ball draw;
165
166 Chapter 3. Drawing from an urn
50
3 323 ≈ 0.15, when the drawn balls are deleted;
405
4 4807 ≈ 0.08 when the drawn balls are replaced and each time an
extra ball of the same colour is added.
3.4.2 Draw-delete preserves hypergeometric and Pólya distributions, see
Corollary 3.4.2 (4) and Theorem 3.4.6. Check that draw-add does
not preserve hypergeometric or Pólya distributions, for instance by
checking that for ϕ = 3| ai + 1| b i,
hg[2](ϕ) = 12 2| ai + 12 1| ai + 1| bi
hg[3](ϕ) = 41 3| ai + 34 2| ai + 1| bi
DA = hg[2](ϕ) = 12 3| ai + 14 2| ai + 1| bi + 14 1| a i + 2| b i ,
And:
pl[2](ϕ)
3
= 35 2| a i + 10 1| a i + 1| b i + 1
10 2| b i
pl[3](ϕ)
3
= 12 3| a i + 10 2| a i + 1| b i + 3
+ 2|b i + 1
20 1| a i 20 3| bi
DA = pl[2](ϕ)
3 1
= 53 3| a i + 20 2| a i + 1| b i + + 2|b i + 10
3
20 1| a i
3| bi
Let X be a finite set, say of size n. Write 1 = x∈X 1| x i for the multiset
P
3.4.3
of single occurrences of elements of X. Show that for each K ≥ 0,
pl[K](1) = unifN[K](X) .
3.4.7 Prove in analogy with Exercise 3.3.7, for an urn ψ ∈ N[L](X) and
elements y, z ∈ X, the following points.
166
3.4. The hypergeometric and Pólya channels 167
1 When y , z and K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L−1
2 When K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L−K)
= K · Flrn(ψ)(y) · .
L−1
3 And in the Pólya case, when y , z,
X
pl[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L+1
4 Finally,
X
pl[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L+K)
= K · Flrn(ψ)(y) · .
L+1
3.4.8 Use Theorem 3.4.1 to prove the following two recurrence relations
for hypgeometric distributions.
X
hg[K](ψ) = Flrn(ψ)(x) · hg[K −1] ψ−1| x i
x∈supp(ψ)
X ϕ(x) + 1
hg[K](ψ)(ϕ) = · hg[K −1](ψ) ϕ+1| x i
x
K+1
3.4.9 Fix numbers N, M ∈ N and write ψ = N| 0 i + M| 1 i for an urn with N
balls of colour 0 and M of colour 1. Let n ≤ N. Show that:
X N+M+1
hg[n+m](ψ) n| 0 i + m| 1 i = .
0≤m≤M
N+1
Hint: Use Exercise 1.7.3. Notice that the right-hand side does not
depend on n.
3.4.10 This exercise elaborates that draws from an urn excluding one partic-
ular colour can be expressed in binary form. This works for all three
modes of drawing: multinomial, hypergeometric, and Pólya.
Let X be a set with at least two elements, and let x ∈ X be an arbi-
trary but fixed element. We write x⊥ for an element not in X. Assume
k ≤ K.
167
168 Chapter 3. Drawing from an urn
/ multiple-colours, Urn 0
Urn
168
3.5. Iterated drawing from an urn 169
0 D(X)
Ot 0
◦ / L(X) × D(X) D(X)
Ut 0
◦ / N(X) × D(X)
(3.17)
-1 N∗ (X)
Ot −
◦ / L(X) × N∗ (X) N∗ (X)
Ut −
◦ / N(X) × N∗ (X)
+1 N(X)
Ot +
◦ / L(X) × N(X) N∗ (X)
Ut +
◦ / N(X) × N∗ (X)
Notice that the draw maps in Table 3.3 in the introduction are written as chan-
nels O0 : D(X) → N[K](X), where the above table contains the corresponding
transition maps, with an extra ‘t’ in the name, as in Ot 0 : D(X) → N(X) ×
D(X).
In each case, the drawn elements are accumulated in the left product-com-
ponent of the codomain of the transition map. They are organised as lists, in
L(X), in the ordered case, and as multisets, in N(X) in the unordered case.
As we have seen before, in the scenarios with deletion/addition the urn is a
multiset, but with replacement it is a distribution.
The crucial observation is that the list and multiset data types that we use for
accumulating drawn elements are monoids, as we have observed early on, in
Lemmas 1.2.2 and 1.4.2. In general, for a monoid M, the mapping X 7→ M × X
is a monad, called the writer monad, see Lemma 1.9.1. It turns out that the
combination of the writer monad with the distribution monad D is again a
monad. This forms the basis for iterating transitions, via Kleisli composition of
this combined monad. We relegate the details of the monad’s flatten operation
to Exercise 3.5.6 below, and concentrate on the unit and Kleisli composition
involved.
Notice the
occurence of the sum + of the monoid M in the first component
of the ket −, − in (3.18). When M is the list monoid, this sum is the (non-
commutative) concatenation ++ of lists, producing an ordered list of drawn
elements. When M is the multiset monoid, this sum is the (commutative) +
169
170 Chapter 3. Drawing from an urn
D(X)
Ot 0
/ D L(X) × D(X) N(X)
Ut 0
/ D N(X) × N(X)
ω
/
X
ω(x) [x], ω
ω
/
X
ω(x) 1| x i, ω
x∈supp(ω) x∈supp(ω)
N∗ (X)
Ot −
/ D L(X) × N(X) N∗ (X)
Ut −
/ D N(X) × N(X)
X ψ(x) ψ(x)
/ /
X
ψ [x], ψ − 1| x i ψ 1| x i, ψ − 1| x i
x∈supp(ψ)
kψk x∈supp(ψ)
kψk
N∗ (X)
Ot +
/ D L(X) × N∗ (X) N∗ (X)
Ut +
/ D N(X) × N∗ (X)
X ψ(x) X ψ(x)
ψ / [x], ψ + 1| x i ψ / 1| x i, ψ + 1| x i
x∈supp(ψ)
kψk x∈supp(ψ)
kψk
Figure 3.1 Definitions of the six transition channels in Table (3.17) for draw-
ing a single element from an urn. In the “ordered” column on the left the list
monoid L(X) is used, where in the “unordered” columnn the (commutative) mul-
tiset monoid N(X) occurs. In the first row the urn (as distribution ω) remains
unchanged, whereas in the second and (resp. third) row the drawn elements x is
removed (resp. added) to the urn ψ. Implicitly it is assumed that the multiset ψ is
non-empty.
Now that we know how to iterate, we need the actual transition maps that can
be iterated, that is, we need concrete definitions of the transition channels in
Table (3.17). They are given in Figure 3.1. In the subsections below we analyse
what iteration means for these six channels. Subsequently, we can describe the
associated K-sized draw channels, as in Table 3.3, as first projection π1 ◦· t K ,
going from urns to drawn elements.
170
3.5. Iterated drawing from an urn 171
D(X) with replacement. By definition we have as first iteration.
X
Ot 10 (ω) = Ot 0 (ω) = ω(x1 ) [x1 ], ω .
x1 ∈supp(ω)
Accumulation of drawn elements in the first coordinate of −, − starts in the
second iteration:
Ot 20 (ω) = Ot 0 =X
Ot 0 (ω)
= ω(x1 ) · Ot 0 (ω)(`, ω) [x1 ] ++ `, ω
`∈L(X), x1 ∈supp(ω)
X
= ω(x1 ) · ω(x2 ) [x1 ] ++ [x2 ], ω
x1 ,x2 ∈supp(ω)
X
= (ω ⊗ ω)(x1 , x2 ) [x1 , x2 ], ω .
x1 ,x2 ∈supp(ω)
Ot 2− (ψ) = Ot − = Ot − (ψ)
X ψ(x1 ) (ψ−1| x1 i)(x2 )
= x1 , x2 , ψ−1| x1 i−1| x2 i .
·
kψk kψk − 1
x1 ∈supp(ψ),
x2 ∈supp(ψ−1| x1 i
171
172 Chapter 3. Drawing from an urn
This independence means that any order of the elements of the same multiset
of balls gets the same (draw) probability. This is not entirely trivial.
does not depend on the order of the elements in ~x: each element y j occurs n j
times in this product, with multiplicities ψ(y j ), . . . , ψ(y j ) − n j + 1, indepen-
dently of the exact occurrences of the y j in ~x. Thus:
Y Y
ψ − acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) − n j + 1)
j
0≤i<K Y
= ψ(y j ) · . . . · (ψ(y j ) − ϕ(y j ) + 1)
j
Y ψ(y j )!
=
j (ψ(y j ) − ϕ(y j ))!
Y ψ(y)!
= .
y∈X
(ψ(y) − ϕ(y))!
We can extend the product over j to a product over all y ∈ X since if y <
supp(ϕ), then, even if ψ(y) = 0,
ψ(y)! ψ(y)!
= = 1.
(ψ(y) − ϕ(y))! ψ(y)!
Theorem 3.5.4. Consider the ordered-transition-with-deletion channel Ot −
on ψ ∈ N[L](X).
172
3.5. Iterated drawing from an urn 173
1 For K ≤ L,
X X ( ψ − ϕ )
Ot −K (ψ) = ~x, ψ − ϕ .
ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ)
we get:
X X (L−K)! Y ψ(y)!
Ot −K (ψ) = ~x, ψ−ϕ
·
L! y (ψ(y)−ϕ(y))!
ϕ≤K ψ ~x∈acc −1 (ϕ)
y ψ(y)!
Q
X X (L−K)!
= ~x, ψ−ϕ
Q ·
ϕ≤K ψ ~x∈acc −1 (ϕ) y (ψ(y)−ϕ(y))! L!
X X (L−K)! ψ
= · ~x, ψ−ϕ
ϕ≤K ψ ~x∈acc −1 (ϕ)
(ψ−ϕ) L!
X X ( ψ −ϕ )
= ~x, ψ−ϕ .
ϕ≤ ψ
( ψ)
K ~x∈acc (ϕ)
−1
We now move from deletion to addition, that is, from Ot − to Ot + , still in the
ordered case. The analysis is very much as in Lemma 3.5.3.
173
174 Chapter 3. Drawing from an urn
Proof. The first item is easy, so we concentrate on the second, in line with the
proof of Lemma 3.5.3. If ϕ = acc(~x) = j n j | y j i we now have:
P
Y Y
ψ + acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) + ϕ(y j ) − 1)
j
0≤i<K
Y (ψ(y j ) + ϕ(y j ) − 1)!
=
j (ψ(y j ) − 1)!
Y (ψ(y) + ϕ(y) − 1)!
= .
y∈supp(ψ)
(ψ(y) − 1)!
ψ
X X 1 ϕ
Ot +K (ψ) = · kψk ~x, ψ + ϕ .
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K
ψ
X X 1 ϕ
O+ [K](ψ) = · kψk ~x .
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K
ψ kψk
The multichoose coefficients ϕ sum to K for all ϕ with kϕk = K, see
Proposition 1.7.3.
Y (L + K − 1)!
(L + i) = L · (L + 1) · . . . · (L + K − 1) = .
0≤i<K
(L − 1)!
174
3.5. Iterated drawing from an urn 175
Ot +K (ψ)
X X (kψk−1)! Y (ψ(y)+ϕ(y)−1)!
= · ~x, ψ+ϕ
ϕ∈N[K](supp(ψ))
(kψk+K −1)! (ψ(y) − 1)!
~x∈acc −1 (ϕ) y∈supp(ψ)
Q (ψ(y)+ϕ(y)−1)!
X X y∈supp(ψ) (ψ(y)−1)!
= ~x, ψ+ϕ
(kψk+K−1)!
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ) (kψk−1)!
(ψ(y)+ϕ(y)−1)!
ϕ(y)!
Q Q
X X y y∈supp(ψ) ϕ(y)!·(ψ(y)−1)!
= ~x, ψ+ϕ
· (kψk+K−1)!
ϕ∈N[K](supp(ψ))
K!
~x∈acc −1 (ϕ) K!·(kψk−1)!
ψ
X X 1 ϕ
= · kψk ~x, ψ+ϕ .
ϕ∈N[K](supp(ψ)) ~x∈acc (ϕ)
−1
( ϕ )
K
2 For ψ ∈ N[L+K](X),
Q ψ(x)
X ψϕ
X x ϕ(x)
Ut −K (ψ) = L+K ϕ, ψ − ϕ = kψk ϕ, ψ − ϕ .
ϕ≤K ψ K ϕ≤K ψ K
3 For ψ ∈ N∗ (X),
Q ψ(x)
ψ
ϕ(x)
x∈supp(ψ) ϕ
X X
Ut +K (ψ) = ϕ, ψ + ϕ = kψk ϕ, ψ + ϕ .
kψk
ϕ∈N[K](X) K ϕ∈N[K](X) K
175
176 Chapter 3. Drawing from an urn
2 For K = 0 both sides are equal 1 0, ψ . Next, for a multiset ψ ∈ N[L+ K +
1](X) we have:
(3.18)
X ψ(y)
= Ut −K (ψ − 1| y i)(ϕ, χ) · ϕ + 1| y i, χ
L+K +1
y∈supp(ψ), χ∈N[L](X),
ϕ≤K ψ−1| y i
ψ−1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| yi, ψ − 1| y i − ϕ
L+K ·
L+K +1
y∈supp(ψ), K
ϕ≤K ψ−1| y i
ψ
X ϕ(y) + 1 ϕ+1| y i
= · L+K+1 ϕ + 1| yi, ψ − (ϕ + 1| y i)
K +1
y∈supp(ψ), K+1
ϕ≤K ψ−1| y i
by Exercise 1.7.6
ψ
X χ(y) χ
= · L+K+1 χ, ψ − χ
χ≤K+1 ψ
K +1
K+1
ψ
χ
X
= L+K+1 χ, ψ − χ .
χ≤K+1 ψ K+1
3 The case K = 0 so we immediately look at, the induction step, for urn ψ
176
3.5. Iterated drawing from an urn 177
(3.18)
X ψ(y)
= Ut +K (ψ + 1| y i)(ϕ, χ) · ϕ + 1| y i, χ
y∈supp(ψ), χ
L
ψ+1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| y i, ψ + 1| y i + ϕ
L+1 ·
y∈supp(ψ), ϕ∈N[K](X)
L
K
ψ
X ϕ(y) + 1 ϕ+1| y i
= · L ϕ + 1| y i, ψ + (ϕ + 1| yi)
y∈supp(ψ), ϕ∈N[K](X)
K +1
K+1
by Exercise 1.7.6
ψ
X χ(y) χ
= · L χ, ψ + χ
K +1
y, χ∈N[K+1](X) K+1
ψ
χ
X
= L χ, ψ + χ .
χ∈N[K+1](X) K+1
mn[K] = π1 ◦· Ut 0K C U0 [K].
hg[K] = π1 ◦· Ut −K C U− [K].
Together, Theorems 3.5.2, 3.5.4, 3.5.6 and 3.5.8 give precise descriptions of
the six channels, for ordrered / unordered drawing with replacement / deletion
/ addition, as originally introduced in Table (3.3), in the introduction to this
chapter. What remains to do is show that the diagrams (3.4) mentioned there
commute — which relate the various forms of drawing via accumulation and
177
178 Chapter 3. Drawing from an urn
arrangement. We repeat them below for convenience, but now enriched with
multinomial, hypergeometric, and Pólya channels.
Theorem 3.3.1 precisely says that the diagram on the left commutes. For the
two other diagram we still have to do some work. This provides new character-
isations of unordered draws with deletion / addition, namely as hypergeometric
/ Pólya followed by arrangement.
(acc ⊗ id ) ◦· Ot −K = Ut −K (acc ⊗ id ) ◦· Ot +K = Ut +K .
2 Permuting ordered draws with deletion / addition has no effect: the permu-
tation channel perm = arr ◦· acc : X K → X K satisfies:
(perm ⊗ id ) ◦· Ot −K = Ot −K (perm ⊗ id ) ◦· Ot +K = Ot +K .
= Ut −K (ψ).
178
3.5. Iterated drawing from an urn 179
= Ut +K (ψ).
acc ◦· O− = acc ◦· π1 ◦· Ot −K
= π1 ◦· (acc ⊗ id ) ◦· Ot −K
= π1 ◦· Ut −K as shown in the first item
= hg[K] by Theorem 3.5.8 (2).
But then:
arr ◦· hg[K] = arr ◦· acc ◦· π1 ◦· Ut −K as just shown
= perm ◦· π1 ◦· Ot −K
= π1 ◦· (perm ⊗ id ) ◦· Ot −K
= π1 ◦· Ot −K see item (2)
= O− .
We conclude with one more result that clarifies the different roles that the
draw-and-delete DD and draw-and-add DA channels play for hypergeometric
and Pólya distributions. So far we have looked only at the first marginal after
iteration. including also the second marginal gives a fuller picture.
Theorem 3.5.10. 1 In the hypergeometric case, both the first and the second
marginal after iterating Ut − can be expressed in terms of the draw-and-
delete map, as in:
N[L+K](X)
hg[K] = DD L DD K
◦ ◦ Ut −K ◦
z π1 π2 $
N[K](X) o ◦ N[K](X) × N[L](X) ◦ / N[L](X)
179
180 Chapter 3. Drawing from an urn
2 In the Pólya case, (only) the second marginal is can be expressed via the
draw-and-add map:
N[L](X)
pl[K] DA K
◦ ◦
◦ Ut +K
y π1 π2 %
N[K](X) o ◦ N[K](X) × N[L+K](X) ◦ / N[L+K](X)
Proof. 1 The triangle on the left commutes by Theorem 3.5.8 (2) and Theo-
rem 3.4.1. The one on the right follows from the equation:
X ψϕ
DD K (ψ) = ψ − ϕ .
kψk
(3.20)
ϕ≤K ψ K
Exercises
3.5.1 Consider the multiset/urn ψ = 4| a i + 2|b i of size 6 over {a, b}. Com-
pute O− [3](ψ) ∈ D {a, b}3 , both by hand and via the O− -formula in
Theorem 3.5.4 (2)
3.5.2 Check that:
pl[2] 1| ai + 2| bi + 1| c i
3
= 10
1
2|a i + 15 1| ai + 1| bi + 10
2| bi
1
+ 101
1| ai + 1| c i + 15 1| b i + 1| c i + 10 2| c i .