0% found this document useful (0 votes)
1K views495 pages

Structured Probabilistic Reasoning: (Incomplete Draft)

This document is an incomplete draft of a book about structured probabilistic reasoning by Bart Jacobs. It covers topics like collections, discrete probability distributions, drawing from urns, observables and validity, variance and covariance, updating distributions, and directed graphical models. The preface discusses how the author's background outside of probability theory provides a fresh perspective and emphasizes more of the mathematical structure in probability than is commonly done.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views495 pages

Structured Probabilistic Reasoning: (Incomplete Draft)

This document is an incomplete draft of a book about structured probabilistic reasoning by Bart Jacobs. It covers topics like collections, discrete probability distributions, drawing from urns, observables and validity, variance and covariance, updating distributions, and directed graphical models. The preface discusses how the author's background outside of probability theory provides a fresh perspective and emphasizes more of the mathematical structure in probability than is commonly done.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 495

Structured Probabilistic Reasoning

(Incomplete draft)

Bart Jacobs
Institute for Computing and Information Sciences, Radboud University Nijmegen,
P.O. Box 9010, 6500 GL Nijmegen, The Netherlands.
[email protected] http://www.cs.ru.nl/∼ bart

Version of: November 30, 2021


Contents

Preface page v
1 Collections 1
1.1 Cartesian products 2
1.2 Lists 6
1.3 Subsets 16
1.4 Multisets 25
1.5 Multisets in summations 35
1.6 Binomial coefficients of multisets 41
1.7 Multichoose coefficents 48
1.8 Channels 56
1.9 The role of category theory 61
2 Discrete probability distributions 69
2.1 Probability distributions 70
2.2 Probabilistic channels 81
2.3 Frequentist learning: from multisets to distributions 92
2.4 Parallel products 98
2.5 Projecting and copying 109
2.6 A Bayesian network example 116
2.7 Divergence between distributions 122
3 Drawing from an urn 127
3.1 Accumlation and arrangement, revisited 132
3.2 Zipping multisets 136
3.3 The multinomial channel 144
3.4 The hypergeometric and Pólya channels 156
3.5 Iterated drawing from an urn 168
3.6 The parallel multinomial law: four definitions 181
3.7 The parallel multinomial law: basic properties 189

iii
iv Chapter 0. Contents

3.8 Parallel multinomials as law of monads 200


3.9 Ewens distributions 207
4 Observables and validity 221
4.1 Validity 222
4.2 The structure of observables 234
4.3 Transformation of observables 249
4.4 Validity and drawing 256
4.5 Validity-based distances 264
5 Variance and covariance 275
5.1 Variance and shared-state covariance 276
5.2 Draw distributions and their (co)variances 284
5.3 Joint-state covariance and correlation 293
5.4 Independence for random variables 299
5.5 The law of large numbers, in weak form 304
6 Updating distributions 311
6.1 Update basics 312
6.2 Updating draw distributions 322
6.3 Forward and backward inference 328
6.4 Discretisation, and coin bias learning 346
6.5 Inference in Bayesian networks 354
6.6 Bayesian inversion: the dagger of a channel 358
6.7 Pearl’s and Jeffrey’s update rules 366
6.8 Frequentist and Bayesian discrete probability 383
7 Directed Graphical Models 390
7.1 String diagrams 392
7.2 Equations for string diagrams 399
7.3 Accessibility and joint states 404
7.4 Hidden Markov models 408
7.5 Disintegration 419
7.6 Disintegration for states 428
7.7 Disintegration and inversion in machine learning 438
7.8 Categorical aspects of Bayesian inversion 450
7.9 Factorisation of joint states 457
7.10 Inference in Bayesian networks, reconsidered 466
References 473

iv
Preface

No victor believes in chance.


Friedrich Nietzsche,
Die fröhliche Wissenschaft, §258, 1882.
Originally in German:
Kein Sieger glaubt an den Zufall.

Probability is for losers — a defiant rephrase of the above aphorism of the


German philosopher Friedrich Nietzsche. According to him, winners do not
reason with probabilities, but with certainties, via Boolean logic, one could say.
However, this goes against the current trend, in which reasoning with proba-
bilities has become the norm, in large scale data analytics and in aritificial
intelligence (AI); it is Boolean reasoning that is now losing influence, see also
the famous End of Theory article [4] from 2008. This book is about the mathe-
matical structures underlying the reasoning, not of Nietzsche’s winners, but of
today’s apparent winners.
The phrase ‘structure in probability’ in the title of this book may sound
like a contradictio in terminis: it seems that probability is about randomness,
like in the tossing of coins, in which one may not expect to find much struc-
ture. Still, as we know since the seventeenth century, via the pioneering work
of Christiaan Huygens, Pierre Fermat, and Blaise Pascal, there is quite some
mathematical structure in the area of probability. The raison d’être of this book
is that there is more structure — especially algebraic and categorical — than
is commonly emphasised.
The scientific roots of this book’s author lie outside probability theory, in
type theory and logic (including some quantum logic), in semantics and speci-
fication of programming languages, in computer security and privacy, in state-
based computation (coalgebra), and in category theory. This scientific distance
to probability theory has advantages and disadvantages. Its obvious disadvan-

v
vi Chapter 0. Preface

tage is that there is no deeply engrained familiarity with the field and with its
development. But at the same time this distance may be an advantage, since
it provides a fresh perspective, without sacred truths and without adherence to
common practices and notations. For instance, the terminology and notation in
this book are influenced by quantum theory, e.g. in using the words ‘state’, ‘ob-
servable’ and ‘test’ — as synonyms for ‘distribution’, for an R-valued function
on a sample space and for compatible (summable) predicates — in using ket
notation | − i in writing discrete probability distributions, or in using daggers
as reversals, in analogy with conjugate transposes (for Hilbert spaces).
It should be said: for someone trained in formal methods, the area of prob-
ability theory can be rather sloppy: everything is called ‘P’, types are hardly
ever used, crucial ingredients (like distributions in expected values) are left
implicit, basic notions (like conjugate prior) are introduced only via examples,
calculation recipes and algorithms are regularly just given, without explana-
tion, goal or justification, etc. This hurts, especially because there is so much
beautiful mathematical structure around. For instance, the notion of a channel
(see below) formalises the idea of a conditional probability and carries a rich
mathematical structure that can be used in compositional reasoning, with both
sequential and parallel composition. The Bayesian inversion (‘dagger’) of a
channel does not only come with appealing mathematical (categorical) proper-
ties — e.g. smooth interaction with sequential and parallel composition — but
is also extremely useful in inference and learning. Via this dagger we can con-
nect forward and backward inference (see Theorem 6.6.3: backward inference
is forward inference with the dagger, and vice-versa) and capture the difference
between Pearl’s and Jeffrey’s update rules (see Theorem 6.7.4: Pearl increases
validity, whereas Jeffrey decreases divergence).
We even dare to think that this ‘sloppiness’ is ultimately a hindrance to fur-
ther development of the field, especially in computer science, where computer-
assisted reasoning requires a clear syntax and semantics. For instance, it is
hard to even express the above-mentioned theorems 6.6.3 and 6.7.4 in stan-
dard probabilistic notation. One can speculate that states/distributions are kept
implicit in traditional probability theory because in many examples they are
used as a fixed implicit assumption in the background. Indeed, in mathemati-
cal notation one tends to omit — for efficiency — the least relevant (implicit)
parameters. But the essence of probabilistic computation is state transforma-
tion, where it has become highly relevant to know explicitly in which state one
is working at which stage. The notation developed in this book helps in such
situations — and in many other situations as well, we hope.
Apart from having beautiful structure, probability theory also has magic. It
can be found in the following two points.

vi
vii

1 Probability distributions can be updated, making it possible to absorb infor-


mation (evidence) into them and learn from it. Multiple updates, based on
data, can be used for training, so that a distribution absorbs more and more
information and can subsequently be used for prediction or classification.
2 The components of a joint distribution, over a product space, can ‘listen’
to each other, so that updating in one (product) component has crossover
ripple effects in other components. This ripple effect looks like what happens
in quantum physics, where measuring one part of an entangled quantum
systems changes other parts.

The combination of these two points is very powerful and forms the basis for
probabilistic reasoning. For instance, if we know that two phenomena are re-
lated, and we have new information about one of them, then we also learn
something new about the other phenomenon, after updating. We shall see that
such crossover ripple effects can be described in two equivalent ways, starting
from a joint distribution with evidence in one component.

• We can use the “weaken-update-marginalise” approach, where we first wea-


ken the evidence from one component to the whole product space, so that
it fits the joint distribution and can be used for updating; subsequently, we
marginalise the updated state to the component that we wish to learn more
about. That’s where the ripple effect through the joint distribution becomes
visible.
• We can also use the “extract-infer” technique, where we first extract a condi-
tional probability (channel) from the joint distribution and then do (forward
or backward) inference with the evidence, along the channel. This is what
happens if we reason in Bayesian networks, when information at one point
in the network is transported up and down the connections to draw conclu-
sions at another point in the network.

The equivalence of these two approaches will be demonstrated and exploited


at various places in this book.
Here is a characteristic illustration of our structure-based approach. A well
known property of the Poisson distribution pois is commonly expressed as:
if X1 ∼ pois[λ1 ] and X2 ∼ pois[λ2 ] then X1 + X2 ∼ pois[λ1 + λ2 ]. This
formulation uses random variables X1 , X2 , which are Poisson-distributed. We
shall formulate this fact as an (algebraic) structure preservation property of
the Poisson distribution, without using any random variables. The property is:
pois[λ1 +λ2 ] = pois[λ1 ] + pois[λ2 ]. It says that the Poisson channel pois is a
is a homomorphism of monoids, from non-negative reals R≥0 to distributions
D(N) on the natural numbers, see Proposition 2.4.6 for details. This result uses

vii
viii Chapter 0. Preface

a commutative monoid structure on distributions whose underlying space is it-


self a commutative monoid. This monoid structure plays a role in many other
situations, for instance in the fundamental distributive law that turns multisets
of distributions into distributions of multisets.
The following aspects characterise the approach of this book.
1 Channels are used as a cornerstone in probabilistic reasoning. The con-
cept of (communication) channel is widely used elsewhere, under various
names, such as conditional probability, stochastic matrix, probabilistic clas-
sifier, Markov kernel, conditional probability table (in Bayesian network),
probabilistic function/computation, signal (in Bayesian persuasion theory),
and finally as Kleisli map (in category theory). Channels can be composed
sequentially and in parallel, and can transform both states and predicates.
Channels exist for all relevant collection types (lists, subsets, multisets, dis-
tributions), for instance for non-deterministic, and for probabilistic compu-
tation. However, after the first chapter about collection types, channels will
be used exclusively for distributions, in probabilistic form.
2 Multisets play a central role to capture various forms of data, like coloured
balls in an urn, draws from such an urn, tables, inputs for successive learn-
ing steps, etc. The interplay between multisets and distributions, notably in
learning, is a recurring theme.
3 States (distributions) are treated as separate from, and dual, to predicates.
These predicates are the ingredients of an (implicit) probabilistic logic, with,
for instance conjunction and negation operations. States are really different
entities, with their own operations, without, for instance conjunction and
negation. In this book, predicates standardly occur in fuzzy (soft, non-sharp)
form, taking values in the unit interval [0, 1]. Along a channel one can trans-
fer states forward, and predicates backward. A central notion is the validity
of a predicate in a state, written as |=. It is standardly called expected value.
Conditioning involves updating a state with a predicate.
It is not just mathematical aesthetics that drives the developments in this
book. Probability theory nowadays forms the basis of large parts of big data
analytics and of artificial intelligence. These areas are of increasing societal
relevance and provide the basis of the modern view of the world — more based
on correlation than on causation — and also provide the basis for much of mod-
ern decision making, that may affect the lives of billions of people in profound
ways. There are increasing demands for justification of such probabilistic rea-
soning methods and decisions, for instance in the legal setting provided by
Europe’s General Data Protection Regulation (GDPR). Its recital 71 is about
automated decision-making and talks about a right to obtain an explanation:

viii
ix

In any case, such processing should be subject to suitable safeguards, which should
include specific information to the data subject and the right to obtain human interven-
tion, to express his or her point of view, to obtain an explanation of the decision reached
after such assessment and to challenge the decision.

It is not acceptable that your mortgage request is denied because you drive a
blue car — in presence of a correlation between driving blue cars and being
late on one’s mortgage payments.
These and other developments have led to a new area called Explainable
Artificial Intelligence (XAI), which strives to provide decisions with explana-
tions that can be understood easily by humans, without bias or discrimination.
Although this book will not contribute to XAI as such, it aims to provide a
mathematically solid basis for such explanations.
In this context it is appropriate to quote Judea Pearl [134] from 1989 about
a divide that is still wide today.
To those trained in traditional logics, symbolic reasoning is the standard, and non-
monotonicity a novelty. To students of probability, on the other hand, it is symbolic
reasoning that is novel, not nonmonotonicity. Dealing with new facts that cause proba-
bilities to change abruptly from very high values to very low values is a commonplace
phenomenon in almost every probabilistic exercise and, naturally, has attracted special
attention among probabilists. The new challenge for probabilists is to find ways of ab-
stracting out the numerical character of high and low probabilities, and cast them in
linguistic terms that reflect the natural process of accepting and retracting beliefs.

This book does not pretend to fill this gap. One of the big embarrassments of
the field is that there is no widely accepted symbolic logic for probability, to-
gether with proof rules and a denotational semantics. Such a logic for symbolic
reasoning about probability will be non-trivial, because it will have to be non-
monotonic1 — a property that many logicians shy away from. This book does
aim to contribute towards bridging the divide mentioned by Pearl, by provid-
ing a mathematical basis for such a symbolic probabilistic logic, consisting of
channels, states, predicates, transformations, conditioning, disintegration, etc.
From the perspective of this book, the structured categorical approach to
probability theory began with the work of Bill Lawvere (already in the 1960s)
and his student Michèle Giry. They recognised that taking probability dis-
tributions has the structure of a monad, which was published in the early
1980s in [58]. Roughly at the same time Dexter Kozen started the systematic
investigation of probabilistic programming languages and logics, published
in [106, 107]. The monad introduced back then is now called the Giry mo-
1 Informally, a logic is non-monotonic if adding assumptions may make a conclusion less true.
For instance, I may think that scientists are civilised people, until, at some conference dinner,
a heated scientific debate ends in a fist fight.

ix
x Chapter 0. Preface

nad G, whose restriction to finite discrete probability distributions is written as


D. Much of this book, certainly in the beginning, concentrates on this discrete
form. The language and notation that is used, however, covers both discrete
and continuous probability — and quantum probability too (inspired by the
general categorical notion of effectus, see [22, 68]).
Since the early 1980s the area of categorical probability theory remained
relatively calm. It is only in the new millenium that there is renewed attention,
sparked in particular by several developments.
• The grown interest in probabilistic programming languages that incorporate
updating (conditioning) and/or higher order features, see e.g. [29, 30, 32,
129, 154, 153].
• The compositional approach to Bayesian networks [26, 48] and to Bayesian
reasoning [28, 87, 89].
• The use of categorical and diagrammatic methods in quantum foundations,
including quantum probability, see [25] for an overview.
• The efforts to develop ‘synthetic’ probability theory via a categorical ax-
iomatisation, see e.g. [52, 53, 147, 20, 80].
This book builds on these developments.
The intended audience consists of students and professionals — in math-
ematics, computer science, artificial intelligence and related fields — with a
basic background in probability and in algebra and logic — and with an inter-
est in formal, logically oriented approaches. This book’s goal is not to provide
intuitive explanations of probability, like [158], but to provide clear and precise
formalisations of the relevant structures. Mathematical abstraction (esp. cate-
gorical air guitar playing) is not a goal in itself (except maybe in Chapter 3):
instead, the book tries to uncover relevant abstractions in concrete problems. It
includes several basic algorithms, with a focus on the algorithms’ correctness,
not their efficiency. Each section ends with a series of exercises, so that the
book can also be used for teaching and/or self-study. It aims at an undergradu-
ate level. No familiarity with category theory is assumed. The basic, necessary
notions are explained along the way. People who wish to learn more about cat-
egory theory can use the references in the text, consult modern introductory
texts like [7, 113], or use online resources such as ncatlab.org or Wikipedia.

Contents overview
The first chapter of the book covers introductory material that is meant to set
the scene. It starts from basic collection types like lists and subsets, and con-
tinues with multisets, which receive most attention. The chapter discusses the

x
xi

(free) monoid structure on all these collection types and introduces ‘unit’ and
‘flatten’ maps as their common, underlying structure. It also introduces the
basic concept of a channel, for these three collection types, and shows how
channels can be used for state transformation and how they can be composed,
both sequentially and in parallel. At the end, the chapter provides definitions
of the relevant notions from category theory.
In the second chapter (discrete) probability distributions first emerge, as a
special collection type, with their own associated form of (probabilistic) chan-
nel. The subtleties of parallel products of distributions (states), with entwined-
ness/correlation between components and the non-naturality of copying, are
discussed at this early stage. This culminates in an illustration of Bayesian net-
works in terms of (probabilistic) channels. It shows how predictions are made
within such Bayesian networks via state transformation and via compositional
reasoning, basically by translating the network structure into (sequential and
parallel) composites of channels.
Blindly drawing coloured balls from an urn is a basic model in discrete prob-
ability. Such draws are analysed systematically in Chapter 3, not only for the
two familar multinomial and hypergeometric forms (with or without replace-
ment), but also in the less familiar Pólya and Ewens forms. By describing these
draws as probabilistic channels we can derive the well known formulations for
these draw-distributions via channel-composition. Once formulated in terms of
channels, these distributions satisfy various compositionality properties. They
are typical for our approach and are (largely) absent in traditional treatments
of this topic. Urns and draws from urns are both described as multisets. The
interplay between multisets and distributions is an underlying theme in this
chapter. There is a fundamental distributive law between multisets and distri-
butions that expresses basic structural properties.
The fourth chapter is more logically oriented, via observables X → R (in-
cluding factors, predicates and events) that can be defined on sample spaces
X, providing numerical information. The chapter concentrates on validity of
obervables in states and on transformation of observables. Where the second
chapter introduces state transformation along a probabilistic channel in a for-
ward direction, this fourth chapter adds observable (predicate) transformation
in a backward direction. These two operations are of fundamental importance
in program semantics, and also in quantum computation — where they are dis-
tinguished as Schrödinger’s (forward) and Heisenberg’s (backward) approach.
In this context, a random variable is a combination of a state and an observable,
on the same underlying sample space. The statistical notions of variance and
covariance are described in terms of of validity for such random variables in

xi
xii Chapter 0. Preface

Chapter 5. This chapter distinguishes two forms of covariance, with a ‘shared’


or a ‘joint’ state, which satisfy different properties.
A very special technique in the area of probability theory is conditioning,
also known as belief updating, or simply as updating. It involves the incor-
poration of evidence into a distribution (state), so that the distribution better
fits the evidence. In traditional probability such conditioning is only indirectly
available, via a rule P(B | A) for computing conditional probabilities. In Chap-
ter 6 we formulate conditioning as an explicit operation, mapping a state ω
and a predicate p to a new updated state ω| p . A key result is that the validity
of p in ω| p is higher than the validity of p in the original state ω. This means
that we have learned from p and adapted our state (of mind) from ω to ω| p .
This updating operation ω| p forms an action (of predicates on states) and sat-
isfies Bayes’ rule, in fuzzy form. The combination with forward and backward
transformation along a channel leads to the techniques of forward inference
(causal reasoning) and backward inference (evidential reasoning). These in-
ference techniques are illustrated in many examples, including Bayesian net-
works. The backward inference rule is also called Pearl’s rule. It turns out that
there exists an alternative update mechanism, called Jeffrey’s rule. It can give
completely different outcomes. The formalisation of Jeffrey’s rule is based on
the reversal of the direction of a channel. This corresponds to turning a con-
ditional probability P(y | x) into P(x | y), essentially via Bayes’ rule. This
Bayesian inversion is also called the ‘dagger’ of a channel, since it satisfies
the properties of a dagger operation (or conjugate transpose), in Hilbert spaces
(and in quantum theory). This chapter includes a mathematical characterisation
of the different goals of these two update rules: Pearl’s rule increases validity
and Jeffrey’s rule decreases divergence. More informally, one learns via Pearl’s
rule by improving what’s going well and via Jeffrey’s rule by reducing what’s
going wrong. Jeffrey’s rule is thus an error correction mechanism. This fits the
basic idea in predictive coding theory [64, 23] that the human mind is seen as
a Bayesian prediction engine that operates by reducing prediction errors.
Since channels can be composed both sequentially and in parallel we can
use graphical techniques to represent composites of channels. So-called string
diagrams have been developed in physics and in category theory to deal with
the relevant compositional structure (symmetric monoidal categories). Chap-
ter 7 introduces these (directed) graphical techniques. It first describes string
diagrams. They are similar to the graphs used for Bayesian networks, but they
have explicit operations for copying and discarding and are thus more expres-
sive. But most importantly, string diagrams have a clear semantics, namely in
terms of channels. The chapter illustrates these string diagrams in a channel-
based description of the basics of Markov chains and of hidden Markov mod-

xii
xiii

els. But the most fundamental technique that is introduced in this chapter, via
string diagrams, is disintegration. In essence, it is the well known procedure
of extracting a conditional probability P(y | x) from a joint probability P(x, y).
One of the themes running through this book is how ‘crossover’ influence can
be captured via channels — extracted from joint states via disintegration —
in particular via forward and backward inference. This phenomenon is what
makes (reasoning in) Bayesian networks work. Disintegration is of interest in
itself, but also provides an intuitive formalisation of the Bayesian inversion of
a channel.
Almost all of the material in these chapters is known from the literature, but
typically not in the channel-based form in which it is presented here. This book
includes many examples, often copied from familiar sources, with the deliber-
ate aim of illustrating how the channel-based approach actually works. Since
many of these examples are taken from the literature, the interested reader may
wish to compare the channel-based description used here with the original de-
scription.

Status of the current incomplete version


An incomplete version of this book is made available online, in order to gen-
erate feedback and to justify a pause in the writing process. Feedback is most
welcome, both positive and negative, especially when it suggests concrete im-
provements of the text. This may lead to occasional updates of this text. The
date on the title page indicates the current version.
Some additional points.

• The (non-trivial) calculations in this book have been carried out with the
EfProb library [20] for channel-based probability. Several calculations in
this book can be done by hand, typically when the outcomes are described
117
as fractions, like 2012 . Such calculations are meant to be reconstructable by
a motivated reader who really wishes to learn the ‘mechanics’ of the field.
Doing such calculations is a great way to really understand the topic — and
the approach of this book2 . Outcomes written in decimal notation 0.1234, as
approximations, or as plots, serve to give an impression of the results of a
computation.
• For the rest of this book, beyond Chapter 7, several additional chapters exist
2 Doing the actual calculations can be a bit boring and time consuming, but there are useful
online tools for calculating fractions, such as
https://www.mathpapa.com/fraction-calculator.html. Recent versions of EfProb also allow
calculations in fractional form.

xiii
xiv Chapter 0. Preface

in unfinished form, for instance on learning, probabilistic automata, causal-


ity and on continuous probability. They will be incorporated in due course.

Bart Jacobs, August 26, 2021.

xiv
1

Collections

There are several ways to put elements from a given set together, for instance
as lists, subsets, multisets, and as probability distributions. This introductory
chapter takes a systematic look at such collections and seeks to bring out nu-
merous similarities. For instance, lists, subsets and multisets all form monoids,
by suitable unions of collections. Unions of distributions are more subtle and
take the form of convex combinations. Also, subsets, multisets and distribu-
tions can be combined naturally via parallel products ⊗, though lists cannot.
In this first chapter, we collect some basic operations and properties of tuples,
lists, subsets and multisets — where multisets are ‘sets’ in which elements
may occur multiple times. Probability distributions will be postponed to the
next chapter. Especially, we collect several basic definitions and results for
multisets, since they play an important role in the sequel, as urns, filled with
coloured balls, as draws from such urns, and as data in learning.
The main differences between lists, subsets and multisets are summarised in
the table below.
lists subsets multisets

order of elements matters + - -


multiplicity of elements matters + - +

For instance, the lists [a, a, b], [a, b, a] and [a, b] are all different. The multisets
2| ai+1| bi and 1| bi+2| a i, with the element a occurring twice and the element
b occurring once, are the same. However, 1|a i + 1| b i is a different multiset.
Similarly, the sets {a, b}, {b, a}, and {a} ∪ {a, b} are the same.
These collections are important in themselves, in many ways, and primar-
ily (in this book) as outputs of channels. Channels are functions of the form
input → T (output), where T is a ‘collection’ operator, for instance, combin-
ing elements as lists, subsets, multisets, or distributions. Such channels capture

1
2 Chapter 1. Collections

a form of computation, directly linked to the form of collection that is produced


on the outputs. For instance, channels where T is powerset are used as interpre-
tations of non-deterministic computations, where each input element produces
a subset of possible output elements. In the probabilistic case these channels
produce distributions, with a suitable instantiation of the operator T . Channels
can be used as elementary units of computation, which can be used to build
more complicated computations via sequential and parallel composition.
First, the reader is exposed to some general considerations that set the scene
for what is coming. This requires some level of patience. But it is useful to
see the similarities between probability distributions (in Chapter 2) and other
collections first, so that constructions, techniques, notation, terminology and
intuition that we use for distributions can be put in a wider perspective and
thus may become natural.
The final section of this chapter explains where the abstractions that we use
come from, namely from category theory, especially referring the notion of
monad. It gives a quick overview of the most relevant parts of this theory and
also how category theory will be used in the remainder of this book to elicit
mathematical structure. We use category theory pragmatically, as a tool, and
not as a goal in itself.

1.1 Cartesian products


This section briefly reviews some (standard) terminology and notation related
to Cartesian products of sets.
Let X1 and X2 be two arbitrary sets. We can form their Cartesian product
X1 × X2 , as the new set containing all pairs of elements from X1 and X2 , as in:

X1 × X2 B {(x1 , x2 ) | x1 ∈ X1 and x2 ∈ X2 }.

(We use the sign B for mathematical definitions.)


We thus write (x1 , x2 ) for the ‘pair’ or ‘tuple’ of elements x1 ∈ X1 and
x2 ∈ X2 . We have just defined a binary product set, constructed from two given
sets X1 , X2 . We can also do this in n-ary form, for n sets X1 , . . . , Xn . We then
get an n-ary Cartesian product:

X1 × · · · × Xn B {(x1 , . . . , xn ) | x1 ∈ X1 , . . . , xn ∈ Xn }.

The tuple (x1 , . . . , xn ) is sometimes called an n-tuple. For convenience, it may


be abbreviated as a vector ~x. The product X1 × · · · × Xn is sometimes written

2
1.1. Cartesian products 3
Q
differently using the symbol , as:
Y Y
Xi or more informally as: Xi .
1≤i≤n

In the latter case it is left implicit what the range is of the index element i.
We allow n = 0. The resulting ‘empty’ product is then written as a singleton
set, written as 1, containing the empty tuple () as sole element, as in:

1 B {()}.
For n = 1 the product X1 × · · · × Xn is (isomorphic to) the set X1 . Note that we
are overloading the symbol 1 and using it both as numeral and as singleton set.
If one of the sets Xi in a product X1 × · · · × Xn is empty, then the whole
product is empty. Also, if all of the sets Xi are finite, then so is the product
X1 × · · · × Xn . In fact, the number of elements of X1 × · · · × Xn is then obtained
by multiplying all the numbers of elements of the sets Xi .

1.1.1 Projections and tuples


If we have sets X1 , . . . , Xn as above, then for each number i with 1 ≤ i ≤ n
there is a projection function πi out of the product to the set Xi , as in:
πi
X1 × · · · × Xn / Xi given by πi x1 , . . . , xn B xi .


This gives us functions out of a product. We also wish to be able to define


functions into a product, via tuples of functions: if we have a set Y and n
functions f1 : Y → X1 , . . . , fn : Y → Xn , then we can form a new function
Y → X1 × · · · × Xn , namely:
h f1 ,..., fn i
Y / X1 × · · · × Xn via h f1 , . . . , fn i(y) B ( f1 (y), . . . , fn (y)).

There is an obvious result about projecting after tupling of functions:

πi ◦ h f1 , . . . , fn i = fi . (1.1)
This is an equality of functions. It can be proven easily by applying both sides
to an arbitrary element y ∈ Y.
There are some more ‘obvious’ equations about tupling of functions:

h f1 , . . . , fn i ◦ g = h f1 ◦ g, . . . , fn ◦ gi hπ1 , . . . , πn i = id , (1.2)
where g : Z → Y is an arbitrary function. In the last equation, id is the identity
function on the product X1 × · · · × Xn .
In a Cartesian product we place sets ‘in parallel’. We can also place functions

3
4 Chapter 1. Collections

between them in parallel. Suppose we have n functions fi : Xi → Yi . Then we


can form the parallel composition:

X1 × · · · × Xn
f1 ×···× fn
/ Y1 × · · · × Yn

via:
f1 × · · · × fn = h f1 ◦ π1 , . . . , fn ◦ πn i
so that:
f1 × · · · × fn (x1 , . . . , xn ) = ( f1 (x1 ), . . . , fn (xn )).


The latter formulation clearly shows how the functions fi are applied in parallel
to the elements xi .
We overload the product symbol ×, since we use it both for sets and for
functions. This may be a bit confusing at first, but it is in fact quite convenient.

1.1.2 Powers and exponents


Let X be an arbitrary set. A power of X is an n-product of X’s, for some n. We
write the n-th power of X as X n , in:
Xn B |
X × {z } = {(x1 , . . . , xn ) | xi ∈ X for each i}.
··· × X
n times

As special cases we have X 1 = X and X 0 = 1, where 1 = {()} is the singleton


set with the empty tuple () as sole inhabitant. Since powers are special cases of
Cartesian products, they come with projection functions πi : X n → X and tuple
functions h f1 , . . . , fn i : Y → X n for n functions fi : Y → X. Finally, for a func-
tion f : X → Y we write f n : X n → Y n for the obvious n-fold parallelisation of
f.
More generally, for two sets X, Y we shall occasionally write:
X Y B functions f : Y → X .


This new set X Y is sometimes called the function space or the exponent of X
and Y. Notice that this exponent notation is consistent with the above one for
powers, since functions n → X can be identified with n-tuples of elements in
X.
These exponents X Y are related to products in an elementary and useful way,
namely via a bijective correspondence:

Z×Y /Xf

=============Y=== (1.3)
Z /X
g

4
1.1. Cartesian products 5

This means that for a function f : Z × Y → X there is a corresponding function


f : Z → X Y , and vice-versa, for g : Z → X Y there is a function g : Z × Y → X,
in such a way that f = f and g = g. It is not hard to see that we can take
f (z) ∈ X Y to be the function f (z)(y) = f (z, y), for z ∈ Z and y ∈ Y. Similarly,
we use g(z, y) = g(z)(y).
The correspondence (1.3) is characteristic for so-called Cartesian closed cat-
egories.

Exercises
1.1.1 Check what a tuple function hπ2 , π3 , π6 i does on a product set X1 ×
· · · × X8 . What is the codomain of this function?
1.1.2 Check that, in general, the tuple function h f1 , . . . , fn i is the unique
function h : Y → X1 × · · · × Xn with πi ◦ h = fi for each i.
1.1.3 Prove, using Equations (1.1) and (1.2) for tuples and projections, that:

g1 × · · · × gn ◦ f1 × · · · × fn = (g1 ◦ f1 ) × · · · × (gn ◦ fn ).
 

1.1.4 Check that for each set X there is a unique function X → 1. Because
of this property the set 1 is sometimes called ‘final’ or ‘terminal’. The
unique function is often denoted by !.
Check also that a function 1 → X corresponds to an element of X.
1.1.5 Define functions in both directions, using tuples and projections, that
yield isomorphisms:

X×Y  Y ×X 1×X  X X × (Y × Z)  (X × Y) × Z.

Try to use Equations (1.1) and (1.2) to prove these isomorphisms,


without reasoning with elements.
1.1.6 Similarly, show that exponents satisfy:

X1  X 1Y  1 (X × Y)Z  X Z × Y Z X Y×Z  X Y Z .


1.1.7 For K ∈ N and sets X, Y define:

XK × Y K
zip[K]
/ (X × Y)K

by:

zip[K] (x1 , . . . , xK ), (y1 , . . . , yK ) B (x1 , y1 ), . . . , (xK , yK ) .


 

Show that zip is an isomorphism, with inverse function unzip[K] B


h(π1 )K , (π2 )K i.

5
6 Chapter 1. Collections

1.2 Lists
The datatype of (finite) lists of elements from a given set is well-known in
computer science, especially in functional programming. This section collects
some basic constructions and properties, especially about the close relationship
between lists and monoids.
For an arbitrary set X we write L(X) for the set of all finite lists [x1 , . . . , xn ]
of elements xi ∈ X, for arbitrary n ∈ N. Notice that we use square brackets
[−] for lists, to distinguish them from tuples, which are typically written with
round brackets (−).
Thus, the set of lists over X can be defined as a union of all powers of X, as
in:
[
L(X) B Xn.
n∈N

When the elements of X are letters of an alphabet, then L(X) is the set of words
— the language — over this alphabet. The set L(X) is alternatively written as
X ? , and called the Kleene star of X.
We zoom in on some trivial cases. One has L(0)  1, since one can only
form the empty word over the empty alphabet 0 = ∅. If the alphabet contains
only one letter, a word consists of a finite number of occurrences of this single
letter. Thus: L(1)  N.
We consider lists as an instance of what we call a collection data type, since
L(X) collects elements of X in a certain manner. What distinguishes lists from
other collection types is that elements may occur multiple times, and that the
order of occurrence matters. The three lists [a, b, a], [a, a, b], and [a, b] differ.
As mentioned in the introduction to this chapter, within a subset orders and
multiplicities do not matter, see Section 1.3; and in a multiset the order of
elements does not matter, but multiplicities do matter, see Section 1.4.
Let f : X → Y be an arbitrary function. It can be used to map lists over X into
lists over Y by applying f element-wise. This is what functional programmers
call map-list. Here we like overloading, so we write L( f ) : L(X) → L(Y) for
this function, defined as:

L( f ) [x1 , . . . , xn ] B [ f (x1 ), . . . , f (xn )].




Thus, L is an operation that not only sends sets to sets, but also functions to
functions. It does so in such a way that identity maps and compositions are
preserved:
L(id ) = id L(g ◦ f ) = L(g) ◦ L( f ).

We shall say: the operation L is functorial, or simply L is a functor.

6
1.2. Lists 7

Functoriality can be used to define the marginal of a list on a product set,


via L(πi ), where πi is a projection map. For instance, let ` ∈ L(X × Y) be of
the form ` = [(x1 , y1 ), . . . , (xn , yn )]. The first marginal L(π1 )(`) ∈ L(X) is then
computed as:

L(π1 )(`) = L(π1 ) [(x1 , y1 ), . . . , (xn , yn )]




= [π1 (x1 , y1 ), . . . , π1 (xn , yn )]


= [x1 , . . . , xn ].

1.2.1 Monoids
A monoid is a very basic mathematical structure. For convenience we define it
explicitly.
Definition 1.2.1. A monoid consists of a set M with a binary operation M ×
M → M, written for instance as infix +, together with an identity element, say
written as 0 ∈ M. The binary operation + is associative and has 0 as identity
on both sides. That is, for all a, b, c ∈ M,

a + (b + c) = (a + b) + c and 0 + a = a = a + 0.
The monoid is called commutative if a + b = b + a, for all a, b ∈ M. It is called
idempotent if a + a = a for all a ∈ M.
Let (M, 0, +) and (N, 1, ·) be two monoids. A function f : M → N is called a
homomorphism of monoids if f preserves the unit and binary operation, in the
sense that:

f (0) = 1 and f (a + b) = f (a) · f (b), for all a, b ∈ M.


For brevity we also say that such an f is a map of monoids, or simply a monoid
map.
The natural numbers N with addition form a commutative monoid (N, 0, +).
But also with multiplication they form a commutative monoid (N, 1, ·). The
function f (n) = 2n is a homomorphism of monoids f : (N, 0, +) → (N, 1, ·).
Various forms of collection types form monoids, with ‘union’ as binary op-
eration. We start with lists, in the next result. The proof is left as (an easy)
exercise to the reader.
Lemma 1.2.2. 1 For each set X, the set L(X) of lists over X is a monoid,
with the empty list [] ∈ L(X) as identity element, and with concatenation
++ : L(X) × L(X) → L(X) as binary operation:

[x1 , . . . , xn ] ++ [y1 , . . . , ym ] B [x1 , . . . , xn , y1 , . . . , ym ].

7
8 Chapter 1. Collections

This monoid (L(X), [], ++) is neither commutative nor idempotent.


2 For each function f : X → Y the associated map L( f ) : L(X) → L(Y) is a
homomorphism of monoids.

Thus, lists are monoids via concatenation. But there is more to say: lists are
free monoids. We shall occasionally make use of this basic property and so we
like to make it explicit. We shall encounter similar freeness properties for other
collection types.
Each element x ∈ X yields a singleton list unit(x) B [x] ∈ L(X). The result-
ing function unit : X → L(X) plays a special role, see also the next subsection.

Proposition 1.2.3. Let X be an arbitrary set and let (M, 0, +) be an arbitrary


monoid, with a function f : X → M. Then there is a unique homomorphism of
monoids f : (L(X), [], ++) → (M, 0, +) with f ◦ unit = f .
The homomorphism f is called the free extension of f . Its freeness can be
expressed via a diagram, as below, where the vertical arrow is dashed, to indi-
cate uniqueness.

X
unit / L(X)

f , homomorphism (1.4)
f
* M

Proof. Since f preserves the identity element and satisfies f ◦ unit = f it is


determined on empty and singleton lists as:

f [] = 0 f ([x]) = f (x).

and
Further, on an list [x1 , . . . , xn ] of length n ≥ 2 we necessarily have:
 
f [x1 , . . . , xn ] = f [x1 ] ++ · · · ++ [xn ]


= f [x1 ] + · · · + f [xn ]
 

= f (x1 ) + · · · + f (xn ).

Thus, there is only one way to define f . By construction, this f : L(X) → M is


a homomorphism of monoids.

The exercise below illustrate this result. For future use we introduce monoid
actions and their homomorphisms.

Definition 1.2.4. Let (M, 0, +) be a monoid.

1 An action of the monoid M on a set X is a function α : M×X → X satisfying:

α(0, x) = x and α(a + b, x) = α(a, α(b, x)),

8
1.2. Lists 9

for all a, b ∈ M and x ∈ X.


α  β
2 A homomorphism or map of monoid actions from M × X →
− X to M ×Y →


Y is a function f : X → Y satisfing:

f α(a, x) = β a, f (x)
 
for all a ∈ M, x ∈ X.

This equation corresponds to commutation of the following diagram.

M×X
id × f
/ M×Y
α
 β
X
f
/Y

Monoid actions are quite common in mathematics. For instance, scalar mul-
tiplication of a vector space forms an action. Also, as we shall see, probabilistic
updating can be described via monoid actions. The action map α : M × X → X
can be understood intuitively as pushing the elements in X forward with a
quantity from M. It then makes sense that the zero-push is the identity, and
that a sum-push is the composition of two individual pushes.

1.2.2 Unit and flatten for lists


We proceed to describe more elementary structure for lists, in terms of special
‘unit’ and ‘flatten’ functions. In subsequent sections we shall see that this same
structure exists for other collection types, like powerset, multiset and distribu-
tion. This unit and flatten structure will turn out to be essential for sequential
composition. At the end of this chapter we will see that it is characteristic for
what is called a ‘monad’.
We have already seen the singleton-list function unit : X → L(X), given
unit(x) B [x]. There is also a ‘flatten’ function which turns a list of lists into
a list by removing inner brackets. This function is written as flat : L(L(X)) →
L(X). It is defined as:

flat [[x11 , . . . , x1n1 ], . . . , [xk1 , . . . , xknk ]] B [x11 , . . . , x1n1 , . . . , xk1 , . . . , xknk ].




The next result contains some basic properties about unit and flatten. These
properties will first be formulated in terms of equations, and then, alternatively
as commuting diagrams. The latter style is preferred in this book.

Lemma 1.2.5. 1 For each function f : X → Y one has:

unit ◦ f = L( f ) ◦ unit and flat ◦ L(L( f )) = L( f ) ◦ flat.

9
10 Chapter 1. Collections

Equivalently, the following two diagrams commute.

X
unit / L(X) L(L(X))
flat / L(X)
f L( f ) L(L( f )) L( f )
   
Y / L(Y) L(L(Y)) / L(Y)
unit flat

2 One further has:

flat ◦ unit = id = flat ◦ L(unit) flat ◦ flat = flat ◦ L(flat).

These two equations can equivalently be expressed via commutation of:

L(X)
unit / L(L(X)) o L(unit) L(X) L(L(L(X)))
flat / L(L(X))
L(flat)
 flat   flat
L(X) L(L(X)) / L(X)
flat

Proof. We shall do the first cases of each item, leaving the second cases to the
interested reader. First, for f : X → Y and x ∈ X one has:

L( f ) ◦ unit (x) = L( f )(unit(x)) = L( f )([x])




= [ f (x)] = unit( f (x)) = unit ◦ f (x).




Next, for the second item we take an arbitrary list [x1 , . . . , xn ] ∈ L(X). Then:

flat ◦ unit ([x1 , . . . , xn ]) = flat [[x1 , . . . , xn ]]


 

= [x1 , . . . , xn ]
flat ◦ L(unit) ([x1 , . . . , xn ]) = flat [unit(x1 ), . . . , unit(xn )]
 

= flat [[x1 ], . . . , [xn ]]




= [x1 , . . . , xn ].

The equations in item 1 of this lemma are so-called naturality equations.


They express that unit and flat work uniformly, independent of the set X in-
volved. The equations in item 2 show that L is a monad, see Section 1.9 for
more information.
The next result connects monoids with the unit and flatten maps.

Proposition 1.2.6. Let X be an arbitrary set.

1 To give a monoid structure (u, +) on X is the same as giving an L-algebra,


that is, a map α : L(X) → X satisfying α ◦ unit = id and α ◦ flat = α ◦

10
1.2. Lists 11

L(α), as in:

X
unit / L(X) L(L(X))
L(α)
/ L(X)
α flat α (1.5)
id $   α

X L(X) /X

2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two monoids, with corresponding L-


algebras α1 : L(M1 ) → M1 and α2 : L(M2 ) → M2 . A function f : M1 → M2
is then a homomorphism of monoids if and only if the diagram

L(M1 )
L( f )
/ L(M2 )
α1 α2 (1.6)
 
M1
f
/ M2 .

commutes.

This result says that instead of giving a binary operation + with an identity
element u we can give a single operation α that works on all sequences of el-
ements. This is not so surprising, since we can apply the sum multiple times.
The more interesting part is that the monoid equations can be captured uni-
formly by the diagrams/equations (1.5). We shall see that same diagrams also
work for other types of monoids (and collection types).

Proof. 1 If (X, u, +) is a monoid, we can define α : L(X) → X in one go as


α([x1 , . . . , xn ]) B x1 + · · · + xn . The latter sum equals the identity element
u when n = 0. Notice that the bracketing of the elements in the expression
x1 + · · · + xn does not matter, since a monoid is associative. The order does
matter, since we do not assume that the monoid is commutative. It is easy
to check the equations (1.5). From a more abstract perspective we define
α : L(X) → X via freeness (1.4).
In the other direction, assume an L-algebra α : L(X) → X. We then
define an identity element u B α([]) ∈ X and the sum of x, y ∈ X as
x + y B α([x, y]) ∈ X. We have to check that u is identity for + and that
+ is associative. This requires some fiddling with the equations (1.5):
(1.5)
x + u = α [x, α([])] = α [α(unit(x)), α([])]
 
 
= α L(α) [ [x], [] ]
(1.5)    (1.5)
= α flat [ [x], [] ] = α [x] = x.

Similarly one shows u + y = y. Next, associativity of + is obtained in a

11
12 Chapter 1. Collections

similar manner:
(1.5)
x + (y + z) = α [x, α([y, z])] = α [α(unit(x)), α([y, z])]
 
 
= α L(α) [ [x], [y, z] ]
(1.5)  
= α flat [ [x], [y, z] ]
 
= α flat [ [x, y], [z] ]
(1.5)  
= α L(α) [ [x, y], [z] ]
= α [α([x, y]), α(unit(z))]

(1.5)
= α [α([x, y]), z] = (x + y) + z.


2 Now let f : M1 → M2 be a homomorphism of monoids. Diagram (1.6) then


commutes:
α2 ◦ L( f ) ([x1 , . . . , xn ]) = α2 ([ f (x1 ), . . . , f (xn )])


= f (x1 ) + · · · + f (xn )
= f (x1 + · · · + xn ) since f is a homomorphism
= f ◦ α1 ([x1 , . . . , xn ]).


In the other direction, if (1.6) commutes for a function f : M1 → M2 , then


f is a homomorphism of monoids, since:
 (1.6)
f (u1 ) = f α1 ([]) = α2 L( f )([]) = α2 ([]) = u2 .


Similarly one checks that sums are preserved:


 (1.6)
f (x +1 y) = f α1 ([x, y]) = α2 L( f )([x, y])


= α2 ([ f (x), f (y)]) = f (x) +2 f (y).


We see that the algebraic structure (M, u, +) on the set M is expressed as an
algebra α : L(M) → M, namely a certain map to M. This will be a recurring
theme in the coming sections.

1.2.3 List combinatorics


Combinatorics is a subarea of mathematics focused on advanced forms of
counting. It is relevant for probability theory, since frequencies of occurrences
play an important role. We give a first taste of this, using lists.
We shall use the length k`k ∈ N of a list `, see Exercise 1.2.3 for more
details, and also the sum and product of a list of natural numbers, defined as:
 
sum [n1 , . . . , nk ] B n1 + · · · + nk = i ni
P
 
prod [n1 , . . . , nk ] B n1 · . . . · nk = i ni .
Q

See also Exercise 1.2.6.

12
1.2. Lists 13

We restrict ourselves to the subset N>0 = {n ∈ N | n > 0} of positive


natural numbers. Clearly, we obtain restrictions sum, prod : L(N>0 ) → N>0 of
the above sum and product functions.
Now fix N ∈ N>0 . We are interested in lists ` ∈ L(N>0 ) with sum(`) = N.
These lists are in the inverse image:

sum −1 (N) B {` ∈ L(N>0 ) | sum(`) = N}.

For instance, for N = 4 this inverse image contains the eight lists:

[1, 1, 1, 1], [1, 1, 2], [1, 2, 1], [2, 1, 1], [2, 2], [1, 3], [3, 1], [4]. (1.7)

We can interpret the situation as follows. Suppose we have coins with value
n ∈ N>0 , for each n. Then we can ask, for an amount N: how many (ordered)
ways are there to lay out the amount N in coins? For N = 4 the different layouts
are given above. Other interpretations are possible: one can also think of the
sequences (1.7) as partitions of the numbers {1, 2, 3, 4}.
Here is a first, easy counting result.

Lemma 1.2.7. For N ∈ N>0 , the subset sum −1 (N) ⊆ L(N>0 ) has 2N−1 ele-
ments.

Proof. We use induction on N, starting with N = 1. Obviously, only the list


[1] sums to 1, and indeed 2N−1 = 20 = 1.
For the induction step we use the familiar fact that for K ∈ N,

X 1 2K+1 − 1
= . (∗)
0≤k≤K
2 k 2K

This can be shown easily by induction on K.


We anticipate the notation | A | for the number of elements of a finite set A,

13
14 Chapter 1. Collections

see Definition 1.3.6. Then, for N > 1,



sum −1 (N) = [N] ∪ ` ++ [n] 1 ≤ n ≤ N −1 and ` ∈ sum −1 (N −n)
X
= 1+ sum −1 (N −n)
1≤n≤N−1
(IH)
X
= 1+ 2N−n−1
1≤n≤N−1
 
 X1 
= 1 + 2N−1 ·  n
− 1
0≤n≤N−1
2
2N − 1
!
(∗)
= 1 + 2N−1 · − 1
2N−1
= 1 + 2N − 1 − 2N−1
= 2N−1 .

Here is an elementary fact about coin lists. It has a definite probabilistic


flavour, since it involves what we later call a convex sum of probabilities ri ∈
[0, 1] with i ri = 1. The proof of this result is postponed. It involves rather
P
complex probability distributions about coins, see Corollary 3.9.13. We are not
aware of an elementary proof.

Theorem 1.2.8. For each N ∈ N>0 ,


X 1
= 1. (1.8)
k`k! · prod (`)
`∈sum −1 (N)

At this stage we only give an example, for N = 4, using the corresponding


lists ` in (1.7). The associated sum (1.8) is illustrated below.

[1, 1,_1, 1] [1, 1,


_ 2] [1, 2,
_ 1] [2, 1,
_ 1] [2,_2] [1,_3] [3,_1] [4]
_
       
1 1 1 1 1 1 1 1
4!·1 3!·2 3!·2 3!·2 2!·4 2!·3 2!·3 1!·4

1 1 1 1 1 1 1 1
24 12 12 12 8 6 6 4
| {z }
with sum: 1
Obviously, elements in a lists are ordered. Thus, in (1.7) we distinghuish
between coin layouts [1, 1, 2], [1, 2, 1] and [2, 1, 1]. However, when we are
commonly discussing which coins add up to 4 we do not take the order into
account, for instance in saying that we use two coins of value 1 and one coin
of 2, without caring about the order. In doing so, we are not using lists as col-
lection type, but multisets — in which the order of elements does not matter.

14
1.2. Lists 15

These multisets form an important alternative collection type; they are dis-
cussed from Section 1.4 onwards.

Exercises
1.2.1 Let X = {a, b, c} and Y = {u, v} be sets with a function f : X → Y
given by f (a) = u = f (c) and f (b) = v. Write `1 = [c, a, b, a] and
`2 = [b, c, c, c]. Compute consecutively:
• `1 ++ `2
• `2 ++ `1
• `1 ++ `1
• `1 ++ (`2 ++ `1 )
• (`1 ++ `2 ) ++ `1
• L( f )(`1 )
• L( f )(`2 )
• L( f )(`1 ) ++ L( f )(`2 )
• L( f )(`1 ++ `2 ).
1.2.2 We write log for the logarithm function with some base b > 0, so that
log(x) = y iff x = by . Verify that the logarithm function log is a map
of monoids:

(R>0 , 1, ·)
log
/ (R, 0, +).

Often the log function is used to simplify a computation, by turning


multiplications into additions. Then one uses that log is precisely this
homomorphism of monoids. (An additonal useful property is that log
is monotone: it preserves the order.)
1.2.3 Define a length function k − k : L(X) → N on lists via the freeneess
property of Proposition 1.2.3.
1 Describe k`k for ` ∈ L(X) explicitly.
2 Elaborate what it means that k − k is a homomorphism of monoids.
3 Write ! for the unique function X → 1 and check that k − k is L(!).
Notice that the previous item can then be seen as an instance of
Lemma 1.2.2 (2).
1.2.4 1 Check that the list-flatten operation L(L(X)) → L(X) can be de-
scribed in terms of concatenation ++ as:

flat [`1 , . . . , `n ] = `1 ++ · · · ++ `n .


15
16 Chapter 1. Collections

2 Now consider the correspondence of Proposition 1.2.6 (1). Con-


clude that the algebra α : L(L(X)) → L(X) associated with the
monoid (L(X), [], ++) from Lemma 1.2.2 (1) is flat.
1.2.5 Check Equation (1.8) yourself for N = 5.
1.2.6 Consider the set N of natural numbers with its additive monoid struc-
ture (0, +) and also with its multiplicative monoid structure (1, ·). Ap-
ply freeness from Proposition 1.2.3 with these two structures to define
two monoid homomorphisms:
sum
)
L(N) 5N
prod

1 Describe these maps explicitly on a sequence [n1 , . . . , nk ] of natural


numbers ni .
2 Make explicit what it means that they preserve the monoid struc-
ture.
3 Prove that for an arbitrary set X, the list-length function k − k from
Exercise 1.2.3 satisfies:

sum ◦ L(k − k) = k − k ◦ flat.

In other words, the following diagram commutes.

L L(X)
 L(k−k)
/ L(N)
flat sum
 
L(X)
k−k
/N

A fancy way to prove that length is such an algebra homomorphism


is to use the uniqueness in Proposition 1.2.3.

1.3 Subsets
The next collection type that will be studied is powerset. The symbol P is
commonly used for the powerset operator. We will see that there are many
similarities with lists L from the previous section. We again pay much attention
to monoid structures.
For an arbitrary set X we write P(X) for the set of all subsets of X, and
Pfin (X) for the set of finite subsets. Thus:

P(X) B {U | U ⊆ X} and Pfin (X) B {U ∈ P(X) | U is finite}.

16
1.3. Subsets 17

If X is a finite set itself, there is no difference between P(X) and Pfin (X). In the
sequel we shall speak mostly about P, but basically all properties of interest
hold for Pfin as well.
First of all, P is a functor: it works both on sets and on functions. Given
a function f : X → Y we can define a new function P( f ) : P(X) → P(Y) by
taking the image of f on a subset. Explicitly, for U ⊆ X,

P( f )(U) = { f (x) | x ∈ U}.

The right-hand side is clearly a subset of Y, and thus an element of P(Y). We


have two equalities:

P(id ) = id P(g ◦ f ) = P(g) ◦ P( f ).

We can use functoriality for marginalisation: for a subset (relation) R ⊆ X × Y


on a product set we get its first marginal P(π1 )(R) ∈ P(X) as the subset:

P(π1 )(R) = {π1 (z) | z ∈ R} = {π1 (x, y) | (x, y) ∈ R} = {x | ∃y. (x, y) ∈ R}.

The next topic is the monoid structure on powersets. The first result is an
analogue of Lemma 1.2.2 and its proof is left to the reader.

Lemma 1.3.1. 1 For each set X, the powerset P(X) is a commutative and
idempotent monoid, with empty subset ∅ ∈ P(X) as identity element and
union ∪ of subsets of X as binary operation.
2 Each P( f ) : P(X) → P(Y) is a map of monoids, for f : X → Y.

Next we define unit and flatten maps for subsets, much like for lists. The
function unit : X → P(X) sends an element to a singleton subset: unit(x) B
{x}. The flatten function flat : P(P(X)) → P(X) is given by union: for A ⊆
P(X),
[
flat(A) B A = {x ∈ X | ∃U ∈ A. x ∈ U}.

We mention, without proof, the following analogue of Lemma 1.2.5.

Lemma 1.3.2. 1 For each function f : X → Y the ‘naturality’ diagrams

X
unit / P(X) P(P(X))
flat / P(X)
f P( f ) P(P( f )) P( f )
   
Y / P(Y) P(P(Y)) / P(Y)
unit flat

commute.

17
18 Chapter 1. Collections

2 Additionaly, the ‘monad’ diagrams below commute.

P(X)
unit / P(P(X)) o P(unit) P(X) P(P(P(X)))
flat / P(P(X))
P(flat)
 flat   flat
P(X) P(P(X)) / P(X)
flat

1.3.1 From list to powerset


We have seen that lists and subsets behave in a similar manner. We strengthen
this connection by defining a support function supp : L(X) → Pfin (X) between
them, via:
supp [x1 , . . . , xn ] B {x1 , . . . , xn }.


Thus, the support of a list is the subset of elements occurring in the list. The
support function removes order and multiplicities. The latter happens implic-
itly, via the set notation, above on the right-hand side. For instance,
supp([b, a, b, b, b]) = {a, b} = {b, a}.
Notice that there is no way to go in the other direction, namely Pfin (X) → L(X).
Of course, one can for each subset choose an order of the elements in order to
turn the subset into a list. However, this process is completely arbitrary and is
not uniform (natural).
The support function interacts nicely with the structures that we have seen
so far. This is expressed in the result below, where we use the same notation
unit and flat for different functions, namely for L and for P. The context, and
especially the type of an argument, will make clear which one is meant.
Lemma 1.3.3. Consider the support map supp : L(X) → Pfin (X) defined above.
1 It is a map of monoids (L(X), [], ++) → (P(X), ∅, ∪).
2 It is natural, in the sense that for f : X → Y one has:

L(X)
supp
/ Pfin (X)
L( f ) Pfin ( f )
 
L(Y) / Pfin (Y)
supp

3 It commutes with the unit and flatten maps of list and powerset, as in:

X X L(L(X))
L(supp)
/ L(Pfin (X)) supp
/ Pfin (Pfin (X))
unit
  unit flat
  flat
L(X) / Pfin (X) L(X) / Pfin (X)
supp supp

18
1.3. Subsets 19

Proof. The first item is easy and skipped. For item 2,

Pfin ( f ) ◦ supp ([x1 , . . . , xn ]) = Pfin ( f )(supp([x1 , . . . , xn ]))




= Pfin ( f )({x1 , . . . , xn })
= { f (x1 ), . . . , f (xn )}
= supp([ f (x1 ), . . . , f (xn )])
= supp(L( f )([x1 , . . . , xn ]))
= supp ◦ L( f ) ([x1 , . . . , xn ]).


In item 3, commutation of the first diagram is easy:

supp ◦ unit (x) = supp([x]) = {x} = unit(x).




The second diagram requires a bit more work. Starting from a list of lists we
get:

flat ◦ supp ◦ L(supp) ([[x11 , . . . , x1n1 ], . . . , [xk1 , . . . , xknk ]])




= ◦ supp ([supp([x11 , . . . , x1n1 ]), . . . , supp([xk1 , . . . , xknk ])])


S 

= ◦ supp ([{x11 , . . . , x1n1 }, . . . , {xk1 , . . . , xknk }])


S 

= ({{x11 , . . . , x1n1 }, . . . , {xk1 , . . . , xknk }})


S

= {x11 , . . . , x1n1 , . . . , xk1 , . . . , xknk }


= supp([x11 , . . . , x1n1 , . . . , xk1 , . . . , xknk ])
= supp ◦ flat ([[x11 , . . . , x1n1 ], . . . , [xk1 , . . . , xknk ]]).


1.3.2 Finite powersets and idempotent commutative monoids


We briefly review the relation between monoids and (finite) powersets. At
an abstract level the situation is much like for lists, as described in Subsec-
tion 1.2.1. For instance, the commutative idempotent monoids Pfin (X) are free,
like lists, in Proposition 1.2.3.

Proposition 1.3.4. Let X be a set and (M, 0, +) a commutative idempotent


monoid, with a function f : X → M between them. Then there is a unique
homomorphism of monoids f : (Pfin (X), ∅, ∪) → (M, 0, +) with f ◦ unit = f .
We represent this situation in the diagram below.

X
unit / Pfin (X)

f , homomorphism (1.9)
* M
f

19
20 Chapter 1. Collections

Proof. Given the requirements, the only way to define f is as:


 
f {x1 , . . . , xn } B f (x1 ) + · · · + f (xn ), with special case f (∅) = 0.

The order in the above sum f (x1 ) + · · · + f (xn ) does not matter since M is
commutative. The function f sends unions to sums since + is idempotent.

Commutative idempotent monoids can be described as algebras, in analogy


with Proposition 1.2.6.

Proposition 1.3.5. Let X be an arbitrary set.

1 To specify a commutative idempotent monoid structure (u, +) on X is the


same as giving a Pfin -algebra α : Pfin (X) → X, namely so that the diagrams

X
unit / Pfin (X) Pfin (Pfin (X))
Pfin (α)
/ Pfin (X)
α flat α (1.10)
id %   α

X Pfin (X) /X

commute.
2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two commutative idempotent monoids,
with corresponding Pfin -algebras α1 : Pfin (M1 ) → M1 and α2 : Pfin (M2 ) →
M2 . A function f : M1 → M2 is a map of monoids if and only if the rectangle

Pfin (M1 )
Pfin ( f )
/ Pfin (M2 )
α1 α2 (1.11)
 
M1
f
/ M2 .

commutes.

Proof. This works very much like in the proof of Proposition 1.2.6. If (X, u, +)
is a monoid, we define α : Pfin (X) → X by freeness as α({x1 , . . . , xn }) B x1 +
· · · + xn . In the other direction, given α : Pfin (X) → X we define a sum as
x + y B α({x, y}) with unit u B α(∅). Clearly, thi sum + on X is commutative
and idempotent.

This result concentrates on the finite powerset functor Pfin . One can also con-
sider algebras P(X) → X for the (general) powerset functor P. Such algebras
turn the set X into a complete lattice, see [116, 9, 92] for details.

1.3.3 Extraction
So far we have concentrated on how similar lists and subsets are: the only struc-
tural difference that we have seen up to now is that subsets form an idempotent

20
1.3. Subsets 21

and commutative monoid. But there are other important differences. Here we
look at subsets of product sets, also known as relations.
The observation is that one can extract functions from a relation R ⊆ X × Y,
namely functions of the form extr 1 (R) : X → P(Y) and extr 2 (R) : Y → P(X),
given by:
extr 1 (R)(x) = {y ∈ Y | (x, y) ∈ R} extr 2 (R)(y) = {x ∈ X | (x, y) ∈ R}.
In fact, one can easily reconstruct the relation R from extr 1 (R), and also from
extr 2 (R), via:
R = {(x, y) | y ∈ extr 1 (R)(x)} = {(x, y) | x ∈ extr 2 (R)(y)}.
This all looks rather trivial, but such function extraction is less trivial for other
data types, as we shall see later on for distributions, where it will be called
disintegration.
Using the exponent notation from Subsection 1.1.2 we can summarise the
situation as follows. There are two isomorphisms:
P(Y)X  P(X × Y)  P(X)Y . (1.12)
Functions of the form X → P(Y) will later be called ‘channels’ from X to
Y, see Section 1.8. What we have just seen will then be described in terms of
‘extraction of channels’.

1.3.4 Powerset combinatorics


We describe the basic of counting subsets, of finite sets. This will lead to the bi-
nomial and multinomial coefficients that show up in many places in probability
theory.
Definition 1.3.6. For a finite set A we write | A| ∈ N for the number of elements
in A. Then, for an arbitrary set X and a number K ∈ N we define:
P[K](X) B {U ∈ Pfin (X) | |U | = K}.
For a number n ∈ N we also write n for the n-element set {0, 1, . . . , n − 1}. Thus
0 is a also the empty set, and 1 = {0} is a singleton.
We recall that from a categorical perspective the sets 0 and 1 are special,
because they are, respectively, initial and final in the category of sets and func-
tions. We have | A × B| = | A| · | B| and | A + B| = | A| + | B|, where A + B is the
disjoint union (coproduct) of sets A and B. It is well known that | P(A) | = 2| A |
for a finite set A. Below we are interested in the number of subsets of a fixed
size.

21
22 Chapter 1. Collections

The next lemma makes a basic result explicit, partly because it provides
valuable insight in itself, but also because we shall see generalisations later
on, for multisets instead of subsets. We use the familiar binomial and multi-
nomial coefficients. We recall their definitions, for natural numbers k ≤ n and
m1 , . . . , m` with ` ≥ 2 and i mi = m.
P
! !
n n! m m!
B and B . (1.13)
k k! · (n − k)! m1 , . . . , m` m1 ! · . . . · m` !
Recall that a partition of a set X is a disjoint cover: a finite collection of subsets
U1 , . . . , Uk ⊆ X satisfying Ui ∩ U j = ∅ for i , j and i Ui = X. We do not
S
assume that the subsets Ui are non-empty.

Lemma 1.3.7. Let X be a (finite) set with n elements.

1 The binomial coefficient counts the number of subsets of size K ≤ n,


! !
P[K](X) = n , and especially: P[K](n) = n .

K K
2 For K1 , . . . , K` ∈ N with ` ≥ 2 and n = i Ki , the multinomial coefficient
P
counts the number of partitions, of appropriate sizes:

~ ∈ P[K1 ](X) × · · · × P[K` ](X) U1 , . . . , U` is a partition of X

U
!
n
= .
K1 . . . , K`
Each subset U ⊆ X forms a partition (U, ¬U) of X, with its complement
¬U = X \ U. Correspondingly the binomial coefficient can be described as a
multinomial coefficient:
! !
n n
= .
K K, n − K
Proof. 1 We use induction on the number n of elements in X. If n = 0, also
K0= 0 and so P[K](X) contains only the empty set. At the same time, also
0 = 1. Next, let X have n + 1 elements, say X = Y ∪ {y} where Y has n
elements and y < Y.If K = 0 or K = n + 1, there is only one subset in X of
size K, and indeed n+1 0 = 1 = n+1
n+1 . We may thus assume 0 < K ≤ n. A
subset of size K in X is thus either a subset of Y, or it contains y, and is then
determined by a subset of size K − 1. Hence:

P[K](X) = P[K](Y) + P[K − 1](Y)
n+1
! ! !
(IH) n n
= + = , by Exercise 1.3.5.
K K−1 K

22
1.3. Subsets 23

2 We now use induction on m ≥ 2. The case m = 2 has just been covered.


Next:

~ ∈ P[K1 ](X) × · · · × P[Km+1 ](X) U1 , . . . , Um+1 is a partition of X

U

~ ∈ P[K1 ](X) × · · · × P[Km+1 ](X)

= U

U2 , . . . , Um+1 is a partition of X \ U1
! !
(IH) n n − K1
= ·
K1 K2 , . . . , Km+1
!
n
= , by Exercise 1.3.6.
K1 . . . , Km+1
The following result will be useful later.

Lemma 1.3.8. Fix a number m ∈ N. Then:


n
m 1
lim = .
n→∞ nm m!
Proof. Since:
n
m n!
lim = lim
n→∞ nm n→∞ m! · (n − m)! · nm
1 n n−1 n−m+1
= · lim · · ... ·
m! n→∞ n n n
n−m+1
! !
1  n n−1
= · lim · lim · . . . · lim
m! n→∞ n n→∞ n n→∞ n
1
= .
m!

Exercises
1.3.1 Continuing Exercise 1.2.1, compute:
• supp(`1 )
• supp(`2 )
• supp(`1 ++ `2 )
• supp(`1 ) ∪ supp(`2 )
• supp(L( f )(`1 ))
• Pfin ( f )(supp(`1 )).
1.3.2 We have used finite unions (∅, ∪) as monoid structure on P(X) in
Lemma 1.3.1 (1). Intersections (X, ∩) give another monoid structure
on P(X).

23
24 Chapter 1. Collections

1 Show that the negation/complement function ¬ : P(X) → P(X),


given by:
¬U = X \ U = {x ∈ X | x < U},
is a homomorphism of monoids between (P(X), ∅, ∪) and (P, X, ∩),
in both directions. In fact, if forms an isomorphism of monoids,
since ¬¬U = U.
2 Prove that the intersections monoid structure is not preserved by
maps P( f ) : P(X) → P(Y).
Hint: Look at preservation of the unit X ∈ P(X).
1.3.3 Check that in general,

P( f )(U) , U .
Conclude that the powerset functor P does not restrict to a functor
P[K], for K ∈ N.
1.3.4 Check that an ordered partition of a set X of size m can be identified
with a function X → m. This assumes that the partition consists of
non-empty subsets.
1.3.5 1 Prove what is called Pascal’s rule: for 0 < m ≤ n,
n+1
! ! !
n n
= + . (1.14)
m m m−1
2 Use this equation to prove:
X m + i! n+m+1
!
= .
0≤i≤n
m m+1

3 Now use both previous equations to prove:


m+i n+m+2
X ! !
(n + 1 − i) · = .
0≤i≤n
m m+2

Let K1 , . . . , K` ∈ N be given, for ` > 2, with K = i Ki . Show that


P
1.3.6
the multinomial coefficient can be written as a product of binomial
coefficients, via:
! ! !
K K K − K1
= ·
K1 , . . . , K` K1 K2 , . . . , K`
Let K1 , . . . , K` ∈ N be given, for ` ≥ 2, with K = i Ki . Show that:
P
1.3.7

P[K1 ] K  × P[K2 ] K −K1  × · · · × P[K`−1 ] K −K1 − · · · −K`−2 
!
K
= .
K1 . . . , K`

24
1.4. Multisets 25

1.4 Multisets
So far we have discussed two collection data types, namely lists and subsets of
elements. In lists, elements occur in a particular order, and may occur multiple
times (at different positions). Both properties are lost when moving from lists
to sets. In this section we look at multisets, which are ‘sets’ in which elements
may occur multiple times. Hence multisets are in between lists and subsets,
since they do allow multiple occurrences, but the order of their elements is
irrelevant.
The list and powerset examples are somewhat remote from probability the-
ory. But multisets are much more directly relevant: first, because we use a
similar notation for multisets and distributions; and second, because observed
data can be organised nicely in terms of multisets. For instance, for statistical
analysis, a document is often seen as a multiset of words, in which one keeps
track of the words that occur in the document together with their frequency
(multiplicity); in that case, the order of the words is ignored. Also, tables with
observed data can be organised naturally as multisets, see Subsection 1.4.1 be-
low. Learning from such tables will be described in Section 2.1 as a (natural)
transformation from multisets to distributions.
Despite their importance, multisets do not have a prominent role in pro-
gramming, like lists have. Eigenvalues of a matrix form a clear example where
the ‘multi’ aspect is ignored in mathematics: eigenvalues may occur multiple
times, so the proper thing to say is that a matrix has a multiset of eigenvalues.
One reason for not using multisets may be that there is no established notation.
We shall use a ‘ket’ notation | − i that is borrowed from quantum theory, but
interchangeably also a functional notation. Since multisets are less familiar,
we take time to introduce the basic definitions and properties, in Sections 1.4
– 1.7.
We start with an introduction about notation, terminology, and conventions
for multisets. Consider a set C = {R, G, B} for the three colours Red, Green,
Blue. An example of a multiset over C is:

2| Ri + 5|G i + 0| Bi.

In this multiset the element R occurs 2 times, G occurs 5 times, and B occurs
0 times. The latter means that B does not occur, that is, B is not an element
of the multiset. From a multiset perspective, we have 2 + 5 + 0 = 7 elements
— and not just 2. A multiset like this may describe an urn containing 2 red
balls, 5 green ones, and no blue balls. Such multisets are quite common. For
instance, the chemical formula C2 H3 O2 for vinegar may be read as a multiset

25
26 Chapter 1. Collections

2|C i + 3| H i + 2| O i, containing 2 carbon (C) atoms, 3 hydrogen (H) atoms and


2 oxygen (O) atoms.
In a situation where we have multiple data items, say arising from succes-
sive experiments, a basic question to ask is: does the order of the experiments
matter? If so, we need to order the data elements as a list. If the order does not
matter we should use a multiset. More concretely, if six successive experiments
yield data items d, e, d, f, e, d and their order is relevant we should model the
data as the list [d, e, d, f, e, d]. When the order is irrelevant, we can capture the
data as the multiset 3| d i + 2|e i + 1| f i.
The funny brackets | − i are called ket notation; this is frequently used in
quantum theory. Here it is meaningless notation, used to separate the natural
numbers, called multiplicities, and the elements in the multiset.
Let X be an arbitrary set. Following the above ket-notation, a (finite) multiset
over X is an expression of the form:
n1 | x1 i + · · · + nk | xk i where ni ∈ N and xi ∈ X.
This expression is a formal sum, not an actual sum (for instance in R). We may
P
write it as i ni | xi i. We use the convention:
• 0| x i may be omitted; but it may also be written explicitly in order to empha-
sise that the element x does not occur in a multiset;
• a sum n| x i + m| x i is the same as (n + m)| x i;
• the order and brackets (if any) in a sum do not matter.
Thus, for instance, there is an equality of multisets:
2| a i + 5|b i + 0| ci + 4| b i = 9| bi + 2| ai.


There is an alternative description of multisets. A multiset can be defined as


a function ϕ : X → N that has finite support. The support supp(ϕ) ⊆ X is the
set supp(ϕ) = {x ∈ X | ϕ(x) , 0}. For each element x ∈ X the number ϕ(x) ∈ N
indicates how often x occurs in the multiset ϕ. Such a function ϕ can also be
written as a formal sum x ϕ(x)| x i, where x ranges over supp(ϕ).
P
For instance, the multiset 9| b i + 2|a i over A = {a, b, c} corresponds to the
function ϕ : A → N given by ϕ(a) = 2, ϕ(b) = 9, ϕ(c) = 0. Its support is thus
{a, b} ⊆ A, with two elements. The number kϕk of elements in ϕ is 11.
We shall freely switch back and forth between the ket-description and the
function-description of multisets, and use whichever form is most convenient
for the goal at hand.
Having said this, we stretch the idea of a multiset and do not only allow
natural numbers n ∈ N as multiplicities, but also allow non-negative numbers
r ∈ R≥0 . Thus we can have a multiset of the form 23 | a i + π| b i where π ∈ R≥0

26
1.4. Multisets 27

is the famous constant of Archimedes: the ratio of a circle’s circumference


to its diameter. This added generality will be useful at times, although many
examples of multisets will simply have natural numbers as multiplicities. We
call such multisets natural.
Definition 1.4.1. For a set X we shall write M(X) for the set of all multisets
over X. Thus, using the function approach:
M(X) B {ϕ : X → R≥0 | supp(ϕ) is finite}.
The elements of M(X) may be called mass functions, as in [147].
We shall write N(X) ⊆ M(X) for the subset of natural multisets, with natu-
ral numbers as multiplicities — also called bags or urns. Thus, N(X) contains
functions ϕ ∈ M(X) with ϕ(x) ∈ N, for all x ∈ X.
We shall write M∗ (X) for the set of non-empty multisets. Thus:
M∗ (X) B {ϕ ∈ M(X) | supp(ϕ) is non-empty}
= {ϕ : X → R≥0 | supp(ϕ) is finite and non-empty}.
Similarly, N∗ (X) ⊆ M∗ (X) contains the non-empty natural multisets.
For a number K we shall write M[K](X) ⊆ M(X) and N[K](X) ⊆ N(X) for
the subsets of mulisets with K elements. Thus:
M[K](X) B {ϕ ∈ M(X) | kϕk = K} where kϕk B x ϕ(x). (1.15)
P

This expression kϕk gives the size of the multiset, that is, its total number of
elements.
All of M, N, M∗ , N∗ , M[K], N[K] are functorial, in the same way. Hence
we concentrate on M. For a function f : X → Y we can define M( f ) : M(X) →
M(Y). When we see a multiset ϕ ∈ M(X) as an urn containing coloured balls,
with colours from X, then M( f )(ϕ) ∈ M(Y) is the urn with ‘repainted’ balls,
where the new colours are taken from the set Y. The function f : X → Y defines
the transformation of colours. It tells that a ball of colour x ∈ X in ϕ should be
repainted with colour f (x) ∈ Y.
The urn M( f )(ϕ) with repainted balls can be defined in two equivalent ways:
X  X X
M( f ) ri | xi i B ri | f (xi ) i or as M( f )(ϕ)(y) B ϕ(x).
i i
x∈ f −1 (y)

It may take a bit of effort to see that these two descriptions are the same, see
P
Exercise 1.4.1 below. Notice that in the sum i ri | f (xi ) i it may happen that
f (xi1 ) = f (xi2 ) for xi1 , xi2 , so that ri1 and ri2 are added together. Thus, the sup-
P  P
port of M( f ) i ri | xi i may have fewer elements than the support of i ri | xi i,
P  P
but the sum of all multiplicities is the same in M( f ) i ri | xi i and i ri | xi i,
see Exercise 1.5.2 below.

27
28 Chapter 1. Collections

Applying M to a projection function πi : X1 ×· · ·× Xn → Xi yields a function


M(πi ), from the set M(X1 ×· · ·×Xn ) of multisets over a product to the set M(Xi )
of multisets over a component. This M(πi ) is called a marginal function or
simply a marginal. It computes what is ‘on the side’, in the marginal of a table,
as will be illustrated in Subsection 1.4.1 below.
Multisets, like lists and subset form a monoid. In terms of urns with coloured
balls, taking the sum of two multisets corresponds to pouring the balls from
two urns into a new urn.

Lemma 1.4.2. 1 The set M(X) of multisets over X is a commutative monoid.


In functional form, addition and zero (identity) element 0 ∈ M(X) are de-
fined as:

(ϕ + ψ)(x) B ϕ(x) + ψ(x) 0(x) B 0.

These sums restrict to N(X).


2 The set M(X) is also a cone: it is closed under ‘scalar’ multiplication with
non-negative numbers r ∈ R≥0 , via:

(r · ϕ)(x) B r · ϕ(x).

This scalar multiplication r · (−) : M(X) → M(X) preserves the sums (0, +)
from the previous item, and is thus a map of monoids.
3 For each f : X → Y, the function M( f ) : M(X) → M(Y) is a map of
monoids and also of cones. The latter means: M( f )(r · ϕ) = r · M( f )(ϕ).

The fact that M( f ) preserves sums can be understood informally as follows.


If we have two urns, we can first combine their contents and then repaint ev-
erything. Alternatively, we can first repaint the balls in the two urns separately,
and then throw them together. The result is the same, in both cases.
The element 0 ∈ M(X) used in item (1) is the empty multiset, that is, the
urn containing no balls. Similarly, the sum of multisets + is implicit in the ket-
notation. The set N(X) of natural multisets is not closed in general under scalar
multiplication with r ∈ R≥0 . It is closed under scalar multiplication with n ∈ N,
but such multiplications add nothing new since they can also be described via
repeated addition.

1.4.1 Tables of data as multisets


Let’s assume that a group of 36 children in the age range 0 − 10 is participat-
ing in some study, where the number of children of each age is given by the
following table.

28
1.4. Multisets 29

0 1 2 3 4 5 6 7 8 9 10

2 0 4 3 5 3 2 5 5 2 4
We can represent this table as a natural multiset over the set of ages {0, 1, . . . , 10}.
2| 0i + 4| 2 i + 3| 3 i + 5|4 i + 3| 5i + 2| 6i + 5| 7 i + 5| 8 i + 2|9 i + 4| 10 i.

Notice that there is no summand for age 1 because of our convention to ommit
expressions like 0| 1 i with multiplicity 0. We can visually represent the above
age data/multiset in the form of a histogram:

(1.16)

Here is another example, not with numerical data, in the form of ages, but
with nominal data, in the form of blood types. Testing the blood type of 50
individuals gives the following table.

A B O AB

10 15 18 7
This corresponds to a (natural) multiset over the set {A, B, O, AB} of blood
types, namely to:
10| A i + 15| Bi + 18| Oi + 7| ABi.

It gives rise to the following bar graph, in which there is no particular ordering
of elements. For convenience, we follow the order of the above table.

(1.17)

29
30 Chapter 1. Collections

Next, consider the two-dimensional table (1.18) below where we have com-
bined numeric information about blood pressure (either high H, or low L) and
certain medicines (either type 1, type 2, or no medicine, indicated as 0). There
is data about 100 study participants:

no medicine medicine 1 medicine 2 totals

high 10 35 25 70
(1.18)
low 5 10 15 30

totals 15 45 40 100

We claim that we can capture this table as a (natural) multiset. To do so, we


first form sets B = {H, T } for blood pressure values, and M = {0, 1, 2} for types
of medicine. The above table can then be described as a natural multiset τ over
the product set/space B × M, that is, as an element τ ∈ N(B × M), namely:

τ = 10| H, 0 i + 35| H, 1 i + 25| H, 2 i + 5| L, 0i + 10| L, 1 i + 15| L, 2 i.

Such a multiset can be plotted in three dimensions as:

We see that Table (1.18) contains ‘totals’ in its vertical and horizontal mar-
gins. They can be obtained from τ as marginals, using the functoriality of N.
This works as follows. Applying the natural multiset functor N to the two pro-
jections π1 : B × M → B and π2 : B × M → M yields marginal distributions on

30
1.4. Multisets 31

B and M, namely:
N(π1 )(τ) = 10|π1 (H, 0)i + 35| π1 (H, 1)i + 25| π1 (H, 2)i
+ 5| π1 (L, 0)i + 10| π1 (L, 1)i + 15| π1 (L, 2)i
= 10| H i + 35| H i + 25| H i + 5| L i + 10| L i + 15| L i
= 70| H i + 30| L i.
N(π2 )(τ) = (10 + 5)| 0i + (35 + 10)| 1 i + (25 + 15)| 2 i
= 15|0 i + 45| 1 i + 40|2 i.
The expression ‘marginal’ is used to describe such totals in the margin of a
multidimensional table. In Section 2.1 we describe how to obtain probabilities
from tables in a systematic manner.

1.4.2 Unit and flatten for multisets


As may be expected by now, there are also unit and flatten maps for multisets.
The unit function unit : X → M(X) is simply unit(x) B 1| x i. Flattening in-
volves turning a multiset of multisets into a multiset. Concretely, this is done
as:
 
flat 31 2|a i + 2| ci + 5 1| b i + 61 | ci = 23 | ai + 5| b i + 32 | c i.

More generally, flattening is the function flat : M(M(X)) → M(X) with:


P  X P 
flat i ri | ϕi i B i ri · ϕi (x) x .
x
P P
Notice that the big outer sum x is a formal one, whereas the inner sum i is
an actual one, in R≥0 , see the earlier example.
The following result, about unit and flatten, does not come as a surprise
anymore. We formulate it for general multisets M, but it restricts to natural
multisets N.
Lemma 1.4.3. 1 For each function f : X → Y the two rectangles

X
unit / M(X) M(M(X))
flat / M(X)
f M( f ) M(M( f )) M( f )
   
Y / M(Y) M(M(Y)) / M(Y)
unit flat

commute.
2 The next two diagrams also commute.

M(X)
unit / M(M(X)) M(unit)
o M(X) M(M(M(X)))
flat / M(M(X))
M(flat)
 flat   flat
M(X) M(M(X)) / M(X)
flat

31
32 Chapter 1. Collections

The next result shows that natural multisets are free commutative monoids.
Arbitrary multisets are also free, but for other algebraic structures, see Exer-
cise 1.4.9.
Proposition 1.4.4. Let X be a set and (M, 0, +) a commutative monoid. Each
function f : X → M has a unique extension to a homomorphism of monoids
f : (N(X), 0, +) → (M, 0, +) with f ◦ unit = f . The diagram below captures
the situation, where the dashed arrow is used for uniqueness.

X
unit / N(X)

f , homomorphism (1.19)
f
* M

Proof. One defines:


 
f n1 | x1 i + · · · + nk | xk i B n1 · f (x1 ) + · · · + nk · f (xk ).
where we write n · a for the n-fold sum a + · · · + a in a monoid.

The unit and flatten operations for (natural) multisets can be used to cap-
ture commutative monoids more precisely, in analogy with Propositions 1.2.6
and 1.3.5
Proposition 1.4.5. Let X be an arbitrary set.

1 A commutative monoid structure (u, +) on X corresponds to an N-algebra


α : N(X) → X making the two diagrams below commute.

X
unit / N(X) N(N(X))
N(α)
/ N(X)
α flat α (1.20)
id %   α

X N(X) /X

2 Let (M1 , u1 , +1 ) and (M2 , u2 , +2 ) be two commutative monoids, with corre-


sponding N-algebras α1 : N(M1 ) → M1 and α2 : N(M2 ) → M2 . A function
f : M1 → M2 is a map of monoids if and only if the rectangle

N(M1 )
N( f )
/ N(M2 )
α1 α2 (1.21)
 
M1
f
/ M2 .

commutes.
Proof. Analogously to the proof Proposition 1.2.6: if (X, u, +) is a commu-
tative monoid, we define α : N(X) → X by turning formal sums into actual

32
1.4. Multisets 33

sums: α( i ni | xi i) B i ni · xi , see (the proof of) Proposition 1.4.4. In the


P P
other direction, given α : N(X) → X we define a sum as x + y B α(1| x i + 1| y i)
with unit u B α(0). Obviously, + is commutative.

1.4.3 Extraction
At the end of the previous section we have seen how to extract a function
(channel) from a binary subset, that is, from a relation. It turns out that one
can do the same for a binary multiset, that is, for a table. More specifically, in
terms of exponents, there are isomorphisms:

M(Y)X  M(X × Y)  M(X)Y . (1.22)

This is analogous to (1.12) for powerset.


How does this work in detail? Suppose we have an arbitary multiset/table
σ ∈ M(X × Y). From σ one can extract a function extr 1 (σ) : X → M(Y), and
also extr 2 (σ) : Y → M(X), via a cap:
X X
extr 1 (σ)(x) = σ(x, y)| y i extr 2 (σ)(y) = σ(x, y)| x i.
y∈Y x∈X

Notice that we are — conveniently — mixing ket and function notation for
multisets. Conversely, σ can be reconstructed from extr 1 (σ), and also from
extr 2 (σ), via σ(x, y) = extr 1 (σ)(x)(y) = extr 2 (σ)(y)(x).
Functions of the form X → M(Y) will also be used as channels from X to
Y, see Section 1.8. That’s why we often speak about ‘channel extraction’.
As illustration, we apply extraction to the medicine - blood pressure Ta-
ble 1.18, described as the multiset τ ∈ M(B × M). It gives rise to two channels
extr 1 (τ) : B → M(M) and extr 2 (τ) : M → M(B). Explicitly:
X
extr 1 (τ)(H) = τ(H, x)| x i = 10| 0 i + 35| 1i + 25| 2 i
x∈M
X
extr 1 (τ)(L) = τ(L, x)| x i = 5| 0i + 10| 1 i + 15| 2i.
x∈M

We see that this extracted function captures the two rows of Table 1.18. Simi-
larly we get the columns via the second extracted function:

extr 2 (τ)(0) = 10| L i + 5| H i


extr 2 (τ)(1) = 35| L i + 10| H i
extr 2 (τ)(2) = 25| L i + 15| H i.

33
34 Chapter 1. Collections

Exercises
1.4.1 In the setting of Exercise 1.2.1, consider the multisets ϕ = 3| ai +
2| bi + 8| c i and ψ = 3|b i + 1| ci. Compute:
• ϕ+ψ
• ψ+ϕ
• M( f )(ϕ), both in ket-formulation and in function-formulation
• idem for M( f )(ψ)
• M( f )(ϕ + ψ)
• M( f )(ϕ) + M( f )(ψ).
1.4.2 Consider, still in the context of Exercise 1.2.1, the ‘joint’ multiset
ϕ ∈ M(X × Y) given by ϕ = 2|a, ui + 3| a, v i + 5| c, vi. Determine the
marginals M(π1 )(ϕ) ∈ M(X) and M(π2 )(ϕ) ∈ M(Y).
1.4.3 Consider the chemical equation for burning methane:

CH4 + 2 O2 −→ CO2 + 2 H2 O.

Check that there is an underlying equation of multisets:


 
flat 1 1|C i + 4| H i + 2 2| O i

 
= flat 1 1|C i + 2| Oi + 2 2| H i + 1|O i .

It expresses the law of conservation of mass.


1.4.4 Check that M(id ) = id and M(g ◦ f ) = M(g) ◦ M( f ).
1.4.5 Show that for each natural number K the mappings X 7→ M[K](X)
and X 7→ N[K](X) are functorial. Notice the difference with P[K],
see Exercise 1.3.3.
1.4.6 Prove Lemma 1.4.3.
1.4.7 Consider both Lemma 1.4.3 and Proposition 1.4.5.
1 Notice that an abstract way of seeing that N(X) is a commutative
monoid is via the properties of the flatten map flat : N N(X)) →
N(X).
2 Notice also that this flatten map is a homomorphism of monoids.
1.4.8 Verify that the support map supp : M(X) → Pfin (X) commutes with
extraction functions, in the sense that the following diagram com-
mutes.
M(X × Y)
supp
/ Pfin (X × Y)
extr 1
  extr 1
M(Y)X / Pfin (Y)X
supp X

34
1.5. Multisets in summations 35

Equationally, this amounts to showing that for τ ∈ M(X × Y) and


x ∈ X one has:
extr 1 supp(τ) (x) = supp extr 1 (τ)(x) .
 

Here we use that supp X ( f ) B supp ◦ f , so that supp X ( f )(x) =


supp( f (x)).
1.4.9 In Proposition 1.4.4 we have seen that natural multisets N(X) form
free commutative monoids. What about general multisets M(X)? They
form free cones. Briefly, a cone is a commutative monoid M with
scalar multiplication r · (−) : M → M, for each r ∈ R≥0 , that is a
homomorphism of monoids. It is like a vector space, not over all re-
als, but only over the non-negative reals. Homomorphisms of cones
preserve such scalar multiplications.
Let X be a set and (M, 0, +) a cone, with a function f : X → M.
Prove that there is a unique homomorphism of cones f : M(X) → M
with f ◦ unit = f .

1.5 Multisets in summations


The current and next section will dive deeper into the use of natural multisets,
in combinatorics as preparation for later use. This section will focus on the
use of multisets in summations, such as the multinomial theorem — which is
probably best known in binary form, as the binomial theorem for expanding
sums (a + b)n . We shall describe various extensions, both for finite and infinite
sums. All material in this section is standard, but its presentation in terms of
multisets is not.
Recall that a multiset ϕ = i ri | xi i is called natural when all the multiplici-
P
ties ri are natural numbers. Such a multiset is also called a bag or an urn. Urns
are often used as illustrations in simple probabilistic arguments, starting with:
what is the probability of drawing two red balls — with or without replacement
— from an urn with initially three red balls and two blue ones. In this book we
shall describe such an urn as the multiset 3| R i + 2| Bi. In such arguments one
often encounters elementary combinatorial questions about (combinations of)
numbers of balls in urns. This section provides some standard results, building
on Subsection 1.3.4, where subsets are counted — instead of multisets.
In this section we focus on counting with multisets, in particular in (infinite)
sums of powers. The next section focuses on counting multisets themselves,
where we ask, for instance, how many multisets of size K are there on a set
with n elements?

35
36 Chapter 1. Collections

There are several ways to associate a natural number with a multiset ϕ. For
instance, we can look at the size of its support | supp(ϕ) | ∈ N, or at its size, as
total number of elements kϕk = x ϕ(x) ∈ R>0 . This size is a natural number
P
when ϕ is a natural multiset. Below we will introduce several more such num-
  multisets ϕ, namely ϕ and ( ϕ ), and later on also a binomial
bers for natural
coefficient ψϕ .

Definition 1.5.1. 1 For two multisets ϕ, ψ ∈ N(X) we write:

ϕ ≤ ψ ⇐⇒ ∀x ∈ X. ϕ(x) ≤ ψ(x).

When ϕ ≤ ψ we define subtraction ψ−ϕ of multisets as the obvious multiset,


defined pointwise as: ψ − ϕ (x) = ψ(x) − ϕ(x).


2 For an additional number K ∈ N we use:

ϕ ≤K ψ ⇐⇒ ϕ ∈ N[K](X) and ϕ ≤ ψ.

3 For a collection of numbers r = (r x ) x∈X we write:

rϕ(x)
Y
rϕ B x = ~r ϕ .
x∈X

The latter vector notation is appropriate in a situation with a particular order.


4 The factorial ϕ of a natural multiset ϕ ∈ N(X) is the product of the factorial
of its multiplicities:
Y
ϕ B ϕ(x)! (1.23)
x∈supp(ϕ)

5 The muliset coefficient ( ϕ ) is defined as:


!
kϕk! kϕk! kϕk
(ϕ) B = Q = .
ϕ x ϕ(x)! ϕ(x1 ), . . . , ϕ(xn )
The latter formulation, using the multinomial coefficient (1.13), assumes
that the support of ϕ is ordered as [x1 , . . . , xn ].

For instance,
5!
(3| R i + 2| Bi) = 3! · 2! = 12 and ( 3|R i + 2| Bi ) = = 10.
12
The multinomial coefficient (ϕ ) in item (5) counts the number of ways of
putting kϕk items in supp(ϕ) = {x1 , . . . , xn } urns, with the restriction that ϕ(xi )
items go into urn xi . Alternatively, ( ϕ ) is the numbers of partitions (Ui ) of a
 N| Ui | = ϕ(xi ).
set X with kϕk elements, where
The traditional notation m1 ,...,mk
for multinomial coefficients in (1.13) is

36
1.5. Multisets in summations 37

suboptimal for two reasons: first, the number N is superflous, since it is deter-
mined by the mi as N = i mi ; second, the order of the mi is irrelevant. These
P
disadvantages are resolved by the multiset variant ( ϕ ). It has our preference.
We recall the recurrence relations:
! ! !
K−1 K−1 K
+ ··· + = (1.24)
k1 − 1, . . . , kn k1 , . . . , kn − 1 k1 , . . . , kn
for multinomial coefficients. A snappy re-formulation, for a natural multiset ϕ,
is: X
(ϕ − 1| x i ) = (ϕ ). (1.25)
x∈supp(ϕ)

Multinomial coefficients (1.13) are useful, for instance in the Multinomial


Theorem (see e.g. [145]):
!
K X K
r1 + · · · + rn = · r1k1 · . . . · rnkn . (1.26)
k , k =K
P k1 , . . . , kn
i i i

An equivalent formulation using multisets is:


X
r1 + · · · + rn K = ( ϕ ) · ~r ϕ .

(1.27)
ϕ∈M[K]({1,...,n})

There is an ‘infinite’ version of this result, known as the (Binomial) Series


Theorem. It holds more generally than formulated in the first item below, for
complex numbers, with adapted meaning of the binomial coefficient, but that’s
beyond the current scope.
Theorem 1.5.2. Fix a natural number K.
1 For a real number r ∈ [0, 1),
X n + K! 1
· rn = .
n≥0
K (1 − r)K+1
2 As a special case,
X 1
rn = .
n≥0
1−r
3 Another consequence is, still for r ∈ [0, 1),
X r
n · rn = .
n≥1
(1 − r)2
4 For r1 , . . . , rm ∈ [0, 1] with i ri < 1,
P

K + i ni
P ! Y
X 1
· rini = P K+1 .
n , ..., n ≥0
K, n1 , . . . , nm i (1 − i ri )
1 m

37
38 Chapter 1. Collections

Equivalently,
K + kϕk
!
X 1
· ( ϕ ) · ~r ϕ = P K+1 .
ϕ∈N({1,...,m})
K (1 − i ri )

P (n)
Proof. 1 The equation arises as the Taylor series f (x) = n f n!(0) · xn of the
function f (x) = (1−x)1 K+1 . One can show, by induction on n, that the n-th
derivative of f is:
(n + K)! 1
f (n) (x) = · .
K! (1 − x)n+K+1
2 The second equation is a special case of the first one, for K = 0. There is also
a simple direct proof. Define sn = r0 + r1 + · · · + rn . Then sn − r · sn = 1 − rn+1 ,
n+1
so that sn = 1−r 1
1−r . Hence sn → 1−r as n → ∞.
3 We choose to use the first item, but there are other ways to prove this result,
see Exercise 1.5.10.
r X n + 1!
= r · · rn by item (1), with K = 1
(1 − r)2 n≥0
1
X X
= (n + 1) · rn+1 = n · rn .
n≥0 n≥1

4 The trick is to turn the multiple sums into a single ‘leading’ one, in:
K + i ni
X P ! Y
· r ni
n1 , ..., nm ≥0
K, n 1 , . . . , n m i i

K+n
X X ! Y
= · r ni
n≥0 n1 , ..., nm , i ni =n
P K, n 1 , . . . , nm i i

K+n
! ! Y
X X n
= · · r ni
n≥0 n1 , ..., nm , i ni =n
P K n 1 , . . . , nm i i

X K + n! X n
! Y
= · · r ni
n≥0
K n1 , ..., nm , i ni =n
P n 1 , . . . , nm i i

(1.26)
X K + n! P 
n
= · i ri
n≥0
K
1
= P K+1 , by item (1).
(1 − i ri )

Exercises
1.5.1 Consider the function f : {a, b, c} → {0, 1} given by f (a) = f (b) = 1
and f (c) = 0.

38
1.5. Multisets in summations 39

1 Take the natural multiset ϕ = 1| a i + 3|b i + 1| ci ∈ N({a, b, c}) and


compute consecutively:
• (ϕ)
• N( f )(ϕ)
• ( N( f )(ϕ) ).
Conclude that ( ϕ ) , ( N( f )(ϕ) ), in general.
2 Now take ψ = 2| 0 i + 3|1 i ∈ N({0, 1}).
• Compute ( ψ).
• Show that there are four multisets ϕ1 , ϕ2 , ϕ3 , ϕ4 ∈ N({a, b, c})
with M( f )(ϕi ) = ψ, for each i.
• Check that ( ψ) , ( ϕ1 ) + ( ϕ2 ) + ( ϕ3 ) + (ϕ4 ).
What is the general formulation now?
1.5.2 Check that:
1 the size map k − k : M(X) → R≥0 is a homomorphism of monoids,
preserving rescaling — and thus a homomorphism of cones, see
Exercise 1.4.9;
2 kM( f )(ϕ)k = kϕk.
1.5.3 Show that for natural multisets ϕ, ψ ∈ N(X),

ϕ ≤ ψ ⇐⇒ ∃ϕ0 ∈ N(X). ϕ + ϕ0 = ψ.

1.5.4 Let Ψ ∈ N(N(X)) be given with ψ = flat(Ψ). Show that for ϕ ∈


supp(Ψ) one has ϕ ≤ ψ and flat Ψ − 1|ϕ i = ψ − ϕ.


1.5.5 Let ϕ be a natural multiset.


1 Show that:
X ϕ
= kϕk.
x∈supp(ϕ)
(ϕ − 1| x i)

ϕ ϕ(y)!
X X Q
y
=
(ϕ(x) − 1)! · y,x ϕ(y)!
Q
x∈supp(ϕ)
(ϕ − 1| x i) x∈supp(ϕ)
X ϕ(x)!
=
x∈supp(ϕ)
(ϕ(x) − 1)!
X
= ϕ(x)
x∈supp(ϕ)
= kϕk.

2 Derive the recurrence relation (1.25) from this equation.

39
40 Chapter 1. Collections

1.5.6 Prove the multiset formulation (1.27) of the Multinomial Theorem,


by induction on K.
1.5.7 Let K ≥ 0 and let set X have n ≥ 1 elements. Prove that:
X
( ϕ ) = nK .
ϕ∈N[K](X)

This generalises
 the well known sum-formula for binomial coeffi-
cients: 0≤k≤K Kk = 2K , for n = 2.
P

Hint: Use n = 1 + · · · + 1 in (1.27).


1.5.8 Let ϕ be a natural multiset. Show that:
1 ( ϕ ) = 1 ⇐⇒ supp(ϕ) is a singleton;
2 ( ϕ ) = kϕk! ⇐⇒ ∀x ∈ supp(ϕ). ϕ(x) = 1 ⇐⇒ ϕ consists of single-
tons.
1.5.9 Let n ≥ 1 and r ∈ (0, 1). Show that:
X k−1!
· rn · (1 − r)k−n = 1.
k≥n
n−1

1.5.10 Elaborate the details of the following two (alternative) proofs of the
equation in Theorem 1.5.2 (3).
1 Use the derivate ddr on both sides of Theorem 1.5.2 (2).
2 Write s B n≥1 n · rn and exploit the following recursive equation.
P

s = r + 2r2 + 3r3 + 4r4 + · · ·


= r + (1 + 1)r2 + (1 + 2)r3 + (1 + 3)r4 + · · ·
= r + r2 + r3 + r4 + · · · + r · r + 2r2 + 3r3 + · · ·
 
r
= + r · s, by Theorem 1.5.2 (2).
1−r
1.5.11 In the proof of Theorem 1.5.2 we have used Taylor’s formula for
a single-variable function. For multi-variable functions we can use
multisets for a compact, ‘multi-index’ formulation. For an an n-ary
function f (x1 , . . . , xn ) and a natural multiset ϕ ∈ N({1, . . . , n}) write:
ϕ(1) ϕ(n)
∂ ϕ f B ∂x1 · · · ∂xn f.

Check that Taylor’s expansion formula (around 0 ∈ Rn ) then be-


comes:
X (∂ ϕ f )(0)
f (~x) = · ~x ϕ .
ϕ∈N({1,...,n})
ϕ

40
1.6. Binomial coefficients of multisets 41

1.6 Binomial coefficients of multisets



Binomial coefficients nk for numbers n ≥ k are a standard tool in many ar-
eas of mathematics, see Subsection 1.3.4 for the definition and the most basic
properties. Here we
ψ extend binomial coefficients from numbers to natural mul-
tisets: we define ϕ for natural multisets ψ, ϕ with ψ ≥ ϕ. In the next section
 
we shall also look into the less familiar ‘multichoose’ coefficients mn and
their extension to multisets ψϕ . The coefficients ψϕ and ψϕ will play an
     

important role in (multivariate) hypergeometric and Pólya distributions.

Definition 1.6.1. For natural multisets ϕ, ψ ∈ N(X) with ϕ ≤ ψ, the multiset


binomial is defined as:
ψ ψ
!
B
ϕ ϕ · (ψ − ϕ)
(1.28)
x ψ(x)!
Q Y ψ(x)!
= Q  = .
x ϕ(x)! · x (ψ(x) − ϕ(x) !) ϕ(x)
 Q
x∈supp(ψ)

For instance:
3| Ri + 2| Bi
! ! !
3! · 2! 3 2
=  = 6 = 3·2 = · .
2| Ri + 1| Bi

2! · 1! · 1! · 1! 2 1
This describes the number of possible ways of drawing 2|R i + 1| Bi from an
urn 3|R i + 2| Bi.
The following result is a generalisation of Vandermonde’s formula.

Lemma 1.6.2. Let ψ ∈ N(X) be a multiset of size L = kψk, with a number


K ≤ L. Then:
X ψϕ
 
X ψ! L
!
= so that  L  = 1.
ϕ≤ ψ
ϕ K ϕ≤ ψ
K K K

These fractions adding up to one will form the probabilities of the so-called
hypergeometric distributions, see Example 2.1.1 (3) and Section 3.4 later on.

Proof. We use induction on the number of elements in supp(ψ). We go through


some initial values explicitly. If the number of elements is 0, then ψ = 0 and
so L = 0 = K and ϕ ≤K ψ means ϕ = 0, so that the result holds. Similarly, if
supp(ψ) is a singleton, say {x}, then L = ψ(x). For K ≤ L and ϕ ≤K ψ we get
supp(ϕ) = {x} and K = ϕ(x). The result then obviously holds.
The case where supp(ψ) = {x, y} captures the ordinary form of Vander-
monde’s formula. We reformulate it for numbers B, G ∈ N and K ≤ B + G.

41
42 Chapter 1. Collections

Then:
B+G
! ! !
X B G
= · . (1.29)
K b≤B, g≤G, b+g=K
b g

Intuitively: if you select K children out of B boys and G girls, the number of
options is given by the sum over the options for b ≤ B boys times the options
for g ≤ G girls, with b + g = K.
  can be proven by induction on G. When G = 0 both
The equation (1.29)
sides amount to KB so we quickly proceed to the induction step. The case
K = 0 is trivial, so we may assume K > 0.
X B G+1
b · g
b≤B, g≤G+1, b+g=K
     B  G+1  B  G+1  B  G+1
= KB · G+1 0 + K−1 · 1 + · · · + K−G · G + K−G−1 · G+1
 B  G  B  G  B  G
= K · 0 + K−1 · 1 + K−1 · 0
 B  G  B   G   B  G
+ · · · + K−G · G + K−G · G−1 + K−G−1 · G
X B G X B G
= b · g + b · g
b≤B, g≤G, b+g=K b≤B, g≤G, b+g=K−1
(IH)   B+G
= B+G K + K−1
(1.14)  B+G+1
= K .
For the induction step, let supp(ψ) = {x1 , . . . , xn , y}, for n ≥ 2. Writing
` = ψ(y), L0 = L − ` and ψ0 = ψ − `|y i ∈ N[L0 ](X) gives:
ψ ψ(x) ` ψ (xi )
X   X Y   X X  Y  0 
ϕ = ϕ(x) x
= n · ϕ(xi ) i
ϕ≤K ψ ϕ≤K ψ n≤` ϕ≤K−n ψ0
(IH)
X `   L−`  (1.29)  L 
= n · K−n = K.
n≤`, K−n≤L−`

For a multiset ϕ we have already used the ‘support’ definition supp(ϕ) =


{x | ϕ(x) , 0}. This yields a map supp : M(X) → Pfin (X), which is well-
behaved, in the sense that it is natural and preserves the monoid structures on
M(X) and Pfin (X).
We have also seen a support map from lists L to finite powerset Pfin . This
support map factorises through multisets, as described with a new function acc
in the following triangle.

L(X)
supp
/ Pfin (X)
B
(1.30)
acc , N(X) supp

The ‘accumulator’ map acc : L(X) → N(X) counts (accumulates) how many

42
1.6. Binomial coefficients of multisets 43

times an element occurs in a list, while ignoring the order of occurrences. Thus,
for a list ` ∈ L(X),

acc(`)(x) B n if x occurs n times in the list α. (1.31)

Alternatively,

acc x1 , . . . , xn = 1| x1 i + · · · + 1| xn i.


Or, more concretely, in an example:

acc a, b, a, b, c, b, b = 2| ai + 4| bi + 1| c i.


The above diagram (1.30) expresses an earlier informal statement, namely that
multisets are somehow in between lists and subsets.
The starting point in this section is the question: how many (ordered) se-
quences of coloured balls give rise to a specific urn? More technically, given a
natural multiset ϕ, how many lists ` statisfy acc(`) = ϕ? In yet another form,
what is the size | acc −1 (ϕ) | of the inverse image?
As described in the beginning of this section, one aim is to relate the multiset
coefficient (ϕ ) of a multiset ϕ to the number of lists that accumlate to ϕ, as
defined in (1.31). Here we use a K-ary version of accumulation, for K ∈ N,
restricted to K-many elements. It then becomes a mapping:

XK
acc[K]
/ N[K](X). (1.32)

The parameter K will often be omitted when it is clear from the context. We are
now ready for a basic combinatorial / counting result. It involves the multiset
coefficient from Definition 1.5.1 (5).

Proposition 1.6.3. For ϕ ∈ N[K](X) one has:



( ϕ ) = acc −1 (ϕ) = the number of lists ` ∈ X K with acc(`) = ϕ.

Proof. We use induction on the number of elements of the support supp(ϕ) of


the multiset ϕ. If this number is 0, then ϕ = 0, with ( 0) = 1. And indeed, there
is precisely one list ` with acc(`) = 0, namely the empty list [].
Next suppose that supp(ϕ) = {x1 , . . . , xn , xn+1 }. Take m B ϕ(xn+1 ) and ϕ0 B
ϕ − m| xn+1 i so that kϕ0 k = K − m and supp(ϕ0 ) = {x1 , . . . , xn }. By the induction
hypothesis there are ( ϕ0 )-many lists `0 ∈ X K−m with acc(`0 ) = ϕ0 . Each such
list `0 can be extended to a list ` with acc(`) = ϕ by m times adding xn+1 to `0 .
How many such additions are there? It is not hard to see that this number of

43
44 Chapter 1. Collections
K 
additions is m . Thus:
!
K
acc −1 (ϕ) = · ( ϕ0 )
m
K! (K − m)!
= ·Q
i≤n ϕ (xi )!
m!(K − m)! 0

K!
= Q since m = ϕ(xn+1 ) and ϕ0 (xi ) = ϕ(xi )
i≤n+1 ϕ(xi )!
= (ϕ)

We conclude this part with a remarkable fact, combining multinomial co-


efficients with the flatten operation for multisets, and also with accumulation.
We start with an illustration for which we recall from Subsection 1.4.2 that

the flatten operation has type flat : M M(X) → M(X). When we restrict it to
natural multisets and include sizes K, L ∈ N it takes the form:

/ M[L·K](X)
  flat
M[L] M[K](X) (1.33)

Now suppose a natural multiset is given:



 K = 3, L = 2

ψ = 3|a i + 1| bi + 2| c i ∈ M[2 · 3](X)

for
 X = {a, b, c}.

It’s multinomial coefficient is ( ψ) = 3!·1!·2!


6!
= 60.
We now ask ourselves the question: which Ψ ∈ N[L] N[K](X) satisfy

flat(Ψ) = ψ? A little reflection shows that there are three such Ψ, namely:

Ψ1 = 1 2|a i + 1| ci + 1 1| a i + 1| b i + 1|c i


Ψ2 = 1 2| a i + 1| bi + 1 1| a i + 2| c i


Ψ3 = 1 3|a i + 1 1| bi + 2| c i .

Next we are going to compute the expression ( Ψ ) · ϕ ( ϕ )Ψ(ϕ) in each case.


Q
Y
( Ψ1 ) · (ϕ )Ψ1 (ϕ) = 2 · 31 · 61 = 36
ϕ
Y
( Ψ2 ) · (ϕ )Ψ2 (ϕ) = 2 · 31 · 31 = 18
ϕ
Y
( Ψ3 ) · (ϕ )Ψ3 (ϕ) = 2 · 11 · 31 = 6
ϕ

We then find that the sum of these outcomes equals the coefficient of ψ:

( ψ) = 60
     
Y Y Y
= ( Ψ1 )· (ϕ )Ψ1 (ϕ) + (Ψ2 )· ( ϕ )Ψ2 (ϕ) + ( Ψ3 )· ( ϕ )Ψ3 (ϕ) .
ϕ ϕ ϕ

44
1.6. Binomial coefficients of multisets 45

The general result is formulated in Theorem 1.6.5 below. For its proof we need
an intermediate step.

Lemma 1.6.4. Let ψ ∈ N[L](X) be a natural multiset of size L ∈ N, and let


K ≤ L.

1 For ϕ ≤K ψ one has:


 ψ
(ϕ) · (ψ − ϕ) ϕ
= L .
( ψ)
K

2 Now:
X
( ψ) = ( ϕ ) · ( ψ − ϕ ).
ϕ≤K ψ

The earlier equation (1.25) is a special case, for K = 1.

Proof. 1 Because:
(ϕ) · (ψ − ϕ) K! (L − K)! ψ
= · ·
( ψ) ϕ (ψ − ϕ) L!
 ψ
K! · (L − K)! ψ ϕ
= · = L .
L! ϕ · (ψ − ϕ)
K

2 By the previous item and Vandermonde’s formula from Lemma 1.6.2:


P  ψ L
X ϕ≤K ψ ϕ K
( ϕ ) · ( ψ − ϕ ) = ( ψ) · L = ( ψ) ·  L  = ( ψ).
ϕ≤K ψ K K

Theorem 1.6.5. Each ψ ∈ N[L · K](X), for numbers L, K ≥ 1, satisfies:


X Y
( ψ) = (Ψ) · (ϕ )Ψ(ϕ) .
ϕ
Ψ∈N[L](N[K](X))
flat(Ψ) = ψ

This equation turns out to be essential for proving that multinomial distribu-
tions are suitably closed under composition, see Theorem 3.3.6 in Chapter 3.

Proof. Since ψ ∈ N[L · K](X) we can apply Lemma 1.6.4 (2) L-many times,
giving:
X X X
( ψ) = ··· ( ϕ1 ) · ( ϕ2 ) · . . . · ( ϕ L )
ϕ1 ≤K ψ ϕ2 ≤K ψ−ϕ1 ϕL ≤K ψ−ϕ1 −···−ϕL−1
X Y
= ( ϕi ).
i
ϕ1 , ..., ϕL ≤K ψ
ϕ1 + ··· +ϕL = ψ

45
46 Chapter 1. Collections

We can accumulate the sequence of multisets ϕ1 , . . . , ϕL ∈ N[K] into a multi-


set of multisets:

Ψ B acc ϕ1 , . . . , ϕL = 1| ϕ1 i + · · · + 1| ϕL i ∈ N[L] N[K](X) .


 

The flatten map preserves sums of multisets, see Exercise 1.4.7, and thus maps
Ψ to ψ, via the flatten-unit law of Lemma 1.4.3.
 
flat Ψ = flat 1|ϕ1 i + · · · + 1| ϕL i


= flat 1| ϕ1 i + · · · + flat 1| ϕL i
 

= flat unit(ϕ1 ) + · · · + flat unit(ϕL ) = ϕ1 + · · · + ϕL = ψ.


 

By summing over the accumulation Ψ ∈ N[L] N[K](X) with flat(Ψ) = ψ,



instead of over the sequences ϕ1 , . . . , ϕL ≤K ψ with i ϕi = ψ we have that
P
into account that (Ψ)-many sequences accumulate to the same Ψ, see Proposi-
tion 1.6.3. This explains the factor ( Ψ) in the above theorem.

Exercises
K 
= 2K to:
P
1.6.1 Generalise the familiar equation 0≤k≤K k

X ψ!
= 2kψk .
ϕ≤ψ
ϕ

1.6.2 Show that kacc(`)k = k`k, using the length k`k of a list ` from Exer-
cise 1.2.3.
1.6.3 Let ϕ ∈ N[K](X) and ψ ∈ N[L](X) be given.
1 Prove that:
K+L
K
( ϕ + ψ) = ϕ+ψ · ( ϕ ) · ( ψ).
ϕ

2 Now assume that ϕ, ψ have disjoint supports, that is, supp(ϕ) ∩


supp(ψ) = ∅. Show that now:
K +L
!
( ϕ + ψ) = · (ϕ ) · ( ψ),
K
1.6.4 Consider the K-ary accumulator function (1.32), for K > 0. Check
that acc is permutation-invariant, in the sense that for each permuta-
tion π of {1, . . . , K} one has:

acc x1 , . . . , xK = acc xπ(1) , . . . , xπ(K) .


 

46
1.6. Binomial coefficients of multisets 47

1.6.5 Check that the accumulator map acc : L(X) → M(X) is a homomor-
phism of monoids. Do the same for the support map supp : M(X) →
Pfin (X).
1.6.6 Prove that the accumulator and support maps acc : L(X) → M(X)
and supp : M(X) → Pfin (X) are natural: for an arbitrary function
f : X → Y both rectangles below commute.

L(X)
acc / M(X) supp
/ Pfin (X)
L( f ) M( f ) Pfin ( f )
  
L(Y) / M(Y) / Pfin (Y)
acc supp

1.6.7 Convince yourself that the following composite


acc / / N[L·K](X)
  flat
N[K](X)L N[L] N[K](X)

is the L-fold sum of multisets — see (1.33) for the fixed size flatten
operation.

1.6.8 In analogy with the powerset operator, with type P : P(X) → P P(X) ,

a powerbag operator PB : N(X) → N N(X) is introduced in [114]
(see also [35]). It can be defined as:
X ψ!
PB(ψ) B
ϕ .
ϕ≤ψ
ϕ

1 Take X = {a, b} and show that:

PB 1| ai + 3| bi


= 1 0 + 1 1| ai + 3 1| b i + 3 1| ai + 1| b i + 3 2|b i


+ 3 1|a i + 2| bi + 1 3| b i + 1 1| ai + 3| bi .

2 Check that one can compute the powerbag of ψ as follows. Take


a list of elements that accumulate to ψ, such as [a, b, b, b] in the
previous item. Take the accumulation of all subsequences.
1.6.9 For N ≥ 2 many natural multisets ϕ1 , . . . , ϕN ∈ N(X), with ψ B
i ϕi , define a multinomial coefficient of multisets as:
P

ψ ψ
!
B .
ϕ1 , . . . , ϕ N ϕ1 · . . . · ϕ N
1 Check that for N ≥ 3, in analogy with Exercise 1.3.6,
ψ ψ ψ − ϕ1
! ! !
= ·
ϕ1 , . . . , ϕN ϕ1 ϕ2 , . . . , ϕ N

47
48 Chapter 1. Collections

For K1 , . . . , KN ∈ N write K = Ki and assume that ψ ∈ N[K](X)


P
2 i
is given. Show that:
ψ
! !
X K
= .
ϕ1 , . . . , ϕ N K1 , . . . , KN
ϕ1 ≤K1 ψ,
P ..., ϕN ≤KN ψ,
i ϕi =ψ

1.7 Multichoose coefficents


We proceed with another counting challenge. Let X be a finite set, say with n
elements. How many elements are there then in N[K](X)? That is, how many
multisets of size K are there over n elements? This is sometimes formulated
informally as: how many ways are there to divide K balls over n urns? It is
the multiset-analogue of Lemma 1.3.7,  where the number of subsets of size
K of an n-element set is identified as Kn . Below we show that the answer for
 
multisets is given by the multiset number or multichoose number Kn , see
e.g. [145]. We introduce this new number below, and immediately extend its
standard definition from numbers to multisets, in analogy with the extension
of (ordinary) binomials to multisets in Definition 1.5.1. We first need some
preparations before coming to the multiset counting result in Proposition 1.7.4.

Definition 1.7.1. 1 For numbers n ≥ 1 and K ≥ 0, put:


!! ! ! !
n n+K −1 n+K −1 (n+K −1)! n n+K
B = = = · . (1.34)
K K n−1 K! · (n−1)! n+K n
 
This Kn is sometimes called multichoose.
2 Let ψ, ϕ be natural multisets on the same space, where ψ is non-empty and
supp(ϕ) ⊆ supp(ψ).
ψ ψ(x) Y ψ(x) + ϕ(x) − 1!
!! Y !!
B = . (1.35)
ϕ x∈supp(ψ)
ϕ(x) x∈supp(ψ)
ϕ(x)

 5 the set N[3]({a, b, c}) of multisets of size 3 over {a, b, c}. It has
3Consider
3 = 3 = 2 = 10 elements, namely:
4·5

3| ai, 3| bi, 3| ci, 2| ai + 1| bi, 2| ai + 1| c i, 1| a i + 2| b i,


2|b i + 1| ci, 1| ai + 2| c i, 1| b i + 2| c i, 1| a i + 1| b i + 1|c i.
 
Our aim is to prove that the multiset number Kn is the total number of
multisets of size K over a non-empty n-element set. We use the following mul-
tichoose analogue of Vandermonde’s (binary) formula (1.29).

48
1.7. Multichoose coefficents 49

Lemma 1.7.2. Fix B ≥ 1 and G ≥ 1. For all K on has:

B+G
!! !! !!
X B G
= · . (1.36)
K 0≤k≤K
k K−k

In particular:

B+K B+1
! !! !! !!
X B X B
= = = . (1.37)
K K 0≤k≤K
k 0≤k≤K
K−k

Proof. The second equation


  (1.37)
 easily follows from the first one by taking
G = 1 and using that 1n = nn = 1.
We shall make frequent use of the following equation, whose proof is left to
the reader (in Exercise 1.7.5) below.

  n+1  n+1 
n
K+1 + K = K+1 . (∗)

We shall prove the first equation (1.36) in the lemma by induction on B ≥1. In
both the base case B = 1 and the induction step we shall use induction on K.
We shall try to keep the structure clear by using nested bullets.

• We first prove Equation (1.36) for B = 1, by induction on K.

– When K = 0 both sides in (1.36) are equal to 1.


– Assume Equation (1.36) holds for K (and B = 1).
X     X     X  
1 G
k · (K+1)−k = G
K−(k−1) = G
K+1 + G
K−`
0≤k≤K+1 0≤k≤K+1 0≤`≤K
(IH)   G+1
= G
K+1 + K
(∗) G+1
= K+1 .

• Now assume Equation (1.36) holds for B (for all G, K). In order to show that
it then also holds for B + 1 we use induction on K.

– When K = 0 both sides in (1.36) are equal to 1.

49
50 Chapter 1. Collections

– Now assume that Equation (1.36) holds for K, and for B. Then:

X    
B+1 G
k · (K+1)−k
0≤k≤K+1
  X    
= G
K+1 + B+1 G
k+1 · K−k
0≤k≤K
(∗)   X h   B+1 i  G 
= G
K+1 + B
k+1 + k · K−k
0≤k≤K
  X    G  X    
= G
K+1 + B
k+1 · K−k + B+1
k
G
· K−k

(IH, K)
X  0≤k≤K
  G 
0≤k≤K
(B+1)+G
= B
k · (K+1)−k + K
0≤k≤K+1
(IH, B) B+G (B+1)+G
= K+1 + K
(∗) (B+1)+G
= K+1 .

We now get the double-bracket analogue of Lemma 1.6.2.

Proposition 1.7.3. Let ψ be a non-empty natural multiset. Write X = supp(ψ)


and L = kψk. Then, for each K ∈ N,

ψ
ψ
!! !!
X L X ϕ
= so  L  = 1.
ϕ∈N[K](X)
ϕ K ϕ∈N[K](X) K

The fractions in this equation will show up later in so-called Pólya distri-
butions, see Example 2.1.1 (4) and Section 3.4. These fractions capture the
probability of drawing a multiset ϕ from an urn ψ when for each drawn ball an
extra ball is added to the urn (of the same colour).

Proof. We use induction on the number of elements in the support X of ψ,


like in the proof of Lemma 1.6.2. By assumption X cannot be empty, so the
induction starts when X is a singleton, say X = {x}. But then ψ(x) = kψk = L
and ϕ(x) = kϕk = K, so the result obviously holds.
Now let supp(ψ) = X ∪ {y} where y < X and X is not empty. Write:

L = kψk ` = ψ(y) > 0 ψ0 = ψ − `| y i L0 = L − ` > 0.

50
1.7. Multichoose coefficents 51

By construction X = supp(ψ0 ) and L0 = kψ0 k. Now:


ψ (1.35) ψ(x)
X !! X Y !!
=
ϕ∈N[K](X∪{y})
ϕ ϕ∈N[K](X∪{y}) x∈X∪{y}
ϕ(x)
ψ(y) Y ψ(x)
X X !! !!
= ·
0≤k≤K ϕ∈N[K−k](X)
k x∈X
ϕ(x)
X `!! X ψ
!!
= ·
0≤k≤K
k ϕ∈N[K−k](X) ϕ
` L0
!! !!
(IH)
X
= ·
0≤k≤K
k K−k
`+L 0
!! !!
(1.36) L
= = .
K K
We finally come to our multiset counting result. It is the multiset-analogue
of Proposition 1.3.7 (1) for subsets.
Proposition 1.7.4. Let X be a non-empty  with n ≥ 1 elements. The number
 set
of natural multisets of size K over X is Kn , that is:

n+K−1
!! !
n
N[K](X) = = .
K K
  n
Proof. The statement holds for K = 0 since there is precisely 1 = n−1 0 = 0
multiset set of 0, namely the empty multiset 0. Hence we may assume K ≥ 1,
so that Lemma 1.7.2 can be used.
We proceed
K  by 
induction
 on n ≥ 1. For n = 1 the statement holds since there
is only 1 = K = K multiset of size K over 1 = {0}, namely K| 0i.
1

The induction step works as follows. Let X have n elements, and y < X. For
a multiset ϕ ∈ N[K] X ∪ {y} there are K + 1 possible multiplicities ϕ(n). If

ϕ(n) = k, then the number possibilities for ϕ(0), . . . , ϕ(n − 1) is the number of
multisets in N[K −k](X). Thus:
 X
N[K] X ∪ {y} = N[K −k](X)
0≤k≤K !!
(IH)
X n
=
0≤k≤K
K−k
n+1
!!
= , by Lemma 1.7.2.
K
There is also a visual proof of this result, described in terms of stars and bars,
see e.g. [47, II, proof of (5.2)], where multiplicities of multisets are described
in terms of ‘occupancy numbers’.

51
52 Chapter 1. Collections

An associated question is: given a fixed element a in an n-element set X,


what is the sum of all multiplicities ϕ(a), for multisets ϕ over X with size K?
Lemma 1.7.5. For an element a ∈ X, where X has n ≥ 1 elements,
!!
X K n
ϕ(a) = · .
ϕ∈N[K](X)
n K

Proof. When we sum over a we get by Proposition 1.7.4:


X X X X X  
ϕ(a) = ϕ(a) = K = K · Kn .
a∈X ϕ∈N[K](X) ϕ∈N[K](X) a∈X ϕ∈N[K](X)

Since a ∈ X is arbitrary, the outcome should be the same for a different b ∈ X.


Hence we have to divide by n, giving the equation in the lemma.
We look at one more counting problem, namely the number of multisets
below a given multiset. Recall from Definition 1.5.1 (1) the pointwise order ≤
on multisets.
Definition 1.7.6. 1 For multisets ϕ, ψ on the same set X we define the strict
order < as:
ϕ < ψ ⇐⇒ ϕ ≤ ψ and ϕ , ψ.
We say that ϕ is fully below ψ, written as ϕ ≺ ψ when for each element
the number of occurrences in ϕ is strictly below (less) than the number of
occurrences in ψ. Thus:

ϕ ≺ ψ ⇐⇒ ∀x ∈ X. ϕ(x) < ψ(x).


2 For ϕ ∈ N(X) we use downset notation in the following way:

↓ ϕ = {ψ ∈ N(X) | ψ ≤ ϕ} and ↓= ϕ = {ψ ∈ N(X) | ψ < ϕ}.


Notice that < is not the same as ≺, as shown by this example:
2| a i + 1| b i + 2| c i < 2|a i + 2| bi + 2| c i
2| a i + 1| b i + 2| c i ⊀ 2|a i + 2| bi + 2| c i.
Proposition 1.7.7. The number of multisets strictly below a natural multiset ϕ
can be described as:
X Y
↓= ϕ = ϕ(x). (1.38)
∅,U⊆supp(ϕ) x∈U

The proof below lends itself to an easy implementation, see Exercise 1.7.12
below. It uses a (chosen) order on the elements of the support of the multiset.
The above formulation shows that the outcome is independent of such an order.

52
1.7. Multichoose coefficents 53

Proof. We use induction on the size | supp(ϕ) | of the support of ϕ. If this size is
1, then ϕ is of the form m| x1 i, for some number m. Hence there are m multisets
strictly below ϕ, namely 0, 1| x1 i, . . . , (m − 1)| x1 i. Since supp(ϕ) = {x1 }, the
only non-empty subset U ⊆ supp(ϕ) is the singleton U = {x1 } and x∈U ϕ(x) =
Q
ϕ(x1 ) = m.
Next assume that supp(ϕ) = X ∪ {y} with y 6 X is of size n + 1. We write
m = ϕ(y) and ϕ0 = ϕ − m| y i so that supp(ϕ0 ) = X. The number M = | ↓= ϕ0 | is
then given by the formula in (1.38), with ϕ0 instead of ϕ. We count the multisets
strictly below ϕ as follows.

• For each 0 ≤ i ≤ m each ψ < ϕ0 gives ψ + i| y i < ϕ; this gives N · (m + 1)


multisets below ϕ;
• But for each 0 ≤ i < m also ϕ0 + i| y i, which gives another m multisets below
ϕ.
We thus get (m + 1) · N + m multisets ϕ, so that:

↓= ϕ = ϕ(y) + 1 · ↓= ϕ0 + ϕ(y). (∗)
Now let’s look at the subset formulation in (1.38). Each non-empty subset
U ⊆ X ∪ {y} is either a subset of X, or of the form U = V ∪ {y} with V ⊆ X
non-empty, or contains only y. Hence:
   
X Y  X Y   X Y 
ϕ(x) =  ϕ(x) +  ϕ(y) · ϕ(x) + ϕ(y)
∅,U⊆supp(ϕ) x∈U ∅,U⊆X x∈U ∅,U⊆X x∈U
   
 X Y   X Y 
=  ϕ0 (x) + ϕ(y) ·  ϕ0 (x) + ϕ(y)
∅,U⊆X x∈U ∅,U⊆X x∈U
(IH)
= ↓= ϕ + ϕ(y) · ↓= ϕ + ϕ(y)
0 0

(*)
= ↓= ϕ .

Exercises
1.7.1 Check that N[K](1) has one element by Proposition 1.7.4. Describe
this element. How many elements are there in N[K](2)? Describe
them all.
1.7.2 Let ϕ, ψ be natural multiset on the same finite set X.
1 Show that:

ϕ ≺ ψ ⇐⇒ ϕ + 1 ≤ ψ ⇐⇒ ϕ ≤ ψ − 1,
where 1 =
P
x∈X 1| x i is the multiset of singletons on X.

53
54 Chapter 1. Collections

2 Now let ψ ≥ 1. Show that one has:


ψ ψ+ϕ−1
!! !
= .
ϕ ϕ
3 Conclude, analogously to Lemma 1.6.4 (1), that:
ψ
( ϕ ) · (ψ − 1 ) ϕ
=  L  .
(ψ + ϕ − 1)
K

1.7.3 Let N ≥ 0 and M ≥ m ≥ 1 and be given. Use the multichoose Van-


dermonde Equation (1.37) to prove:
X m!! M − m + N − i! N+M
!
· = .
0≤i≤N
i N−i N

1.7.4 Let n ≥ 1 and m ≥ 0.


1 Show that:
n+1 n+1
!! !! !!
n
= + .
m+1 m m+1
2 Generalise this to:
n+k n+i
!! X k! !!
= · .
m+k 0≤i≤k
i m+k−i

1.7.5 Prove the following properties.


 n−k   m−k 
1 m−(k+1) = n−(k+1)
 n  m n+m
2 m + n = n
 n   n+1 
3 m · m = n · m−1 .
n+1  n   n 
4 n · m = (n+m) · m = (m+1) · m+1 .
1.7.6 1 Show that for numbers m ≤ n − 1,
! ! !
n−1 n n
n· = (m+1) · = (n−m) · .
m m+1 m
2 Show similarly that for natural multisets ϕ, ψ with x ∈ supp(ψ) and
ϕ ≤ ψ − 1| x i,
ψ−1| x i ψ ψ
! ! !
ψ(x) · = (ϕ(x)+1) · = (ψ(x)−ϕ(x)) · .
ϕ ϕ+1| x i ϕ
3 For x ∈ supp(ϕ) ⊆ supp(ψ),
ψ ψ+1| x i
!! !!
ϕ(x) · = ψ(x) · .
ϕ ϕ−1| x i

54
1.7. Multichoose coefficents 55

4 For supp(ϕ) ⊆ supp(ψ) and x ∈ supp(ψ),


ψ+1| x i ψ ψ
!! !! !!
ψ(x) · = (ϕ(x)+ψ(x)) · = (ϕ(x)+1) · .
ϕ ϕ ϕ+1| x i
1.7.7 Let n ≥ 1 and m ≥ 1.
1 Show that:
!! !!
X n m
= .
j<m
j n

2 Prove next:
n+m
!! X !! !
X m n
+ = .
i<n
i j<m
j n

1.7.8 1 Extend Exercise 1.3.5 (2) to: for k ≥ 1,


X k + n + j! k+m+1
!!
k
!!
= − .
0≤ j≤m
n n+1 n+1

Hint: One can use induction on m.


2 Show that one also has:
X k!! m + 1 + n − i! k+m+1
!!
k
!!
· = − .
0≤i≤n
i m n+1 n+1

Hint: Use the multichoose version (1.36) of Vandermonde’s for-


mula.
1.7.9 Show that for 0 < k ≤ m + 1,
X k − i!! n + m − k + 1! n+m
!
· = .
i<k
i m−i n

1.7.10 Check that Theorem 1.5.2 (1) can be reformulated as: for a real num-
ber x ∈ (0, 1) and K ≥ 1,
X K !! 1
· xn =
n≥0
n (1 − x)K

1.7.11 Let s ∈ [0, 1] and n, m ≥ 1.


1 Prove first the following auxiliary result.
 
X n+1!! 1 X n!! n+1
!! 
·s =
j
·  j
·s − · s  .
m

j<m
j (1− s) j<m j m−1

55
56 Chapter 1. Collections

2 Take r = 1 − s so that r + s = 1 and prove:


X n!! X m!!
n
r · ·s + s ·
j m
· ri = 1.
j<m
j i<n
i

3 Show also that:


!! !!
X n m 1
· si + · ri = n m .
i≥0
m+i n+i r ·s

1.7.12 Let ϕ be a natural multiset with support {x1 , . . . , xn }. We assume that


this support is ordered as a list [x1 , . . . , xn ]. Use the proof of Proposi-
tion 1.7.7 to show that the number | ↓= ϕ | of multisets strictly below ϕ
can be computed via the following algorithm.

result = 0
for x ∈ [x1 , . . . , xn ] :
result B (ϕ(x) + 1) ∗ result + ϕ(x)
return result

1.8 Channels
The previous sections covered the collection types of lists, subsets, and multi-
sets, with much emphasis on the similarities between them. In this section we
will exploit these similarities in order to introduce the concept of channel, in a
uniform approach, for all of these collection types at the same time. This will
show that these data types are not only used for certain types of collections,
but also for certain types of computation. Much of the rest of this book builds
on the concept of a channel, especially for probabilistic distributions, which
are introduced in the next chapter. The same general approach to channels that
will be described in this section will work for distributions.
Let T be one of the collection functors L, P, or M. A state of type Y is an
element ω ∈ T (Y); it collects a number of elements of Y in a certain way. In this
section we abstract away from the particular type of collection. A channel, or
sometimes more explicitly T -channel, is a collection of states, parameterised
by a set. Thus, a channel is a function of the form c : X → T (Y). Such a
channel turns an element x ∈ X into a certain collection c(x) of elements of
Y. Just like an ordinary function f : X → Y can be seen as a computation, we
see a T -channel as a computation of type T . For instance, a channel X → P(Y)
is a non-deterministic computation and a channel X → M(Y) is a resource-
sensitive computation.

56
1.8. Channels 57

When it is clear from the context what T is, we often write a channel using
functional notation, as c : X → Y, with a circle on the shaft of the arrrow.
Notice that a channel 1 → X from the singleton set 1 = {0} can be identified
with a state on X.
Definition 1.8.1. Let T ∈ {L, P, M}, each of which is functorial, with its own
flatten operation, as described in previous sections.
1 For a state ω ∈ T (X) and a channel c : X → T (Y) we can form a new state
c = ω of type Y. It is defined as:

c = ω B flat ◦ T (c) (ω)



where T (X)
T (c)
/ T (T (Y)) flat / T (Y).

This operation c = ω is called state tranformation, sometimes with addi-


tional along the channel c. In functional programming it is called bind1 .
2 Let c : X → Y and d : Y → Z be two channels. Then we can compose them
and get a new channel d ◦· c : X → Z via:

d ◦· c (x) B d = c(x) d ◦· c = flat ◦ T (d) ◦ c.



so that
We first look at some examples of state transformation.
Example 1.8.2. Take X = {a, b, c} and Y = {u, v}.
1 For T = L an example of a state ω ∈ L(X) is ω = [c, b, b, a]. An L-channel
f : X → L(Y) can for instance be given by:

f (a) = [u, v] f (b) = [u, u] f (c) = [v, u, v].


State transformation f = ω amounts to ‘map list’ with f and then flattening.
It turns a list of lists into a list, as in:
f = ω = flat L( f )(ω) = flat [ f (c), f (b), f (b), f (a)]
 

= flat [[v, u, v], [u, u], [u, u], [u, v]]




= [v, u, v, u, u, u, u, u, v].
2 We consider the analogous example for T = P. We thus take as state ω =
{a, b, c} and as channel f : X → P(Y) with:

f (a) = {u, v} f (b) = {u} f (c) = {u, v}.


1 Our bind notation c = ω differs from the one used in the functional programming language
Haskell; there one writes the state first, as in ω = c. For us, the channel c acts on the state ω
and is thus written before the argument. This is in line with standard notation f (x) in
mathematics, for a function f acting on an argument x. Later on, we shall use predicate
transformation c = p along a channel, where we also write the channel first, since it acts on
the predicate p. Similarly, in categorical logic the corresponding pullback (or substitution) is
written as c∗ (p) with the channel before the predicate.

57
58 Chapter 1. Collections

Then:
 [
f = ω = flat P( f )(ω) =

f (a), f (b), f (c)
[
= {u, v}, {u}, {u, v} = {u, v}.

3 For multisets, a state in M(X) could be of the form ω = 3| ai + 2|b i + 5| c i


and a channel f : X → M(Y) could have:

f (a) = 10| u i + 5| v i f (b) = 1|u i f (c) = 4| u i + 1| v i.

We then get as state transformation:

f = ω = flat M( f )(ω)


= flat 3| f (a)i + 2| f (b) i + 5| f (c) i




= flat 3 10|u i + 5| vi + 2 1| u i + 5 4| ui + 1| v i


= 30| ui + 15| v i + 2| ui + 20| u i + 5| v i


= 52| ui + 20| v i.

We shall mostly be using multiset — and probabilistic channels, as special


case — and so we explicitly describe state transformation = in these cases. So
let c : X → M(Y) be an M-channel. Transformation of a state ω of type X can
be described as:
X
(c = ω)(y) = c(x)(y) · ω(x). (1.39)
x∈X

Equivalently, we can describe the transformed state c = ω as a formal sum:


 
X X 
c = ω =  c(x)(y) · ω(x) y .

(1.40)
y∈Y x∈X

We now prove some general properties about state transformation and about
composition of channels, based on the abstract description in Definition 1.8.1.

Lemma 1.8.3. 1 Channel composition ◦· has a unit, namely unit : Y → Y, so


that:
unit ◦· c = c and d ◦· unit = d,

for all channels c : X → Y and d : Y → Z. Another way to write the second


equation is: d = unit(y) = d(y).
2 Channel composition ◦· is associative:

e ◦· (d ◦· c) = (e ◦· d) ◦· c,

for all channels c : X → Y, d : Y → Z and e : Z → W.

58
1.8. Channels 59

3 State tranformation via a composite channel is the same as two consecutive


transformations:

d ◦· c = ω = d = c = ω .
 

4 Each ordinary function f : Y → Z gives rise to a ‘trivial’ or ‘deterministic’


channel  f  B unit ◦ f : Y → Z. This construction − satisfies:

 f  = ω = T ( f )(ω),

where T is the type of channel involved. Moreover:

g ◦·  f  = g ◦ f   f  ◦· c = T ( f ) ◦ c d ◦·  f  = d ◦ f,

for all functions g : Z → W and channels c : X → Y and d : Y → W.

Proof. We can give generic proofs, without knowing the type T ∈ {L, P, M}
of the channel, by using earlier results like Lemma 1.2.5, 1.3.2, and 1.4.3. No-
tice that we carefully distinguish channel composition ◦· and ordinary function
composition ◦.

1 Both equations follow from the flat − unit law. By Definition 1.8.1 (2):

unit ◦· c = flat ◦ T (unit) ◦ c = id ◦ c = c.

For the second equation we use naturality of unit in:

d ◦· unit = flat ◦ T (d) ◦ unit = flat ◦ unit ◦ d = id ◦ d = d.

2 The proof of associativity uses naturality and also the commutation of flatten
with itself (the ‘flat − flat law’), expressed as flat ◦ flat = flat ◦ T (flat).

e ◦· (d ◦· c) = flat ◦ T (e) ◦ (d ◦· c)
= flat ◦ T (e) ◦ flat ◦ T (d) ◦ c
= flat ◦ flat ◦ T (T (e)) ◦ T (d) ◦ c by naturality of flat
= flat ◦ T (flat) ◦ T (T (e)) ◦ T (d) ◦ c by the flat − flat law
=

flat ◦ T flat ◦ T (e) ◦ d ◦ c by functoriality of T
=

flat ◦ T e ◦· d ◦ c
= (e ◦· d) ◦· c

59
60 Chapter 1. Collections

3 Along the same lines:


d ◦· c = ω = flat ◦ T (d ◦· c) (ω)
 

= flat ◦ T (flat ◦ T (d) ◦ c) (ω)




= flat ◦ T (flat) ◦ T (T (d)) ◦ T (c) (ω)



by functoriality of T
= flat ◦ flat ◦ T (T (d)) ◦ T (c) (ω)

by the flat − flat law
= flat ◦ T (d) ◦ flat ◦ T (c) (ω)

by naturality of flat
  
= flat ◦ T (d) flat ◦ T (c) (ω)


= flat ◦ T (d) c = ω
 

= d = c = ω .


4 All these properties follow from elementary facts that we have seen before:
 f  = ω = flat ◦ T (unit ◦ f ) (ω)


= flat ◦ T (unit) ◦ T ( f ) (ω)



by functoriality of T
= T ( f )(ω) by a flat − unit law
g ◦·  f  = flat ◦ T (unit ◦ g) ◦ unit ◦ f
= flat ◦ unit ◦ (unit ◦ g) ◦ f by naturality of unit
= unit ◦ g ◦ f by a flat − unit law
= g ◦ f 
 f  ◦· c = flat ◦ T (unit ◦ f ) ◦ c
= flat ◦ T (unit) ◦ T ( f ) ◦ c by functoriality of T
= T(f) ◦ c
d ◦·  f  = flat ◦ T (d) ◦ unit ◦ f
= flat ◦ unit ◦ d ◦ f by naturality of unit
= d◦ f by a flat − unit law.
In the sequel we often omit writing the brackets − that turn an ordinary
function f : X → Y into a channel  f . For instance, in a state transformation
f = ω, it is clear that we use f as a channel, so that the expression should be
read as  f  = ω.

Exercises
1.8.1 For a function f : X → Y define an inverse image P-channel f −1 : Y →
X by:
f −1 (y) B {x ∈ X | f (x) = y}.
Prove that:
(g ◦ f )−1 = f −1 ◦· g−1 and id −1 = unit.

60
1.9. The role of category theory 61

1.8.2 Notice that a state of type X can be identified with a channel 1 → T (Y)
with singleton set 1 as domain. Check that under this identification,
state transformation c = ω corresponds to channel composition c ◦· ω.
1.8.3 Let f : X → Y be a channel.
1 Prove that if f is a Pfin -channel, then the state transformation func-
tion f = (−) : Pfin (X) → Pfin (Y) can also be defined via freeness,
namely as the unique function f in Proposition 1.3.4.
2 Similarly, show that f = (−) = f when f is an M-channel, as in
Exercise 1.4.9.
1.8.4 1 Describe how (non-deterministic) powerset channels can be reversed,
via a bijective correspondence between functions:
X −→ P(Y)
===========
Y −→ P(X)
(A description of this situation in terms of ‘daggers’ will appear in
Example 7.8.1.)
2 Show that for finite sets X, Y there is a similar correspondence for
multiset channels.

1.9 The role of category theory


The previous sections have highlighted several structural properties of, and
similarities between, the collection types list, powerset, multiset — and later
on also distribution. By now readers may ask: what is the underlying structure?
Surely someone must have axiomatised what makes all of this work!
Indeed, this is called category theory. It provides a foundational language
for mathematics, which was first formulated in the 1950s by Saunders Mac
Lane and Samuel Eilenberg (see the first overview book [118]). Category the-
ory focuses on the structural aspects of mathematics and shows that many
mathematical constructions have the same underlying structure. It brings for-
ward many similarities between different areas (see e.g. [119]). Category the-
ory has become very useful in (theoretical) computer science too, since it in-
volves a clear distinction between specification and implementation, see books
like [7, 10, 113, 138]. We refer to those sources for more information.
The role of category theory in capturing the mathematical essentials and
estabilishing connections also applies to probability theory. William Lawvere,
another founding father of the area, first worked in this direction. Lawvere him-
self published little on this approach to probability theory, but his ideas can be

61
62 Chapter 1. Collections

found in e.g. the early notes [111]. This line of work was picked up, extended,
and published by his PhD student Michèle Giry. Her name continues in the
‘Giry monad’ G of continuous probability distributions, see Section ??. The
precise source of the distribution monad D for discrete probability theory, that
will be introduced in Section 2.1 in the next chapter, is less clear, but it can be
regarded as the discrete version of G. Probabilistic automata have been studied
in categorical terms as coalgebras, see Chapter ??, and e.g. [152] and [70] for
general background information on coalgebra. There is a recent surge in inter-
est in more foundational, semantically oriented studies in probability theory,
through the rise of probabilistic programming languages [59, 153], probabilis-
tic Bayesian reasoning [28, 89], and category theory ??. Probabilistic methods
have received wider attention, for instance, via the current interest in data an-
alytics (see the essay [4]), in quantum probability [128, 25], and in cognition
theory [64, 151].
Readers who know category theory will have recognised its implicit use in
earlier sections. For readers who are not familiar (yet) with category theory,
some basic concepts will be explained informally in this section. This is in no
way a serious introduction to the area. The remainder of this book will continue
to make implicit use of category theory, but will make this usage increasingly
explicit. Hence it is useful to know the basic concepts of category, functor,
natural transformation, and monad. Category theory is sometimes seen as a
difficult area to get into. But our experience is that it is easiest to learn category
theory by recognising its concepts in constructions that you already know. That
is why this chapter started with concrete descriptions of various collections and
their use in channels. For more solid expositions of category theory we refer to
the sources listed above.

1.9.1 Categories
A category is a mathematical structure given by a collection of ‘objects’ with
‘morphisms’ between them. The requirements are that these morphisms are
closed under (associative) composition and that there is an identity morphism
on each object. Morphisms are also called ‘maps’, and are written as f : X →
Y, where X, Y are objects and f is a homomorphism from X to Y. It is tempting
to think of morphisms in a category as actual functions, but there are plenty of
examples where this is not the case.
A category is like an abstract context of discourse, giving a setting in which
one is working, with properties of that setting depending on the category at
hand. We shall give a number of examples.

62
1.9. The role of category theory 63

1 There is the category Sets, whose objects are sets and whose morphisms are
ordinary functions between them. This is a standard example.
2 One can also restrict to finite sets as objects, in the category FinSets, with
functions between them. This category is more restrictive, since for instance
it contains objects n = {0, 1, . . . , n − 1} for each n ∈ N, but not N itself. Also,
Q
in Sets one can take arbitrary products i∈I Xi of objects Xi , over arbitrary
index sets I, whereas in FinSets only finite products exist. Hence FinSets is
a more restrictive world.
3 Monoids and monoid maps have been mentioned in Definition 1.2.1. They
can be organised in a category Mon, whose objects are monoids, and whose
homomorphisms are monoid maps. We now have to check that monoid maps
are closed under composition and that identity functions are monoid maps;
this is easy. Many mathematical structures can be organised into categories
in this way, where the morphisms preserve the relevant structure. For in-
stance, one can form a category PoSets, with partially ordered sets (posets)
as objects, and monotone functions between them as morphisms (also closed
under composition, with identity).
4 For T ∈ {L, P, M} we can form the category Chan(T ). Its objects are arbi-
trary sets X, but its morphisms X to Y are T -channels, X → T (Y), written as
X → Y. We have already seen that channels are closed under composition ◦·
and have unit as identity, see Lemma 1.8.3. We can now say that Chan(T )
is a category.
These categories of channels form good examples of the idea that a cate-
gory forms a universe of discourse. For instance, in Chan(P) we are in the
world of non-deterministic computation, whereas Chan(M) is the world of
computation in which resources are counted.

We will encounter several more examples of categories later on in the book.


Occasionally, the following construction will be used. Given a category C, a
new ‘opposite’ category Cop is formed. It has the same objects as C, but its
morphisms are reversed. Thus f : Y → X in Cop means f : X → Y in C.
Also, given two categories C, D one can form a product category C × D. Its
objects are pairs (X, A) with X an object in C and A an object in D. Similarly,
an arrow (X, A) → (Y, B) in C × D is given by a pair ( f, g) of arrows f : X → Y
in C and g : A → B in D.

1.9.2 Functors
Category theorists like abstraction, hence the question: if categories are so im-
portant, then why not organise them as objects themselves in a superlarge cat-

63
64 Chapter 1. Collections

egory Cat, with morphisms between them preserving the relevant structure?
The latter morphisms between categories are called ‘functors’. More precisely,
given categories C and D, a functor F : C → D between them consists of two
mappings, both written F, sending an object X in C to an object F(X) in D,
and a morphism f : X → Y in C to a morphism F( f ) : F(X) → F(Y) in D.
This mapping F should preserve composition and identities, as in: F(g ◦ f ) =
F(g) ◦ F( f ) and F(id X ) = id F(X) .
Earlier we have already called some operations ‘functorial’ for the fact that
they preserve composition and identities. We can now be a bit more precise.

1 Each T ∈ {L, P, Pfin , M, M∗ , N, N∗ , N[K]} is a functor T : Sets → Sets.


This has been described in the beginning of each of the sections 1.2 – 2.1.
2 Taking lists is also a functor L : Sets → Mon. This is in essence the content
of Lemma 1.2.2. One can also view P, Pfin and M as functors Sets → Mon,
see Lemmas 1.3.1 and 1.4.2. Moreover, one can describe P, Pfin as a functor
Sets → PoSets, by considering each set of subsets P(X) and Pfin (X) with
its subset relation ⊆ as partial order. In order to verify this claim one has
to check that P( f ) : P(X) → P(Y) is a morphism of posets, that is, forms a
monotone function. But that is easy.
3 There is also a functor J : Sets → Chan(T ), for each T . It is the identity
on sets/objects: J(X) B X. But it sends a functon f : X → Y to the channel
J( f ) B  f  = unit ◦ f : X → Y. We have seen, in Lemma 1.8.3 (4), that
J(g ◦ f ) = J(g) ◦ J( f ) and that J(id ) = id , where the latter identity id is
unit in the category Chan(T ). This functor J shows how to embed the world
of ordinary computations (functions) into the world of compuations of type
T (channels).
4 Taking the product of two sets can be described as a functor × : Sets×Sets →
Sets. Its action on morphisms was already described at the end of Subsec-
tion 1.1.1, see also Exercise 1.1.3.
5 If we have functors Fi : Ci → Di , for i = 1, 2, then we also have a product
functor F1 × F2 : C1 × C2 → D1 × D2 between product categories, simply
by (F1 × F2 )(X1 , X2 ) = (F1 (X1 ), F2 (X2 )), and similarly for morphisms.

1.9.3 Natural transformations


Let us move one further step up the abstraction ladder and look at morphisms
between functors. These are called natural transformations. We have already
seen examples of those as well. Given two functors F, G : C → D, a natural
transformation α from F to G is a collection of maps αX : F(X) → G(X) in D,
indexed by objects X in C. Naturality means that α works in the same way on

64
1.9. The role of category theory 65

all objects and is expressed as follows: for each morphism f : X → Y in C, the


rectangle
αX
F(X) / G(X)
F( f ) G( f )
 
F(Y) / G(Y)
αY

in D commutes.
Such a natural transformation is often denoted by a double arrow α : F ⇒ G.
We briefly review some of the examples of natural transformations that we
have seen.

1 The various support maps can now be described as natural transformations


supp : L ⇒ Pfin , supp : M ⇒ Pfin and supp : D ⇒ Pfin , see the overview
diagram 1.30.
2 For each T ∈ {L, P, M} we have described maps unit : X → T (X) and
flat : T (T (X)) → T (X) and have seen naturality results about them. We can
now state more precisely that they are natural transformations unit : id ⇒
T and flat : (T ◦ T ) ⇒ T . Here we have used id as the identity functor
Sets → Sets, and T ◦ T as the composite of T with itself, also as a functor
Sets → Sets.

1.9.4 Monads
A monad on a category C is a functor T : C → C that comes with two natural
transformations unit : id ⇒ T and flat : (T ◦ T ) ⇒ T satisfying:

flat ◦ unit = id = flat ◦ T (unit)


(1.41)
flat ◦ flat = flat ◦ T (flat).
All the collection functors L, P, Pfin , M, M∗ , N, N∗ , D, D∞ that we have seen
so far are monads, see e.g., Lemma 1.2.5, 1.3.2, or 1.4.3. For each monad T we
can form a category Chan(T ) of T -channels, that capture computations of type
T , see Subsection 1.9.1. In category theory this is called the Kleisli category of
T . Composition in this category Chan(T ) is called Kleisli composition. In this
book it is written as ◦·, where the context should make clear what the monad T
at hand is.
Monads have become popular in functional programming [126] as mecha-
nisms for including special effects (e.g., for input-output, writing, side-effects,
continuations) into a functional programming language2 . The structure of prob-
2 See the online overview https://wiki.haskell.org/Monad_tutorials_timeline

65
66 Chapter 1. Collections

abilistic computation is also given by monads, namely by the discrete distribu-


tion monads D, D∞ and by the continuous distribution monad G.
We thus associate the (Kleisli) category Chan(T ) of channels with a mo-
nad T . A second category is associated with a monad T , namely the category
EM(T ) of “Eilenberg-Moore” algebras. The objects of EM(T ) are algebras
α : T (X) → X, satisfying α ◦ unit = id and α ◦ flat = α ◦ T (α). We have seen
algebras for the monads L, Pfin , and M in Propositions 1.2.6, 1.3.5, and 1.4.5.
They capture monoids, commutative idempotent monoids, and commutative
monoids respectively. A morphism in EM(T ) is a morphism of algebras, given
by a commuting rectangle, as described in these propositions. In general, alge-
bras of a monad capture algebraic structure in a uniform manner.
Here is an easy result that describes so-called writer monads.

Lemma 1.9.1. Let M = (M, +, 0) be an arbitrary monoid. The mapping X 7→


M × X forms a monad on the category Sets.

Proof. Let us write T (X) = M × X. For a function f : X → Y we define


T ( f ) : M × X → M × Y by T ( f )(m, x) = (m, f (x)). There is a unit map
unit : X → M × X, namely unit(x) = (0, x) and a flattening map flat : M ×
(M × X) → M × X by µ(m, m0 , x) = (m + m0 , x). We skip naturality and con-
centrate on the monad equations (1.41). First, for (m, x) ∈ T (X) = M × X,

flat ◦ unit (m, x) = flat(0, m, x) = (0 + m, x) = (m, x)




flat ◦ T (unit) (m, x) = flat m, unit(x) = flat(m, 0, x) = (m, x).


 

Next, the flatten-equation holds by associativity of the monoid addition +. This


is left to the reader.

We have seen natural transformations as maps between functors. In the spe-


cial case where the functors involved are monads, these natural transformations
can be called maps of monads if they additionally commute with the unit and
flatten maps.

Definition 1.9.2. Let T 1 = (T 1 , unit 1 , flat 1 ) and T 2 = (T 2 , unit 2 , flat 2 ) be two


monads (on Sets). A map/homomorphism of monads from T 1 to T 2 is a natural
transformation α : T 1 ⇒ T 2 that commutes with unit and flatten in the sense
that the two diagrams
α / T 2 (T 1 (X)) T2 (α)/ T 2 (T 2 (X))
unit 1
X unit 2
T 1 (T 1 (X))
flat 1 flat
 α
  α
 2
T 1 (X) / T 2 (X) T 1 (X) / T 2 (X)

commute, for each set X.

66
1.9. The role of category theory 67

The writer monads from Lemma 1.9.1 give simple examples of maps of
monads: if f : M1 → M2 is a map of monoids, then the maps α B f ×id : M1 ×
X → M2 × X form a map of monoids.
For a historical account of monads and their applications we refer to [66].

Exercises
1.9.1 We have seen the functor J : Sets → Chan(T ). Check that there is
also a functor Chan(T ) → Sets in the opposite direction, which is
X 7→ T (X) on objects, and c 7→ c = (−) on morphisms. Check ex-
plicitly that composition is preserved, and find the earlier result that
stated that fact implicitly.
1.9.2 Recall from (1.15) the subset N[K](X) ⊆ M(X) of natural multisets
with K elements. Prove that N[K] is a functor Sets → Sets.
1.9.3 Show that Exercise 1.8.1 implicitly describes a functor Setsop →
Chan(P), which is the identity on objects.
1.9.4 Show that the zip function from Exercise 1.1.7 is natural: for each
pair of functions f : X → U and g : Y → V the following diagram
commutes.

XK × Y K
zip
/ (X × Y)K
f K ×gK K
  ( f ×g)
UK × VK
zip
/ (U × V)K

1.9.5 Fill in the remaining details in the proof of Lemma 1.9.1: that T is
a functor, that unit and flat are natural transformation, and that the
flatten equation holds.
1.9.6 For arbitrary sets X, A, write X + A for the disjoint union (coproduct)
of X and A, which may be described explicitly by tagging elements
with 1, 2 in order to distinguish them:

X + A = {(x, 1) | x ∈ X} ∪ {(a, 2) | a ∈ A}.

Write κ1 : X → X + A and κ2 : A → X + A for the two obvious func-


tions.
1 Keep the set A fixed and show that the mapping X 7→ X + A can be
extended to a functor Sets → Sets.
2 Show that it is actually a monad; it is sometimes called the excep-
tion monad, where the elements of A are seen as exceptions in a
computation.

67
68 Chapter 1. Collections

1.9.7 Check that the support and accumulation functions form maps of
monads in the situations:
1 supp : M(X) ⇒ P(X);
2 acc : L(X) ⇒ M(X).
1.9.8 Let T = (T, unit, flat) be a monad. By definition, it involves T as
a functor T : Sets → Sets. Show that T can be ‘lifted’ to a functor
T : Chan(T ) → Chan(T ). It is defined on objects as T (X) B T (X)
and on a morphism f : X → Y as:
T(f)
/ T (T (Y)) flat / T (Y) unit / T (T (Y)) .
 
T ( f ) B T (X)

Prove that T is a functor, i.e. that it preserves (channel) identities and


composition.

68
2

Discrete probability distributions

The previous chapter has introduced products, lists, subsets and multisets as
basic collection types and has made some of their basic properties explicit.
This serves as preparation for the current first chapter on probability distribu-
tions. We shall see that distributions also form a collection type, with much
analogous structure. In fact distributions are special multisets, where multi-
plicities add up to one.
This chapter introduces the basics of probability distributions and of prob-
abilistic channels (as indexed / conditional distributions). These notions will
play a central role in the rest of this book. Distributions will be defined as spe-
cial multisets, so that there is a simple inclusion of the set of distributions in
the set of multisets multisets, on a particular space. In the other direction, this
chapter describes the ‘frequentist learning’ construction, which turns a multi-
set into a distribution, essentially by normalisation. The chapter also describes
several constructions on distributions, like the convex sum, parallel product,
and addition of distributions (in the special case when the underlying space
happens to be a commutative monoid). Parallel products ⊗ of distributions and
joint distributions (on product spaces) are rather special, for instance, because
a joint distribution is typically not equal to the product of its marginals. In-
deed, joint distributions may involve correlations between the different (prod-
uct) components, so that updates in one component have ‘crossover’ effect in
other components. This magical phenonenom will be elaborated in later chap-
ters.
The chapter closes with an example of a Bayesian network. It shows that
the conditional probability tables that are associated with nodes in a Bayesian
network are instances of probabilistic channels. As a result, one can system-
atically organise computations in Bayesian networks as suitable (sequential
and/or parallel) compositions of channels. This is illustrated via calculations
of various predicted probabilities.

69
70 Chapter 2. Discrete probability distributions

2.1 Probability distributions


This section is the first one about probability, in elementary form. It introduces
finite discrete probability distributions, which we often simply call distribu-
tions or states. In the literature they are also called ‘multinomial’ or ‘categor-
ical’ distributions. The notation and definitions that we use for distributions
are very much like for multisets, since distributions form a subset of multisets,
namely where multiplicities add up to one.
What we call distribution, is in fact a finite discrete probability distribution.
We will use state as synonym for distribution. A distribution over a set X is a
finite formal convex sum of the form:
X
r1 | x1 i + · · · + rn | xn i where xi ∈ X and ri ∈ [0, 1] with ri = 1.
i
P
We can write such an expression as a sum i ri | xi i. It is called a convex sum
since the ri add up to one. Thus, a distribution is a special ‘probabilistic’ mul-
tiset.
We write D(X) for the set of distributions on a space / set X, so that there
is an inclusion D(X) ⊆ M(X) of distributions on X in the set of multisets
on X. Via this inclusion, we use the same conventions for distributions, as
for multisets; they were described in the three bullet points in the beginning
of Section 1.4. This set X is often called the sample space, see e.g. [144],
the outcome space, the underlying space, or simply the underlying set. Each
element x ∈ X gives rise to a distribution 1| x i ∈ D(X), which is 1 on x and 0
everywhere else. It is called a Dirac distribution, a point mass, or also a point
distribution. The mapping x 7→ 1| x i is the unit function unit : X → D(X).
For a coin we can use the set {H, T } with elements for head and tail as sample
space. A fair coin is described on the left below, as a distribution over this set;
the distribution on the right gives a coin with a slight bias.
1
2|Hi + 21 | T i 0.51| H i + 0.49| T i.

In general, for a finite set X = {x1 , . . . , xn } there a uniform distribution


unif X ∈ D(X) that assigns the same probability to each element. Thus, it
is given by unif X = 1≤i≤n n1 | xi i. The above fair coin is a uniform distribu-
P
tion on the two-element set {H, T }. Similarly, a fair dice can be described as
unifpips = 16 |1 i+ 16 |2 i+ 16 | 3 i+ 16 | 4 i+ 16 | 5 i+ 61 | 6 i, where pips = {1, 2, 3, 4, 5, 6}.
Figure 2.1 shows bar charts of several distributions. The last one describes the
letter frequencies in English for the latin alphabet. One commonly does not dis-
tinguish upper en lower cases in such frequencies, so we take the 26-element
set A = {a, b, c, . . . , z} of lower cases as sample space. The distribution itself

70
2.1. Probability distributions 71

Figure 2.1 Plots of a slightly biased coin distribution 0.51| H i + 0.49| T i and a
fair (uniform) dice distribution on {1, 2, 3, 4, 5, 6} in the top row, together with the
distribution of letter frequencies in English at the bottom.

can be described as formal sum:


0.082|a i + 0.015| b i + 0.028| c i + 0.043| d i + 0.13|e i + 0.022| f i
+ 0.02|g i + 0.061| h i + 0.07| i i + 0.0015| j i + 0.0077| k i
+ 0.04|l i + 0.024|m i + 0.067| n i + 0.075| oi + 0.019| pi
+ 0.00095|q i + 0.06| r i + 0.063| s i + 0.091| t i + 0.028|u i
+ 0.0098| v i + 0.024| wi + 0.0015| x i + 0.02|y i + 0.0074| z i.
These frequencies have been copied from Wikipedia. Interestingly, they do not
precisely add up to 1, but to 1.01085, probably due to rounding. Thus, strictly
speaking, this is not a probability distribution but a multiset.
Below we describe several standard examples of distributions — which will
play an important role in the rest of the book. Especially the three ‘draw’ dis-
tributions — multinomial, hypergeometric, Pólya — capture probabilities as-
sociated with drawing coloured balls from an urn. The differences between
these distributions involve the changes in the urn after the draws and can be
characterised as −1, 0, and +1. In the hypergeometric case a drawn ball is re-
moved (−1), in the multinomial case a drawn ball is returned so that the urn

71
72 Chapter 2. Discrete probability distributions

remains unaltered (0), and in the Pólya case the drawn ball is returned to the
urn together with an extra ball of the same colour (+1).
Example 2.1.1. We shall explicitly describe several familiar distributions us-
ing the above formal convex sum notation.
1 The coin that we have seen above can be parametrised via a ‘bias’ probabil-
ity r ∈ [0, 1]. The resulting coin is often called flip and is defined as:
flip(r) B r| 1i + (1 − r)|0 i
where 1 may understood as ‘head’ and 0 as ‘tail’. We may thus see flip as a
function flip : [0, 1] → D({0, 1}) from probabilities to distributions over the
sample space 2 = {0, 1} of Booleans. This flip(r) is often called the Bernoulli
distribution, with parameter r ∈ [0, 1].
2 For each number K ∈ N and probability r ∈ [0, 1] there is the familiar
binomial distribution bn[K](r) ∈ D({0, 1, . . . , K}). It captures probabilities
for iterated coin flips, and is given by the convex sum:
X  
bn[K](r) B K
k · r k
· (1 − r) k .
K−k

0≤k≤K

The multiplicity probability before | k i in this expression is the chance of


getting k heads of out K coin flips, where each flip has bias r ∈ [0, 1].
In this way we obtain a function bn[K] : [0, 1] → D({0, 1, . . . , K}). These
multiplicities are plotted as bar charts in the top row of Figure 2.2, for two
binomial distributions, both on the sample space {0, 1, . . . , 10}.
There are isomorphisms [0, 1]  D(2), via the above flip function, and
{0, 1, . . . , K}  N[K](2), via k 7→ k| 0 i + (K − k)| 1 i. With these isomor-
phisms we can describe binomials as probabilistic function bn[K] : D(2) →

D N[K](2) . This re-formulation opens the door to a multivariate version
of this binomial distribution, called the multinomial distribution. It can be
described as function:
mn[K]
/ D N[K](X) .
 
D(X) (2.1)

The number K ∈ N represents the number of objects that is drawn, from


a distribution ω ∈ D(X), seen as abstract urn. The distribution mn[K](ω)
assigns a probability to a K-sized draw ϕ ∈ N[K](X). There is no bound on
K, since the idea behind multinomial distributions is that drawn objects are
replaced. More details will appear in Chapter 3; at this stage it suffices to
define for ω ∈ D(X),
X
( ϕ ) · ωϕ ϕ ,

mn[K](ω) B (2.2)
ϕ∈N[K](X)

72
2.1. Probability distributions 73

where ( ϕ ) is the multinomial coefficient QxK! ϕ(x)! , see Definition 1.5.1 (5)
and where ωϕ B ω(x) ϕ(x)
Q
x . In the sequel we shall standardly use the
multinomial distribution and view the binomial distribution as a special case.
To see an example, for space X = {a, b, c} and urn ω = 13 | ai + 12 | b i + 16 | ci
the draws of size 3 form
a distribution over multisets, described below within
the outer ‘big’ kets − .

mn[3](ω) = 27 3| a i + 6 2| a i + 1| b i + 4 1|a i + 2| bi + 8 3| bi
1 1 1 1

+ 18 1
2| a i + 1| c i + 16 1| a i + 1| bi + 1| ci + 18 2| b i + 1| c i


+ 36 1
1| a i + 2| c i + 241
1| b i + 2|c i + 216
1

3|c i

Via the Multinomial Theorem (1.27) we see that the probabilities in the
above expression (2.2) for mn[K](ω) add up to one:
X X Y (1.26)  P K
(ϕ ) · ωϕ = (ϕ) · ω(x)ϕ(x) = x ω(x) = 1K = 1.
x
ϕ∈N[K](X) ϕ∈N[K](X)

In addition, we note that the multisets ϕ in (2.2) can be restricted to those


with supp(ϕ) ⊆ supp(ω). Indeed, if ω(x) = 0, but ϕ(x) , 0, for some x ∈ X,
then ω(x)ϕ(x) = 0, so that the whole product becomes zero, and so that ϕ
Q
does not contribute to the above multinomial distribution.
3 The multinomial distribution captures draws with replacement, where drawn
balls are returned to the urn. Hence the urn itself can be described as a
probability distribution. A useful variation is the hypergeometric distribu-
tion which captures draws from an urn without replacement. We briefly in-
troduce this hypergeometric distribution, in multivariate form, and postpone
further analysis to Chapter 3.
When drawn balls are not replaced, the urn in question changes with each
draw. In the hypergeometric case the urn is a multiset, say of size L ∈ N.
Draws are then multisets of size K, with K ≤ L. The hypergeometric func-
tion thus takes the form:

N[L](X)
hg[K]
/ D N[K](X). (2.3)

This function/channel is defined on an L-sized multiset ψ ∈ N[L](X) as:

X ψϕ X x ψ(x)
  Q  
ϕ(x)
hg[K] ψ B L ϕ =  L  ϕ .

ϕ≤K ψ K ϕ≤K ψ K

This yields a probability distribution by Lemma 1.6.2. To see an example,

73
74 Chapter 2. Discrete probability distributions

with ψ = 4| a i + 6| b i + 2| c i as urn, the draws of size 3 give a distribution:


9 3 1
hg[3](ψ) = 55 1
3| ai + 55 2| ai + 1| bi + 11 1| ai + 2| b i + 11

3| b i

+ 55 3
2| ai + 1| ci + 12 ai + 1| b i + 1| c i + 22 3
2| b i + 1|c i

55 1|
+ 55 1
1| ai + 2| ci + 1103
1| bi + 2| c i .

4 We have seen that in multinomial mode the drawn ball is returned to the
urn, whereas in hypergeometric mode the drawn ball is removed. There is
a logical third option where the drawn ball is returned, together with one
additional ball of the same colour. This leads to what is called the Pólya
distribution. We shall describe it as a function of the form:

N∗ (X)
pl[K]
/ D N[K](X). (2.4)

Its description is very similar to the hypergeometric distribution, with as


main difference that it uses multichoose binomomial coefficients instead of
ordinary binomial coefficients:
ψ
ϕ
X
pl[K](ψ) =   ϕ . kψk
(2.5)
ϕ∈N[K](supp(ψ)) K

Notices that the draws ϕ satisfy supp(ϕ) ⊆ supp(ψ). In this situation it is


most natural to use urns ψ with full support, that is with ψ(x) > 0 for each
x ∈ X. In that case the urn contains at least one ball of each colour.
The above Pólya formula in (2.5) yields a proper distribution by Proposi-
tion 1.7.3. Here is an example, with the same urn ψ = 4| a i + 6| bi + 2| c i as
before, for the hypergeometric distribution.
3 2
pl[3](ψ) = 915
3| a i + 15 91 2| a i + 1| b i + 13 1|a i + 2| bi + 13

3| bi

+ 91 5
2| a i + 1| c i + 91 1|a i + 1| bi + 1| c i + 26 2| b i + 1| c i
12 3

+ 91 3
1| a i + 2| c i + 182
9
1| b i + 2|c i + 91
1

3| ci

The middle and last rows in Figure 2.2 give bar plots for hypergeometric and
Pólya distributions over 2 = {0, 1}. They show the probabilities for numbers
0 ≤ k ≤ 10 in a drawn multiset k| 0 i + (10 − k)| 1i.

So far we have described distributions as formal convex sums. But they can
be described, equivalently, in functional form. This is done in the definition
below, which is much like Definition 1.4.1 for multisets. Also for distributions
we shall freely switch between the above ket-formulation and the function-
formulation.

74
2.1. Probability distributions 75

bn[10]( 13 ) bn[10]( 43 )

   
hg[10] 10| 0 i + 20| 1 i hg[10] 16| 0 i + 14| 1 i

   
pl[10] 1| 0 i + 2| 1i pl[10] 8| 0 i + 7| 1 i

Figure 2.2 Plots of binomial and hypergeometric distributions

Definition 2.1.2. The set D(X) of all distributions over a set X can be defined
as:

ω(x) = 1}.
P
D(X) B {ω : X → [0, 1] | supp(ω) is finite, and x

Such a function ω : X → [0, 1], with finite support and values adding up to
one, is often called a probability mass function, abbreviated as pmf.
This D is functorial: for a function f : X → Y we have D( f ) : D(X) → D(Y)
defined either as:
X
ω(x).
P  P
D( f ) i ri | xi i B i ri | f (xi ) i or as: D( f )(ω)(y) B
x∈ f −1 (y)

A distribution of the form D( f )(ω) ∈ D(Y), for ω ∈ D(X), is sometimes called


an image distribution. One also says that ω is pushed forward along f .

One has to check that D( f )(ω) is a distribution again, that is, that its multi-

75
76 Chapter 2. Discrete probability distributions

plicities add up to one. This works as follows.


X X X X
D( f )(ω)(y) = ω(x) = ω(x) = 1.
y∈Y y∈Y x∈ f −1 (y) x∈X

We present two examples where functoriality of D is frequently used, but


not always in explicit form.

Example 2.1.3. 1 Computing marginals of ‘joint’ distributions involves func-


toriality of D. In general, one speaks of a joint distribution if its sample
space is a product set, of the form X1 × X2 , or more generally, X1 × · · · × Xn ,
for n ≥ 2. The i-th marginal of ω ∈ D(X1 × · · · × Xn ) is defined as D(πi )(ω),
via the i-th projection function πi : X1 × · · · × Xn → Xi .
For instance, the first marginal of the joint distribution,

ω= 1
12 | H, 0 i + 61 | H, 1i + 13 | H, 2 i + 16 | T, 0i + 1
12 | T, 1 i + 16 | T, 2 i

on the product space {H, T } × {0, 1, 2} is obtained as:

D(π1 )(ω) = 12 | π1 (H, 0)i + 6 | π1 (H, 1)i + 3 | π1 (H, 2)i


1 1 1

+ 6 | π1 (T, 0)i + 12 | π1 (T, 1)i + 16 | π1 (T, 2)i


1 1

= 12 | H i + 6 | H i + 3 | H i + 6 | T i + 12 | T i + 6 | T i
1 1 1 1 1 1

= 12 | H i + 12 | T i.
7 5

2 Let ω ∈ D(X) be distribution. In Chapter 4 we shall discuss random vari-


ables in detail, but here it suffices to know that it involves a function R : X →
R from the sample space of the distribution ω to the real numbers. Often,
then, the notation
P[R = r] ∈ [0, 1]

is used to indicate the probability that the random variable R takes value
r ∈ R.
Since we now know that D is functorial, we may apply it to the function
R : X → R. It gives another function D(R) : D(X) → D(R), so that D(R)(ω)
is a distribution on R. We observe:
X
P[R = r] = D(R)(ω)(r) = ω(x).
x∈R−1 (r)

Thus, P[R = (−)] is an image distribution, on R. In the notation P[R = r]


the distribution ω on the sample space X is left implicit.
Here is a concrete example. Recall that we write pips = {1, 2, 3, 4, 5, 6} for
the sample space of a dice. Let M : pips × pips → pips take the maximum,
so M(i, j) = max(i, j). We consider M as a function pips × pips → R via the

76
2.1. Probability distributions 77

inclusion pips ,→ R. Then, using the uniform distribution unif ∈ D(pips ×


pips),

P[M = k] = the probability that the maximum of two dice throws is k


= D(M)(unif)(k)
X
= unif(i, j)
i, j with max(i, j)=k
X X
= unif(i, k) + unif(k, j)
i≤k j<k
2k − 1
= .
36
In the notation of this book the image distribution D(M)(unif) on pips is
written as in the first line below.

D(M)(unif) = 36
1
| 1i + 36
3
|2 i + 5 | 3 i + 36
7
| 4 i + 9 | 5i + 11
36 |6 i
36 36
= P[M = 1] 1 + P[M = 2] 2 + P[M = 3] 3


+ P[M = 4] 4 + P[M = 5] 5 + P[M = 6] 6 .

Since in our notation the underlying (uniform) distribution unif is explicit,


we can also change it to another distribution ω ∈ D(pips × pips) and still
do the same computation. In fact, once we have seen product distributions in
Section 2.4 we can compute D(M)(ω1 ⊗ω2 ) where ω1 is a distribution for the
first dice (say a uniform, fair one) and ω2 a distribution for the second dice
(which may be unfair). This notation in which states are written explicitly
gives much flexibility in what we wish to express and compute, see also
Subsection 4.1.1 below.

In the previous chapter we have seen that the sets L(X), P(X) and M(X) of
lists, powersets and multisets all carry a monoid structure. Hence one may ex-
pect a similar result saying that D(X) forms a monoid, via an elementwise sum,
like for multisets. But that does not work. Instead, one can take convex sums
of distributions. This works as follows. Suppose we have two distributions
ω, ρ ∈ D(X) and a number s ∈ [0, 1]. Then we can form a new distribution
σ ∈ D(X), as convex combination of ω and ρ, namely:

σ B s · ω + (1 − s) · ρ that is σ(x) = r · ω(x) + (1 − s) · ρ(x). (2.6)

This obviously generalises to an n-ary convex sum.


At this stage we shall not axiomatise structures with such convex sums; they
are sometimes called simply ‘convex sets’ or ‘barycentric algebras’, see [155]
or [67] for details. A brief historical account occurs in [99, Remark 2.9].

77
78 Chapter 2. Discrete probability distributions

Remark 2.1.4. We have restricted ourselves to finite probability distributions,


by requiring that the support of a distribution is a finite set. This is fine in many
situations of practical interest, such as in Bayesian networks. But there are
relevant distributions that have infinite support, like the Poisson distribution
pois[λ] on N, with ‘mean’, ‘rate’ or ‘intensity’ parameter λ ≥ 0. It can be
described as infinite formal convex sum:
X λk
pois[λ] B e−λ · k . (2.7)
k∈N
k!

This does not fit in D(N). Therefore we sometimes use D∞ instead of D, where
the finiteness of support requirement has been dropped:
D∞ (X) B {ω : X → [0, 1] | x ω(x) = 1}.
P
(2.8)
Notice by the way that the multiplicities add up to one in the Poisson distri-
bution because of the basic formula: eλ = k∈N λk! . The Poisson distribution is
P k

typically used for counts of rare events. The rate or intensity parameter λ is the
average number of events per time period. The Poisson distribution then gives
for each k ∈ N the probability of having k events per time period. This works
when events occur independently.
Exercises 2.1.9 and 2.1.10 contain other examples of discrete distributions
with infinite support, namely the geometric and negative-binomial distribution.
Another illustration is the zeta (or zipf) distribution, see e.g. [145].

Exercises
2.1.1 Check that a marginal of a uniform distribution is again a uniform
distribution; more precisely, D(π1 )(unif X×Y ) = unif X .
2.1.2 1 Prove that flip : [0, 1] → D(2) is an isomorphism.
2 Check that flip(r) is the same as bn[1](r).
3 Describe the distribution bn[3]( 41 ) concretely and interepret this
distribution in terms of coin flips.
2.1.3 We often write n B {0, . . . , n − 1} so that 0 = ∅, 1 = {0} and 2 = {0, 1}.
Verify that:
L(0)  1 P(0)  1 N(0)  1 M(0)  1 D(0)  0
L(1)  N P(1)  2 N(1)  N M(1)  R≥0 D(1)  1
P(n)  2n N(n)  Nn M(n)  Rn≥0 D(2)  [0, 1].
The set D(n + 1) is often called the n-simplex. Describe it as a subset
of Rn+1 , and also as a subset of Rn .
2.1.4 Let X = {a, b, c} with draw ϕ = 2| ai + 3| b i + 2| c i ∈ N[7](X).

78
2.1. Probability distributions 79

1 Consider distribution ω = 13 | a i + 12 | b i + 16 | c i and show that:

mn[7](ω)(ϕ) = 432 .
35

Explain this outcome in terms of iterated single draws.


2 Consider urn ψ = 4| a i + 6| b i + 2| ci ∈ N[12](X) and compute:
hg[7](ψ)(ϕ) = 33 .
5

2.1.5 Let a number r ∈ [0, 1] and a finite set X be given. Show that:
X
r| U | · (1 − r)| X\U | = 1.
U∈P(X)

Can you generalise to numbers r1 , . . . , rn and partitions of X of size


n?
Hint: Recall the binomial and multinomial distribution from Exam-
ple 2.1.1 (2) and recall Lemma 1.3.7.
2.1.6 Check that the powerbag operation from Exercise 1.6.8 can be turned
into a probabilistic powerbag PPB via:
X ψϕ
 
PPB(ψ) B
ϕ .
2 kψk
ϕ≤ψ

2.1.7 Let X be a non-empty finite set with n elements. Use Exercise 1.5.7
to check that, for the following multiset distribution yields a well-
defined distribution on N[K](X).
X (ϕ )
mltdst[K]X B ϕ .
n K
ϕ∈N[K](X)

Describe mltdst[4]X for X = {a, b, c}.  


2.1.8 1 Recall Theorem 1.5.2 (1) and conclude that lim n+K · rn = 0,
n→∞ K
for r ∈ [0, 1). (This is general result: if partial sums of a series
converge, the limit of the series itself is zero.)
2 Conclude that for r ∈ (0, 1] one has:
lim bn[n+m](r)(m) = 0.
n→∞

Explain yourself what this means.


2.1.9 For a (non-zero) probability r ∈ (0, 1) one defines the geometric dis-
tribution geo[r] ∈ D∞ (N) as:
X
geo[r] B r · (1 − r)k−1 k .
k∈N>0

It captures the probability of being successful for the first time after

79
80 Chapter 2. Discrete probability distributions

k − 1 unsuccesful tries. Prove that this is a distribution indeed: its


multiplicities add up to one.
Hint: Recall the summation formula for geometric series, but don’t
confuse this geometric distribution with the hypergeometric distribu-
tion from Example 2.1.1 (3).
2.1.10 The negative binomial distribution is of the form nbn[K](s) ∈ D∞ (N),
for K ≥ 1 and s ∈ (0, 1). It captures the probability of reaching K
successes, with probability s, in n + K trials.
X K !!
· sK · (1 − s)n K + n

nbn[K](s) B
n≥0
n
X n + K − 1!
= · sK · (1 − s)n K + n

n≥0
K − 1
X m − 1!
= · sK · (1 − s)m−K m .
m≥K
K−1

Use Theorem 1.5.2 (or Exercise 1.5.9) to show that this forms a distri-
bution. We shall describe negative multinomial distributions in Sec-
tion ??.
2.1.11 Prove that a distribution ω ∈ D∞ (X) necessarily has countable sup-
port.
Hint: Use that each set {x ∈ X | ω(x) > 1n }, for n > 0, can have only
finitely many elements.
2.1.12 Let ω ∈ D(X) be an arbitrary distribution on a set X. We extend it to a
distribution ω? on the set L(X) of lists of elements from X. We define
the function ω? : L(X) → [0, 1] by:
ω(x1 ) · . . . · ω(xn )
ω? [x1 , . . . , xn ] B .

2n+1

Prove that ω? ∈ D∞ L(X) .



1
2 Consider the function f : {a, b, c} → {1, 2} with f (a) = 1, f (b) = 1,
f (c) = 2. Take ω = 13 | a i + 14 | b i + 12
5
| c i ∈ D({a, b, c}) and ` =
[1, 2, 1] ∈ L({1, 2}). Show that:

D( f )(ω)? (`) = 245


27648 = D∞ (L( f ))(ω? )(`).

Using the ?-operation as a function:

(−)?
D(X) / D∞ L(X)

80
2.2. Probabilistic channels 81

we can describe the above equation as:


 / ω? ∈ D∞ L({a, b, c})
D({a, b, c}) 3 ω
_ _
 
D∞ L( f ) (ω? )
 
D({1, 2}) 3 D( f )(ω) / D( f )(ω)? ∈ D∞ L({1, 2})

3 Prove in general that (−)? is a natural transformation from D to


D∞ ◦ L.

2.2 Probabilistic channels


In the previous chapter we have seen that taking lists / powersets / multisets is
functorial, giving functors written respectively as L, P, M. Further, we have
seen that these functors all have ‘unit’ and ‘flatten’ operations that satisfy stan-
dard equations that make L, P and M into a monad. Moreover, in terms of
these unit and flatten maps we have defined categories of channels Chan(L),
Chan(P) and Chan(M). In this section we will show that the same monad
structure exists for the distribution functor D and that we thus also have prob-
abilistic channels, in a category Chan(D). In the remainder of this book the
emphasis will be almost exclusively on this probabilistic case, so that ‘chan-
nel’ will standardly mean ‘probabilistic channel’.
The unit and flatten maps for multisets from Subsection 1.4.2 restrict to
distributions. The unit function unit : X → D(X) is simply unit(x) B 1| x i.
Flattening is the function flat : D(D(X)) → D(X) with:
P  X P  P
flat i ri | ωi i B i ri · ωi (x) x = i ri · ωi .
x∈X

The latter formulation uses a convex sum of distributions (2.6).


The only thing that needs to be checked is that flattening yields a convex
sum, i.e. that its probabilities add up to one. This is easy:

i ri · ωi (x) = x ωi (x) = i ri · 1 = 1.
P P  P P P
x i ri ·

The familiar properties of unit and flatten hold for distributions too: the ana-
logue of Lemma 1.4.3 holds, with ‘multiset’ replaced by ‘distribution’. We
conclude that D, with these unit and flat, is a monad. The same holds for the
‘infinite’ variation D∞ from Remark 2.1.4.
A probabilistic channel c : X → Y is a function c : X → D(Y). For a state /
distribution ω ∈ D(X) we can do state transformation and produce a new state

81
82 Chapter 2. Discrete probability distributions

c = ω ∈ D(Y). This happens via the standard definition for = from Section 1.8,
see especially (1.39) and (1.40):
  XX 
c = ω B flat D(c)(ω) = c(x)(y) · ω(x) y . (2.9)
y∈Y x∈X

In the probabilistic case we get a probability distribution again, with probabil-


ities adding up to one, since:
   
X X X X  X
 c(x)(y) · ω(x) =  c(x)(y) · ω(x) = 1 · ω(x) = 1.
   
y∈Y x∈X x∈X y∈Y x∈X

Moreover, if we have another channel d : Y → D(Z) we can form the compos-


ite channel:

c / D(Y) D(d)
/ D(D(Z)) flat / D(Z) .
 
d ◦· c B X

More explicitly, this gives (d ◦· c)(x) = d = c(x).


As a result we can form a category Chan(D) of probabilistic channels. Its
objects are sets X and its morphisms X → Y are probabilistic channels. They
can be composed via ◦·, see Lemma 1.8.3, with unit : X → X as identity chan-
nel. Recall that an ordinary function f : X → Y can be turned into a (probabilis-
tic) channel  f  : X → Y. Explicitly,  f (x) = 1| f (x)i. This can be formalised
in terms of a functor Sets → Chan(D). Often, we don’t write the − angles
when the context makes clear what is meant.
We have already seen several examples of such probabilistic channels in
Example 2.1.1, namely:

[0, 1]
flip
◦ / {0, 1} and [0, 1]
bn[K]
◦ / {0, 1, . . . , K}.

The multinomial, hypergeometric, and Pólya distributions from (2.1), (2.3)


and (2.4) can be organised as channels of the form:

◦ / ◦ / ◦ /
mn[K] hg[K] pl[K]
D(X) N[K](X) N[L](X) N[K](X). N∗ (X) N[K](X).

Example 2.2.1. We now present an example of a probabilistic channel, in the


style of the earlier list/powerset/multiset channels in Example 1.8.2. We use
again the sets X = {a, b, c} and Y = {u, v}, with state ω = 61 | ai + 12 | b i + 13 |c i ∈
D(X) and channel f : X → D(Y) given by:

f (a) = 21 | u i + 21 | v i f (b) = 1|u i f (c) = 34 | ui + 14 | v i.

82
2.2. Probabilistic channels 83

We then get as state transformation:

f = ω =

flat D( f )(ω)
= flat 61 | f (a) i + 12 | f (b) i + 31 | f (c) i


= flat 16 12 | u i + 21 | vi + 12 1| u i + 13 34 | ui + 14 |v i


= 12 |u i + 12 | v i + 2 | u i + 4 | u i + 12 | vi
1 1 1 1 1

= 6 | ui + 6 | v i.
5 1

This state transformation f = ω describes a mixture (convex sum) of the dis-


tributions f (a), f (b), f (c), where the weights of the components of the mixture
are given by the distribution ω, see also Exercise 2.2.3 (2). In practice we often
compute transformation directly as convex sum of states, as in:

     
f = ω = 6 · 2 | ui + 2 | vi
1 1 1
+ 1
2 · 1| ui + 1
3 · 3
4|ui + 14 | v i
= 6 | ui + 6 | vi.
5 1

This ‘mixture’ terminology is often used in a clustering context where the ele-
ments a, b, c are the components.

Example 2.2.2 (Copied from [79]). Let’s assume we wish to capture the mood
of a teacher, as a probabilistic mixture of three possible options namely: pes-
simistic (p), neutral (n) or optimistic (o). We thus have a three-element proba-
bility space X = {p, n, o}. We assume a mood distribution:

σ = 81 | p i + 38 | n i + 21 | o i with plot

This mood thus tends towards optimism.


Associated with these different moods the teacher has different views on
how pupils perform in a particular test. This performance is expressed in terms
of marks, which can range from 1 to 10, where 10 is best. The probability space
for these marks is written as Y = {1, 2, . . . , 10}.
The view of the teacher is expressed via a channel c : X → Y. It is defined
as via the following three mark distributions, one for each element in X =

83
84 Chapter 2. Discrete probability distributions

{p, n, o}.

c(p)
= 501
| 1 i + 50
2
| 2 i + 50
10
| 3i + 15 50 |4 i + 50 | 5 i
10

+ 50 6
| 6i + 503
| 7i + 50 1
| 8 i + 501
| 9 i + 50
1
| 10 i
pessimistic mood marks

c(n)
= 501
| 1 i + 50
2
| 2 i + 50
4
| 3i + 10
50 |4 i + 50 | 5 i
15

+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 5 1 1 1
| 10 i
neutral mood marks

c(o)
= 501
| 1 i + 50
1
| 2 i + 50
1
| 3i + 50
2
|4 i + 50
4
|5i
+ 50 | 6i + 50 | 7i + 50 | 8 i + 50 | 9 i + 50
10 15 10 4 2
| 10 i.
optimistic mood marks
Now that the state σ ∈ D(X) and the channel c : X → Y have been described,
we can form the transformed state c = σ ∈ D(Y). Following the formula-
tion (2.9) we get for each mark i ∈ Y the probability:

c = σ (i) = x σ(x) · c(x)(i)


 P

= σ(p) · c(p)(i) + σ(n) · c(n)(i) + σ(o) · c(o)(i).


The outcome is in the plot below. It contains a convex combination of the
above three distributions (and plots), for c(p), c(n) and c(o), where the weights
are determined by the mood distribution σ. This plot contains the ‘predicted’
marks, corresponding to the state of mind of the teacher.

This example will return in later chapters. There, the teacher will be confronted
with the marks that the pupils actually obtain. This will lead the teacher to an

84
2.2. Probabilistic channels 85

update of his/her mood. This situation forms an illustration of a predictive cod-


ing model, in which the human mind actively predicts the situation in the out-
side world, depending on its internal state — and updates it when confronted
with the actual situation.

In the next example we show how products, multisets and distributions come
together in an elementary combinatorial situation.

Example 2.2.3. Let X be an arbitrary set and K be a fixed positive natural


number. The K-fold product X K = X × · · · × X contains the lists of length
K. As we have seen in (1.32), the accumulator function acc : L(X) → N(X)
from (1.30) restricts to acc : X K → N[K](X), where, recall, N[K](X) is the set
of multisets over X with K elements.
We ask ourselves: do these maps acc : X K → N[K](X) have inverses? Let
ϕ = i ki | xi i ∈ N[K](X) be a K-element natural multiset, so kϕk = i ki = K,
P P
where we assume that i ranges over 1, 2, . . . , n. We can surely choose a list
` ∈ X K with acc(`) = ϕ. All we have to do is put the elements in ϕ in a
certain order. We can do so in many ways. How many? In Proposition 1.6.3
we have seen that the answer is given by the coefficient ( ϕ ) of the multiset
ϕ ∈ N[K](X), where:
!
K! K! K
(ϕ) = Q = = .
x ϕ(x)! k1 ! · · · kn ! k1 , . . . , kn

Returning to our question about an inverse to acc : X K → N[K](X) we see


that we can construct a channel in the reversed direction. We call it arr for
‘arrange’ and define it as arr : N[K](X) → X K via uniform distributions:
X 1
arr(ϕ) B ~x . (2.10)
(ϕ)
~x∈acc −1 (ϕ)

This is a sum over (ϕ )-many lists ~x with acc(~x) = ϕ. The size K of the multiset
involved has disappeared from this formulation, so that we can view arr simply
as a channel arr : N(X) → L(X).
 instance, for X = {a, b} with multiset ϕ = 3| a i + 1| bi there are (ϕ ) =
 4 For
3,1 = 3!·1! = 4 arrangements of ϕ, namely [a, a, a, b], [a, a, b, a], [a, b, a, a],
4!

and [b, a, a, a], so that:



arr 3| a i + 1|b i = 41 a, a, a, b + 14 a, a, b, a + 41 a, b, a, a + 41 b, a, a, a .


Our next question is: how are accumulator acc and arrangement arr related?

85
86 Chapter 2. Discrete probability distributions

The following commuting triangle gives an answer.

N(X)
arr
◦ / L(X)
◦ ◦ acc  (2.11)
unit 
N(X)
It combines a probabilistic channel arr : N(X) → D(L(X)) with an ordinary
function acc : L(X) → N(X), promoted to a deterministic channel acc. For
the channel composition ◦· we make use of Lemma 1.8.3 (4):
acc  ◦· arr (ϕ) = D(acc) arr(ϕ)
 
 
 X 1 
= D(acc)   ~x 
(ϕ ) 
~x∈acc −1 (ϕ)
X 1
=

acc(~x)
(ϕ )
~x∈acc −1 (ϕ)
X 1 1
= ϕ = (ϕ) · ϕ = 1 ϕ = unit(ϕ).
(ϕ ) (ϕ)
~x∈acc (ϕ)
−1

The above triangle (2.11) captures a very basic relationship between sequences,
multisets and distributions, via the notion of (probabilistic) channel.
In the other direction, arr(acc(~x)) does not return ~x. It yields a uniform dis-
tribution over all permutations of the sequence ~x ∈ X K .
Deleting and adding an element to a natural multiset are basic operations
that are naturally described as channels.
Definition 2.2.4. For a set X and a number K ∈ N we define a draw-delete
channel DD and also a draw-add channel DA in a situation:
DD
r ◦
N[K](X)

2 N[K +1](X)
DA

On ψ ∈ N[K +1](X) and ϕ ∈ N[K](X) they are defined as:


X ψ(x)
DD(ψ) B ψ − 1| x i
x∈supp(ψ)
K + 1
X χ(x) (2.12)
DA(χ) B χ + 1| x i
x∈supp(χ)
K

For DA we need to assume that χ is non-empty, or equivalently, that K > 0.


In the draw-delete case DD one draws single elements x ∈ X from the urn
ψ, with probability determined by the number of occurrences ψ(x) ∈ N of x

86
2.2. Probabilistic channels 87

in the urn ψ. This drawn element x is subsequently removed from the urn via
subtraction ψ − 1| x i.
In the draw-add case DA one also draws single elements x from the urn χ,
but instead of deleting x one adds an extra copy of x, via the sum χ + 1| x i. This
is typical for Pólya style urns, see [115] or Section 3.4 for more information.
These channels are not each other’s inverses. For instance:

DD 3|a i + 1| bi = 43 2| a i + 1| b i + 14 3| a i


DA 2|a i + 1| bi = 2 3| a i + 1| b i + 1 2| a i + 2|b i .

3 3

In a next step we see that neither DA ◦· DD nor DD ◦· DA is the identity.

DA ◦· DD 3| ai + 1| b i
 
 
= DA = 34 2| a i + 1| b i + 41 3| a i

   
= 34 23 3| a i + 1| bi + 31 2| ai + 2| bi + 14 1 4| ai


= 12 3| ai + 1| b i + 14 2| a i + 2| b i + 14 4| a i

DD ◦· DA 2| ai + 1| bi
 
 
= DD = 23 3| a i + 1|b i + 31 2| ai + 2| bi

   
= 23 34 2| a i + 1| bi + 41 3| ai + 13 12 1| a i + 2| b i + 12 2|a i + 1| bi


= 21 2| ai + 1| b i + 16 3| a i + 16 1| a i + 2| b i + 16 2|a i + 1| bi .

When we iterate the draw-and-delete and the draw-and-add channels we


obtain distributions that strongly remind us of the hypergeometric and Pólya
distribution, see Example 2.1.1 (3) and (4). But they are not the same. The
iterations below describe not what is drawn from the urn — as in the hyper-
geometric and Pólya cases — but what is left in the urn after such draws. The
full picture appears in Chapter 3.
The iteration of draws is intuitively very natural, leading to subtraction and
addition of the draws ϕ, in the formulas below. The important thing to note is
that the mathematical formalisation of this intuition works because these draws
are channels and are composed via channel composition ◦·.

Theorem 2.2.5. Iterating K ∈ N times the draw-and-delete and draw-and-add


channels from Definition 2.2.4 yields channels:

DD K
r ◦
N[L](X) 2 N[L+K](X)

DA K

87
88 Chapter 2. Discrete probability distributions

On ψ ∈ N[L+K](X) and χ ∈ N[L](X) they satisfy:

 ψ χ
ϕ ϕ
X X
DD K (ψ) =  ψ − ϕ DA K (χ) =  L  χ + ϕ .

L+K
ϕ≤K ψ K ϕ≤K χ K

The probabilities in these expressions add up to one by Lemma 1.6.2 and


Proposition 1.7.3.

Proof. For both equations we use induction on K ∈ N. In both cases the only
option for ϕ in N[0](X) is the empty multiset 0, so that DD 0 (ψ) = 1| ψi and
DA 0 (χ) = 1|χ i.
For the induction steps we make extensive use of the equations in Exer-
cise 1.7.6. In those cases we shall put ‘E’ above the equation. We start with the
induction step for draw-delete. For ψ ∈ N[L+K + 1](X),

DD K+1 (ψ) = DD K = DD(ψ)


 
(2.12)
 X ψ(x) 
= DD K =  ψ − 1| x i 
x∈supp(ψ)
L+K+1
X X ψ(x)
= · DD K (ψ − 1| x i)(ϕ) ϕ
ϕ∈N[K](X) x∈supp(ψ)
L+K+1
ψ−1| x i
(IH)
X X ψ(x) χ
= · L+K  ψ − 1| x i − χ

x∈supp(ψ) χ≤K ψ−1| x i
L + K + 1
K
 ψ 
(E)
X X χ(x) + 1 χ+1| x i
= · L+K+1 ψ − (χ + 1| x i)

x∈supp(ψ) χ≤K ψ−1| x i
K + 1
K+1
 ψ
X X ϕ(x) ϕ
= · L+K+1 ψ − ϕ

ϕ≤K+1 ψ x∈supp(ψ)
K + 1
K+1
 ψ
ϕ
X
=  ψ − ϕ , since kϕk = K + 1.

 L+K+1
ϕ≤K+1 ψ K+1

We reason basically in the same way for draw-add, and now also use Exer-

88
2.2. Probabilistic channels 89

cise 1.7.5. For χ ∈ N[L](X),


 
(2.12)
 X χ(x) 
DA K+1 (χ) = DA K =  χ + 1| x i 
x∈supp(χ)
L
χ+1| x i
(IH)
X X χ(x) ψ
= · L+1 χ + 1| x i + ψ

x∈supp(χ) ψ≤K χ+1| x i
L
K
 χ 
(E)
X X ψ(x) + 1 ψ+1| x i
= ·  L  χ + ψ + 1| x i

x∈supp(χ) ψ≤K χ+1| x i
K + 1
K+1
χ
X X ϕ(x) ϕ
= ·  L  χ + ϕ

ϕ≤K+1 χ x∈supp(χ)
K + 1
K+1
χ
ϕ
X X
=   χ + ϕ .

L
ϕ≤K+1 χ x∈supp(χ) K+1

Exercises
2.2.1 Consider the D-channel f : {a, b, c} → {u, v} from Example 2.2.1,
together with a new D-channel g : : {u, v} → {1, 2, 3, 4} given by:

g(u) = 14 |1 i + 18 | 2 i + 21 | 3 i + 18 | 4 i g(v) = 14 | 1i + 18 | 3i + 58 |3 i.

Describe the general g ◦· f : {a, b, c} → {1, 2, 3, 4} concretely.


2.2.2 Formulate and prove the analogue of Lemma 1.4.3 for D, instead of
for M.
2.2.3 Notice that for a probabilistic channel c one can describe state trans-
formation as a (convex) sum/mixture of states c(x): that is, as:
X
c = ω = ω(x) · c(x).
x

2.2.4 Identify the channel f and the state ω in Example 2.2.1 with matrices:
1
1 
 1 3 
 6 
M f =  21 =
4
 1 
1  M ω  2 
2 0 4
 1 
3

Such matrices are called stochastic, since each of their columns has
non-negative entries that add up to one.
Check that the matrix associated with the transformed state f = ω
is the matrix-column multiplication M f · Mω .
(A general description appears in Remark 4.3.5.)

89
90 Chapter 2. Discrete probability distributions

2.2.5 Prove that state transformation along D-channels preserves convex


combinations (2.6), that is, for r ∈ [0, 1],
f = (r · ω + (1 − r) · ρ) = r · ( f = ω) + (1 − r) · ( f = ρ).
2.2.6 Write π : X K+1 → X K for the obvious projection function. Show that
the following diagram of channels commutes.

N[K +1](X)
DD
◦ / N[K](X)
arr ◦ ◦ arr
 
K+1 π / XX
X ◦

2.2.7 Consider the arrangement channels arr : N(X) → L(X) from Exam-
ple 2.2.3. Prove that arr is natural: for each function f : X → Y the
following diagram commutes.

N(X)
arr / D L(X))
N( f ) D(L( f ))
 
N(Y)
arr / D L(Y)

(This is far from easy; you may wish to check first what this means in
a simple case and then content yourself with a ‘proof by example’.)
2.2.8 Show that the draw-and-delete and draw-and-add maps DD and DA
in Definition 2.2.4 are natural, from N[K+1] to D ◦ N[K], and from
N[K] to D ◦ N[K +1].
2.2.9 Recall from (1.34) that the set N[K](X)
  of K-sized natural multisets
over a set X with n has contains Kn members. We write unif K ∈

D N[K](X) for the uniform distribution over such multisets.
1 Check that:
X 1
unif K =  n  ϕ .
ϕ∈N[K](n) K

2 Prove that these unif K distributions match the chain of draw-and-


delete maps DD : N[K +1](X) → N[K](X) in the sense that:
DD = unif K+1 = unif K .
In categorical language this means that these uniform distributions
form a cone.
3 Check that the draw-add maps do not preserve uniform distribu-
tions. Check for instance that for X = {a, b},

DA = unif3 = 14 4|a i + 16 3| ai + 1| bi + 16 2| ai + 2| b i


+ 16 1| a i + 3|b i + 14 4| bi .

90
2.2. Probabilistic channels 91

2.2.10 Let X be a finite set, say with n elements, of the form X = {x1 , . . . , xn }.
Define for K ≥ 1,
X
σK B 1
∈ D N[K](X) .

n K| xi i
1≤i≤n

Thus:

σ1 = 1n 1| x1 i + · · · + 1

n 1| xn i

σ = 1 2| x i + · · · + n 2| xn i ,
1

2 n 1 etc.
Show that these σK form a cone, both for draw-delete and for draw-
add:
DD = σK+1 = σK and DA = σK = σK+1 .
Thus, the whole sequence σK K∈N>0 can be generated from σ1 =

unifN[1](X) by repeated application of DA.
2.2.11 Recall the bijective correspondences from Exercise 1.8.4.
1 Let X, Y be finite sets and c : X → D(Y) be a D-channel. We
can then define an M-channel c† : Y → M(X) by swapping ar-
guments: c† (y)(x) = c(x)(y). We call c a ‘bi-channel’ if c† is also a
D-channel, i.e. if x c(x)(y) = 1 for each y ∈ Y.
P
Prove that the identity channel is a bi-channel and that bi-channels
are closed under composition.
2 Take A = {a0 , a1 } and B = {b0 , b1 } and define a channel bell : A ×
2 → D(B × 2) as:
bell(a0 , 0) = 21 | b0 , 0i + 8 |b1 , 0 i + 8 | b1 , 1 i
3 1

bell(a0 , 1) = 2 | b0 , 1i + 8 |b1 , 0 i + 8 | b1 , 1 i
1 1 3

bell(a1 , 0) = 83 | b0 , 0i + 18 | b0 , 1 i + 18 | b1 , 0 i + 38 | b1 , 1 i
bell(a1 , 1) = 81 | b0 , 0i + 38 | b0 , 1 i + 38 | b1 , 0 i + 18 | b1 , 1 i
Check that bell is a bi-channel.
(It captures the famous Bell table from quantum theory; we have
deliberately used open spaces in the above description of the chan-
nel bell so that non-zero entries align, giving a ‘bi-stochastic’ ma-
trix, from which one can read bell † vertically.)
2.2.12 Check that the inclusions D(X) ,→ M(X) form a map of monads, as
described in Definition 1.9.2.
2.2.13 Recall from Exercise 1.9.6 that for a fixed set A the mapping X 7→
X + A is a monad. Prove that X 7→ D(X + A) is also a monad.
(The latter monad will be used in Chapter ?? to describe Markov
models with outputs. A composition of two monads is not necessarily

91
92 Chapter 2. Discrete probability distributions

again a monad, but it is when there is a so-called distributive law


between the monads, see e.g. [70, Chap. 5] for details.)

2.3 Frequentist learning: from multisets to distributions


We have introduced distributions as special multisets, namely as multisets in
which the multiplicities add up to one, so that D(X) ⊆ M(X). There are many
other relations and interactions between distributions and multisets. As we
have mentioned, an urn containing coloured balls can be described aptly as
a multiset. The distribution for drawing a ball can then be derived from the
multiset. In this section we shall describing this situation in terms of a map-
ping from multisets to distributions, called ‘frequentist learning’ since it basi-
cally involves counting. Later on we shall see that other forms of learning from
data can be described in terms of passages from multisets to distributions. This
forms a central topic.
In earlier sections we have seen several (natural) mappings between collec-
tion types, in the form of support maps, see the overview in Diagram (1.30).
We are now adding the mapping Flrn : M∗ (X) → D(X), see [76]. The name
Flrn stands for ‘frequentist learning’, and may be pronounced as ‘eff-learn’.
The frequentist interpretation of probability theory views probabilities as long
term frequencies of occurrences. Here, these occurrences are given via multi-
sets, which form the inputs of the Flrn function. Later on, in Theorem 4.5.7 we
show that these outcomes of frequentist learning ly dense in the set of distri-
butions, which means that we can approximate each distribution with arbitrary
precision via frequentist learning of (natural) multisets.
P
Recall that M∗ (X) is the collection of non-empty multisets i ri | xi i, with
ri , 0 for at least one index i. Equivalently one can require that the sum s B
i ri = k i ri | xi ik is non-zero.
P P

The Flrn maps turns a (non-empty) multiset into a distribution, essentially


by normalisation:
  r1 rk
Flrn r1 x1 + · · · + rk xk B x1 + · · · + xk
s Ps (2.13)
where s B i ri .

The normalisation step forces the sum on the right-hand side to be a convex
sum, with factors adding up to one. Clearly, from an empty multiset we cannot
learn a distribution — technically because the above sum s is then zero so that
we cannot divide by s.
Using scalar multiplication from Lemma 1.4.2 (2) we can define the Flrn

92
2.3. Frequentist learning: from multisets to distributions 93

function more succintly:


1
·ϕ ϕ(x).
P
Flrn(ϕ) B where kϕk B x (2.14)
kϕk
Example 2.3.1. We present two illustrations of frequentist learning.
1 Suppose we have some coin of which the bias is unkown. There are experi-
mental data showing that after tossing the coin 50 times, 20 times come up
head (H) and 30 times yield tail (T ). We can present these data as a multiset
ϕ = 20| H i+30| T i ∈ M∗ ({H, T }). When we wish to learn the resulting prob-
abilities, we apply the frequentist learning map Flrn and get a distribution
in D({H, T }, namely:

Flrn(ϕ) = 20
20+30 | H i + 30
20+30 | T i = 25 | H i + 53 |T i.

Thus, the bias (twowards head) is 25 . In this simple case we could have ob-
tained this bias immediately from the data, but the Flrn map captures the
general mechanism.
Notice that with frequentist learning, more (or less) consistent data gives
the same outcome. For instance if we knew that 40 out of 100 tosses were
head, or 2 out of 5, we would still get the same bias. Intuitively, these data
give more (or less) confidence about the data. These aspects are not cov-
ered by frequentist learning, but by a more sophisticated form of ‘Bayesian’
learning. Another disadvantage of the rather primitive form of frequentist
learning is that prior knowledge, if any, about the bias is not taken into ac-
count.
2 Recall the medical table (1.18) captured by the multiset τ ∈ N(B × M).
Learning from τ yields the following joint distribution:
Flrn(τ) = 0.1| H, 0 i + 0.35| H, 1 i + 0.25| H, 2 i
+ 0.05| L, 0 i + 0.1| L, 1 i + 0.15| L, 2 i.
Such a distribution, directly derived from a table, is sometimes called an
empirical distribution [33].
In the above coin example we saw a property that is typical of frequentist
learning, namely that learning from more of the same does not have any effect.
We can make this precise via the equation:

Flrn K · ϕ = Flrn(ϕ) for K > 0.



(2.15)
In [57] it is argued that in general, people are not very good at probabilistic
(esp. Bayesian) reasoning, but that they are much better at reasoning with “fre-
quency formats”. To put it simply: the information (from D) that there is a 0.04

93
94 Chapter 2. Discrete probability distributions

probability of getting a disease is more difficult to process than the information


(from M) that 4 out of 100 people get the disease. In the current setting these
frequency formats would correspond to natural multisets; they can be turned
into distributions via the frequentist learning map Flrn. In the sequel we regu-
larly return to the interaction between multisets and distributions in relation to
drawing from an urn (in Chapter 3) and to learning (in Chapter ??).
It turns out that the learning map Flrn is ‘natural’, in the sense that it works
uniformly for each set.

Lemma 2.3.2. The frequentist learning maps Flrn : M∗ (X) → D(X) from (2.13)
are natural in X. This means that for each function f : X → Y the following
diagram commutes.

M∗ (X)
Flrn / D(X)
M∗ ( f ) D( f )
 
M∗ (Y) / D(Y)
Flrn

As a result, frequentist learning commutes with marginalisation (via projection


functions), see also Subsection 2.5.1.

Proof. Pick an arbitrary non-empty multiset ϕ = i ri | xi i in M∗ (X) and write


P
s B kϕk = i ri . By non-emptyness of ϕ we have s , 0. Then:
P

Flrn ◦ M∗ ( f ) (ϕ) = Flrn i ri | f (xi ) i


 P 

= i s | f (xi ) i
P ri 

= D( f ) i rsi | xi i
P 

= D( f ) ◦ Flrn (ϕ).


We can apply this basic result to the medical data in Table (1.18), via the
multiset τ ∈ N(B × M). We have already seen in Section 1.4 that the multiset-
marginals N(πi )(τ) produce the marginal columns and rows, with their totals.
We can learn the distributions from the colums as:

Flrn M(π1 )(τ) = Flrn 70| H i + 30| L i = 0.7| H i + 0.3| L i.


 

We can also take the distribution-marginal of the ‘learned’ distribution from


the table, as described in Example 2.3.1 (2):

M(π1 ) Flrn(τ) = (0.1 + 0.35 + 0.25)| H i + (0.05 + 0.1 + 0.15)| L i




= 0.7| H i + 0.3| L i.

Hence the basic operations of learning and marginalisations commute. This


is a simple result, which many practitioners in probability are surely aware

94
2.3. Frequentist learning: from multisets to distributions 95

of, at an intuitive level, but maybe not in the mathematically precise form of
Lemma 2.3.2.
Drawing an object from an urn is an elementary operation in probability
theory, which involves frequentist learning Flrn. For instance, the draw-and-
delete and draw-and-add operations DD and DA from Definition 2.2.4 can be
described, for urns ψ and χ with kψk = K + 1 and kχk = K, as:
X ψ(x) X
DD(ψ) = ψ − 1| x i = Flrn(ψ)(x) ψ − 1| x i

x∈supp(ψ)
K + 1 x∈supp(ψ)
X χ(x) X
DA(χ) = χ + 1| x i = Flrn(χ)(x) χ + 1| x i .

x∈supp(χ)
K x∈supp(χ)

Since this drawing takes the multiplicities into account, the urns after the
DD and DA draws have the same frequentist distribution as before, but only
if we interpret “after” as channel composition ◦·. This is the content of the
following basic result.

Theorem 2.3.3. One has both:

Flrn ◦· DD = Flrn and Flrn ◦· DA = Flrn.


Equivalently, the following two triangles of channels commute.

N[K](X) o / N[K +1](X)


DD DA
◦ N[K +1](X) N[K](X) ◦

◦ ◦ ◦ ◦
Flrn +Xr Flrn Flrn +Xr Flrn

Proof. For the proof of commutation of draw-delete and frequentist learning,


we take ψ ∈ M[K +1](X) and y ∈ X. Then:
X
Flrn ◦· DD (ψ)(y) =

DD(ψ)(ϕ) · Flrn(ϕ)(y)
ϕ∈N[K](X)
X ψ(x)
(2.12)
= · Flrn(ψ − 1| x i)(y)
x∈X
K+1
ψ(y) ψ(y) − 1 X ψ(x) ψ(y)
= · + ·
K+1 K x,y
K+1 K
 
ψ(y)  X 
= · ψ(y) − 1 +
 ψ(x)
K(K + 1)
  x,y 
ψ(y) X
= ·  ψ(x) − 1
 
K(K + 1) x
ψ(y) ψ(y)
= · ((K + 1) − 1) = = Flrn(ψ)(y).
K(K + 1) K+1

95
96 Chapter 2. Discrete probability distributions

Similarly, for χ ∈ N[K](X), where K > 0, and y ∈ X,


(2.12)
X χ(x)
Flrn ◦· DA (χ)(y) = Flrn(χ + 1| x i)(y) ·

x∈X
K
χ(y) + 1 χ(y) X χ(y) χ(x)
= · + ·
K+1 K x,y
K+1 K
 
χ(y) χ(y)  χ(y) X χ(x) 
= + ·  +
K(K + 1) K + 1  K

x,y
K 
χ(y) χ(y) X χ(x)
= + ·
K(K + 1) K + 1 x K
χ(y) χ(y)
= +
K(K + 1) K + 1
χ(y) + K · χ(y) χ(y)
= = = Flrn(χ)(y).
K(K + 1) K
Remark 2.3.4. In Lemma 2.3.2 we have seen that Flrn : M∗ → D is a natural
transformation. Since both M∗ and D are monads, one can ask if Flrn is also
a map of monads. This would mean that Flrn also commutes with the unit and
flatten maps, see Definition 1.9.2. This is not the case.
It is easy to see that Flrn commutes with the unit maps, simply because
Flrn(1| x i) = 1| x i. But commutation with flatten’s fails. Here is a simple coun-
terexample. Consider the multiset of multsets Φ ∈ M(M({a, b, c})) given by:

Φ B 1 2| a i + 4| ci + 2 1| a i + 1| b i + 1| c i .

First taking (multiset) multiplication, and then doing frequentist learning gives:
   
Flrn flat(Φ) = Flrn 4| ai + 2| b i + 6| c i = 31 | a i + 16 | bi + 12 | ci.

However, first (outer en inner) learning and then doing (distribution) multipli-
cation yields:
 
flat Flrn M(Flrn)(Φ) = 13 13 | a i + 23 | c i + 32 13 | ai + 13 |b i + 13 | c i

= 13 | a i + 29 | b i + 49 | c i.

Exercises
2.3.1 Recall the data / multisets about child ages and blood types in the
beginning of Subsection 1.4.1. Compute the associated (empirical)
distributions.
Plot these distributions as a graph. How do they compare to the
plots (1.16) and (1.17)?

96
2.3. Frequentist learning: from multisets to distributions 97

2.3.2 Check that frequentist learning from a constant multiset yields a uni-
form distribution. And also that frequentist learning is invariant under
(non-zero) scalar multiplication for multisets: Flrn(s · ϕ) = Flrn(ϕ)
for s ∈ R>0 .
2.3.3 1 Prove that for multisets ϕ, ψ ∈ M∗ (X) one has:
kϕk kψk
Flrn ϕ + ψ = · Flrn(ϕ) +

· Flrn(ψ).
kϕk + kψk kϕk + kψk

This means that when one has already learned Flrn(ϕ) and new
data ψ arrives, all probabilities have to be adjusted, as in the above
convex sum of distributions.
2 Check that the following formulation for natural multisets of fixed
sizes K > 0, L > 0 is a special case of the previous item.
+ / N[K +L](X)
N[K](X) × N[L](X)
Flrn×Flrn
 K L
K+L (−)+ K+L (−)
 Flrn
D(X) × D(X) / D(X)

2.3.4 Show that Diagram (1.30) can be refined to:

L∗ (X)
supp
/ Pfin (X)
F
(2.16)
acc * M (X) / D(X) supp

Flrn

where L∗ (X) ⊆ L(X) is the subset of non-empty lists.


2.3.5 Let c : X → D(Y) be a D-channel and ϕ ∈ M(X) be a multiset.
Because D(Y) ⊆ M(Y) we can also consider c as an M-channel, and
use c = ϕ. Prove that:

Flrn(c = ϕ) = c = Flrn(ϕ) = Flrn(c = Flrn(ϕ)).

2.3.6 Let X be a finite set and K ∈ N be an arbitrary number. Show that for
σ ∈ D N[K +1](X) and ϕ ∈ N[K](X) one has:


DD = σ (ϕ) X σ(ϕ+1| x i)

= .
(ϕ) x∈X
( ϕ+1| x i)

2.3.7 Recall the uniform distributions unif K ∈ D N[K](X) from Exer-
cise 2.2.9, where the set X is finite. Use Lemma 1.7.5 to prove that
Flrn = unif K = unif X ∈ D(X).

97
98 Chapter 2. Discrete probability distributions

2.4 Parallel products


In the very first section of the first chapter we have seen Cartesian, parallel
products X × Y of sets X, Y. Here we shall look at parallel products of states,
and also at parallel products of channels. These new products will be written
as tensors ⊗. They express parallel combination. These tensors exist for P, M
and D, but not for lists L. The reason for this absence will be explained below,
in Remark 2.4.3.
In this section we start with a brief uniform description of parallel products,
for multiple collection types — in the style of the first chapter — but we shall
quickly zoom in on the probabilistic case. Products ⊗ of distributions have their
own dynamics, due to the requirement that probabilities, also over a product,
must add up to one. This means that the two components of a ‘joint’ distri-
bution, over a product space, can be correlated. Indeed, a joint distribution is
typically not equal to the product of its marginals: the whole is more than the
product of its parts. It also means, as we shall see in Chapter 6, that updating
in one part has effect in other parts: the two parts of a joint distribution ‘listen’
to each other.

Definition 2.4.1. Tensors, also called parallel products, will be defined first for
states, for each collection type separately, and then for channels, in a uniform
manner.

1 Let X, Y be arbitrary sets.


a For U ∈ P(X) and V ∈ P(Y) we define U ⊗ V ∈ P(X × Y) as:

U ⊗ V B {(x, y) ∈ X × Y | x ∈ U and y ∈ V}.

This product U ⊗ V is often written simply as a product of sets U × V, but


we prefer to have a separate notation for this product of subsets.
b For ϕ ∈ M(X) and ψ ∈ M(Y) we get ϕ ⊗ ψ ∈ M(X × Y) as:
X
ϕ⊗ψ B (ϕ(x) · ψ(y)) x, y that is ϕ ⊗ ψ)(x, y) = ϕ(x) · ψ(y).

x∈X,y∈Y

c For ω ∈ D(X) and ρ ∈ D(Y) we use ω ⊗ ρ as in the previous item,


for multisets. This is well-defined, with outcome in D(X × Y), since the
relevant multiplicities add up to one. This tensor also works for infinite
support, i.e. for D∞ .
2 Let c : X → Y and d : A → B be two channels, both of the same type
T ∈ {P, M, D}. A channel c ⊗ d : X × A → Y × B is defined via:

c ⊗ d)(x, a) B c(x) ⊗ d(a).

98
2.4. Parallel products 99

The right-hand side uses the tensor product of the appropriate type T , as
defined in the previous item.
We shall use tensors not only in binary form ⊗, but also in n-ary form ⊗ · · · ⊗,
both for states and for channels.
We see that tensors ⊗ involve the products × of the underlying domains. A
simple illustration of a (probabilistic) tensor product is:

6 | ui + 3 | vi + 2 |w i ⊗ 4 | 0 i + 4 | 1 i
1 1 1  3 1 

= 18 | u, 0 i + 24
1
| u, 1 i + 14 | v, 0i + 12
1
|v, 1 i + 38 | w, 0i + 18 | w, 1i.
These tensor products tend to grow really quickly in size, since the number of
entries of the two parts have to be multiplied.
Parallel products are well-behaved, as described below.
Lemma 2.4.2. 1 Transformation of parallel states via parallel channels is the
parallel product of the individual transformations:
(c ⊗ d) = (ω ⊗ ρ) = (c = ω) ⊗ (d = ρ).
2 Parallel products of channels interact nicely with unit and composition:
unit ⊗ unit = unit (c1 ⊗ d1 ) ◦· (c2 ⊗ d2 ) = (c1 ◦· c2 ) ⊗ (d1 ◦· d2 ).
3 The tensor of trivial, deterministic channels is obtained from their product:
 f  ⊗ g =  f × g

where f × g = h f ◦ π1 , g ◦ π2 i, see Subsection 1.1.1.


4 The image distribution along a product function is a tensor of images:
D( f × g)(ω ⊗ ρ) = D( f )(ω) ⊗ D(g)(ρ).
Proof. We shall do the proofs for M and for D of item (1) at once, using
Exercise 2.2.3 (1), and leave the remaining cases to the reader.
X
(c ⊗ d) = (ω ⊗ ρ) (y, b) = (c ⊗ d)(x, a)(y, b) · (ω ⊗ ρ)(x, a)

x∈X, a∈A
X
= (c(x) ⊗ d(a))(y, b) · ω(x) · ρ(a)
x∈X, a∈A
X
= c(x)(y) · d(a)(b) · ω(x) · ρ(a)
x∈X, a∈A
   
X  X
=  c(x)(y) · ω(x) ·  d(a)(b) · ρ(a)

  
x∈X a∈A
= (c = ω)(y) · (d = ρ)(b)
= (c = ω) ⊗ (d = ρ) (y, b).


99
100 Chapter 2. Discrete probability distributions

As promised we look into why parallel products don’t work for lists.

Remark 2.4.3. Suppose we have two list [a, b, c] and [u, v] and we wish to
form their parallel product. Then there are many ways to do so. For instance,
two obvious choices are:
[ha, ui, ha, vi, hb, ui, hb, vi, hc, ui, hc, vi]
[ha, ui, hb, ui, hc, ui, ha, vi, hb, vi, hc, vi]
There are many other possibilities. The problem is that there is no canonical
choice. Since the order of elements in a list matter, there is no commutativity
property which makes all options equivalent, like for multisets. Technically,
the tensor for L does not exist because L is not a commutative (i.e. monoidal)
monad; this is an early result in category theory going back to [102].

From now on we concentrate on parallel products ⊗ for distributions and


for probabilistic channels. We illustrate the use of ⊗ of probabilistic channels
for two standard ‘summation’ results. We start with a simple property, which
is studied in more complicated form in Exercise 2.4.3. Abstractly, it can be
understood in terms of convolutions. In general, a convolution of parallel maps
f, g : X → Y is a composite of the form:

X
split
/ X×X f ⊗g
/ Y ×Y join
/Y (2.17)

The split and join operations depend on the situation. In the next result they
are copy and sum.

Lemma 2.4.4. Recall the flip (or Bernoulli) distribution flip(r) = r|1 i + (1 −
r)| 0 i and consider it as a channel flip : [0, 1] → N. The convolution of two
such flip’s is then a binomial distribution, as described in the following dia-
gram of channels.
∆ / [0, 1] × [0, 1] flip⊗flip
/ N×N
[0, 1] ◦ ◦
◦+
◦ 
bn[2] /N

Proof. For r ∈ [0, 1] we compute:


 
D(+) flip(r) ⊗ flip(r)
 
= D(+) r2 1, 1 + r · (1 − r) 1, 0 + (1 − r) · r) 0, 1 + (1 − r)2 0, 0


= r2 2 +!2 · r · (1 − r) 1 + (1 − r)2 0
X 2
= · ri · (1 − r)2−i i = bn[2](r).
0≤i≤2
i

100
2.4. Parallel products 101

The structure used in this result is worth making explicit. We have intro-
duced probability distributions on sets, as sample spaces. It turns out that if
this underlying set, say M, happens to be a commutative monoid, then D(M)
also forms a commutative monoid. This makes for instance the set D(N) of
distributions on the natural numbers a commutative monoid — even in two
ways, using either the additive or multiplicative monoid structure on N. The
construction occurs in [99, p.82], but does not seem to be widely used and/or
familiar. It is an instance of an abstract form of convolution in [104, §10].
Since it plays an important role here we make it explicit. It can be described
via parallel products ⊗ of distributions.

Proposition 2.4.5. Let M = (M, +, 0) be a commutative monoid. Define a sum


and zero element on D(M) via:

ω + ρ B D(+) ω ⊗ ρ  +: M × M → M
 

using  (2.18)
0 B 1| 0 i 0 ∈ M

Alternative descriptions of this sum of distributions are:


X
ω+ρ = ω(a) · ρ(b) a + b

a,b∈M
X  X  X
ri ai + sj bj = ri · s j ai + b j .

i j i, j

1 Via these sums +, 0 of distributions, the set D(M) forms a commutative


monoid.
2 If f : M → N is a homomorphism of commutative monoids, then so is
D( f ) : D(M) → D(N).

Let CMon be the category of commutative monoid, with monoid homomor-


phisms as arrows between them. The above two items tell that the distribution
functor D : Sets → Sets can be restricted to a functor D : CMon → CMon in
a commuting diagram:

CMon
D / CMon
(2.19)
 
Sets
D / Sets

The vertical arrows ‘forget’ the monoid structure, by sending a monoid to its
underlying set.

Exercise 2.4.4 below contains some illustrations of this definition. In Ex-


ercise 2.4.12 we shall see that the restricted (or lifted) functor D : CMon →
CMon is also a monad. The same construction works for D∞ .

101
102 Chapter 2. Discrete probability distributions

Proof. 1 Commutativity and associativity of + on D(M) follow from com-


mutativity and associativity of + on M, and of multiplication · on [0, 1].
Next,
X X
ω + 0 = ω + 1| 0i = ω(a) · 1 a + 0 = ω(a) a = ω.
a∈M a∈M

2 Clearly, D( f )(1| 0i) = 1| f (0) i = 1| 0 i, and:


  
D( f ) ω + ρ = D( f ) ◦ D(+) (ω ⊗ ρ)
 
= D(+) ◦ D( f × f ) (ω ⊗ ρ)
 
= D(+) D( f )(ω) ⊗ D( f )(ρ) by Lemma 2.4.2 (4)
= D( f )(ω) + D( f )(ρ).

As mentioned, we use this sum of distributions in Lemma 2.4.4. We may


now reformulate it as:

flip(r) + flip(r) = bn[2](r).

We consider a similar property of Poisson distributions pois[λ], see (2.7).


This property is commonly expressed in terms of random variables Xi as: if
X1 ∼ pois[λ1 ] and X2 ∼ pois[λ2 ] then X1 + X2 ∼ pois[λ1 + λ2 ]. We have not
discussed random variables yet, nor the meaning of ∼, but we do not need them
for the channel-based reformulation below.
Recall that the Poisson distribution has infinite support, so that we need to
use D∞ instead of D, see Remark 2.1.4, but that difference is immaterial here.
We now use the mapping λ 7→ pois[λ] as a function R≥0 → D∞ (N) and as a
D∞ -channel pois : R≥0 → N.

Proposition 2.4.6. The Poisson channel pois : R≥0 → N is a homomorphism


of monoids.
Thus, for λ1 , λ2 ∈ R≥0 ,

pois[λ1 + λ2 ] = pois[λ1 ] + pois[λ2 ] and pois[0] = 1| 0i.

We can also express this via commutation of the following diagrams of chan-
nels.
+ / R≥0 0 / R≥0
R≥0 × R≥0 ◦ 1 ◦

pois ⊗ pois ◦ ◦ pois unit ◦ ◦ pois


 +
 
N×N ◦ /N 1
0
◦ /N

Proof. We first do preservation of sums +, for which we pick arbitrary λ1 , λ2 ∈

102
2.4. Parallel products 103

R≥0 and k ∈ N.
(2.18)
pois[λ1 ] + pois[λ2 ] (k) = D(+) pois[λ1 ] ⊗ pois[λ2 ] (k)
 
X
= pois[λ1 ] ⊗ pois[λ2 ] (k1 , k2 )

k1 ,k2 ,k1 +k2 =k
X
= pois[λ1 ](m) · pois[λ2 ](k − m)
0≤m≤k
λm  λk−m
!  
(2.7)
X
= e−λ1 · 1 · e−λ2 · 2 

0≤m≤k
m! (k − m)!
X e−(λ1 +λ2 ) k!
= · · λm
1 · λ2
k−m

0≤m≤k
k! m!(k − m)!
e−(λ1 +λ2 ) X k
!
= · · λm1 · λ2
k−m
k! 0≤m≤k
m
(1.26) e−(λ1 +λ2 )
= · (λ1 + λ2 )k
k!
(2.7)
= pois[λ1 + λ2 ](k).
k
Finally, in the expression pois[0] = k e0 · 0k! | k i everything vanishes except
P
for k = 0, since only 00 = 1. Hence pois[0] = 1| 0 i.

The next illustration shows that tensors for different collection types are
related.

Proposition 2.4.7. The maps supp : D → P and Flrn : M∗ → D commute


with tensors, in the sense that:

supp σ ⊗ τ = supp(σ) ⊗ supp(τ) Flrn ϕ ⊗ ψ = Flrn(ϕ) ⊗ Flrn(ψ).


 

This says that support and frequentist learning are ‘monoidal’ natural trans-
formations.

Proof. We shall do the Flrn-case and leave supp as exercise. We first note that:

ϕ ⊗ ψ = z (ϕ ⊗ ψ)(z) = x,y (ϕ ⊗ ψ)(x, y)
P P

= x,y ϕ(x) · ψ(y)


P

= x ϕ(x) · y ψ(y) = ϕ · ψ .
P  P 

Then:
(ϕ ⊗ ψ)(x, y) ϕ(x) ψ(y)
Flrn ϕ ⊗ ψ (x, y) = =

·
kϕ ⊗ ψk kϕk kψk
= Flrn(ϕ)(x) · Flrn(ψ)(y)
= Flrn(ϕ) ⊗ Flrn(ψ) (x, y).


103
104 Chapter 2. Discrete probability distributions

We can form a K-ary product X K for a set X. But also for a distribution
ω ∈ D(X) we can form ωK = ω ⊗ · · · ⊗ ω ∈ D(X K ). This lies at the heart
of the notion of ‘independent and identical distributions’. We define a separate
function for this construction:
iid [K]
/ D(X K ) iid [K](ω) = ω ··· ⊗ ω
D(X) with | ⊗ {z }. (2.20)
K times
We can describe iid [K] diagrammatically as composite, making the copying
of states explicit:

D(X)
iid [K]
/ D(X K )
[K](ω1 , . . . , ωK )
N
B
where (2.21)
, D(X)K B ω1 ⊗ · · · ⊗ ω K .

N
[K]

We often omit the parameter K when it is clear from the context. This map iid
will pop-up occasionally in the sequel. At this stage we only mention a few of
its properties, in combination with zipping.

Lemma 2.4.8. Consider the iid maps from (2.20), and also the zip function
from Exercise 1.1.7.

1 These iid maps are natural, which we shall describe in two ways. For a
function f : X → Y and for a channel c : X → Y the following two diagrams
commutes.

D(X)
iid / D(X K ) D(X)
iid
◦ / XK
D( f ) K c = (−) ◦ cK = c ⊗ ··· ⊗ c
  D( f )
 
D(Y)
iid / D(Y K ) D(Y)
iid
◦ / YK

2 The iid maps interact suitably with tensors of distributions, as expressed by


the following diagram.

D(X) × D(Y)
iid ⊗iid
◦ / XK × Y K
◦ zip 

 
D(X × Y)
iid
◦ / (X × Y)K

3 The zip function commutes with tensors in the following way.


N N
×
K
D(X) × D(Y) K / D(X K ) × D(Y K ) ⊗ / D(X K × Y K )
zip D(zip)
 N 
D(X) × D(Y) K
 ⊗K / D(X × Y)K / D (X × Y)K 

104
2.4. Parallel products 105

Proof. 1 Commutation of the diagram on the left, for a function f , is a direct


consequence of Lemma 2.4.2 (4), so we concentrate on the one on the right,
for a channel c : X → Y. We use Lemma 2.4.2 (1) in:

cK ◦· iid (ω) = (c ⊗ · · · ⊗ c) = (ω ⊗ · · · ⊗ ω)


= (c = ω) ⊗ · · · ⊗ (c = ω)
= iid (c = ω)
= iid ◦ (c = (−)) (ω).


2 For ω ∈ D(X) and ρ ∈ D(Y),

zip  ◦· (iid ⊗ iid ) (ω, ρ) = D(zip) ωK ⊗ ρK


 
X X
= ωK (~x) · ρK (~y) zip(~x,~z)

~x∈X K ~y∈Y K
X
= (ω ⊗ ρ)K (~z) ~z
~z∈(X×Y)K
= (ω ⊗ ρ)K
= iid (ω ⊗ ρ).

3 For distributions ωi ∈ D(X), ρi ∈ D(Y) and elements xi ∈ X, yi ∈ Y we


elaborate:

D(zip) ◦ ⊗ ◦ ( × ) [ω1 , . . . , ωK ], [ρ1 , . . . , ρK ]


N N 

[(x1 , y1 ), . . . , (xK , yK )]

 
= D(zip) ω1 ⊗ · · · ⊗ ωK ⊗ ρ1 ⊗ · · · ρK [(x1 , y1 ), . . . , (xK , yK )]
 

= ω1 ⊗ · · · ⊗ ωK [x1 , . . . , xK ] · ρ1 ⊗ · · · ⊗ ρK [y1 , . . . , yK ]
   

= ω1 (x1 ) · . . . · ωK (xK ) · ρ1 (y1 ) · . . . · ρK (yK )


= ω1 (x1 ) · ρ1 (y1 ) · . . . · ωK (xK ) · ρK (yK )
= ω1 ⊗ ρ1 (x1 , y1 ) · . . . · ωK ⊗ ρK (xK , yK )
 
 
= ω1 ⊗ ρ1 ⊗ · · · ⊗ ωK ⊗ ρK [(x1 , y1 ), . . . , (xK , yK )]
  

= ◦ ⊗K [(ω1 , ρ1 ), . . . , (ωK , ρK )] [(x1 , y1 ), . . . , (xK , yK )]


N   

= ◦ ⊗K ◦ zip [ω1 , . . . , ωK ], [ρ1 , . . . , ρK ]


N  

[(x1 , y1 ), . . . , (xK , yK )] .


Exercises
2.4.1 Prove the equation for supp in Proposition 2.4.7.
2.4.2 Show that tensoring of multisets is linear, in the sense that for σ ∈
M(X) the operation ‘tensor with σ’ σ ⊗ (−) : M(Y) → M(X × Y) is

105
106 Chapter 2. Discrete probability distributions

linear w.r.t. the cone structure of Lemma 1.4.2 (2): for τ1 , . . . , τn ∈


D(Y) and r1 , . . . , rn ∈ R≥0 one has:
 
σ ⊗ r1 · τ1 + · · · + rn · τn = r1 · σ ⊗ τ1 + · · · + rn · σ ⊗ τn .
 

The same hold in the other coordinate, for (−)⊗τ. As a special case we
obtain that when σ is a probability distribution, then σ⊗(−) preserves
convex sums of distributions.
2.4.3 Extend Lemma 2.4.4 in the following two ways.
1 Show that the K-fold convolution (2.17) of flip’s is a binomial of
size K, as in:
∆K flip K
[0, 1] ◦ / [0, 1]K ◦ / NK
◦+
◦ 
bn[K] .N

where ∆K is a K-fold copy and + returns the sum of a K-tuple of


numbers.
2 Show that binomials are closed under convolution, in the following
sense: for numbers K, L ∈ N,
∆ / [0, 1] × [0, 1] bn[K]⊗bn[L]
/ N×N
[0, 1] ◦ ◦
◦+
◦ 
bn[K+L] /N

Hint: Remember Vandermonde’s binary formula (1.29).


2.4.4 The set of natural numbers N has two commutative monoid structures,
one additive with +, 0, and one multiplicative with ·, 1. Accordingly
Proposition 2.4.5 gives two commutative monoid structures on D(N),
namely:

ω + ρ = D(+) ω ⊗ ρ ω ? ρ = D(·) ω ⊗ ρ .
 
and
Consider the following three distributions on N.

ρ1 = 12 | 0i + 13 | 1i + 16 |2 i ρ2 = 12 | 0 i + 12 | 1 i ω = 32 | 0i + 13 | 1i.
Show consecutively:
1 ρ1 + ρ2 = 14 | 0 i + 125
| 1i + 14 | 2i + 12
1
| 3 i;
2 ω ? (ρ1 + ρ2 ) = 4 | 0i + 36 | 1i + 12 | 2 i + 36
3 5 1 1
| 3 i;
3 ω ? ρ1 = 6 | 0i + 9 | 1i + 18 | 2 i;
5 1 1

4 ω ? ρ2 = 65 | 0i + 16 | 1i;
5 (ω ? ρ1 ) + (ω ? ρ2 ) = 36 25
| 0i + 108
25
| 1 i + 108
7
|2i + 1
108 | 3i.

106
2.4. Parallel products 107

Observe that ? does not distribute over + on D(N). More generally,


conclude that the construction of Proposition 2.4.5 does not extend to
commutative semirings.
2.4.5 Recall that for N ∈ N>0 we write N = {0, 1, . . . , N − 1} for the set
of natural numbers (strictly) below N. It is an additive monoid, via
addition modulo N. As such it is sometimes written as ZN or as Z/NZ.
Prove that:

unif N + unif N = unif N , with + from Proposition 2.4.5.

You may wish to check this equation first for N = 4 or N = 5. It works


for the modular sum, not for the ordinary sum (on N), as one can see
from the sum dice + dice, when dice is considered as distribution on
N, see also Example 4.5. See [148] for more info.
2.4.6 Let M = (M, +, 0) and N = (N, +, 0) be two commutative monoids,
so that D(M) and D(N) are also commutative monoids. Let chan-
nel c : M → D(N) be a homomorphism of monoids. Show that state
transformation c = (−) : D(M) → D(N) is then also a homomor-
phism of monoids.
2.4.7 For sets X, Y with arbitrary elements a ∈ X, b ∈ Y and with distribu-
tions σ ∈ D(X) and τ ∈ D(Y) we define strength maps st 1 : D(X) ×
Y → D(X × Y) and st 2 : X × D(Y) → D(X × Y) via:
X
st 1 (σ, b) B σ ⊗ unit(b) = σ(x) x, b

x∈X
X (2.22)
st 2 (a, τ) B unit(a) ⊗ τ = τ(y) a, y .

y∈Y

Show that the tensor σ⊗τ can be reconstructed from these two strength
maps via the following diagram:

D(X × D(Y)) o / D(D(X) × Y)


st 1 st 2
D(X) × D(Y)
D(st 2 ) ⊗ D(st )
   1

D(D(X × Y))
flat / D(X × Y) o flat
D(D(X × Y))

2.4.8 Consider a function f : X → M where X is an ordinary set and M is a


monoid. We can add noise to f via a channel c : X → M. The result
is a channel noise( f, c) : X → M given by:

noise( f, c) B  f  + c.

This formula promotes f to the deterministic channel  f  = unit ◦ f


and uses the sum + of distributions from (2.18) in a pointwise manner.

107
108 Chapter 2. Discrete probability distributions

1 Check that we can concretely describe this noise channel as:


X
noise( f, c)(x) = c(x)(y) f (x) + y .

y∈Y

2 Show also that it can be described abstractly via strength (from the
previous exercise) as composite:

/ M × D(M) st 2 / D(M × M) D(+) / D(M) .


 h f,ci 
noise( f, c) = X

2.4.9 Show that the following diagram commutes.

D D(X)
 iid
◦ / D(X)K
= iid (Ω) = iid flat(Ω) .
N 
that is
N

flat
 
D(X)
iid
◦ / XK
N
2.4.10 Show that the big tensor : D(X)K → D(X K ) from (2.21) com-
mutes with unit and flatten, as described below.
N N
D( )
X K
D (X)2 K / D D(X)K  / D2 (X K )
unit
K
unit flat K flat
 N   N 
D(X)K / D(X K ) D(X)K / D(X K )

Abstractly this shows that the functor (−)K distributes over the monad
D.
2.4.11 Check that Lemma 2.4.2 (4) can be read as: the tensor ⊗ of distribu-
tions, considered as a function ⊗ : D(X) × D(Y) → D(X × Y), is a
natural transformation from the composite functor:

Sets × Sets
D×D / Sets × Sets × / Sets

to the functor

Sets × Sets
× / Sets D / Sets.

2.4.12 Consider the situation described in Proposition 2.4.5, with a commu-


tative monoid M, and induced monoid structure on D(M).
1 Check that 1|a i + 1| bi = 1| a + bi, for all a, b ∈ M. This says that
unit : M → D(M) is a homomorphism of monoids, which can also
be expressed via commutation of the diagram:

M×M
unit×unit / D(M) × D(M)
+ +
 
M
unit / D(M)

108
2.5. Projecting and copying 109

2 Check also that flat : D(D(M)) → D(M) is a monoid homomor-


phism. This means that for Ω, Θ ∈ D(D(M)) one has:

flat(Ω) + flat(Θ) = flat(Ω + Θ) and flat(1 1| 0i ) = 1| 0 i.

The sum + on the on the left-hand side is the one in D(M), from
the beginning of this exercise. The sum + on the right-hand side is
the one in D(D(M)), using that D(M) is a commutative monoid —
and thus D(D(M)) too.
3 Check the functor D : CMon → CMon in (2.19) is also a monad.

2.5 Projecting and copying


For cartesian products × there are two projection functions π1 : X1 × X2 → X1 ,
π2 : X1 × X2 → X2 and a diagonal function ∆ : X → X × X for copying, see
Section 1.1. There are analogues for tensors ⊗ of distributions. But they behave
differently. Since these differences are fundamental in probability theory we
devote a separate section to them.

2.5.1 Marginalisation and entwinedness


In Section 1.1 we have seen projection maps πi : X1 × · · · × Xn → Xi for Carte-
sian products of sets. Since these πi are ordinary functions, they can be turned
into channels πi  = unit ◦ πi : X1 × · · · × Xn → Xi . We will be sloppy and
typically omit these brackets −. The kind of arrow, → or →, or the type of op-
eration at hand, will then indicate whether πi is used as function or as channel.
State transformation πi = (−) = D(πi ) : D X1 × · · · × Xn → D(Xi ) with such

projections is called marginalisation. It is given by summing over all variables,
except the one that we wish to keep: for y ∈ Xi we get, by Definition 2.1.2,

πi = ω (y)

(2.23)
X X
= ω x1 , . . . , xi−1 , y, xi+1 , . . . , xn .

x1 ∈X1 ,..., xi−1 ∈Xi−1 xi+1 ∈Xi+1 ,..., xn ∈Xn

Below we shall introduce special notation for such marginalisation, but first
we look at some of its properties.
The next result only works in the probabilistic case, for D. The exercises
below will provide counterexamples for P and M.

109
110 Chapter 2. Discrete probability distributions

Lemma 2.5.1. 1 Projection channels take parallel products of probabilistic


states apart, that is, for ωi ∈ D(Xi ) we have:

πi = ω1 ⊗ · · · ⊗ ωn = D(πi ) ω1 ⊗ · · · ⊗ ωn = ωi .
 

Thus, marginalisation of parallel product yields its components.


2 Similarly, projection channels commute with parallel products of probabilis-
tic channels, in the following manner:

X1 × · · · × Xn
c1 ⊗ ··· ⊗ cn
◦ / Y1 × · · · × Yn
πi ◦ ◦ πi
 
Xi ◦ / Yi
ci

Proof. We only do item (1), since item (2) then follows easily, using that
parallel products of channels are defined pointwise, see Definition 2.4.1 (2).
The first equation in item (1) follows from Lemma 1.8.3 (4), which yields
πi  = (−) = D(πi )(−). We restrict to the special case where n = 2 and i = 1.
Then:
(2.23) P
D(π1 ) ω1 ⊗ ω2 )(x) = y ω1 ⊗ ω2 )(x, y)
= y ω1 (x) · ω2 (y)
P

= ω1 (x) · y ω2 (y) = ω1 (x) · 1 = ω1 (x).


P 

The last line of this proof relies on the fact that probabilistic states (distri-
butions) involve a convex sum, with multiplicities adding up to one. This does
not work for subsets or multisets, see Exercise 2.5.1 below.
We introduce special, post-fix notation for marginalisation via ‘masks’. It
corresponds to the idea of listing only the relevant variables in traditional no-
tation, where a distribution on a product set is often written as ω(x, y) and its
first marginal as ω(x).

Definition 2.5.2. A mask M is a finite list of 0’s and 1’s, that is an element
M ∈ L({0, 1}). For a state ω of type T ∈ {P, M, D} on X1 × · · · × Xn and a mask
M of length n we write:
ωM

for the marginal with mask M. Informally, it keeps all the parts from ω at a
position where there is 1 in M and it projects away parts where there is 0. This
is best illustrated via an example:

ω 1, 0, 1, 0, 1 = T (hπ1 , π3 , π5 i)(ω) ∈ T (X1 × X3 × X5 )


 

= hπ1 , π3 , π5 i = ω.

110
2.5. Projecting and copying 111

More generally, for a channel c : Y → X1 × · · · × Xn and a mask M of length n


we use pointwise marginalisation via the same postcomposition:
cM is the channel y 7−→ c(y)M.
With Cartesian products one can take an arbitrary tuple t ∈ X × Y apart
into π1 (t) ∈ X, π2 (t) ∈ Y. By re-assembling these parts the original tuple is
recovered: hπ1 (t), π2 (t)i = t. This does not work for collections: a joint state
is typically not the parallel product of its marginals, see Example 2.5.4 below.
We introduce a special name for this.
Definition 2.5.3. A joint state ω on X × Y is called non-entwined if it is the
parallel product of its marginals:
ω = ω 1, 0 ⊗ ω 0, 1 .
   

Otherwise it is called entwined. This notion of entwinedness may be formu-


lated with respect to n-ary states, via a mask, see for example Exercise 2.5.2,
but it may then need some re-arrangement of components.
Lemma 2.5.1 (1) shows that a probabilistic product state of the form ω1 ⊗ ω2
is non-entwined. But in general, joint states are entwined, so that the different
parts are correlated and can influence each other. This is a mechanism that
will play an important role in the sequel. A joint distribution is more than the
product of its parts, and its different parts can influence each other.
Non-entwined states are called separable in [25]. Sometimes they are also
called independent, although independence is also used for random variables.
We like to have separate terminology for states only, so we use the phrase
(non-)entwinedness, which is a new expression. Independence for random vari-
ables is described in Section 5.4.
Example 2.5.4. Take X = {u, v} and A = {a, b} and consider the state ω ∈
D(X × A) given by:
ω = 18 | u, a i + 14 | u, b i + 38 | v, ai + 14 | v, b i.
We claim that ω is entwined. Indeed, ω has first and second marginals ω 1, 0 =
 
D(π1 )(ω) ∈ D(X) and ω 0, 1 = D(π2 )(ω) ∈ D(A), namely:
 

ω 1, 0 = 83 | ui + 58 |v i ω 0, 1 = 12 | a i + 12 | b i.
   
and
The original state ω differs from the product of its marginals:
ω 1, 0 ⊗ ω 0, 1 = 16 3
| u, a i + 16
3
| u, b i + 16
5
| v, a i + 16
5
   
| v, bi.
This entwinedness follows from a general characterisation, see Exercise 2.5.9
below.

111
112 Chapter 2. Discrete probability distributions

2.5.2 Copying
For an arbitrary set X there is a K-ary copy function ∆[K] : X → X K = X ×
· · · × X (K times), given as:

∆[K](x) B hx, . . . , xi (K times x).

We often omit the subscript K, when it is clear from the context, especially
when K = 2. This copy function can be turned into a copy channel ∆[K] =
unit ◦ ∆K : X → X K , but, recall, we often omit writing − for simplicity.
These ∆[K] are alternatively called copiers or diagonals.
As functions one has πi ◦ ∆[K] = id , and thus as channels πi ◦· ∆[K] = unit.

Fact 2.5.5. State transformation with a copier ∆ = ω differs from the tensor
product ω ⊗ ω. In general, for K ≥ 2,
(2.20)
∆[K] = ω , ω ··· ⊗ ω
| ⊗ {z } = iid [K](ω).
K times

The following simple example illustrates this fact. For clarity we now write
the brackets − explicitly. First,
 
∆ = 13 | 0i + 23 | 1i = 13 ∆(0) + 23 ∆(1) = 13 0, 0 + 23 1, 1 .

(2.24)

In contrast:
3|0i + 3|1i ⊗ 3|0i + 3|1i
1 2  1 2 
(2.25)
= 19 |0, 0i + 29 | 0, 1 i + 92 | 1, 0 i + 49 | 1, 1 i.
One might expect a commuting diagram like in Lemma 2.5.1 (2) for copiers,
but that does not work: diagonals do not commute with arbitrary channels,
see Exercise 2.5.10 below for a counterexample. Copiers do commute with
deterministic channels of the form  f  = unit ◦ f , as in:

∆ ◦·  f  = ( f  ⊗  f ) ◦· ∆ because ∆ ◦ f = ( f × f ) ◦ ∆.

In fact, this commutation with diagonals may be used as definition for a chan-
nel to be deterministic, see Exercise 2.5.12.

2.5.3 Joint states and channels


For an ordinary function f : X → Y one can form the graph of f as the relation
gr( f ) ⊆ X × Y given by gr( f ) = {(x, y) | f (x) = y}. There is a similar thing that
one can do for probabilistic channels, given a state. But please keep in mind
that this kind of ‘graph’ has nothing to do with the graphical models that we
will be using later, like Bayesian networks or string diagrams.

112
2.5. Projecting and copying 113

Definition 2.5.6. 1 For two channels c : X → Y and d : X → Z with the same


domain we can define a tuple channel:

hc, di B (c ⊗ d) ◦· ∆ : X → Y × Z.

2 In particular, for a state σ on Y and a channel d : Y → Z we define the graph


as the joint state on Y × Z defined by:
 
hid , di = σ so that hid , di = σ (y, z) = σ(y) · d(y)(z).

We are overloading the tuple notation hc, di. Above we use it for the tuple of
channels, so that hc, di has type X → D(Y × Z). Interpreting hc, di as a tuple of
functions, like in Subsection 1.1.1, would give a type X → D(Y) × D(Z). For
channels we standardly use the channel-interpretation for the tuple.
In the literature a probabilistic channel is often called a discriminative model.
Let’s consider a channel whose codomain is a finite set, say (isomorphic to) the
set n = {0, 1, . . . , n − 1}, whose elements i ∈ n can be seen as labels or classes.
Hence we can write the channel as c : X → n. It can then be understood as a
classifier: it produces for each element x ∈ X a distribution c(x) on n, giving
the likelihood c(x)(i) that x belongs to class i ∈ n. The class i with the highest
likelihood, obtained via argmax, may simply be used as x’s class.
The term generative model is used for a pair of a state σ ∈ D(Y) and a chan-
nel g : Y → Z, giving rise to a joint state hid , gi = σ ∈ D(X × Y). This shows
how the joint state can be generated. Later on we shall see that in principle
each joint state can be described in such generative form, via marginalisation
and disintegration, see Section 7.6.

Exercises
2.5.1 1 Let U ∈ P(X) and V ∈ P(Y). Show that the equation

(U ⊗ V) 1, 0 = U
 

does not hold in general, in particular not when V = ∅ (and U , ∅).


2 Similarly, check that for arbitrary multisets ϕ ∈ M(X) and ψ ∈
M(Y) one does not have:

(ϕ ⊗ ψ) 0, 1 = ϕ.
 

It fails when ψ is but ϕ is not the empty multiset 0. But it also fails
in non-zero cases, e.g. for the multisets ϕ = 2| ai + 4| b i + 1| ci and
ψ = 3| ui + 2| v i. Check this by doing the required computations.

113
114 Chapter 2. Discrete probability distributions

Check that the distribution ω ∈ D {a, b} × {a, b} × {a, b} given by:



2.5.2

ω = 24 | aaai + 12 | aabi + 12 |aba i + 6 | abbi


1 1 1 1

+ 6 | baai + 3 | babi + 24 |bba i + 12


1 1 1 1
| bbbi
satisfies:
ω = ω 1, 1, 0 ⊗ ω 0, 0, 1 .
   

2.5.3 Check that for finite sets X, Y, the uniform distribution on X × Y is


non-entwined — see also Exercise 2.1.1.
2.5.4 Find different joint states σ, τ with equal marginals:
σ 1, 0 = τ 1, 0 and σ 0, 1 = τ 0, 1 but σ , τ.
       

Hint: Use Example 2.5.4.


2.5.5 Check that:
ω 0, 1, 1, 0, 1, 1 0, 1, 1, 0 = ω 0, 0, 1, 0, 1, 0
    

What is the general result behind this?


2.5.6 Show that a (binary) joint distribution is non-entwined when its first
marginal is a unit (Dirac) distribution.
Formulated more explicitly, if ω ∈ D(X × Y) satisfies ω 1, 0 =
 
1| x i, then ω = 1| x i ⊗ ω 0, 1 . (This property shows that D is a
 
‘strongly affine monad’, see [69] for details.)
2.5.7 Prove that the following diagram about iid and concatenation ++ com-
mutes, for K, L ∈ N.
∆ / D(X) × D(X) iid [K]⊗iid [L]
/
D(X) ◦ XK × XL
 ◦ ++
◦ 
iid [K+L] . X K+L

2.5.8 Prove that for a channel c with suitably typed codomain,


c 1, 0, 1, 0, 1 = hπ1 , π3 , π5 i ◦· c.
 

2.5.9 Let X = {u, v} and A = {a, b} as in Example 2.5.4. Prove that a state
ω = r1 |u, ai + r2 | u, bi + r3 | v, a i + r4 | v, b i ∈ D(X × A), where r1 +
r2 + r3 + r4 = 1, is non-entwined if and only if r1 · r4 = r2 · r3 .
2.5.10 Consider the probabilistic channel f : X → Y from Example 2.2.1
and show that on the one hand ∆ ◦· f : X → Y × Y is given by:
a −7 → 21 | u, u i + 12 | v, vi
b 7−→ 1|u, ui
c 7−→ 34 | u, u i + 14 | v, vi.

114
2.5. Projecting and copying 115

On the other hand, ( f ⊗ f ) ◦· ∆ : X → Y × Y is described by:


a− 7 → 14 | u, ui + 14 | u, v i + 14 | v, u i + 14 | v, vi
b 7−→ 1| u, u i
c 7−→ 16 9
| u, ui + 16
3
|u, v i + 16 3
| v, u i + 161
| v, v i.
2.5.11 Check that for ordinary functions f : X → Y and g : X → Z one can
relate channel tupling and ordinary tupling via the operation − as:
h f , gi = h f, gi i.e. hunit ◦ f, unit ◦ gi = unit ◦ h f, gi.
Conclude that the copy channel ∆ is hunit, uniti and show that it does
commute with deterministic channels:
∆ ◦·  f  = h f ,  f i ◦· ∆.
2.5.12 Deterministic channels are in fact the only ones that commute with
copy (see for an abstract setting [18]). Let c : X → Y commute with
diagonals, in the sense that ∆ ◦· c = (c ⊗ c) ◦· ∆. Prove that c is determi-
nistic, i.e. of the form c =  f  = unit ◦ f for a function f : X → Y.
2.5.13 Check that the fact that in general ∆ = ω , ω ⊗ ω, as described
in Subsection 2.5.2, can also be expressed as: the following diagram
does not commute:

D(X)
D(∆)
/ D(X × X)
>
∆ ( ⊗
D(X) × D(X)
2.5.14 Show that the tuple operation h−, −i for channels, from Definition 2.5.6,
satisfies both:
πi ◦· hc1 , c2 i = ci and hπ1 , π2 i = unit
but not, essentially as in Exercise 2.5.10:
hc1 , c2 i ◦· d = hc1 ◦· d, c2 ◦· di.
2.5.15 Consider the joint state Flrn(τ) from Example 2.3.1 (2).
1 Take σ = 10 7
|Hi + 3
10 | L i ∈ D({H, L}) and c : {H, L} → {0, 1, 2}
defined by:
c(H) = 71 | 0 i + 21 | 1 i + 5
14 | 2i c(T ) = 16 | 0i + 13 | 1i + 12 |2 i.
Check that gr(σ, c) = Flrn(τ).
2.5.16 Consider a state σ ∈ D(X) and an ‘endo’ channel c : X → X.
1 Check that the following two statements are equivalent.

115
116 Chapter 2. Discrete probability distributions

• hid , ci = σ = hc, id i = σ in D(X × X);


• σ(x) · c(x)(y) = σ(y) · c(y)(x) for all x, y ∈ X.
2 Show that the conditions in the previous item imply that σ is a fixed
point for state transformation, that is:

c = σ = σ.

Such a σ is also called a stationary state.


2.5.17 Consider the bi-channel bell : A×2 → B×2 from Exercise 2.2.11 (2).
Show that both bell 1, 0 and bell † 1, 0 are channels with uniform
   
states only.
 
2.5.18 Prove that marginalisation (−) 1, 0 : M(X × Y) → M(X) is linear —
and preserves convex sums when restricted to distributions.

2.6 A Bayesian network example


This section uses the previous first descriptions of probabilistic channels to
give semantics for a Bayesian network example. At this stage we just describe
an example, without giving the general approach, but we hope that it clarifies
what is going on and illustrates to the reader the relevance of (probabilistic)
channels and their operations. A general description of what a Bayesian net-
work is appears much later, in Definition 7.1.2.
The example that we use is a standard one, copied from the literature, namely
from [33]. Bayesian networks were introduced in [133], see also [109, 8, 13,
14, 91, 105]. They form a popular technique for displaying probabilistic con-
nections and for efficiently presenting joint states, without state explosion.
Consider the diagram/network in Figure 2.3. It is meant to capture proba-
bilistic dependencies between several wetness phenomena in the oval boxes.
For instance, in winter it is more likely to rain (than when it’s not winter), and
also in winter it is less likely that a sprinkler is on. Still the grass may be wet
by a combination of these occurrences. Whether a road is slippery depends on
rain, not on sprinklers.
The letters A, B, C, D, E in this diagram are written exactly as in [33]. Here
they are not used as sets of Booleans, with inhabitants true and false, but in-
stead we use these sets with elements:

A = {a, a⊥ } B = {b, b⊥ } C = {c, c⊥ } D = {d, d⊥ } E = {e, e⊥ }.


The notation a⊥ is read as ‘not a’. In this way the name of an elements suggests
to which set the element belongs.

116
2.6. A Bayesian network example 117

 
wet grass slippery road
(D) (E)
9
d
9
 
sprinkler rain
(B) (C)


e :


winter
(A)

Figure 2.3 The wetness Bayesian network (from [33, Chap. 6]), with only the
nodes and edges between them; the conditional probability tables associated with
the nodes are given separately in the text.

The diagram in Figure 2.3 becomes a Bayesian network when we provide


it with conditional probability tables. For the lower three nodes they look as
follows.



A b b⊥ A c c⊥
winter 

sprinkler

a a⊥ rain 
a 1/5 4/5 a 4/5 1/5
3/5 2/5


a ⊥ 3/4 1/4 a ⊥ 1/10 9/10






And for the upper two nodes we have:


B C d d⊥





slippery road

b c 19/20 1/20 C e e⊥
wet grass

b c⊥ 9/10 1/10 c 7/10 3/10

b ⊥
c 4/5 1/5 ⊥
c 0 1






⊥ ⊥
b c 0 1
This Bayesian network is thus given by nodes, each with a conditional proba-
bility table, describing likelihoods in terms of previous ‘ancestor’ nodes in the
network (if any).
How to interpret all this data? How to make it mathematically precise? It is
not hard to see that the first ‘winter’ table describes a probability distribution
on the set A, which, in the notation of this book, is given by:

wi = 35 | ai + 25 |a⊥ i.
Thus we are assuming with probability of 60% that we are in a winter situation.
This is often called the prior distribution, or also the initial state.
Notice that the ‘sprinkler’ table contains two distributions on B, one for the

117
118 Chapter 2. Discrete probability distributions

element a ∈ A and one for a⊥ ∈ A. Here we recognise a channel, namely a


channel A → B. This is a crucial insight! We abbreviate this channel as sp, and
define it explicitly as:

 sp(a) = 15 | b i + 45 | b⊥ i

sp
/

A ◦ B with
 sp(a⊥ ) = 3 | b i + 1 | b⊥ i.


4 4

We read them as: if it’s winter, there is a 20% chance that the sprinkler is on,
but if it’s not winter, there is 75% chance that the sprinkler is on.
Similarly, the ‘rain’ table corresponds to a channel:

ra /  ra(a) = 54 |c i + 15 | c⊥ i


A ◦ C with
 ra(a⊥ ) = 1 | c i + 9 | c⊥ i.


10 10

Before continuing we can see that the formalisation (partial, so far) of the
wetness Bayesian network in Figure 2.3 in terms of states and channels, al-
ready allows us to do something meaningful, namely state transformation =.
Indeed, we can form distributions:
sp = wi on B and ra = wi on C.
These (transformed) distributions capture the derived, predicted probabilities
that the sprinkler is on, and that it rains. Using the definition of state transfor-
mation, see (2.9), we get:

sp = wi (b) = x sp(x)(b) · wi(x)


 P

= sp(a)(b) · wi(a) + sp(a⊥ )(b) · wi(a⊥ )


= 51 · 35 + 34 · 25 = 50
21

sp = wi (b ) = x sp(x)(b ) · wi(x)
 ⊥ P ⊥

= sp(a)(b⊥ ) · wi(a) + sp(a⊥ )(b⊥ ) · wi(a⊥ )


= 54 · 35 + 14 · 25 = 50
29
.
Thus the overall distribution for the sprinkler (being on or not) is:

sp = wi = 21
50 | b i + 29 ⊥
50 | b i.

In a similar way one can compute the probability distribution for rain as:

ra = wi = 13
25 |c i + 12 ⊥
25 | c i.

Such distributions for non-initial nodes of a Bayesian network are called pre-
dictions. As will be shown here, they can be obtained via forward state trans-
formation, following the structure of the network.
But first we still have to translate the upper two nodes of the network from
Figure 2.3 into channels. In the conditional probability table for the ‘wet grass’

118
2.6. A Bayesian network example 119

node we see 4 distributions on the set D, one for each combination of elements
from the sets B and C. The table thus corresponds to a channel:




 wg(b, c) = 1920 | d i + 20 | d i
1 ⊥

 wg(b, c⊥ ) = 10 | d i + 10
 9 1
| d⊥ i


B×C ◦ / D
wg

with
wg(b , c) = 5 | d i + 5 | d⊥ i
⊥ 4 1






 wg(b⊥ , c⊥ ) = 1|d⊥ i.

Finally, the table for the ‘slippery road’ table gives:



sr /  sr(c) = 10

 7
|ei + 3 ⊥
10 | e i
C ◦ E with
 sr(c ) = 1| e⊥ i.

 ⊥

We illustrate how to obtain predictions for ‘rain’ and for ‘slippery road’. We
start from the latter. Looking at the network in Figure 2.3 we see that there are
two arrows between the initial node ’winter’ and our node of interest ‘slippery
road’. This means that we have to do two state successive transformations,
giving:

sr ◦· ra = wi = sr = ra = wi
 

= sr = 1325 | c i + 25 | c i = 250 |e i + 250 | e i.


12 ⊥  91 159 ⊥

The first equation follows from Lemma 1.8.3 (3). The second one involves
elementary calculations, where we can use the distribution ra = wi that we
calculated earlier.
Getting the predicted wet grass probability requires some care. Inspection of
the network in Figure 2.3 is of some help, but leads to some ambiguity — see
below. One might be tempted to form the parallel product ⊗ of the predicted
distributions for sprinkler and rain, and do state transformation on this product
along the wet grass channel wg, as in:

wg = (sp = wi) ⊗ (ra = wi) .




But this is wrong, since the winter probabilities are now not used consistently,
see the different outcomes in the calculations (2.24) and (2.25). The correct
way to obtain the wet grass prediction involves copying the winter state, via
the copy channel ∆, see:
 
wg ◦· (sp ⊗ ra) ◦· ∆ = wi = wg = (sp ⊗ ra) = ∆ = wi


= wg = hsp, rai = wi


= 1399
2000 | d i + 2000 | d i.
601 ⊥

Such calcutions are laborious, but essentially straightforward. We shall do this


in one in detail, just to see how it works. Especially, it becomes clear that all

119
120 Chapter 2. Discrete probability distributions

summations are automatically done at the right place. We proceed in two steps,
where for each step we only elaborate the first case.

hsp, rai = wi (b, c) = x sp(x)(b) · ra(x)(c) · wi(x)


 P

= sp(a)(b) · ra(a)(c) · 35 + sp(a⊥ )(b) · ra(a⊥ )(c) · 52


= 15 · 45 · 35 + 34 · 10
1
· 25 = 500
63

hsp, rai = wi (b, c⊥ ) = · · · = 147



500
hsp, rai = wi (b⊥ , c) = 197

500
hsp, rai = wi (b⊥ , c⊥ ) = 500
93
.


We conclude that:
(sp ⊗ ra) = ∆ = wi = hsp, rai = wi


= 500
63
| b, ci + 500
147
| b, c⊥ i + 500 | b , ci
197 ⊥
+ 500 | b , c i.
93 ⊥ ⊥

This distribution is used in the next step:


 
wg = (sp ⊗ ra) = (∆ = wi) (d)
= x,y wg(x, y)(d) · (sp ⊗ ra) = (∆ = wi) (x, y)
P 

= wg(b, c)(d) · 500


63
+ wg(b, c⊥ )(d) · 147
500
+ wg(b , c)(d) · 197

500 + wg(b ⊥
, c⊥ )(d) · 500
93

= 19
20 · 500 + 10 · 500 + 5 · 500 + 0 · 500
63 9 147 4 197 93

= 1399
 2000 
wg = (sp ⊗ ra) = (∆ = wi) (d⊥ )
= 2000 .
601

We have thus shown that:

wg = (sp ⊗ ra) = (∆ = wi) = 1399


+ 601
 ⊥
2000 | d i 2000 | d i.

2.6.1 Redrawing Bayesian networks


We have illustrated how prediction computations for Bayesian networks can
be done, basically by following the graph structure and translating it into suit-
able sequential and parallel compositions (◦· and ⊗) of channels. The match
between the graph and the computation is not perfect, and requires some care,
especially wrt. copying. Since we have a solid semantics, we like to use it in
order to improve the network drawing and achieve a better match between the
underlying mathematical operations and the graphical representation. There-
fore we prefer to draw Bayesian networks slightly differently, making some
minor changes:

120
2.6. A Bayesian network example 121

•O •O
D E
   
wet grass slippery road
@ d 8 
B
•O
C
   
sprinkler rain
 
f : 

•O
A
 
winter
 

Figure 2.4 The wetness Bayesian network from Figure 2.3 redrawn in a manner
that better reflects the underlying channel-based semantics, via explicit copiers
and typed wires. This anticipates the string-diagrammatic approach of Chapter 7.

• copying is written explicity, for instance as in the binary case; in general


one can have n-ary copying, for n ≥ 2;
• the relevant sets/types — like A, B, C, D, E — are not included in the nodes,
but are associated with the arrows (wires) between the nodes;
• final nodes have outgoing arrows, labeled with their type.
The original Bayesian network in Figure 2.3 is then changed according to these
points in Figure 2.4. In this way the nodes (with their conditional probability
tables) are clearly recognisable as channels, of type A1 × · · · × An → B, where
A1 , . . . , An are the types on the incoming wires, and B is the type of the out-
going wire. Initial nodes have no incoming wires, which formally leads to a
channel 1 → B, where 1 is the empty product. As we have seen, such channels
1 → B correspond to distributions/states on B. In the adapted diagram one
easily forms sequential and parallel compositions of channels, see the exercise
below.

Exercises
2.6.1 In [33, §6.2] the (predicted) joint distribution on D × E that arises
from the Bayesian network example in this section is reprented as a
table. It translates into:
30,443
100,000 |d, e i + 39,507 ⊥
100,000 | d, e i + 100,000 | d , e i
5,957 ⊥
+ 100,000 | d , e i.
24,093 ⊥ ⊥

121
122 Chapter 2. Discrete probability distributions

Following the structure of the diagram in Figure 2.4, it is obtained in


the present setting as:
 
(wg ⊗ sr) ◦· (id ⊗ ∆) ◦· (sp ⊗ ra) ◦· ∆ = wi
= (wg ⊗ sr) = ((id ⊗ ∆) = (hsp, rai = wi)).
Perform the calculations and check that this expression equals the
above distribution.
(Readers may wish to compare the different calculation methods,
using sequential and parallel composition of channels — as here —
or using multiplications of tables — as in [33].)

2.7 Divergence between distributions


In many situations it is useful to know how unequal / different / apart prob-
ability distributions are. This can be used for instance in learning, where one
can try to bring a distribution closer to target via iterative adaptations. Such
comparison of distributions can be defined via a metric / distance function.
Here we shall describe a different comparison, called divergence, or more fully
Kullback-Leibler divergence, written as DKL . It is a not a distance function
since it is not symmetric: DKL (ω, ρ) , DKL (ρ, ω), in general. But it does sat-
isfy DKL (ω, ρ) = 0 iff ω = ρ. In Section 4.5 we describe ‘total variation’ as a
proper distance function between states.
In the sequel we shall make frequent use of this divergence. This section
collects the definition and some basic facts. It assumes rudimentary familiarity
with logarithms.
Definition 2.7.1. Let ω, ρ be two distributions/states on the same set X with
supp(ω) ⊆ supp(ρ). The Kullback-Leibler divergence, or KL-divergence, or
simply divergence, of ω from ρ is:
ω(x)
X !
DKL (ω, ρ) B ω(x) · log .
x∈X
ρ(x)
The convention is that r · log(r) = 0 when r = 0.
Sometimes the natural logarithm ln is used instead of log = log2 .
The inclusion supp(ω) ⊆ supp(ρ) is equivalent to: ρ(x) = 0 implies ω(x) =
0. This requirement immediately implies that divergence is not symmetric. But
even when ω and ρ do have the same support, the divergences DKL (ω, ρ) and
DKL (ρ, ω) are different, in general, see Exercise 2.7.1 below for an easy illus-
tration.

122
2.7. Divergence between distributions 123

Whenever we write an expression DKL (ω, ρ) we will implicitly assume an


inclusion supp(ω) ⊆ supp(ρ).
We start with some easy properties of divergence.

Lemma 2.7.2. Let ω, ρ ∈ D(X) and ω0 , ρ0 ∈ D(Y) be distributions.

1 Zero-divergence is the same as equality:

DKL (ω, ρ) = 0 ⇐⇒ ω = ρ.

2 Divergence of products is a sum of divergences:

DKL ω ⊗ ω0 , ρ ⊗ ρ0 = DKL ω, ρ + DKL ω0 , ρ0 .


  

Proof. 1 The direction (⇐) is easy. For (⇒), let 0 = DKL (ω, ρ) = x ω(x) ·
P
log ω(x)/ρ(x) . This means that if ω(x) , 0, one has log ω(x)/ρ(x) = 0, and thus
 
ω(x)/ρ(x) = 1 and ω(x) = ρ(x). In particular:

X X
1= ω(x) = ρ(x). (∗)
x∈supp(ω) x∈supp(ω)

By assumption we have supp(ω) ⊆ supp(ρ). Write supp(ρ) as disjoint union


supp(ω) ∪ U for some U ⊆ supp(ρ). It suffices to show U = ∅. We have:
(∗)
X X X X
1= ρ(x) = ρ(x) + ρ(x) = 1 + ρ(x).
x∈supp(ρ) x∈supp(ω) x∈U x∈U

Hence U = ∅.
2 By unwrapping the relevant definitions:

DKL ω ⊗ ω0 , ρ ⊗ ρ0

(ω ⊗ ω0 )(x, y)
X !
= (ω ⊗ ω0 )(x, y) · log
x,y
(ρ ⊗ ρ0 )(x, y)
ω(x) ω0 (y)
X !
= ω(x) · ω0 (y) · log · 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! !!
= ω(x) · ω (y) · log
0
+ log 0
x,y
ρ(x) ρ (y)
ω(x) ω0 (y)
X ! X !
= ω(x) · ω (y) · log
0
+ ω(x) · ω (y) · log 0
0

x,y
ρ(x) x,y
ρ (y)
ω(x) ω 0
! !
X X (y)
= ω(x) · log + ω0 (y) · log 0
x
ρ(x) y
ρ (y)
= DKL ω, ρ + DKL ω , ρ . 0 0


123
124 Chapter 2. Discrete probability distributions

In order to prove further properties about divergence we need a powerful


classical result called Jensen’s inequality about functions acting on convex
combinations of non-negative reals. We shall use Jensen’s inequality below,
but also later on in learning.
Lemma 2.7.3 (Jensen’s inequality). Let f : R>0 → R be a function that satis-
fies f 00 < 0. Then for all a1 , . . . , an ∈ R>0 and r1 , . . . , rn ∈ [0, 1] with i ri = 1
P
there is an inequality:
X  X
f ri · ai ≥ ri · f (ai ). (2.26)
i i

The inequality is strict, except in trivial cases.


The inequality holds in particular for logarithms, when f = log, or f = ln.
The proof is standard but is included, for convenience.
Proof. We shall provide a proof for n = 2. The inequality is easily extended
to n > 2, by induction. So let a, b ∈ R>0 be given, with r ∈ [0, 1]. We need to
prove f (ra + (1 − r)b) ≥ r f (a) + (1 − r) f (b). The result is trivial if a = b or
r = 0 or r = 1. So let, without loss of generality, a < b and r ∈ (0, 1). Write
c B ra + (1 − r)b = b − r(b − a), so that a < c < b. By the mean value theorem
we can find a < u < c and c < v < b with:
f (c) − f (a) f (b) − f (c)
= f 0 (u) and = f 0 (v)
c−a b−c
Since f 00 < 0 we have that f 0 is strictly decreasing, so f 0 (u) > f 0 (v) because
u > v. We can write:
c − a = (r − 1)a + (1 − r)b = (1 − r)(b − a) and b − c = r(b − a).
From f (u) > f (v) we deduce inequalities:
0 0

f (c) − f (a) f (b) − f (c)


> i.e. r( f (c) − f (a)) > (1 − r)( f (b) − f (c).
(1 − r)(b − a) r(b − a)
By reorganising the latter inequality we get f (c) > r f (a) + (1 − r) f (b), as
required.
We can now say a bit more about divergence. For instance, that it is non-
negative, as one expects.
Proposition 2.7.4. Let ω, ρ ∈ D(X) be states on the same space X.
1 DKL (ω, ρ) ≥ 0.
2 State transformation is DKL -non-expansive: for a channel c : X → Y and
states ω, ρ ∈ D(X) one has:
DKL c = ω, c = ρ ≤ DKL ω, ρ .
 

124
2.7. Divergence between distributions 125

Proof. 1 Via Jensen’s inequality, see we get:


ρ(x)
X !
−DKL (ω, ρ) = ω(x) · log
x ω(x)
ρ(x)
X ! X 
≤ log ω(x) · = log ρ(x) = log(1) = 0.
x ω(x) x

2 Again Via Jensen’s inequality:


(c = ω)(y)
!
 X
DKL c = ω, c = ρ = (c = ω)(y) · log
y (c = ρ)(y)
(c = ω)(y)
X !
= ω(x) · c(x)(y) · log
x,y (c = ρ)(y)
(c = ω)(y)
X X !
≤ ω(x) · log c(x)(y) · .
x y (c = ρ)(y)
Hence it suffices to prove:
(c = ω)(y)
X X !
ω(x) · log ≤ DKL ω, ρ

c(x)(y) ·
x y (c = ρ)(y)
ω(x)
X !
= ω(x) · log .
x ρ(x)
This inequality ≤ follows from another application of Jensen’s inequality:
(c = ω)(y) ω(x)
X X ! X !
ω(x) · log c(x)(y) · − ω(x) · log
x y (c = ρ)(y) x ρ(x)
(c = ω)(y) ω(x)
X " X ! !#
= ω(x) · log c(x)(y) · − log
x y (c = ρ)(y) ρ(x)
(c = ω)(y) ρ(x)
X X !
= ω(x) · log c(x)(y) · ·
x y (c = ρ)(y) ω(x)
(c = ω)(y) ρ(x)
X !
≤ log ω(x) · c(x)(y) · ·
x,y (c = ρ)(y) ω(x)
X X  (c = ω)(y) !
= log c(x)(y) · ρ(x) ·
y x (c = ρ)(y)
X 
= log (c = ω)(y)
y

= log 1 = 0.


Exercises
2.7.1 Take ω = 14 | a i + 34 | b i and ρ = 12 | a i + 12 | b i. Check that:

125
126 Chapter 2. Discrete probability distributions

1 DKL (ω, ρ) = 43 · log(3) − 1 ≈ 0.19.


2 DKL (ρ, ω) = 1 − 21 · log(3) ≈ 0.21.
2.7.2 Check that for y ∈ supp(ρ) one has:
DKL 1| y i, ρ = − log ρ(y) .
 

Show that the function DKL ω, − behaves as follows on convex com-



2.7.3
binations of states:
DKL ω, r1 · ρ1 + · · · + rn · ρn


≤ r1 · DKL ω, ρ1 + · · · + rn · DKL ω, ρn .
 

2.7.4 Use Jensen’s inequality to prove what is known as the inequality of


arithmetic and geometric means: for ri , ai ∈ R≥0 with i ri = 1,
P
X Y
ri · ai ≥ ari i .
i i

126
3

Drawing from an urn

One of the basic topics covered in textbooks on probability is drawing from


an urn, see e.g. [139, 145, 115] and many other references. An urn is container
which is typically filled with balls of different colours. The idea is that the balls
in the urn are thoroughly shuffled so that the urn forms a physical representa-
tion of a probability distribution. The challenge is to describe the probabilities
associated with the experiment of (blindly) drawing one or more balls from
the urn and registering the colour(s). Commonly two questions can be asked in
this setting.

• Do we consider the drawn balls in order, or not? More specifically, do we


see a draw of multiple balls as a list, or as a multiset?
• What happens to the urn after a ball is drawn? We consider three scenarios,
with short names “-1”, “0”, “+1”, see Examples 2.1.1.

– In the “-1” scenario the drawn ball is removed from the urn. This is cov-
ered by the hypergeometric distribution.
– In the “0” scenario the drawn ball is returned to the urn, so that the urn
remains unchanged. In that case the multinomial distribution applies.
– In the “+1” scenario not only the drawn ball is returned to the urn, but one
additional ball of the same colour is added to the urn. This is covered by
the Pólya distribution.
In the “-1” and “+1” scenarios the probability of drawing a ball of a certain
colour changes after each draw. In the “-1” case only finitely many draws
are possible, until the urn is empty.

This yields six variations in total, namely with ordered / unordered draws and
with -1, 0, 1 as replacement scenario. These six options are represented in a
3 × 2 table, see (3.3) below.

127
128 Chapter 3. Drawing from an urn

An important goal of this chapter is to describe these six cases in a princi-


pled manner via probabilistic channels. When drawing from an urn one can
concentrate on the draw — on what is in your hand after the draw — or on the
urn — on what is left in the urn after the draw. It turns out to be most fruitful to
combine these two perspectives in a single channel, acting as a transition map.
This transition aspect will be studied more systematically in terms of proba-
bilistic automata in Chapter ??. Here we look at single (ball) draws, which may
be described informally as:

 / single-colour, Urn 0
 
Urn (3.1)

We use the ad hoc notation Urn 0 to describe the urn after the draw. It may
be the same urn as before, in case of a draw with replacement, or it may be
a different urn, with one ball less/more, namely the original urn minus/plus a
ball as drawn.
The above transition (3.1) will be described as a probabilistic channel. It
gives for each single draw the associated probability. In this description we
combine multisets and distributions. For instance, an urn with three red balls
and two blue ones will be described as a (natural) multiset 3| R i + 2| Bi. The
transition associated with drawing a single ball without replacement (scenario
“-1”) gives a mapping:
 / 3 R, 2| Ri + 2| Bi + 2 B, 3| R i + 1| Bi .

3| R i + 2| Bi 5 5

It gives the 35 probability of drawing a red ball, together with the remaining
urn, and a 25 probability of drawing a blue one, with a different new urn.
The “+1” scenario with double replacement gives a mapping:
 / 3 R, 4| Ri + 2| Bi + 2 B, 3| R i + 3| Bi .

3| R i + 2| Bi 5 5

Finally, the situation with single replacement (“0”) is given by:


 / 3 R, 3| Ri + 2| Bi + 2 B, 3| R i + 2| Bi .

3| R i + 2| Bi 5 5

In this last, third case we see that the urn/multiset does not change. An im-
portant first observation is that in that case we may as well use a distribution
as urn, instead of a multiset. The distribution represents an abstract urn. In the
above example we would use the distribution 53 | Ri+ 25 | Bi as abstract urn, when
we draw with single replacement (case “0”). The distribution contains all the
relevant information. Clearly, it is obtained via frequentist learning from the
original multiset. Using distributions instead of multisets gives more flexibility,

128
129

since not all distributions are obtained via frequentist learing — in particular
when the probabilities are proper real numbers and not fractions.
We formulate this approach explicitly.

• In the “-1” and “+1” situations without or with double replacement, an urn
is a (non-empty, natural) multiset, which changes with every draw, via re-
moval or double replacement of the drawn ball. These scenarios will also be
described in terms of deletion or addition, using the draw-delete and draw-
add channels DD and DA that we already saw in Definition 2.2.4.
• In a “0” situation with replacement, an urn is a probability distribution; it
does not change when balls are drawn.

This covers the above second question, about what happens to the urn. For
the first question concerning ordered and unordered draws one has to go be-
yond single draw transitions. Hence we need to suitably iterate the single-draw
transition (3.1) to:
 / multiple-colours, Urn 0
 
Urn (3.2)

Now we can make the distinction between ordered and unordered draws ex-
plicit. Let X be the set of colours, for the balls in the urn — so X = {R, B} in
the above illustration.

• An ordered draw of multiple balls, say K many, is represented via a list


X K = X × · · · × X of length K.
• An unordered draw of K-many balls is represented as a K-sized multiset, in
N[K](X).

Thus, in the latter case, both the urn and the handful of balls drawn from it, are
represented as a multiset.
In the end we are interested in assigning probabilities to draws, ordered or
not, in “-1”, “0” or “1” mode. These probabilities on draws are obtained by tak-
ing the first marginal/projection of the iterated transition map (3.2). It yields a
mapping from an urn to multiple draws. The following table gives an overview
of the types of these operations, where X is the set of colours of the balls.

mode ordered unordered

0 D(X)
O0
◦ / XK D(X)
U0
◦ / N[K](X)
(3.3)
-1 N[L](X)
O−
◦ / XK N[L](X)
U−
◦ / N[K](X)

+1 N∗ (X)
O+
◦ / XK N∗ (X)
U+
◦ / N[K](X)

129
130 Chapter 3. Drawing from an urn

We see that in the replacement scenario “0” the inputs of these channels are
distributions in D(X), as abstract urns. In the deletion scenario “-1” the in-
put (urns) are multisets in N[L](X), of size L. In the ordered case the outputs
are tuples in X K of length K and in the unordered case they are multisets in
N[K](X) of size K. Implicitly in this table we assume that L ≥ K, so that the
urn is full enough for K single draws. In the addition scenario “+1” we only
require that the urn is a non-empty multiset, so that at least one ball can be
drawn. Sometimes it is required that there is at least one ball of each colour in
the urn, so that all colours can occur in draws.
The above table uses systematic names for the six different draw maps. In
the unordered case the following (historical) names are common:

U0 = multinomial U− = hypergeometric U+ = Pólya.

In this chapter we shall see that these three draw maps have certain properties
in common, such as:

• Frequentist learning applied to the draws yields the same outcome as fre-
quentist learning from the urn, see Theorem 3.3.5, Corollary 3.4.2 (2) and
Proposition 3.4.5 (1).
• Doing a draw-delete DD after a draw of size K + 1 is the same as doing a
K-sized draw, see Proposition 3.3.8, Corollary 3.4.2 (4) and Theorem 3.4.6.

But we shall also see many differences between the three forms of drawing.
This chapter describes and analyses the six probabilistic channels in Ta-
ble 3.3 for drawing from an urn. It turns out that many of the relevant prop-
erties can be expressed via composition of such channels, either sequentially
(via ◦·) or in parallel (via ⊗). In this analysis the operations of accumulation
acc and arrangement arr, for going back-and-forth between products and mul-
tisets, play an important role. For instance, in each case one has commuting
diagrams of the form.

U0 6 N[K](X) U− 4 N[K](X) U+ 6 N[K](X)


◦ ◦ ◦
arr ◦ arr ◦ arr ◦
  
D(X)
O0
◦ / XK ◦ N[L](X)
O−
◦ / XK ◦ N∗ (X)
O+
◦ / XK ◦ (3.4)
acc ◦ acc ◦ acc ◦
U0

(  U−

*  U+

( 
N[K](X) N[K](X) N[K](X)

Such commuting diagrams are unusual in the area of probability theory but we
like to use them because they are very expressive. They involve multiple equa-
tions and clearly capture the types and order of the various operations involved.

130
131

Moreover, to emphasise once more, these are diagrams of channels with chan-
nel composition ◦·. As ordinary functions, with ordinary function composition
◦, there are no such diagrams / equations.
The chapter first takes a fresh look at accumulation and arrangement, in-
troduced earlier in Sections 1.6 and 2.2. These are the operations for turning
a list into a multiset, and a multiset into a uniform distribution of lists (that
accumulate to the multiset). In Section 3.2 we will use accumulation and ar-
rangement for the powerful operation for “zipping” two multisets of the same
size together. It works analogously to zipping of two lists of the same length,
into a single list of pairs, but this ‘multizip’ produces a distribution over (com-
bined) multisets. The multizip operation turns out to interact smoothly with
multinomial and hypergeometric distributions, as we shall see later on in this
chapter.
Section 3.3 investigates multinomial channels and Section 3.4 covers both
hypergeometric and Pólya channels, all in full, multivariate generality. We have
briefly seen the multinomial and hypergeometric channels in Example 2.1.1.
Now we take a much closer look, based on [78], and describe how these chan-
nels commute with basic operations such as accumulation and arrangement,
with frequentist learning, with multizip, and with draw-and-delete. All these
commutation properties involve channels and channel composition ◦·.
Subsequently, Section 3.5 elaborates how the channels in Table 3.3 actu-
ally arise. It makes the earlier informal descriptions in (3.1) and (3.2) mathe-
matically precise. What happens is a bit sophisticated. We recall that for any
monoid M, the mapping X 7→ M × X is a monad, called the writer monad,
see Lemma 1.9.1. This can be combined with the distribution monad D, giv-
ing a combined monad X 7→ D(M × X). It comes with an associated ‘Kleisli’
composition. It is precisely this composition that we use for iterating a single
draw, that is, for going from (3.1) to (3.2). Moreover, for ordered draws we use
the monoid M = L(X) of lists, and for unordered draws we use the monoid
M = N(X) of multisets. It is rewarding, from a formal perspective, to see that
from this abstract principled approach, common distributions for different sorts
of drawing arise, including the well known multinomial, hypergeometric and
Pólya distributions. This is based on [81].
The subsequent two sections 3.6 and 3.7 of this chapter focus on a non-
trivial operation from [78], namely turning a multiset of distributions into a
distribution of multisets. Technically, this operation is a distributive law, called
the parallel multinomial law. We spend ample time introducing it: Section 3.6
contains no less than four different definitions — all equivalent. Subsequently,
various properties are demonstrated of this parallel multinomial law, including

131
132 Chapter 3. Drawing from an urn

commutation with hypergeometric channels, with frequentist learning and with


muliset-zip.
The final section 3.9 briefly looks at yet another form of drawing coloured
balls from an urn. The draws considered there are like Pólya urns — where an
additional ball of the same colour as the drawn ball is returned to the urn —
but with a new twist: after each draw another ball is added to the urn with a
fresh colour, not already occurring in the urn. This gives an entirely new dy-
namics. The new colour works like a new mutation in genetics, and indeed,
these kind of “Hoppe” urns have arisen in mathematical biology, first in the
work of Ewens [46]. We describe the resulting distributions via probabilistic
channels and show that there are some similarities and differences with the
classical draws — multinomial, hypergeometric, and Pólya — where the num-
ber of colours of balls in the urn does not increase.
This chapter make more use of the language of category theory than other
chapters in this book. It demonstrates that that there is ample (categorical)
structure in basic probabilistic operations. The various properties of multino-
mial, hypergeometric and Pólya channels that occur in this chapter wil be used
frequently in the sequel. Other, more technical topics could be skipped at first
reading. An even more categorical, axiomatic approach can be found in [80].

3.1 Accumlation and arrangement, revisited


Accumulation and arrangement are the operations that turn a sequence (list)
of elements into a multiset and vice-versa. These operations turn out to play
an important role in connecting multisets and distributions. That’s why we use
this first section to look closer than we have done before. We shall see the
role that permutations play. The next section introduces a ‘zip’ operation for
multisets, in terms of accumulation and arrangement.
In the previous two chapters we have introduced the accumulation function
acc : L(X) → N(X) that turns a list into a multiset by counting the occurrences
of each element in the list, see (1.31). For instance, acc(a, b, a, a) = 3| a i+1| bi.
Clearly, acc is a surjective function. Accumulation forms a map of monads,
see Exercise 1.9.7, which means that it is natural and commutes with the unit
and flatten maps (of L and M). In this chapter we shall use the accumlation
map restricted to a particular ‘size’, via a number K ∈ N and via a restriction
acc[K] : X K → N[K](X) to lists and multisets of size K. The parameter K in
acc[K] is often omitted if it is clear from the context.
In the other direction, one can go from multisets to lists/sequences via what
we have called arrangement (2.10). This is not a function but channel, which

132
3.1. Accumlation and arrangement, revisited 133

assigns equal probability to each sequence. Here we shall also use arrangement
for a specific size, as a channel arr[K] : N[K](X) → X K . We recall:
X 1 kϕk! K!
arr[K](ϕ) = ~x where (ϕ ) = = Q .
(ϕ) ϕ x ϕ(x)!
~x∈acc (ϕ)
−1

We recall also that (ϕ ) is called the multiset coefficient of ϕ. It is the number


of sequences that accumulate to ϕ, see Proposition 1.6.3.
One has acc ◦· arr = id , see (2.11). Composition in the other direction does
not give an identity but a (uniform) distribution of permutations:
X 1 X 1
arr ◦· acc (~x) = ~y = ~y . (3.5)

(acc(~x) ) K!
~y∈acc −1 (acc(~x)) ~y is a permutation of ~x

The vectors ~y take the multiplicities of elements in ~x into account, which leads
1
to the factor ( acc(~
x) )
.
Permutations are an important part of the story of accumulation and arrange-
ment. This is very explicit in the following result. Categorically it can be de-
scribed in terms of a so-called coequaliser, but here we prefer a more concrete
description. The axiomatic approach of [80] is based on this coequaliser.

Proposition 3.1.1. For a number K ∈ N, let f : X K → Y be a function which


is stable under permutation: for each permutation π : {1, . . . , K} → 
{1, . . . , K}
one has

f x1 , . . . , xK = f xπ(1) , . . . , xπ(K) , for all sequences (x1 , . . . , xK ) ∈ X K .


 

Then there is a unique function f : N[K](X) → Y with f ◦ acc = f . In a


diagram, we write this unique existence as a dashed arrow:

XK
acc / / N[K](X)

f
f 
+Y

The double head / / for acc is used to emphasise that it is a surjective func-
tion.

This result will be used both as a definition principle and as a proof principle.
The existence part can be used to define a function N[K] → Y by specifying a
function X K → Y that is stable under permutation. The uniqueness part yields
a proof principle: if two functions g1 , g2 : N[K](X) → Y satisfy g1 ◦ acc =
g2 ◦ acc, then g1 = g2 . This is quite powerful, as we shall see. Notice that acc
is stable under permutation itself, see also Exercise 1.6.4.

133
134 Chapter 3. Drawing from an urn

Proof. Take ϕ = 1≤i≤` ni | xi i ∈ N[K](X). Then we can define f (ϕ) via any
P
arrangement, such as:

f (ϕ) B f x1 , . . . , x1 , . . . , x` , . . . , x` .

| {z } | {z }
n1 times n` times

Since f is stable under permutation, any other arrangement of the elements xi


gives the same outcome, that is, f (ϕ) = f (~x) for each ~x ∈ X K with acc(~x) = ϕ.
With this definition, f ◦ acc = f holds, since for a sequence ~x = (x1 , . . . , xK ) ∈
X , say with acc(~x) = ϕ,
K

f (x1 , . . . , xK ) = f (ϕ) = f acc(~x) = f ◦ acc (x1 , . . . , xK ).


 

For uniqueness, suppose g : N[K](X) → Y also satisfies g ◦ acc = f . Then


g = f , by the following argument. Take ϕ ∈ N[K](X); since acc is surjective,
we can find ~x ∈ X K with acc(~x) = ϕ. Then:

g(ϕ) = g acc(~x) = f (~x) = f (ϕ).




Example 3.1.2. We illustrate the use of Proposition 3.1.1 to re-define the ar-
rangement map and to prove one of its basic properties. Assume for a moment
that we do not already now about arrangement, only about accumulation. Now
consider the following situation:

XK
acc / / N[K](X)
X 1
arr where perm(~x) B ~y .
perm
'  ~y is a permutation of ~x
K!
D(X K )

By construction, the function perm is stable under permutation. Hence the


function acc exists, by using Proposition 3.1.1 as a definition principle. As
a result, perm = arr ◦ acc.
Next we use Proposition 3.1.1 as proof principle. Our aim is to re-prove the
equality acc  ◦· arr = unit : N[K](X) → D N[K](X)), which we already know
from (2.11). Via our new proof principle such an equality can be obtained when
both sides of the equation fit in a situation where there is a unique solution. This
situation is given below.

XK
acc / / N[K](X)

unit acc ◦·arr


acc 
*  
D N[K](X)

134
3.1. Accumlation and arrangement, revisited 135

We show that both dashed arrows fit. The first equation below holds by con-
struction of arr, in the above triangle.
 
x) = acc  ◦· perm (~x)
acc ◦· arr ◦ acc (~
 
 
 X 1 
= D(acc)  ~y 
K! 
~y is a permutation of ~x
X 1
=

acc(~y)
K!
~y is a permutation of ~x
X 1
=

acc(~x)
K!
~y is
a permutation of ~x
=

1 acc(~x)
= acc (~ x)
=

unit ◦ acc (~x).

The fact that the composite arr ◦· acc produces N all permutation is used in the
K K
following result. Recall that the ‘big’ tensor : D(X) → D(X ) is defined
(ω1 , . . . , ωK ) = ω1 ⊗ · · · ⊗ ωK , see 2.21.
N
by

Proposition 3.1.3. The composite arr ◦· acc commutes with tensors, as in the
following diagram of channels.

D(X)K
acc
◦ / N[K] D(X) arr
◦ / D(X)K
N N
◦ ◦
 
XK
acc
◦ / N[K](X) arr
◦ / XK

Proof. For distributions ω1 , . . . , ωK ∈ D(X) and elements x1 , . . . , xK ∈ X we


have:
X 1
ω)(~x) = · ρ1 ⊗ · · · ⊗ ρK (~x)
N  
◦·arr ◦· acc (~
( acc(~ ω) )
~ρ∈acc −1 (acc(~
ω))
X 1
= · ρ1 ⊗ · · · ⊗ ρK (~x)

K!
~ρ is a permutation of ω
~
X 1
= · ω1 ⊗ · · · ⊗ ωK (~y)

K!
~y is a permutation of ~x
X 1
= · ω1 ⊗ · · · ⊗ ωK (~y)

(acc(~x) )
~y∈acc −1 (acc(~x))

= ω)(~x).
N
arr ◦· acc ◦· (~

135
136 Chapter 3. Drawing from an urn

Exercises
3.1.1 Show that the permutation channel perm from Example 3.1.2 is an
idempotent, i.e. satisfies perm ◦· perm = perm. Prove this both con-
cretely, via the definition of perm, and abstractly, via acc and arr.
3.1.2 Consider Proposition 3.1.3 for X = {a, b} and K = 4. Check that:

◦·arr ◦· acc ω1 , ω1 , ω2 , ω2 a, b, b, b
N   

= 12 · ω1 (a) · ω1 (b) · ω2 (b) · ω2 (b) + 21 · ω1 (b) · ω1 (b) · ω2 (a) · ω2 (b)


= arr ◦· acc ◦· ω1 , ω1 , ω2 , ω2 a, b, b, b .
N  

3.2 Zipping multisets


For two multisets ϕ ∈ N[K](X) and ψ ∈ N[L](Y) we can form their tensor
product ϕ ⊗ ψ ∈ N[K · L](X × Y). The fact that it has size K · L is shown
in the first lines of the proof of Proposition 2.4.7. For sequences there is the
familiar zip combination map X K ×Y K → (X ×Y)K that does maintain size, see
Exercise 1.1.7. Interestingly, there is also a zip-like operation for combining
two multisets of the same size. It makes systematic use of accumulation and
arrangement. This will be described next, based on [78].
Zipping two lists of the same length is a standard operation in (functional)
programming. It produces a new list, consisting of pairs of elements from the
two lists. We have seen it for products in Exercise 1.1.7, as an isomorphism
zip[K] : X K ×Y K −→ 
(X ×Y)K . Our aim below is to describe a similar function,
called multizip and written as mzip, for (natural) multisets of size K. It is
not immediately obvious how mzip should work, since there are many ways
to combine elements from two multisets. Analogously to the arrange channel
arr[K] : N[K](X) → X K we shall describe mzip on multisets as a (uniform)
channel:

N[K](X) × N[K](Y)
mzip[K]
◦ / N[K](X × Y).

So let multisets ϕ ∈ N[K](X) and ψ ∈ N[K](Y) be given. For all sequences


~x ∈ X K and ~y ∈ Y K with acc(~x) = ϕ and acc(~y) = ψ we can form their ordinary
zip, as zip(~x, ~y) ∈ (X × Y)K . Accumulating the pairs in this zip gives a multiset
over X × Y. Thus we define:
X X 1
mzip(ϕ, ψ) B acc zip(~x, ~y) .

(3.6)
( ϕ ) · (ψ )
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ)

136
3.2. Zipping multisets 137

Diagrammatically we may describe mzip as the following composite.

N[K](X) × N[K](Y)
arr⊗arr / D XK × Y K 
 D(zip) (3.7)

D (X × Y)K
 / D N[K](X × Y)
D(acc)

An illustration may help to see what happens here.

Example 3.2.1. Let’s use two set X = {a, b} and Y = {0, 1} with two multisets
of size three:

ϕ = 1| ai + 2| bi and ψ = 2| 0 i + 1| 1 i.

Then:
3 3
(ϕ) = 1,2 =3 ( ψ) = 2,1 =3

The sequences in X 3 and Y 3 that accumulate to ϕ and ψ are:


 


 a, b, b 

 0, 0, 1
and 0, 1, 0
 

 b, a, b 

 b, b, a
  1, 0, 0

Zipping them together gives the following nine sequences in (X × Y)3 .

(a, 0), (b, 0), (b, 1) (b, 0), (a, 0), (b, 1) (b, 0), (b, 0), (a, 1)
(a, 0), (b, 1), (b, 0) (b, 0), (a, 1), (b, 0) (b, 0), (b, 1), (a, 0)
(a, 1), (b, 0), (b, 0) (b, 1), (a, 0), (b, 0) (b, 1), (b, 0), (a, 0)
By applying the accumulation function acc to each of these we get multisets:

1|a, 0i+1| b, 0i+1| b, 1 i 1| b, 0 i+1| a, 0 i+1| b, 1i 2| b, 0 i+1| a, 1 i


1| a, 0 i+1| b, 1i+1| b, 0 i 2| b, 0 i+1|a, 1i 1| b, 0 i+1| b, 1 i+1| a, 0 i
1|a, 1 i+2| b, 0i 1| b, 1 i+1| a, 0 i+1|b, 0i 1| b, 1i+1| b, 0 i+1| a, 0 i
We see that are only two different multisets involved. Counting them and mul-
tiplying with ( ϕ )·(1 ψ ) = 19 gives:
 
mzip[3] 1|a i+2| bi, 2| 0i+1| 1 i

= 1 1| a, 1 i+2| b, 0 i + 2 1| a, 0 i+1|b, 0i+1| b, 1i .

3 3

This shows that calculating mzip is laborious. But it is quite mechanical and
easy to implement. The picture below suggests to look at mzip as a funnel with
two input pipes in which multiple elements from both sides can be combined
into a probabilistic mixture.

137
138 Chapter 3. Drawing from an urn
1|a i+2| b i 2| 0 i+1| 1i
@ @
@ @


1
+ 2

3 1| a, 1 i+2| b, 0i 3 1|a, 0i+1| b, 0i+1| b, 1 i

The mzip operation satisfies several useful properties.


Proposition 3.2.2. Consider mzip : N[K](X) × N[K](Y) → N[K](X × Y),
either in equational formulation (3.6) or in diagrammatic form (3.7).
1 The mzip map is natural in X and Y: for functions f : X → U and g : Y → V
the following diagram commutes.

N[K](X) × N[K](Y)
mzip
/ D N[K](X × Y)
N[K]( f )×N[K](g) D(N[K]( f ×g))
 
N[K](U) × N[K](V)
mzip
/ D N[K](U × V)

2 For ϕ ∈ N[K](X) and y ∈ Y,



mzip(ϕ, K| y i) = 1 ϕ ⊗ 1| y i .

And similarly in symmetric form.


3 Multizip commutes with projections in the following sense.
π1 π2
N[K](X) o N[K](X) × N[K](Y) / N[K](Y)
mzip

unit
  unit
D N[K](X) o / D N[K](Y)
 D(N[K](π1 ))  D(N[K](π2 ))
D N[K](X × Y)

4 Arrangement arr relates zip and mzip as in:

N[K](X) × N[K](Y)
arr⊗arr
◦ / XX × Y K
mzip ◦ ◦ zip
 
N[K](X × Y)
arr
◦ / (X × Y)K

5 mzip is associative, as given by:

N[K](X) × N[K](Y) × N[K](Z)


id ⊗mzip
◦ / N[K](X) × N[K](Y × Y)
mzip⊗id ◦ ◦ mzip
 
N[K](X × Y) × N[K](Z)
mzip
◦ / N[K](X × Y × Z)

Here we take associativity of × for granted.

138
3.2. Zipping multisets 139

Proof. 1 Easy, via the diagrammatic formulation (3.7), using naturality of arr
(Exercise 2.2.7), of zip (Exercise 1.9.4), and of acc (Exercise 1.6.6).
2 Since:
X 1
mzip(ϕ, K| y i) = acc zip(~x, hy, . . . , yi)

( ϕ )
~x∈acc −1 (ϕ)
1
acc(−x−i→
X
= , y)

( ϕ )
~x∈acc −1 (ϕ)
X 1
=

acc(~x) ⊗ 1|y i
( ϕ )
~x∈acc −1 (ϕ)
X 1
= ϕ ⊗ 1| y i = 1 ϕ ⊗ 1| y i .

( ϕ )
−1 ~x∈acc (ϕ)

3 By naturality of acc and zip:


 
D(N[K](π1 )) mzip(ϕ, ψ)
X X 1
= N[K](π1 ) acc(zip(~x, ~y))

( ϕ ) · ( ψ)
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ)
X X 1
= acc (π1 )K (zip(~x, ~y))

( ϕ ) · ( ψ)
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ)
X X 1 X 1
= acc(~x) = ϕ = 1 ϕ .

( ϕ ) · ( ψ) (ϕ)
~x∈acc (ϕ) ~y∈acc (ψ)
−1 −1 −1 ~x∈acc (ϕ)

4 We have for ϕ ∈ N[K](X), ψ ∈ N[K](Y),


 
X  X 
arr ◦· mzip (ϕ, ψ) =
  arr(χ)(~ z ) · mzip(ϕ, ψ)(χ)  ~z
 
~z∈(X×Y)K χ∈N[K](X×Y)
X 1
= · mzip(ϕ, ψ)(acc(~z)) ~z

( acc(~ z ) )
~z∈(X×Y)K  
X 1  X 1 1 
 ~z
= ·  ·
( acc(~z) ) ( ϕ ) ( ψ) 
~z∈(X×Y) K ~x∈acc (ϕ),~y∈acc (ψ),acc(zip(~x,~y))=acc(~z)
−1 −1
X X 1 1 X 1
= · · ~z
( ϕ ) ( ψ) ( acc(zip(~x, ~y)) )
~x∈acc −1 (ϕ) ~y∈acc −1 (ψ) ~z∈acc −1 (acc(zip(~x,~y)))
X X 1 1
= zip(~x, ~y)

·
( ϕ ) ( ψ)
~x∈acc (ϕ) ~y∈acc (ψ)
−1 −1

since permuting a zip of permuted inputs does not add anything


= zip  ◦· (arr ⊗ arr) (ϕ, ψ).


5 Consider the following diagram chase, obtained by unpacking the (diagram-

139
140 Chapter 3. Drawing from an urn

matic) definition of mzip on the right-hand side.

N[K](X)×N[K](Y)×N[K](Z)
id ⊗mzip
◦ / N[K](X)×N[K](Y ×Z)
arr⊗arr⊗arr ◦ arr⊗arr
◦* 
mzip⊗id ◦ K
X ×Y ×Z K K id ⊗zip
◦ / X ×(Y ×Z)K
K

zip⊗id ◦ ◦ zip mzip ◦


  
arr⊗arr /
2/ (X×Y ×Z)
zip
N[K](X)×N[K](Y ×Z) ◦ (X×Y)K ×Z K ◦ K

mzip ◦ ◦ ◦ acc
 arr 
N[K](X×Y ×Z) ◦ N[K](X×Y ×Z) o
We can see that the outer diagram commutes by going through the internal
subdiagrams. In the middle we use associativity of the (ordinary) zip func-
tion, formulated in terms of (deterministic) channels. Three of the (other)
internal subdiagrams commute by item (4). The acc-arr triangle at the bot-
tom commutes by (2.11).

The following result deserves a separate status. It tells that what we learn
from a multiset zip is the same as what we learn from a parallel product (of
multisets).

Theorem 3.2.3. Multiset zip and frequentist learning interact well, namely as:

Flrn = mzip(ϕ, ψ) = Flrn(ϕ ⊗ ψ).


Equivalently, in diagrammatic form:

N[K](X) × N[K](Y)
mzip
◦ / N[K](X × Y)

 
◦ Flrn
2
N[K ](X × Y)
Flrn
◦ / X×Y

Proof. Let multisets ϕ ∈ N[K](X) and ψ ∈ N[K](Y) be given and let a ∈ X


and b ∈ Y be arbitrary elements. We need to show that the probability:
  X X acc zip(~x, ~y)(a, b)
Flrn = mzip(ϕ, ψ) (a, b) =
K · (ϕ ) · ( ψ)
~x∈acc (ϕ) ~y∈acc (ψ)
−1 −1

is the same as the probability:


ϕ(a) · ψ(a)
Flrn(ϕ ⊗ ψ)(a, b) = .
K·K
We reason informally, as follows. For arbitrary ~x ∈ acc −1 (ϕ) and ~y ∈ acc −1 (ψ)
we need to find the fraction of occurrences (a, b) in zip(~x, ~y). The fraction of
occurrences of a in ~x is ϕ(a)
K = Flrn(ϕ)(a), and the fraction of occurrences of b

140
3.2. Zipping multisets 141

in ~y is ψ(b) x, ~y)
K = Flrn(ψ)(b). Hence the fraction of occurrences of (a, b) in zip(~
is Flrn(ϕ)(a) · Flrn(ψ)(b) = Flrn(ϕ ⊗ ψ)(a, b).

Once we have seen the definition of mzip, via ‘deconstruction’ of multisets


into lists, a zip operation on lists, and ‘reconstruction’ to a multiset result, we
can try to apply this approach more widely. For instance, instead of using a
zip on lists we can simply concatenate (++) the lists — assuming they contain
elements from the same set. This yields, like in (3.7), a composite channel:

N[K](X) × N[L](X)
arr⊗arr / D XK × XL
 D(++)
 
D X K+L / D N[K +L](X)
D(acc)

It is easy to see that this yields addition of multisets, as a deterministic channel.


We don’t get the tensor ⊗ of multisets in this way, because there is no tensor
of lists, see Remark 2.4.3.
We conclude with several useful observations about accumulation in two
dimensions.

Lemma 3.2.4. Fix a number K ∈ N and a set X.

1 We can mix K-ary and 2-ary accumulation in the following way.

XK × XK
acc[K]×acc[K]
/ N[K](X) × N[K](X) +
(
zip  N[2K](X)
6
 acc[2]K
(X × X)K / N[2](X)K +

2 When we generalise from 2 to L ≥ 2 we get:

acc[K]L
XK
L / N[K](X)L +

'
zip L  N[L·K](X)
7

acc[L]K
XL K
 / N[L](X)K +

Proof. 1 Commutation of the diagram is ‘obvious’, so we provide only an

141
142 Chapter 3. Drawing from an urn

exemplary proof. The two paths in the diagram yield the same outcomes in:

+ ◦ (acc[K] × acc[K]) [a, b, c, a, c], [b, b, a, a, c]


 

= 2| a i + 1|b i + 2| ci + 2| a i + 2| bi + 1| ci
 

= 4| a i + 3| b i + 3| ci.
+ ◦ acc[2]K ◦ zip [a, b, c, a, c], [b, b, a, a, c]
 

= + ◦ acc[2]K [(a, b), (b, b), (c, a), (a, a), (c, c)]
 

= + [1| a i + 1|b i, 2| bi, 1| a i + 1| c i, 2| ai, 2| c i]




= 4| a i + 3|b i + 3| ci.

2 Similarly.

We now transfer these results to multizip.

Proposition 3.2.5. In binary form one has:

+ / N[2K](X) unit
N[K](X) × N[K](X)
( 
mzip D N[2K](X)
6

D N[K](X × X)
 D(N[K](acc[2]))
/ D N[K] N[2](X) D(flat)

More generally, we have for L ≥ 2,

+ / N[L·K](X)
N[K](X)L unit

( 
mzip L D N[L·K](X)
6
 D(N[K](acc[L]))/
D N[K](X L )

D N[K] N[L](X) D(flat)

The map mzip L is the L-ary multizip, obtained via:

mzip 2 B mzip and mzip L+1 B mzip ◦· (mzip L ⊗ id )

Via the associativity of Proposition 3.2.2 (5) the actual arrangement of these
multiple multizips does not matter.

142
3.2. Zipping multisets 143

Proof. We only do the binary case, via an equational proof:

D(flat) ◦ D(N[K](acc[2])) ◦ mzip


(3.7)
= D(flat) ◦ D(N[K](acc[2])) ◦ D(acc[K]) ◦ D(zip) ◦ (arr ⊗ arr)
= D(flat) ◦ D(acc[K]) ◦ D(acc[2]K ) ◦ D(zip) ◦ (arr ⊗ arr)
by naturality of acc[K] : X K → N[K](X)
= D(+) ◦ D(acc[2]K ) ◦ D(zip) ◦ (arr ⊗ arr) by Exercise 1.6.7
= D(+) ◦ D(acc[K] × acc[K]) ◦ (arr ⊗ arr) by Lemma 3.2.4 (1)
= D(+) ◦ (D(acc[K]) ◦ arr) ⊗ (D(acc[K]) ◦ arr)

(2.11)
= D(+) ◦ (unit ⊗ unit)
= D(+) ◦ unit
= unit ◦ +.

Exercises
3.2.1 Show that:
 
mzip[4] 1| a i + 2|b i + 1| ci, 3|0 i + 1| 1i

= 41 1| a, 1 i + 2| b, 0i + 1| c, 0 i


+ 21 1| a, 0 i + 1| b, 0i + 1| b, 1 i + 1|c, 0 i


+ 41 1| a, 0 i + 2| b, 0i + 1| c, 1 i .

3.2.2 Show in the context of the previous exercise that:


 
Flrn = mzip[4] 1| a i + 2| b i + 1|c i, 3| 0 i + 1|1 i
= 3
16 | a, 0i + 1
16 |a, 1 i + 38 | b, 0 i + 18 | b, 1 i + 3
16 | c, 0 i + 1
16 | c, 1i.

Compute also: Flrn 1| a i + 2| b i + 1| c i ⊗ Flrn 3|0 i + 1| 1 i , and re-


 
member Theorem 3.2.3.
3.2.3 1 Check that mzip does not commute with diagonals, in the sense
that the following triangle does not commute.

∆ / N[K](X) × N[K](X)
N[K](X)
, mzip

N[K](∆) - D N[K](X × X)

Hint: Take for instance 1| a i + 1|b i.

143
144 Chapter 3. Drawing from an urn

2 Check that mzip and zip do not commute with accumulation, as in:

XK × Y K
acc × acc / N[K](X) × N[K](Y)
zip  , mzip
 
acc  / D N[K](X × Y)
(X × Y)K

Hint: Take sequences [a, b, b], [0, 0, 1] and re-use Example 3.2.1.

3.3 The multinomial channel


Multinomial distributions appear when one draws multiple coloured balls from
an urn, with replacement, as described briefly in Example 2.1.1 (2). These dis-
tributions assign a probability to such a draw, where the draws themselves will
be represented as multisets, over colours. We shall standardly describe multi-
nomial distributions in multivariate form, for multiple colours. The binomial
distribution is then a special case, for two colours only, see Example 2.1.1 (2).
A multinomial draw is a draw ‘with replacement’, so that the draw of mul-
tiple, say K, balls can be understood as K consecutive draws from the same
urn. Such draws may also be understood as samples from the distribution, of
size K. This addititive nature of multinomials is formalised as Exercise 3.3.8
below. In different form it occurs in Section 3.5.
When one thinks in terms of drawing from an urn with replacement, the
urn itself never changes. For that reason, the urn is most conveniently — and
abstractly — represented as a distribution, rather than as a multiset. When the
urn is written as ω ∈ D(X), the underlying set/space X is the set of types
(think: colours) of objects in the urn. To sum up, multinomial distributions are
described as a function of the following type.
mn[K]
/ D N[K](X) .
 
D(X) (3.8)

The number K ∈ N represents the number of objects that is drawn. The distri-
bution mn[K](ω) assigns a probability to a K-object draw ϕ ∈ N[K](X). There
is no bound on K, since the idea is that drawn objects are replaced.
Clearly, the above function (3.8) forms a channel mn[K] : D(X) → N[K](X).
In this section we collect some basic properties of these channels. They are
interesting in themselves, since they capture basic relationships between mul-
tisets and distributions, but they will also be useful in the sequel.
For convenience we repeat the definition of the multinomial channel (3.8)
from Example 2.1.1 (2) For a set X and a natural number K ∈ N we define the

144
3.3. The multinomial channel 145

multinomial channel (3.8) on ω ∈ D(X) as:


X Y
ω(x)ϕ(x) ϕ ,

mn[K](ω) B (ϕ) · (3.9)
x
ϕ∈N[K](X)

where, recall, ( ϕ ) is the multinomial coefficient Qi K!


ϕ(xi )! , see Definition 1.5.1 (5).
The Multinomial Theorem (1.26) ensures that the probabilities in mn[K](ω)
add up to one.
The next result describes the fundamental relationships between multino-
mial distributions, accumulation/arrangement (acc / arr) and indedependent
and identical distributions (iid ).

Theorem 3.3.1. 1 Arrangement after a multinomial yields the indedependent


and identical version of the original distribution:

D(X)
mn[K]
◦ / N[K](X)
◦ arr[K]


iid [K] / XK

2 Multinomial channels can be described as composite:

D(X)
mn[K]
◦ / N[K](X)
A

iid [K] . XK ◦
acc[K]

where iid (ω) = ωK = ω ⊗ · · · ⊗ ω ∈ D(X K ).

Proof. 1 For ω ∈ D(X) and ~x = (x1 , . . . , xK ) ∈ X K ,


X
arr ◦· mn[K] (ω)(~x) =

arr(ϕ)(~x) · mn[K](ω)(ϕ)
ϕ∈N[K](X)
1
= · mn[K](ω)(acc(~x))
( acc(~x) )
1 Y
= · ( acc(~x) ) · ω(y)acc(~x)(y)
( acc(~x) ) y
Y
= ω(xi ) = ωK (~x) = iid (ω)(~x).
i

2 A direct consequence of the previous item, using that acc ◦· arr = unit,
see (2.11) in Example 2.2.3.

This theorem has non-trivial consequences, such as naturality of multino-


mial channels and commutation with tensors.

145
146 Chapter 3. Drawing from an urn

Corollary 3.3.2. Multinomial channels are natural: for each function f : X →


Y the following diagram commutes.

D(X)
mn[K]
/ D N[K](X)
D( f ) D(N[K]( f ))
 
D(X)
mn[K]
/ D N[K](X)

Proof. By Theorem 3.3.1 (2) and naturality of acc and of iid , as expressed by
the diagram on the left in Lemma 2.4.8 (1).
D(N[K]( f )) ◦ mn[K] = D(N[K]( f )) ◦ D(acc) ◦ iid
= D(acc) ◦ D( f K ) ◦ iid
= D(acc) ◦ iid ◦ D( f )
= mn[K] ◦ D( f ).
For the next result, recall the multizip operation mzip from Section 3.2. It
may have looked a bit unusual at the time, but the next result demonstrates that
it behaves quite well — as in other such results.
Corollary 3.3.3. Multinomial channels commute with tensor and multizip:
 
mzip = mn[K](ω) ⊗ mn[K](ρ) = mn[K](ω ⊗ ρ).
Diagrammatically this amounts to:

D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
⊗ ◦ mzip
 
D(X × Y)
mn[K]
◦ / N[K](X × Y)

Proof. We give a proof via diagram-chasing. Commutation of the outer dia-


gram below follows from the commuting subparts of the diagram, which arise
by unfolding the definition of mzip in (3.7), below on the right.

D(X) × D(Y)
mn[K]⊗mn[K]
◦ / N[K](X) × N[K](Y)
◦ arr⊗arr

iid ⊗iid / X × YK
K

◦ zip 

 ◦ mzip

iid / (X × Y)K
◦ acc 
 
D(X × Y)
mn[K]
◦ / N[K](X × Y) o

The three subdiagrams, from top to bottom, commute by Theorem 3.3.1 (1),
by Lemma 2.4.8 (2), and by Theorem 3.3.1 (2).

146
3.3. The multinomial channel 147

Actually computing mn[K](ω ⊗ ρ)(χ) is very fast, but computing the equal
expression:
 
mzip = mn[K](ω) ⊗ mn[K](ρ) (χ)
is much much slower. The reason is that one has to sum over all pairs (ϕ, χ)
that mzip to χ.
We move on to a next fundamental fact, namely that frequentist learning
after a multinomial is the identity, in the theorem below. We first need an aux-
iliary result.
Lemma 3.3.4. Fix a distribution ω ∈ D(X) and a number K. For each y ∈ X,
X
mn[K](ω)(ϕ) · ϕ(y) = K · ω(y).
ϕ∈N[K](X)

Proof. The equation holds for K = 0, since then ϕ(y) = 0. Hence we may
assume K > 0. Then:
X
mn[K](ω)(ϕ) · ϕ(y)
ϕ∈N[K](X)
X K! Y
= ϕ(y) · Q · ω(x)ϕ(x)
x ϕ(x)!
x
ϕ∈N[K](X), ϕ(y),0
X K · (K −1)! Y
= · ω(y) · ω(y)ϕ(y)−1 · ω(x)ϕ(x)
ϕ(x)!
Q
ϕ∈N[K](X), ϕ(y),0
(ϕ(y)−1)! · x,y x,y
X (K − 1)! Y
= K · ω(y) · · ω(x)ϕ(x)
ϕ(x)!
Q
x
ϕ∈N[K−1](X) x
X
= K · ω(y) · mn[K −1](ω)(ϕ)
ϕ∈N[K−1](X)
= K · ω(y).
Theorem 3.3.5. Frequentist learning from a multinomial gives the original
distribution:
Flrn = mn[K](ω) = ω. (3.10)
This means that the following diagram of channels commutes.

D(X)
mn[K]
◦ / N[K](X)

 Flrn

id /X

The identity function D(X) → D(X) is used as channel D(X) → X.


This last result is important when one thinks of draws as samples from the
distribution. It says that what we learn from samples is the same as what we

147
148 Chapter 3. Drawing from an urn

learn from the distribution itself. This is of course an elementary correctness


property for sampling.

Proof. By Lemma 3.3.4:

X
Flrn ◦· mn[K] (ω)(y) =

mn[K](ω)(ϕ) · Flrn(ϕ)(y)
ϕ∈N[K](X)
X ϕ(y)
= mn[K](ω)(ϕ) ·
ϕ∈N[K](X)
kϕk
1
= kϕk · ω(y) ·
kϕk
= ω(y).

Since a multinomial mn[K](ω) is a distribution we can use it as an ab-


stract urn, not containing single balls, but containing multisets of balls (draws).
Hence we can draw from mn[K](ω) as well, giving a distribution on draws of
draws. We show that this can also be done with a single multinomial.

Theorem 3.3.6. Multinomial channels compose, with a bit of help of the (fixed-
size) flatten operation for multisets (1.33), as in:

mn[K]
/ D M[K](X) mn[L]
/ D M[L] M[K](X)
 
D(X)
D(flat)

mn[L·K] . D M[L·K](X)
 

148
3.3. The multinomial channel 149

Proof. For ω ∈ D(X) and ψ ∈ M[L·K](X),


 
D(flat) ◦ mn[L] ◦ mn[K] (ω)(ψ)
X
=

mn[L] mn[K](ω) (Ψ)
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ
X Y Y Ψ(ϕ)
ω(x)ϕ(x)

= (Ψ) · (ϕ ) ·
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
X Y Y Y
= (Ψ) · ( ϕ )Ψ(ϕ) · ω(x)ϕ(x)·Ψ(ϕ)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ) ϕ∈supp(Ψ)
X Y Y
Ψ(ϕ) ϕ(x)·Ψ(ϕ)
= (Ψ) · (ϕ) ω(x)
P
· ϕ∈supp(Ψ)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
X Y Y
= (Ψ) · ( ϕ )Ψ(ϕ) · ω(x)flat(Ψ)(x)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
 
 X Y  Y
=  (Ψ ) · ( ϕ )Ψ(ϕ)  · ω(x)ψ(x)
x
Ψ∈M[L](M[K](X)), flat(Ψ)=ψ ϕ∈supp(Ψ)
Y
ψ(x)
= ( ψ) · ω(x) by Theorem 1.6.5
x
= mn[L·K](ω)(ψ).

We mention another consequence of Lemma 3.3.4; it may be understood as


an ‘average’ or ‘mean’ result for multinomials, see also Definition 4.1.3. We

can describe it via a flatten map flat : M M(X) → M(X), using inclusions
D(X) ,→ M(X) and N[K](X) ,→ M(X).

Proposition 3.3.7. Fix a distribution ω ∈ D(X) and a number K. Each natural


multiset ϕ ∈ N[K](X) can be regarded as an element of the set of all multisets
M(X). Similarly, ω can be understood as an element of M(X). With these
inclusions in mind one has:
X
flat mn[K](ω) = mn[K](ω)(ϕ) · ϕ = K · ω.

ϕ∈N[K](X)

Proof. Since:
 
X X  X 
mn[K](ω)(ϕ) · ϕ = 
 mn[K](ω)(ϕ) · ϕ(x)  x

ϕ∈N[K](X) x∈X ϕ∈N[K](X)
X
= K · ω(x) x by Lemma 3.3.4
x∈X X

= K· ω(x) x
x∈X
= K · ω.

149
150 Chapter 3. Drawing from an urn

Recall the draw-and-delete channel DD : N[K + 1](X) → N[K](X) from


Definition 2.2.4 for drawing a single ball from a non-empty multiset. It com-
mutes with multinomial channels.
Proposition 3.3.8. The following triangles commute, for K ∈ N.

N[K](X) o DD
◦ N[K; +1](X)
a
◦ ◦
mn[K] mn[K+1]
D(X)
Proof. For ω ∈ D(X) and ϕ ∈ N[K](X) we have:
DD ◦· mn[K +1] (ω)(ϕ)

X
= mn[K +1](ω)(ψ) · DD[K](ψ)(ϕ)
ψ∈N[K+1](X)
X ϕ(x) + 1
= mn[K +1](ω)(ϕ + 1| x i) ·
x∈X
K+1
X (K + 1)! Y ϕ(x) + 1
= · ω(y)(ϕ+1| x i)(y) ·
y (ϕ + 1| x i)(y)! K+1
Q
y
x∈X
X K! Y
ϕ(y)
= · ω(y) · ω(x)
y ϕ(y)!
Q
y
x∈X Y
= (ϕ) · ω(y)ϕ(y) · x ω(x)
P 
y
= mn[K](ω)(ϕ).
The above triangles exist for each K. This means that the collection of chan-
nels mn[K], indexed by K ∈ N, forms a cone for the infinite chain of draw-
and-delete channels. This situation is further investigated in [85] in relation to
de Finetti’s theorem [37], which is reformulated there in terms of multinomial
channels forming a limit cone.
The multinomial channels do not commute with draw-add channels DA. For
instance, for ω = 13 | ai + 32 | bi one has:

mn[2](ω) = 19 2| a i + 49 1|a i + 1| bi + 49 2| bi

8
mn[3](ω) = 27 1
3| a i + 92 2| a i + 1|b i + 49 1| ai + 2| bi + 27 3| bi .

But:

DA = mn[2](ω) = 19 3|a i + 29 2| ai + 1| bi + 29 1| ai + 2| b i + 49 3| b i .

There is one more point that we like to address. Since mn[K](ω) is a distri-
P
bution, the sum over all draws ϕ mn[K](ω)(ϕ) equals one. But what if we re-
strict this sum to draws ϕ of certain colours only, that is, with supp(ϕ) ⊆ S , for
a proper subset S ⊆ supp(ω)? And what if we then let the size of these draws

150
3.3. The multinomial channel 151

K go to infinity? The result below describes what happens. It turns out that
the same behaviour exists in the hypergeometric and Pólya cases, see Proposi-
tions 3.4.9.

Proposition 3.3.9. Let ω ∈ D(X) be given with a proper, non-empty subset


S ⊆ supp(ω), so S , ∅ and S , supp(ω). For K ∈ N, write:
X
MK B mn[K](ω)(ϕ).
ϕ∈M[K](S )

Then MK > MK+1 and lim MK = 0.


K→∞

ω(x) in:
P
Proof. Write r B x∈S
(1.27)
X X Y
MK = mn[K](ω)(ϕ) = (ϕ) · ω(x)ϕ(x) = r K .
ϕ∈M[K](S ) ϕ∈M[K](S ) x∈S

Since 0 < r < 1 we get MK = r K > r K+1 = MK+1 and lim MK = lim r K = 0.
K→∞ K→∞

In this section we have assumed that an urn is given, namely as a distribution


ω on colours, and we have looked at probabilities of multiple draws from the
urn ϕ, as multisets. This will be turned around in the context of learning, see
Chapter ??. There we ask, among other things, the question: suppose that ϕ of
size K is given, as ‘data’; which distribution ω best fits ϕ, in the sense that it
maximises the multinomial probability mn[K](ω)(ϕ)? Perhaps unsurprisingly,
the answer is: the distribution Flrn(ϕ), obtained from ϕ via frequentist learning,
see Proposition ??.

Exercises
3.3.1 Let’s throw a fair dice 12 times. What is the probability that each
12!
number appears twice? Show that it is 72 6.

3.3.2 This exercise illustrates naturality of the multinomial channel, see


Corollary 3.3.2. Take sets X = {a, b, c} and Y = 2 = {0, 1} with func-
tion f : X → Y given by f (a) = f (b) = 0 and f (c) = 1. Consider the
following distribution and multiset:

ω = 14 | a i + 21 | b i + 14 | c i ψ = 2| 0 i + 1| 1 i.

1 Show that:
 
mn[3] D( f )(ω) (ψ) = 64 .
27

151
152 Chapter 3. Drawing from an urn

2 Check that:
 
D(N[3]( f )) mn[3](ω) (ψ) = mn[3](ω)(2|a i + 1| ci)
+ mn[3](ω)(1| ai + 1| b i + 1| c i)
+ mn[3](ω)(2| bi + 1| c i)
27
yields the same outcome 64 .
3.3.3 Check that:
 
mn[2] 12 | ai + 12 | bi = 14 2|a i + 21 1| ai + 1| bi + 14 2| b i .

Conclude that the multinomial map does not preserve uniform distri-
butions.
3.3.4 Check that: X
mn[1](ω) = ω(x) 1| x i .

x∈supp(ω)

3.3.5 Use Exercise 1.6.3 to show that for ϕ ∈ N[K](X) and ψ ∈ N[L](X)
one has:
K+L
K
mn[K +L](ω)(ϕ + ψ) = ϕ+ψ · mn[K](ω)(ϕ) · mn[L](ω)(ψ),
ϕ

and if ϕ, ψ have disjoint support, then:


K +L
!
mn[K +L](ω)(ϕ + ψ) = · mn[K](ω)(ϕ) · mn[L](ω)(ψ).
K
3.3.6 The aim of this exercise is to prove recurrence relations for multino-
mials: for each ω ∈ D(X) and ϕ ∈ N[K](X) with K > 0 one has:
X
mn[K](ω)(ϕ) = ω(x) · mn[K −1](ω) ϕ − 1| x i .

x∈supp(ϕ)

Here are two possible avenues:


1 use the recurrence relations (1.25) for multiset coefficients ( ϕ );
2 show first:
ω(x) · mn[K −1](ω) ϕ − 1| x i = Flrn(ϕ)(x) · mn[K](ω) ϕ .
 

Follow-up both avenues.


3.3.7 Prove the following items, in the style of Lemma 3.3.4, for a distribu-
tion ω ∈ D(X) and a number K > 1.
1 For two elements y , z in X,
X
mn[K](ω)(ϕ) · ϕ(y) · ϕ(z) = K · (K − 1) · ω(y) · ω(z).
ϕ∈N[K](X)

152
3.3. The multinomial channel 153

2 For a single element y ∈ X,


X
mn[K](ω)(ϕ) · ϕ(y) · (ϕ(y) − 1) = K · (K − 1) · ω(y)2 .
ϕ∈N[K](X)

3 Again, for a single element y ∈ X,


X
mn[K](ω)(ϕ) · ϕ(y)2 = K · (K − 1) · ω(y)2 + K · ω(y).
ϕ∈N[K](X)

Hint: Write a2 = a · (a − 1) + a and then use the previous item and


Lemma 3.3.4.
3.3.8 1 Prove the following multivariate generalisation of Exercise 2.4.3,
demonstrating that not only binomials, but more generally, multi-
nomials are closed under convolution (2.17): for K, L ∈ N,

D(X) ×O D(X)
mn[K]⊗mn[L]
◦ / N[K](X) × N[L](X)
∆ ◦+

D(X)
mn[K+L]
◦ / N[K +L](X)

On the right, the sum + is the usual sum of multisets, promoted to


a channel +.
Hint: Use Exercises 1.6.5 and 2.5.7.
2 Conclude that K-sized draws can be reduced to parallel single draws,
as in:
mn[1]K
D(X) K ◦ / N[1](X)K
O
∆ ◦+

D(X)
mn[K]
◦ / N[K](X)

Notice that this is essentially Theorem 3.3.1 (2), via the isomor-
phism M[1](X)  X.
3.3.9 Show that for a natural multiset ϕ of size K one has:
 
flat mn[K] Flrn(ϕ) = ϕ.

3.3.10 Check that multinomial channels do not commute with tensors, as in:

D(X) × D(Y)
mn[K]⊗mn[L]
◦ / M[K](X) × M[L](Y)
⊗ , ◦⊗
 
D(X × Y) ◦ / M[K ·L](X × Y)
mn[K·L]

153
154 Chapter 3. Drawing from an urn

3.3.11 Check that the following diagram does not commute.

N[K +1](X) × N[L+1](X)


DD⊗DD
◦ / N[K](X) × N[L](X)
+◦ , ◦+
 
N[K +L+2](X) ◦ / N[K +L](X)
DD◦·DD

3.3.12 Check that the multiset distribution mltdst[K]X from Exercise 2.1.7
equals:
mltdst[K]X = mn[K](unif X ).
3.3.13 1 Check that accumulation does not commute with draw-and-delete,
as in:
acc  / N[K +1](X)
X K+1 ◦

π , ◦ DD
 
XK ◦ / N[K](X)
acc 

2 Now define the probabilistic projection channel ppr : X K+1 → X K


by:
ppr(x1 , . . . , xK+1 )
X 1 (3.11)
x1 , . . . , xk−1 , xk+1 , . . . , xK+1 .

B
1≤k≤K+1
K +1
Show that with ppr instead of π we do get a commuting diagram:
acc  / N[K +1](X)
X K+1 ◦

ppr ◦ ◦ DD
 
acc  / N[K](X)
XK ◦

3 Conclude that we could have introduced DD via the definition prin-


ciple of Proposition 3.1.1 as:

X K+1
acc / / N[K +1](X)
ppr *
D(X K ) DD
D(acc)
+  
D N[K](X)
4 Show that probabilistic projection also commutes with arrange-
ment:
X K+1 o
arr
◦ N[K +1](X)
ppr ◦ ◦ DD
 
XK o
arr
◦ N[K](X)

154
3.3. The multinomial channel 155

3.3.14 Show that the probabilistic projection channel ppr from the previous
exercise makes the following diagram commute.
N
D(X) K+1 ◦ / X K+1
ppr ◦ ◦ ppr
 N 
D(X)K ◦ / XK

3.3.15 1 Prove that ppr ◦· arr = π ◦· arr in:


ppr
◦ &
N[K +1](X)
arr / X K+1 K


6X
π

2 Use this result to show that:

ppr ◦· zip ◦· (arr ⊗ arr) = π ◦· zip ◦· (arr ⊗ arr),

in a situation:
arr⊗arr /
N[K +1](X) × N[K +1](Y) ◦ X K+1 × Y K+1
zip ◦
 ppr
◦ ,
(X × Y)K+1 2 (X × Y)
K

π

Hint: Use Proposition 3.2.2 (4).


3 Use the last item, (3.7), Exercises 2.2.6 and 3.3.13 (3), to show:

N[K +1](X) × N[K +1](Y)


DD⊗DD
◦ / N[K](X) × N[K](Y)
mzip ◦ ◦ mzip
 
N[K +1](X × Y)
DD
◦ / N[K](X × Y)

3.3.16 Recall the set D∞ (X) of discrete distributions from (2.8), with infinite
(countable) support, see Exercise 2.1.11. For ρ ∈ D∞ (N>0 ) and K ∈ N
define:
X Y
ρ(i)ϕ(i) ϕ .

mn [K](ρ) B (ϕ) ·

ϕ∈N[K](N>0 ) i∈supp(ϕ)

1 Show that this yields a distribution in D∞ N[K](N>0 ) , i.e. that
X
mn [K](ρ)(ϕ) = 1.

ϕ∈N[K](N>0 )

2 Check that DD ◦· mn [K +1] = mn [K], like in Proposition 3.3.8.


∞ ∞

155
156 Chapter 3. Drawing from an urn

3.4 The hypergeometric and Pólya channels


We have seen that a multinomial channel assigns probabilities to draws, with
replacement. In addition, we have distinghuised two draws without replace-
ment, namely the “-1” hypergeometric version where a drawn ball is actually
removed from the urn, and the “+1” Pólya version where the drawn ball is
returned to the urn together with an additional ball of the same colour. The
additional ball has a strengthening effect that can capture situations with a
cluster dynamics, like in the spread of contagious diseases [62] or the flow
of tourists [108]. This section describes the main properties of these “-1” and
“+1” draws. We have already briefly seen hypergeometric and Pólya distribu-
tions, namely in Example 2.1.1, in items (3) and (4).
Since drawn balls are removed or added in the hypergeometric and Pólya
modes, the urn in question changes with each draw. It thus not a distribution,
like in the multinomial case, but a multiset, say of size L ∈ N. Draws are
described as multisets of size K — with restriction K ≤ L in hypergeometric
mode. The hypergeometric and Pólya channels thus takes the form:
hg[K]
/
/ D N[K](X) .

N[L](X) (3.12)
pl[K]

We first repeat from Example 2.1.1 (3) the definition of the hypergeometric
channel. For a set X and a natural numbers K ≤ L we have, for multiset/urn
ψ ∈ N[L](X),

X ψϕ X x ψ(x)
  Q  
ϕ(x)
hg[K] ψ B   ϕ =   ϕ .

L L
(3.13)
ϕ≤K ψ K ϕ≤K ψ K

Recall that we write ϕ ≤K ψ for: kϕk = K and ϕ ≤ ψ, see Definition 1.5.1 (2).
Lemma 1.6.2 shows that the probabilities in the above definition add up to one.
The Pólyachannel
 resembles the above hypergeometric one, instead
 that
multichoose is used instead of ordinary binomial coefficients .
ψ
ϕ
X
pl[K] ψ B  L  ϕ

ϕ∈N[K](supp(ψ)) K
Q ψ(x) (3.14)
x∈supp(ψ) ϕ(x)
X
=  L  ϕ .
ϕ∈N[K](supp(ψ)) K

This yields a proper distribution by Proposition 1.7.3. In Section 3.5 we shall


see how this distribution arises from iterated drawing and addition.

156
3.4. The hypergeometric and Pólya channels 157

Pólya distributions have their own dynamics, different from hypergeomet-


ric distributions. One commonly refers to this well-studied approach as Pólya
urns, see e.g. [115]. This section introduces the basics of such urns and the next
section shows how they come about via iteration of a transition map, followed
by marginalisation. Later on in Section ??, within the continuous context, the
Pólya channel shows up again, in relation to Dirichlet-multinomials and de
Finetti’s Theorem.
A subtle point in the above definitions (3.13) and (3.14) is that the draws ϕ
should be multisets over the support supp(ψ) of the urn ψ. Indeed, only balls
that occur in the urn can be drawn. In the hypergeometric case this is handled
via the requirement ϕ ≤K ψ, which ensures an inclusion supp(ϕ) ⊆ supp(ψ).
In the Pólya case we ensure this inclusion by requiring that draws ϕ are in
Q
N[K](supp(ψ)). Notice also that the product in (3.14)  is restricted
 to range
over x ∈ supp(ψ) so that ψ(x) > 0. This is needed since mn is defined only
for n > 0, see Definition 1.7.1 (2).
We start our analysis with hypergeometric channels. A basic fact is that
they can be described in as iterations of draw-and-delete channels from Defi-
nition 2.2.4. As we shall see soon afterwards, this result has many useful con-
sequences.

Theorem 3.4.1. For L, K ∈ N, the hypergeometric channel N[K + L](X) →


N[K](X) equals L consecutive draw-and-delete’s:

N[K +L](X)
hg[K]
◦ / N[K](X)
@
DD ◦ & ◦
DD
N[K +L−1](X) N[K: +1](X)

DD ◦· ··· ◦· DD
| {z }
L−2 times

To emphasise, this result says that the draws can be obtained via iterated
draw-and-delete. The full picture emerges in Theorem 3.5.10 where we include
the urn.

Proof. Write ψ ∈ N[K + L](X) for the urn. The proof proceeds by induction
on the number of iterations L, starting with L = 0. Then ϕ ≤K ψ means ϕ = ψ.
Hence:
 ψ ψ
 X ϕ ψ
hg[K] ψ = K+0 ϕ = K  ψ = 1 ψ = DD 0 (ψ) = DD L (ψ).
ϕ≤K ψ K K

For the induction step we use ψ ∈ N[K +(L+1)](X) = N[(K +1)+ L](X) and

157
158 Chapter 3. Drawing from an urn

ϕ ≤K ψ. Then:
 
DD L+1 (ψ)(ϕ) = DD = DD L (ψ) (ϕ)
X
= DD L (ψ)(χ) · DD(χ)(ϕ)
χ∈N[K+1](X)
X ϕ(y) + 1
= DD L (ψ)(ϕ + 1| y i) ·
y∈X
K+1
 ψ 
(IH)
X ϕ+1| y i ϕ(y) + 1
= K+L+1 ·
y, ϕ(y)<ψ(y)
K+1
K+1
(ψ(y) − ϕ(y)) · ψϕ
 
(*)
X
= K+L+1 by Exercise 1.7.6 (1),(2)
y, ϕ(y)<ψ(y) (L + 1) ·
 K
((K + L + 1) − K) · ψϕ
=  
(L + 1) · K+L+1K
ψ
ϕ
= K+L+1
K
= hg[K](ψ)(ϕ).

From this result we can deduce many additional facts about hypergeometric
distributions.

Corollary 3.4.2. 1 Hypergeometric channels (3.12) are natural in X.


2 Frequentist learning form hypergeometric draws is like learning from the
urn:

N[N](X)
hg[K]
◦ / N[K](X)

◦ ◦
Flrn -Xq Flrn

3 For L ≥ K one has:

N[L](X) o
DD
◦ N[L+1](X)
◦ ◦
hg[K] * t hg[K]
N[K](X)

4 Also:
N[K](X) o DD
◦ N[K= +1](X)
_
◦ ◦
hg[K] hg[K+1]
N[L](X)

158
3.4. The hypergeometric and Pólya channels 159

5 Hypergeometric channels compose, as in:

N[K +L+ M](X)


hg[K]
◦ / N[K](X)
<
◦ ◦
hg[K+L] , N[K +L](X) hg[K]

6 Hypergeometric and multinomial channels commute, as in:

N[K +L](X)
hg[K]
◦ / N[K](X)
` @
◦ ◦
mn[K+L] D(X) mn[K]

7 Hypergeometric channels commute with multizip:

N[K +L](X) × N[K +L](Y)


hg[K]⊗hg[K]
◦ / N[K](X) × N[K](Y)
mzip ◦ ◦ mzip
 
N[K +L](X × Y)
hg[K]
◦ / N[K](X × Y)

Proof. 1 By naturality of draw-and-delete, see Exercise 2.2.8.


2 By Theorem 2.3.3.
3 By Theorem 3.4.1.
4 Idem.
5 Similarly, since DD L+M = DD L ◦· DD M .
6 By Proposition 3.3.8.
7 By Exercise 3.3.15 (3).

The distribution of small hypergeometric draws from a large urn looks like
a multinomial distribution. This is intuitively clear, but will be made precise
below.

Remark 3.4.3. When the urn from which we draw in hypergeometric mode is
very large and the draw inolves only a small number of balls, the withdrawals
do not really affect the urn. Hence in this case the hypergeometric distribution
behaves like a multinomial distribution, where the urn (as distribution) is ob-
tained via frequentist learning. This is elaborated below, where the urn ψ is

159
160 Chapter 3. Drawing from an urn

very large, in comparison to the draw ϕ ≤K ψ.


ψ
ϕ K! Y (L − K)! ψ(x)!
hg[K](ψ)(ϕ) =  L  = Q · ·
x ϕ(x)! x L! (ψ(x)−ϕ(x))!
K
Y ψ(x) · (ψ(x)−1) · . . . · (ψ(x)−ϕ(x)+1)
= (ϕ) ·
x L · (L−1) · . . . · (L−K +1)
ϕ(x)
big ψ Y ψ(x)
≈ (ϕ) ·
x LK
Y ψ(x) !ϕ(x)
= (ϕ) ·
x L
Y
= (ϕ ) · Flrn(ψ)(x)ϕ(x)
x
= mn[K](Flrn(ψ))(ϕ).
The essence of this observation is in Lemma 1.3.8.
We continue with Pólya distributions. Since hypergeometric draw channels
arise from repeated draw-and-delete’s one may expect that Pólya draw chan-
nels arise from repeted draw-and-add’s. This is not the case. The connection
will be clarified further in the next section. At this stage our first point below
is that frequentist learning from its draws is the same as learning from the urn.
We shall first prove the following analogues of Lemma 3.3.4.
Lemma 3.4.4. For an non-empty urn ψ ∈ N(X) and a fixed element y ∈ X,
X
hg[K](ψ)(ϕ) · ϕ(y) = K · Flrn(ψ)(y), (3.15)
ϕ≤K ψ

And also:
X
pl[K](ψ)(ϕ) · ϕ(y) = K · Flrn(ψ)(y). (3.16)
ϕ∈M[K](supp(ψ))

(∗)
Proof. We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. We
start with equation (3.15), for which we assume K ≤ L.
X ψϕ · ϕ(y)
 
X
hg[K](ψ)(ϕ) · ϕ(y) = L
ϕ≤K ψ ϕ≤K ψ K
ψ−1| y i
(∗)
X ψ(y) · ϕ−1| y i
=  L−1 
L
ϕ≤K ψ K · K−1
ψ−1| y i
ψ(y) X ϕ
= K· ·  
L ϕ≤ ψ−1| y i L−1
K−1 K−1
= K · Flrn(ψ)(y).

160
3.4. The hypergeometric and Pólya channels 161

Similarly we obtain (3.16) via the additional use of Exercise 1.7.5.


ψ
X X ϕ · ϕ(y)
pl[K](ψ)(ϕ) · ϕ(y) =  L 
ϕ∈M[K](supp(ψ)) ϕ∈M[K](supp(ψ)) K
ψ+1| y i
(∗)
X ψ(y) · ϕ−1| y i
=  L+1 
L
ϕ∈M[K](supp(ψ)) K · K−1
ψ−1| y i
ψ(y) Xϕ
= K· ·  L+1 
L ϕ∈M[K−1](supp(ψ+1| y i))
K−1
= K · Flrn(ψ)(y).

Proposition 3.4.5. Let K > 0.

1 Frequentist learning and Pólya satisfy the following equation:

N∗ (X)
pl[K]
◦ / N[K](X)
◦ ◦
Flrn (Xu Flrn

2 Doing a draw-and-add before Pólya has no effect: for L > 0,

N[L](X)
DA
◦ / N[L+1](X)

◦ ◦
pl[K] * t pl[K]
N[K](X)

This second point is the Pólya analogue of Theorem 3.4.2 (3), with draw-
and-add instead of draw-and-delete.

Proof. 1 For a multiset/urn ψ ∈ N(X) with kψk = L > 0, and for x ∈ X,


  X
Flrn = pl[K](ψ) (x) = Flrn(ϕ)(x) · pl[K](ψ)(ϕ)
ϕ∈N[K](supp(ψ))
1 X
= · ϕ(x) · pl[K](ψ)(ϕ)
K ϕ∈N[K](supp(ψ))
(3.16)
= Flrn(ψ)(x).

(∗)
2 We use Exercises 1.7.5 and 1.7.6 in the marked equation = below. For urn

161
162 Chapter 3. Drawing from an urn

ψ ∈ N[L](X) and draw ϕ ∈ N[K](X),


  X ψ(x)
pl[K] ◦· DA (ψ)(ϕ) = · pl[K] ψ + 1| x i (ϕ)

x∈supp(ψ)
L
ψ+1| x i
X ψ(x) ϕ
= · L+1
x∈supp(ψ)
L
K
ψ
(∗)
X ψ(x) + ϕ(x) ϕ
= ·  L 
x∈supp(ψ)
L + K
K
ψ
ϕ
=  L 
K
= pl[K](ψ)(ϕ).

Next we look at interaction of the Pólya channel with draw-and-delete —


like in Theorem 3.4.2 (4) and then also with hypergeometric channels.
The next point illustrates a relationship between Pólya and the draw-and-add
channel DA from Definition 2.2.4.

Theorem 3.4.6. For each K and L ≥ K the triangles below commute.

N[K](X) o DD
◦ N[K? +1](X) N[K](X) o hg[K]
◦ N[L](X)
] ] A
◦ ◦ ◦ ◦
pl[K] pl[K+1] pl[K] pl[L]
N∗ (X) N∗ (X)

Proof. We concentrate on commutation of the triangle on the left since it


implies commutation of the on on the right by Theorem 3.4.1. The marked
(∗)
equation = below indicates the use of Exercises 1.7.5 and 1.7.6. For urn ψ ∈
N[L](X) of size L > 0, and for ϕ ∈ N[K](X),
  X
DD ◦· pl[K +1] (ψ)(ϕ) = DD(χ)(ϕ) · pl[K +1](ψ)(χ)
χ∈N[K+1](supp(ψ))
X ϕ(x)+1
= · pl[K +1](ψ)(ϕ + 1| x i)
x∈supp(ψ)
K +1
 ψ 
X ϕ(x)+1 ϕ+1| x i
= ·  L 
x∈supp(ψ)
K +1
K+1
ψ
(∗)
X ψ(x) + ϕ(x) ϕ
= ·  L 
x∈supp(ψ)
L+K
K
= pl[K](ψ).

162
3.4. The hypergeometric and Pólya channels 163

Later on we shall see that the Pólya channel factors through the multinomial
channel, see Exercise 6.4.5 and ??. The Pólya channel does not commute with
multizip.
Earlier in Proposition 3.3.7 we have seen an ‘average’ result for multinomi-
als. There are similar results for hypergeometric and Pólya distributions.

Proposition 3.4.7. Let ψ ∈ N(X) be an urn of size kψk = L ≥ 1.

1 For L ≥ K ≥ 1,
X K
flat hg[K](ψ) = hg[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).

ϕ≤ ψ
L
K

2 Similarly, for K ≥ 1,
X K
flat pl[K](ψ) = pl[K](ψ)(ϕ) · ϕ = · ψ = K · Flrn(ψ).

ϕ∈N[K](supp(ψ))
L

Proof. In both cases we rely on Lemma 3.4.4. For the first item we get:
 
X X  X 
hg[K](ψ)(ϕ) · ϕ = 
 hg[K](ψ)(ϕ) · ϕ(x) x
ϕ≤K ψ x∈X ϕ≤K ψ
(3.15)
X
= K · Flrn(ψ)(x) x
x∈X
= K · Flrn(ψ).

In the second, Pólya case we proceed analogously:


 
X X  X 
pl[K](ψ)(ϕ) · ϕ = 
 pl[K](ψ)(ϕ) · ϕ(x)  x

ϕ∈N[K](supp(ψ)) x∈X ϕ∈N[K](supp(ψ))
(3.16)
X
=

K · Flrn(ψ)(x) x
x∈X
= K · Flrn(ψ).

There is another point of analogy with multinomial distributions that we


wish to elaborate, namely the (limit) behaviour when balls of specific colours
only are drawn, see Proposition 3.3.9. In the hypergeometric case it does not
make sense to look at limit behaviour since after a certain number of steps
the urn is empty. But in the Pólya case one can continue drawing indefinitely,
making not trivial what happens in the limit. We use the following auxiliary
result1 .
1 With thanks to Bas and Bram Westerbaan for help.

163
164 Chapter 3. Drawing from an urn

Lemma 3.4.8. Let 0 < N < M be given. Define for n ∈ N,


Y N+i
an B .
i<n
M+i

Then: lim an = 0.
n→∞

Proof. We switch to the (natural) logarithm ln and prove the equivalent state-
ment lim ln(an ) = −∞. We use that the logarithm turns products into sums,
n→∞
see Exercise 1.2.2, and that the derivative of ln(x) is 1x . Then:
X N +i X
ln an = = ln N + i − ln M + i
  
ln
i<n
M+i i<n
X Z M+1 1
= − dx
i<n N+1 x
(∗) X (M + i) − (N + i)
≤ −
i<n
M+i
X 1
= (N − M) ·
i<n
M+i
X 1
= (N − M) · .
i<n
M+i

It is well-known that the harmonic series n>0 n1 is infinite. Since M > N the
P
above sequence ln(an ) thus goes to −∞.
The validity of the marked inequality ≤ follows from an inspection of the
graph of the function 1x : the integral from N + i to M + i is the surface under 1x
between the points N + i < M + i. Since 1x is a decreasing function, this surface
is bigger than the rectangle with height M+i1
and length (M + i) − (N + i).

Proposition 3.4.9. Consider an urn ψ ∈ N[L](X) with a proper non-empty


subset S ⊆ supp(ψ).

1 Write for K ≤ L,
X
HK B hg[K](ψ)(ϕ).
ϕ∈N[K](S ), ϕ≤ψ

Then HK > HK+1 ; this stops at K = L, when the urn is empty.


2 Now write, for arbitrary K ∈ N,
X
PK B pl[K](ψ)(ϕ).
ϕ∈N[K](S )

Then PK > PK+1 and lim PK = 0.


K→∞

164
3.4. The hypergeometric and Pólya channels 165

ψ(x) is for the number of balls in the urn whose


P
Proof. We write LS B x∈S
colour is in S .
1 By separating S and its complement ¬S we can write, via Vandermonde’s
formula from Lemma 1.6.2,
Q ψ(x) Q ψ(x) L 
x∈S ϕ(x) ·
S
X x<S 0 K LS ! (L−K)!
HK = L = L = · .
ϕ∈N[K](S ), ϕ≤ψ
L! (LS −K)!
K K

Using a similar description for HK+1 we get:


LS ! (L−K)! LS ! (L−(K +1))!
HK > HK+1 ⇐⇒ · > ·
L! (LS −K)! L! (LS −(K +1))!
L−K
⇐⇒ > 1.
LS −K
The latter holds because L > LS , since S is a proper subset of supp(ψ).
2 In the Pólya case we get, as before, but now via the Vandermonde formula
for multichoose, see Proposition 1.7.3,
L 
(L−1)! (LS +K −1)!
S
K
PK =  L  = · .
(LS −1)! (L+K −1)!
K

We define:
PK+1 (L−1)! (LS +(K +1)−1)! (LS −1)! (L+K −1)!
aK B = · · ·
PK (LS −1)! (L+(K +1)−1)! (L−1)! (LS +K −1)!
LS +K
= < 1, since LS < L.
L+K
Thus PK+1 = aK · PK < PK and also:
PK = aK−1 · PK−1 = aK−1 · aK−2 · PK−2 = · · · = aK−1 · aK−2 · . . . · a0 · P0 .
Our goal is to prove lim PK = 0. This follows from lim i<K ai = 0, which
Q
K→∞ K→∞
we obtain from Lemma 3.4.8.

Exercises
3.4.1 Consider an urn with 5 red, 6 green, and 8 blue balls. Suppose 6 balls
are drawn from the urn, resulting in two balls of each colour.
1 Describe both the urn and the draw as a multiset.
Show that the probability of the six-ball draw is:
5184000
2 47045881≈ 0.11, when balls are drawn one by one and are replaced
before each next single-ball draw;

165
166 Chapter 3. Drawing from an urn

50
3 323 ≈ 0.15, when the drawn balls are deleted;
405
4 4807 ≈ 0.08 when the drawn balls are replaced and each time an
extra ball of the same colour is added.
3.4.2 Draw-delete preserves hypergeometric and Pólya distributions, see
Corollary 3.4.2 (4) and Theorem 3.4.6. Check that draw-add does
not preserve hypergeometric or Pólya distributions, for instance by
checking that for ϕ = 3| ai + 1| b i,

hg[2](ϕ) = 12 2| ai + 12 1| ai + 1| bi


hg[3](ϕ) = 41 3| ai + 34 2| ai + 1| bi


DA = hg[2](ϕ) = 12 3| ai + 14 2| ai + 1| bi + 14 1| a i + 2| b i ,

And:
pl[2](ϕ)
3
= 35 2| a i + 10 1| a i + 1| b i + 1

10 2| b i
pl[3](ϕ)
3
= 12 3| a i + 10 2| a i + 1| b i + 3
+ 2|b i + 1

20 1| a i 20 3| bi
DA = pl[2](ϕ)
3 1
= 53 3| a i + 20 2| a i + 1| b i + + 2|b i + 10
3

20 1| a i
3| bi

Let X be a finite set, say of size n. Write 1 = x∈X 1| x i for the multiset
P
3.4.3
of single occurrences of elements of X. Show that for each K ≥ 0,

pl[K](1) = unifN[K](X) .

3.4.4 1 Give a direct proof of Theorem 3.4.1 for L = 1.


2 Elaborate also the case K = 1 in Theorem 3.4.1 and show that in
that case DD L (ψ) = Flrn(ψ), at least when we identify the (isomor-
phic) sets N[1](X) and X.

3.4.5 Let unif L ∈ D N[L](X) be the uniform distribution on multisets of
size L, from Exercise 2.2.9, where X is a finite set. Show that hg[K] =
unif L = unif K , for K ≤ L.
3.4.6 Recall from Remark 3.4.3 that small hypergeometric draws from a
large urn can be described multinomially. Extend this in the following
way. Let σ ∈ D N[L](X) be a distribution of large urns. Argue that

for small numbers K,

hg[K] = σ ≈ mn[K] = D(Flrn)(σ).

3.4.7 Prove in analogy with Exercise 3.3.7, for an urn ψ ∈ N[L](X) and
elements y, z ∈ X, the following points.

166
3.4. The hypergeometric and Pólya channels 167

1 When y , z and K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L−1
2 When K ≤ L,
X
hg[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L−K)
= K · Flrn(ψ)(y) · .
L−1
3 And in the Pólya case, when y , z,
X
pl[K](ψ)(ϕ) · ϕ(y) · ϕ(z)
ϕ∈N[K](X)
ψ(z)
= K · (K −1) · Flrn(ψ)(y) · .
L+1
4 Finally,
X
pl[K](ψ)(ϕ) · ϕ(y)2
ϕ∈N[K](X)
(K −1) · ψ(y) + (L+K)
= K · Flrn(ψ)(y) · .
L+1
3.4.8 Use Theorem 3.4.1 to prove the following two recurrence relations
for hypgeometric distributions.
X
hg[K](ψ) = Flrn(ψ)(x) · hg[K −1] ψ−1| x i

x∈supp(ψ)
X ϕ(x) + 1
hg[K](ψ)(ϕ) = · hg[K −1](ψ) ϕ+1| x i

x
K+1
3.4.9 Fix numbers N, M ∈ N and write ψ = N| 0 i + M| 1 i for an urn with N
balls of colour 0 and M of colour 1. Let n ≤ N. Show that:
X N+M+1
hg[n+m](ψ) n| 0 i + m| 1 i = .

0≤m≤M
N+1
Hint: Use Exercise 1.7.3. Notice that the right-hand side does not
depend on n.
3.4.10 This exercise elaborates that draws from an urn excluding one partic-
ular colour can be expressed in binary form. This works for all three
modes of drawing: multinomial, hypergeometric, and Pólya.
Let X be a set with at least two elements, and let x ∈ X be an arbi-
trary but fixed element. We write x⊥ for an element not in X. Assume
k ≤ K.

167
168 Chapter 3. Drawing from an urn

1 For ω ∈ D(X) with x ∈ supp(ω), use the Multinomial Theo-


rem (1.27) to show that:
X
mn[K](ω) k| x i + ϕ

ϕ∈N[K−k](X−x)
  
= mn[K] ω(x)| x i + (1 − ω(x))| x⊥ i k| x i + (K − k)| x⊥ i
= bn[K] ω(x) (k).


2 Prove, via Vandermonde’s formula, that for be an urn ψ of size


L ≥ K one has:
X
hg[K](ψ) k| x i + ϕ

ϕ≤K−k ψ−ψ(x)| x i
  
= hg[K] ψ(x)| x i + (L − ψ(x))| x⊥ i k| x i + (K − k)| x⊥ i .

3 Show again for an urn ψ,


X
pl[K](ψ) k| x i + ϕ

ϕ∈N[K−k](supp(ψ)−x)
  
= pl[K] ψ(x)| x i + (L − ψ(x))| x⊥ i k| x i + (K − k)| x⊥ i .

3.5 Iterated drawing from an urn


This section deals with the iterative aspects of drawing from an urn. It makes
the informal descriptions in the introduction to this chapter precise, for six
different forms of drawing: in “-1” or “0” or “+1” mode, and with ordered
or unordered draws. The iteration involved will be formalised via a special
monad, combining the multiset and the writer monad. This infrastructure for
iteration will be set up first. Subsequently, ordered and unordered draws will
be described in separate subsections.
We recall the transition operation (3.2) from the beginning of this chapter:

 / multiple-colours, Urn 0
 
Urn

mapping an urn to a distribution over pairs of multiple draws and remaining


urn. The actual mapping from urns to draws would then arise as first marginal
of this transition channel. Concretely, the above transition map, used for de-

168
3.5. Iterated drawing from an urn 169

scribing our intuition, takes one of the following six forms.

mode ordered unordered

0 D(X)
Ot 0
◦ / L(X) × D(X) D(X)
Ut 0
◦ / N(X) × D(X)
(3.17)
-1 N∗ (X)
Ot −
◦ / L(X) × N∗ (X) N∗ (X)
Ut −
◦ / N(X) × N∗ (X)

+1 N(X)
Ot +
◦ / L(X) × N(X) N∗ (X)
Ut +
◦ / N(X) × N∗ (X)

Notice that the draw maps in Table 3.3 in the introduction are written as chan-
nels O0 : D(X) → N[K](X), where the above table contains the corresponding
transition maps, with an extra ‘t’ in the name, as in Ot 0 : D(X) → N(X) ×
D(X).
In each case, the drawn elements are accumulated in the left product-com-
ponent of the codomain of the transition map. They are organised as lists, in
L(X), in the ordered case, and as multisets, in N(X) in the unordered case.
As we have seen before, in the scenarios with deletion/addition the urn is a
multiset, but with replacement it is a distribution.
The crucial observation is that the list and multiset data types that we use for
accumulating drawn elements are monoids, as we have observed early on, in
Lemmas 1.2.2 and 1.4.2. In general, for a monoid M, the mapping X 7→ M × X
is a monad, called the writer monad, see Lemma 1.9.1. It turns out that the
combination of the writer monad with the distribution monad D is again a
monad. This forms the basis for iterating transitions, via Kleisli composition of
this combined monad. We relegate the details of the monad’s flatten operation
to Exercise 3.5.6 below, and concentrate on the unit and Kleisli composition
involved.

Lemma 3.5.1. Let M = (M, 0, +) be a monoid. The mapping X 7→ D M × X




is a monad on Sets, with unit : X → D M × X given by:

unit(x) = 1| 0, x i where 0 ∈ M is the zero element.

For Kleisli maps f : A → D(M × B) and g : B → D(M × C) there is the Kleisli


composition g ◦· f : A → D(M × C) given by:
X P 
g ◦· f (a) = b f (a)(m, b) · g(b)(m , c) m + m , c .
0 0

(3.18)
m,m0 ,c

Notice the
occurence of the sum + of the monoid M in the first component

of the ket −, − in (3.18). When M is the list monoid, this sum is the (non-
commutative) concatenation ++ of lists, producing an ordered list of drawn
elements. When M is the multiset monoid, this sum is the (commutative) +

169
170 Chapter 3. Drawing from an urn

D(X)
Ot 0
/ D L(X) × D(X) N(X)
Ut 0
/ D N(X) × N(X)
ω
 /
X
ω(x) [x], ω

ω
 /
X
ω(x) 1| x i, ω

x∈supp(ω) x∈supp(ω)

N∗ (X)
Ot −
/ D L(X) × N(X) N∗ (X)
Ut −
/ D N(X) × N(X)
 X ψ(x)  ψ(x)
/ /
X
ψ [x], ψ − 1| x i ψ 1| x i, ψ − 1| x i

x∈supp(ψ)
kψk x∈supp(ψ)
kψk

N∗ (X)
Ot +
/ D L(X) × N∗ (X) N∗ (X)
Ut +
/ D N(X) × N∗ (X)
 X ψ(x)  X ψ(x)
ψ / [x], ψ + 1| x i ψ / 1| x i, ψ + 1| x i
x∈supp(ψ)
kψk x∈supp(ψ)
kψk

Figure 3.1 Definitions of the six transition channels in Table (3.17) for draw-
ing a single element from an urn. In the “ordered” column on the left the list
monoid L(X) is used, where in the “unordered” columnn the (commutative) mul-
tiset monoid N(X) occurs. In the first row the urn (as distribution ω) remains
unchanged, whereas in the second and (resp. third) row the drawn elements x is
removed (resp. added) to the urn ψ. Implicitly it is assumed that the multiset ψ is
non-empty.

of multisets, so that the accumulation of drawn elements yields a multiset, in


which the order of elements is irrelevant.
If we have an ‘endo’ Kleisli map for the combined monad of Lemma 3.5.1,
of the form t : A → D(M × A), we can iterate it K times, giving t K : A →
D(M × A). This iteration is defined via the above unit and Kleisli composition:

t0 = unit and t K+1 = t K ◦· t = t ◦· t K .

Now that we know how to iterate, we need the actual transition maps that can
be iterated, that is, we need concrete definitions of the transition channels in
Table (3.17). They are given in Figure 3.1. In the subsections below we analyse
what iteration means for these six channels. Subsequently, we can describe the
associated K-sized draw channels, as in Table 3.3, as first projection π1 ◦· t K ,
going from urns to drawn elements.

3.5.1 Ordered draws from an urn


We start to look at the upper two ‘ordered’ transition channels in the column on
the left in Figure 3.1. Towards a general formula for their iteration, let’s look
first at the easiest case, namely ordered transition Ot 0 : D(X) → D L(X) ×

170
3.5. Iterated drawing from an urn 171

D(X) with replacement. By definition we have as first iteration.
X
Ot 10 (ω) = Ot 0 (ω) = ω(x1 ) [x1 ], ω .

x1 ∈supp(ω)

Accumulation of drawn elements in the first coordinate of −, − starts in the
second iteration:
Ot 20 (ω) = Ot 0 =X
Ot 0 (ω)

= ω(x1 ) · Ot 0 (ω)(`, ω) [x1 ] ++ `, ω

`∈L(X), x1 ∈supp(ω)
X
= ω(x1 ) · ω(x2 ) [x1 ] ++ [x2 ], ω

x1 ,x2 ∈supp(ω)
X
= (ω ⊗ ω)(x1 , x2 ) [x1 , x2 ], ω .

x1 ,x2 ∈supp(ω)

The formula for subsequent iterations is beginning to appear.

Theorem 3.5.2. Consider in Figure 3.1 the ordered-transition-with-replace-


ment channel Ot 0 : D(X) → L(X) × D(X), with distribution ω ∈ D(X).

1 Iterating K ∈ N times yields:


X
Ot 0K (ω) = ωK (~x) ~x, ω .

~x∈X K

2 The associated K-draw channel O0 [K] B π1 ◦· Ot 0K : D(X) → X K satisfies

O0 [K](ω) = ωK = iid [K](ω),

where iid is the identical and independent channel from (2.20).

The situation for ordered transition with deletion is less straightforward. We


look at two iterations explicitly, starting from a multiset ψ ∈ N(X).
X ψ(x1 )
Ot 1− (ψ) = x , ψ−1| x i
1 1
x ∈supp(ψ)
kψk
1

Ot 2− (ψ) = Ot − = Ot − (ψ)
X ψ(x1 ) (ψ−1| x1 i)(x2 )
= x1 , x2 , ψ−1| x1 i−1| x2 i .

·
kψk kψk − 1
x1 ∈supp(ψ),
x2 ∈supp(ψ−1| x1 i

Etcetera. We first collect some basic observations in an auxiliary result.

Lemma 3.5.3. Let ψ ∈ N[L](X) be a multiset / urn of size L = kψk.

171
172 Chapter 3. Drawing from an urn

1 Iterating K ≤ L times satisfies:


ψ − acc(x1 , . . . , xi ) (xi+1 )
X Y 
Ot −K (ψ) = ~x, ψ−acc(~x) .

L−i
~x∈X K , acc(~x)≤ψ 0≤i<K

2 For ~x ∈ X K write ϕ = acc(~x). Then:


Y   Y ψ(y)! ψ
ψ − acc(x1 , . . . , xi ) (xi+1 ) = = .
0≤i<K
y (ψ(y) − ϕ(y))! (ψ − ϕ)

The right-hand side is thus independent of the sequence ~x.

This independence means that any order of the elements of the same multiset
of balls gets the same (draw) probability. This is not entirely trivial.

Proof. 1 Directly from the definition of the transition channel Ot − , using


Kleisli composition (3.18).
2 Write ϕ = acc(~x) as ϕ = j n j |y j i. Then each element y j ∈ X occurs n j
P
times in the sequence ~x. The product
Y  
ψ − acc(x1 , . . . , xi ) (xi+1 )
0≤i<K

does not depend on the order of the elements in ~x: each element y j occurs n j
times in this product, with multiplicities ψ(y j ), . . . , ψ(y j ) − n j + 1, indepen-
dently of the exact occurrences of the y j in ~x. Thus:
Y   Y
ψ − acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) − n j + 1)
j
0≤i<K Y
= ψ(y j ) · . . . · (ψ(y j ) − ϕ(y j ) + 1)
j
Y ψ(y j )!
=
j (ψ(y j ) − ϕ(y j ))!
Y ψ(y)!
= .
y∈X
(ψ(y) − ϕ(y))!

We can extend the product over j to a product over all y ∈ X since if y <
supp(ϕ), then, even if ψ(y) = 0,
ψ(y)! ψ(y)!
= = 1.
(ψ(y) − ϕ(y))! ψ(y)!
Theorem 3.5.4. Consider the ordered-transition-with-deletion channel Ot −
on ψ ∈ N[L](X).

172
3.5. Iterated drawing from an urn 173

1 For K ≤ L,
X X ( ψ − ϕ )
Ot −K (ψ) = ~x, ψ − ϕ .

ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ)

2 The associated K-draw channel O− [K] B π1 ◦· Ot −K : N[L](X) → X K satis-


fies:
X X (ψ − ϕ ) X ( ψ − acc(~x) )
O− [K](ψ) = ~x = ~x .
ϕ≤ ψ
( ψ) ( ψ)
−1
K ~x∈acc (ϕ) K ~x∈X , acc(~x)≤ψ

Proof. 1 By combining the two items of Lemma 3.5.3 and using:


Y L!
(L − i) = L · (L − 1) · . . . · (L − K + 1) = ,
0≤i<K
(L − K)!

we get:
X X (L−K)! Y ψ(y)!
Ot −K (ψ) = ~x, ψ−ϕ

·
L! y (ψ(y)−ϕ(y))!
ϕ≤K ψ ~x∈acc −1 (ϕ)

y ψ(y)!
Q
X X (L−K)!
= ~x, ψ−ϕ

Q ·
ϕ≤K ψ ~x∈acc −1 (ϕ) y (ψ(y)−ϕ(y))! L!
X X (L−K)! ψ
= · ~x, ψ−ϕ
ϕ≤K ψ ~x∈acc −1 (ϕ)
(ψ−ϕ) L!
X X ( ψ −ϕ )
= ~x, ψ−ϕ .
ϕ≤ ψ
( ψ)
K ~x∈acc (ϕ)
−1

2 Directly by the previous item.

We now move from deletion to addition, that is, from Ot − to Ot + , still in the
ordered case. The analysis is very much as in Lemma 3.5.3.

Lemma 3.5.5. Let ψ ∈ N∗ (X) be a non-empty multiset.

1 Iterating K times gives:


ψ + acc(x1 , . . . , xi ) (xi+1 )
X Y 
Ot +K (ψ) = ~x, ψ+acc(~x) .

kψk + i
~x∈X K , acc(~x)≤ψ 0≤i<K

2 For ~x ∈ X K with ϕ = acc(~x),


Y   Y (ψ(y) + ϕ(y) − 1)!
ψ + acc(x1 , . . . , xi ) (xi+1 ) = .
0≤i<K y∈supp(ψ)
(ψ(y) − 1)!

This expression on the right is independent of the sequence ~x.

173
174 Chapter 3. Drawing from an urn

Proof. The first item is easy, so we concentrate on the second, in line with the
proof of Lemma 3.5.3. If ϕ = acc(~x) = j n j | y j i we now have:
P

Y   Y
ψ + acc(x1 , . . . , xi ) (xi+1 ) = ψ(y j ) · . . . · (ψ(y j ) + ϕ(y j ) − 1)
j
0≤i<K
Y (ψ(y j ) + ϕ(y j ) − 1)!
=
j (ψ(y j ) − 1)!
Y (ψ(y) + ϕ(y) − 1)!
= .
y∈supp(ψ)
(ψ(y) − 1)!

Theorem 3.5.6. Let ψ ∈ N∗ (X) be a non-empty multiset and let K ∈ N. Then:

ψ
X X 1 ϕ
Ot +K (ψ) = ·  kψk  ~x, ψ + ϕ .

ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K

The associated draw channel O+ [K] B π1 ◦· Ot +K : N∗ (X) → N[K](X) then is:

ψ
X X 1 ϕ
O+ [K](ψ) = ·  kψk  ~x .
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
(ϕ)
K

We recall from Definition 1.7.1 (2) that:

ψ ψ(x) ψ(x) + ϕ(x) − 1


!! Y !! Y !
= = .
ϕ x∈supp(ψ)
ϕ(x) x∈supp(ψ)
ϕ(x)

ψ kψk
The multichoose coefficients ϕ sum to K for all ϕ with kϕk = K, see
Proposition 1.7.3.

Proof. Write L = kψk > 0. We first note that:

Y (L + K − 1)!
(L + i) = L · (L + 1) · . . . · (L + K − 1) = .
0≤i<K
(L − 1)!

174
3.5. Iterated drawing from an urn 175

In combination with Lemma 3.5.5 we get:

Ot +K (ψ)
X X (kψk−1)! Y (ψ(y)+ϕ(y)−1)!
= · ~x, ψ+ϕ
ϕ∈N[K](supp(ψ))
(kψk+K −1)! (ψ(y) − 1)!
~x∈acc −1 (ϕ) y∈supp(ψ)
Q (ψ(y)+ϕ(y)−1)!
X X y∈supp(ψ) (ψ(y)−1)!
= ~x, ψ+ϕ

(kψk+K−1)!
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ) (kψk−1)!
(ψ(y)+ϕ(y)−1)!
ϕ(y)!
Q Q
X X y y∈supp(ψ) ϕ(y)!·(ψ(y)−1)!
= ~x, ψ+ϕ

· (kψk+K−1)!
ϕ∈N[K](supp(ψ))
K!
~x∈acc −1 (ϕ) K!·(kψk−1)!
ψ
X X 1 ϕ
= ·  kψk  ~x, ψ+ϕ .

ϕ∈N[K](supp(ψ)) ~x∈acc (ϕ)
−1
( ϕ )
K

The formula for Ot + [K](ψ) is then an easy consequence.

3.5.2 Unordered draws from an urn


We now concentrate on the transition channels on the right in Figure 3.1, for
unordered draws. Notice that we are now using M = N(X) as monoid in the
setting of Lemma 3.5.1. We now combine all the mathematical ‘heavy lifting’
in one preparatory lemma.

Lemma 3.5.7. 1 For ω ∈ D(X) and K ∈ N,


X Y
ω(x)ϕ(x) ϕ, ω .

Ut 0K (ω) = (ϕ) ·

x
ϕ∈N[K](X)

2 For ψ ∈ N[L+K](X),
Q ψ(x)
X ψϕ
 
X x ϕ(x)
Ut −K (ψ) = L+K  ϕ, ψ − ϕ = kψk ϕ, ψ − ϕ .

ϕ≤K ψ K ϕ≤K ψ K

3 For ψ ∈ N∗ (X),
Q ψ(x)
ψ
ϕ(x)
x∈supp(ψ) ϕ
X X
Ut +K (ψ) = ϕ, ψ + ϕ = kψk ϕ, ψ + ϕ .

kψk
ϕ∈N[K](X) K ϕ∈N[K](X) K

Proof. 1 We use induction on K ∈ N. For K = 0 we have N[K](X) = {0} and


so:
X Y
ω(x)ϕ(x) ϕ, ω .

Ut 00 (ω) = unit(ω) = 1| 0, ωi = (ϕ) ·

x
ϕ∈N[0](X)

175
176 Chapter 3. Drawing from an urn

For the induction step:

Ut 0K+1 (ω) = Ut 0K ◦· Ut 0 (ω)



(3.18)
X
= Ut 0K (ω)(ϕ, ω) · Ut 0 (ω)(ψ, ω) ψ+ϕ, ω

ψ∈N[1](X), ϕ∈N[K](X)
(IH)
X Y
ω(x)ϕ(x) · ω(y) 1| y i+ϕ, ω

= (ϕ) ·

x
y∈X, ϕ∈N[K](X)
X  Y
ω(x)ψ(x) ψ, ω
P
= ( ψ − 1| y i ) ·

y x
ψ∈N[K+1](X)
(1.25)
X Y
ω(x)ψ(x) ψ, ω .

= ( ψ) ·

x
ψ∈N[K+1](X)


2 For K = 0 both sides are equal 1 0, ψ . Next, for a multiset ψ ∈ N[L+ K +

1](X) we have:

Ut −K+1 (ψ) = Ut −K ◦· Ut − (ψ)




(3.18)
X ψ(y)
= Ut −K (ψ − 1| y i)(ϕ, χ) · ϕ + 1| y i, χ

L+K +1
y∈supp(ψ), χ∈N[L](X),
ϕ≤K ψ−1| y i
ψ−1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| yi, ψ − 1| y i − ϕ

L+K  ·
L+K +1
y∈supp(ψ), K
ϕ≤K ψ−1| y i
 ψ 
X ϕ(y) + 1 ϕ+1| y i
= · L+K+1 ϕ + 1| yi, ψ − (ϕ + 1| y i)

K +1
y∈supp(ψ), K+1
ϕ≤K ψ−1| y i
by Exercise 1.7.6
ψ
X χ(y) χ
= · L+K+1 χ, ψ − χ

χ≤K+1 ψ
K +1
K+1
 ψ
χ
X
= L+K+1 χ, ψ − χ .

χ≤K+1 ψ K+1

3 The case K = 0 so we immediately look at, the induction step, for urn ψ

176
3.5. Iterated drawing from an urn 177

with kψk = L > 0.

Ut +K+1 (ψ) = Ut +K ◦· Ut + (ψ)




(3.18)
X ψ(y)
= Ut +K (ψ + 1| y i)(ϕ, χ) · ϕ + 1| y i, χ

y∈supp(ψ), χ
L
ψ+1| y i
(IH)
X ϕ ψ(y)
= ϕ + 1| y i, ψ + 1| y i + ϕ

L+1 ·
y∈supp(ψ), ϕ∈N[K](X)
L
K
 ψ 
X ϕ(y) + 1 ϕ+1| y i
= ·  L  ϕ + 1| y i, ψ + (ϕ + 1| yi)

y∈supp(ψ), ϕ∈N[K](X)
K +1
K+1
by Exercise 1.7.6
ψ
X χ(y) χ
= ·  L  χ, ψ + χ

K +1
y, χ∈N[K+1](X)   K+1
ψ
χ
X
=  L  χ, ψ + χ .

χ∈N[K+1](X) K+1

We are now in a position to describe the multinomial, hypergeometric and


Pólya distributions, using iterations of the Ut 0 , Ut − and Ut + transition maps,
followed by marginalisation.

Theorem 3.5.8. 1 The K-draw multinomial is the first marginal of K itera-


tions of the unordered-with-replacement transition:

mn[K] = π1 ◦· Ut 0K C U0 [K].

2 Similarly the hypergeometric distribution arises from iterated unordered-


with-deletion transitions:

hg[K] = π1 ◦· Ut −K C U− [K].

3 Finally, the Pólya distribution comes from iterated unordered-with-addition


transitions, as:
pl[K](ψ) = π1 ◦· Ut +K C U+ [K].

Proof. Directly by Lemma 3.5.7.

Together, Theorems 3.5.2, 3.5.4, 3.5.6 and 3.5.8 give precise descriptions of
the six channels, for ordrered / unordered drawing with replacement / deletion
/ addition, as originally introduced in Table (3.3), in the introduction to this
chapter. What remains to do is show that the diagrams (3.4) mentioned there
commute — which relate the various forms of drawing via accumulation and

177
178 Chapter 3. Drawing from an urn

arrangement. We repeat them below for convenience, but now enriched with
multinomial, hypergeometric, and Pólya channels.

mn=U0 6 N[K](X) hg=U− 4 N[K](X) pl=U+ 6 N[K](X)


◦ ◦ ◦
arr ◦ arr ◦ arr ◦
  
D(X)
O0
◦ / XK ◦ N[L](X)
O−
◦ / XK ◦ N∗ (X)
O+
◦ / XK ◦ (3.19)
acc ◦ acc ◦ acc ◦
mn=U0

(  ◦
hg=U− *  pl=U+

( 
N[K](X) N[K](X) N[K](X)

Theorem 3.3.1 precisely says that the diagram on the left commutes. For the
two other diagram we still have to do some work. This provides new character-
isations of unordered draws with deletion / addition, namely as hypergeometric
/ Pólya followed by arrangement.

Proposition 3.5.9. 1 Accumulating ordered draws with deletion / addition yields


unordered draws with deletion:

(acc ⊗ id ) ◦· Ot −K = Ut −K (acc ⊗ id ) ◦· Ot +K = Ut +K .

2 Permuting ordered draws with deletion / addition has no effect: the permu-
tation channel perm = arr ◦· acc : X K → X K satisfies:

(perm ⊗ id ) ◦· Ot −K = Ot −K (perm ⊗ id ) ◦· Ot +K = Ot +K .

3 The diagrams in the middle and on the right in (3.19) commute.

Proof. 1 By combinining Theorem 3.5.4 (1) and Lemma 3.5.7 (2):


 
 X X (ψ − ϕ ) 
Ot −K (ψ) = (acc ⊗ id ) =  ~x, ψ − ϕ 

(acc ⊗ id ) ◦·
ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ) 
X X ( ψ − ϕ )
= acc(~x), ψ − ϕ .

ϕ≤K ψ ~x∈acc −1 (ϕ)
( ψ)
X (ϕ) · (ψ − ϕ)
= ϕ, ψ − ϕ .
ϕ≤K ψ
( ψ)
X ψϕ
 
=  L  ϕ, ψ−ϕ

by Lemma 1.6.4 (1)
ϕ≤K ψ K

= Ut −K (ψ).

178
3.5. Iterated drawing from an urn 179

Similarly, by Theorem 3.5.6,


ψ
X X 1 ϕ
Ot +K (ψ) = ·  kψk  acc(~x), ψ + ϕ .

(acc ⊗ id ) ◦·
ϕ∈N[K](supp(ψ)) ~x∈acc −1 (ϕ)
( ϕ )
K
ψ
X 1 ϕ
= (ϕ) · ·  kψk  ϕ, ψ + ϕ .

ϕ∈N[K](supp(ψ))
( ϕ )
K

= Ut +K (ψ).

2 The probability for a K-tuple ~x in Ot −K Theorem 3.5.4 depends only on


acc(~x). Hence it does not change if ~x is permuted. The same holds for Ot +K
Theorem 3.5.6.
3 By the above first item we get:

acc ◦· O− = acc ◦· π1 ◦· Ot −K
= π1 ◦· (acc ⊗ id ) ◦· Ot −K
= π1 ◦· Ut −K as shown in the first item
= hg[K] by Theorem 3.5.8 (2).

But then:
arr ◦· hg[K] = arr ◦· acc ◦· π1 ◦· Ut −K as just shown
= perm ◦· π1 ◦· Ot −K
= π1 ◦· (perm ⊗ id ) ◦· Ot −K
= π1 ◦· Ot −K see item (2)
= O− .

The same argument works with O+ instead of O− .

We conclude with one more result that clarifies the different roles that the
draw-and-delete DD and draw-and-add DA channels play for hypergeometric
and Pólya distributions. So far we have looked only at the first marginal after
iteration. including also the second marginal gives a fuller picture.

Theorem 3.5.10. 1 In the hypergeometric case, both the first and the second
marginal after iterating Ut − can be expressed in terms of the draw-and-
delete map, as in:

N[L+K](X)
hg[K] = DD L DD K
◦ ◦ Ut −K ◦
z π1  π2 $
N[K](X) o ◦ N[K](X) × N[L](X) ◦ / N[L](X)

179
180 Chapter 3. Drawing from an urn

2 In the Pólya case, (only) the second marginal is can be expressed via the
draw-and-add map:
N[L](X)
pl[K] DA K
◦ ◦
◦ Ut +K
y π1  π2 %
N[K](X) o ◦ N[K](X) × N[L+K](X) ◦ / N[L+K](X)

Proof. 1 The triangle on the left commutes by Theorem 3.5.8 (2) and Theo-
rem 3.4.1. The one on the right follows from the equation:
X ψϕ
 
DD K (ψ) =   ψ − ϕ .

kψk
(3.20)
ϕ≤K ψ K

The proof is left as exercise to the reader.


2 Commutation of the triangle on the left holds by Theorem 3.5.8 (3). The
triangle on the right follows from:
ψ
ϕ
X
DA K (ψ) = kψk ψ + ϕ .

(3.21)
ϕ∈N[K](supp(ψ)) K

This one is also left as exercise.

Exercises
3.5.1 Consider the multiset/urn ψ = 4| a i + 2|b i of size 6 over {a, b}. Com-
pute O− [3](ψ) ∈ D {a, b}3 , both by hand and via the O− -formula in

Theorem 3.5.4 (2)
3.5.2 Check that:
pl[2] 1| ai + 2| bi + 1| c i

3
= 10
1
2|a i + 15 1| ai + 1| bi + 10

2| bi
1
+ 101
1| ai + 1| c i + 15 1| b i + 1| c i + 10 2| c i .