Categorical Time Series:
Analysis, Modelling, Monitoring?
Christian H. Weiß
Department of Mathematics,
Darmstadt University of Technology
Categorical
Time Series
Motivation
Categorical Time Series — Motivation
Log data of a web server:
Several types of categorical features (IP addresses, web ad-
dresses, state codes, etc.).
(→ ENBIS talk in 2006)
Possible applications:
• Measuring success of a web site,
• intrusion detection.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Motivation
Medical Diagnoses
of an examiner for a certain type of examination.
(→ ENBIS talk in 2009)
Possible applications:
• Compare diagnosis behavior of different examiners,
• compare examiner with a norm profile.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Motivation
Text:
Sequence of letters, words, word classes, grammatical enti-
ties, etc.
Possible applications:
• Language recognition,
• text recognition,
• part-of-speech tagging, etc.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Motivation
Genetic or protein sequences.
Possible applications:
• Structure analysis to understand functionality of seg-
ments of the sequence,
• recognition of similar sequences,
• recognition of common evolutionary roots,
• identification of sequences through comparison to con-
sensus model.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Motivation
SPC-related examples.
Xt = result of inspection of item, with
Xt = i for i = 1, . . . , m iff item has nonconformity type i,
Xt = 0 iff conforming.
Mukhopadhyay (2008): m = 6 paint defects of ceiling fan
cover (‘poor covering’, ‘bubbles’, etc.).
Overall defect category = most predominant defect.
Ye et al. (2002) monitor network traffic data (284 different
types of audit events) for intrusion detection.
Christian H. Weiß — Darmstadt University of Technology
Categorical
Random Variables
Properties
Categorical Random Variables — Properties
Categorical random variable:
X takes one of finite number of unordered categories,
say b0, . . . , bm. E. g.,
• diagnoses b0, . . . , bm, or
• parts of speech b0, . . . , bm, or
• types of defects b0, . . . , bm, or . . .
To simplify notations:
Range of X is coded as V = {0, . . . , m}.
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
X categorical r.v. with range V = {0, . . . , m}
∑
⇒ P (X = 0) = 1 − m
j=1 P (X = j).
Abbreviate marginal probabilities: pi := P (X = i) ∈ (0; 1),
whole distribution determined by m parameters p1, . . . , pm.
1st problem: Number m of parameters might be very large,
in contrast to real-valued or count-data random variables,
no sparse parametric models available yet!
. . . except trivial cases:
one-point distribution, uniform distribution.
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
Typical for cardinal random variables:
quantify basic properties, e. g.,
• location via mean or median,
• dispersion via variance or quartiles.
But:
• no arithmetic operations for categorical range
⇒ no mean, variance, etc.
• unordered range ⇒ no quantiles.
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
So how can we quantify
location and dispersion
of a categorical random variable?
Measures of location:
only mode of X in use, i. e.,
value i ∈ V such that pi ≥ pj for all j ∈ V.
Often not uniquely determined (e. g., uniform distribution).
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
Intuitive understanding of dispersion:
X shows large dispersion
≈
High uncertainty about the outcome of X
⇒ Uncertainty of categorical random variable?
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
Two extreme cases:
Uniform distribution:
Maximal uncertainty about the outcome of X.
One-point distribution:
Perfect certainty about the outcome of X.
⇒ Hallmarks for definition of any measure of dispersion!
Contributions in literature (desirable properties, measures),
e. g., by Uschner (1987), Vogel & Kiesl (1999).
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
Common standardized measures of dispersion:
m+1 ∑
Gini index: νG(X) := (1 − p2
j ).
m j
1 ∑
Entropy: νE(X) := − pj ln pj .
ln (m + 1) j
m+1
Chebycheff dispersion: νC(X) := (1 − maxj pj ).
m
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
Important properties of these measures:
• continuous and symmetric functions of p,
• range [0; 1],
• maximum value 1 in case of uniform distribution,
• minimal value 0 in case of one-point distribution,
• inequality: m+1
m (1 − minj pj ) ≥ νG(X) ≥ νC(X).
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
Empirical measures of dispersion
based on sample X1, . . . , XT .
Binarization Y t ∈ {0, 1}m+1 of Xt via Yt,i := δi,Xt .
1 ∑T
Unbiased estimator for pi: p̂i(T ) := T · t=1 Yt,i.
Abbreviate p := (p0, . . . , pm)⊤,
( )⊤ ∑
p̂(T ) := p̂0(T ), . . . , p̂m(T ) = T1 · t Y t,
∑
and sk (p) := k
j pj for k ∈ N.
Christian H. Weiß — Darmstadt University of Technology
Categorical Random Variables — Properties
Weiß (2011): Empirical Gini index defined by
( ( ))
ν̂G(X) := m+1
m · T · 1 − s p̂(T ) .
T −1 2
If computed from i.i.d. data, then
ν̂G(X) is exactly unbiased and asymptotically normally dis-
tributed with variance determined by
[ ( )] ( )
V 1 − s2 p̂(T ) = 4
T · s 3 (p) − s 2
2 (p) + O(T −2).
Current research:
ν̂G(X) and ν̂E(X) for NDARMA processes (see below).
Christian H. Weiß — Darmstadt University of Technology
Categorical
Time Series
Terms & Notations
Categorical Time Series — Terms & Notations
Categorical process:
(Xt)N with N = {1, 2, . . .}, where each Xt takes one of finite
number of unordered categories.
Categorical time series:
Realizations (xt)t=1,...,T from (Xt)N.
(Xt)N said to be stationary
if joint distribution of (Xt, . . . , Xt+k )
independent of t for all k ∈ N0.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Terms & Notations
Notations for time-invariant probabilities:
If (Xt)N stationary:
• marginal probabilities pi := P (Xt = i) ∈ (0; 1).
p := (p0, . . . , pm)⊤, and
∑
sk (p) := k
j pj for k ∈ N; obviously s1(p) = 1.
• bivariate probabilities pij (k) := P (Xt = i, Xt−k = j),
conditional probabilities pi|j (k) := P (Xt = i | Xt−k = j).
Christian H. Weiß — Darmstadt University of Technology
Categorical
Time Series
Visual Analysis
Categorical Time Series — Visual Analysis
“Little existing work deals directly with categorical time se-
ries analysis, and much less deals with the visualization of
categorical time series.” (Ribler, 1997, p. 11)
Several visual tools for real-valued time series:
line plot, periodogram, correlogram, etc.
At least line plot simple and universal tool!
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Linienplot (Bier.sta 1v*422c)
240
220
200
180
160
Konsum
140
120
100
80
60
40
1 51 101 151 201 251 301 351 401
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Visual tools for categorical time series:
only few proposals,
often from computer science and biology,
see survey by Weiß (2008).
In fact problematic: Analogue of line plot?
Lack of a natural order within purely categorical range
⇒ arrangement of range along ordinate
arbitrary and misleading.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Linienplot (Bovine.sta 6v*8419c) Linienplot (Bovine.sta 6v*8419c)
t c
c a
Wert
Wert
a g
g t
1 11 21 31 41 or 1 11 21 31 41
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Alternative (Keim & Kriegel, 1996): Map range onto set
of colors or symbols, plot x1, x2, . . . successively on space-
filling curve (line-by-line, column-by-column, Peano-Hilbert
curves, spiral, etc.).
However: These graphs perform poorly, characteristic fea-
tures of time series difficult to recognize, interpretation of
resulting plot is problematic, appearance depends heavily on
choice of underlying curve.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Genome of Bovine leukemia virus:
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
So how to analyze categorical time series visually?
Proposal by Ribler (1997): Rate evolution graph.
Categorical process (Xt)N, range coded as V = {0, . . . , m}.
Binarization Y t ∈ {0, 1}m+1 with Yt,i = δi,Xt , i = 0, . . . , m.
∑t
Define the cumulated sums C t := s=1 Y s, i. e.,
Ct,i = number of Xs, s = 1, . . . , t, equal to i.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Rate evolution graph of (Xt)N: (Ribler, 1997)
Multiple line plot of all component series Ct,i, i = 0, . . . , m,
i. e., all Ct,i are plotted simultaneously into one chart.
Interpretation:
Slope of graphs is estimate for corresp. marginal probability.
If (Xt)N stationary and at most moderately serially depen-
dent, then graphs approximately linear in t
⇒ Simple visual tool for checking stationarity.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Shakespeare’s (1593) poem “Venus and Adonis”:
/LQLHQSORW YHQXVDQGDGRQLVVWDY F
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Log data (2005) of Statistics server at Univ. of Würzburg:
Access to home directory of five members.
/LQLHQSORW DFFHVVBORJB]LHOVWDY F
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Visual Analysis
Some further tools: (see Weiß (2008) for references)
• VizTree or IFS circle transformation for visualizing
string frequencies; (→ ENBIS talk in 2005)
• pattern histograms (e. g., based on runs or cycles);
• categorical control charts (also see below).
However: These tools are very specialized.
Universal instrument (like line plot), providing multiple types
of information at once, still missing.
Christian H. Weiß — Darmstadt University of Technology
Categorical
Time Series
Serial Dependence
Categorical Time Series — Serial Dependence
Cardinal time series:
Convenient measure of serial dependence:
(Partial) Autocorrelation.
Categorical time series:
Measures of serial dependence?
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Serial Dependence
Weiß & Göb (2008): same strategy as for dispersion mea-
sures, i. e., we start with
Extreme cases:
• Xt, Xt−k (stochastically) independent iff pij (k) = pi · pj .
• Xt perfectly depends on Xt−k iff for every j = 0, . . . , m:
conditional distribution of Xt, conditioned on Xt−k = j,
is a one-point distribution.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Serial Dependence
Weiß & Göb (2008) proposed, among others,
v
u ( )2
u
u∑ pij (k)
u − pipj
Goodman and Kruskal’s τ : τ (k) = t ( ) ,
i,j pj 1 − s2(p)
v
u ( )2
u
u1
u ∑ pij (k) − pipj
Cramer’s v: v(k) = t · .
m i,j pipj
Some properties: τ (k), v(k) have range [0; 1] with
• Xt, Xt−k independent ⇔ τ (k) = v(k) = 0.
• Xt depends perfectly on Xt−k ⇔ τ (k) = v(k) = 1.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Serial Dependence
Cardinal case: Positive and negative autocorrelation.
⇒
Concept of signed dependence: (Weiß & Göb, 2008)
• Xt, Xt−k perfectly positively dependent,
if perfectly dependent and
if pi|i(k) = 1 for all i.
• Xt, Xt−k perfectly negatively dependent
if perfectly dependent and
if pi|i(k) = 0 for all i.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Serial Dependence
Measures of signed dependence:
(Weiß, 2011; Weiß & Göb, 2008)
Cohen’s κ:
∑
j pjj (k) − s2(p) s2(p)
κ(k) = with range [− 1−s ; 1],
1 − s 2 (p) 2 (p)
Modified κ:
(∑ )
κ∗(k) = 1
m · p
j j|j (k) − 1 with range [− 1 ; 1].
m
Some properties:
• Xt, Xt−k independent ⇒ κ(k) = κ∗(k) = 0.
• Xt, Xt−k perf. positively dep. ⇔ κ(k) = κ∗(k) = 1.
• Xt, Xt−k perf. negatively dep. ⇒ κ(k), κ∗(k) minimal.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Serial Dependence
Empirical measures of (signed) serial dependence
based on time series X1, . . . , XT .
Weiß (2011): empirical Cohen’s κ and modified κ:
∑
1 1 − j p̂jj (k, T )
κ̂(k) := 1 + − ,
T 1 − s2(p̂(T ))
(∑ )
p̂jj (k, T )
κ̂∗(k) := T1 + m
1 · −1 .
j p̂j (T )
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Serial Dependence
Empirical measures of (signed) serial dependence
If computed from i.i.d. data, then
κ̂(k), κ̂∗(k) are asymptotically normally distributed with
[ ] [ ]
E κ̂(k) = E κ̂∗(k) = 0 + O(T −2),
and variances given by
[ ] (
1+2s3(p)−3s2(p) )
V κ̂(k) = 1
T · 1 − (1−s2(p))2
+ O(T −2);
[ ]
V κ̂∗(k) 1
= m·T + O(T −2).
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Serial Dependence
Current and future research:
• κ̂(k), κ̂∗(k) and also τ̂ (k), v̂(k) for NDARMA processes;
• goodness-of-fit tests for NDARMA processes.
Christian H. Weiß — Darmstadt University of Technology
Categorical
Time Series
Modelling
Categorical Time Series — Modelling
Models for stationary categorical processes
with range V = {0, . . . , m}:
• pth order Markov model: (m + 1)p · m parameters;
• variable length M. m. of Bühlmann & Wyner (1999):
more parsimonious, but model choice difficult;
• MTD(p) model of Raftery (1985):
still m(m + 1) + p − 1 parameters;
• NDARMA(p, q) models of Jacobs & Lewis (1983):
m + p + q parameters, also non-Markovian dependence.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Modelling
(Xt)Z, (ϵt)Z: categorical processes with range V = {0, . . . , m};
(ϵt)Z: i.i.d. with marginal distribution P (ϵt = j) = pj > 0,
ϵt independent of (Xs)s<t.
For φq > 0, with ϕp > 0 if p ≥ 1, let
D t = (αt,1, . . . , αt,p, βt,0, . . . , βt,q) ∼ M U LT (1; ϕ1, . . . , ϕp, φ0, . . . , φq)
be i.i.d. and independent of (ϵt)Z, (Xs)s<t.
(Xt)Z is NDARMA(p, q) process if
Xt = αt,1 · Xt−1 + . . . + αt,p · Xt−p + βt,0 · ϵt + . . . + βt,q · ϵt−q.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Modelling
Some properties:
• marginal distribution P (Xt = j) = pj ;
• only shows positive serial dependence, and
κ(k) = κ∗(k) = v(k) = τ (k);
• Yule-Walker-type equations
∑p ∑q−k
κ(k) = j=1 ϕj ·κ(|k−j|) + i=0 φi+k ·r(i) for k ≥ 1,
where r(i) = 0 for i < 0, r(0) = φ0, and
∑i−1 ∑q
r(i) = j=max {0,i−p} ϕi−j ·r(j) + j=0 δij ·φj for i > 0.
Christian H. Weiß — Darmstadt University of Technology
Categorical Time Series — Modelling
Disadvantage of NDARMA models:
Only describe positive dependence, i. e.,
categorical time series with long runs of symbols.
(e. g., genetic sequence vs. letters of text)
Issues for future research:
• Find simple and sparsely parametrized models
that also allow for negative dependence!
• Definition of trend or seasonality?
• And: How to remove such trend or seasonality?
Christian H. Weiß — Darmstadt University of Technology
Categorical
Processes
Monitoring
Categorical Processes — Monitoring
Only very few approaches for monitoring processes of
unordered and mutually exclusive categories.
Monitoring samples from an i.i.d. categorical process:
Duncan (1950), Marcucci (1985), Nelson (1987), Mukho-
padhyay (2008) plot Pearson’s χ2-statistic for goodness of
fit on a control chart.
Continuously monitoring cat. process (100 % inspect.):
Weiß (2010) proposes
• two moving-average-type charts,
based either on Pearson statistic or Gini index,
• a (k, r)-runs chart.
Christian H. Weiß — Darmstadt University of Technology
Categorical Processes — Monitoring
(k, r)-run: finished after k successive observations
of either ‘1’ or ‘2’ or . . . or ‘r’, i. e., after observing
one of (1, . . . , 1), (2, . . . , 2), . . . , (r, . . . , r) of length k each.
(k,r)
(k, r)th run lengths (Yn )N determined as
(k,r)
Y1 := No. obs. until first occurrence of k-tuple of ‘1’s or . . . ‘r’s,
(k,r)
Yn := No. obs. after (n − 1)th occurrence of k-tuple of ‘1’s or . . . ‘r’s
until nth occurrence of k-tuple of ‘1’s or . . . ‘r’s, for n ≥ 2.
Example: m = 3 (i. e., V = {0, 1, 2, 3})
and (k, r) = (2, 3), (fictive) time series:
1
| 2 0 0 0 | 3 0 0{z1 0 1 1} 1
{z 3 0 2 2} 0 | {z1} 3
| 0 2{z0 3 3} 2 3 . . .
9 8 2 6
Christian H. Weiß — Darmstadt University of Technology
Categorical Processes — Monitoring
If (Xt)N Markov chain, then
(k,r)
(Yn )N i.i.d. process, range Nk := {k, k + 1, . . .}.
Properties: (Chryssaphinou et al., 1994)
(1 − piz)(piz)k
r
∑
Let ck,r (z) := , then
i=1 1 − (piz) k
(k,r) 1 1 + ck,r (1) − 2c′k,r (1)
E[Y ] = , V [Y (k,r)] = .
ck,r (1) c2
k,r (1)
Christian H. Weiß — Darmstadt University of Technology
Categorical Processes — Monitoring
(k, r)-runs chart: (Weiß, 2010)
(k,r)
(Yn )N plotted on chart with k ≤ LCL < U CL.
Exact AN E computation with Markov chain approach.
Issues for future research:
• charts based on different patterns,
e. g., cycles instead of runs;
• CUSUM or EWMA methods for categorical processes,
e. g., related to patterns or certain statistics; . . .
Christian H. Weiß — Darmstadt University of Technology
Conclusions
In a nutshell,
the field of categorical time series . . .
• is relevant for practice, and
• offers a lot of topics for future research,
in any of the disciplines
analysis, modelling and monitoring!
Christian H. Weiß — Darmstadt University of Technology
Thank You
for Your Interest!
Christian H. Weiß
Department of Mathematics
Darmstadt University of Technology
Literature
Box et al. (1994): Time series analysis – forecasting and control. 3rd edition, Prentice
Hall, Inc., New Jersey.
Bühlmann & Wyner (1999): Variable length Markov chains. Ann. Stat., 27, 480–513.
Chryssaphinou et al. (1994): On the waiting time of appearance of given patterns. In:
Godbole & Papastavridis (Eds.): Runs and Patterns in Probability, Kluwer Academic
Publishers, 231–241.
Duncan (1950): A chi-square chart for controlling a set of percentages. Industrial
Quality Control, 7, 11–15.
Jacobs & Lewis (1983): Stationary discrete autoregressive-moving average time series
generated by mixtures. J. Time Series Anal., 4(1), 19–36.
Keim & Kriegel (1996): Visualization techniques for mining large databases: A com-
parison. IEEE Trans. Knowl. Data Eng., 8(6), 923–938.
Marcucci (1985): Monitoring multinomial processes. J. Qual. Tech., 17(2), 86–91.
Mukhopadhyay (2008): Multivariate attribute control chart using Mahalanobis D2 sta-
tistic. J. Appl. Statist., 35(4), 421–429.
Nelson (1987): A chi-square control chart for several proportions. J. Qual. Tech.,
19(4), 229–231.
Raftery (1985): A model for high-order Markov chains. J. Royal Stat. Soc., B, 47(3),
528–539.
(. . . )
Christian H. Weiß — Darmstadt University of Technology
Literature
(. . . )
Ribler (1997): Visualizing categorical time series data with applications to computer
and communications network traces. PhD thesis, Virginia State University.
Uschner (1987): Streuungsmessung nominaler Merkmale mit Hilfe von Paarverglei-
chen. Doctoral dissertation, University Erlangen-Nürnberg.
Vogel & Kiesl (1999): Deskriptive und induktive Eigenschaften zweier Streuungsmaße
für nominale Merkmale. In: Vogel (Edt.): Arbeiten aus der Statistik, Univ. Bamberg.
Weiß (2008): Visual analysis of categorical time series. Statist. Meth., 5(1), 56–71.
Weiß (2010): Continuously monitoring categorical processes. Qual. Tech. Quant. Ma-
nag., to appear.
Weiß (2011): Empirical measures of signed serial dependence in categorical time se-
ries. J. Statist. Comp. Simulation, 81(4), 411–429.
Weiß & Göb (2008): Measuring serial dependence in categorical time series. Adv.
Statist. Anal., 92(1), 71–89.
Ye et al. (2002): Multivariate statistical analysis of audit trails for host-based intrusion
detection. IEEE Trans. Computer, 51(7), 810–820.
Christian H. Weiß — Darmstadt University of Technology