0% fanden dieses Dokument nützlich (0 Abstimmungen)
71 Ansichten56 Seiten

Folien 09 11 1

Categorical timeseries analysis modeling and monitoring

Hochgeladen von

Nezar El Messnaoui
Copyright
© © All Rights Reserved
Wir nehmen die Rechte an Inhalten ernst. Wenn Sie vermuten, dass dies Ihr Inhalt ist, beanspruchen Sie ihn hier.
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen
0% fanden dieses Dokument nützlich (0 Abstimmungen)
71 Ansichten56 Seiten

Folien 09 11 1

Categorical timeseries analysis modeling and monitoring

Hochgeladen von

Nezar El Messnaoui
Copyright
© © All Rights Reserved
Wir nehmen die Rechte an Inhalten ernst. Wenn Sie vermuten, dass dies Ihr Inhalt ist, beanspruchen Sie ihn hier.
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen

Categorical Time Series:

Analysis, Modelling, Monitoring?

Christian H. Weiß
Department of Mathematics,
Darmstadt University of Technology
Categorical
Time Series

Motivation
Categorical Time Series — Motivation

Log data of a web server:

Several types of categorical features (IP addresses, web ad-


dresses, state codes, etc.).
(→ ENBIS talk in 2006)

Possible applications:

• Measuring success of a web site,

• intrusion detection.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Motivation

Medical Diagnoses

of an examiner for a certain type of examination.


(→ ENBIS talk in 2009)

Possible applications:

• Compare diagnosis behavior of different examiners,

• compare examiner with a norm profile.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Motivation

Text:
Sequence of letters, words, word classes, grammatical enti-
ties, etc.

Possible applications:

• Language recognition,

• text recognition,

• part-of-speech tagging, etc.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Motivation

Genetic or protein sequences.

Possible applications:

• Structure analysis to understand functionality of seg-


ments of the sequence,

• recognition of similar sequences,

• recognition of common evolutionary roots,

• identification of sequences through comparison to con-


sensus model.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Motivation

SPC-related examples.

Xt = result of inspection of item, with


Xt = i for i = 1, . . . , m iff item has nonconformity type i,
Xt = 0 iff conforming.

Mukhopadhyay (2008): m = 6 paint defects of ceiling fan


cover (‘poor covering’, ‘bubbles’, etc.).
Overall defect category = most predominant defect.

Ye et al. (2002) monitor network traffic data (284 different


types of audit events) for intrusion detection.

Christian H. Weiß — Darmstadt University of Technology


Categorical
Random Variables

Properties
Categorical Random Variables — Properties

Categorical random variable:


X takes one of finite number of unordered categories,
say b0, . . . , bm. E. g.,

• diagnoses b0, . . . , bm, or

• parts of speech b0, . . . , bm, or

• types of defects b0, . . . , bm, or . . .

To simplify notations:
Range of X is coded as V = {0, . . . , m}.

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

X categorical r.v. with range V = {0, . . . , m}



⇒ P (X = 0) = 1 − m
j=1 P (X = j).

Abbreviate marginal probabilities: pi := P (X = i) ∈ (0; 1),


whole distribution determined by m parameters p1, . . . , pm.

1st problem: Number m of parameters might be very large,


in contrast to real-valued or count-data random variables,
no sparse parametric models available yet!

. . . except trivial cases:


one-point distribution, uniform distribution.

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

Typical for cardinal random variables:


quantify basic properties, e. g.,

• location via mean or median,

• dispersion via variance or quartiles.

But:

• no arithmetic operations for categorical range


⇒ no mean, variance, etc.

• unordered range ⇒ no quantiles.

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

So how can we quantify


location and dispersion
of a categorical random variable?

Measures of location:
only mode of X in use, i. e.,
value i ∈ V such that pi ≥ pj for all j ∈ V.

Often not uniquely determined (e. g., uniform distribution).

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

Intuitive understanding of dispersion:

X shows large dispersion



High uncertainty about the outcome of X

⇒ Uncertainty of categorical random variable?

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

Two extreme cases:

Uniform distribution:
Maximal uncertainty about the outcome of X.

One-point distribution:
Perfect certainty about the outcome of X.

⇒ Hallmarks for definition of any measure of dispersion!

Contributions in literature (desirable properties, measures),


e. g., by Uschner (1987), Vogel & Kiesl (1999).

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

Common standardized measures of dispersion:

m+1 ∑
Gini index: νG(X) := (1 − p2
j ).
m j

1 ∑
Entropy: νE(X) := − pj ln pj .
ln (m + 1) j

m+1
Chebycheff dispersion: νC(X) := (1 − maxj pj ).
m

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

Important properties of these measures:

• continuous and symmetric functions of p,

• range [0; 1],

• maximum value 1 in case of uniform distribution,

• minimal value 0 in case of one-point distribution,

• inequality: m+1
m (1 − minj pj ) ≥ νG(X) ≥ νC(X).

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

Empirical measures of dispersion


based on sample X1, . . . , XT .

Binarization Y t ∈ {0, 1}m+1 of Xt via Yt,i := δi,Xt .


1 ∑T
Unbiased estimator for pi: p̂i(T ) := T · t=1 Yt,i.

Abbreviate p := (p0, . . . , pm)⊤,


( )⊤ ∑
p̂(T ) := p̂0(T ), . . . , p̂m(T ) = T1 · t Y t,

and sk (p) := k
j pj for k ∈ N.

Christian H. Weiß — Darmstadt University of Technology


Categorical Random Variables — Properties

Weiß (2011): Empirical Gini index defined by


( ( ))
ν̂G(X) := m+1
m · T · 1 − s p̂(T ) .
T −1 2

If computed from i.i.d. data, then


ν̂G(X) is exactly unbiased and asymptotically normally dis-
tributed with variance determined by
[ ( )] ( )
V 1 − s2 p̂(T ) = 4
T · s 3 (p) − s 2
2 (p) + O(T −2).

Current research:
ν̂G(X) and ν̂E(X) for NDARMA processes (see below).

Christian H. Weiß — Darmstadt University of Technology


Categorical
Time Series

Terms & Notations


Categorical Time Series — Terms & Notations

Categorical process:
(Xt)N with N = {1, 2, . . .}, where each Xt takes one of finite
number of unordered categories.

Categorical time series:


Realizations (xt)t=1,...,T from (Xt)N.

(Xt)N said to be stationary


if joint distribution of (Xt, . . . , Xt+k )
independent of t for all k ∈ N0.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Terms & Notations

Notations for time-invariant probabilities:

If (Xt)N stationary:

• marginal probabilities pi := P (Xt = i) ∈ (0; 1).


p := (p0, . . . , pm)⊤, and

sk (p) := k
j pj for k ∈ N; obviously s1(p) = 1.

• bivariate probabilities pij (k) := P (Xt = i, Xt−k = j),


conditional probabilities pi|j (k) := P (Xt = i | Xt−k = j).

Christian H. Weiß — Darmstadt University of Technology


Categorical
Time Series

Visual Analysis
Categorical Time Series — Visual Analysis

“Little existing work deals directly with categorical time se-


ries analysis, and much less deals with the visualization of
categorical time series.” (Ribler, 1997, p. 11)

Several visual tools for real-valued time series:


line plot, periodogram, correlogram, etc.

At least line plot simple and universal tool!

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis
Linienplot (Bier.sta 1v*422c)
240

220

200

180

160
Konsum

140

120

100

80

60

40
1 51 101 151 201 251 301 351 401

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

Visual tools for categorical time series:


only few proposals,
often from computer science and biology,
see survey by Weiß (2008).

In fact problematic: Analogue of line plot?

Lack of a natural order within purely categorical range

⇒ arrangement of range along ordinate


arbitrary and misleading.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis
Linienplot (Bovine.sta 6v*8419c) Linienplot (Bovine.sta 6v*8419c)

t c

c a
Wert

Wert
a g

g t

1 11 21 31 41 or 1 11 21 31 41

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

Alternative (Keim & Kriegel, 1996): Map range onto set


of colors or symbols, plot x1, x2, . . . successively on space-
filling curve (line-by-line, column-by-column, Peano-Hilbert
curves, spiral, etc.).

However: These graphs perform poorly, characteristic fea-


tures of time series difficult to recognize, interpretation of
resulting plot is problematic, appearance depends heavily on
choice of underlying curve.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

Genome of Bovine leukemia virus:

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

So how to analyze categorical time series visually?

Proposal by Ribler (1997): Rate evolution graph.

Categorical process (Xt)N, range coded as V = {0, . . . , m}.

Binarization Y t ∈ {0, 1}m+1 with Yt,i = δi,Xt , i = 0, . . . , m.


∑t
Define the cumulated sums C t := s=1 Y s, i. e.,

Ct,i = number of Xs, s = 1, . . . , t, equal to i.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

Rate evolution graph of (Xt)N: (Ribler, 1997)

Multiple line plot of all component series Ct,i, i = 0, . . . , m,


i. e., all Ct,i are plotted simultaneously into one chart.

Interpretation:
Slope of graphs is estimate for corresp. marginal probability.

If (Xt)N stationary and at most moderately serially depen-


dent, then graphs approximately linear in t

⇒ Simple visual tool for checking stationarity.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

Shakespeare’s (1593) poem “Venus and Adonis”:


/LQLHQSORW YHQXVDQGDGRQLVVWDY F




$

%

&

'

(

)

*

+

,

-

.

/
 0

1

2

3

 4

5

6

7

8

9

:
;

<

=

            

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

Log data (2005) of Statistics server at Univ. of Würzburg:


Access to home directory of five members.
/LQLHQSORW DFFHVVBORJB]LHOVWDY F















            

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Visual Analysis

Some further tools: (see Weiß (2008) for references)

• VizTree or IFS circle transformation for visualizing


string frequencies; (→ ENBIS talk in 2005)

• pattern histograms (e. g., based on runs or cycles);

• categorical control charts (also see below).

However: These tools are very specialized.


Universal instrument (like line plot), providing multiple types
of information at once, still missing.

Christian H. Weiß — Darmstadt University of Technology


Categorical
Time Series

Serial Dependence
Categorical Time Series — Serial Dependence

Cardinal time series:


Convenient measure of serial dependence:
(Partial) Autocorrelation.

Categorical time series:


Measures of serial dependence?

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Serial Dependence

Weiß & Göb (2008): same strategy as for dispersion mea-


sures, i. e., we start with

Extreme cases:
• Xt, Xt−k (stochastically) independent iff pij (k) = pi · pj .

• Xt perfectly depends on Xt−k iff for every j = 0, . . . , m:


conditional distribution of Xt, conditioned on Xt−k = j,
is a one-point distribution.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Serial Dependence

Weiß & Göb (2008) proposed, among others,


v
u ( )2
u
u∑ pij (k)
u − pipj
Goodman and Kruskal’s τ : τ (k) = t ( ) ,
i,j pj 1 − s2(p)

v
u ( )2
u
u1
u ∑ pij (k) − pipj
Cramer’s v: v(k) = t · .
m i,j pipj

Some properties: τ (k), v(k) have range [0; 1] with


• Xt, Xt−k independent ⇔ τ (k) = v(k) = 0.
• Xt depends perfectly on Xt−k ⇔ τ (k) = v(k) = 1.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Serial Dependence

Cardinal case: Positive and negative autocorrelation.



Concept of signed dependence: (Weiß & Göb, 2008)

• Xt, Xt−k perfectly positively dependent,


if perfectly dependent and
if pi|i(k) = 1 for all i.
• Xt, Xt−k perfectly negatively dependent
if perfectly dependent and
if pi|i(k) = 0 for all i.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Serial Dependence

Measures of signed dependence:


(Weiß, 2011; Weiß & Göb, 2008)
Cohen’s κ:

j pjj (k) − s2(p) s2(p)
κ(k) = with range [− 1−s ; 1],
1 − s 2 (p) 2 (p)

Modified κ:
(∑ )
κ∗(k) = 1
m · p
j j|j (k) − 1 with range [− 1 ; 1].
m

Some properties:
• Xt, Xt−k independent ⇒ κ(k) = κ∗(k) = 0.
• Xt, Xt−k perf. positively dep. ⇔ κ(k) = κ∗(k) = 1.
• Xt, Xt−k perf. negatively dep. ⇒ κ(k), κ∗(k) minimal.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Serial Dependence

Empirical measures of (signed) serial dependence


based on time series X1, . . . , XT .

Weiß (2011): empirical Cohen’s κ and modified κ:


1 1 − j p̂jj (k, T )
κ̂(k) := 1 + − ,
T 1 − s2(p̂(T ))

(∑ )
p̂jj (k, T )
κ̂∗(k) := T1 + m
1 · −1 .
j p̂j (T )

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Serial Dependence

Empirical measures of (signed) serial dependence

If computed from i.i.d. data, then


κ̂(k), κ̂∗(k) are asymptotically normally distributed with
[ ] [ ]
E κ̂(k) = E κ̂∗(k) = 0 + O(T −2),

and variances given by


[ ] (
1+2s3(p)−3s2(p) )
V κ̂(k) = 1
T · 1 − (1−s2(p))2
+ O(T −2);
[ ]
V κ̂∗(k) 1
= m·T + O(T −2).

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Serial Dependence

Current and future research:

• κ̂(k), κ̂∗(k) and also τ̂ (k), v̂(k) for NDARMA processes;

• goodness-of-fit tests for NDARMA processes.

Christian H. Weiß — Darmstadt University of Technology


Categorical
Time Series

Modelling
Categorical Time Series — Modelling

Models for stationary categorical processes


with range V = {0, . . . , m}:

• pth order Markov model: (m + 1)p · m parameters;

• variable length M. m. of Bühlmann & Wyner (1999):


more parsimonious, but model choice difficult;

• MTD(p) model of Raftery (1985):


still m(m + 1) + p − 1 parameters;

• NDARMA(p, q) models of Jacobs & Lewis (1983):


m + p + q parameters, also non-Markovian dependence.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Modelling

(Xt)Z, (ϵt)Z: categorical processes with range V = {0, . . . , m};


(ϵt)Z: i.i.d. with marginal distribution P (ϵt = j) = pj > 0,
ϵt independent of (Xs)s<t.

For φq > 0, with ϕp > 0 if p ≥ 1, let

D t = (αt,1, . . . , αt,p, βt,0, . . . , βt,q) ∼ M U LT (1; ϕ1, . . . , ϕp, φ0, . . . , φq)

be i.i.d. and independent of (ϵt)Z, (Xs)s<t.

(Xt)Z is NDARMA(p, q) process if

Xt = αt,1 · Xt−1 + . . . + αt,p · Xt−p + βt,0 · ϵt + . . . + βt,q · ϵt−q.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Modelling

Some properties:

• marginal distribution P (Xt = j) = pj ;

• only shows positive serial dependence, and


κ(k) = κ∗(k) = v(k) = τ (k);

• Yule-Walker-type equations
∑p ∑q−k
κ(k) = j=1 ϕj ·κ(|k−j|) + i=0 φi+k ·r(i) for k ≥ 1,

where r(i) = 0 for i < 0, r(0) = φ0, and


∑i−1 ∑q
r(i) = j=max {0,i−p} ϕi−j ·r(j) + j=0 δij ·φj for i > 0.

Christian H. Weiß — Darmstadt University of Technology


Categorical Time Series — Modelling

Disadvantage of NDARMA models:


Only describe positive dependence, i. e.,
categorical time series with long runs of symbols.
(e. g., genetic sequence vs. letters of text)

Issues for future research:

• Find simple and sparsely parametrized models


that also allow for negative dependence!

• Definition of trend or seasonality?

• And: How to remove such trend or seasonality?

Christian H. Weiß — Darmstadt University of Technology


Categorical
Processes

Monitoring
Categorical Processes — Monitoring
Only very few approaches for monitoring processes of
unordered and mutually exclusive categories.

Monitoring samples from an i.i.d. categorical process:


Duncan (1950), Marcucci (1985), Nelson (1987), Mukho-
padhyay (2008) plot Pearson’s χ2-statistic for goodness of
fit on a control chart.

Continuously monitoring cat. process (100 % inspect.):


Weiß (2010) proposes
• two moving-average-type charts,
based either on Pearson statistic or Gini index,
• a (k, r)-runs chart.
Christian H. Weiß — Darmstadt University of Technology
Categorical Processes — Monitoring
(k, r)-run: finished after k successive observations
of either ‘1’ or ‘2’ or . . . or ‘r’, i. e., after observing
one of (1, . . . , 1), (2, . . . , 2), . . . , (r, . . . , r) of length k each.
(k,r)
(k, r)th run lengths (Yn )N determined as
(k,r)
Y1 := No. obs. until first occurrence of k-tuple of ‘1’s or . . . ‘r’s,
(k,r)
Yn := No. obs. after (n − 1)th occurrence of k-tuple of ‘1’s or . . . ‘r’s
until nth occurrence of k-tuple of ‘1’s or . . . ‘r’s, for n ≥ 2.

Example: m = 3 (i. e., V = {0, 1, 2, 3})


and (k, r) = (2, 3), (fictive) time series:

1
| 2 0 0 0 | 3 0 0{z1 0 1 1} 1
{z 3 0 2 2} 0 | {z1} 3
| 0 2{z0 3 3} 2 3 . . .
9 8 2 6

Christian H. Weiß — Darmstadt University of Technology


Categorical Processes — Monitoring

If (Xt)N Markov chain, then


(k,r)
(Yn )N i.i.d. process, range Nk := {k, k + 1, . . .}.

Properties: (Chryssaphinou et al., 1994)

(1 − piz)(piz)k
r

Let ck,r (z) := , then
i=1 1 − (piz) k

(k,r) 1 1 + ck,r (1) − 2c′k,r (1)


E[Y ] = , V [Y (k,r)] = .
ck,r (1) c2
k,r (1)

Christian H. Weiß — Darmstadt University of Technology


Categorical Processes — Monitoring

(k, r)-runs chart: (Weiß, 2010)


(k,r)
(Yn )N plotted on chart with k ≤ LCL < U CL.
Exact AN E computation with Markov chain approach.

Issues for future research:

• charts based on different patterns,


e. g., cycles instead of runs;

• CUSUM or EWMA methods for categorical processes,


e. g., related to patterns or certain statistics; . . .

Christian H. Weiß — Darmstadt University of Technology


Conclusions

In a nutshell,
the field of categorical time series . . .

• is relevant for practice, and

• offers a lot of topics for future research,


in any of the disciplines

analysis, modelling and monitoring!

Christian H. Weiß — Darmstadt University of Technology


Thank You
for Your Interest!

Christian H. Weiß

Department of Mathematics

Darmstadt University of Technology


Literature
Box et al. (1994): Time series analysis – forecasting and control. 3rd edition, Prentice
Hall, Inc., New Jersey.
Bühlmann & Wyner (1999): Variable length Markov chains. Ann. Stat., 27, 480–513.
Chryssaphinou et al. (1994): On the waiting time of appearance of given patterns. In:
Godbole & Papastavridis (Eds.): Runs and Patterns in Probability, Kluwer Academic
Publishers, 231–241.
Duncan (1950): A chi-square chart for controlling a set of percentages. Industrial
Quality Control, 7, 11–15.
Jacobs & Lewis (1983): Stationary discrete autoregressive-moving average time series
generated by mixtures. J. Time Series Anal., 4(1), 19–36.
Keim & Kriegel (1996): Visualization techniques for mining large databases: A com-
parison. IEEE Trans. Knowl. Data Eng., 8(6), 923–938.
Marcucci (1985): Monitoring multinomial processes. J. Qual. Tech., 17(2), 86–91.
Mukhopadhyay (2008): Multivariate attribute control chart using Mahalanobis D2 sta-
tistic. J. Appl. Statist., 35(4), 421–429.
Nelson (1987): A chi-square control chart for several proportions. J. Qual. Tech.,
19(4), 229–231.
Raftery (1985): A model for high-order Markov chains. J. Royal Stat. Soc., B, 47(3),
528–539.

(. . . )

Christian H. Weiß — Darmstadt University of Technology


Literature

(. . . )
Ribler (1997): Visualizing categorical time series data with applications to computer
and communications network traces. PhD thesis, Virginia State University.
Uschner (1987): Streuungsmessung nominaler Merkmale mit Hilfe von Paarverglei-
chen. Doctoral dissertation, University Erlangen-Nürnberg.
Vogel & Kiesl (1999): Deskriptive und induktive Eigenschaften zweier Streuungsmaße
für nominale Merkmale. In: Vogel (Edt.): Arbeiten aus der Statistik, Univ. Bamberg.
Weiß (2008): Visual analysis of categorical time series. Statist. Meth., 5(1), 56–71.
Weiß (2010): Continuously monitoring categorical processes. Qual. Tech. Quant. Ma-
nag., to appear.
Weiß (2011): Empirical measures of signed serial dependence in categorical time se-
ries. J. Statist. Comp. Simulation, 81(4), 411–429.
Weiß & Göb (2008): Measuring serial dependence in categorical time series. Adv.
Statist. Anal., 92(1), 71–89.
Ye et al. (2002): Multivariate statistical analysis of audit trails for host-based intrusion
detection. IEEE Trans. Computer, 51(7), 810–820.

Christian H. Weiß — Darmstadt University of Technology

Das könnte Ihnen auch gefallen