Data Mining in Digital Humanities
Data Mining in Digital Humanities
Nicoleta ROGOVSCHI
[email protected]
Data mining
2
Data mining
3
Data Mining
Some ‘problems’:
– Which technique to choose?
ARM/Classification/Clustering
Answer: Depends on what you want to do with data?
4
Why Mine Data?
5
Why Mine Data? Scientific Viewpoint
Social need :
– Meteo prediction;
– Soil erosion prediction;
– Inundation, earthquake prediction
6
Examples: What is (not) Data Mining?
7
Data Mining: Classification
Schemes
8
Decisions in Data Mining
Databases to be mined
– Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
9 market analysis, Web mining, Weblog analysis, etc.
Data Mining Tasks
Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
Description Tasks
– Find human-interpretable patterns that describe the data.
10
Data Mining classical example -
Association Rule Discovery: Definition
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules Discovered:
3 Beer, Coke, Diaper, Milk {Milk} --> {Coke}
4 Beer, Bread, Diaper, Milk {Diaper, Milk} --> {Beer}
5 Coke, Diaper, Milk
11
The Sad Truth About Diapers and Beer
12
Association Rule Discovery:
Application
13
Free open-source data mining
software
15
Data Mining vs. KDD
16
16
Steps of a KDD Process
17
Data Mining and KDD Issues
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
18
18
Challenges…
19
19
Social Implications
Privacy preserving
Profiling peoples
Unauthorized use
Some solutions as :
Collaborative unsupervised learning;
Transfert Learning, …
20
20
DM: Remarks
21
21
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
22
DBMS, OLAP, and Data Mining
Knowledge discovery of
Extraction of detailed Summaries, trends and
Task hidden patterns and
and summary data forecasts
insights
23
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules) 24
24
DATA
25
A data?
26
Samples and variables
Population
Group or set of individuals that are analyzed.
Variables
Set of characteristics of a population.
27
Classical data matrix
28
Example of a data table (matrix)
Matrix of data
Objects : x1,x2,…,x10.
Variables :Y1,Y2,…,Y4.
Other definitions :
- Individuals: observations, objets, instances, transactions.
- Variables: attributs, dimensions, description, component.
29
Data types
Quantitative variables
Qualitative variables
30
Quantitative variables
Quantitative variables
31
Qualitative variables
- Binary data: it can take two states (true or false, 0 or 1, yes or no;).
Ex: Gender, having a credit or not ...
- Ordered categorical data: data from a survey (1: very satisfied, 2: satisfied,
…) Ex: low, medium, high, small, medium, large
Height:
32
Goals
33
Why to use the informatics?
Objectivity!
34
Mining quantitative univariate
data
35
Univariate statistics of variables
36
Position criteria
37
Statistics of central tendency
38
Statistics of central
tendency
39
Statistics of central tendency
The mode
The median
The mean
40
The mode
The mode is the most frequently occurring value in the data set.
(not necessarily unique).
41
The mode
Normally, the mode is used for categorical data where we wish to know
which is the most common category :
We can see above that the most common form of transport, in this
particular data set, is the bus. However, one of the problems with the mode
is that it is not unique, so it leaves us with problems when we have two or
more values that share the highest frequency.
42
The mode
Depending of the distribution of the mode, there are two types of the
distributions:
Unimodal
Bimodal
Multimodal
43
The median
if n is odd
median if n is even
The median is the measure which allows to define the value which cut the
distribution in two parts, each of them having the same number of
observations.
44
The median
The median is the value which cut the population in two populations of
the same size.
46
Example
Only now we have to take the 5th and 6th score in our
data set and average them to get a median of 55.5.
47
The median
Convenients:
The median is not affected by the outliers
The median is useful then we have missing data
48
The median
This often is the case with home prices and with income data for a
group of people, which often is very skewed. For such data, the
median often is reported instead of the mean.
49
The mean
The mean is the value which could have each of the data sample if they was all
identical without changing the total (global) value.
The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set.
n
x 1 xi
n i1
50
Notes
51
Summary
52
Dispersion criteria
53
Definition
54
Dispersion criteria
Standard deviation
Interquartile range or Interdecile range
Range
Mean difference
Median absolute deviation
Average absolute deviation (or simply called average
deviation)
Distance standard deviation
55
Interest
56
Interest
57
Interest
58
Dispersion measure
Range
Standard deviation
Variance
59
Range
60
Variance
The variance is the mean of the squared deviation of that variable from
its expected value or mean
n 2
x x
1
s var( x)
2
i
n i 1
the variance of a variable has units that are the square of the units of the
variable it self !
61
Standard deviation
n
1
s s
2
n i 1
( xi x ) 2
Example :
Sample A=(79,80,81)
Sample B=(60,80,100)
mean(A)=mean(B)=80
Standard-Deviation(A)≠ Standard-Deviation (B)
Standard-Deviation(A)=1
Standard-Deviation(B)=20
62
Remarks
63
Describe a distribution
64
Skewness
(x x)
i
3
N
Skx i
sX
3
( N 1)( N 2)
65
The skewness
If :
•Skx = 0, we have a perfectly symmetrical distribution (values are spread
uniformly and also to higher values and lower values of the variable)
•Skx > 0, - positively skewed (distribution spreads more towards higher
values of the variable )
•Skx < 0, negatively skewed (distribution spreads more towards lower
values of the variable)
66
The skewness
Another method to determine the skewness of
a distribution is:
Symetric
- Symetric when
mode=median=mean
i
( x x ) 4
N ( N 1) N 1
Kux i
3
sx
4
( N 1)( N 2)( N 3) ( N 2)( N 3)
68
Quartile
a<-c(0,3,3,5,5,6,7,8,9,9,12,12,18)
barplot(a)
mean=7,46
q1= 5 Lower fence > Q1 – 1.5(IQR)
median = 7 Upper fence < Q3 + 1.5(IQR)
q3 = 9 Smallest observation (sample minimum)
Lower fence= 0 Largest observation (sample maximum)
Upper fence= 12
70
Boxplot
Outlier
Biger value
Upper fence
72
Example : Boxplot
Firstly, we order the sample in the ascendend order. Then, find the median.
7, 8, 14, 18, 21, 26, 40, 42, 44, 50, 52, 63.
There arte 6 numbers before the median : 7, 8, 14, 18, 21, 26.
76
Gaussian distribution
77
Missing values and outliers
The detection of missing values and outliers is a step to achieve for
any type models.
It is important to question why the data is missing, this can help with finding a
solution to the problem.
If the values are missing at random there is still information about each
variable in each unit but if the values are missing systematically the problem
is more severe because the sample cannot be representative of the
population.
For example: a research is done about the relation between IQ and income. If
participants with an over average IQ do not answer the question ‘What is your
salary?’ the results of the research may show that there is no association
between IQ and salary, while in fact there is a relationship. Because of these
problems, methodologists routinely advise researchers to design research so
as to minimize the incidence of missing values (Ader, H.J., Mellenbergh,
78 G.J. 2008).
Missing data
79
Outliers
Examples:
80
Outliers
– Delete the concerned observations, if their number is not too high and
sufficiently random distributed
– Keep the observation and the variable, but replace the outlier by
another value that is closest to its true value (mean...)
– Keep the observations but don’t use the variables for mining the data.
81
Data standartization
Centering
y x x
Centering and reduction
(standartization) xx
y
s
Other normalization
xmin( x)
y
Min-Max
max(x)min( x)
Logarithmic ylog(x)
82
Example: Table of countries
83
Bibliography
84