0% found this document useful (0 votes)
117 views84 pages

Data Mining in Digital Humanities

Data mining involves applying statistical and machine learning techniques to discover patterns in large datasets. It is used to efficiently discover previously unknown patterns in data that can be useful and understandable. Common techniques include classification, clustering, association rule mining, and regression. Open source software like R, Weka, and RapidMiner as well as commercial tools from IBM, Microsoft, SAS and Oracle are used for data mining. Data mining is part of the broader process of knowledge discovery in databases (KDD) which aims to find useful patterns in data.

Uploaded by

Otilia Sirbu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views84 pages

Data Mining in Digital Humanities

Data mining involves applying statistical and machine learning techniques to discover patterns in large datasets. It is used to efficiently discover previously unknown patterns in data that can be useful and understandable. Common techniques include classification, clustering, association rule mining, and regression. Open source software like R, Weka, and RapidMiner as well as commercial tools from IBM, Microsoft, SAS and Oracle are used for data mining. Data mining is part of the broader process of knowledge discovery in databases (KDD) which aims to find useful patterns in data.

Uploaded by

Otilia Sirbu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Data Mining

Nicoleta ROGOVSCHI

[email protected]
Data mining

«Data mining offers the capability to view data in a new light,


discovering associations and patterns not appreciated before. For
the humanities domain, it exemplifies the interdisciplinary efforts of
digital humanities.»
Jonathan Hagood

2
Data mining

 Data mining (applying statistics and pattern recognition to discover


knowledge from data)

 A field at the intersection of computer science and statistics

 The efficient discovery of previously unknown, valid, potentially


useful, understandable patterns in large datasets

 The analysis of (often large) observational data sets to find


unsuspected relationships and to summarize the data in novel
ways that are both understandable and useful to the data owner

3
Data Mining

 Data mining (knowledge discovery in databases):


– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
information or patterns from data in large databases

 Alternative names and their “inside stories”:


– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, business intelligence, etc.

 Objective: Fit Data to a Model


– Descriptive
– Predictive

 Some ‘problems’:
– Which technique to choose?
 ARM/Classification/Clustering
 Answer: Depends on what you want to do with data?

– Search Strategy – Technique to search the data


 Interface? Query Language?
 Efficiency

4
Why Mine Data?

 Lots of data is being collected and warehoused


– Web data, e-commerce
– purchases at department stores
– Bank/Credit Card transactions

 Computers have become cheaper and more powerful (from 32 bits


to 64 bits)

 Competitive Pressure is Strong


– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)

5
Why Mine Data? Scientific Viewpoint

 Data collected and stored at enormous speeds (GB/hour)


– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene expression data
– scientific simulations generating terabytes of data

 Traditional techniques infeasible for raw data

 Data mining may help scientists


– in classifying and segmenting data
– in Hypothesis Formation

 Social need :
– Meteo prediction;
– Soil erosion prediction;
– Inundation, earthquake prediction
6
Examples: What is (not) Data Mining?

 What is not Data  What is Data Mining?


Mining?
- Certain names are more
prevalent in certain US
– Look up phone
locations (O’Brien, O’Rurke,
number in phone
O’Reilly… in Boston area)
directory
– Group together similar
documents returned by
– Query a Web search engine according to
search engine for their context (e.g. Amazon
information about rainforest, Amazon.com,)
“Amazon”

7
Data Mining: Classification
Schemes

 Decisions in data mining


– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted

 Data mining tasks


– Descriptive data mining
– Predictive data mining

8
Decisions in Data Mining

 Databases to be mined
– Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
 Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels

 Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.

 Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
9 market analysis, Web mining, Weblog analysis, etc.
Data Mining Tasks

 Prediction Tasks
– Use some variables to predict unknown or future values of other
variables

 Description Tasks
– Find human-interpretable patterns that describe the data.

Common data mining tasks:


– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]

10
Data Mining classical example -
Association Rule Discovery: Definition

 Given a set of records each of which contain some number of


items from a given collection;
– Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules Discovered:
3 Beer, Coke, Diaper, Milk {Milk} --> {Coke}
4 Beer, Bread, Diaper, Milk {Diaper, Milk} --> {Beer}
5 Coke, Diaper, Milk

11
The Sad Truth About Diapers and Beer

 So, don’t be surprised if you find six-packs stacked next to diapers!

12
Association Rule Discovery:
Application

 Supermarket shelf management.


– Goal: To identify items that are bought together by sufficiently
many customers.
– Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
– A classic rule --
 If a customer buys diaper and milk, then he is very likely to buy
beer:

13
Free open-source data mining
software

 Carrot2: Text and search results clustering framework.


 Chemicalize.org: A chemical structure miner and web search engine.
 ELKI: A university research project with advanced cluster analysis and outlier detection
methods written in the Java language.
 GATE: a natural language processing and language engineering tool.
 JHepWork: Java cross-platform data analysis framework developed at Argonne National
Laboratory.
 KNIME: The Konstanz Information Miner, a user friendly and comprehensive data analytics
framework.
 NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and
statistical natural language processing (NLP) for the Python language.
 Orange: A component-based data mining and machine learning software suite written in the
Python language.
 R: A programming language and software environment for statistical computing, data mining,
and graphics. It is part of the GNU project.
 RapidMiner: An environment for machine learning and data mining experiments.
 UIMA: The UIMA (Unstructured Information Management Architecture) is a component
framework for analyzing unstructured content such as text, audio and video – originally
developed by IBM.
 Weka: A suite of machine learning software applications written in the Java programming
language.
 ML-Flex: A software package that enables users to integrate with third-party machine-
learning packages written in any programming language, execute classification analyses in
parallel across multiple computing nodes, and produce HTML reports of classification results.
14
Commercial data-mining
software

 IBM InfoSphere Warehouse: Intelligent Miner - in-database data


mining platform provided by IBM
 Microsoft Analysis Services: data mining software provided by
Microsoft
 SAS: Enterprise Miner – data mining software provided by the SAS
Institute.
 STATISTICA: Data Miner – data mining software provided by
StatSoft.
 Oracle Data Mining: data mining software by Oracle.
 Clarabridge: enterprise class text analytics solution.
 LIONsolver: an integrated software application for data mining,
business intelligence, and modeling that implements the Learning
and Intelligent OptimizatioN (LION) approach
 MATLAB & Simulink

15
Data Mining vs. KDD

 Knowledge Discovery in Databases (KDD):


process of finding useful information and
patterns in data.

 Data Mining: Use of algorithms to extract the


information and patterns derived by the KDD
process.

16
16
Steps of a KDD Process

 Learning the application domain:


– relevant prior knowledge and goals of application

 Creating a target data set: data selection

 Data cleaning and preprocessing: (may take 60% of effort!)

 Data reduction and transformation:


– Find useful features, dimensionality/variable reduction, invariant representation.

 Choosing functions of data mining


– summarization, classification, regression, association, clustering.

 Choosing the mining algorithm(s)

 Data mining: search for patterns of interest

 Pattern evaluation and knowledge presentation


– visualization, transformation, removing redundant patterns, etc.

 Use of discovered knowledge

17
Data Mining and KDD Issues

 Human Interaction
 Overfitting
 Outliers
 Interpretation
 Visualization
 Large Datasets
 High Dimensionality

18
18
Challenges…

 Different types of the Data (text,


images, video…)
 Missing Data
 Irrelevant Data (objects selection)
 Noisy Data (irelevant features)
 Changing Data (data flows)

19
19
Social Implications

 Privacy preserving
 Profiling peoples
 Unauthorized use

Some solutions as :
Collaborative unsupervised learning;
Transfert Learning, …

20
20
DM: Remarks

 Data Mining is:


– Usefulness
– Return on Investment (ROI)
– Accuracy
– Space/Time

 Database Perspective on Data Mining:


– Scalability
– Real World Data
– Updates
– Ease of Use

21
21
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines

22
DBMS, OLAP, and Data Mining

DBMS OLAP Data Mining

Knowledge discovery of
Extraction of detailed Summaries, trends and
Task hidden patterns and
and summary data forecasts
insights

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the


Deduction
modeling, model, apply it to
Method (Ask the question, verify
Aggregation, new data, get the
with data)
Statistics result)

What is the average


Who will buy a mutual
Who purchased mutual income of mutual
Example question fund in the next 6
funds in the last 3 years? fund buyers by
months and why?
region by year?

23
Query Examples

 Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
 Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules) 24
24
DATA

25
A data?

Data is a basic description of a phenomenon.

A phenomenon can be described by one or more criteria, these criteria are


called variables:

A single criterion: univariate data

Several criteria: multivariate data

A phenomenon is thus described by a set of univariate or multivariate data.

26
Samples and variables

 Population
Group or set of individuals that are analyzed.

 Variables
Set of characteristics of a population.

27
Classical data matrix

 For n objects and p variables, we difine the data table :

X which is rectangular matrix containing n lines and p columns

 x11 x12 ... x1p 


 1 
 x2 x22 
  
X  ( x ,..., x )  
1 p
j 
 xi 
  
 1 
 xn xn 
p
...

28
Example of a data table (matrix)

Matrix of data

 Objects : x1,x2,…,x10.
 Variables :Y1,Y2,…,Y4.

Other definitions :
- Individuals: observations, objets, instances, transactions.
- Variables: attributs, dimensions, description, component.

29
Data types

 Quantitative variables

 Qualitative variables

 Others : text data

30
Quantitative variables

 Quantitative variables

– Continuous (ex: the size, the weight of a person, the


time of completion of a task, the volume of an object,
the speed of a car)

– Discrete (ex: counting: the number of persons in a


room, the number of items from a list)

31
Qualitative variables

 Qualitative variables (categorical)

- Binary data: it can take two states (true or false, 0 or 1, yes or no;).
Ex: Gender, having a credit or not ...

- Unordered categorical data (nominal):


Ex: eye color
Colour :

- Ordered categorical data: data from a survey (1: very satisfied, 2: satisfied,
…) Ex: low, medium, high, small, medium, large

Height:
32
Goals

 Answer a question or solve a problem


 Make data – intelligible
 Retrieve and select information
 Determine the validity of this information:
Problem of sampling fluctuations
Problem of generalization

33
Why to use the informatics?

 Power (speed) of calculation and processing of


large databases

 Visualization (dimension reduction)

 Objectivity!

34
Mining quantitative univariate
data

35
Univariate statistics of variables

 First step in the data mining :

Detect anomalies in their distributions (eg extreme or missing


values​​)

How to discretize continuous variables (if applicable)

To understand some important information, which can be useful in


other analysis (eg: age, average income of the population)

Make a summary of the contents using a graph

36
Position criteria

A position criterion is a value which represents the best the

corresponding set of data values.

37
Statistics of central tendency

A measure of central tendency is a single value that attempts to


describe a set of data by identifying the central position within that set
of data.

As such, measures of central tendency are sometimes called


measures of central location.

They are also classed as summary statistics.

It can be used to answer to questions as:

 What is the "typical" salary of a footbal player?


 How many children has a "typical" French family?
 What is the "typical“ note of students for the exam?

38
Statistics of central
tendency

 The term central tendency refers to the


"middle" value or perhaps a typical value of the
data, and is measured using the mean,
median, or mode. Each of these measures is
calculated differently, and the one that is best
to use depends upon the situation.

39
Statistics of central tendency

 The mode

 The median

 The mean

40
The mode

The mode is the most frequently occurring value in the data set.
(not necessarily unique).

On a histogram it represents the highest bar in a bar chart or histogram.


Therefore, sometimes consider the mode as being the most popular option.

41
The mode
Normally, the mode is used for categorical data where we wish to know
which is the most common category :

We can see above that the most common form of transport, in this
particular data set, is the bus. However, one of the problems with the mode
is that it is not unique, so it leaves us with problems when we have two or
more values that share the highest frequency.
42
The mode

Depending of the distribution of the mode, there are two types of the
distributions:
 Unimodal
 Bimodal
 Multimodal

43
The median

if n is odd
median if n is even

The median is the measure which allows to define the value which cut the
distribution in two parts, each of them having the same number of
observations.

44
The median

The median is the value which cut the population in two populations of
the same size.

The median is determined by sorting the data set from lowest to


highest values and taking the data point in the middle of the sequence.
45
Example

 Suppose we have the data below:

 We first need to rearrange that data into order


of magnitude (smallest first):

 Our median mark is the middle mark - in this


case 56. It is the middle mark because there
are 5 scores before it and 5 scores after it.

46
Example

 This works fine when you have an odd number of


scores but what happens when you have an even
number of scores? What if you had only 10 scores?
Well, you simply have to take the middle two scores
and average the result. So, if we look at the example
below:

 We again rearrange that data into order of magnitude


(smallest first):

 Only now we have to take the 5th and 6th score in our
data set and average them to get a median of 55.5.

47
The median

So, the median uses only the relative position of the


observations.

Sample A : 13, 15, 17,19, 23


Sample B : 13, 15, 17,19, 400

Convenients:
 The median is not affected by the outliers
 The median is useful then we have missing data

48
The median

The median often is used when there are a few extreme


values that could greatly influence the mean and distort
what might be considered typical.

This often is the case with home prices and with income data for a
group of people, which often is very skewed. For such data, the
median often is reported instead of the mean.

Example: In a group of people, if the salary of one person is 10


times the mean, the mean salary of the group will be higher because
of the unusually large salary. In this case, the median may better
represent the typical salary level of the group.

49
The mean

The mean is the value which could have each of the data sample if they was all
identical without changing the total (global) value.

The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set.
n
x  1 xi
n i1
50
Notes

 All these 3 criteria gives different informations


 The mode use only a set of the distribution values (only the most
frequently value is considered)
 The median counts only the position of the observations
 The mean is sensitive (depends) on the extreme values (outliers)

51
Summary

52
Dispersion criteria

53
Definition

A dispersion criteria is a value which represents the homogeneity of


the values of a data.

Statistical dispersion (also called statistical variability or variation) is


variability or spread in a variable or a probability distribution ;

54
Dispersion criteria

A measure of statistical dispersion is a nonnegative real


number that is zero if all the data are the same and
increases as the data become more diverse :

 Standard deviation
 Interquartile range or Interdecile range
 Range
 Mean difference
 Median absolute deviation
 Average absolute deviation (or simply called average
deviation)
 Distance standard deviation

55
Interest

For your holidays you have the choice between:

 A peaceful family pension in Novosibirsk


(Siberia): mean age 64 years

 A paradise island a few miles off Hawaii: mean


age 24 years

56
Interest

57
Interest

58
Dispersion measure

 Range

 Standard deviation

 Variance

59
Range

 The range is the difference between the largest and


the smallest observed value

Range  X max  X min

60
Variance

The variance is the mean of the squared deviation of that variable from
its expected value or mean

n 2

 x  x 
1
s  var( x) 
2
i
n i 1

the variance of a variable has units that are the square of the units of the
variable it self !
61
Standard deviation

The standard deviation describe the « typical » difference


between the observations and the mean (average)

n
1
s s 
2

n i 1
( xi  x ) 2

Example :
Sample A=(79,80,81)
Sample B=(60,80,100)
mean(A)=mean(B)=80
Standard-Deviation(A)≠ Standard-Deviation (B)
Standard-Deviation(A)=1
Standard-Deviation(B)=20
62
Remarks

 A series of indicators that gives a partial view of the data: effective,


mean, median, variance, standard deviation, minimum, maximum,
range, first quartile, 3rd quartile, ...

 These indicators measure the central tendency and the dispersion.


We usually use the mean, variance and standard deviation.

 These criteria are sensitive to extreme values

 Mean and standard deviation can be generalized from a sample

63
Describe a distribution

To describe a distribution, we analyse:


 The mean
 The standard deviation

We have also to analyse the distribution for:


 Its degree of Skewness
 Its degree of Kurtosis

64
Skewness

 (x  x)
i
3
N
Skx  i

sX
3
( N  1)( N  2)

65
The skewness
If :
•Skx = 0, we have a perfectly symmetrical distribution (values ​are spread
uniformly and also to higher values ​and lower values ​of the variable)
•Skx > 0, - positively skewed (distribution spreads more towards higher
values ​of the variable )
•Skx < 0, negatively skewed (distribution spreads more towards lower
values ​of the variable)

66
The skewness
Another method to determine the skewness of
a distribution is:
Symetric

- Symetric when
mode=median=mean

- Skewed to the right when


mode<median<mean
Skewed to the right

- Skewed to the left when


mode>median>mean

Skewed to the left

Mode Median Mean


67
The Kurtosis measure

 i
( x  x ) 4
N ( N  1) N 1
Kux  i
 3
sx
4
( N  1)( N  2)( N  3) ( N  2)( N  3)

Distributions with negative or positive excess kurtosis are called


platykurtic distributions or leptokurtic distributions respectively

68
Quartile

 Distribution function of a random variable :


- F(x)=P(X≤x)

 Quartiles q1, q2, q3, are defined as:


- F(q1) = 0,25 is the first quartile (designated Q1) = lower quartile
= splits lowest 25% of data = 25th percentile
- F(q2) = 0,5 is the second quartile (designated Q2) = median =
cuts data set in half = 50th percentile
- F(q3) =0,75 is the third quartile (designated Q3) = upper quartile
= splits highest 25% of data, or lowest 75% = 75th percentile

Interquartile range: IQR=q3 - q1


- The difference between the upper and lower quartiles containing
50% of data.
69
Boxplot

a<-c(0,3,3,5,5,6,7,8,9,9,12,12,18)
barplot(a)

mean=7,46
q1= 5 Lower fence > Q1 – 1.5(IQR)
median = 7 Upper fence < Q3 + 1.5(IQR)
q3 = 9 Smallest observation (sample minimum)
Lower fence= 0 Largest observation (sample maximum)
Upper fence= 12
70
Boxplot
Outlier

Biger value
Upper fence

Mean=7,46 Lower fence Smaller value


q1=5
Median=7
q3=9
Lower fence= 0 a<-c(0,3,3,5,5,6,7,8,9,9,12,12,18)
Upper fence = 12
71
Example : Boxplot

 Create the boxplot of the following data:

52, 18, 26, 40, 8, 50, 63, 42, 21, 7, 44, 14

72
Example : Boxplot
 Firstly, we order the sample in the ascendend order. Then, find the median.

7, 8, 14, 18, 21, 26, 40, 42, 44, 50, 52, 63.

 Median = 6,5 value


= (6st + 7th observations) ÷ 2
= (26 + 40) ÷ 2
= 33

 There arte 6 numbers before the median : 7, 8, 14, 18, 21, 26.

 Q1 = the median of these six variables


= 3,5 value
= (3rd + 4th observations) ÷ 2
= (14 + 18) ÷ 2
= 16
 There six numbers after the median : 40, 42, 44, 50, 52, 63.

 Q3 = the median of these six elements is


= (6 + 1) ÷ 2= 3,5 value
= (3rd + 4th observations) ÷ 2 = (44+50) ÷ 2
= 47
Mini = 7
Q1 =16
Median = 33
Q3 = 47
73 Max = 63
Normal distribution

Normal (or Gaussian) distribution is a continuous probability


distribution that has a bell-shaped probability density function,
known as the Gaussian function

Recall : The parameter μ - mean


(location of the peak)
σ is the standard deviation
σ 2 is the variance.

The distribution with μ = 0 and σ 2 = 1 is called the standard normal


74 distribution or the unit normal distribution
Normal distribution

The normal distribution has the following proprieties:

 68% of the population is situated in the interval: [ x  s; x  s]

 95% of the population is in the range: [ x  2s; x  2s]

 99% of the population is in the range: [ x  3s; x  3s]

If these properties are not satisfied, the distribution


is not Gaussian.
75
Normal distribution

Many natural variables follow the Normal Distribution

76
Gaussian distribution

The mixture of two Gaussian populations is not a Gaussian population

The distribution is bimodal

77
Missing values ​and outliers
 The detection of missing values ​and outliers is a step to achieve for
any type models.

 It is important to question why the data is missing, this can help with finding a
solution to the problem.

 If the values are missing at random there is still information about each
variable in each unit but if the values are missing systematically the problem
is more severe because the sample cannot be representative of the
population.

 For example: a research is done about the relation between IQ and income. If
participants with an over average IQ do not answer the question ‘What is your
salary?’ the results of the research may show that there is no association
between IQ and salary, while in fact there is a relationship. Because of these
problems, methodologists routinely advise researchers to design research so
as to minimize the incidence of missing values (Ader, H.J., Mellenbergh,
78 G.J. 2008).
Missing data

 Ignore the tuple: usually done when class label is missing


(assuming the tasks in classification)—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree

79
Outliers

An outlier is an incorrect value corresponding to a bad measure, a


miscalculation, a mistake or misrepresentation.

Examples:

 Inconsistent Dates: February 29 for a non-leap year, the subscription


dates prior to the date of birth of the customer

 Codes "sex" taking more than two different values

 Phone numbers that don’t correspond to phone numbers

80
Outliers

Several solutions exist to deal with outliers:

– Delete the concerned observations, if their number is not too high and
sufficiently random distributed

– Keep the observation and the variable, tolerating a small margin of


error in the model results

– Keep the observation and the variable, but replace the outlier by
another value that is closest to its true value (mean...)

– Keep the observations but don’t use the variables for mining the data.

81
Data standartization

 Centering
y x x
 Centering and reduction
(standartization) xx
y
s
 Other normalization
xmin( x)
y
Min-Max
max(x)min( x)
Logarithmic ylog(x)
82
Example: Table of countries

country esp life F mort_inf activF % chom. gnb/hb % education % health


 Allemagne 74.8 4.4 48.8 8.2 26768 4.3 10.6
 Autriche 75.4 4.8 49 4.1 29075 4.9 8
 Belgique 75.1 5 42.3 7.3 27952 5.8 8.8
 Chypre 75.3 5.6 50.9 3.8 12724 5.8 6
 Danemark 74.5 5.3 73.8 4.5 30096 8.1 8.4
 Espagne 75.6 3.9 40.3 11.4 22538 5.6 7.7
 …………………………………………………
 Pays Bas 75.5 5.1 52.9 2.7 29614 5.2 8.2
 Pologne 70.3 8.1 49.5 18.1 9852 5.1 4.2
 Portugal 72.4 5.5 54.1 5 18500 5.5 8.2
 Royaume Uni 75 5.6 53 5.1 26756 4.7 7.3
 Slovaquie 69.7 8.6 52.9 17.4 12314 4.3 6.4
 Slovénie 72.3 4.9 51.3 11.3 17762 5.2 8.2
 Suède 77.5 3.4 76.2 4.9 26849 7.3 7.9
 Suisse 77.2 4.9 58.8 3.1 30058 5.1 10.7
 Tchéquie 72.2 4.1 51.3 9.8 15011 4.6 7.4

83
Bibliography

 Gregory Piatetsky-Shapiro, KDnuggets


 Ad Feelders, Advanced Data Mining 2011
 Srinivasan Parthasarathy, Introduction to Data Mining
 Min Song, Data Mining
 Fall 2004, CIS, Temple University, CIS527: Data Warehousing, Filtering, and Mining
 Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
 Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)
 Stéphane Tufféry «Data mining et statistique décisionnelle», Editions Technip, 2012.
 Robert R. Haccoun, Denis Cousineau « Statistiques : Concepts et applications », 2e edition.
 Fenelon, J.P., (1981). « Qu’est-ce que l’analyse des données ? », Lefonen.
 Benzécri, J.-P., (1982). « Histoire et préhistoire de l’analyse des données », Dunod.

84

You might also like