0% found this document useful (0 votes)

117 views84 pages

Data Mining in Digital Humanities

Data mining involves applying statistical and machine learning techniques to discover patterns in large datasets. It is used to efficiently discover previously unknown patterns in data that can be useful and understandable. Common techniques include classification, clustering, association rule mining, and regression. Open source software like R, Weka, and RapidMiner as well as commercial tools from IBM, Microsoft, SAS and Oracle are used for data mining. Data mining is part of the broader process of knowledge discovery in databases (KDD) which aims to find useful patterns in data.

Uploaded by

Otilia Sirbu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views84 pages

Data Mining in Digital Humanities

Uploaded by

Otilia Sirbu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Data Mining

Nicoleta ROGOVSCHI

[email protected]
Data mining

«Data mining offers the capability to view data in a new light,

discovering associations and patterns not appreciated before. For
the humanities domain, it exemplifies the interdisciplinary efforts of
digital humanities.»
Jonathan Hagood

2
Data mining

 Data mining (applying statistics and pattern recognition to discover

knowledge from data)

 A field at the intersection of computer science and statistics

 The efficient discovery of previously unknown, valid, potentially

useful, understandable patterns in large datasets

 The analysis of (often large) observational data sets to find

unsuspected relationships and to summarize the data in novel
ways that are both understandable and useful to the data owner

3
Data Mining

 Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
information or patterns from data in large databases

 Alternative names and their “inside stories”:

– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, business intelligence, etc.

 Objective: Fit Data to a Model

– Descriptive
– Predictive

 Some ‘problems’:
– Which technique to choose?
 ARM/Classification/Clustering
 Answer: Depends on what you want to do with data?

– Search Strategy – Technique to search the data

 Interface? Query Language?
 Efficiency

4
Why Mine Data?

 Lots of data is being collected and warehoused

– Web data, e-commerce
– purchases at department stores
– Bank/Credit Card transactions

 Computers have become cheaper and more powerful (from 32 bits

to 64 bits)

 Competitive Pressure is Strong

– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)

5
Why Mine Data? Scientific Viewpoint

 Data collected and stored at enormous speeds (GB/hour)

– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene expression data
– scientific simulations generating terabytes of data

 Traditional techniques infeasible for raw data

 Data mining may help scientists

– in classifying and segmenting data
– in Hypothesis Formation

 Social need :
– Meteo prediction;
– Soil erosion prediction;
– Inundation, earthquake prediction
6
Examples: What is (not) Data Mining?

 What is not Data  What is Data Mining?

Mining?
- Certain names are more
prevalent in certain US
– Look up phone
locations (O’Brien, O’Rurke,
number in phone
O’Reilly… in Boston area)
directory
– Group together similar
documents returned by
– Query a Web search engine according to
search engine for their context (e.g. Amazon
information about rainforest, Amazon.com,)
“Amazon”

7
Data Mining: Classification
Schemes

 Decisions in data mining

– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted

 Data mining tasks

– Descriptive data mining
– Predictive data mining

8
Decisions in Data Mining

 Databases to be mined
– Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
 Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels

 Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.

 Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
9 market analysis, Web mining, Weblog analysis, etc.
Data Mining Tasks

 Prediction Tasks
– Use some variables to predict unknown or future values of other
variables

 Description Tasks
– Find human-interpretable patterns that describe the data.

Common data mining tasks:

– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]

10
Data Mining classical example -
Association Rule Discovery: Definition

 Given a set of records each of which contain some number of

items from a given collection;
– Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules Discovered:
3 Beer, Coke, Diaper, Milk {Milk} --> {Coke}
4 Beer, Bread, Diaper, Milk {Diaper, Milk} --> {Beer}
5 Coke, Diaper, Milk

11
The Sad Truth About Diapers and Beer

 So, don’t be surprised if you find six-packs stacked next to diapers!

12
Association Rule Discovery:
Application

 Supermarket shelf management.

– Goal: To identify items that are bought together by sufficiently
many customers.
– Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
– A classic rule --
 If a customer buys diaper and milk, then he is very likely to buy
beer:

13
Free open-source data mining
software

 Carrot2: Text and search results clustering framework.

 Chemicalize.org: A chemical structure miner and web search engine.
 ELKI: A university research project with advanced cluster analysis and outlier detection
methods written in the Java language.
 GATE: a natural language processing and language engineering tool.
 JHepWork: Java cross-platform data analysis framework developed at Argonne National
Laboratory.
 KNIME: The Konstanz Information Miner, a user friendly and comprehensive data analytics
framework.
 NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and
statistical natural language processing (NLP) for the Python language.
 Orange: A component-based data mining and machine learning software suite written in the
Python language.
 R: A programming language and software environment for statistical computing, data mining,
and graphics. It is part of the GNU project.
 RapidMiner: An environment for machine learning and data mining experiments.
 UIMA: The UIMA (Unstructured Information Management Architecture) is a component
framework for analyzing unstructured content such as text, audio and video – originally
developed by IBM.
 Weka: A suite of machine learning software applications written in the Java programming
language.
 ML-Flex: A software package that enables users to integrate with third-party machine-
learning packages written in any programming language, execute classification analyses in
parallel across multiple computing nodes, and produce HTML reports of classification results.
14
Commercial data-mining
software

 IBM InfoSphere Warehouse: Intelligent Miner - in-database data

mining platform provided by IBM
 Microsoft Analysis Services: data mining software provided by
Microsoft
 SAS: Enterprise Miner – data mining software provided by the SAS
Institute.
 STATISTICA: Data Miner – data mining software provided by
StatSoft.
 Oracle Data Mining: data mining software by Oracle.
 Clarabridge: enterprise class text analytics solution.
 LIONsolver: an integrated software application for data mining,
business intelligence, and modeling that implements the Learning
and Intelligent OptimizatioN (LION) approach
 MATLAB & Simulink

15
Data Mining vs. KDD

 Knowledge Discovery in Databases (KDD):

process of finding useful information and
patterns in data.

 Data Mining: Use of algorithms to extract the

information and patterns derived by the KDD
process.

16
16
Steps of a KDD Process

 Learning the application domain:

– relevant prior knowledge and goals of application

 Creating a target data set: data selection

 Data cleaning and preprocessing: (may take 60% of effort!)

 Data reduction and transformation:

– Find useful features, dimensionality/variable reduction, invariant representation.

 Choosing functions of data mining

– summarization, classification, regression, association, clustering.

 Choosing the mining algorithm(s)

 Data mining: search for patterns of interest

 Pattern evaluation and knowledge presentation

– visualization, transformation, removing redundant patterns, etc.

 Use of discovered knowledge

17
Data Mining and KDD Issues

 Human Interaction
 Overfitting
 Outliers
 Interpretation
 Visualization
 Large Datasets
 High Dimensionality

18
18
Challenges…

 Different types of the Data (text,

images, video…)
 Missing Data
 Irrelevant Data (objects selection)
 Noisy Data (irelevant features)
 Changing Data (data flows)

19
19
Social Implications

 Privacy preserving
 Profiling peoples
 Unauthorized use

Some solutions as :
Collaborative unsupervised learning;
Transfert Learning, …

20
20
DM: Remarks

 Data Mining is:

– Usefulness
– Return on Investment (ROI)
– Accuracy
– Space/Time

 Database Perspective on Data Mining:

– Scalability
– Real World Data
– Updates
– Ease of Use

21
21
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines

22
DBMS, OLAP, and Data Mining

DBMS OLAP Data Mining

Knowledge discovery of
Extraction of detailed Summaries, trends and
Task hidden patterns and
and summary data forecasts
insights

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the

Deduction
modeling, model, apply it to
Method (Ask the question, verify
Aggregation, new data, get the
with data)
Statistics result)

What is the average

Who will buy a mutual
Who purchased mutual income of mutual
Example question fund in the next 6
funds in the last 3 years? fund buyers by
months and why?
region by year?

23
Query Examples

 Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
 Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules) 24
24
DATA

25
A data?

Data is a basic description of a phenomenon.

A phenomenon can be described by one or more criteria, these criteria are

called variables:

A single criterion: univariate data

Several criteria: multivariate data

A phenomenon is thus described by a set of univariate or multivariate data.

26
Samples and variables

 Population
Group or set of individuals that are analyzed.

 Variables
Set of characteristics of a population.

27
Classical data matrix

 For n objects and p variables, we difine the data table :

X which is rectangular matrix containing n lines and p columns

 x11 x12 ... x1p 

 1 
 x2 x22 
  
X  ( x ,..., x )  
1 p
j 
 xi 
  
 1 
 xn xn 
p
...

28
Example of a data table (matrix)

Matrix of data

 Objects : x1,x2,…,x10.
 Variables :Y1,Y2,…,Y4.

Other definitions :
- Individuals: observations, objets, instances, transactions.
- Variables: attributs, dimensions, description, component.

29
Data types

 Quantitative variables

 Qualitative variables

 Others : text data

30
Quantitative variables

 Quantitative variables

– Continuous (ex: the size, the weight of a person, the

time of completion of a task, the volume of an object,
the speed of a car)

– Discrete (ex: counting: the number of persons in a

room, the number of items from a list)

31
Qualitative variables

 Qualitative variables (categorical)

- Binary data: it can take two states (true or false, 0 or 1, yes or no;).
Ex: Gender, having a credit or not ...

- Unordered categorical data (nominal):

Ex: eye color
Colour :

- Ordered categorical data: data from a survey (1: very satisfied, 2: satisfied,
…) Ex: low, medium, high, small, medium, large

Height:
32
Goals

 Answer a question or solve a problem

 Make data – intelligible
 Retrieve and select information
 Determine the validity of this information:
Problem of sampling fluctuations
Problem of generalization

33
Why to use the informatics?

 Power (speed) of calculation and processing of

large databases

 Visualization (dimension reduction)

 Objectivity!

34
Mining quantitative univariate
data

35
Univariate statistics of variables

 First step in the data mining :

Detect anomalies in their distributions (eg extreme or missing

values)

How to discretize continuous variables (if applicable)

To understand some important information, which can be useful in

other analysis (eg: age, average income of the population)

Make a summary of the contents using a graph

36
Position criteria

A position criterion is a value which represents the best the

corresponding set of data values.

37
Statistics of central tendency

A measure of central tendency is a single value that attempts to

describe a set of data by identifying the central position within that set
of data.

As such, measures of central tendency are sometimes called

measures of central location.

They are also classed as summary statistics.

It can be used to answer to questions as:

 What is the "typical" salary of a footbal player?

 How many children has a "typical" French family?
 What is the "typical“ note of students for the exam?

38
Statistics of central
tendency

 The term central tendency refers to the

"middle" value or perhaps a typical value of the
data, and is measured using the mean,
median, or mode. Each of these measures is
calculated differently, and the one that is best
to use depends upon the situation.

39
Statistics of central tendency

 The mode

 The median

 The mean

40
The mode

The mode is the most frequently occurring value in the data set.
(not necessarily unique).

On a histogram it represents the highest bar in a bar chart or histogram.

Therefore, sometimes consider the mode as being the most popular option.

41
The mode
Normally, the mode is used for categorical data where we wish to know
which is the most common category :

We can see above that the most common form of transport, in this
particular data set, is the bus. However, one of the problems with the mode
is that it is not unique, so it leaves us with problems when we have two or
more values that share the highest frequency.
42
The mode

Depending of the distribution of the mode, there are two types of the
distributions:
 Unimodal
 Bimodal
 Multimodal

43
The median

if n is odd
median if n is even

The median is the measure which allows to define the value which cut the
distribution in two parts, each of them having the same number of
observations.

44
The median

The median is the value which cut the population in two populations of
the same size.

The median is determined by sorting the data set from lowest to

highest values and taking the data point in the middle of the sequence.
45
Example

 Suppose we have the data below:

 We first need to rearrange that data into order

of magnitude (smallest first):

 Our median mark is the middle mark - in this

case 56. It is the middle mark because there
are 5 scores before it and 5 scores after it.

46
Example

 This works fine when you have an odd number of

scores but what happens when you have an even
number of scores? What if you had only 10 scores?
Well, you simply have to take the middle two scores
and average the result. So, if we look at the example
below:

 We again rearrange that data into order of magnitude

(smallest first):

 Only now we have to take the 5th and 6th score in our
data set and average them to get a median of 55.5.

47
The median

So, the median uses only the relative position of the

observations.

Sample A : 13, 15, 17,19, 23

Sample B : 13, 15, 17,19, 400

Convenients:
 The median is not affected by the outliers
 The median is useful then we have missing data

48
The median

The median often is used when there are a few extreme

values that could greatly influence the mean and distort
what might be considered typical.

This often is the case with home prices and with income data for a
group of people, which often is very skewed. For such data, the
median often is reported instead of the mean.

Example: In a group of people, if the salary of one person is 10

times the mean, the mean salary of the group will be higher because
of the unusually large salary. In this case, the median may better
represent the typical salary level of the group.

49
The mean

The mean is the value which could have each of the data sample if they was all
identical without changing the total (global) value.

The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set.
n
x  1 xi
n i1
50
Notes

 All these 3 criteria gives different informations

 The mode use only a set of the distribution values (only the most
frequently value is considered)
 The median counts only the position of the observations
 The mean is sensitive (depends) on the extreme values (outliers)

51
Summary

52
Dispersion criteria

53
Definition

A dispersion criteria is a value which represents the homogeneity of

the values of a data.

Statistical dispersion (also called statistical variability or variation) is

variability or spread in a variable or a probability distribution ;

54
Dispersion criteria

A measure of statistical dispersion is a nonnegative real

number that is zero if all the data are the same and
increases as the data become more diverse :

 Standard deviation
 Interquartile range or Interdecile range
 Range
 Mean difference
 Median absolute deviation
 Average absolute deviation (or simply called average
deviation)
 Distance standard deviation

55
Interest

For your holidays you have the choice between:

 A peaceful family pension in Novosibirsk

(Siberia): mean age 64 years

 A paradise island a few miles off Hawaii: mean

age 24 years

56
Interest

57
Interest

58
Dispersion measure

 Range

 Standard deviation

 Variance

59
Range

 The range is the difference between the largest and

the smallest observed value

Range  X max  X min

60
Variance

The variance is the mean of the squared deviation of that variable from
its expected value or mean

n 2

 x  x 
1
s  var( x) 
2
i
n i 1

the variance of a variable has units that are the square of the units of the
variable it self !
61
Standard deviation

The standard deviation describe the « typical » difference

between the observations and the mean (average)

n
1
s s 
2

n i 1
( xi  x ) 2

Example :
Sample A=(79,80,81)
Sample B=(60,80,100)
mean(A)=mean(B)=80
Standard-Deviation(A)≠ Standard-Deviation (B)
Standard-Deviation(A)=1
Standard-Deviation(B)=20
62
Remarks

 A series of indicators that gives a partial view of the data: effective,

mean, median, variance, standard deviation, minimum, maximum,
range, first quartile, 3rd quartile, ...

 These indicators measure the central tendency and the dispersion.

We usually use the mean, variance and standard deviation.

 These criteria are sensitive to extreme values

 Mean and standard deviation can be generalized from a sample

63
Describe a distribution

To describe a distribution, we analyse:

 The mean
 The standard deviation

We have also to analyse the distribution for:

 Its degree of Skewness
 Its degree of Kurtosis

64
Skewness

 (x  x)
i
3
N
Skx  i

sX
3
( N  1)( N  2)

65
The skewness
If :
•Skx = 0, we have a perfectly symmetrical distribution (values are spread
uniformly and also to higher values and lower values of the variable)
•Skx > 0, - positively skewed (distribution spreads more towards higher
values of the variable )
•Skx < 0, negatively skewed (distribution spreads more towards lower
values of the variable)

66
The skewness
Another method to determine the skewness of
a distribution is:
Symetric

- Symetric when
mode=median=mean

- Skewed to the right when

mode<median<mean
Skewed to the right

- Skewed to the left when

mode>median>mean

Skewed to the left

Mode Median Mean

67
The Kurtosis measure

 i
( x  x ) 4
N ( N  1) N 1
Kux  i
 3
sx
4
( N  1)( N  2)( N  3) ( N  2)( N  3)

Distributions with negative or positive excess kurtosis are called

platykurtic distributions or leptokurtic distributions respectively

68
Quartile

 Distribution function of a random variable :

- F(x)=P(X≤x)

 Quartiles q1, q2, q3, are defined as:

- F(q1) = 0,25 is the first quartile (designated Q1) = lower quartile
= splits lowest 25% of data = 25th percentile
- F(q2) = 0,5 is the second quartile (designated Q2) = median =
cuts data set in half = 50th percentile
- F(q3) =0,75 is the third quartile (designated Q3) = upper quartile
= splits highest 25% of data, or lowest 75% = 75th percentile

Interquartile range: IQR=q3 - q1

- The difference between the upper and lower quartiles containing
50% of data.
69
Boxplot

a<-c(0,3,3,5,5,6,7,8,9,9,12,12,18)
barplot(a)

mean=7,46
q1= 5 Lower fence > Q1 – 1.5(IQR)
median = 7 Upper fence < Q3 + 1.5(IQR)
q3 = 9 Smallest observation (sample minimum)
Lower fence= 0 Largest observation (sample maximum)
Upper fence= 12
70
Boxplot
Outlier

Biger value
Upper fence

Mean=7,46 Lower fence Smaller value

q1=5
Median=7
q3=9
Lower fence= 0 a<-c(0,3,3,5,5,6,7,8,9,9,12,12,18)
Upper fence = 12
71
Example : Boxplot

 Create the boxplot of the following data:

52, 18, 26, 40, 8, 50, 63, 42, 21, 7, 44, 14

72
Example : Boxplot
 Firstly, we order the sample in the ascendend order. Then, find the median.

7, 8, 14, 18, 21, 26, 40, 42, 44, 50, 52, 63.

 Median = 6,5 value

= (6st + 7th observations) ÷ 2
= (26 + 40) ÷ 2
= 33

 There arte 6 numbers before the median : 7, 8, 14, 18, 21, 26.

 Q1 = the median of these six variables

= 3,5 value
= (3rd + 4th observations) ÷ 2
= (14 + 18) ÷ 2
= 16
 There six numbers after the median : 40, 42, 44, 50, 52, 63.

 Q3 = the median of these six elements is

= (6 + 1) ÷ 2= 3,5 value
= (3rd + 4th observations) ÷ 2 = (44+50) ÷ 2
= 47
Mini = 7
Q1 =16
Median = 33
Q3 = 47
73 Max = 63
Normal distribution

Normal (or Gaussian) distribution is a continuous probability

distribution that has a bell-shaped probability density function,
known as the Gaussian function

Recall : The parameter μ - mean

(location of the peak)
σ is the standard deviation
σ 2 is the variance.

The distribution with μ = 0 and σ 2 = 1 is called the standard normal

74 distribution or the unit normal distribution
Normal distribution

The normal distribution has the following proprieties:

 68% of the population is situated in the interval: [ x  s; x  s]

 95% of the population is in the range: [ x  2s; x  2s]

 99% of the population is in the range: [ x  3s; x  3s]

If these properties are not satisfied, the distribution

is not Gaussian.
75
Normal distribution

Many natural variables follow the Normal Distribution

76
Gaussian distribution

The mixture of two Gaussian populations is not a Gaussian population

The distribution is bimodal

77
Missing values and outliers
 The detection of missing values and outliers is a step to achieve for
any type models.

 It is important to question why the data is missing, this can help with finding a
solution to the problem.

 If the values are missing at random there is still information about each
variable in each unit but if the values are missing systematically the problem
is more severe because the sample cannot be representative of the
population.

 For example: a research is done about the relation between IQ and income. If
participants with an over average IQ do not answer the question ‘What is your
salary?’ the results of the research may show that there is no association
between IQ and salary, while in fact there is a relationship. Because of these
problems, methodologists routinely advise researchers to design research so
as to minimize the incidence of missing values (Ader, H.J., Mellenbergh,
78 G.J. 2008).
Missing data

 Ignore the tuple: usually done when class label is missing

(assuming the tasks in classification)—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree

79
Outliers

An outlier is an incorrect value corresponding to a bad measure, a

miscalculation, a mistake or misrepresentation.

Examples:

 Inconsistent Dates: February 29 for a non-leap year, the subscription

dates prior to the date of birth of the customer

 Codes "sex" taking more than two different values

 Phone numbers that don’t correspond to phone numbers

80
Outliers

Several solutions exist to deal with outliers:

– Delete the concerned observations, if their number is not too high and
sufficiently random distributed

– Keep the observation and the variable, tolerating a small margin of

error in the model results

– Keep the observation and the variable, but replace the outlier by
another value that is closest to its true value (mean...)

– Keep the observations but don’t use the variables for mining the data.

81
Data standartization

 Centering
y x x
 Centering and reduction
(standartization) xx
y
s
 Other normalization
xmin( x)
y
Min-Max
max(x)min( x)
Logarithmic ylog(x)
82
Example: Table of countries

country esp life F mort_inf activF % chom. gnb/hb % education % health

 Allemagne 74.8 4.4 48.8 8.2 26768 4.3 10.6
 Autriche 75.4 4.8 49 4.1 29075 4.9 8
 Belgique 75.1 5 42.3 7.3 27952 5.8 8.8
 Chypre 75.3 5.6 50.9 3.8 12724 5.8 6
 Danemark 74.5 5.3 73.8 4.5 30096 8.1 8.4
 Espagne 75.6 3.9 40.3 11.4 22538 5.6 7.7
 …………………………………………………
 Pays Bas 75.5 5.1 52.9 2.7 29614 5.2 8.2
 Pologne 70.3 8.1 49.5 18.1 9852 5.1 4.2
 Portugal 72.4 5.5 54.1 5 18500 5.5 8.2
 Royaume Uni 75 5.6 53 5.1 26756 4.7 7.3
 Slovaquie 69.7 8.6 52.9 17.4 12314 4.3 6.4
 Slovénie 72.3 4.9 51.3 11.3 17762 5.2 8.2
 Suède 77.5 3.4 76.2 4.9 26849 7.3 7.9
 Suisse 77.2 4.9 58.8 3.1 30058 5.1 10.7
 Tchéquie 72.2 4.1 51.3 9.8 15011 4.6 7.4

83
Bibliography

 Gregory Piatetsky-Shapiro, KDnuggets

 Ad Feelders, Advanced Data Mining 2011
 Srinivasan Parthasarathy, Introduction to Data Mining
 Min Song, Data Mining
 Fall 2004, CIS, Temple University, CIS527: Data Warehousing, Filtering, and Mining
 Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
 Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)
 Stéphane Tufféry «Data mining et statistique décisionnelle», Editions Technip, 2012.
 Robert R. Haccoun, Denis Cousineau « Statistiques : Concepts et applications », 2e edition.
 Fenelon, J.P., (1981). « Qu’est-ce que l’analyse des données ? », Lefonen.
 Benzécri, J.-P., (1982). « Histoire et préhistoire de l’analyse des données », Dunod.

July 16, 2009 1 Data Mining
No ratings yet
July 16, 2009 1 Data Mining
26 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
DB 14
No ratings yet
DB 14
97 pages
01 Intro
No ratings yet
01 Intro
23 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
32 pages
01 - Introduction To Datamining
No ratings yet
01 - Introduction To Datamining
19 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
CIS 467 - Topic 1 - Introduction - 2020
No ratings yet
CIS 467 - Topic 1 - Introduction - 2020
79 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
Data Mining & BI Course Guide
No ratings yet
Data Mining & BI Course Guide
25 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Data Mining Concepts and Applications
No ratings yet
Data Mining Concepts and Applications
27 pages
Data Mining Seminar
100% (2)
Data Mining Seminar
21 pages
DM-Unit 1
No ratings yet
DM-Unit 1
110 pages
Inf 444e - Datamining N Advanced Databases Introduction 2019
No ratings yet
Inf 444e - Datamining N Advanced Databases Introduction 2019
32 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
32 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Data Mining: Concepts and Applications
No ratings yet
Data Mining: Concepts and Applications
35 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
01 Intro
No ratings yet
01 Intro
22 pages
Introduction
No ratings yet
Introduction
46 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
Major Issues in Data Mining
80% (5)
Major Issues in Data Mining
45 pages
Internal
No ratings yet
Internal
267 pages
Es 2646574663
No ratings yet
Es 2646574663
7 pages
21IS503 UnitII LM5
No ratings yet
21IS503 UnitII LM5
20 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
01 Intro
No ratings yet
01 Intro
29 pages
Data Mining
No ratings yet
Data Mining
88 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
43 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
KDD in Data Mining: Hindi Overview
No ratings yet
KDD in Data Mining: Hindi Overview
19 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
38 pages
01 Intro
No ratings yet
01 Intro
40 pages
01 Intro
No ratings yet
01 Intro
28 pages
Knowledge Discovery Process and Data Mining - Final Remarks: - Moore's Law
No ratings yet
Knowledge Discovery Process and Data Mining - Final Remarks: - Moore's Law
25 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
Data Mining
No ratings yet
Data Mining
27 pages
Unit III
No ratings yet
Unit III
101 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Chapter 1 DM
No ratings yet
Chapter 1 DM
20 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
287 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Data Mining Basics with Excel and R
No ratings yet
Data Mining Basics with Excel and R
17 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
45 pages
Combine 056
No ratings yet
Combine 056
57 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
Advanced Data Mining Techniques Overview
No ratings yet
Advanced Data Mining Techniques Overview
145 pages
Course: COMP6140 - Data Mining Effective Period: September 2017
No ratings yet
Course: COMP6140 - Data Mining Effective Period: September 2017
24 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
35 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
55 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
Measures of Central Tendency - Dispersion - Skewness - NOTES PGDM
No ratings yet
Measures of Central Tendency - Dispersion - Skewness - NOTES PGDM
89 pages
Statistics and Probability: Normal Distribution
100% (1)
Statistics and Probability: Normal Distribution
44 pages
Noise Types and Filters (1-Line Definitions) : Gaussian Noise Wiener Filter
No ratings yet
Noise Types and Filters (1-Line Definitions) : Gaussian Noise Wiener Filter
7 pages
Growth Charts For Small Sample Sizes Using Unsupervised Clustering: Application To Canine Early Growth
No ratings yet
Growth Charts For Small Sample Sizes Using Unsupervised Clustering: Application To Canine Early Growth
14 pages
PG DBDA Feb 2025 Syllabus
No ratings yet
PG DBDA Feb 2025 Syllabus
56 pages
Lin Et Al (2013) Exact CS, BSSA
No ratings yet
Lin Et Al (2013) Exact CS, BSSA
14 pages
Answer: 4 (A.) Arithmetic Mean
No ratings yet
Answer: 4 (A.) Arithmetic Mean
2 pages
Nitin
No ratings yet
Nitin
3 pages
CH - 14
No ratings yet
CH - 14
2 pages
Concrete (Round 13) Proficiency Testing Program: OCTOBER 2011 Report No. 728
No ratings yet
Concrete (Round 13) Proficiency Testing Program: OCTOBER 2011 Report No. 728
35 pages
Lesson Plan in Mathematics Grade 7
100% (5)
Lesson Plan in Mathematics Grade 7
9 pages
Laoag City, Ilocos Norte: Mat 101: Mathematics in The Modern World
No ratings yet
Laoag City, Ilocos Norte: Mat 101: Mathematics in The Modern World
46 pages
Mat 152 Set P Quiz 2
No ratings yet
Mat 152 Set P Quiz 2
4 pages
Tests of Normality
No ratings yet
Tests of Normality
6 pages
Normal Probability Distribution
No ratings yet
Normal Probability Distribution
25 pages
Statistics Unlocking The Power of Data 2nd Edition Test Bank
No ratings yet
Statistics Unlocking The Power of Data 2nd Edition Test Bank
33 pages
Chapters 1-4 Multiple Choice Practice
50% (2)
Chapters 1-4 Multiple Choice Practice
7 pages
Mathematics 3A3B Calculator Assumed Examination 2011 PDF
No ratings yet
Mathematics 3A3B Calculator Assumed Examination 2011 PDF
24 pages
Modemedian
No ratings yet
Modemedian
3 pages
Final Exam A Qms 102 Fall 2010
No ratings yet
Final Exam A Qms 102 Fall 2010
18 pages
BFM MCQ
No ratings yet
BFM MCQ
10 pages
Chapter 17 - Fundamental Principles of Relative Valuation
No ratings yet
Chapter 17 - Fundamental Principles of Relative Valuation
3 pages
CL 9 (CH 14)
No ratings yet
CL 9 (CH 14)
16 pages
Exploratory Data Analysis Concepts
No ratings yet
Exploratory Data Analysis Concepts
28 pages
Anne McDonnell Sill - Statistics For Laboratory Scientists and Clinicians - A Practical Guide (2021, Cambridge University Press) - Libgen - Li
No ratings yet
Anne McDonnell Sill - Statistics For Laboratory Scientists and Clinicians - A Practical Guide (2021, Cambridge University Press) - Libgen - Li
187 pages
First Exam - Probabilty and Statistics - Second
No ratings yet
First Exam - Probabilty and Statistics - Second
3 pages
The Usefulness of Earnings and Book Value For Equity Valuation To Kuwait Stock Exchange Participants
No ratings yet
The Usefulness of Earnings and Book Value For Equity Valuation To Kuwait Stock Exchange Participants
18 pages
Data Analysis and Interpretation
No ratings yet
Data Analysis and Interpretation
13 pages
Analysis Professional Success Studenți: Amatiesei Oana Cipcă Cosmin Roca Roxana Sfredel Răzvan Hypothesis 1
No ratings yet
Analysis Professional Success Studenți: Amatiesei Oana Cipcă Cosmin Roca Roxana Sfredel Răzvan Hypothesis 1
5 pages