0% found this document useful (0 votes)

51 views12 pages

Why Data Mining

The document discusses the evolution of data mining as a critical tool for analyzing vast amounts of data generated daily. It outlines the knowledge discovery process, which includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, it provides practical examples of data preprocessing techniques using Weka software, emphasizing the importance of effective data analysis in transforming raw data into valuable insights.

Uploaded by

veenashan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views12 pages

Why Data Mining

Uploaded by

veenashan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

[Link]

in/v/140019/Weka-Tutorial-02-Data-
Preprocessing-101--Data-Prep#course_14386
[Link]
[Link]
[Link]
closed_video.mp4

[Link]
maximal-and-closed/
[Link]

[Link]

sequential pattern mining bharani priya

spade algorithm GRIETCSEPROJECTS
gsp shivani srivarshini
[Link] spade

Why Data Mining?

We live in a world where vast amounts of data are collected daily. Analyzing such data
is an important need. Data mining can meet this need by providing tools to discover
knowledge from data. Data mining can be viewed as a result of the natural evolution of
information technology
1.1.2 Data Mining as the Evolution of Information Technology
Data mining can be viewed as a result of the natural evolution of information technology.
The database and data management industry evolved in the development of
several critical functionalities (Figure 1.1): data collection and database creation, data
management (including data storage and retrieval and database transaction processing),
and advanced data analysis (involving data warehousing and data mining). The early
development of data collection and database creation mechanisms served as a prerequisite
for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems
offer query and transaction processing as common practice. Advanced data analysis has
naturally become the next step.
After the establishment of database management systems, database technology
moved toward the development of advanced database systems, data warehousing, and
data mining for advanced data analysis and web-based databases. Advanced database
systems incorporate new and powerful data models such as extended-relational,
object-oriented, object-relational models. Application-oriented database
systems have flourished, including spatial, temporal, multimedia, active, stream and
sensor, scientific and engineering databases, knowledge bases, and office information
bases.
Advanced data analysis sprang up from the late 1980s onward.
This technology provides a great boost to the database and information
industry, and it enables a huge number of databases and information repositories to be
available for transaction management, information retrieval, and data analysis. Data
can now be stored in many different kinds of databases and information repositories.
One emerging data repository architecture is the data [Link] is a repository of
multiple heterogeneous data sources organized under a unified
schema at a single site to facilitate management decision making. Data warehouse
technology includes data cleaning, data integration, and online analytical processing
(OLAP)—that is, analysis techniques with functionalities such as summarization,
consolidation,
and aggregation, as well as the ability to view information from different
angles. Although OLAP tools support multidimensional analysis and decision making,
additional data analysis tools are required for in-depth analysis—for example, data mining
tools that provide data classification, clustering, outlier/anomaly detection, and the
characterization of changes in data over time.
Huge volumes of data have been accumulated beyond databases and data warehouses.
During the 1990s, the World Wide Web and web-based databases (e.g., XML
databases) began to appear. Internet-based global information bases, such as theWWW
and various kinds of interconnected, heterogeneous databases, have emerged and play
a vital role in the information industry. The effective and efficient analysis of data from
such different forms of data by integration of information retrieval, data mining, and
information network analysis technologies is a challenging task.
In summary, the abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation (Figure 1.2).
The fast-growing, tremendous amount of data, collected and stored in large and numerous
data repositories, has far exceeded our human ability for comprehension without powerful
tools. As a result, data collected in large data repositories become “data tombs”—data
archives that are seldom visited. Consequently, important decisions are often made
based not on the information-rich data stored in data repositories but rather on a decision
maker’s intuition, simply because the decision maker does not have the tools to
extract the valuable knowledge embedded in the vast amounts of data. Efforts have
been made to develop expert system and knowledge-based technologies, which typically
rely on users or domain experts to manually input knowledge into knowledge bases.
Unfortunately, however, the manual knowledge input procedure is prone to biases and
errors and is extremely costly and time consuming. The widening gap between data and
information calls for the systematic development of data mining tools that can turn data
tombs into “golden nuggets” of knowledge.

1.2 What Is Data Mining?

It is no surprise that data mining, as a truly interdisciplinary subject, can be defined
in many different ways. Even the term data mining does not really present all the major
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD.

The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the
following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)3
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) 4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures.
Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared
for mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process,
albeit an essential one because it uncovers hidden patterns for evaluation.
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the
system dynamically.

LAB 1

Weka 3-Data Mining with open source machine

[Link]

c:\Program Files\Weka-3-8-5\data

[Link] Filter-> choose filters, unsupervised, attribute, numeric cleaner, index 6 mass, min Default
NaN, min Threshold 0.1E-7, ok, apply.
Select mass edit

Filter unsupervised , instance, remove with values , filet attribute index 6, match missing values True, ok,
apply, check mass, check edit removed.

Impute undo choose fdilter unsupervised attribute, replace missing values, apply, edit replaced

Weather numeric data edit , probability play percentage should be between 0 to 100. Filetrc
unsuopervisede attribute, numeric cleaner max thresholf 100 min threshold 0, max default 100, min
default 0

45 to 49 must become 50 , closeto: 47 changeto : 50, close to tolerance: 3( means less than 3) , attribute
indices: 5, Ok apply edit

[Link] filet unsupervised attribute, interquartile range apply. New attributes at 10 and 11 outlier
and extreme values, edit

Outlier Detection: unsupervised instances, remove with values , attribute index 10, nominal indices:
last move: filter, unsupervised instances, remove with values , attribute index 11, nominal indices: last
ok. Apply save [Link]

[Link]

Normalize: open weater [Link] , Attribute Relation File Format

Filter, Unsupervised, attribute, Normalize used for numeric attribute only. ( for -1 to +1 chooose
scale=2 and translation = -1 select normalize in filter to edit , scale=1, translation=1 for values between
0 to 1. OK, Apply, edit undo. ( for -1 to +1 choose scale=2 and translation = -1 ) Save will replace the file.
Give a new name

Filter, Unsupervised, attribute, Standardize( zero mean unit variance) used for numeric attribute only
Check for numeric attribute mean and std deviation

Rushdi Shams Weka Tutoriqals

Data Sources, Arff Loader,

Evaluation, class Assigner

Filter, supervised, attributeselection

Visualization, Text Viewer

Convert csv files to arff files

Download files from [Link]

Unzip

Open file [Link] and [Link] using notepad. Change extension to txt

Open excel, Data, get external data , from text

Go to download , select [Link], next, select delimiter tab and space , next, finish, put data
=$A$1

Insert new row at top

Copy paste column names from [Link]

Save file as csv

Weka, tools , arff viewer, file , open, select csv file, save as arff
Data Cleaning using weka:

Open file [Link].

Check relation name

Select first attribute

Check if missing values, in this case 2% for first attribute

Select edit, you can find at lot if missing values shown in grey color

1. Replace missing values using weka:

Go to filter, go to weka, filters, unsupervised, attribute, replace missing values, apply

Discretize

Open [Link]

Select attribute age unsupervised, attribute, Discretize, select on the discretize bar, attribute indices 13
(for age), bins range precision ( for decimal values limit) = 2,bins =3, apply, save as type csv

Open file in excel replace values with Old, Middle and Youg, save the file as csv

2. Info Gain Attribute Evaluator

Open csv file [Link] in weka

select attributes from top bar

attribute Evaluator

InfogainAttributeEval

Alert- yes for ranker

Start

Check Results

Select attributes : 17,19,18,8,11,16, remove, save

3. Change any attribute as class
Open [Link],

Edit

Select mpg, set attribute as class, ok

4. Change Numeric to Nominal

Open [Link]

Select attribute preg- numeric

Weka, filters, unsupervised, attribute, NumericToNominal, Click on bar, attribute indices 1, Apply

5. Normalize
Open [Link]

Weka, filters, unsupervised, attribute, normalize, apply

Undo, standardize, apply

6. Remove Missing values

Open [Link]

Select attribute plant-stand. It has missing values

Weka, filters, unsupervised,instance, RemoveWithValues, click bar, attribute indices : 2, invert

Selection: true, matchMissingValues: True, OK

7. Best attributes:
Weka, filters, supervised, attribute, attribute selection

Weka, select attributes , chooe, ClassifierSubsetEval, click, classifier, choose, NaiveBayes, ok, start

Choose ,tree, j48, ok, start

Find the best attributes

Preprocess, select 1,3,4,5 select Invert, remove

Classifier, naïve bayes, see results

8. Finding Outliers
Open file [Link]

Weka, filters, unsupervised, attribute, InterQuartilerange, Apply

Two extra columns added. Select column outlier, set class as outlier, visualize

Weka,Filters , unsupervised, instance, removeWithvalues, click on bar

Attribute outlier has two values no(1) and yes(2). We want to remove outliers, so nominal indices=2 or
last.

Attributeindex: 11, nominalindices: 2, classify

Undo, click on bar, detectionPerAttribute: true , Undo

9. Numeric transform
[Link] weka filter unsupervised attribute NumericTransform, metod name : floor

10. PCA
Open file [Link], filter, unsupervised, attribute, PrincipalComponents, click, variance covered:0.95, ok,
apply.

Check for variance/Std deviation on the right. It is the maximum variance, Set threshold=50% of the
maximum. All other attributes have less than 50%. Select them ( 2,3,4,5) and click remove.

Sparse dataset

Open file [Link], edit to see sparse data

Filter, choose, weka, filter, unsupervised, instance, NonSparseToSparse

Experiment 1: Installation of WEKA Tool Aim
No ratings yet
Experiment 1: Installation of WEKA Tool Aim
19 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Chapter 1 - What Is Data Mining
No ratings yet
Chapter 1 - What Is Data Mining
8 pages
Chapter 2 DM
No ratings yet
Chapter 2 DM
91 pages
Data Analytics Using WEKA
No ratings yet
Data Analytics Using WEKA
65 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
DM - MOD - 1 Part I
No ratings yet
DM - MOD - 1 Part I
9 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
17 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Data Mining: Techniques & Applications
No ratings yet
Data Mining: Techniques & Applications
16 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Adm 4 ND 5
No ratings yet
Adm 4 ND 5
51 pages
Chapter 3
No ratings yet
Chapter 3
9 pages
DWDM
No ratings yet
DWDM
48 pages
Study Material I
No ratings yet
Study Material I
140 pages
358 44 Datamining and Warehousing 4.4
No ratings yet
358 44 Datamining and Warehousing 4.4
155 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Comparative Analysis of Data Mining Tools and Techniques For Evaluating Performance of Database System
No ratings yet
Comparative Analysis of Data Mining Tools and Techniques For Evaluating Performance of Database System
6 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Data Mining-1
No ratings yet
Data Mining-1
7 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
3 pages
KDD and Data Mining Explained
No ratings yet
KDD and Data Mining Explained
46 pages
Metadata & Data Mining Essentials
No ratings yet
Metadata & Data Mining Essentials
36 pages
Data Mining
No ratings yet
Data Mining
27 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining for Business Growth
No ratings yet
Data Mining for Business Growth
7 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Data Mining Techniques Using Weka Tool
No ratings yet
Data Mining Techniques Using Weka Tool
4 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
History and Patterns in Data Mining
No ratings yet
History and Patterns in Data Mining
25 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
33 pages
Sample Paper 307
No ratings yet
Sample Paper 307
16 pages
DWM - Module 2
No ratings yet
DWM - Module 2
74 pages
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
No ratings yet
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
40 pages
Class 1a-DataCollection
No ratings yet
Class 1a-DataCollection
14 pages
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
No ratings yet
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
5 pages
Summarizing Transactional Data Insights
No ratings yet
Summarizing Transactional Data Insights
22 pages
Data Warehousing & Mining Course Overview
No ratings yet
Data Warehousing & Mining Course Overview
118 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
Unit 2
No ratings yet
Unit 2
144 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Data Warehouse and Data Mining - Unit 2
No ratings yet
Data Warehouse and Data Mining - Unit 2
24 pages
ChatGPT - Shared Content
No ratings yet
ChatGPT - Shared Content
26 pages
Module 4
No ratings yet
Module 4
54 pages
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
No ratings yet
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
36 pages
An Introduction To Data Mining: Discovering Hidden Value in Your Data Warehouse
No ratings yet
An Introduction To Data Mining: Discovering Hidden Value in Your Data Warehouse
10 pages
DataMining S
No ratings yet
DataMining S
103 pages
(Ebook PDF) Data Mining Concepts and Techniques 3Rd Download
No ratings yet
(Ebook PDF) Data Mining Concepts and Techniques 3Rd Download
53 pages
Data Mining Process Overview
No ratings yet
Data Mining Process Overview
784 pages
Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
Sap All Tables
No ratings yet
Sap All Tables
24 pages
Conf Fair
No ratings yet
Conf Fair
165 pages
Oracle R12 Project Accounting Setups
90% (10)
Oracle R12 Project Accounting Setups
120 pages
Part-1 5
No ratings yet
Part-1 5
2 pages
Lab 02b - Manage Governance Via Azure Policy Student Lab Manual
No ratings yet
Lab 02b - Manage Governance Via Azure Policy Student Lab Manual
6 pages
Chapter One Principles of Programming Languages
No ratings yet
Chapter One Principles of Programming Languages
37 pages
FastAPI SSE with Redis Pub/Sub Guide
No ratings yet
FastAPI SSE with Redis Pub/Sub Guide
12 pages
Research Proposal Quantum Computing Limitations in Saving Data
No ratings yet
Research Proposal Quantum Computing Limitations in Saving Data
5 pages
PowerEdge R750X - C1
No ratings yet
PowerEdge R750X - C1
16 pages
Learning SEO - Learning SEO
No ratings yet
Learning SEO - Learning SEO
9 pages
VMW Vcap DCV Design 8x Exam Guide
No ratings yet
VMW Vcap DCV Design 8x Exam Guide
6 pages
Apple Pro Training Series
No ratings yet
Apple Pro Training Series
2 pages
CCNA Security, Final Exam
100% (1)
CCNA Security, Final Exam
22 pages
Mad QB Unit 1
No ratings yet
Mad QB Unit 1
5 pages
Linux Privilege Escalation Ressource
No ratings yet
Linux Privilege Escalation Ressource
10 pages
Time Table For Winter 2025 Theory Examination
No ratings yet
Time Table For Winter 2025 Theory Examination
9 pages
Assignment 6
No ratings yet
Assignment 6
4 pages
JBoss Enterprise Application Platform 6.3 Installation Guide Es ES
No ratings yet
JBoss Enterprise Application Platform 6.3 Installation Guide Es ES
90 pages
WT Da All Practical Questions
100% (2)
WT Da All Practical Questions
100 pages
Functions of Library Software Explained
No ratings yet
Functions of Library Software Explained
24 pages
Cloud Computing Assignment
No ratings yet
Cloud Computing Assignment
17 pages
Cbus TCP Driver Setup Guide
No ratings yet
Cbus TCP Driver Setup Guide
23 pages
PID CONTROLLER DESIGN AND TUNING GAINS Presentation
No ratings yet
PID CONTROLLER DESIGN AND TUNING GAINS Presentation
22 pages
zOS 2
No ratings yet
zOS 2
38 pages
001 Maintain Activities For Object Parts
No ratings yet
001 Maintain Activities For Object Parts
16 pages
Equipment Solutions for Engineers
No ratings yet
Equipment Solutions for Engineers
8 pages
2022 SSBW Syllabus
No ratings yet
2022 SSBW Syllabus
6 pages
EM-IPX-330 2100 2200 2500 (v3) v3.0
No ratings yet
EM-IPX-330 2100 2200 2500 (v3) v3.0
123 pages
Content and Creator Guidance To Help Avoid Violations
No ratings yet
Content and Creator Guidance To Help Avoid Violations
77 pages
Atlanta Magazine-2
No ratings yet
Atlanta Magazine-2
2 pages

Why Data Mining

Uploaded by

Why Data Mining

Uploaded by

[Link]

sequential pattern mining bharani priya

Why Data Mining?

1.2 What Is Data Mining?

Weka 3-Data Mining with open source machine

Normalize: open weater [Link] , Attribute Relation File Format

Rushdi Shams Weka Tutoriqals

Data Sources, Arff Loader,

Evaluation, class Assigner

Filter, supervised, attributeselection

Visualization, Text Viewer

Download files from [Link]

Open excel, Data, get external data , from text

Insert new row at top

Copy paste column names from [Link]

Save file as csv

Open file [Link].

Check relation name

Select first attribute

Check if missing values, in this case 2% for first attribute

1. Replace missing values using weka:

2. Info Gain Attribute Evaluator

select attributes from top bar

Alert- yes for ranker

Select attributes : 17,19,18,8,11,16, remove, save

Select mpg, set attribute as class, ok

4. Change Numeric to Nominal

Select attribute preg- numeric

Weka, filters, unsupervised, attribute, normalize, apply

Undo, standardize, apply

6. Remove Missing values

Select attribute plant-stand. It has missing values

Weka, filters, unsupervised,instance, RemoveWithValues, click bar, attribute indices : 2, invert

Choose ,tree, j48, ok, start

Find the best attributes

Classifier, naïve bayes, see results

Weka, filters, unsupervised, attribute, InterQuartilerange, Apply

Weka,Filters , unsupervised, instance, removeWithvalues, click on bar

Attributeindex: 11, nominalindices: 2, classify

Attributeindex: 11, nominalindices: 2, classify

Undo, click on bar, detectionPerAttribute: true , Undo

Open file [Link], edit to see sparse data

You might also like