[Link]
in/v/140019/Weka-Tutorial-02-Data-
Preprocessing-101--Data-Prep#course_14386
[Link]
[Link]
[Link]
closed_video.mp4
[Link]
maximal-and-closed/
[Link]
[Link]
[Link]
sequential pattern mining bharani priya
spade algorithm GRIETCSEPROJECTS
gsp shivani srivarshini
[Link] spade
Why Data Mining?
We live in a world where vast amounts of data are collected daily. Analyzing such data
is an important need. Data mining can meet this need by providing tools to discover
knowledge from data. Data mining can be viewed as a result of the natural evolution of
information technology
1.1.2 Data Mining as the Evolution of Information Technology
Data mining can be viewed as a result of the natural evolution of information technology.
The database and data management industry evolved in the development of
several critical functionalities (Figure 1.1): data collection and database creation, data
management (including data storage and retrieval and database transaction processing),
and advanced data analysis (involving data warehousing and data mining). The early
development of data collection and database creation mechanisms served as a prerequisite
for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems
offer query and transaction processing as common practice. Advanced data analysis has
naturally become the next step.
After the establishment of database management systems, database technology
moved toward the development of advanced database systems, data warehousing, and
data mining for advanced data analysis and web-based databases. Advanced database
systems incorporate new and powerful data models such as extended-relational,
object-oriented, object-relational models. Application-oriented database
systems have flourished, including spatial, temporal, multimedia, active, stream and
sensor, scientific and engineering databases, knowledge bases, and office information
bases.
Advanced data analysis sprang up from the late 1980s onward.
This technology provides a great boost to the database and information
industry, and it enables a huge number of databases and information repositories to be
available for transaction management, information retrieval, and data analysis. Data
can now be stored in many different kinds of databases and information repositories.
One emerging data repository architecture is the data [Link] is a repository of
multiple heterogeneous data sources organized under a unified
schema at a single site to facilitate management decision making. Data warehouse
technology includes data cleaning, data integration, and online analytical processing
(OLAP)—that is, analysis techniques with functionalities such as summarization,
consolidation,
and aggregation, as well as the ability to view information from different
angles. Although OLAP tools support multidimensional analysis and decision making,
additional data analysis tools are required for in-depth analysis—for example, data mining
tools that provide data classification, clustering, outlier/anomaly detection, and the
characterization of changes in data over time.
Huge volumes of data have been accumulated beyond databases and data warehouses.
During the 1990s, the World Wide Web and web-based databases (e.g., XML
databases) began to appear. Internet-based global information bases, such as theWWW
and various kinds of interconnected, heterogeneous databases, have emerged and play
a vital role in the information industry. The effective and efficient analysis of data from
such different forms of data by integration of information retrieval, data mining, and
information network analysis technologies is a challenging task.
In summary, the abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation (Figure 1.2).
The fast-growing, tremendous amount of data, collected and stored in large and numerous
data repositories, has far exceeded our human ability for comprehension without powerful
tools. As a result, data collected in large data repositories become “data tombs”—data
archives that are seldom visited. Consequently, important decisions are often made
based not on the information-rich data stored in data repositories but rather on a decision
maker’s intuition, simply because the decision maker does not have the tools to
extract the valuable knowledge embedded in the vast amounts of data. Efforts have
been made to develop expert system and knowledge-based technologies, which typically
rely on users or domain experts to manually input knowledge into knowledge bases.
Unfortunately, however, the manual knowledge input procedure is prone to biases and
errors and is extremely costly and time consuming. The widening gap between data and
information calls for the systematic development of data mining tools that can turn data
tombs into “golden nuggets” of knowledge.
1.2 What Is Data Mining?
It is no surprise that data mining, as a truly interdisciplinary subject, can be defined
in many different ways. Even the term data mining does not really present all the major
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD.
The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the
following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)3
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) 4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures.
Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared
for mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process,
albeit an essential one because it uncovers hidden patterns for evaluation.
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the
system dynamically.
LAB 1
Weka 3-Data Mining with open source machine
[Link]
c:\Program Files\Weka-3-8-5\data
[Link] Filter-> choose filters, unsupervised, attribute, numeric cleaner, index 6 mass, min Default
NaN, min Threshold 0.1E-7, ok, apply.
Select mass edit
Filter unsupervised , instance, remove with values , filet attribute index 6, match missing values True, ok,
apply, check mass, check edit removed.
Impute undo choose fdilter unsupervised attribute, replace missing values, apply, edit replaced
Weather numeric data edit , probability play percentage should be between 0 to 100. Filetrc
unsuopervisede attribute, numeric cleaner max thresholf 100 min threshold 0, max default 100, min
default 0
45 to 49 must become 50 , closeto: 47 changeto : 50, close to tolerance: 3( means less than 3) , attribute
indices: 5, Ok apply edit
[Link] filet unsupervised attribute, interquartile range apply. New attributes at 10 and 11 outlier
and extreme values, edit
Outlier Detection: unsupervised instances, remove with values , attribute index 10, nominal indices:
last move: filter, unsupervised instances, remove with values , attribute index 11, nominal indices: last
ok. Apply save [Link]
[Link]
Normalize: open weater [Link] , Attribute Relation File Format
Filter, Unsupervised, attribute, Normalize used for numeric attribute only. ( for -1 to +1 chooose
scale=2 and translation = -1 select normalize in filter to edit , scale=1, translation=1 for values between
0 to 1. OK, Apply, edit undo. ( for -1 to +1 choose scale=2 and translation = -1 ) Save will replace the file.
Give a new name
Filter, Unsupervised, attribute, Standardize( zero mean unit variance) used for numeric attribute only
Check for numeric attribute mean and std deviation
Rushdi Shams Weka Tutoriqals
Data Sources, Arff Loader,
Evaluation, class Assigner
Filter, supervised, attributeselection
Visualization, Text Viewer
Convert csv files to arff files
Download files from [Link]
Unzip
Open file [Link] and [Link] using notepad. Change extension to txt
Open excel, Data, get external data , from text
Go to download , select [Link], next, select delimiter tab and space , next, finish, put data
=$A$1
Insert new row at top
Copy paste column names from [Link]
Save file as csv
Weka, tools , arff viewer, file , open, select csv file, save as arff
Data Cleaning using weka:
Open file [Link].
Check relation name
Select first attribute
Check if missing values, in this case 2% for first attribute
Select edit, you can find at lot if missing values shown in grey color
1. Replace missing values using weka:
Go to filter, go to weka, filters, unsupervised, attribute, replace missing values, apply
Discretize
Open [Link]
Select attribute age unsupervised, attribute, Discretize, select on the discretize bar, attribute indices 13
(for age), bins range precision ( for decimal values limit) = 2,bins =3, apply, save as type csv
Open file in excel replace values with Old, Middle and Youg, save the file as csv
2. Info Gain Attribute Evaluator
Open csv file [Link] in weka
select attributes from top bar
attribute Evaluator
InfogainAttributeEval
Alert- yes for ranker
Start
Check Results
Select attributes : 17,19,18,8,11,16, remove, save
3. Change any attribute as class
Open [Link],
Edit
Select mpg, set attribute as class, ok
4. Change Numeric to Nominal
Open [Link]
Select attribute preg- numeric
Weka, filters, unsupervised, attribute, NumericToNominal, Click on bar, attribute indices 1, Apply
5. Normalize
Open [Link]
Weka, filters, unsupervised, attribute, normalize, apply
Undo, standardize, apply
6. Remove Missing values
Open [Link]
Select attribute plant-stand. It has missing values
Weka, filters, unsupervised,instance, RemoveWithValues, click bar, attribute indices : 2, invert
Selection: true, matchMissingValues: True, OK
7. Best attributes:
Weka, filters, supervised, attribute, attribute selection
Weka, select attributes , chooe, ClassifierSubsetEval, click, classifier, choose, NaiveBayes, ok, start
Choose ,tree, j48, ok, start
Find the best attributes
Preprocess, select 1,3,4,5 select Invert, remove
Classifier, naïve bayes, see results
8. Finding Outliers
Open file [Link]
Weka, filters, unsupervised, attribute, InterQuartilerange, Apply
Two extra columns added. Select column outlier, set class as outlier, visualize
Weka,Filters , unsupervised, instance, removeWithvalues, click on bar
Attribute outlier has two values no(1) and yes(2). We want to remove outliers, so nominal indices=2 or
last.
Attributeindex: 11, nominalindices: 2, classify
Attributeindex: 11, nominalindices: 2, classify
Undo, click on bar, detectionPerAttribute: true , Undo
9. Numeric transform
[Link] weka filter unsupervised attribute NumericTransform, metod name : floor
10. PCA
Open file [Link], filter, unsupervised, attribute, PrincipalComponents, click, variance covered:0.95, ok,
apply.
Check for variance/Std deviation on the right. It is the maximum variance, Set threshold=50% of the
maximum. All other attributes have less than 50%. Select them ( 2,3,4,5) and click remove.
Sparse dataset
Open file [Link], edit to see sparse data
Filter, choose, weka, filter, unsupervised, instance, NonSparseToSparse