Komal DWDM 1to5
Komal DWDM 1to5
&
DATA MINING
(01CE0723)
Lab Manual
A.Y. 2025-26
Case study on applications of Data Mining tools and techniques used for CO2, CO3,
14.
Business Intelligence. CO4, CO5
Experiment 1
1. WEKA
∙ Introduction:
WEKA (Waikato Environment for Knowledge Analysis) is an open-source software for data mining
tasks. It contains a collection of visualization tools and algorithms for data analysis and predictive
modeling, together with graphical user interfaces for easy access to these functions.
∙ Detailed Description with Screenshots of Various Features:
Tool 2:RapidMiner
● Introduction:
RapidMiner is a powerful, open-source data science platform designed for data preparation,
machine learning, deep learning, and text mining. It has a drag-and-drop interface and
supports a wide range of data science tasks.
● Introduction:
Talend is an open-source data integration and data warehousing tool that enables data
transformation, migration, and synchronization. It is widely used for big data integration and
ETL processes.
● Detailed Description with Screenshots of Various Features:
Supported Many classic ML algorithms: J48, 1500+ operators: ML, DL, text No ML algorithms; focuses on
Algorithms Naive Bayes, Random Forest mining, sentiment analysis data flows and transformations
Data Format CSV, Excel, DB, JSON, XML, Hundreds of formats; strong
ARFF, CSV, C4.5
Support BigQuery, Hadoop, etc. database and API support
Experiment Outcome:
This experiment provided hands-on exposure to three powerful tools—WEKA, RapidMiner, and
Talend—highlighting their strengths in data mining and warehousing. WEKA and RapidMiner
proved effective for data analysis and predictive modeling, while Talend showcased robust
capabilities for data integration and ETL tasks. Each tool serves a unique purpose, and
understanding their features enables more informed tool selection for real-world data science
applications.
Introduction to WEKA
WEKA stands for Waikato Environment for Knowledge Analysis. It is a powerful, open-source
suite of machine learning software developed to facilitate data mining and analysis tasks.
WEKA provides tools for:
• Data pre-processing
• Classification
• Regression
• Clustering
• Association rules
• Visualization
WEKA is written in Java and offers a Graphical User Interface (GUI) as well as a command-line
interface. It is widely used for teaching, research, and practical machine learning applications.
History of WEKA
1. 1992 – Project Initiation
WEKA was initiated at the University of Waikato in Hamilton, New Zealand. The original
aim was to create a tool that supported machine learning algorithms and made them easily
accessible to non-programmers.
2. Early Development
Initially, WEKA was developed as a closed-source project that focused on algorithms for
analyzing agricultural data.
3. 1997 – Open Source Release
The project was restarted from scratch in 1997 and released as open-source software under
the GNU General Public License (GPL). This made it freely available for public use and
significantly increased its popularity.
4. Rapid Growth and Popularity
After becoming open source, WEKA grew rapidly as a popular tool in the data mining and
machine learning community. It became one of the primary educational tools for students
and researchers to learn and experiment with machine learning concepts.
5. Development Contributions
WEKA was primarily developed by the Machine Learning Group at the University of
Waikato, but it also attracted contributions from the global research community.
6. Widespread Adoption
WEKA gained widespread adoption due to its:
o Simple user interface
o Easy accessibility of machine learning algorithms
WEKA Applications:
1) Disease Prediction
2) Market Basket Analysis
3) Credit Scoring and Risk Assessment
4) Spam Detection
5) Crop Yield Prediction
Modules in WEKA:
1. Explorer
2. Experimenter
3. KnowledgeFlow
4. Workbench
5. Simple CLI
Module 1: Explorer
Description:
This figure shows the file selection dialog box that appears when the user clicks on the
Open file... button in the Preprocess tab of the WEKA Explorer.
It allows the user to:
• Browse the local system to locate and select dataset files (typically in .arff format).
• The dialog box displays folders and files from the system’s directories.
• The bottom panel shows the File Name input field and file type filter, which is set to
accept only ARFF data files by default.
This step is essential for loading datasets into WEKA for further analysis and processing.
∙ Clean/filter/transform data
Module 2: Experimenter
Module 3: KnowledgeFlow
Description:
This figure shows the WEKA KnowledgeFlow Environment, a graphical interface that
allows users to build and visualize machine learning workflows. Unlike the command-line
or Explorer interface, KnowledgeFlow uses a drag-and-drop design where components
such as data sources, filters, classifiers, and evaluators are added and connected visually.
Key elements include:
• Design Panel (Left): Contains various components organized into categories.
• Toolbar (Top): Provides tools for saving, running, and editing workflows.
• Status/Log Panel (Bottom): Displays messages, logs, and execution status.
• Design Panel (Left): Contains various components organized into categories (e.g.,
DataSources, Filters, Classifiers).
• Workflow Area (Center): Users can design experiments by placing and linking
components.
• Toolbar (Top): Provides tools for saving, running, and editing workflows.
• Status/Log Panel (Bottom): Displays messages, logs, and execution status.
Module 4: Workbench
1. This screenshot shows the Simple CLI window of WEKA. Key features include:
2. A command input line to enter WEKA class and method calls
3. Commands follow Java class names for classifiers, filters, and other functions
4. Tab-completion support for easier typing of class names
5. Displays real-time output of the model training, testing, or file processing
6. It gives access to all core WEKA functionality including preprocessing, model training,
evaluation, and saving results — all via code, without using the GUI.
Experiment Outcome:
Dataset Code
@relation student
@attribute Name string
@attribute Age numeric
@attribute Gender {Male, Female}
@attribute Bdate date "yyyy-MM-dd"
@attribute Email string
@attribute City {Rajkot, Ahmedabad, Jamnagar, Gondal}
@attribute Married {Yes, No}
@attribute Address string
@attribute Mobile numeric
@attribute Backlog {Yes , No}
@data
"Win",21,Female,"2003-05-31","[email protected]","Rajkot",No,"Marwadi
University",7042159221,No
"Yadanar",22,Female,"2002-05-31","[email protected]","Ahmedabad",No,"Parul
University",7042161665,Yes
"Phyo",20,Male,"2001-03-31","[email protected]","Rajkot",Yes,"Taungoo
University",250400742,No
"Kaung",23,Male,"2000-02-14","[email protected]","Jamnagar",No,"Yangon
University",7042159222,Yes
"Si",25,Female,"2003-12-31","[email protected]","Rajkot",No,"Marwadi
University",7042159221,No
"Yoon",14,Female,"2012-05-31","[email protected]","Gondal",Yes,"Yeni
University",263836707, Yes
"Yu",26,Male,"2005-04-22","[email protected]","Ahmedabad",No,"Marwadi
University",7042159121,No
"WinWin",21,Female,"2003-05-31","[email protected]","Rajkot",No,"Dagon
University",250400742 ,Yes
"Chit",22,Male,"2003-05-31","[email protected]","Rajkot",No,"Marwadi
University",7042159221,No
Komal Buddhdev (92310103021) | 16
Analysis of “student.arff” with weka
1) Age Attribute
2) Gender Attribute
Description:
The Gender attribute is of nominal type with two distinct values: Male and Female. There are
no missing values, and the dataset contains 4 male and 5 female instances. The histogram
shows that both genders are almost equally represented. The class distribution (Backlog =
Yes/No) appears in both groups, indicating that gender does not show a strong imbalance or
clear pattern related to backlog status in this small sample.
3) Bdate Attribute
Description:
The Email attribute is of string type, with no missing values. Out of 9 instances, there are 8
distinct entries, and 7 of them are unique, making up 78% uniqueness. Since it is a string
attribute, Weka does not perform standard numeric or nominal statistical analysis. Also, it
cannot be visualized in the standard plot window, as shown in the message: "Attribute is neither
numeric nor nominal."This attribute is typically used for identification or reference purposes
and doesn't contribute directly to model training unless processed or encoded into a usable
form.
5) City Attribute
Description:
This “Visualize All” panel in WEKA displays attribute distributions in black and white. Each
histogram shows the frequency of attribute values across all 9 instances.
• Age, Gender, Bdate, City, Married, Mobile, Backlog: Visualized with bar charts. For example,
most students are aged 20+, live in Rajkot, and are unmarried.
• Name, Email, Address: Not visualized – labeled as “neither numeric nor nominal”.
• Backlog: Final target attribute showing 4 students with no backlog and 5 with backlog.
Dataset Code
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
1) Outlook Atrribute
This WEKA Explorer window shows the Preprocess tab for the dataset weather.symbolic,
which has 5 attributes and 14 instances.
• The selected attribute is play (target class).
• It is a nominal attribute with two values:
o yes (9 instances)
o no (5 instances)
• The bar chart below visualizes the class distribution, with more instances labeled yes.
2) Temperature Attribute:
3) Humidity Attribute:
Description:
This figure displays the humidity attribute from the weather.symbolic dataset in WEKA
Explorer. The attribute is nominal with two distinct values: high and normal, each
occurring 7 times. The bar chart below shows how each humidity level relates to the
target class play (blue for "yes", red for "no"). The data is evenly split, indicating no strong
preference toward playing based on humidity alone, which may affect its usefulness in
classification.
4) Windy Attribute
Description:
This is the Preprocess tab of WEKA Explorer showing the dataset weather.symbolic with
5 attributes and 14 instances. The selected attribute is windy, which has two values:
TRUE (6 times) and FALSE (8 times). A bar chart below shows how the windy attribute
relates to the target class play. The left chart represents windy = TRUE, and the right
chart represents windy = FALSE. The top menu has options like Open file, Save, Edit, and
Generate for managing the dataset.
Komal Buddhdev (92310103021) | 24
Figure 3.2.4: Windy Attribute
5) Play Attribute:
Description:
This is the Preprocess tab of WEKA Explorer showing the dataset weather.symbolic with
5 attributes and 14 instances. The selected attribute is play, which has two values: yes (9
times) and no (5 times). The bar chart below shows the distribution: the blue bar
represents "yes" and the red bar represents "no". The dataset is ready for further
processing like classification or visualization.
Dataset Code
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class
{Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1 ,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa 4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa 4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa 5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa 5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa 5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa 5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa 5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2 ,3.5,1.5,0.2,Iris-setosa
1) Sepallength Attribute
Description:
This image shows the Preprocess tab of WEKA Explorer with the Iris dataset loaded. The
dataset contains 150 instances and 5 attributes: sepallength, sepalwidth, petallength,
petalwidth, and class. The class attribute is selected, which is a nominal type with three
distinct classes: Iris-setosa, Iris-versicolor, and Iris-virginica, each having 50 instances.
Komal Buddhdev (92310103021) | 27
The bar chart at the bottom visually represents the class distribution. The interface also
provides options to open files, apply filters, and manage attributes.
2) Sepalwidth Attribute
3) Petallength Attribute
Description:
This image shows the Preprocess tab of WEKA Explorer with the Iris dataset loaded and the
attribute petallength selected. The petallength attribute is numeric with 43 distinct values,
ranging from 1 to 6.9. The mean is 3.759 and the standard deviation is 1.764. The histogram at
the bottom displays the distribution of petal lengths across the three iris classes, with each class
shown in a different color, illustrating clear separation among the classes based on petal length.
4) Petalwidth Attribute
Description:
This image shows the Preprocess tab of WEKA Explorer with the Iris dataset loaded and the
attribute petalwidth selected. The petalwidth attribute is numeric with 22 distinct values,
ranging from 0.1 to 2.5. The mean value is 1.199 and the standard deviation is 0.763. The
histogram at the bottom represents the distribution of petal width across the three iris
classes, each displayed in different colors, indicating how the petal width varies between the
classes.
5) Class Attribute
Description:
This image shows the Preprocess tab of WEKA Explorer with the Iris dataset loaded. All four
input attributes (sepallength, sepalwidth, petallength, petalwidth) and the class attribute
Komal Buddhdev (92310103021) | 30
are selected. The class attribute is nominal with three distinct classes: Iris-setosa,
Iris-versicolor, and Iris-virginica, each having 50 instances. The bar chart below shows that
the dataset is perfectly balanced, with an equal number of samples in each class,
represented by three colored bars.
Description:
This image shows the Visualize All Attributes window in WEKA Explorer for the Iris dataset.
It displays histograms for sepallength, sepalwidth, petallength, petalwidth, and class. Each
class is color-coded: blue, red, and cyan. Petal length and petal width show clear separation
between classes, while sepal attributes have more overlap. The class distribution is balanced
with 50 instances each.
• Successfully loaded and explored the “student”, “weather.nominal”, and “iris” datasets using
WEKA.
• Performed data preprocessing such as handling missing values and attribute editing.
• Applied classification algorithms (e.g., J48, Naive Bayes) and interpreted results using
evaluation metrics.
• Analyzed patterns and relationships within the datasets using statistical summaries and
filters.
• Used visualization tools in WEKA to generate scatter plots, histograms, and decision trees for
better data understanding.
• Identified class distributions and feature importance, especially in the iris dataset through
classification and clustering.
• Gained hands-on experience in data analysis workflow, from loading datasets to applying
machine learning models and interpreting results.
This figure shows the "Remove" filter in Weka’s Preprocess panel, used to eliminate selected
attributes from a dataset. In the example, the attribute "duration" is selected for removal from
the "labor-neg-data" dataset. The right panel displays statistical details of the selected
attribute, such as minimum, maximum, mean, and standard deviation, along with a class
distribution histogram.
- Steps for applying the filter
1) Click "Open file" and load your dataset. 2) Click "Choose" under the Filter section. 3)
Select: Unsupervised → Attribute → Remove 4) Click the filter name (Remove) to set
options (e.g., select attribute indices to remove). 5) Click "Apply" to apply the filter.
Description:
This screenshot shows the Remove filter in WEKA's Preprocess tab. The filter Remove -R 1 is
selected to remove the first attribute (wage-increase-first-year) from the dataset. Users can
Komal Buddhdev (92310103021) | 34
select attributes to remove using checkboxes or index range, and then click Apply to exclude
them from the dataset.
This filter replaces all missing values in a dataset with a user-defined constant.
Working: You choose a constant value (like "0" or "unknown"), and the filter fills in all missing
numeric or nominal values with it. Importance: It ensures data completeness using a specific
value chosen by the user.
This filter in WEKA replaces selected attribute values with missing (unknown) values. You
specify which attribute(s) and values to convert, and the filter marks them as missing. It is
useful for simulating missing data or correcting wrongly filled values during preprocessing.
Description:
This figure shows the ReplaceWithMissingValue filter applied to the "wage-increase-third-
year" attribute, where 88% of values have been replaced with missing values. It demonstrates
the effect of increasing the percentage of data made missing, useful for testing data imputation
methods or simulating incomplete datasets.
3) Descritize
- Filter Introduction, working & importance
The Discretize filter in WEKA is used to convert numeric attributes into nominal (categorical)
ones by dividing their range into fixed intervals or bins. It works by specifying the number of
bins or using supervised methods to group values based on class labels. This is important when
algorithms require categorical input or when simplifying continuous data helps in better
pattern recognition and interpretation.
- Dataset before filter
Description:
This figure displays the result of the Discretize filter applied to the "duration" attribute. The
numeric values are converted into defined intervals (bins), making the attribute nominal for
categorical analysis and algorithm compatibility.
Experiment Outcome:
The experiment successfully demonstrated the application of various preprocessing filters in
WEKA, including ReplaceMissingValues, ReplaceMissingWithUserConstant,
ReplaceWithMissingValue, and Discretize. Each filter was applied to clean, modify, or
transform the dataset attributes. The outcomes showed improved data quality,
preparation of missing values, and conversion of numeric data into categorical form.
These steps are essential for enhancing the accuracy and effectiveness of machine learning
models by ensuring the dataset is consistent, complete, and algorithm-ready.
Experiment 5
Dataset: iris.arff
The iris.arff dataset is a well-known and widely used dataset in machine learning and pattern
recognition. It consists of 150 instances, each representing a sample of an iris flower. The dataset
includes five attributes: sepallength, sepalwidth, petallength, petalwidth, and class. The first four
attributes are numeric and represent the physical dimensions of the flower's sepals and petals in
centimeters. The fifth attribute, class, is nominal and indicates the species of the iris flower, which
can be one of three categories: Iris-setosa, Iris-versicolor, or Iris-virginica.
Step-2: dataset in table format.
Komal Buddhdev (92310103021) 44
FACULTY OF ENGINEERING & TECHNOLOGY
Department of Computer Engineering
01CE0723 – DWDM – Lab Manual
The screenshot shows the tabular view of the iris.arff dataset in Weka. It contains five attributes:
sepallength, sepalwidth, petallength, petalwidth (all numeric), and class (nominal), which indicates the
iris flower species such as Iris-setosa. Each row represents one flower instance with its measured values.
This image shows the Weka Explorer interface applying the NumericToNominal filter to convert
numeric attributes (1–3) in the Iris dataset to nominal. A histogram and attribute statistics are also
visible.
This image shows Weka Explorer after applying the NumericToNominal filter to attributes 1–3 of
the Iris dataset. The sepallength attribute is now treated as nominal, with distinct value counts and
a class distribution histogram displayed.
The StringToNominal filter in Weka is used to convert string attributes into nominal attributes. It is
particularly useful when the dataset contains categorical data represented as text (strings) that needs to be
transformed into discrete values (nominal). This conversion helps in applying machine learning algorithms
that require nominal data as input.
Dataset: contact-lenses.arff
This image shows the Preprocess tab of the Weka Explorer interface, where the "contact-lenses"
dataset is currently loaded. The dataset contains 24 instances and 5 attributes: age, spectacle-
prescrip, astigmatism, tear-prod-rate, and contact-lenses. The selected attribute in this view is
"age," which is a nominal attribute with three distinct values: young, pre-presbyopic, and
presbyopic. Each of these age categories contains an equal count of 8 instances, indicating a
balanced distribution across the dataset.
On the right side, a detailed summary of the selected attribute is displayed, showing the label
names, counts, and weights. Below that, a bar chart visualizes how the values of the class attribute
"contact-lenses" are distributed across each age category. Each color in the bars represents a
different class label (such as "no lenses," "soft," or "hard" lenses). The visualization allows users to
observe how the lens recommendations vary based on age groups, providing insights into the
relationship between age and lens type. This setup is part of the preprocessing phase in Weka, often
used to explore and understand the structure of the dataset before applying machine learning
algorithms.
Komal Buddhdev (92310103021) 47
FACULTY OF ENGINEERING & TECHNOLOGY
Department of Computer Engineering
01CE0723 – DWDM – Lab Manual
Step-2: dataset in table format.
This image displays the data viewer in Weka for the "contact-lenses" dataset. It shows 24 instances with 5
nominal attributes: age, spectacle prescription, astigmatism, tear production rate, and contact lens
recommendation. The table provides a clear view of how different attribute combinations influence the
contact lens type prescribed (none, soft, or hard).
This image shows the Weka Explorer after applying the StringToNominal filter to the contact-
lenses dataset. All five attributes (age, spectacle-prescrip, astigmatism, tear-prod-rate, and contact-
lenses) have been successfully converted to nominal type, enabling categorical data analysis. The
visualization panel displays class distribution for the contact-lenses attribute across the three age
groups (young, pre-presbyopic, and presbyopic), each with equal instance counts.
This screenshot shows the table format of dataset after applying StringToNominal filter.
Filter 3: NominalToBinary
The NominalToBinary filter in Weka is an unsupervised attribute filter that transforms nominal
(categorical) attributes into binary (numeric) form. This process, known as one-hot encoding, creates a
separate binary attribute for each possible value of a nominal attribute. For example, if an attribute "Color"
has values like Red, Green, and Blue, the filter converts it into three new binary attributes: "Color=Red",
"Color=Green", and "Color=Blue", with values of 0 or 1 indicating the presence of each category. This
conversion is particularly useful for machine learning algorithms in Weka that require numerical input
rather than categorical data.
Dataset: contact-lenses.arff
This image shows the Weka Explorer – Preprocess tab with the contact-lenses dataset loaded. It
contains 5 nominal attributes and 24 instances. The attribute "age" is selected, showing three
distinct values: young, pre-presbyopic, and presbyopic, each with 8 instances. A bar chart displays the
distribution of the target class contact-lenses across the different age groups.
Step-2: dataset in table format.
The screenshot shows the tabular view of the contact-lenses.arff dataset in Weka. It contains five
attributes: age, spectacle-prescrip, astigmatism, tear-prod-rate and contact-lense. All the attributes
having type of string.
This image shows the Weka Explorer interface with the NominalToBinary filter selected. The
filter is configured to convert nominal attributes in indices 1 to 3 into binary numeric attributes.
This image shows that first three attributes values are replaced by t(true) or f(false).
This screenshot shows the table format of dataset after applying NominalToBinary filter.
Komal Buddhdev (92310103021) 53
FACULTY OF ENGINEERING & TECHNOLOGY
Department of Computer Engineering
01CE0723 – DWDM – Lab Manual
Filter 4: Normalize
The Normalize filter in Weka is a data preprocessing tool that scales numeric attribute values to a specified
range, typically [0, 1]. This is useful for improving the performance of machine learning algorithms that are
sensitive to the scale of input data, such as k-NN or neural networks. It ensures that all numeric attributes
contribute equally to the model.
Dataset: student.arff
This image shows the Preprocess tab of the Weka Explorer with the student dataset loaded. It
contains 25 instances and 23 attributes. The selected attribute is "id", which is numeric with
distinct values ranging from 1 to 25. At the bottom, a class distribution histogram for the attribute
"final_result" (a nominal class) is displayed in red and blue, indicating the frequency of each class
value.
Step-2: dataset in table format.
This image shows the data viewer window in Weka, displaying the student dataset with 23
attributes, including both numeric and nominal types such as id, name, age, gender, gpa, and
participation. Each row represents a student instance with corresponding values.
This image shows the Normalize filter settings in Weka, where numeric attributes are scaled
between 0.0 and 1.0 using the weka.filters.unsupervised.attribute.Normalize filter.
Step-4: Apply Normalize filter from to all attributes.
This screenshot shows the Weka Explorer after applying the Normalize filter, where all numeric
attributes have been scaled to the [0, 1] range.
This screenshot shows the table format of dataset after applying Normalize filter. where all numeric
attributes have been scaled to the [0, 1] range.