weka - Waikato environment for knowledge analysis(dm/ml tool)
- collection of ml algorithms
- open source
- provides tools for data preprocessing, algorithms, visualization
- released in 1997, uni of Waikato, nz originally for academic use
how to download?
- go to official website([Link])
- select compatible ver.
- install java (8 or above for recent ver.)
- run the installer
- launch weka
- doubts? checkout weka manual
process
raw data --> preprocessor --> algorithm usage --> output(visualization)
features
-graphical user interface(gui)
~ explorer(for exploring data)
~ experimenter(for designing and experimenting)
~ knowledge flow(visual rep of data)
~ workbench(to discover nd learn about diff statistical distributions)
~ simple CLI(provides simple cli)
-data preprocessing tools
-various ml algorithms
-visualization
-flexible integration via APIs
datasets in weka
- each entry is an instance of java class-> [Link]
- each instance consists of attributes like
~ nominal, numeric, string, date, relational
loading data in weka
- can be loaded from ~ local system file, web and database
steps
[Link] -> preprocessor tab ->
[Link] load data into weka
for local file- open file -> select the folder or use default sample db
for web- open url -> enter the url of ur data(explorer will load ur data)
for database- open db ->then set connection string to ur db, set query for data
selection, process the query and load
Explorer(main gui in weka)
- offers various tabs for preprocessing and algorithms without coding
- provides visualization tools, user friendly interface
- after clicking on explorer you will see the following tabs
[Link] tab - allows selection nd processing of data to prepare for ml,
essential for data preprocessing
[Link] tab - provides various classification algorithms, supports supervised and
unsupervised algorithms
[Link] tab - provides clustering algorithms, used for unsupervised learning
[Link] tab - provides association rule algorithms, used to discover relationships
in dataset
[Link] attributes tab - facilitates feature selection, helps improve model
performance
[Link] tab - used for data visualization(outputs), useful for gaining insights
through graphical analysis.
Tasks in data preprocessing in weka
- data cleaning(removes noise, missing datas)
- data integration(combines data from multiple sources)
- data transformation(converts data into suitable form)
- data reduction(reduce redundancy)
- data discretization(partitions data using binning to identify patterns)
clustering in weka
- used for finding and grouping similar data
- unsupervised learning
- algorithms:
~ simplekmeans(centroid), used for segmenting customer datas
~ hierarchical(tree), used for biological data analysis
~ expectation-maximization or em(estimates probability of instances to a cluster
using gaussian dist.), used in medical data to identify disease patterns
~ dbscan(groups based on density), used for spatial data such ad identifying
earthquake epicenters
~ farthestfirst(picks cluster centers with farthest point), used for initializing diverse
cluster centers
applications of clustering in real life(use cases)
[Link] segmentation (groups based on purchase behaviour, enhances
personalization)
[Link] detection (identifies unusual data that deviate, enhances security, used in
fraud detection)
[Link] clustering (organizes similar doc based on contents)
[Link] segmentation (divides image into distinct regions for analysis, enhances
object detection nd recog in computer vision, used in medical field for identifying
tumors)
[Link] data analysis (clusters geo data points based on location, optimizes
resource allocation)
ml case studies
- customer churn prediction (analyzes cust data to predict potential chur
- credit scoring and risk assessment (uses historical financial data to assess credit
worthiness and risk for loan applicants)
- medical diagnosis (classifies patient data to assist in diagnosing diseases )
- e-commerce product recommendation (analyzes user behaviour nd purchase
history to provide personalization)
- sentiment analysis on social media (evaluates user generated content to get a idea
on public sentiment about products and brands)