0% found this document useful (0 votes)
21 views74 pages

Presentation 1

The document is an introduction to data mining and machine learning, detailing their applications, evolution, and various paradigms. It highlights the importance of machine learning in modern industries, showcasing real-world applications such as predicting cement strength and optimizing energy production. Additionally, it discusses different types of data that can be mined and various data mining tasks like classification, regression, and clustering.

Uploaded by

aristu914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views74 pages

Presentation 1

The document is an introduction to data mining and machine learning, detailing their applications, evolution, and various paradigms. It highlights the importance of machine learning in modern industries, showcasing real-world applications such as predicting cement strength and optimizing energy production. Additionally, it discusses different types of data that can be mined and various data mining tasks like classification, regression, and clustering.

Uploaded by

aristu914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Introduction to Data Mining

Data Mining

Dr. Lov Kumar


Department of Computer Engineering,
National Institute of Technology, Kurukshetra
Thanesar, Haryana

22-07-2025
Overview

1 Introduction

2 Machine Learning

3 Application

4 Data Mining

5 Example Datasets

6 Shapley Value

7 Decision Tree based Methods

8 Artificial neural network

Dr. Lov Kumar Machine Learning 22-07-2025 2 / 74


Machine Learning

Data science salaries grew by 25% in the past five years while software
engineering salaries grew only 6%

Dr. Lov Kumar Machine Learning 22-07-2025 3 / 74


Machine Learning

With AI tools like ChatGPT influencing every industry, AI/ML engineers are
indispensable.

Dr. Lov Kumar Machine Learning 22-07-2025 4 / 74


Machine Learning

Dr. Lov Kumar Machine Learning 22-07-2025 5 / 74


Machine Learning
Automating repetitive tasks

Improving efficiency

Enhancing decision-making

Reducing errors

Dr. Lov Kumar Machine Learning 22-07-2025 6 / 74


Models

Dr. Lov Kumar Machine Learning 22-07-2025 7 / 74


Machine Learning

Human can learn from past experience and make decision of its own.

Dr. Lov Kumar Machine Learning 22-07-2025 8 / 74


Machine Learning

Figure: What is this object?

Dr. Lov Kumar Machine Learning 22-07-2025 9 / 74


Machine Learning

Dr. Lov Kumar Machine Learning 22-07-2025 10 / 74


Machine Learning

Figure: What is this object?

Dr. Lov Kumar Machine Learning 22-07-2025 11 / 74


Let us ask the same question to him

Figure: What is this object?

Dr. Lov Kumar Machine Learning 22-07-2025 12 / 74


Let us ask the same question to him

But, he is a human being. He can observe and learn

Let us make him learn

Dr. Lov Kumar Machine Learning 22-07-2025 13 / 74


Machine Learning

Dr. Lov Kumar Machine Learning 22-07-2025 14 / 74


Machine Learning

Dr. Lov Kumar Machine Learning 22-07-2025 15 / 74


Let us ask the same question now

Dr. Lov Kumar Machine Learning 22-07-2025 16 / 74


What about a Machine ?

Figure: Machines follow instructions

It can not take decision of its own


Dr. Lov Kumar Machine Learning 22-07-2025 17 / 74
What about a Machine ?

Figure: Machines follow instructions

Dr. Lov Kumar Machine Learning 22-07-2025 18 / 74


What about a Machine ?

Figure: Machines follow instructions

Dr. Lov Kumar Machine Learning 22-07-2025 19 / 74


What about a Machine ?
We can ask a machine
To perform an arithmetic operations such as
Addition
Multiplication
Division

We can ask a machine


To perform an arithmetic operations such as
Comparison
Print
Plotting a chart

What is Machine Learning?


We want a machine to act like a human
Dr. Lov Kumar Machine Learning 22-07-2025 20 / 74
What about a Machine ?

What do we do?
Just like, what we did to human,
we need to provide experience to the machine.

Dr. Lov Kumar Machine Learning 22-07-2025 21 / 74


What is Machine Learning?

This what we called as Data or Training dataset


So, we first need to provide training dataset to the
machine

Dr. Lov Kumar Machine Learning 22-07-2025 22 / 74


What is Machine Learning?

Dr. Lov Kumar Machine Learning 22-07-2025 23 / 74


What is Machine Learning?

Dr. Lov Kumar Machine Learning 22-07-2025 24 / 74


Machine Learning Paradigms

Supervised Learning

Unsupervised Learning

Reinforcement learning

Dr. Lov Kumar Machine Learning 22-07-2025 25 / 74


Evolution of Machine Learning

Rule-Based Systems
AI systems followed strict rules written by humans to produce results.

Machine Learning (1990s-2000s)


AI started using machine learning, which allowed it to learn from data
instead of just following rules.

Deep Learning (2010s)


Deep learning improved AI significantly by using neural networks, which
mimic how the human brain works.

Dr. Lov Kumar Machine Learning 22-07-2025 26 / 74


Evolution of Machine Learning
Generative Adversarial Networks (2014)
GANs, introduced in 2014, use two AI systems that work together: one
generates new content, and the other checks if it looks real. This made
generative AI much better at creating realistic images, videos, and sounds.

Large Language Models (LLMs) and Beyond (2020s)


Models like GPT-3 and GPT-4 can understand and generate human-like
text. They are trained on massive amounts of data from books, websites,
and other sources. AI can now hold conversations, write essays, generate
code, and much more.

Multimodal Generative AI
New AI models can handle multiple types of data at once—text, images,
audio, and video. This allows AI to create content that combines different
formats.

Dr. Lov Kumar Machine Learning 22-07-2025 27 / 74


Application

Sanjit Shewale – Digital Lead, ABB Process Industries


Traditionally, cement strength can be measured after 28 days – by then it
is obviously too late to make corrections in the process

ABB is leveraging machine learning (ML) with data-driven soft sensors to


predict 28-day strength on the day of sampling.

Company has option for process corrections - setting new daily CaCO3 /
blaine targets.

Dr. Lov Kumar Machine Learning 22-07-2025 28 / 74


Application

Iberdrola
Iberdrola uses AI to maximize renewable energy generation.

Iberdrola
Machine learning models predict weather conditions, adjusting energy
production in real-time.

Iberdrola
In recent years, Iberdrola’s AI systems have reduced energy waste by nearly
25% across their renewable sites.

Dr. Lov Kumar Machine Learning 22-07-2025 29 / 74


Application

Power Company of Karnataka Limited (PCKL)


The agricultural requirement of electricity going up every day due to hot
weather, the Power Company of Karnataka Limited (PCKL) will use
Artificial Intelligence (AI) and Machine Learning (ML) to study crop
patterns and provide energy accordingly.

Power Company of Karnataka Limited (PCKL)


They are planning to do a cropping pattern survey study to assess
electricity demand of crops and Irrigation Pump (IP) sets.

Power Company of Karnataka Limited (PCKL)


They will also know which time and for what period the crop will need
water. Based on the findings, they can release energy for specific IP sets.

Dr. Lov Kumar Machine Learning 22-07-2025 30 / 74


Application

Dr. Lov Kumar Machine Learning 22-07-2025 31 / 74


Application

Microsoft, L V Prasad Eye Institute and Global Experts Collaborate to


Launch Microsoft Intelligent Network for Eyecare

Dr. Lov Kumar Machine Learning 22-07-2025 32 / 74


Application

A.J. Hospital and Research Centre has started using for 50 beds in the
hospital’s private ward the new Dozee Artificial Intelligence-based
Continuous Remote Patient Monitoring and Early Warning System

Dr. Lov Kumar Machine Learning 22-07-2025 33 / 74


Application

Last month, 7,000 patients across Fortis, HN Reliance Foundation,


Sahyadri, amongst others, were discharged without the usual long wait for
final bills-a quiet but significant shift in a healthcare system where
discharges typically stretch for hours.

Dr. Lov Kumar Machine Learning 22-07-2025 34 / 74


Image Recognition

Image recognition, which is an approach for cataloging and detecting a


feature or an object in the digital image, is one of the most significant and
notable machine learning and AI techniques.

Dr. Lov Kumar Machine Learning 22-07-2025 35 / 74


Sentiment Analysis

Sentiment analysis is one of the most necessary applications of machine


learning. Sentiment analysis is a real-time machine learning application
that determines the emotion or opinion of the speaker or the writer.

Dr. Lov Kumar Machine Learning 22-07-2025 36 / 74


Machine Learning Paradigms

Image Recognition

Speech Recognition

Recommender Systems

Fraud Detection

Self Driving Cars

Medical Diagnosis

Stock Market Trading

Dr. Lov Kumar Machine Learning 22-07-2025 37 / 74


Introduction

Machine learning is an application of artificial intelligence (AI) that


provides systems the ability to automatically learn and improve from
experience.
Machine learning focuses on the development of computer programs that
can access data and use it learn for themselves.

Dr. Lov Kumar Machine Learning 22-07-2025 38 / 74


Data vs. Information

Data and information are interrelated

Data usually refers to raw data, or unprocessed data.

Once the data is analyzed, it is considered as information.

Information is ”knowledge communicated or received concerning a


particular fact or circumstance.”

Dr. Lov Kumar Machine Learning 22-07-2025 39 / 74


Data vs. Information

Data is used as input for the computer system. Information is the output
of data.

Data is unprocessed facts figures. Information is processed data.

Data doesn’t depend on Information. Information depends on data.

Data doesn’t carry a meaning. Information must carry a logical meaning.

Dr. Lov Kumar Machine Learning 22-07-2025 40 / 74


What Kind of Data can be mined?
Flat files
Flat files are simple data files in text or binary format with a structure
known by the data mining algorithm to be applied.

Relational Databases
Tables have columns and rows, where columns represent attributes and
rows represent tuples.

Data Warehouses
A data warehouse as a storehouse, is a repository of data collected from
multiple data sources (often heterogeneous).

Transaction Databases
A transaction database is a set of records representing transactions, each
with a time stamp, an identifier and a set of items

Dr. Lov Kumar Machine Learning 22-07-2025 41 / 74


What Kind of Data can be mined?

Multimedia Databases
Multimedia databases include video, images, audio and text media.

Time-Series Databases
Time-series databases contain time related data such stock market data or
logged activities.

Dr. Lov Kumar Machine Learning 22-07-2025 42 / 74


Data Mining Tasks

Descriptive Analytics
which use data aggregation and data mining to provide insight into the
past and answer: ”What has happened?”

Predictive Analytics
which use statistical models and forecasts techniques to understand the
future and answer: ”What could happen?”

Prescriptive Analytics
which use optimization and simulation algorithms to advice on possible
outcomes and answer: ”What should we do?”

Dr. Lov Kumar Machine Learning 22-07-2025 43 / 74


Data Mining Tasks

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Regression [Predictive]

Dr. Lov Kumar Machine Learning 22-07-2025 44 / 74


Classification

Classification:
Classification maps data into predefined labels.

Example:
Lots of mails are there in my mail box. Can you tell me which are SPAM?
Often based on some patterns or characteristics
We can use the frequency of words
Assumption is that some words appears more or less frequently in
SPAM
Dr. Lov Kumar Machine Learning 22-07-2025 45 / 74
Regression

Regression:
Regression is used to map data into real valued variable.

Example:
What is the cost of my house?
We have data about the cost of house based on features such as
location, Plot area, number of rooms, garden available or not, how
old it is
Current economical conditions can also matter
Dr. Lov Kumar Machine Learning 22-07-2025 46 / 74
Clustering

Clustering:
Clustering is similar to classification except the groups are not pre-defined.

Example:
How many kind of files are there in my directory?
Unsupervised learning setting We can use file
name
Words it has
Dr. Lov Kumar Machine Learning 22-07-2025 47 / 74
Association Rule Discovery

Association Rule:
Produce dependency rules which will predict occurrence of an item based
on occurrences of other items.

Example:
Rules Discovered
Milk –> Coke
Diaper, Milk –> Beer
Dr. Lov Kumar Machine Learning 22-07-2025 48 / 74
Cement Strength
Concrete Compressive Strength
Cement (kg/m3 ): The amount of cement in the concrete mixture.
Blast Furnace Slag (kg/m3 ): The amount of blast furnace slag in the
concrete mixture.
Fly Ash (kg/m3 3): The amount of fly ash in the concrete mixture.
Water (kg/m3 3): The amount of water in the concrete mixture.
Superplasticizer (kg/m3 3): The amount of superplasticizer in the
concrete mixture.
Coarse Aggregate (kg/m3 3): The amount of coarse aggregate in the
concrete mixture.
Fine Aggregate (kg/m3 3): The amount of fine aggregate in the
concrete mixture.
Age (days): The curing age of the concrete (in days).
Compressive Strength (MPa): The target variable representing the
concrete’s compressive strength.
Dr. Lov Kumar Machine Learning 22-07-2025 49 / 74
Gas-Turbine Carbon monoxide(CO) and Nitrogen
oxides(NOX)

Gas-Turbine
A powerplant engine( gas-turbine) is mainly used to generate electricity.

The engine can different types of fuels, the engine can have different
levels of CO and NOX emission gases.
The dataset contains 36733 instances of 11 sensor measures
aggregated over one hour, from a gas turbine located in Turkey
Dr. Lov Kumar Machine Learning 22-07-2025 50 / 74
Industrial Boiler Operations

Boiler Operations
This dataset contains high-frequency time-series data collected from a
coal-fired industrial boiler operating in a chemical plant in Zhejiang, China.

The boiler is equipped with multiple sensors capturing parameters


such as pressure, temperature, flow rate, and oxygen levels.

Dr. Lov Kumar Machine Learning 22-07-2025 51 / 74


Anomaly Detection in Oil and Gas Chemical Plants

Anomaly Detection
This dataset contains sensor data collected from an oil and gas chemical
plant, designed for anomaly detection

It includes time-series data, with key operational parameters such as


temperature, pressure, flow rate, vibration levels, valve position, motor
speed, and chemical concentration.

Dr. Lov Kumar Machine Learning 22-07-2025 52 / 74


Vehicle CO2 Emissions

Vehicle CO2 Emissions Dataset


Vehicle Type: Classification of vehicles based on size and usage (e.g.,
SUV, Sedan).
Engine Size (L): Engine displacement volume in liters.
Cylinders: Number of cylinders in the engine. Transmission: Type of
transmission (e.g., Automatic, Manual).
Fuel Type: Type of fuel used by the vehicle (e.g., Gasoline, Diesel,
Hybrid).
Fuel Consumption (City, Hwy, and Combined): Fuel efficiency
measured in liters per 100 kilometers (L/100 km).
CO2 Emissions (g/km): Carbon dioxide emissions per kilometer
(target variable for prediction).

Dr. Lov Kumar Machine Learning 22-07-2025 53 / 74


Power Plant Data

The data points were collected from a Combined Cycle Power Plant over 6
years (2006-2011) when the power plant was set to work with a full load.

Power Plant Data


Hourly average ambient variables - Ambient Temperature (AT)
Ambient Pressure (AP)
Relative Humidity (RH)
Exhaust Vacuum (V)
Hourly electrical energy output (PE)

Dr. Lov Kumar Machine Learning 22-07-2025 54 / 74


Machine Predictive Maintenance
Machine Predictive Maintenance
productID: consisting of a letter L, M, or H for low (50% of all products),
medium (30%), and high (20%) as product quality variants and a
variant-specific serial number
air temperature [K]: generated using a random walk process later normalized
to a standard deviation of 2 K around 300 K
process temperature [K]: generated using a random walk process normalized
to a standard deviation of 1 K, added to the air temperature plus 10 K.
rotational speed [rpm]: calculated from powepower of 2860 W, overlaid with
a normally distributed noise
torque [Nm]: torque values are normally distributed around 40 Nm with an iƒ
= 10 Nm and no negative values.
tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool
wear to the used tool in the process.
’machine failure’ label that indicates, whether the machine has failed in this
particular data point for any of the following failure modes are true.

Dr. Lov Kumar Machine Learning 22-07-2025 55 / 74


Water Potability
ppm: parts per million
µg/L: microgram per litre
mg/L: milligram per litre

Water Potability
ph: pH of 1. water (0 to 14).
Hardness: Capacity of water to precipitate soap in mg/L.
Solids: Total dissolved solids in ppm.
Chloramines: Amount of Chloramines in ppm.
Sulfate: Amount of Sulfates dissolved in mg/L.
Conductivity: Electrical conductivity of water in µS/cm.
Organic_carbon: Amount of organic carbon in ppm.
Trihalomethanes: Amount of Trihalomethanes in µg/L.
Turbidity: Measure of light emiting property of water in NTU.
Potability: Indicates if water is safe for human consumption. Potable -1 and
Not potable -0
Dr. Lov Kumar Machine Learning 22-07-2025 56 / 74
Shapley Value

Dr. Lov Kumar Machine Learning 22-07-2025 57 / 74


Shapley Value

Dr. Lov Kumar Machine Learning 22-07-2025 58 / 74


Shapley Value

Dr. Lov Kumar Machine Learning 22-07-2025 59 / 74


Shapley Value

Dr. Lov Kumar Machine Learning 22-07-2025 60 / 74


Shapley Value

Dr. Lov Kumar Machine Learning 22-07-2025 61 / 74


Shapley Value

Shapley values are a widely used approach from cooperative game theory
that come with desirable properties.

A ”Game” is any situation in which there are several decision-makers, and


each of them wants to optimize their results.

It offers a way to evenly divide the overall profits or expenses among a team
of individuals working together towards a shared objective.

Dr. Lov Kumar Machine Learning 22-07-2025 62 / 74


Shapley Value

If a team of three employee(Allan, Bob, Cindy) together make $160 profit,


then how to fairly allocate the profit to each individual employee?

HR profit
Allan $40
Bob $50
Cindy $60
Allan and Bob $95
Allan and Cindy $110
Bob and Cindy $120

Dr. Lov Kumar Machine Learning 22-07-2025 63 / 74


Marginal Contribution

Marginal contribution for X is (Other members + X)-(Other members)

Combination -we can see Bob gets different marginal contributions in dif-
ferent employee combinations(Allan & Bob vs Cindy & Bob).

Combination Order- This is due to Margin calculation is based on sequence


order. Bob’s gets different marginal contributions in Cindy & Bob and Bob
& Cindy.

Dr. Lov Kumar Machine Learning 22-07-2025 64 / 74


Marginal Contribution

Additive
The profit allocation based on Shapley Values is Allan $42.5, Bob $52.5
and Cindy $65, note the sum of three employee’s Shapley values is
42.5+52.5+65 = 160
Dr. Lov Kumar Machine Learning 22-07-2025 65 / 74
Shapley Values for Machine Learning

We have trained a machine learning model to predict house prices in Delhi.


For a certain house, our model predicts INR 51,00,000 and we need to
explain this prediction. The apartment has a size of 50 yards, has a private
pool and also a garage:

The average prediction for all apartments is INR 50,00,000. How much has
each feature value contributed to the prediction compared to the average
prediction?

The feature values has_pool, has_garageand area-50 worked together to


achieve the prediction of INR 51,00,000. Our goal is to explain the difference
between the actual prediction (INR 51,00,000) and the average prediction
(50,00,000): a difference of INR 1,00,000

Dr. Lov Kumar Machine Learning 22-07-2025 66 / 74


Decision Tree based Methods

Dr. Lov Kumar Machine Learning 22-07-2025 67 / 74


Artificial neural network

Input layer

I1 h1

I2 h2 O1

Output layer

I3 h3

Hidden layer

Figure: Artificial neural network

Dr. Lov Kumar Machine Learning 22-07-2025 68 / 74


Activation Function
Linear Activation Function
The range of the output spans from (-∞ to + ∞)
Linear activation function is used at just one place i.e. output layer.

Sigmoid Function
1
It is mathematically defined as S = 1+e −x
The output ranges between 0 and 1, hence useful for binary
classification.

Tanh Function
It is mathematically defined as
S = 1+e2−2x − 1 = 2 ∗ sigmoid(2 ∗ x) − 1
The output ranges between -1 and 1, hence Commonly used in hidden
layers due to its zero-centered output.

Dr. Lov Kumar Machine Learning 22-07-2025 69 / 74


Activation Function

ReLU (Rectified Linear Unit) Function


ReLU activation is defined by: A(x) = max(0, x)
Value Range:(0,+∞)
It is a non-linear activation function, allowing neural networks to learn
complex patterns and making backpropagation more efficient.

Softmax Function
Softmax function is designed to handle multi-class classification
problems.
It works by squashing the output values of each class into the range
of 0 to 1, while ensuring that the sum of all probabilities equals 1.

Dr. Lov Kumar Machine Learning 22-07-2025 70 / 74


Artificial neural network

+1 w

F x0
u
n w0
x1
c
t
X1 i Cos(πx1 ) w1
o
n w2
a Sin(πx1 )
l .
.
. P S ŷ
ρ
e x2
x
p
a Cos(πx2 )
X2 n ŷ -
s Sin(πx2 )
i
o Adaptive Error
x1 . x2
P
n algorithm

y +
Dr. Lov Kumar Machine Learning 22-07-2025 71 / 74
Gradient Descent method

The Gradient Descent (GD) method is used for updating the weights
to minimize the output error.
GD method uses the 1st order derivative of the total error function to
find the minima in error space. It is represented using the following
equation:

∂ ∂ 1 0
G= Ek = (Ok − Ok )2 (1)
 
∂W ∂W 2
In each iteration, weight vector W is updated using gradient vector G.
Weighted vector W is updated as:

Wk+1 = −αGk = −α Ek (2)

∂W

Dr. Lov Kumar Machine Learning 22-07-2025 72 / 74


e ?
le as
n P
s t io
u e
yQ
An

Dr. Lov Kumar Machine Learning 22-07-2025 73 / 74


o u!
an kY
T h

Dr. Lov Kumar Machine Learning 22-07-2025 74 / 74

You might also like