0% found this document useful (0 votes)

12 views27 pages

Machine Learning

Uploaded by

known4nope

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views27 pages

Machine Learning

Uploaded by

known4nope

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Types

• To analyze data, it is important to know what type of data we are dealing with.
• We can split the data types into three main categories:
• Numerical
• Categorical
• Ordinal
• Numerical data are numbers, and can be split into two numerical categories:
• Discrete Data
- counted data that are limited to integers. Example: The number of cars passing by.
• Continuous Data
- measured data that can be any number. Example: The price of an item, or the size of an item
• Categorical data are values that cannot be measured up against each other. Example: a color value, or any
yes/no values.
• Ordinal data are like categorical data, but can be measured up against each other. Example: school grades
where A is better than B and so on.
Mean, Median, and Mode
• Mean - The average value
• Median - The mid point value
• Mode - The most common value
• Mean
• The mean value is the average value.
• To calculate the mean, find the sum of all values, and divide the sum by the number of values:
• (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77

• import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)

• Median
• The median value is the value in the middle, after you have sorted all the values:

• import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)
• import numpy

speed = [99,86,87,88,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

• Mode
• The Mode value is the value that appears the most number of times:

• from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)

print(x)
What is Standard
Deviation?
• Standard deviation is a number that describes how spread out the values are.
• A low standard deviation means that most of the numbers are close to the mean (average) value.
• A high standard deviation means that the values are spread out over a wider range.
• Example: This time we have registered the speed of 7 cars:
• speed = [86,87,88,86,87,85,86], The standard deviation is: 0.9
• Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.

• Let us do the same with a selection of numbers with a wider range:

• speed = [32,111,138,28,59,77,97], The standard deviation is: 37.85

• Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.

• As you can see, a higher standard deviation indicates that the values are spread out over a wider range.

• The NumPy module has a method to calculate the standard deviation:

• import numpy

speed = [86,87,88,86,87,85,86]

x = numpy.std(speed)

print(x)

• import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)
Variance
• Variance is another number that indicates how spread out the values are.

• In fact, if you take the square root of the variance, you get the standard deviation!

• Or the other way around, if you multiply the standard deviation by itself, you get the variance!

• To calculate the variance you have to do as follows:

• 1. Find the mean: (32+111+138+28+59+77+97) / 7 = 77.4

• 2. For each value: find the difference from the mean:

• 32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6

• 3. For each difference: find the square value:

• (-45.4)2 = 2061.16

• (33.6)2 = 1128.96

• (60.6)2 = 3672.36

• (-49.4)2 = 2440.36

• (-18.4)2 = 338.56

• (- 0.4)2 = 0.16

• (19.6) = 2
384.16

• 4. The variance is the average number of these squared differences:

• (2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2

• import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.var(speed)

print(x)

• Standard Deviation

• root(1432.25) = 37.85

• import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)
What are Percentiles?
• Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.

• Example: Let's say we have an array that contains the ages of every person living on a street.
• ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

• What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.

• The NumPy module has a method for finding the specified percentile:

• import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 75)

print(x)

• Example

• What is the age that 90% of the people are younger than?

• import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 90)

print(x)
Data Distribution
• Earlier in this tutorial we have worked with very small amounts of data in our examples, just to understand the different concepts.

• In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project.

• How Can we Get Big Data Sets?

• To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.

• Example

• Create an array containing 250 random floats between 0 and 5:

• import numpy

x = numpy.random.uniform(0.0, 5.0, 250)

print(x)

• Histogram
• To visualize the data set we can draw a histogram with the data we collected.

• We will use the Python module Matplotlib to draw a histogram.

• Learn about the Matplotlib module in our Matplotlib Tutorial.

• Example

• Draw a histogram:

• import numpy
import matplotlib.pyplot as plt

x = numpy.random.uniform(0.0, 5.0, 250)

plt.hist(x, 5)
plt.show()
Histogram Explained & Big Data
Distributions
• We use the array from the example above to draw a histogram with 5 bars.

• The first bar represents how many values in the array are between 0 and 1.

• The second bar represents how many values are between 1 and 2.

• Etc.

• Which gives us this result:

• 52 values are between 0 and 1

• 48 values are between 1 and 2

• 49 values are between 2 and 3

• 51 values are between 3 and 4

• 50 values are between 4 and 5

• An array containing 250 values is not considered very big, but now you know how to create a random set of values, and by changing the parameters, you can create the data set as big as you
want.

• Example

• Create an array with 100000 random numbers, and display them using a histogram with 100 bars:

• import numpy
import matplotlib.pyplot as plt

x = numpy.random.uniform(0.0, 5.0, 100000)

plt.hist(x, 100)
plt.show()
Normal Data Distribution
• In the previous chapter we learned how to create a completely random array, of a given size, and between two given values.

• In this chapter we will learn how to create an array where the values are concentrated around a given value.

• In probability theory this kind of data distribution is known as the normal data distribution, or the Gaussian data distribution, after the mathematician Carl
Friedrich Gauss who came up with the formula of this data distribution.
• Example

• A typical normal data distribution:

• import numpy
import matplotlib.pyplot as plt

x = numpy.random.normal(5.0, 1.0, 100000)

plt.hist(x, 100)
plt.show()

• Histogram Explained

• We use the array from the numpy.random.normal() method, with 100000 values, to draw a histogram with 100 bars.

• We specify that the mean value is 5.0, and the standard deviation is 1.0.

• Meaning that the values should be concentrated around 5.0, and rarely further away than 1.0 from the mean.

• And as you can see from the histogram, most values are between 4.0 and 6.0, with a top at approximately 5.0.
Machine Learning - Scatter
Plot
• Scatter Plot
• A scatter plot is a diagram where each value in the data set is represented by a dot.
• The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values
of the y-axis:

• x = [5,7,8,7,2,17,2,9,4,11,12,9,6], y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

• The x array represents the age of each car.

• The y array represents the speed of each car.

• Example

• Use the scatter() method to draw a scatter plot diagram:

• import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()

• Scatter Plot Explained

• The x-axis represents ages, and the y-axis represents speeds.

• What we can read from the diagram is that the two fastest cars were both 2 years old, and the slowest car was 12 years old.

• Note: It seems that the newer the car, the faster it drives, but that could be a coincidence, after all we only registered 13 cars.
Linear Regression
• Regression
• The term regression is used when you try to find the relationship between variables.
• In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events.
• Linear Regression

• Linear regression uses the relationship between the data-points to draw a straight line through all them.

• This line can be used to predict future values. In Machine Learning, predicting the future is very important.

• How Does it Work?

• Python has methods for finding a relationship between data-points and to draw a line of linear regression. We will show you how to use these methods instead
of going through the mathematic formula.

• In the example below, the x-axis represents age, and the y-axis represents speed. We have registered the age and speed of 13 cars as they were passing a
tollbooth. Let us see if the data we collected could be used in a linear regression:
• import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()

• import matplotlib.pyplot as plt

from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
• Example Explained

• Import the modules you need.

• You can learn about the Matplotlib module in our Matplotlib Tutorial.

• You can learn about the SciPy module in our SciPy Tutorial.
• import matplotlib.pyplot as plt
from scipy import stats

• Create the arrays that represent the values of the x and y axis:

• x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

• Execute a method that returns some important key values of Linear Regression:

• slope, intercept, r, p, std_err = stats.linregress(x, y)

• Create a function that uses the slope and intercept values to return a new value. This new value represents where on the y-axis the corresponding x value
will be placed:

• def myfunc(x):
return slope * x + intercept

• Run each value of the x array through the function. This will result in a new array with new values for the y-axis:

• mymodel = list(map(myfunc, x))

• Draw the original scatter plot:

• plt.scatter(x, y)

• Draw the line of linear regression:

• plt.plot(x, mymodel)

• Display the diagram:

• plt.show()
R for Relationship
• It is important to know how the relationship between the values of the x-axis and the values of the y-axis is, if there are no
relationship the linear regression can not be used to predict anything.
• This relationship - the coefficient of correlation - is called r

• The r value ranges from -1 to 1, where 0 means no relationship, and 1 (and -1) means 100% related.

• Python and the Scipy module will compute this value for you, all you have to do is feed it with the x and y values.

• Example

• How well does my data fit in a linear regression?

• from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

print(r)

• Predict Future Values

• Now we can use the information we have gathered to predict future values.
• Example: Let us try to predict the speed of a 10 years old car.
• To do so, we need the same myfunc() function from the example above:

• def myfunc(x):
return slope * x + intercept

• Example

• Predict the speed of a 10 years old car:

• from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

speed = myfunc(10)

print(speed)

• The example predicted a speed at 85.6, which we also could read from the diagram:

• Bad Fit?

• Let us create an example where linear regression would not be the best method to predict future values.

• Example

• These values for the x- and y-axis should result in a very bad fit for linear regression:

• import matplotlib.pyplot as plt

from scipy import stats

x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

• And the r for relationship?

6.lab Activity
No ratings yet
6.lab Activity
23 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Introduction to Machine Learning Basics
100% (1)
Introduction to Machine Learning Basics
52 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
69 pages
Machine Learning
No ratings yet
Machine Learning
80 pages
4-Demonstrate The Descriptive Statistics For A Sample Data Like Mean, Median, Variance and Correlation Etc.,-16-12-2024
No ratings yet
4-Demonstrate The Descriptive Statistics For A Sample Data Like Mean, Median, Variance and Correlation Etc.,-16-12-2024
10 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
71 pages
Python Stats with NumPy Tutorial
No ratings yet
Python Stats with NumPy Tutorial
131 pages
Machine Learning Implementation Guide
No ratings yet
Machine Learning Implementation Guide
7 pages
Modul 7 Praktikum Machine Learning Python
No ratings yet
Modul 7 Praktikum Machine Learning Python
32 pages
Data Visualization Exp. 3
No ratings yet
Data Visualization Exp. 3
3 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
ML Lab Manual
No ratings yet
ML Lab Manual
27 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Data Mining Lab Maual Through Python 031023
No ratings yet
Data Mining Lab Maual Through Python 031023
22 pages
Week2 Modified
No ratings yet
Week2 Modified
43 pages
Machine Learning Course Setup Guide
No ratings yet
Machine Learning Course Setup Guide
345 pages
Intro to Statistics with Python
No ratings yet
Intro to Statistics with Python
54 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
28 pages
Data Types and Statistical Concepts
No ratings yet
Data Types and Statistical Concepts
356 pages
Visualizing Keras Models Without Pydot
No ratings yet
Visualizing Keras Models Without Pydot
356 pages
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
No ratings yet
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
19 pages
MLCourse Slides
No ratings yet
MLCourse Slides
427 pages
Statistics
No ratings yet
Statistics
48 pages
Python Workshop: PDF Sampling & Statistics
No ratings yet
Python Workshop: PDF Sampling & Statistics
10 pages
ML Class 14.7.25
No ratings yet
ML Class 14.7.25
57 pages
Intro To Statistics (CH1&2)
No ratings yet
Intro To Statistics (CH1&2)
38 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
01 Statistics With Python
No ratings yet
01 Statistics With Python
8 pages
B Lab Manual Machine Learning SEM-7 CSE 2024
No ratings yet
B Lab Manual Machine Learning SEM-7 CSE 2024
49 pages
Reliability Distribution 1
No ratings yet
Reliability Distribution 1
41 pages
Advanced Plot Types With Matplotlib
No ratings yet
Advanced Plot Types With Matplotlib
8 pages
Pandas & NumPy Data Analysis Guide
No ratings yet
Pandas & NumPy Data Analysis Guide
11 pages
EDA: Key Stats & Visualizations in Python
No ratings yet
EDA: Key Stats & Visualizations in Python
15 pages
Rahul ML File' (1) 2
No ratings yet
Rahul ML File' (1) 2
30 pages
Statistical Analysis: 1 Data Analysis: Mean, Variance, Boxplots
No ratings yet
Statistical Analysis: 1 Data Analysis: Mean, Variance, Boxplots
4 pages
Random Variable
No ratings yet
Random Variable
10 pages
Notebook Statistics
No ratings yet
Notebook Statistics
6 pages
Maths
No ratings yet
Maths
30 pages
ML Lab
No ratings yet
ML Lab
12 pages
Matplot Lib Practicals
No ratings yet
Matplot Lib Practicals
24 pages
Unit 1 AIDS
No ratings yet
Unit 1 AIDS
128 pages
Statistical Measures in Data Analysis
No ratings yet
Statistical Measures in Data Analysis
70 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Chapter1.2 PythonPandas2
No ratings yet
Chapter1.2 PythonPandas2
38 pages
ML 3170724 Unit-2
No ratings yet
ML 3170724 Unit-2
40 pages
Machine Learning Lab Word 12-1-2025. Document
No ratings yet
Machine Learning Lab Word 12-1-2025. Document
68 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Mod 4
No ratings yet
Mod 4
115 pages
Data Mining and Predictive Modelling Assignment
No ratings yet
Data Mining and Predictive Modelling Assignment
34 pages
Data Science Experiments
No ratings yet
Data Science Experiments
31 pages
Unit 5
No ratings yet
Unit 5
10 pages
Conditions for Projectile Motion Analysis
No ratings yet
Conditions for Projectile Motion Analysis
11 pages
Tender 10 KW Rooftop Solar
No ratings yet
Tender 10 KW Rooftop Solar
14 pages
Being Numerate PDF
No ratings yet
Being Numerate PDF
3 pages
SAP BODS Circulum PDF
No ratings yet
SAP BODS Circulum PDF
39 pages
2018-11 FS9100 Datasheet
No ratings yet
2018-11 FS9100 Datasheet
3 pages
Prelims Short Quiz 1
No ratings yet
Prelims Short Quiz 1
3 pages
New Features
No ratings yet
New Features
180 pages
ASOE Biology 2022 Final-With-Answers
No ratings yet
ASOE Biology 2022 Final-With-Answers
31 pages
Linear Algebra
No ratings yet
Linear Algebra
312 pages
Hauvrex 2-Post Lift Specifications
No ratings yet
Hauvrex 2-Post Lift Specifications
10 pages
G4-Concrete Test Specimens Preparation in Laboratory
No ratings yet
G4-Concrete Test Specimens Preparation in Laboratory
4 pages
Driver Information for Windows 7
No ratings yet
Driver Information for Windows 7
53 pages
Chapter - 3: Elements of Realiazability Theory: Requirements Is Called Network Synthesis
100% (1)
Chapter - 3: Elements of Realiazability Theory: Requirements Is Called Network Synthesis
4 pages
Furnace Efficiency Optimization Guide
No ratings yet
Furnace Efficiency Optimization Guide
27 pages
Couchbase Manual 1.8
No ratings yet
Couchbase Manual 1.8
157 pages
Anwana
No ratings yet
Anwana
16 pages
JAVA Programming Guide - Quick Reference
100% (28)
JAVA Programming Guide - Quick Reference
11 pages
Ionic Equilibrium in Weak Acids
No ratings yet
Ionic Equilibrium in Weak Acids
20 pages
Flotation Kinetic Test Procedures Guide
No ratings yet
Flotation Kinetic Test Procedures Guide
10 pages
Economics Assignment, 2024
No ratings yet
Economics Assignment, 2024
2 pages
Technical Terms About Reinforcement
100% (1)
Technical Terms About Reinforcement
13 pages
SN-QC-SAPP-103 Indosef 500mg Injection UPDATED
No ratings yet
SN-QC-SAPP-103 Indosef 500mg Injection UPDATED
8 pages
CN r19 Lecturenotes Unit 2
No ratings yet
CN r19 Lecturenotes Unit 2
19 pages
American Statistical Association
No ratings yet
American Statistical Association
5 pages
Comparison Clauses - BS1377 and ISO17892
100% (2)
Comparison Clauses - BS1377 and ISO17892
1 page
Dividing Fractions: Student Name: - Score
No ratings yet
Dividing Fractions: Student Name: - Score
2 pages
MCP3204 MCP3208 PDF
No ratings yet
MCP3204 MCP3208 PDF
20 pages
Explain ANSYS Ncode
100% (2)
Explain ANSYS Ncode
34 pages
Sap PP Integration Flow
67% (3)
Sap PP Integration Flow
2 pages
KISA ISC Preparatory Mathematics
No ratings yet
KISA ISC Preparatory Mathematics
8 pages

Machine Learning

Uploaded by

Machine Learning

Uploaded by

Data Types

• from scipy import stats

• Let us do the same with a selection of numbers with a wider range:

• speed = [32,111,138,28,59,77,97], The standard deviation is: 37.85

• The NumPy module has a method to calculate the standard deviation:

• To calculate the variance you have to do as follows:

• 1. Find the mean: (32+111+138+28+59+77+97) / 7 = 77.4

• 2. For each value: find the difference from the mean:

• 3. For each difference: find the square value:

• 4. The variance is the average number of these squared differences:

• How Can we Get Big Data Sets?

• Create an array containing 250 random floats between 0 and 5:

x = numpy.random.uniform(0.0, 5.0, 250)

• We will use the Python module Matplotlib to draw a histogram.

• Learn about the Matplotlib module in our Matplotlib Tutorial.

x = numpy.random.uniform(0.0, 5.0, 250)

• Which gives us this result:

• 52 values are between 0 and 1

• 48 values are between 1 and 2

• 49 values are between 2 and 3

• 51 values are between 3 and 4

• 50 values are between 4 and 5

x = numpy.random.uniform(0.0, 5.0, 100000)

• A typical normal data distribution:

x = numpy.random.normal(5.0, 1.0, 100000)

• The x array represents the age of each car.

• The y array represents the speed of each car.

• Use the scatter() method to draw a scatter plot diagram:

• import matplotlib.pyplot as plt

• Scatter Plot Explained

• The x-axis represents ages, and the y-axis represents speeds.

• How Does it Work?

• import matplotlib.pyplot as plt

slope, intercept, r, p, std_err = stats.linregress(x, y)

mymodel = list(map(myfunc, x))

• Import the modules you need.

• slope, intercept, r, p, std_err = stats.linregress(x, y)

• mymodel = list(map(myfunc, x))

• Draw the original scatter plot:

• Draw the line of linear regression:

• Display the diagram:

• How well does my data fit in a linear regression?

• from scipy import stats

slope, intercept, r, p, std_err = stats.linregress(x, y)

• Predict Future Values

• Predict the speed of a 10 years old car:

slope, intercept, r, p, std_err = stats.linregress(x, y)

• import matplotlib.pyplot as plt

slope, intercept, r, p, std_err = stats.linregress(x, y)

mymodel = list(map(myfunc, x))

• And the r for relationship?

You might also like