EXPERIMENT NO.
11
Program Name: Statistical Analysis using Numpy in Python Programming Language
Implementation: Statistics is concerned with collecting and then analyzing that data. It includes methods
for collecting the samples, describing the data, and then concluding that data. NumPy is the fundamental
package for scientific calculations and hence goes hand-in-hand for NumPy statistical Functions.
NumPy contains various statistical functions that are used to perform statistical data analysis. These
statistical functions are useful when finding a maximum or minimum of elements. It is also used to find
basic statistical concepts like standard deviation, variance, etc.
NumPy is equipped with the following statistical functions:
1. [Link]()- This function determines the minimum value of the element along a specified axis.
2. [Link]()- This function determines the maximum value of the element along a specified axis.
3. [Link]()- It determines the mean value of the data set.
4. [Link]()- It determines the median value of the data set.
5. [Link]()- It determines the standard deviation
6. [Link] – It determines the variance.
7. [Link]()- It returns a range of values along an axis.
8. [Link]()- It determines the weighted average
9. [Link]()- It determines the nth percentile of data along the specified axis.
Finding maximum and minimum of array in NumPy
NumPy [Link]()and [Link]()functions are useful to determine the minimum and maximum value of
array elements along a specified axis.
import numpy as np
arr= [Link]([[1,23,78],[98,60,75],[79,25,48]])
print(arr)
#Minimum Function
print([Link](arr))
#Maximum Function
print([Link](arr))
Output
[[ 1 23 78]
[98 60 75]
[79 25 48]]
1
98
Prince Yadav IT3 2100270130133
Finding Mean, Median, Standard Deviation and Variance in NumPy
Mean
Mean is the sum of the elements divided by its sum and given by the following formula:
It calculates the mean by adding all the items of the arrays and then divides it by the number of elements.
We can also mention the axis along which the mean can be calculated.
import numpy as np
a = [Link]([5,6,7])
print(a)
print([Link](a))
Output
[5 6 7]
6.0
Median
Median is the middle element of the array. The formula differs for odd and even sets.
It can calculate the median for both one-dimensional and multi-dimensional arrays. Median separates the
higher and lower range of data values.
import numpy as np
a = [Link]([5,6,7])
print(a)
print([Link](a))
Output
[5 6 7]
6.0
Standard Deviation
Standard deviation is the square root of the average of square deviations from mean. The formula for
standard deviation is:
import numpy as np
a = [Link]([5,6,7])
print(a)
print([Link](a))
Prince Yadav IT3 2100270130133
Output
[5 6 7]
0.816496580927726
Variance
Variance is the average of the square deviations. Following is the formula for the same:
import numpy as np
a = [Link]([5,6,7])
print(a)
print([Link](a))
Output
[5 6 7]
0.6666666666666666
NumPy Average Function
NumPy [Link]() function determines the weighted average along with the multi-dimensional arrays.
The weighted average is calculated by multiplying the component by its weight, the weights are specified
separately. If weights are not specified it produces the same output as mean.
import numpy as np
a = [Link]([5,6,7])
print(a)
#without weight same as mean
print([Link](a))
#with weight gives weighted average
wt = [Link]([8,2,3])
print([Link](a, weights=wt))
Output
[5 6 7]
6.0
5.615384615384615
NumPy Percentile Function
It has the following syntax:
[Link](input, q, axis)
The accepted parameters are:
input: it is the input array.
q: it is the percentile which it calculates of the array elements between 0-100.
axis: it specifies the axis along which calculation is performed.
a = [Link]([2,10,20])
Prince Yadav IT3 2100270130133
print(a)
print([Link](a,10,0))
Output
[ 2 10 20]
3.6
NumPy Peak-to-Peak Function
NumPy [Link]() function is useful to determine the range of values along an axis.
a = [Link]([[2,10,20],[6,10,60]])
print([Link](a,0))
Output
[4 0 40]
Prince Yadav IT3 2100270130133
EXPERIMENT NO. 12
Program Name: Implementation of Linear Regression using Python Programming Language
Implementation:
Simple Linear Regression in Python :
Simple linear regression is a statistical method that we can use to find a relationship between two variables
and make predictions. The two variables used are typically denoted as y and x. The independent variable, or
the variable used to predict the dependent variable is denoted as x. The dependent variable, or
the outcome/output, is denoted as y.A simple linear regression model will produce a line of best fit, or the
regression line. You may have heard about drawing the line of best fit through a scatter plot of data. For
example, let's say we have a scatter plot showing how years of experience affect salaries. Imagine drawing a
line to predict the trend.
The simple linear regression equation we will use is written below. The constant is the y-intercept (𝜷0), or
where the regression line will start on the y-axis. The beta coefficient (𝜷1) is the slope and describes the
relationship between the independent variable and the dependent variable. The coefficient can be positive
or negative and is the degree of change in the dependent variable for every 1-unit of change in the inde-
pendent variable.
For example, let's say we have a re-
gression equation of y = 2 + 0.5x. For
every 1-unit increase in the indepen-
dent variable (x), there will be a 0.50
increase in the dependent variable (y).
Simple Linear Regression Using
Python
For this example, we will be using salary data from Kaggle. The data consists of two columns, years of ex-
perience and the corresponding salary.
First, we will import the Python packages that we will need for this analysis. All we will need is NumPy,
to help with the math calculations, Pandas, to store and manipulate the data and Matplotlib (optional), to
plot the data.
import numpy as np
import pandas as pd
import [Link] as plt
Next, we will load in the data and then assign each column to its appropriate variable. For this example,
we will be using the years of experience to predict the salary, so the dependent variable will be the salary
(y) and the independent variable will be the years of experience (x).
data = pd.read_csv('Salary_Data.csv')
x = data['YearsExperience']
y = data['Salary']
To get a look at the data we can use the .head() function provided by Pandas, which will show us the first
few rows of the data.
print([Link]())
YearsExperience Salary
0 1.1 39343.0
1 1.3 46205.0
Prince Yadav IT3 2100270130133
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
Above is a scatter plot showing our data. We can see a positive linear relationship between Years of Expe-
rience and Salary, meaning that as a person gains more experience, they also get paid more.
Calculating the Regression Line
While we could spend all day guessing the slope and intercept of the linear regression line, luckily there
are formulas that we can use to quickly make these calculations.
To estimate the slope 𝜷1 of the data we will use the following formula:
To estimate the intercept 𝜷0, we can use the following formula:
Now we will have to translate these two formulas to Python to calculate the regression line. First I will
show the full function, then I will break it down further.
def linear_regression(x, y):
N = len(x)
x_mean = [Link]()
y_mean = [Link]()
B1_num = ((x - x_mean) * (y - y_mean)).sum()
B1_den = ((x - x_mean)**2).sum()
B1 = B1_num / B1_den
B0 = y_mean - (B1*x_mean)
reg_line = 'y = {} + {}β'.format(B0, round(B1, 3))
return (B0, B1, reg_line)
First, we will use the len() function to get the number of observations in our dataset and set this to
the N variable. We can then calculate the mean for both X and Y by simply using the .mean() function.
Prince Yadav IT3 2100270130133
N = len(x)
x_mean = [Link]()
Now we can begin to calculate the slope 𝜷1. To shorten the length of these lines of code, we can calculate
y_mean = [Link]()
and assign it to a variable named 𝜷1. We can just follow the slope formula given above.
the numerator and denominator of the slope formula first then divide the numerator by the denominator
B1_num = ((x - x_mean) * (y - y_mean)).sum()
B1_den = ((x - x_mean)**2).sum()
Now that we have calculated the slope 𝜷1, we can use the formula for the intercept 𝜷0.
B1 = B1_num / B1_den
B0 = y_mean - (B1 * x_mean)
Now if we apply this linear_regression() function to our data, it will return the intercept, slope and the
regression line rounded to 3 decimal places.
Prince Yadav IT3 2100270130133