Fds Lab Manual
Fds Lab Manual
REGISTER NO:
Procedure:
Install Python Data Science Packages
Python is a high-level and general-purpose programming language with data science
and machine learning packages. Use the video below to install on Windows, MacOS, or Linux.
As a first step, install Python for Windows, MacOS, or Linux.
Python Packages
The power of Python is in the packages that are available either through the pip or conda
package managers. This page is an overview of some of the best packages for machine learning
and data science and how to install them.
We will explore the Python packages that are commonly used for data science and machine
learning. You may need to install the packages from the terminal, Anaconda prompt, command
prompt, or from the Jupyter Notebook. If you have multiple versions of Python or have specific
dependencies then use an environment manager such as pyenv. For most users, a single
installation is typically sufficient. The Python package manager pip has all of the packages
(such as gekko) that we need for this course. If there is an administrative access error, install
to the local profile with the --user flag.
pip install gekko
Gekko
Gekko provides an interface to gradient-based solvers for machine learning and optimization
of mixed-integer, differential algebraic equations, and time series models. Gekko provides
exact first and second derivatives through automatic differentiation and discretization with
simultaneous or sequential methods.
pip install gekko
Keras
Keras provides an interface for artificial neural networks. Keras acts as an interface for the
TensorFlow library. Other backend packages were supported until version 2.4. TensorFlow is
now the only backend and is installed separately with pip install tensorflow.
pip install keras
Matplotlib
The package matplotlib generates plots in Python.
pip install matplotlib
Numpy
Numpy is a numerical computing package for mathematics, science, and engineering. Many
data science packages use Numpy as a dependency.
pip install numpy
OpenCV
OpenCV (Open Source Computer Vision Library) is a package for real-time computer vision
and developed with support from Intel Research.
pip install opencv-python
Pandas
Pandas visualizes and manipulates data tables. There are many functions that allow efficient
manipulation for the preliminary steps of data analysis problems.
pip install pandas
Plotly
Plotly renders interactive plots with HTML and JavaScript. Plotly Express is included with
Plotly.
pip install plotly
PyTorch
PyTorch enables deep learning, computer vision, and natural language processing.
Development is led by Facebook's AI Research lab (FAIR).
pip install torch
Scikit-Learn
Scikit-Learn (or sklearn) includes a wide variety of classification, regression and clustering
algorithms including neural network, support vector machine, random forest, gradient
boosting, k-means clustering, and other supervised or unsupervised learning methods.
pip install scikit-learn
SciPy
SciPy is a general-purpose package for mathematics, science, and engineering and extends the
base capabilities of NumPy.
pip install scipy
Seaborn
Seaborn is built on matplotlib, and produces detailed plots in few lines of code.
pip install seaborn
Statsmodels
Statsmodels is a package for exploring data, estimating statistical models, and performing
statistical tests. It include descriptive statistics, statistical tests, plotting functions, and result
statistics.
pip install statsmodels
TensorFlow
TensorFlow is an open source machine learning platform with particular focus on training and
inference of deep neural networks. Development is led by the Google Brain team.
pip install tensorflow
Result:
Ex. No: 02
Working with NumPy
Aim:
import numpy as np
arr =np.array([1, 2, 3, 4, 5]) [1 2 3 4 5]
print(arr)
Tuble import numpy as np [1 2 3 4 5]
arr = np.array((1, 2, 3, 4, 5))
print(arr)
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7]) [5 6]
print(arr[-3:-1])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7]) [2 4]
print(arr[1:5:2])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[::2]) [1 3 5 7]
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9,
10]]) [7 8 9]
print(arr[1, 1:4])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12]) [[[ 1 2]
newarr = arr.reshape(2, 3, 2) [ 3 4]
print(newarr) [ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]]
Result:
Ex. No: 03
Working with Pandas data frames
Aim:
To learn the various uses of Pandas package in Python with examples.
Introduction:
Pandas is a Python library used for working with data sets. It has functions for
analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to
both "Panel Data", and "Python Data Analysis" . Pandas allows us to analyze big data and make
conclusions based on statistical theories. Pandas can clean messy data sets, and make them
readable and relevant. Relevant data is very important in data science.
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Output:
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Output:
0 1
1 7
2 2
dtype: int64
• If nothing else is specified, the values are labeled with their index number. First value
has index 0, second value has index 1 etc. This label can be used to access a specified
value.
import pandas as pd
a = ['a', 'b', 'c']
myvar = pd.Series(a)
print(myvar[1])
Output:
B
import pandas as pd
a = [30, 31, 32]
myvar = pd.Series(a, index = ["A", "B", "C"])
print(myvar)
Output:
A 30
B 31
C 32
dtype: int64
import pandas as pd
KeyValuePair = {"Index-1": 420, "Index-2": 380, "Index-3": 390}
myvar = pd.Series(KeyValuePair,index=["Index-1","Index-3"])
print(myvar)
Output:
Index-1 420
Index-3 390
dtype: int64
Output:
Item Cost
I1 A 20
I2 B 10
I3 C 5
import pandas as pd
data = {
"Item": ['A', 'B', 'C'],
"Cost": [20, 10, 5]}
df= pd.DataFrame(data)
for i in range(0,3):
print(df.loc[i])
Output:
Item A
Cost 20
Name: 0, dtype: object
Item B
Cost 10
Name: 1, dtype: object
Item C
Cost 5
Name: 2, dtype: object
Load Files in to a DataFrame:
import pandas as pd
df = pd.read_csv('/student.csv') #Can change the file extension
print(df.to_string())
Output:
id name class mark gender
0 1 John Deo Four 75 female
1 2 Max Ruin Three 85 male
2 3 Arnold Three 55 male
3 4 Krish Star Four 60 female
4 5 John Mike Four 60 female
5 6 Alex John Four 55 male
6 7 My John Rob Fifth 78 male
7 8 Asruid Five 85 male
8 9 Tes Qry Six 78 male
9 10 Big John Four 55 female
10 11 Ronald Six 89 female
11 12 Recky Six 94 female
12 13 Kty Seven 88 female
Viewing the Data
• One of the most used method for getting a quick overview of the DataFrame, is the
head() method.
• The head() method returns the headers and a specified number of rows, starting from
the top.
• There is also a tail() method for viewing the last rows of the DataFrame.
import pandas as pd
df = pd.read_csv('/content/student.csv')
print("Head Data")
print(df.head(3))
print("Tail Data")
print(df.tail(3))
Output:
Head Data
id name class mark gender
0 1 John Deo Four 75 female
1 2 Max Ruin Three 85 male
2 3 Arnold Three 55 male
Tail Data
id name class mark gender
32 33 Kenn Rein Six 96 female
33 34 Gain Toe Seven 69 male
34 35 Rows Noump Six 88 female
import pandas as pd
df = pd.read_csv('/content/clean.csv')
print(df)
Output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 120 250 NaN
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Empty Cells
• Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows
• One way to deal with empty cells is to remove rows that contain empty cells.
• This is usually OK, since data sets can be very big, and removing a few rows will not
have a big impact on the result.
import pandas as pd
df = pd.read_csv('/content/clean.csv')
new_df = df.dropna()
print(new_df.to_string())
Output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
27 60 '2020/12/27' 92 118 241.0
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
• Note: By default, the dropna () method returns a new DataFrame, and will not change
the original.
• If you want to change the original DataFrame, use the inplace = True argument
import pandas as pd
df = pd.read_csv('/content/clean.csv')
df.dropna(inplace = True)
print(df.to_string())
import pandas as pd
df = pd.read_csv('/content/clean.csv')
df.fillna(130, inplace = True)
print(df.to_string())
Output:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
print(df.to_string())
Output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.10
1 60 '2020/12/02' 117 145 479.00
2 60 '2020/12/03' 103 135 340.00
3 45 '2020/12/04' 109 175 282.40
4 45 '2020/12/05' 117 148 406.00
5 60 '2020/12/06' 102 127 300.00
6 60 '2020/12/07' 110 136 374.00
7 450 '2020/12/08' 104 134 253.30
8 30 '2020/12/09' 109 133 195.10
9 60 '2020/12/10' 98 124 269.00
10 60 '2020/12/11' 103 147 329.30
11 60 '2020/12/12' 100 120 250.70
12 60 '2020/12/12' 100 120 250.70
13 60 '2020/12/13' 106 128 345.30
14 60 '2020/12/14' 104 132 379.30
15 60 '2020/12/15' 98 123 275.00
16 60 '2020/12/16' 98 120 215.20
17 60 '2020/12/17' 100 120 300.00
18 45 '2020/12/18' 90 112 304.68
19 60 '2020/12/19' 103 123 323.00
20 45 '2020/12/20' 97 125 243.00
21 60 '2020/12/21' 108 131 364.20
22 45 130 100 119 282.00
23 60 '2020/12/23' 130 101 300.00
24 45 '2020/12/24' 105 132 246.00
25 60 '2020/12/25' 102 126 334.50
26 60 20201226 100 120 250.00
27 60 '2020/12/27' 92 118 241.00
28 60 '2020/12/28' 103 132 304.68
29 60 '2020/12/29' 100 132 280.00
30 60 '2020/12/30' 102 129 380.30
31 60 '2020/12/31' 92 115 243.00
x = df["Calories"].median()
x = df["Calories"].mode()[0]
Plotting
Pandas uses the plot() method to create diagrams.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/content/1.csv')
df.plot()
plt.show()
Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
A scatter plot needs an x- and a y-axis.
In the example below we will use "Duration" for the x-axis and "Calories" for the y-axis.
Include the x and y arguments like this:
x = 'Duration', y = 'Calories'
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/content/1.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()
Result:
Ex 4 . Reading data from text files does exploring various commands
Aim
Procedure
Python provides inbuilt functions for creating, writing, and reading files. There are two types
of files that can be handled in python, normal text files and binary files (written in binary
language, 0s, and 1s).
Text files: In this type of file, Each line of text is terminated with a special character called
EOL (End of Line), which is the new line character (‘\n’) in python by default.
Binary files: In this type of file, there is no terminator for a line, and the data is stored after
converting it into machine-understandable binary language.
There are 6 access modes in python.
Read Only (‘r’) : Open text file for reading.
Read and Write (‘r+’): Open the file for reading and writing
Write Only (‘w’) : Open the file for writing.
Write and Read (‘w+’) : Open the file for reading and writing.
Append Only (‘a’): Open the file for writing.
Append and Read (‘a+’) : Open the file for reading and writing.
There are three ways to read data from a text file.
read() : Returns the read bytes in form of a string. Reads n bytes, if no n specified, reads the
entire file.
File_object.read([n])
readline() : Reads a line of the file and returns in form of a string.For specified n, reads at most
n bytes. However, does not reads more than one line, even if n exceeds the length of the line.
File_object.readline([n])
readlines() : Reads all the lines and return them as each line a string element in a list.
File_object.readlines()
Program
# Program to show various ways to read and
# write data in a file.
file1 = open("myfile.txt","w")
L = ["This is CSE Department \n", "Testing Line 2 \n", "Testing Line 3 \n"]
# \n is placed to indicate EOL (End of Line)
file1.write("Hello \n")
file1.writelines(L)
file1.close() #to change file access modes
file1 = open("myfile.txt","r+")
print("Output of Read function is ")
print(file1.read())
print()
file1.seek(0)
# readlines function
print("Output of Readlines function is ")
print(file1.readlines())
print()
file1.close()
Output
Output of Read function is
Hello
This is CSE Department
Testing Line 2
Testing Line 3
Result:
Ex.No 5. a. Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following: Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis.
Aim:
Procedure
• Download dataset like Pima Indian diabetes dataset. Save them in any drive and call
them for process.
• The mean () function can be used to calculate mean/average of a given list of numbers.
• The median () method calculates the median (middle value) of the given data set.
• The mode of a set of data values is the value that appears most often.
• The var () method calculates the variance for each column.
• Standard deviation std () is a number that describes how spread out the values are.
• The skew () method calculates the skew for each column. Skewness refers to a
distortion or asymmetry that deviates from the symmetrical bell curve, or normal
distribution, in a set of data.
Kurtosis:
It is also a statistical term and an important characteristic of frequency distribution. It
determines whether a distribution is heavy-tailed in respect of the normal distribution. It
provides information about the shape of a frequency distribution.
Program:
import pandas as pd
from scipy.stats import kurtosis
import pylab as p
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
df1 = pd.DataFrame(df, columns= ['Age','Glucose'])
print (df1)
df1.mean()
df1.median()
df1.mode()
print(df1.var())
df1.std()
print(df1.skew())
print(kurtosis(df, axis=0, bias=True))
Result:
Ex.No: 5 b. Linear Regression and Logistic Regression with the Diabetes Dataset Using
Python Machine Learning
Aim
Procedure
• Load sklearn Libraries.
• Load Data
• Load the diabetes dataset
• Split Dataset
• Creating Model Linear Regression and Logistic Regression
• Make predictions using the testing set
• Finding Coefficient and Mean Square Error
Program
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(diabetes_y_test, diabetes_y_pred))
y_predict = Logistic_model.predict(diabetes_X_train)
#print("Y predict/hat ", y_predict)
y_predict
Output
Coefficients: [938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
Result:
Ex. No: 5 c. Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following: Multiple Regression
Aim
Procedure
• The Pandas module allows us to read csv files and return a DataFrame object.
• Then make a list of the independent values and call this variable X.
• Put the dependent values in a variable called y.
• From the sklearn module we will use the LinearRegression() method to create a linear
regression object.
• This object has a method called fit() that takes the independent and dependent values
as parameters and fills the regression object with data that describes the relationship.
• We have a regression object that are ready to predict age values based on a person
Glucose and BloodPressure
Program
import pandas as pd
from sklearn import linear_model
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
X = df[['Glucose', 'BloodPressure']]
y = df['Age']
regr = linear_model.LinearRegression()
regr.fit(X, y)
predictedage = regr.predict([[150, 13]])
print(predictedage)
Output
[28.77214401]
Result:
Ex.No. 5 d. Also compare the results of the above analysis for the two data sets.
Aim
Procedure
Step 1: Prepare the datasets to be compared
Step 2: Create the two DataFrames
Based on the above data, you can then create the following two DataFrames
Step 3: Compare the values between the two Pandas DataFrames
• In this step, you’ll need to import the NumPy package.
• Let’s say that you have the following data stored in a CSV file called car1.csv
• While you have the data below stored in a second CSV file called car2.csv
Program
import pandas as pd
import numpy as np
data_1 = pd.read_csv(r'd:\car1.csv')
df1 = pd.DataFrame(data_1)
data_2 = pd.read_csv(r'd:\car2.csv')
df2 = pd.DataFrame(data_2)
df1['amount1'] = df2['amount1']
df1['prices_match'] = np.where(df1['amount'] == df2['amount1'], 'True', 'False')
df1['price_diff'] = np.where(df1['amount'] == df2['amount1'], 0, df1['amount'] -
df2['amount1'])
print(df1)
Output
Model City Year amount amount 1 prices_match price_diff
0 Maruti Chennai 2022 600000 600000 True 0
1 Hyndai Chennai 2022 700000 700000 True 0
2 Ford Chennai 2022 800000 850000 False -50000
3 Kia Chennai 2022 900000 900000 True 0
4 XL6 Chennai 2022 1000000 1000000 True 0
5 Tata Chennai 2022 1100000 1150000 False -50000
6 Audi Chennai 2022 1200000 1200000 True 0
7 Ertiga Chennai 2022 1300000 1300000 True 0
Result:
Ex. No: 06 Apply and explore various plotting functions
Aim:
Program:
Histogram Plotting
Line Plotting
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Scatter Plotting
import matplotlib.pyplot as plt
import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
Result: