Data Science Lab Manual-CS3362
Data Science Lab Manual-CS3362
Branch : CSE/IT
Year/Semester : II/III
LIST OF EXPERIMENTS
Sl.
Date Program Name Mark Signature
No.
CS3362-Data Science Lab Manual
1a. Aim:
Problem Description
Python is an open-source object-oriented language. It has many features of which one is the wide
range of external packages. There are a lot of packages for installation and use for expanding
functionalities. These packages are a repository of functions in python script. NumPy is one such
package to ease array computations. To install all these python packages we use the pip- package
installer. Pip is automatically installed along with Python. We can then use pip in the command
line to install packages from PyPI.
NumPy
NumPy (Numerical Python) is an open-source library for the Python programming
language. It is used for scientific computing and working with arrays.
Apart from its multidimensional array object, it also provides high-level functioning tools for
working with arrays.
Prerequisites
output:
CS3362-Data Science Lab Manual
RESULT:
Thus the NumPy package is downloaded, installed and the features are explored.
CS3362-Data Science Lab Manual
Data Science:
Data science combines math and statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with specific subject matter expertise to
uncover actionable insights hidden in an organization’s data.
Jupyter:
Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization,
machine learning, and much more.
Jupyter has support for over 40 different programming languages and Python is one of them.
Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter
Notebook itself.
Procedure:
To install Jupyter using pip, we need to first check if pip is updated in our
system. Use the following command to update pip:
After updating the pip version, follow the instructions provided below to install Jupyter:
Finished Installation:
jupyter notebook
CS3362-Data Science Lab Manual
➢ Click New and select python 3(ipykernal) and type the following
program. Click run to execute the program.
Running the Python program:
Python code:
Program to find the area of a triangle
# Python Program to find the area of triangle
a=5
b=6
c=7
# calculate the semi-perimeter
s = (a + b + c) / 2
# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)
Output:
CS3362-Data Science Lab Manual
Result:
Thus the jupyter package is downloaded, installed and the features are explored.
CS3362-Data Science Lab Manual
1 C Aim:
Problem Description
Scipy is a python library that is useful in solving many mathematical equations and
algorithms. It is designed on the top of Numpy library that gives more extension of finding
scientific mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU
Decomposition, etc. Using its high-level functions will significantly reduce the complexity of the
code and helps in better analyzing the data.
output
d = special.cosdg(45)
print(d)
CS3362-Data Science Lab Manual
Output:
RESULT
Thus the SciPy package is downloaded, installed and the features are explored.
CS3362-Data Science Lab Manual
1 D). Aim :
To download, install and explore the features of Panda packages.
Problem Description
The library does not come included with a regular install of Python. To use it, you
must install the Pandas framework separately.
As long as you have a newer version of Python installed (> Python 3.4), pip will be
installed on your computer along with Python by default.
However, if you’re using an older version of Python, you will need to install pip on
your computer before installing Pandas.
on the terminal. This should launch the pip installer. The required files will be
downloaded, and Pandas will be ready to run on your computer.
Sample program
import pandas as pd
data = pd.DataFrame({"x1":["y", "x", "y", "x", "x", "y"],
"x2":range(16, 22),
"x3":range(1, 7),
"x4":["a", "b", "c", "d", "e", "f"],
"x5":range(30, 24, - 1)})
print(data)
s1 = pd.Series([1, 3, 4, 5, 6, 2, 9])
s2 = pd.Series([1.1, 3.5, 4.7, 5.8, 2.9, 9.3])
s3 = pd.Series(['a', 'b', 'c', 'd', 'e'])
Data ={'first':s1, 'second':s2, 'third':s3}
CS3362-Data Science Lab Manual
dfseries = pd.DataFrame(Data)
print(dfseries)
Result:
Thus the Panda package is downloaded, installed and the features are explored.
CS3362-Data Science Lab Manual
1E). Aim:
Problem Description:
1. It includes various models of linear regression like ordinary least squares, generalized
least squares, weighted least squares, etc.
2. It provides some efficient functions for time series analysis.
3. It also has some datasets for examples and testing.
4. Models based on survival analysis are also available.
5. All the statistical tests that we can imagine for data on a large scale are present.
Installing Statsmodels
Type 'Command Prompt' on the taskbar's search pane and you'll see its icon. Click on it to open
the command prompt.
Also, you can directly click on its icon if it is pinned on the taskbar.
Installation of statsmodels
Now for installing statsmodels in our system, Open the Command Prompt, type the
following command and click on 'Enter'.
Output
Here, we will perform OLS(Ordinary Least Squares) regression, in this technique we will try to
minimize the net sum of squares of difference between the calculated value and observed value.
Program
import statsmodels.api as sm
import pandas
CS3362-Data Science Lab Manual
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[vars]
df[-5:]
OUTPUT
Result:
Thus the Statsmododels package is downloaded, installed and the features are explored.
CS3362-Data Science Lab Manual
Aim :
Write a python program to show the woking of NumPy Arrays in Python.
2a) Use Numpy array to demonstrate basic array characteristics
b) Create Numpy array using list and tuple
c) Apply basic operations (+,_,*./) and find the transpose of the matrix
d) Perform sorting operation with Numpy arrays
Problem Description
Arrays in NumPy: NumPy’s main object is the homogeneous multidimensional array.
• It is a table of elements (usually numbers), all of the same type, indexed by a tuple of
positive integers.
• In NumPy dimensions are called axes. The number of axes is rank.
• NumPy’s array class is called ndarray. It is also known by the alias array.
Example 1:
Write a python program to demonstrate the basic NumPy array
characteristics
import numpy as np
Output :
Array is of type: <class 'numpy.ndarray'>
No. of dimensions: 2
Shape of array: (2, 3)
Size of array: 6
Array stores elements of type: int64
2. Array creation:
Example 2:
import numpy as np
newarr = arr.reshape(2, 2, 3)
# Flatten array
arr = np.array([[1, 2, 3], [4, 5, 6]])
flarr = arr.flatten()
OUTPUT
Array created using passed list:
[[ 1. 2. 4.]
[ 5. 8. 7.]]
[ 0. 0. 0. 0.]]
A random array:
[[ 0.46829566 0.67079389]
[ 0.09079849 0.95410464]]
Original array:
[[1 2 3 4]
[5 2 4 2]
[1 2 0 1]]
Reshaped array:
[[[1 2 3]
[4 5 2]]
[[4 2 1]
[2 0 1]]]
Original array:
[[1 2 3]
[4 5 6]]
Fattened array:
[1 2 3 4 5 6]
3. Basic operations:
Program 3:
import numpy as np
print("Addition")
print(array1 + array2)
print("-" * 20)
print("Subtraction")
print(array1 - array2)
print("-" * 20)
print("Multiplication")
print(array1 * array2)
print("-" * 20)
print("Division")
CS3362-Data Science Lab Manual
print(array2 / array1)
print("-" * 40)
print(array1 ** array2)
print("-" * 40)
a = np.array([1, 2, 5, 3])
print ("Adding 1 to every element:", a+1)
print ("Subtracting 3 from each element:", a-3)
print ("Multiplying each element by 10:", a*10)
print ("Squaring each element:", a**2)
a *= 2
print ("Doubled each element of original array:", a)
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])
print ("\nOriginal array:\n", a)
print ("Transpose of array:\n", a.T)
Output
Addition
[[ 8 10 12]
[14 16 18]]
Subtraction
[[-6 -6 -6]
[-6 -6 -6]]
Multiplication
[[ 7 16 27]
[40 55 72]]
Division
[[7. 4. 3. ]
[2.5 2.2 2. ]]
[[ 1 256 19683]
[ 1048576 48828125 -2118184960]]
Original array:
CS3362-Data Science Lab Manual
[[1 2 3]
[3 4 5]
[9 6 0]]
Transpose of array:
[[1 3 9]
[2 4 6]
[3 5 0]]
4. Sorting array: There is a simple np.sort method for sorting NumPy arrays. Let’s explore it
a bit.
Program 4:
import numpy as np
a = np.array([[1, 4, 2],
[3, 4, 6],
[0, -1, 5]])
# sorted array
print ("Array elements in sorted order:\n",
np.sort(a, axis = None))
# Creating array
arr = np.array(values, dtype = dtypes)
print ("\nArray sorted by names:\n",
CS3362-Data Science Lab Manual
OUTPUT
Array elements in sorted order:
[-1 0 1 2 3 4 4 5 6]
Row-wise sorted array:
[[ 1 2 4]
[ 3 4 6]
[-1 0 5]]
Column wise sort by applying merge-sort:
[[ 0 -1 2]
[ 1 4 5]
[ 3 4 6]]
Result
Thus the python programs are written and executed to explain the features of NumPy
array.
CS3362-Data Science Lab Manual
Aim:
Write a python program to work with Panda data frames.
Pandas
Pandas is an open-source library that is built on top of NumPy library. It is a Python
package that offers various data structures and operations for manipulating numerical data and
time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast
and it has high-performance & productivity for users.
Pandas DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame
can be created from the lists, dictionary, and from a list of dictionary etc.
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three
principal components, the data, rows, and columns.
To create dataframe from dict of narray/list, all the narray must be of same length. If index is
passed then the length index should be equal to the length of arrays. If no index is passed, then
by default, index will be range(n) where n is the array length.
Iterating over rows :
In order to iterate over rows, we can use three function iteritems(), iterrows(), itertuples() . These
three function will help in iteration over rows.
Program
import pandas as pd
df = pd.DataFrame(Data)
# Print the output.
print(df)
print("Create dataframe from dictionoary of lists")
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'Degree': ["MBA", "BCA", "M.Tech", "MBA"],
'Score':[90, 40, 80, 98]}
OUTPUT
Empty dataframe
Empty DataFrame
Columns: []
Index: []
3 jack 18
0 name aparna
Degree MBA
Score 90
Name: 0, dtype: object
1 name pankaj
Degree BCA
Score 40
Name: 1, dtype: object
2 name sudhir
Degree M.Tech
Score 80
Name: 2, dtype: object
3 name Geeku
Degree MBA
Score 98
Name: 3, dtype: object
In[1]:
import pandas as pd
url =
'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'
df = pd.read_excel(url)
df
OUTPUT
CS3362-Data Science Lab Manual
data = pd.read_csv(r'C:\Users\HI\Downloads\PythonDataScienceHandbook-
master\notebooks\data\iris.csv')
df = pd.DataFrame(data)
print (df)
Out[2]:
Result:
Thus the python program is written to show the working of pandas dataframes
CS3362-Data Science Lab Manual
READING DATA FROM TEXT FILES, EXCEL AND THE WEB AND
Ex.No.4 EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE
ANALYTICS ON THE IRIS DATA SET
Aim:
Reading data from text files, excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set.
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical summary
of the data. We will also be able to deal with the duplicates values, outliers, and also see some
trends or patterns present in the dataset.
Now let’s see a brief about the Iris dataset.
Iris Dataset
If you are from a data science background you all must be familiar with the Iris Dataset. If you
are not then don’t worry we will discuss this here.
Iris Dataset is considered as the Hello World for data science. It contains five columns namely
– Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a flowering
plant, the researchers have measured various features of the different iris flowers and recorded
them digitally.
4A). Aim:
Reading data from Text file and exploring various commands for doing
descriptive analytics on the Iris data set.
Seaborn Package:
Seaborn has many of its own high-level plotting routines, but it can also overwrite
Matplotlib's default parameters and in turn get even simple Matplotlib scripts to produce vastly
superior output. We can set the style by calling Seaborn's set() method. By convention, Seaborn
is imported as sns:
Seaborn package is installed by typing the following command in the command prompt
Step 2:
After the successful insertion of seaborn package launch into jupyter using jupyter
notebook command. Type the following program. Give the correct path name for iris dataset.
The Iris dataset, which lists measurements of petals and sepals of three iris species.
Step 1:
Import Packages
In[1]:
import numpy as np
import pandas as pd # package for working with data frames in python
import seaborn as sns # package for visualization (more on seaborn later)
import matplotlib.pyplot as plt
%matplotlib inline
Step 2:
Import iris dataset
In[2]:
iris = sns.load_dataset('iris')
my_data_frame = pd.DataFrame(iris)
my_data_frame.head()
OUTPUT
CS3362-Data Science Lab Manual
In[3]:
p=plt.hist(my_data_frame.sepal_length)
OUTPUT
CS3362-Data Science Lab Manual
In[4]:
g = sns.pairplot(my_data_frame)
OUTPUT
CS3362-Data Science Lab Manual
In[5]:
OUTPUT:
CS3362-Data Science Lab Manual
In [6]:
g = sns.pairplot(iris, height=3, vars=["sepal_width", "sepal_length"], \
markers=["o", "s", "D"], hue="species")
OUTPUT
In [7]:
OUTPUT
CS3362-Data Science Lab Manual
In[8]:
sns.set(style="ticks", color_codes=True) # change style
g = sns.pairplot(iris, hue="species")
CS3362-Data Science Lab Manual
Result:
Thus reading data from Text file and exploring various commands for doing descriptive
analytics on the Iris data set is executed.
CS3362-Data Science Lab Manual
4 B). Aim:
Reading data from web and exploring various commands for doing descriptive
analytics on the iris dataset.
Step 1:
Download the Iris dataset from the UCI machine learning repository.by providind the
corresponding URL.
In [1]:
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal
width', 'class']
data.head()
Out[1]:
Step 2:
For each quantitative attribute, calculate its average, standard deviation, minimum, and
maximum values.
In [2]:
Out[2]:
sepal length:
Mean = 5.84
Standard deviation = 0.83
Minimum = 4.30
Maximum = 7.90
sepal width:
Mean = 3.05
Standard deviation = 0.43
Minimum = 2.00
Maximum = 4.40
petal length:
Mean = 3.76
Standard deviation = 1.76
Minimum = 1.00
Maximum = 6.90
petal width:
Mean = 1.20
Standard deviation = 0.76
Minimum = 0.10
Maximum = 2.50
Step 3:
For the qualitative attribute (class), count the frequency for each of its distinct values.
In [3]:
data['class'].value_counts()
Out [3]:
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: class, dtype: int64
Step 4:
It is also possible to display the summary for all the attributes simultaneously in a table
using the describe() function. If an attribute is quantitative, it will display its mean, standard
deviation and various quantiles (including minimum, median, and maximum) values. If an
attribute is qualitative, it will display its number of unique values and the top (most frequent)
values.
In [4]:
CS3362-Data Science Lab Manual
data.describe(include='all')
describe()
Out [4]:
Step :5
For multivariate statistics, you can compute the covariance and correlation between pairs of
attributes.
In [5]:
print('Covariance:')
data.cov()
Out[5]:
In [6]:
print('Correlation:')
data.corr()
Out[6]:
CS3362-Data Science Lab Manual
Result:
Thus the reading data from web and exploring various commands for doing descriptive
analytics on the iris dataset is executed.
CS3362-Data Science Lab Manual
4C). Aim:
Reading data from Excel file and exploring various commands for doing
descriptive analytics on the Iris data set.
You can install the required modules using pip. Open your command line program and
execute command pip install <module name> to install a module. You should
replace <module name> with the actual name of the module you are trying to install. For
example, to install pandas, you would execute command –
Step 1:
Sheet 1 Sheet 2
Step 2:
Now we can import the excel file using the read_excel function in pandas, as shown below:
#place "r" before the path string to address special character, such as '\'. Don't forget to put the
file name at the end of the path + '.xlsx'
In [1]:
import pandas as pd
df = pd.read_excel (r'C:\Users\HI\Downloads\dept.xlsx')
print (df)
Out [1]:
CS3362-Data Science Lab Manual
Step 2:
The second statement reads the data from excel and stores it into a pandas Data Frame
which is represented by the variable newData. If there are multiple sheets in the excel
workbook, the command will import data of the first sheet. To make a data frame with all the
sheets in the workbook, the easiest method is to create different data frames separately and
then concatenate them.
The read_excel method takes argument sheet_name and index_col where we can
specify the sheet of which the data frame should be made of and specifies the title column.
In [2]:
Step 3:
To view 5 columns from the top and from the bottom of the data frame, we can run the
command
In [3] :
newData.tail()
Out[3]:
CS3362-Data Science Lab Manual
In [4] :
newData.head()
Out[4]:
Step 5:
If any column contains numerical data, we can sort that column using
the sort_values() method in pandas as follows:
In[5]:
sorted_column = newData.sort_values(['Weight'], ascending = True)
sorted_column.head(5)
Out[5]:
Step 6:
Our data is mostly numerical. We can get the statistical information like mean, max,
min, etc. about the data frame using the describe() method as shown below:
In [6]:
newData.describe()
CS3362-Data Science Lab Manual
Out [6]:
Result:
Thus reading data from Excel file and exploring various commands for doing
descriptive analytics has been executed.
CS3362-Data Science Lab Manual
USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS
Ex.No.5
DIABETES
Aim:
Use the data set from UCI and Pima Indians diabetes and find the diabetic patients.
Coding:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
dataset=pd.read_csv("E:\\ds\diabetes.csv")
dataset.head()
Output:
dataset.shape
output:
dataset.describe()
output:
CS3362-Data Science Lab Manual
sns.countplot(x='Outcome',data=dataset)
output:
#
dataset['Outcome'].value_counts()
OUTPUT
CS3362-Data Science Lab Manual
corr_mat=dataset.corr()
sns.heatmap(corr_mat,annot=True)
output:
Data cleaning
#check any null or empty data is present in the dataset
dataset.isna().sum()
output:
CS3362-Data Science Lab Manual
Feature matrix -Taking all our independent columns into single array and dependent
values into another array
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
x.shape
output
#
x[0]
OUTPUT
#
y
Output
array([1, 0, 1, ..., 0, 1, 0], dtype=int64)
#
fig = plt.figure(figsize=(16,6))
sns.distplot(dataset["Glucose"][dataset["Outcome"]==1])
CS3362-Data Science Lab Manual
OUTPUT
#
fig = plt.figure(figsize=(16,6))
sns.distplot(dataset["Insulin"][dataset["Outcome"]==1])
plt.xticks()
plt.title("Insulin",fontsize=20)
OUTPUT
CS3362-Data Science Lab Manual
#
fig = plt.figure(figsize=(16,6))
sns.distplot(dataset["BMI"][dataset["Outcome"]==1])
plt.xticks()
plt.title("BMI",fontsize=20)
OUTPUT
CS3362-Data Science Lab Manual
#
fig = plt.figure(figsize=(16,5))
sns.distplot(dataset["DiabetesPedigreeFunction"][dataset["Outcome"]==1])
plt.xticks([i*0.15 for i in range(1,12)])
plt.title("diabetespedigreefunction")
Output
#
fig = plt.figure(figsize=(16,6))
sns.distplot(dataset["Age"][dataset["Outcome"]==1])
plt.xticks([i*0.15 for i in range(1,12)])
plt.title("Age")
Output
CS3362-Data Science Lab Manual
#
x = dataset.drop(["Pregnancies","BloodPressure","SkinThickness","Outcome"],axis = 1)
y = dataset.iloc[:,-1]
#
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2, random_state=0)
#
x_train.shape
Output
#
x_test.shape
Import pickle
Pickle.dump(svc,open(‘classifier.pkl’,wb))
Pickle.dump(sc,open(‘sc.pkl’,’wb’))
Output:
CS3362-Data Science Lab Manual
CS3362-Data Science Lab Manual
Result:
CS3362-Data Science Lab Manual
Aim:
To apply and explore various plotting functions on UCI data sets.
A. Normal curves
B. Density and contour plots
C. Correlation and scatter plots
D. Histograms
E. Three-dimensional plotting
Aim:
To apply and explore the probability density function on normal curves.
Normal Distribution is a probability function used in statistics that tells about how the data
values are distributed. It is the most important probability distribution function used in
statistics because of its advantages in real case scenarios. For example, the height of the
population, shoe size, IQ level, rolling a die, and many more.
The probability density function of normal or Gaussian distribution is given by:
Modules Needed
• Matplotlib is python’s data visualization library which is widely used for the purpose of
data visualization.
• Numpy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the
fundamental package for scientific computing with Python.
CS3362-Data Science Lab Manual
• Scipy is a python library that is useful in solving many mathematical equations and
algorithms.
• Statistics module provides functions for calculating mathematical statistics of numeric
data.
Functions used
mean(data)
stdev(data)
• To calculate normal probability density of the data norm.pdf is used, it refers to the normal
probability density function which is a module in scipy library that uses the above
probability density function to calculate the value.
Syntax:
norm.pdf(Data, loc, scale)
Here, loc parameter is also known as the mean and the scale parameter is also known as
standard deviation.
Approach
• Import module
• Create data
• Calculate mean and deviation
• Calculate normal probability density
• Plot using above calculated values
• Display plot
Implementation
Step 1: Draw a normal Curve
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
Step 2:
Find the probability density function
In [2]:
# import required libraries
sb.set_style('whitegrid')
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights')
plt.ylabel('Probability Density')
CS3362-Data Science Lab Manual
Out [2]:
Result:
Thus the probability density function on normal curves is applied and executed..
CS3362-Data Science Lab Manual
Aim:
To apply and explore Density and Contour Plotting function on UCI Datasets
In [1] :
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
In [2]:
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
In [3]:
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
CS3362-Data Science Lab Manual
In [4]:
plt.contour(X, Y, Z, colors='black');
OUTPUT
Notice that by default when a single color is used, negative values are represented by dashed
lines, and positive values by solid lines.
Alternatively, the lines can be color-coded by specifying a colormap with the cmap argument.
Here, we'll also specify that we want more lines to be drawn—20 equally spaced intervals within
the data range:
In [5]:
plt.contour(X, Y, Z, 20, cmap='RdGy');
OUTPUT
CS3362-Data Science Lab Manual
The spaces between the lines may be a bit distracting. We can change this by switching to a
filled contour plot using the plt.contourf() function (notice the f at the end), which uses largely
the same syntax as plt.contour().
Add a plt.colorbar() command, which automatically creates an additional axis with labeled
color information for the plot:
In [6]:
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();
OUTPUT
The colorbar makes it clear that the black regions are "peaks," while the red regions are
"valleys."
Page 64
CS3362-Data Science Lab Manual
One potential issue with this plot is that it is a bit "splotchy." That is, the color steps are discrete
rather than continuous, which is not always what is desired.
This could be remedied by setting the number of contours to a very high number, but this results
in a rather inefficient plot:
Matplotlib must render a new polygon for each step in the level. A better way to handle this is to
use the plt.imshow() function, which interprets a two-dimensional grid of data as an image.
In [7]:
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');
OUTPUT
• plt.imshow() will automatically adjust the axis aspect ratio to match the input data; this
can be changed by setting, for example, plt.axis(aspect='image') to make x and y units
match.
• Finally, it can sometimes be useful to combine contour plots and image plots. For
example, here we'll use a partially transparent background image (with transparency set
via the alpha parameter) and overplot contours with labels on the contours themselves
(using the plt.clabel() function):
In [8]:
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
OUTPUT
Result:
Thus the probability density function on density and contour plotting is done.
CS3362-Data Science Lab Manual
Aim:
To apply and explore Correlation and Scatterplots function on UCI Datasets
i) Simple Scatter Plots
Commonly used plot type is the simple scatter plot, a close cousin of the
line plot. Instead of points being joined by line segments, here the points are represented
individually with a dot, circle, or other shape. We’ll start by setting up the notebook for plotting
and importing the functions we will use:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
OUTPUT
CS3362-Data Science Lab Manual
In [3]:
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5), marker,
label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);
OUTPUT
For even more possibilities, these character codes can be used together with line and color codes
to plot points along with a line connecting them:
In [4]:
plt.plot(x, y, '-ok');
OUTPUT
CS3362-Data Science Lab Manual
Additional keyword arguments to plt.plot specify a wide range of properties of the lines and
markers:
In [5]:
OUTPUT
CS3362-Data Science Lab Manual
This type of flexibility in the plt.plot function allows for a wide variety of possible visualization
options. For a full description of the options available, refer to the plt.plot documentation.
In [6]:
plt.scatter(x, y, marker='o');
OUTPUT
The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots
where the properties of each individual point (size, face color, edge color, etc.) can be
individually controlled or mapped to data.
Let's show this by creating a random scatter plot with points of many colors and sizes. In order to
better see the overlapping results, we'll also use the alpha keyword to adjust the transparency
level:
In [7]:
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
CS3362-Data Science Lab Manual
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
OUTPUT
Notice that the color argument is automatically mapped to a color scale (shown here by
the colorbar() command), and that the size argument is given in pixels. In this way, the color
and size of points can be used to convey information in the visualization, in order to visualize
multidimensional data.
For example, we might use the Iris data from Scikit-Learn, where each sample is one of three
types of flowers that has had the size of its petals and sepals carefully measured:
For running this module scikit package should be installed. The command used is
python -m pip install scikit-learn
CS3362-Data Science Lab Manual
OUTPUT
CS3362-Data Science Lab Manual
We can see that this scatter plot has given us the ability to simultaneously explore four different
dimensions of the data: the (x, y) location of each point corresponds to the sepal length and
width, the size of the point is related to the petal width, and the color is related to the particular
species of flower. Multicolor and multifeature scatter plots like this can be useful for both
exploration and presentation of data.
Aside from the different features available in plt.plot and plt.scatter, why might you choose to
use one over the other? While it doesn't matter as much for small amounts of data, as datasets get
larger than a few thousand points, plt.plot can be noticeably more efficient than plt.scatter. The
reason is that plt.scatter has the capability to render a different size and/or color for each point,
so the renderer must do the extra work of constructing each point individually. In plt.plot, on the
other hand, the points are always essentially clones of each other, so the work of determining the
appearance of the points is done only once for the entire set of data. For large datasets, the
difference between these two can lead to vastly different performance, and for this
reason, plt.plot should be preferred over plt.scatter for large datasets.
Result:
Thus the Correlation and Scatterplots function on UCI Datasets are executed.
CS3362-Data Science Lab Manual
6 D.Histograms
Aim:
To apply and explore Histograms function on UCI Datasets
Preliminaries
A simple histogram can be a great first step in understanding a dataset. Earlier, we saw a
preview of Matplotlib's histogram function, which creates a basic histogram in one line, once the
normal boiler-plate imports are done:
Program
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
In [2]:
plt.hist(data);
OUTPUT
CS3362-Data Science Lab Manual
The hist() function has many options to tune both the calculation and the display; here's an
example of a more customized histogram:
In [3]:
OUTPUT
The plt.hist docstring has more information on other customization options available. I find this
combination of histtype='stepfilled' along with some transparency alpha to be very useful when
comparing histograms of several distributions:
In [4]:
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
OUTPUT
CS3362-Data Science Lab Manual
If you would like to simply compute the histogram (that is, count the number of points in a given
bin) and not display it, the np.histogram() function is available:
In [5]:
OUTPUT
In [6]:
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
CS3362-Data Science Lab Manual
In [7]:
OUTPUT
Just as with plt.hist, plt.hist2d has a number of extra options to fine-tune the plot and the
binning, which are nicely outlined in the function docstring. Further, just as plt.hist has a
counterpart in np.histogram, plt.hist2d has a counterpart in np.histogram2d, which can be
used as follows:
In [8]:
For the generalization of this histogram binning in dimensions higher than two, see
the np.histogramdd function.
CS3362-Data Science Lab Manual
The two-dimensional histogram creates a tesselation of squares across the axes. Another natural
shape for such a tesselation is the regular hexagon.
For this purpose, Matplotlib provides the plt.hexbin routine, which will represents a two-
dimensional dataset binned within a grid of hexagons:
In [9]:
OUTPUT
plt.hexbin has a number of interesting options, including the ability to specify weights for each
point, and to change the output in each bin to any NumPy aggregate (mean of weights, standard
deviation of weights, etc.).
In [10]:
Result:
Aim:
To apply and explore Three Dimensional plotting function on UCI Datasets
Program
In [1]:
Once this submodule is imported, a three-dimensional axes can be created by passing the
keyword projection='3d' to any of the normal axes creation routines:
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [3]:
fig = plt.figure()
ax = plt.axes(projection='3d')
OUTPUT
CS3362-Data Science Lab Manual
The most basic three-dimensional plot is a line or collection of scatter plot created from
sets of (x, y, z) triples. In analogy with the more common two-dimensional plots discussed
earlier, these can be created using the ax.plot3D and ax.scatter3D functions. The call signature
for these is nearly identical to that of their two-dimensional counterparts, so you can refer
to Simple Line Plots and Simple Scatter Plots for more information on controlling the output.
Here we'll plot a trigonometric spiral, along with some points drawn randomly near the line:
In [4]:
ax = plt.axes(projection='3d')
OUTPUT
CS3362-Data Science Lab Manual
In [5]:
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
In [6]:
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
OUTPUT
CS3362-Data Science Lab Manual
In [7]:
ax.view_init(60, 35)
fig
OUTPUT
can make the resulting three-dimensional forms quite easy to visualize. Here's an example of
using a wireframe:
In [7]:
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');
OUTPUT
A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon. Adding
a colormap to the filled polygons can aid perception of the topology of the surface being
visualized:
In [8]:
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none')
ax.set_title('surface');
OUTPUT
CS3362-Data Science Lab Manual
Surface Triangulations
For some applications, the evenly sampled grids required by the above routines is overly
restrictive and inconvenient. In these situations, the triangulation-based plots can be very useful.
What if rather than an even draw from a Cartesian or a polar grid, we instead have a set of
random draws?
In [9]:
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
We could create a scatter plot of the points to get an idea of the surface we're sampling from:
In [10]:
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5);
OUTPUT
CS3362-Data Science Lab Manual
This leaves a lot to be desired. The function that will help us in this case is ax.plot_trisurf,
which creates a surface by first finding a set of triangles formed between adjacent points
(remember that x, y, and z here are one-dimensional arrays):
In [11]:
ax = plt.axes(projection='3d')
ax.plot_trisurf(x, y, z,
cmap='viridis', edgecolor='none');
OUTPUT
Result:
Aim:
To visualizing geographic data with basemap
Basemap
Basemap is a great tool for creating maps using python in a simple way. It’s
a matplotlib extension, so it has got all its features to create data visualizations, and adds the
geographical projections and some datasets to be able to plot coast lines, countries, and so on
directly from the library.
Installation
Step 1: Use the Anaconda Navigator to install basemap. Go to start and click Anaconda
command prompt.
Step 2: Before installing Basemap, be sure to install pillow package. Install the pillow package
using the command line
pip install pillow
Step 3: Next step is to install the Basemap using the following command
pip install basemap
The anaconda command prompt will look like
Step 4: After successfully installing basmape package navigate to jupyter notebook using the
following command
jupyter notebook
Step 4: The above cmd will open a new webpage with address http://localhost:8888/tree.
Step 5: Click New→Python 3 (ipykernal).
CS3362-Data Science Lab Manual
Step 6: Start the program for visualizing geographical data using basemap. Click run to run the
pogram.
PROGRAM
2 a. Create a map centered on North America with lines showing the country and state
boundaries as well as rivers:
fig = plt.figure(num=None, figsize=(12, 8) )
m=
Basemap(width=6000000,height=4500000,resolution='c',projection='aea',lat_1=35.,lat_2=45,lon
_0=-100,lat_0=40)
m.drawcoastlines(linewidth=0.5)
m.fillcontinents(color='tan',lake_color='lightblue')
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,91.,15.),labels=[True,True,False,False],dashes=[2,2])
m.drawmeridians(np.arange(-180.,181.,15.),labels=[False,False,False,True],dashes=[2,2])
m.drawmapboundary(fill_color='lightblue')
m.drawcountries(linewidth=2, linestyle='solid', color='k' )
m.drawstates(linewidth=0.5, linestyle='solid', color='k')
m.drawrivers(linewidth=0.5, linestyle='solid', color='blue')
CS3362-Data Science Lab Manual
2b. Use a different map projection, zoom-in to North America and plot the location of
Seattle
Output
2. Map Projections
The Basemap package implements several dozen such projections, all referenced
by a short format code. Here we'll briefly demonstrate some of the more common ones.
We'll start by defining a convenience routine to draw our world map along with
the longitude and latitude lines:
CS3362-Data Science Lab Manual
Cylindrical projections
The simplest of map projections are cylindrical projections, in which lines of constant
latitude and longitude are mapped to horizontal and vertical lines, respectively. This type
of mapping represents equatorial regions quite well, but results in extreme distortions
near the poles. The spacing of latitude lines varies between different cylindrical
projections, leading to different conservation properties, and different distortion near the
poles. In the following figure we show an example of the equidistant cylindrical
projection, which chooses a latitude scaling that preserves distances along meridians.
Other cylindrical projections are the Mercator (projection='merc') and the cylindrical
equal area (projection='cea') projections.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
OUTPUT
CS3362-Data Science Lab Manual
Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled. This can lead to
very good local properties, but regions far from the focus point of the cone may become very
distorted. One example of this is the Lambert Conformal Conic projection (projection='lcc'),
which we saw earlier in the map of North America. It projects the map onto a cone arranged in
such a way that two standard parallels (specified in Basemap by lat_1 and lat_2) have well-
represented distances, with scale decreasing between them and increasing outside of them. Other
useful conic projections are the equidistant conic projection (projection='eqdc') and the Albers
equal-area projection (projection='aea'). Conic projections, like perspective projections, tend to
be good choices for representing small to medium patches of the globe.
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)
OUTPUT
CS3362-Data Science Lab Manual
Result: