Foundation of Data Science Lab Manual
Foundation of Data Science Lab Manual
AIM
To download and install the NumPy, SciPy, Jupyter, Statsmodels and Pandas packages.
INTRODUCTION
Python is a high-level and general-purpose programming language. One of the
important features that makes python a strong programming language is its vast collection of
python packages which includes data science and machine learning packages. A lot of external
packages are written in python which you can be installed and used depending upon the
requirement.
Some important packages are
1.NumPy
2.SciPy
3.Jupyter
4. Statsmodels
5. Pandas
1. NUMPY
NumPy is the fundamental package for array computing with Python. It can also be
used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be
defined. This allows it to seamlessly and speedily integrate with a wide variety of databases.
It provides,
a powerful N-dimensional array object
sophisticated (broadcasting) functions
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities
Installing Numpy on Windows:
a. For Conda Users:
conda install numpy
b. For PIP Users:
pip install numpy
2. SCIPY
SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and
engineering. The SciPy library depends on NumPy, which provides convenient and fast N-
dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and
provides many user-friendly and efficient numerical routines such as routines for numerical
integration and optimization.
It is designed on the top of Numpy library that gives more extension of finding scientific
mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU Decomposition,
etc. Using its high-level functions will significantly reduce the complexity of the code and
helps in better analysing the data.
Installing Scipy on Windows:
a. For Conda Users:
conda install scipy
b. For PIP Users:
pip install scipy
Verifying Scipy Module Installation through python shell
import scipy
scipy. __version__
3. JUPYTER
Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include
data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
Advantages of a Jupyter Notebook
Notebook has the ability to re-run individual code snippets, and it provides you the
flexibility of modifying them before re-running.
You can deploy a Jupyter Notebook on a remote server and access it from your local
web browser.
You can insert notes and documentation to your code in a Jupyter Notebook in various
formats like markdown, latex, and HTML
Installing Jupyter Notebook using pip:
pip install jupyter
4. STATSMODELS
Statsmodels is a popular library in Python that enables us to estimate and analyze
various statistical models. It is built on numeric and scientific libraries like NumPy and SciPy.
Features
It includes various models of linear regression like ordinary least squares, generalized
least squares, weighted least squares, etc.
It provides some efficient functions for time series analysis.
It also has some datasets for examples and testing.
Models based on survival analysis are also available.
All the statistical tests that we can imagine for data on a large scale are present.
Installing Statsmodels using Anaconda:
open the Anaconda Prompt and type the following command-
conda install -c conda-forge statsmodels
Installing Statsmodels using pip:
To obtain the latest released version of statsmodels using pip:
python -m pip install statsmodels
5. PANDAS
Pandas in Python is a package that is written for data analysis and manipulation. Pandas
offer various operations and data structures to perform numerical data manipulations and time
series. Pandas is an open-source library that is built over Numpy libraries. Pandas library is
known for its high productivity and high performance. Pandas is popular because it makes
importing and data analysing much easier.
Main Features
Easy handling of missing data (represented as NaN, NA or NaT ) in floating point as
well as non-floating point data
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Size mutability: columns can be inserted and deleted from DataFrame and higher
dimensional objects
Time series-specific functionality: date range generation and frequency conversion,
moving window statistics, date shifting and lagging
Install Pandas using pip
pip install pandas
RESULT
Thus the NumPy, SciPy, Jupyter, Statsmodels and Pandas packages are downloaded
and installed.
AIM
To work with different features provided by Numpy arrays.
Arrays
1. Creating Arrays
import numpy as np
a = np.array(42) #0-D
b = np.array([1, 2, 3, 4, 5]) #1-D
c = np.array([[1, 2, 3], [4, 5, 6]]) # 2-D
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) #3-D
print(a.ndim); print(b.ndim); print(c.ndim); print(d.ndim);
2. Access Array Elements
Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers
representing the dimension and the index of the element.
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
Access 3-D Arrays
To access elements from 3-D arrays we can use comma separated integers
representing the dimensions and the index of the element.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
3. Array Slicing
Slicing in python means taking elements from one given index to another given index
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
4. Data Types
NumPy has some extra data types, and refer to data types with one character. Below
is a list of all data types in NumPy and the characters used to represent them.
i - integer M - datetime
b - boolean O - object
u - unsigned integer S - string
f - float U - unicode string
c - complex float V - fixed chunk of memory for other
m - timedelta type (void)
import numpy as np
arr = np.array([1, 2, 3, 4], dtype='S')
print(arr)
print(arr.dtype)
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)
Array Reshaping
Reshaping means changing the shape of an array.
The shape of an array is the number of elements in each dimension.
By reshaping we can add or remove dimensions or change number of elements in
each dimension.
Convert the following 1-D array with 12 elements into a 3-D array.
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2);print(newarr)
7. Array Iterating
Iterating means going through elements one by one.
As we deal with multi-dimensional arrays in numpy, we can do this using
basicfor loop of python.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)
8. Joining Array
Joining means putting contents of two or more arrays in a single array.
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
9. Splitting Array
Splitting breaks one array into multiple.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)
RESULT
Thus the programs using NumPy has been successfully executed and verified.
EX.NO: 3 PANDAS DATAFRAME
AIM
INTRODUCTION
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data
structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame
consists of three principal components, the data, rows, and columns. In the real world, a Pandas
DataFrame will be created by loading the datasets from existing storage, storage can be SQL
Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary,
and from a list of dictionary etc.
Features of DataFrame
Potentially columns are of different types
Size – Mutable
Labeled axes (rows and columns)
Can perform Arithmetic operations on rows and columns
Name Description
data takes various forms like ndarray, series, map, lists, dict, constants and also
data
another DataFrame.
For the row labels, the Index to be used for the resulting frame is Optional
index
Default np.arange(n) if no index is passed.
For column labels, the optional default syntax is - np.arange(n). This is only
columns
true if no index is passed.
dtype Data type of each column.
This command (or whatever it is) is used for copying of data, if the default is
copy
False.
Create DataFrame
A pandas DataFrame can be created using various inputs.
Lists
import pandas as pd
nested_list = [[1,2,3],[10,20,30],[100,200,300]]
df = pd.DataFrame(nested_list, columns= ['A','B','C'])
dictionary
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
Series
import pandas as pd
letter = pd.Series(['A', 'B', 'C', 'D', 'E'],name='Name')
df = letter.to_frame()
Numpy ndarrays
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),columns=['a', 'b', 'c'])
Another DataFrame
old_df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']})
new_df = old_df[['team']].copy()
print(new_df)
DataFrame Methods:
Function Description
index() Method returns index (row labels) of the DataFrame
insert() Insert column into DataFrame at specified location.
set_index() Set the DataFrame index (row labels) using one or more existing
columns or arrays (of the correct length).
drop() Drop specified labels from rows or columns.
rename() rename any index, column or row
loc[] Method retrieves rows based on index label
iloc[] Method retrieves rows based on index position
sort_values() Sort by the values along either axis.
PROGRAM
import pandas as pd
dictionaryData = {'Scoville' : [50, 5000, 500000],'Name' : ["Bell pepper", "Espelette pepper",
"Chocolate habanero"],'Feeling' : ["Not even spicy", "Uncomfortable", "Practically ate pepper
spray"]}
dataFrame = pd.DataFrame(dictionaryData)
dataFrame
#change the indexing
dataFrame2 = dataFrame.set_index('Scoville')
dataFrame2
# Location by label
print(dataFrame.loc[2])
# Location by index
print(dataFrame.iloc[1])
#removing rows
dataFrame.drop(1, inplace=True)
dataFrame
#rename rows
dataFrame.rename({0:"First", 2:"Second"},
inplace=True)
dataFrame
#add column
dataFrame['Color'] = ['Green', 'Bright Red',
'Brown','white']
dataFrame
#sort values
newdf = dataFrame.sort_values(by='Name')
newdf
RESULT
Thus the program using Pandas dataframe have been successfully executed and
verified.
AIM
Reading data from text files, Excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set.
INTRODUCTION
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical summary
of the data. We will also be able to deal with the duplicates values, outliers, and also see some
trends or patterns present in the dataset.
IRIS DATASET
It includes three iris species with 50 samples each, as well as some properties about
each flower. One flower species is linearly separable from the other two, but the other two are
not linearly separable from each other.
The columns in this dataset are:
Id PetalLengthCm
SepalLengthCm PetalWidthCm
SepalWidthCm Species
shape use the shape parameter to get the shape of the dataset.
info() To get the columns and their data types.
describe() quick statistical summary of the dataset using the describe () method.
The describe () function applies basic statistical computations on the
dataset like extreme values, count of data points standard deviation, etc.
Any missing value or NaN value is automatically skipped. describe()
function gives a good picture of the distribution of data.
isnull() check if our data contains any missing values or not. Missing values can
occur when no information is provided for one or more items or for a
whole unit
drop_duplicates() see if our dataset contains any duplicates or not. It helps in removing
duplicates from the data frame.
head() Use head() method of the data frame to show the first five rows of the
data.
The pandas module allows us to load DataFrames from external files and work on them.
The dataset can be a text file, excel file, web reference or a CSV file.
1. Reading data from text files, and exploring various commands for doing descriptive
analytics on the Iris data set.
The following functions can be used to read a table of fixed-width formatted lines into
DataFrame. A comma-separated values (csv) file is returned as two-dimensional data structure
with labeled axes.
a. read_csv()
b. read_table()
c. read_fwf()
import pandas as pd
from pandas.api.types import is_numeric_dtype
df = pd.read_table("E:/data/IRISTestdata.txt",delimiter = ',')
#df = pd.read_csv("E:/data/IRISTestdata.txt")
#df = pd.read_fwf("E:/data/IRISTestdata.txt", delimiter=",")
print(type(df))
print(df)
#set column names of DataFrame
df.columns = ["sepal_length","sepal_width","petal_length","petal_width","target"]
print(df)
df.head()
print(df.shape)
print(df.info())
df.target.replace({"Iris-setosa":"setosa","Iris-versicolor":"versicolor","Iris-
virginica":"virginica"},inplace=True)
print(df)
#the unique values of a column
df.target.unique()
print("EDA")
print(df.describe())
#find the pairwise correlation of all columns
df.corr()
#return the count of unique occurrences in this column
df.target.value_counts()
for col in df.columns:
if is_numeric_dtype(df[col]):
print('%s:' % (col))
print('\t Mean = %.2f' % df[col].mean())
print('\t Standard deviation = %.2f' % df[col].std())
print('\t Minimum = %.2f' % df[col].min())
print('\t Maximum = %.2f' % df[col].max())
2. Reading data from web, and exploring various commands for doing descriptive
analytics on the Iris data set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# download iris data and read it into a dataframe
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(url, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'class'])
print(df)
print(df.columns)
#checks for any missing values in the parameters.
print(df.isnull().values.any())
print("visually by plotting a graph by no. of data points
of each class label.")
# First plot
plt.plot(df["class"])
plt.xlabel("No. of data points")
plt.show()
# Second plot
plt.hist(df["class"],color="green")
plt.show()
#Describing numeric columns
print(df.describe())
#This shows the actual duplicate rows
df[df.duplicated()]
# Total no of duplicates in the dataset
df.duplicated().sum()
#calculate median of each species
print(df.groupby('class').median())
3. Reading data from Excel files, and exploring various commands for doing descriptive
analytics on the Iris data set.
import pandas as pd
df = pd.read_excel('E:/data/IrisData.xls')
print (df)
print(df.head())
# it will print the rows from 10 to 20.
print(df[10:21])
print(type(df))
#print the total number of rows and columns of that particular dataset.
print(df.shape)
print(df.info())
print(df.info(verbose = False))
print(df.describe())
print(df.isnull().sum()
print(df.drop_duplicates(subset ="Species_name"))
#it will count number of times a particular species has occurred
print(df.value_counts("Species_name"))
print(df.sample(10))
#Displaying the number of columns and names of the columns.
print(df.columns)
#Extracting minimum and maximum from a column.
min_data=df["Sepal_length"].min()
max_data=df["Sepal_length"].max()
print("Minimum:",min_data, "\nMaximum:", max_data)
#Displaying only specific columns.
specific_data=df[["Id","Species_name"]]
print(specific_data.head(60))
print("Calculating sum, mean and mode of a particular column.")
sum_data = df["Sepal_length"].sum()
mean_data = df["Sepal_length"].mean()
median_data =
df["Sepal_length"].median()
print("Sum:",sum_data, "\nMean:", mean_data, "\nMedian:",median_data)
RESULT
Thus the various commands for doing descriptive analytics on the Iris data set
are explored.
EX. NO: 5a UNIVARIATE ANALYSIS
AIM
Analysis the various univariate functions like Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis on dataset like Pima Indian diabetes
dataset.
Univariate analysis:
The main objective of the univariate analysis is to describe the data in order to find out
the patterns in the data. It is the simplest form to analyze data. Uni means ‘one’ and this means
that the data has only one kind of variable.
That is the data contains just one variable and does not have to deal with the relationship
of a cause and effect. Its major purpose is to describe; It takes data, summarizes that data and
finds patterns in the data. patterns found in univariate data include central tendency (mean,
mode and median)and dispersion: range , variance, maximum, minimum, quartiles (including
the interquartile range), and standard deviation, Skewness and Kurtosis.
The mean() function can be used to calculate mean/average of a given list of numbers.
The median() method calculates the median (middle value) of the given data set.
The mode of a set of data values is the value that appears most often.
The var() method calculates the variance for each column.
Standard deviation std() is a number that describes how spread out the values are.
The skew() method calculates the skew for each column. Skewness refers to a
distortion or asymmetry that deviates from the symmetrical bell curve, or normal
distribution, in a set of data.
PROGRAM
import pandas as pd
from scipy.stats import kurtosis
df = pd.read_csv("https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-
Dataset/master/diabetes.csv")
print (df)
df.dtypes # Get data type for each attribute
df.isnull().sum() # Check for missing values
# Check the average of features grouped by Outcome (Diabetes)
df.groupby('Outcome').mean()
# Getting only the women with Glucose value > 0
df_glucose = df.loc[df['Glucose'] != 0]
#Calculating Mean
df_glucose['Glucose'].mean()
df_glucose.groupby('Outcome').mean()
#Calculating Median
df_glucose['Glucose'].median()
#Calculating Mode
df_glucose = df.loc[df['BloodPressure'] != 0]
df_glucose['BloodPressure'].mode()
df_glucose = df.loc[df['Glucose'] != 0]
df_glucose['Glucose'].mode()
#Calculating Variance
df_glucose['Glucose'].var()
#Calculating Standard Deviation
df_glucose['Glucose'].std()
#Calculating Skew
print("Mean Age: ",df['Age'].mean())
print("Age Skewness: ",df['Age'].skew())
#Calculating Kurtosis
kurtosis(df['Age'],axis=0,bias=True)
OUTPUT
Result
Thus the various univariate functions like Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis on dataset Pima Indian diabetes are successfully
executed.
Ex. No: 5b BIVARIATE ANALYSIS
Aim
To perform bivariate analysis both linear and logistic regression modeling on UCI Pima
Indian diabetes dataset.
Bivariate Analysis
1. Scatterplot
A scatterplot is a type of data display that shows the relationship between two numerical
variables
import pandas as pd
import matplotlib.pyplot as plt
pima = pd.read_csv("https://raw.githubusercontent.com/npradaschnor/Pima-Indians-
Diabetes-Dataset/master/diabetes.csv")
# Diabetes Outcome
res = pima.loc[pima.Outcome==1,:]
# Pregnancies, Age and Diabetes relation
res.plot.scatter('Pregnancies', 'Age')
2. Correlation Coefficient
The correlation coefficient is a statistical measure of the strength of the
relationship between the relative movements of two variables. The values range between -
1.0 and 1.0. Correlation of -1.0 shows a perfect negative correlation, while a correlation of
1.0 shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship
between the movement of the two variables.
import numpy as np
# Check only the women that have all the values of BMI, Glucose, Insulin & Blood Pressure
pima_all = pima.loc[(pima['BMI'] != 0) & (pima['Insulin'] != 0) & (pima['BloodPressure'] !=
0) & (pima['Glucose'] != 0)]
age = pima_all['Age']
preg = pima_all['Pregnancies']
# Correlation between the different characteristics. Closer to 1 better is the correlation
corr = np.corrcoef(age,preg)
print(corr)
Simple linear regression is a statistical method that we can use to find a relationship
between two variables and make predictions. The two variables used are typically denoted as
y and x. The independent variable, or the variable used to predict the dependent variable is
denoted as x. The dependent variable, or the outcome/output, is denoted as y. A simple linear
regression model will produce a line of best fit, or the regression line.
Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
pima = pd.read_csv("https://raw.githubusercontent.com/npradaschnor/Pima-Indians-
Diabetes-Dataset/master/diabetes.csv")
x=pima['Age']
y=pima['Pregnancies']
x_train=x[0:699];x_test=x[700:]
y_train=y[0:699];y_test=y[700:]
# x must have one column
x_train = np.array(x_train).reshape(-1,1)
x_test = np.array(x_test).reshape(-1,1)
# create a linear regression model and fit it
model = LinearRegression().fit(x_train, y_train)
# obtain the coefficient of determination
r_sq = model.score(x_train, y_train)
print("Correlation Coeff : ",r_sq)
# pass the regressor as the argument and get the corresponding predicted response
y_pred = model.predict(x_test)
# Plot outputs
plt.scatter(x_test, y_test, color="black")
plt.plot(x_test, y_pred, color="blue", linewidth=2)
plt.show()
Output
Logistic Regression
It is a Machine Learning classification algorithm that is used to predict the probability
of a categorical dependent variable. In logistic regression, the dependent variable is a binary
variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words,
the logistic regression model predicts P(Y=1) as a function of X. Logistic regression requires
quite large sample sizes.
Program
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
pima = pd.read_csv("https://raw.githubusercontent.com/npradaschnor/Pima-Indians
Diabetes-Dataset/master/diabetes.csv")
x = pima.drop(['Outcome'], axis = 1)
y = pima.loc[:,"Outcome"].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state =
123)
logreg = linear_model.LogisticRegression(max_iter=150)
# Fit
logreg.fit(x_train,y_train)
# Predict
predicted = logreg.predict(x_test)
print("Test accuracy: {} ".format(logreg.score(x_test, y_test)))
cf_matrix = confusion_matrix(y_test,predicted)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,fmt='.2%', cmap='winter_r')
plt.show()
Output
RESULT
Thus the bivariate analysis on UCI Pima Indian diabetes dataset was performed and
verified successfully.
AIM
To perform multiple regression analysis on UCI Pima Indian diabetes dataset.
Multiple Regression
Multiple regression is like linear regression, but with more than one independent value.
When one variable/column in a dataset is not sufficient to create a good model and make more
accurate predictions, we’ll use a multiple linear regression model instead of a simple linear
regression model.
The line equation for the multiple linear regression model is:
y = β0 + β1X1 + β2X2 + β3X3 + .... + βpXp + e
Adding more variables isn’t always helpful because the model may ‘over-fit,’ and it’ll
be too complicated. The trained model doesn’t generalize with the new data. It only works on
the trained data. We have to select the appropriate variables to build the best model. This
process of selecting variables is called Feature selection.
PROGRAM
import pandas as pd
from sklearn import linear_model
df = pd.read_csv ('E:\data\diabetes.csv')
feature_columns = ['Glucose', 'Age']
target_column = 'Pregnancies'
x = df[feature_columns]
y = df[target_column]
regr = linear_model.LinearRegression()
regr.fit(x, y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
predictedage = regr.predict([[150,20]])
print("Predicted Age : ",predictedage)
OUTPUT
Intercept:
-1.185667850875595
Coefficients:
[-0.00158363 0.15710088]
Predicted Age : [1.71880494]
RESULT
Thus the multiple regression analysis on UCI Pima Indian diabetes dataset was
performed and verified.
AIM
NORMAL DISTRIBUTION
It is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in statistics because
of its advantages in real case scenarios (Eg., the height of the population, shoe size, etc)
It is generally observed that data distribution is normal when there is a random
collection of data from independent sources. The graph produced after plotting the value of
the variable on x-axis and count of the value on y-axis is bell-shaped curve graph. The graph
signifies that the peak point is the mean of the data set and half of the values of data set lie
on the left side of the mean and other half lies on the right part of the mean telling about the
distribution of the values. The graph is symmetric distribution.
The probability density function of normal or Gaussian distribution is given by:
OUTPUT
RESULT
Thus the plotting of normal curve on Iris dataset was successfully performed and
verified.
Aim
To explore the plotting of density and contour plot on UCI Iris dataset.
Density Plotting
Density Plot is a type of data visualization tool. It is a variation of the histogram that
uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a
histogram inferred from a data.
Density plots uses Kernel Density Estimation (so they are also known as Kernel
density estimation plots or KDE) which is a probability density function. The region of plot
with a higher peak is the region with maximum data points residing between those values.
PROGRAM
OUTPUT
Contour Plots
Contour plots also called level plots are a tool for doing multivariate analysis and
visualizing 3-D plots in 2-D space using contours or color-coded regions. It is used to display
the relationship between two independent variables and a dependent variable.
There are 3 Matplotlib functions:
plt.contour for contour plots
plt.contourf for filled contour plots
plt.imshow for showing images
The matplotlib.pyplot.contour() are usually useful when Z = f(X, Y) i.e Z changes as a
function of input X and Y. Contour plots are widely used to visualize density, altitudes or
heights of the mountain as well as in the meteorological department. Contour plots require
three continuous variables.
Syntax:
X, Y: 2-D numpy arrays with same shape as Z or 1-D arrays such that len(X)==M and
len(Y)==N (where M and N are rows and columns of Z)
Z: The height values over which the contour is drawn. Shape is (M, N)
levels: Determines the number and positions of the contour lines / regions.
A contour plot typically contains the following elements:
1. X and Y-axes denoting values of two continuous independent variables.
2. Coloured bands representing ranges of the continuous dependent (Z) variable.
3. Contour lines connecting points that have the same dependent value.
The x and y values represent positions on the plot, and the z values will be represented by
the contour levels. The way to prepare such data is to use the np.meshgrid function, which
builds two-dimensional grids from one-dimensional arrays. Eg.,
PROGRAM
import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 4, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
# to fill the contour plot use the plt.contourf()
plt.contourf(X, Y, Z, 20, cmap= plt.cm.RdGy)
plt.colorbar()
# combining contour plots and image plots
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 4], origin='lower',
cmap=plt.cm.cividis, alpha=0.2)
plt.colorbar()
RESULT
Thus the plotting of density and contour plot was explored and verified successfully.
Ex.No: 6c CORRELATION AND SCATTER PLOTS
AIM
To explore the plotting of correlation and scatter plots on UCI Iris data set.
INTRODUCTION
Scatter plots and correlation matrices are both tools used in statistics and data analysis
to visually and quantitatively understand the relationships between variables in a dataset. They
help to explore patterns, associations, and potential dependencies between different variables.
Scatter Plot
A scatter plot is a graphical representation that displays individual data points as dots on a two-
dimensional plane. Each data point is represented by a dot at the intersection of its
corresponding values on the two axes. Scatter plots are particularly useful when you want to
visualize the relationship between two continuous variables. One variable is plotted on the
horizontal (x) axis, and the other variable is plotted on the vertical (y) axis.
The pattern of the dots in a scatter plot can indicate the nature of the relationship
between the two variables:
Positive Correlation: If the dots roughly form a line that slopes upwards from left to right, this
indicates a positive correlation. It means that as one variable increases, the other tends to
increase as well.
Negative Correlation: If the dots roughly form a line that slopes downwards from left to right,
this indicates a negative correlation. It means that as one variable increases, the other tends to
decrease.
No Correlation: If the dots are scattered randomly without any clear pattern, this suggests that
there is little to no correlation between the variables.
Correlation Matrix
A correlation matrix is a tabular representation of the correlation coefficients between
multiple variables in a dataset. It provides a numerical measure of the strength and direction of
the linear relationship between pairs of variables. The correlation coefficient, often denoted as
"r," ranges from -1 to 1.
- A correlation coefficient of 1 indicates a perfect positive correlation.
- A correlation coefficient of -1 indicates a perfect negative correlation.
- A correlation coefficient close to 0 indicates little to no correlation.
Both scatter plots and correlation matrices play a crucial role in exploratory data
analysis, hypothesis testing, and model building, as they provide insights into the relationships
between variables, which can inform further analysis and decision-making.
PROGRAM
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Reading the CSV file
df = pd.read_csv("E:\data\iris.csv")
# Check for missing values
df.isnull().sum()
# Scatter plot to find relationship between variables
sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',hue='Species', data=df)
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()
# to find & visualize the pairwise correlation of all columns
sns.heatmap(df.corr(method='pearson').drop(['Id'], axis=1).drop(['Id'], axis=0),annot = True)
plt.show()
OUTPUT
RESULT
Thus the plotting of correlation and scatter plots on UCI Iris data set was explored and
verified successfully.
Ex.No: 6c HISTOGRAM
AIM
To explore the plotting of histogram on UCI Iris dataset.
INTRODUCTION
A histogram is a great tool for quickly assessing a probability distribution that is
intuitively understood by almost any audience. It is a graph showing frequency distributions.
It is a graph showing the number of observations within each given interval. A histogram is a
mapping of bins (intervals) to frequencies. Histograms allow seeing the distribution of data for
various columns. It can be used for uni as well as bi-variate analysis.
PROGRAM
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Reading the CSV file
df = pd.read_csv("E:\data\iris.csv")
fig, axes = plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5)
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6)
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6)
RESULT
Thus the plotting of histogram on UCI Iris dataset was explored and verified
successfully.
PROGRAM
import pandas as pd
import plotly.express as px
df = pd.read_csv("E:\data\iris.csv")
px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalLengthCm",\
size="SepalWidthCm",color="Species", color_discrete_map = {"Iris-setosa": "skyblue",\
"Iris-virginica": "violet", "Iris-versicolor":"pink"}).show()
OUTPUT
RESULT
Thus the three dimensional plotting on UCI Iris dataset was verified successfully.
Ex. No: 7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
AIM
To get the insight of basemap in visualizing geographic data.
INTRODUCTION
Basemap is a toolkit under the Python visualization library Matplotlib under the
namespace mpl_toolkits. Its main function is to draw 2D maps, which are important for
visualizing spatial data.
Installation
Using Conda type,
$ conda install basemap
or in Jupyter Notebook type,
!python -m pip install basemap
PROGRAM
import os
os.environ["PROJ_LIB"] = "C:\\Utilities\\Python\\Anaconda\\Library\\share";
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import requests
from csv import DictReader
import pandas as pd
DATA_URL = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_month.csv'
print("Downloading", DATA_URL)
resp = requests.get(DATA_URL)
quakes = list(DictReader(resp.text.splitlines()))
print(quakes[:2])
quakes1=pd.DataFrame(quakes)
print(quakes1.head(2))
earth = Basemap()
earth.drawmapboundary(fill_color='skyblue')
earth.fillcontinents(color='coral', lake_color='skyblue')
earth.drawcoastlines(color='#555566', linewidth=1)
plt.scatter(qLngs, qLats, qMags, c='red', alpha=0.5, zorder=10)
plt.xlabel("HISTORY OF EARTHQUAKES")
OUTPUT
RESULT
Thus visualizing of geographic data with basemap was verified successfully.