67% found this document useful (9 votes)

19K views53 pages

CS3362 Data Science Lab Manual

This document discusses performing univariate and bivariate analysis on diabetes datasets. It loads diabetes data from UCI and Pima Indians, then performs univariate analysis including frequency, mean, median, mode, variance, standard deviation, skewness and kurtosis. For bivariate analysis, it explores relationships between variables using count plots, distribution plots and box plots in seaborn. The analysis provides descriptive statistics and visualizations to understand the distribution of variables and relationships in the diabetes data.

Uploaded by

Jegatheeswari ic37721

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

67% found this document useful (9 votes)

19K views53 pages

CS3362 Data Science Lab Manual

Uploaded by

Jegatheeswari ic37721

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

lOMoARcPSD|28265006

CS3362 Foundations OF DATA Science LAB Manual

Computer Science and Engineering (Anna University)

Studocu is not sponsored or endorsed by any college or university

Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX.NO.4. READING DATA FROM TEXT FILES, EXCEL AND THE WEB
DATE:

Aim:
To Reading data from text files, Excel and the web using pandas package.

ALGORITHM:
STEP 1: Start the program
STEP 2: To read data from csv file using pandas package.
STEP 3: To read data from excel file using pandas package.
STEP 4: To read data from html file using pandas package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
DATA INPUT AND OUTPUT

This notebook is the reference code for getting input and output, pandas can read a variety of file
types using its pd.read_ methods. Let’s take a look at the most common data types:

import numpy as np
import pandas as pd

CSV

CSV INPUT:
df = pd.read_csv('example')
df

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

CSV OUTPUT:
df.to_csv('example',index=False)

EXCEL

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or
images, having images or macros may cause this read_excel method to crash.

EXCEL INPUT :
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

EXCEL OUTPUT :
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML

You may need to install htmllib5, lxml, and BeautifulSoup4. In your terminal/command prompt
run:

pip install lxml

pip install html5lib==1.1
pip install BeautifulSoup4

Then restart Jupyter Notebook. (or use conda install)

Pandas can read table tabs off of html.

For example:

HTML INPUT

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

url = https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list

df = pd.read_html(url)

df[0]

match = "Metcalf Bank"

df_list = pd.read_html(url, match=match)

df_list[0]

HTML OUTPUT:

RESULT:
Exploring commands for read data from csv file, excel file and html are successfully
executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX NO 4(a). EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE

DATE: ANALYTICS ON THE IRIS DATA SET.

AIM:
To explore various commands for doing descriptive analytics on the Iris data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To understand idea behind Descriptive Statistics.
STEP 3: Load the packages we will need and also the `iris` dataset.
STEP 4: load_iris() loads in an object containing the iris dataset, which I stored in
`iris_obj`.
STEP 5: Basic statistics: count, mean, median, min, max
STEP 6: Display the output.
STEP 7: Stop the program.
PROGRAM:
import pandas as pd

from pandas import DataFrame

from sklearn.datasets import load_iris

# sklearn.datasetsincludes common example datasets

# A function to load in the iris dataset

iris_obj = load_iris()

# Dataset preview

iris_obj.data

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in

range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]),
index=pd.Index([i for i in range(iris_obj.target.shape[0])])))

iris # prints iris data

Commands

iris_obj.feature_names

iris.count()

iris.mean()

iris.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

iris.var()

iris.std()

iris.max()

iris.min()

iris.describe()

OUTPUT:

RESULT:
Exploring various commands for doing descriptive analytics on the Iris data set
successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX.NO 5. USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS
DATE: DIABETES DATA SET FOR PERFORMING THE FOLLOWING:

A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE,

STANDARD DEVIATION, SKEWNESS AND KURTOSIS.
AIM:
To explore various commands for doing Univariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the mean, median, mode, variance, standard deviation, skewness and
kurtosis in the given excel data set package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/kirub/Documents/Learning/Untitled Folder/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
df.dtypes['Outcome']
df.info()
df.describe().T

# Frequency# finding the unique count

df1 = df['Outcome'].value_counts()

# displaying df1
print(df1)
#mean
df.mean()
#median
df.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

#mode
df.mode()
#Variance
df.var()
#standard deviation
df.std()
#
#kurtosis
df.kurtosis(axis=0,skipna=True)
df['Outcome'].kurtosis(axis=0,skipna=True)
#skewness
# skewness along the index axis
df.skew(axis = 0, skipna = True)

# skip the na values

# find skewness in each row
df.skew(axis = 1, skipna = True)

#Pregnancy variable
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)

preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_pro
portion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)

sns.countplot(data=df['Outcome'])

sns.distplot(df['Pregnancies'])

sns.boxplot(data=df['Pregnancies'])

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

OUTPUT:

RESULT:
Exploring various commands for doing univariate analytics on the UCI AND PIMA
INDIANS DIABETES was successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX.NO:5. B) BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION

DATE: MODELING
AIM:
To explore the Linear and Logistic Regression model on the USA HOUSING AND UCI
AND PIMA INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the any kind of data set like housing dataset using kaggle.
STEP 3: To read data from downloaded data set.
STEP 4: To find the linear and logistic regression model using the given data set.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
BIVARIATE ANALYSIS GENERAL PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')

fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

plt.tight_layout()

plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()

plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title
plt.tight_layout()

plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()

plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()

OUTPUT:

## Blood Pressure variable

fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['BloodPressure'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],color='green',
label='Non Diab.')
sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0][1],color='red',label='Diab')

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

LINEAR REGRESSION MODELLING ON HOUSING DATASET

# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

sns.distplot(USAhousing['Price'])

sns.heatmap(USAhousing.corr())

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of
Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# print the intercept
print(lm.intercept_)

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)

sns.distplot((y_test-predictions),bins=50);

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

LOGISTIC REGRESSION MODELLING ON PIME DIABETIES

# Data manipulation libraries

import numpy as np
import pandas as pd

###scikit Learn Modules needed for Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import
LabelEncoder,MinMaxScaler,OneHotEncoder,StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

#for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')

df=pd.read_csv('C:/Users/diabetes.csv')

df.head()

df.tail()

df.isnull().sum()

df.describe(include='all')

df.corr()

sns.heatmap(df.corr(),annot=True)
plt.show()

df.hist()
plt.show()

sns.countplot(x=df['Outcome'])

scaler=StandardScaler()
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']]=scaler.fit_transform(df[['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']])

df_new = df
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

# Train & Test split

x_train, x_test, y_train, y_test = train_test_split( df_new[['Pregnancies', 'Glucose',
'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']],
df_new['Outcome'],test_size=0.20,
random_state=21)

print('Shape of Training Xs:{}'.format(x_train.shape))

print('Shape of Test Xs:{}'.format(x_test.shape))
print('Shape of Training y:{}'.format(y_train.shape))
print('Shape of Test y:{}'.format(y_test.shape))

Shape of Training Xs:(614, 8)

Shape of Test Xs:(154, 8)
Shape of Training y:(614,)
Shape of Test y:(154,)

# Build Model
model = LogisticRegression()
model.fit(x_train, y_train)
y_predicted = model.predict(x_test)

score=model.score(x_test,y_test);
print(score)

0.7337662337662337

#Confusion Matrix
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

RESULT:
Exploring various commands for doing Bivariate analytics on the USA HOUSING Dataset
was successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX.NO:5.C) MULTIPLE REGRESSION ANALYSIS

DATE:`
AIM:
To explore various commands for doing Multiivariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the multiple regression analysis the
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

RESULT:

Thus the Multi regression analysis using housing data sets are executed successfully.

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX.NO:5.D) ALSO COMPARE THE RESULTS OF THE ABOVE ANALYSIS FOR THE
DATE: TWO DATA SETS.

AIM:
To explore various commands for compare the results of the above analysis for the date:
two data sets.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the comparison between the two different dataset using various command.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Glucose Variable
df.Glucose.describe()

#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='green',label='
Non Diab.')
sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0][1],color='red',label='Diab')
axes[0][1].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})

axes[1][1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

OUTPUT:

RESULT:

Thus the comparison of the above analysis for the two datasets are executed successfully.

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX.NO:6. APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI

DATE: DATA SETS.

AIM:
To apply and explore various plotting functions on UCI datasets.

ALGORITHM:

STEP 1: Install seaborn package and import the package.

STEP 2: Normal curves, density or contour plots, correlation and sctter plots, and
histogram plots are visualized.
STEP 3: 3d plotting done using plotly package
STEP 4: Stop the program.
PROGRAM:

A. NORMAL CURVES

#seaborn package
import seaborn as sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

B. DENSITY AND CONTOUR PLOTS

iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)

OUTPUT:

C. CORRELATION AND SCATTER PLOTS

#correlation visualized using heatmap function

df = sns.load_dataset("titanic")
ax = sns.heatmap(df annot=True, fmt="d")

#scatter plots of categorical variable

df = sns.load_dataset("titanic")
sns.catplot(data=df, x="age", y="class")

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

D. HISTOGRAMS

#histogram of datafra,e

df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")

OUTPUT:

E. THREE DIMENSIONAL PLOTTING

#3d plotting using ploty package

import plotly as px
df = sns.load_dataset("iris")

px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",

size="SepalLengthCm",
color="Species", color_discrete_map = {"Joly": "blue", "Bergeron": "violet",
"Coderre":"pink"})

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

RESULT:

Thus the various exploring visual plots are successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

EX.NO:7. VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

DATE:

AIM:

To check the Visualizing Geographic Data with Basemap using googlecolap.

ALGORITHM:

STEP 1: Install the basemap package

Install the below package:

Use google colab (in anaconda prompt , conda version is need to change, it may affect our
other packages compatability)
pip install basemap
(or)
conda install -c https://conda.anaconda.org/anaconda basemap

STEP 2: Explore on various projection options example: ortho, lcc.

STEP 3: Mark the location using longitude and latitude

PROGRAM:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

lOMoARcPSD|28265006

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting

x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

OUTPUT:

from itertools import chain

def draw_map(m, scale=0.2):

# draw a shaded-relief image
m.shadedrelief(scale=scale)

# lats and longs are returned as a dictionary

lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))

# keys contain the plt.Line2D instances

lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)

# cycle through these lines and set the desired style

for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')

fig = plt.figure(figsize=(8, 6), edgecolor='w')

Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006

m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

OUTPUT:

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)

OUTPUT:

RESULT:

Thus the Exploring Geographic Data with Basemap was successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

cs3361 Data Science Lab Record Manual
89% (9)
cs3361 Data Science Lab Record Manual
92 pages
OCS353 - Data Science Manual-FULL
100% (2)
OCS353 - Data Science Manual-FULL
64 pages
Ad3411 Data Science and Analytics Laboratory
100% (7)
Ad3411 Data Science and Analytics Laboratory
24 pages
Ad3301 - Data Exploration and Visualization
100% (6)
Ad3301 - Data Exploration and Visualization
2 pages
Business Analytics Lab Manual-Ai
50% (2)
Business Analytics Lab Manual-Ai
93 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
Ocs353dsf Unit Wise Notes
100% (4)
Ocs353dsf Unit Wise Notes
121 pages
Ad3491 Fdsa Unit 2 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 2 Notes Eduengg
82 pages
ccs355 Lab Manual
No ratings yet
ccs355 Lab Manual
24 pages
CS3361 Data Science Lab Manual (II CYS)
100% (1)
CS3361 Data Science Lab Manual (II CYS)
40 pages
Cs3591 Computer Networks Lab Mannual
33% (3)
Cs3591 Computer Networks Lab Mannual
41 pages
CCS341-Data Warehousing Lab Manual (2021)
100% (2)
CCS341-Data Warehousing Lab Manual (2021)
50 pages
CCS341 Set1
67% (3)
CCS341 Set1
2 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
ML - LAB Record
No ratings yet
ML - LAB Record
36 pages
Full Stack Lab - MANUAL
0% (1)
Full Stack Lab - MANUAL
53 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
AD3351 DAA Lab Manual
No ratings yet
AD3351 DAA Lab Manual
47 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Cs3461 Operating System Lab Manual-1-4
100% (2)
Cs3461 Operating System Lab Manual-1-4
24 pages
CS3481 - DBMS Lab Manual - New
100% (2)
CS3481 - DBMS Lab Manual - New
82 pages
Ccs352-Multimedia and Animation Lab Manual
No ratings yet
Ccs352-Multimedia and Animation Lab Manual
50 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
AIDS Syllabus 2021 L
100% (1)
AIDS Syllabus 2021 L
87 pages
AL3391 Notes Unit I
100% (1)
AL3391 Notes Unit I
52 pages
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
100% (1)
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
48 pages
CS3352 FDS QP Solved (Anna University)
100% (1)
CS3352 FDS QP Solved (Anna University)
98 pages
Cs3481 - Dbms Lab Manual
No ratings yet
Cs3481 - Dbms Lab Manual
72 pages
AD3491 FDSA Syllabus
No ratings yet
AD3491 FDSA Syllabus
2 pages
AD3461 Machine Learning Lab Manual
No ratings yet
AD3461 Machine Learning Lab Manual
26 pages
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
No ratings yet
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
7 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
CCS354 Network Security Lab Manual
0% (1)
CCS354 Network Security Lab Manual
59 pages
CSBS - AD3491 - FDSA - IA 2 - Answer Key
67% (3)
CSBS - AD3491 - FDSA - IA 2 - Answer Key
14 pages
CS3491 Ai Lab Manula R2021 Final
100% (4)
CS3491 Ai Lab Manula R2021 Final
43 pages
IT3401 Web Essentials Lab Manual
100% (1)
IT3401 Web Essentials Lab Manual
33 pages
B Tech AIDS
No ratings yet
B Tech AIDS
43 pages
Operating Systems Laboratory Course Guide
No ratings yet
Operating Systems Laboratory Course Guide
2 pages
CS3362 C Programming and Data Structures Laboratory
100% (1)
CS3362 C Programming and Data Structures Laboratory
1 page
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
FDS IMPORTANT QUESTIONS EduEngg
100% (1)
FDS IMPORTANT QUESTIONS EduEngg
7 pages
Deep Learning For Vision Lab Manual 2024
100% (1)
Deep Learning For Vision Lab Manual 2024
25 pages
cd3281 Final Copy Lab Manual
100% (2)
cd3281 Final Copy Lab Manual
44 pages
AD3251 Data Structures Design Question Bank 1
No ratings yet
AD3251 Data Structures Design Question Bank 1
1 page
CD3281 Dsa Lab 2021 R
100% (2)
CD3281 Dsa Lab 2021 R
3 pages
Ccs355 Neural Networks and Deep Learning Unit1
No ratings yet
Ccs355 Neural Networks and Deep Learning Unit1
29 pages
CCS334 - Bda Lab Manual
No ratings yet
CCS334 - Bda Lab Manual
40 pages
NNDL Lab Manual
No ratings yet
NNDL Lab Manual
41 pages
Aiml Lab Manaual R23
100% (1)
Aiml Lab Manaual R23
10 pages
ccw331 Business Analytics Lab Manual Final Students
No ratings yet
ccw331 Business Analytics Lab Manual Final Students
46 pages
AD3491 - FDSA - Unit I - Introduction - Part I
100% (2)
AD3491 - FDSA - Unit I - Introduction - Part I
23 pages
GE3171 Problem Solving and Python Programming Lab Manual Shanen
No ratings yet
GE3171 Problem Solving and Python Programming Lab Manual Shanen
44 pages
Machine Learning Al3451
No ratings yet
Machine Learning Al3451
10 pages
ccs341 Data Warehousing Lab Manual2021
No ratings yet
ccs341 Data Warehousing Lab Manual2021
41 pages
Data Science Lab Manual: Pandas & Analysis
No ratings yet
Data Science Lab Manual: Pandas & Analysis
53 pages
CS 3362 FDS
No ratings yet
CS 3362 FDS
53 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Python Data Analysis Tutorial
No ratings yet
Python Data Analysis Tutorial
47 pages
Random Variable Transformation & Chebyshev
No ratings yet
Random Variable Transformation & Chebyshev
32 pages
Informacion Producto Martillo MX60
No ratings yet
Informacion Producto Martillo MX60
20 pages
Allegra 6 Series
No ratings yet
Allegra 6 Series
52 pages
Sexual Fantasy
100% (5)
Sexual Fantasy
20 pages
MSDS All DNA RNA Purification Kits
No ratings yet
MSDS All DNA RNA Purification Kits
6 pages
Month & Year End Activites
100% (1)
Month & Year End Activites
4 pages
Sine Wave Filters
No ratings yet
Sine Wave Filters
7 pages
Cse-Iii-Engineering Mathematics - Iii (15mat31) - Solution PDF
No ratings yet
Cse-Iii-Engineering Mathematics - Iii (15mat31) - Solution PDF
68 pages
Review of Related Literature and Studies: (Author: Jodel Balastigue 2014)
No ratings yet
Review of Related Literature and Studies: (Author: Jodel Balastigue 2014)
4 pages
List of Pharmaceutical Impurities
No ratings yet
List of Pharmaceutical Impurities
20 pages
ICEL-MSB1253100 - E# - Originale
No ratings yet
ICEL-MSB1253100 - E# - Originale
4 pages
Carsen Otto - Worksheet Building Cardiorespiratory Endurance
No ratings yet
Carsen Otto - Worksheet Building Cardiorespiratory Endurance
2 pages
Engine Performance Data
No ratings yet
Engine Performance Data
2 pages
Conagra'S Recipe For A Better Human Resources System Case Study
0% (1)
Conagra'S Recipe For A Better Human Resources System Case Study
7 pages
(Ebook) Manufacturing Facilities: Location, Planning, and Design, Third Edition by Sule, Dileep R ISBN 9781420044232, 1420044230 Kindle & PDF Formats
No ratings yet
(Ebook) Manufacturing Facilities: Location, Planning, and Design, Third Edition by Sule, Dileep R ISBN 9781420044232, 1420044230 Kindle & PDF Formats
137 pages
Mesc Spe 74-019
No ratings yet
Mesc Spe 74-019
7 pages
Unit 1: Technology in Use
100% (1)
Unit 1: Technology in Use
7 pages
Social Representations of Science Explained
No ratings yet
Social Representations of Science Explained
16 pages
HM61/AM61V: Home Comfort Systems
No ratings yet
HM61/AM61V: Home Comfort Systems
4 pages
LSU Law Center Application Overview
No ratings yet
LSU Law Center Application Overview
13 pages
Prefix and Suffix
No ratings yet
Prefix and Suffix
57 pages
Land Law Notes
No ratings yet
Land Law Notes
175 pages
SMEA Proposal
100% (3)
SMEA Proposal
3 pages
Barbarossa Rulezbook
No ratings yet
Barbarossa Rulezbook
5 pages
Lumber Liquidators Strategic Plan
No ratings yet
Lumber Liquidators Strategic Plan
70 pages
Light and Medium Merchant Mill
100% (1)
Light and Medium Merchant Mill
35 pages
2021 SharePoint QMS Guidebook
No ratings yet
2021 SharePoint QMS Guidebook
29 pages
Grade IX Summer Vacation Homework Activity 2024-25
No ratings yet
Grade IX Summer Vacation Homework Activity 2024-25
14 pages
Beginner's Guide To Growing Heirloom Vegetables (Excerpt)
100% (1)
Beginner's Guide To Growing Heirloom Vegetables (Excerpt)
10 pages
1310 2437 4 PB
No ratings yet
1310 2437 4 PB
20 pages