lOMoARcPSD|28265006
CS3362 Foundations OF DATA Science LAB Manual
Computer Science and Engineering (Anna University)
Studocu is not sponsored or endorsed by any college or university
Downloaded by Jegatheeswari ic37721 (
[email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX.NO.4. READING DATA FROM TEXT FILES, EXCEL AND THE WEB
DATE:
Aim:
To Reading data from text files, Excel and the web using pandas package.
ALGORITHM:
STEP 1: Start the program
STEP 2: To read data from csv file using pandas package.
STEP 3: To read data from excel file using pandas package.
STEP 4: To read data from html file using pandas package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
DATA INPUT AND OUTPUT
This notebook is the reference code for getting input and output, pandas can read a variety of file
types using its pd.read_ methods. Let’s take a look at the most common data types:
import numpy as np
import pandas as pd
CSV
CSV INPUT:
df = pd.read_csv('example')
df
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
CSV OUTPUT:
df.to_csv('example',index=False)
EXCEL
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or
images, having images or macros may cause this read_excel method to crash.
EXCEL INPUT :
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
EXCEL OUTPUT :
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
HTML
You may need to install htmllib5, lxml, and BeautifulSoup4. In your terminal/command prompt
run:
pip install lxml
pip install html5lib==1.1
pip install BeautifulSoup4
Then restart Jupyter Notebook. (or use conda install)
Pandas can read table tabs off of html.
For example:
HTML INPUT
Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
url = https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list
df = pd.read_html(url)
df[0]
match = "Metcalf Bank"
df_list = pd.read_html(url, match=match)
df_list[0]
HTML OUTPUT:
RESULT:
Exploring commands for read data from csv file, excel file and html are successfully
executed.
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX NO 4(a). EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE
DATE: ANALYTICS ON THE IRIS DATA SET.
AIM:
To explore various commands for doing descriptive analytics on the Iris data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To understand idea behind Descriptive Statistics.
STEP 3: Load the packages we will need and also the `iris` dataset.
STEP 4: load_iris() loads in an object containing the iris dataset, which I stored in
`iris_obj`.
STEP 5: Basic statistics: count, mean, median, min, max
STEP 6: Display the output.
STEP 7: Stop the program.
PROGRAM:
import pandas as pd
from pandas import DataFrame
from sklearn.datasets import load_iris
# sklearn.datasetsincludes common example datasets
# A function to load in the iris dataset
iris_obj = load_iris()
# Dataset preview
iris_obj.data
iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in
range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]),
index=pd.Index([i for i in range(iris_obj.target.shape[0])])))
iris # prints iris data
Commands
iris_obj.feature_names
iris.count()
iris.mean()
iris.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
iris.var()
iris.std()
iris.max()
iris.min()
iris.describe()
OUTPUT:
RESULT:
Exploring various commands for doing descriptive analytics on the Iris data set
successfully executed.
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX.NO 5. USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS
DATE: DIABETES DATA SET FOR PERFORMING THE FOLLOWING:
A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE,
STANDARD DEVIATION, SKEWNESS AND KURTOSIS.
AIM:
To explore various commands for doing Univariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the mean, median, mode, variance, standard deviation, skewness and
kurtosis in the given excel data set package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('C:/Users/kirub/Documents/Learning/Untitled Folder/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
df.dtypes['Outcome']
df.info()
df.describe().T
# Frequency# finding the unique count
df1 = df['Outcome'].value_counts()
# displaying df1
print(df1)
#mean
df.mean()
#median
df.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
#mode
df.mode()
#Variance
df.var()
#standard deviation
df.std()
#
#kurtosis
df.kurtosis(axis=0,skipna=True)
df['Outcome'].kurtosis(axis=0,skipna=True)
#skewness
# skewness along the index axis
df.skew(axis = 0, skipna = True)
# skip the na values
# find skewness in each row
df.skew(axis = 1, skipna = True)
#Pregnancy variable
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)
preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_pro
portion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)
sns.countplot(data=df['Outcome'])
sns.distplot(df['Pregnancies'])
sns.boxplot(data=df['Pregnancies'])
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
OUTPUT:
RESULT:
Exploring various commands for doing univariate analytics on the UCI AND PIMA
INDIANS DIABETES was successfully executed.
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX.NO:5. B) BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION
DATE: MODELING
AIM:
To explore the Linear and Logistic Regression model on the USA HOUSING AND UCI
AND PIMA INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the any kind of data set like housing dataset using kaggle.
STEP 3: To read data from downloaded data set.
STEP 4: To find the linear and logistic regression model using the given data set.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
BIVARIATE ANALYSIS GENERAL PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
plt.tight_layout()
plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()
plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title
plt.tight_layout()
plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()
OUTPUT:
## Blood Pressure variable
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.distplot(df['BloodPressure'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],color='green',
label='Non Diab.')
sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0][1],color='red',label='Diab')
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))
plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
LINEAR REGRESSION MODELLING ON HOUSING DATASET
# Data manipulation libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()
USAhousing.columns
sns.pairplot(USAhousing)
sns.distplot(USAhousing['Price'])
sns.heatmap(USAhousing.corr())
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of
Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# print the intercept
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
sns.distplot((y_test-predictions),bins=50);
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
LOGISTIC REGRESSION MODELLING ON PIME DIABETIES
# Data manipulation libraries
import numpy as np
import pandas as pd
###scikit Learn Modules needed for Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import
LabelEncoder,MinMaxScaler,OneHotEncoder,StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
#for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.tail()
df.isnull().sum()
df.describe(include='all')
df.corr()
sns.heatmap(df.corr(),annot=True)
plt.show()
df.hist()
plt.show()
sns.countplot(x=df['Outcome'])
scaler=StandardScaler()
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']]=scaler.fit_transform(df[['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']])
df_new = df
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
# Train & Test split
x_train, x_test, y_train, y_test = train_test_split( df_new[['Pregnancies', 'Glucose',
'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']],
df_new['Outcome'],test_size=0.20,
random_state=21)
print('Shape of Training Xs:{}'.format(x_train.shape))
print('Shape of Test Xs:{}'.format(x_test.shape))
print('Shape of Training y:{}'.format(y_train.shape))
print('Shape of Test y:{}'.format(y_test.shape))
Shape of Training Xs:(614, 8)
Shape of Test Xs:(154, 8)
Shape of Training y:(614,)
Shape of Test y:(154,)
# Build Model
model = LogisticRegression()
model.fit(x_train, y_train)
y_predicted = model.predict(x_test)
score=model.score(x_test,y_test);
print(score)
0.7337662337662337
#Confusion Matrix
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
RESULT:
Exploring various commands for doing Bivariate analytics on the USA HOUSING Dataset
was successfully executed.
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX.NO:5.C) MULTIPLE REGRESSION ANALYSIS
DATE:`
AIM:
To explore various commands for doing Multiivariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the multiple regression analysis the
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Data manipulation libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()
USAhousing.columns
sns.pairplot(USAhousing)
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
RESULT:
Thus the Multi regression analysis using housing data sets are executed successfully.
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX.NO:5.D) ALSO COMPARE THE RESULTS OF THE ABOVE ANALYSIS FOR THE
DATE: TWO DATA SETS.
AIM:
To explore various commands for compare the results of the above analysis for the date:
two data sets.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the comparison between the two different dataset using various command.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Glucose Variable
df.Glucose.describe()
#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='green',label='
Non Diab.')
sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0][1],color='red',label='Diab')
axes[0][1].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()
fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))
plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
OUTPUT:
RESULT:
Thus the comparison of the above analysis for the two datasets are executed successfully.
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX.NO:6. APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI
DATE: DATA SETS.
AIM:
To apply and explore various plotting functions on UCI datasets.
ALGORITHM:
STEP 1: Install seaborn package and import the package.
STEP 2: Normal curves, density or contour plots, correlation and sctter plots, and
histogram plots are visualized.
STEP 3: 3d plotting done using plotly package
STEP 4: Stop the program.
PROGRAM:
A. NORMAL CURVES
#seaborn package
import seaborn as sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
B. DENSITY AND CONTOUR PLOTS
iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)
OUTPUT:
C. CORRELATION AND SCATTER PLOTS
#correlation visualized using heatmap function
df = sns.load_dataset("titanic")
ax = sns.heatmap(df annot=True, fmt="d")
#scatter plots of categorical variable
df = sns.load_dataset("titanic")
sns.catplot(data=df, x="age", y="class")
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
D. HISTOGRAMS
#histogram of datafra,e
df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")
OUTPUT:
E. THREE DIMENSIONAL PLOTTING
#3d plotting using ploty package
import plotly as px
df = sns.load_dataset("iris")
px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",
size="SepalLengthCm",
color="Species", color_discrete_map = {"Joly": "blue", "Bergeron": "violet",
"Coderre":"pink"})
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
RESULT:
Thus the various exploring visual plots are successfully executed.
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
EX.NO:7. VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
DATE:
AIM:
To check the Visualizing Geographic Data with Basemap using googlecolap.
ALGORITHM:
STEP 1: Install the basemap package
Install the below package:
Use google colab (in anaconda prompt , conda version is need to change, it may affect our
other packages compatability)
pip install basemap
(or)
conda install -c https://conda.anaconda.org/anaconda basemap
STEP 2: Explore on various projection options example: ortho, lcc.
STEP 3: Mark the location using longitude and latitude
PROGRAM:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
OUTPUT:
Downloaded by Jegatheeswari ic37721 ([email protected])
lOMoARcPSD|28265006
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);
OUTPUT:
from itertools import chain
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')
fig = plt.figure(figsize=(8, 6), edgecolor='w')
Downloaded by Jegatheeswari ic37721 (
[email protected])
lOMoARcPSD|28265006
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
OUTPUT:
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)
OUTPUT:
RESULT:
Thus the Exploring Geographic Data with Basemap was successfully executed.
Downloaded by Jegatheeswari ic37721 ([email protected])