21CS1711
DATA SCIENCE
AND
ANALYTICS LABORATORY
Exercise 1:Download, install and explore the features of NumPy,
SciPy, Jupiter, Stats models and Pandas packages.
• Reading data from text file, Excel and the web.
• Exploring various commands for doing descriptive analytics
on Iris dataset.
Before We Start
• Since Anaconda Navigator already comes preloaded with most key
data science packages (including NumPy, Pandas, SciPy, Statsmodels,
Jupyter), we can skip the installation step and jump right into the
hands-on teaching session.
STEP 1: Launch Jupyter
Notebook via Anaconda
Navigator
•Open Anaconda Navigator
•Click Launch under Jupyter Notebook
•A browser window will open → Click New → Python 3 (ipykernel)
Run a quick test: for testing the already existing packages
import numpy as np
import pandas as pd
import scipy
print(np.__version__)
print(pd.__version__)
print(scipy.__version__)
Packages
• A) NumPy (numpy)
“NumPy” stands for Numerical Python. It helps in fast mathematical
calculations. Think of it as a super-powered calculator!”
Example:
import numpy as np
a = np.array([1, 2, 3])
print("Mean:", np.mean(a))
OUTPUT?
Mean :2.0
Packages
B) SciPy (scipy)
• “SciPy” is used for scientific computing — especially statistics, signals, optimization,
and more.”
• Example
from scipy import stats
group1 = [60, 65, 70]
group2 = [80, 85, 90]
t_stat, p_val = stats.ttest_ind(group1, group2)
print("t-statistic:", t_stat)
Output:
t-statistic: -4.898979485566357
NumPy vs SciPy
from scipy import stats import scipy.stats as sta
import numpy as np
group1 = [60, 65, 70]
data = [1, 2, 3, 4, 5]
group2 = [80, 85, 90]
mean = np.mean(data) t_stat, p_val = stats.ttest_ind(group1, group2)
print("Mean:",mean) print("t-statistic:", t_stat)
Output: print("p-value:", p_val)
Mean :3.0 Output:
t-statistic: -4.898979485566357
p-value: 0.008049893100837717
Use NumPy (np) for general numerical operations and basic stats.Use from scipy import stats when
you need statistical functions that go beyond what NumPy offers — like distributions, hypothesis
testing, and confidence intervals.
c) Jupyter Notebook(jupyter)
• “Jupyter is an interactive environment where we write code, see
output instantly, and explain with text, images, or charts.”
• How to launch from Anaconda Navigator
• Markdown and Code examples
• Save as .ipynb file
Each flower has 4 features:
Feature Description Unit
sepal length Length of the sepal cm
sepal width Width of the sepal cm
petal length Length of the petal cm
petal width Width of the petal cm
Target (Label): Dataset Size:
The species of the flower — a categorical value: •150 samples
•0 → Setosa •50 samples per species
•1 → Versicolor •No missing values
•2 → Virginica
D) Pandas (pandas)
• “Pandas lets us read, clean, and analyze tabular data easily. It gives us
DataFrames, like Excel in Python.”
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-
data/master/iris.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
E) Statsmodels (statsmodels)
• “Statsmodels is used for running statistical models like regression,
ANOVA, and time series.”
import statsmodels.api as sm
X = df['sepal_length']
y = df['petal_length']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())
What are we trying to do?
• To predict petal_length based on
Meaning in Simple
sepal_length using a straight- Term coef
Words
This is the starting value
line formula (intercept). If
sepal_length = 0,
const -7.1014
• Petal Length=something+someth petal_length would be -
7.1 (just part of the
formula).
ing×Sepal Length For every 1 unit increase
in sepal_length,
sepal_length 1.8584
• The computer finds the best line petal_length increases
by about 1.86 units.
that fits the data using a method
called OLS
What are we trying to do?
Column Meaning p-value Interpretation
How much the value might vary.
std err Smaller is better. Very strong evidence against
< 0.01
A score that tells if the number
the null → Highly significant
t is far from zero Enough evidence to reject the
< 0.05
`P> t null → Statistically significant
0.025–0.975 We’re 95% sure the real number Not enough evidence →
lies in this range. > 0.05 Difference might be due to
chance
What we Should Focus On
• Look at the coef (values of the • Look at the P>|t| value:If it's less
equation): than 0.05, it’s important
• Petal Length=−7.1+1.86×Sepal Le (statistically significant).
ngth • Both values here are 0.000, so
they are very important.
Petal Length=−7.1+1.86×Sepal Length
So if a flower has sepal_length = 6cm: • Look at R-squared = 0.760:This
Petal Length≈−7.1+1.86×6=3.06 cm means 76% of the change in
petal length can be explained
just by knowing sepal length.
SUMMARY
Package Use
NumPy Fast math, arrays, stats
SciPy Statistical tests, scientific computing
Jupyter Interactive code notebook
Pandas Data manipulation and analysis
Statsmodels Statistical modeling like regression
STEP 4: Explore the Dataset
(Descriptive Analytics)
• Show basic details : df.info()
• Summary statistics : df.describe()
• Unique species and counts: df['species'].value_counts()
• Mean values by species: df.groupby('species').mean()
• df.corr(numeric_only=True)
df.shape – no. of rows and columns will be displayed
df.duplicated() - indicating whether a row is a duplicate of a previous row.
STEP 5: Use NumPy for Numerical
Operations
# Mean of sepal length
np.mean(df['sepal_length'])
# Standard deviation of petal width
np.std(df['petal_width'])
STEP 6: Use SciPy for Statistical Test
Example
# T-test between petal lengths of setosa and versicolor
setosa = df[df['species'] == 'setosa']['petal_length']
versicolor = df[df['species'] == 'versicolor']['petal_length']
stats.ttest_ind(setosa, versicolor)
STEP 7: Use Statsmodels for
Regression
# Simple Linear Regression: Predict petal_length using sepal_length
X = df['sepal_length']
y = df['petal_length']
X = sm.add_constant(X) # Adds intercept
model = sm.OLS(y, X).fit()
print(model.summary())
STEP 8: Make It Visual
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df, hue='species')
plt.show()
• variance =
Basic Descriptions df['DiabetesPedigreeFunction'].var()
• print(variance)
• frequency =
df[‘column_name’].value_counts() • std_dev =
• print(frequency) df['DiabetesPedigreeFunction'].std()
• print(std_dev)
• mean =
df['DiabetesPedigreeFunction'].mean() • skewness =
• print(mean) df['DiabetesPedigreeFunction'].skew()
• print(skewness)
• median =
df['DiabetesPedigreeFunction'].median() • kurtosis =
• print(median) df['DiabetesPedigreeFunction'].kurtosis()
• print(kurtosis)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load a dataset
df = sns.load_dataset("iris") # Example dataset
# Compute the correlation matrix
corr = df.corr(numeric_only=True)
# Create the heatmap
sns.heatmap(corr, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
Data from Text or Excel
• Load from .txt:
df_txt = pd.read_csv(‘myfile.txt', delimiter='\t')
• Load from .xlsx:
df_excel = pd.read_excel(‘Book1.xlsx')
Summary Table
Task Command
Load CSV pd.read_csv(url)
View top rows df.head()
Basic info df.info()
Summary stats df.describe()
Group by df.groupby()
Mean using
np.mean()
NumPy
T-test stats.ttest_ind()
Regression sm.OLS().fit()
Correlation df.corr()
coefficient
sns.heatmap()
Correlation
heatmap
Creating Own Kernel (Virtual
Environment) in Navigator by
Installing using Terminal
1. Open Anaconda Navigator
• Click Start → search for Anaconda Navigator → open it.
• Wait for it to load fully.
2. Create a New Environment
(Recommended)
• Creating a new environment keeps things clean.
• Click Environments (left side).
• Click Create (bottom).
• Name it: data_analysis_env
• Choose: Python 3.10 or 3.11
• Click Create (wait a bit).
3. Install Required Packages
Let’s install the following:
• numpy
• scipy
• pandas
• statsmodels
• jupyter
Option A: Using GUI
• Select your environment (data_analysis_env)
• Click Open Terminal (right side)
• In the terminal, type:
conda install numpy scipy pandas statsmodels jupyter
Option B: Use Environment Tab
• Click Channels > conda-forge
• Use Search Bar for each library (e.g., numpy)
• Select → Apply
Creating Own Kernel (Virtual
Environment) in Navigator by
Installing using GUI
4. Launch Jupyter Notebook
• From Home tab, select the environment dropdown (top right).
• Make sure data_analysis_env is selected.
• Click Launch → Jupyter Notebook
This will open a browser window (localhost:8888).
Click New → Python 3 to open a new notebook.