DataScience Lab
DataScience Lab
1. Problem Definition: What is the problem you're trying to solve? (e.g., predict house prices, classify
email as spam or not).
2. Data Collection: Gathering the relevant data.
3. Data Cleaning and Preprocessing: Cleaning the data by handling missing values, removing outliers,
and transforming variables.
4. Exploratory Data Analysis (EDA): Summarizing the dataset using statistical methods and
visualizations.
5. Modeling: Building machine learning models and evaluating their performance.
6. Model Deployment: Putting models into production, where they are used to make real-time
predictions.
Example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Descriptive Statistics
- Python:
python
# Basic statistics
print(df.describe())
- R:
r
# Basic statistics
summary(data)
Visualization
- Python:
python
import matplotlib.pyplot as plt
# Plotting a histogram
df['Age'].hist(bins=10)
plt.title("Age Distribution")
plt.show()
Load dataset
df = pd.read_csv('data.csv')
Clean data
df_clean = df.dropna()
Summary statistics
print(df_clean.describe())
Visualization
df_clean['Age'].hist(bins=10)
plt.title("Cleaned Age Distribution")
plt.show()
7. Version Control with Git
What is Git and GitHub?
- Git is a version control system that helps track changes in code and collaborate with others.
- GitHub is an online platform to store and share Git repositories.
Commit changes
git commit -m "Initial commit"
Push to GitHub
git push origin main
Learning Objectives:
By the end of this module, students should be able to:
- Understand the importance of data cleaning and preprocessing in data science.
- Handle missing data and outliers.
- Normalize and standardize data for machine learning models.
- Encode categorical data into numerical formats.
- Perform feature engineering to create new features from existing data.
Lab Exercise:
- Load a dataset with missing values (e.g., the Titanic dataset).
- Check for missing values and handle them by either dropping or filling with the mean/median.
3. Handling Duplicates
Duplicate data can distort your analysis by providing inaccurate results. It’s important to check for
and remove duplicate rows in your dataset.
Removing Duplicates:
df.drop_duplicates(inplace=True) # Drop duplicate rows
Lab Exercise:
- Load a dataset and check for duplicates.
- Remove any duplicate rows from the dataset.
4. Handling Outliers
Outliers are extreme values that deviate significantly from the rest of the data. These can affect
statistical analyses and machine learning models.
Identifying Outliers:
- Boxplot is a common method to visualize outliers.
- Z-score: If the Z-score of a data point is greater than 3 or less than -3, it is often considered an
outlier.
from scipy import stats
z_scores = stats.zscore(df['Age'])
outliers = df[abs(z_scores) > 3] # Identify outliers
- IQR (Interquartile Range): Values that fall outside the range of 1.5 * IQR above the 75th percentile
or below the 25th percentile are considered outliers.
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Age'] < (Q1 - 1.5 * IQR)) | (df['Age'] > (Q3 + 1.5 * IQR))]
Handling Outliers:
1. Remove Outliers:
df = df[(df['Age'] >= (Q1 - 1.5 * IQR)) & (df['Age'] <= (Q3 + 1.5 * IQR))]
2. Cap Outliers: Cap values to a predefined limit, such as setting values beyond the 99th percentile to
the 99th percentile value.
Lab Exercise:
- Identify outliers in a numerical column using boxplots or Z-scores.
- Remove or cap the outliers accordingly.
5. Data Transformation
Data transformation includes techniques like normalization, standardization, and scaling to adjust
numerical features to be on the same scale.
Normalization vs Standardization:
- Normalization: Rescaling features to a [0, 1] range.
Lab Exercise:
- Normalize or standardize a numerical column (e.g., 'Age') using the appropriate method.
6. Encoding Categorical Data
Most machine learning algorithms work with numerical data, so categorical variables need to be
converted to numeric values. Common methods include:
1. *Label Encoding*: Converts each category into a number.
df = pd.get_dummies(df, columns=['Sex'])
```
Lab Exercise:
- Perform label encoding or one-hot encoding on a categorical variable (e.g., 'Sex').
7. Feature Engineering
Feature engineering involves creating new features from existing data that may be more informative
for the model.
- Date and Time Features: Extract features like day, month, or year from a datetime column.
df['Year'] = pd.to_datetime(df['Date']).dt.year
Lab Exercise:
- Create a new feature by combining two existing columns (e.g., creating an "Age-Group" from 'Age').
Learning Objectives:
By the end of this module, students should be able to:
- Understand the significance of Exploratory Data Analysis (EDA) in data science.
- Use Python tools to generate descriptive statistics of the dataset.
- Visualize data using various plotting techniques (e.g., histograms, box plots, scatter plots).
- Identify correlations between variables and interpret the results.
- Understand the importance of visualizing distributions, trends, and outliers in the dataset.
1. Introduction to EDA
Exploratory Data Analysis (EDA) is the process of analyzing a dataset to summarize its main
characteristics, often using visual methods. The goal of EDA is to gain insights, detect patterns, test
hypotheses, and check assumptions through the use of statistical graphics, plots, and information
tables.
2. Descriptive Statistics
Descriptive statistics help in summarizing the main characteristics of the data and providing a quick
overview.
3. Mode:
df['Sex'].mode() # Most frequent category in the 'Sex' column
Lab Exercise:
- Load a dataset (e.g., Titanic dataset).
- Generate the summary statistics (mean, median, mode, standard deviation).
- Check the skewness and kurtosis of a numerical column.
3. Data Visualization
Visualization helps in understanding the data's patterns, distributions, and relationships more clearly
than just numerical summaries. We will use popular Python libraries like Matplotlib and Seaborn for
plotting.
2. Box Plots:
- Box plots display the spread of data and identify outliers.
import seaborn as sns
sns.boxplot(x='Survived', y='Age', data=df)
plt.title('Box Plot of Age by Survival Status')
plt.show()
3. Scatter Plots:
- Scatter plots help in identifying relationships between two continuous variables.
df.plot(kind='scatter', x='Age', y='Fare', alpha=0.5)
plt.title('Scatter Plot of Age vs Fare')
plt.show()
4. Correlation Matrix:
- Correlation matrices show relationships between numerical variables.
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
*Lab Exercise:*
- Create a histogram for a numerical column (e.g., 'Age').
- Create a box plot comparing a numerical feature ('Age') with a categorical feature ('Survived').
- Generate a scatter plot between two continuous variables (e.g., 'Age' vs 'Fare').
- Create a correlation matrix for the numerical features in the dataset.
4. Univariate Analysis
Univariate analysis involves analyzing the distribution of a single variable to understand its
characteristics and behavior.
*Visualizing Distributions*:
1. *Histogram and Density Plot*:
- Use both histogram and KDE (Kernel Density Estimation) plot to visualize the distribution of a
numerical variable.
sns.histplot(df['Age'], kde=True, bins=30)
plt.title('Age Distribution with KDE')
plt.show()
2. *Box Plot*:
- A box plot can show the spread and any outliers in a single variable.
sns.boxplot(x=df['Age'])
plt.title('Box Plot of Age')
plt.show()
Lab Exercise:
- Plot a histogram and KDE plot for a numerical variable (e.g., 'Age').
- Create a box plot for the same variable to check for outliers.
5. Bivariate Analysis
Bivariate analysis looks at the relationship between two variables, both numerical and categorical, to
understand how they interact.
Visualizing Relationships:
1. Scatter Plot: For two numerical features.
sns.scatterplot(x='Age', y='Fare', data=df)
plt.title('Scatter Plot of Age vs Fare')
plt.show()
Lab Exercise:
- Create a scatter plot between two numerical variables (e.g., 'Age' vs 'Fare').
- Generate a bar plot showing the relationship between a categorical variable and a numerical
variable (e.g., 'Survived' vs 'Age').
- Use a pair plot to explore relationships between multiple numerical features.
6. Identifying Correlations and Insights
Correlation analysis helps in understanding the linear relationship between numerical variables. A
correlation coefficient close to +1 or -1 indicates a strong relationship, while a value close to 0
indicates a weak relationship.
Correlation Heatmap:
- Visualize correlations between all numerical features using a heatmap.
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
Lab Exercise:
- Generate a correlation matrix and heatmap for numerical features in the dataset.
- Interpret the results of the heatmap to identify strong or weak correlations between features.
2. Stacked Bar Plot: To examine the relationship between two categorical variables.
pd.crosstab(df['Pclass'], df['Survived']).plot(kind='bar', stacked=True)
plt.title('Survival Count by Pclass')
plt.show()
Lab Exercise:
- Create a count plot to visualize the distribution of a categorical variable (e.g., 'Survived').
- Create a stacked bar plot to examine the relationship between two categorical variables (e.g.,
'Pclass' vs 'Survived').
This module introduces students to the core techniques of Exploratory Data Analysis (EDA), which helps
to unlock insights from raw data. By focusing on Python-based visualization libraries like Matplotlib and
Seaborn, students will learn how to generate compelling visualizations and perform statistical analyses
to guide further modeling.
Learning Objectives:
By the end of this module, students should be able to:
- Understand the importance of feature selection and dimensionality reduction in machine learning.
- Apply various feature selection techniques to identify important variables.
- Use dimensionality reduction techniques (e.g., PCA, LDA) to reduce the number of features in a
dataset while retaining critical information.
- Improve model performance by eliminating redundant or irrelevant features.
Variance Thresholding:
- Features with very low variance (almost constant) provide little information. They can be removed.
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
df_selected = selector.fit_transform(df)
Lab Exercise:
- Load a dataset.
- Use correlation matrices and variance thresholds to remove irrelevant features.
- Apply the chi-square test on categorical features to select the best features.
*Lab Exercise:*
- Use Recursive Feature Elimination (RFE) with a classifier (e.g., Logistic Regression) to select the top 5
features.
- Compare model performance with and without feature selection.
4. Embedded Methods for Feature Selection
Embedded methods perform feature selection as part of the model training process. Some machine
learning algorithms automatically perform feature selection during training, such as Lasso Regression
or Decision Trees.
*Decision Trees*:
- Decision tree algorithms automatically perform feature selection by choosing the most important
features to split on.
- *Python*:
python
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X, y)
feature_importance = tree.feature_importances_
```
Lab Exercise:
- Use Lasso regression to select features from a dataset.
- Apply a decision tree classifier and visualize the feature importance.
Steps in PCA:
1. Standardize the Data: PCA is sensitive to the variance of features, so it's essential to standardize the
data.
2. Compute Covariance Matrix: Measures how the features vary together.
3. Eigenvalue and Eigenvector Decomposition: Compute the principal components.
4. Select Principal Components: Choose the components that explain most of the variance.
Python Implementation:
- Standardize the data:
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
- *Apply PCA*:
- *Explained Variance*:
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
*Lab Exercise:*
- Standardize the dataset and apply PCA.
- Reduce the dataset to 2 or 3 components and visualize the results.
- Check the explained variance ratio to understand how much information is retained.
*Python Implementation*:
- *Apply LDA*:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
```
Lab Exercise:
- Apply LDA to reduce dimensions and visualize the transformed dataset.
- Compare the results of PCA and LDA in terms of class separability.
While feature selection focuses on keeping the most relevant features, dimensionality reduction
combines features into a new set of variables that represent the data in fewer dimensions.
Lab Exercise:
- Compare the effect of feature selection (using Recursive Feature Elimination) and dimensionality
reduction (using PCA) on the performance of a machine learning model.
- Evaluate the models using cross-validation and compare the accuracy.
This module equips students with the tools and knowledge needed to effectively perform feature
selection and dimensionality reduction, which are critical steps in preparing data for machine learning
models. These techniques help to avoid overfitting, improve model accuracy, and reduce
computational overhead.
Learning Objectives:
By the end of this module, students should be able to:
- Understand the importance of model selection and the process of building machine learning
models.
- Build different types of models, including regression and classification models.
- Evaluate model performance using appropriate metrics.
- Understand the concepts of bias and variance, overfitting, and underfitting.
- Use techniques like cross-validation to evaluate models effectively.
1. Introduction to Model Building
Model building is a critical phase in the data science pipeline, where we apply algorithms to the
prepared data to make predictions. The choice of model depends on the type of problem at hand
(regression, classification, etc.).
Regression Models:
- Linear Regression: Predicts continuous output based on input features.
- Polynomial Regression: Extends linear regression to model nonlinear relationships.
Classification metrics
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-Score: {f1}')
- Overfitting occurs when a model learns the details and noise of the training data to the point where
it negatively impacts the performance on new data.
- Underfitting happens when a model is too simple to capture the underlying patterns of the data.
Signs of Overfitting:
- High accuracy on the training set but poor accuracy on the test set.
- The model is too complex, such as having too many features or a very high-degree polynomial in
regression.
Signs of Underfitting:
- Both training and test accuracy are low.
- The model is too simple to represent the underlying data, such as using linear regression for a
nonlinear problem.
5. Cross-Validation
Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to
an independent dataset. It helps in identifying whether a model is overfitting or underfitting.
K-Fold Cross-Validation:
In k-fold cross-validation, the dataset is divided into k subsets (folds). The model is trained on k-1
folds and tested on the remaining fold. This process is repeated k times, and the average
performance across all folds is reported.
Python Implementation:
from sklearn.model_selection import cross_val_score
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)
print('Cross-validation scores:', scores)
print('Mean score:', scores.mean())
6. Hyperparameter Tuning
Once you’ve selected and trained your model, you might need to adjust the model’s hyperparameters
for optimal performance.
Grid Search:
Grid search exhaustively searches through a specified subset of hyperparameters and evaluates
model performance using cross-validation.
Python Example:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'max_iter': [100, 200, 300]}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f'Best hyperparameters: {grid_search.best_params_}')
Randomized Search:
Randomized search samples a fixed number of hyperparameter combinations from a specified grid,
providing a faster alternative to grid search.
7. Model Evaluation in Practice
In real-world applications, selecting the best model requires experimentation with different
algorithms, evaluation using cross-validation, and tuning of hyperparameters. It's also important to
consider factors such as model interpretability and computational cost.
Lab Exercise:
1. Build and evaluate a regression model:
- Use Linear Regression to predict a continuous variable.
- Evaluate the model using MSE, RMSE, and R².
3. Model Comparison:
- Compare models like Decision Trees, Random Forest, KNN, and Logistic Regression using cross-
validation.
- Select the best-performing model based on the evaluation metrics.
4. Hyperparameter Tuning:
- Use GridSearchCV to find the best hyperparameters for a chosen model.
This module guides students through the fundamental process of building, evaluating, and tuning
machine learning models. By the end of the module, students will be able to select the right model,
evaluate its performance, and optimize it to achieve the best results.
Module 6: Model Optimization and Ensemble Methods
Learning Objectives:
By the end of this module, students should be able to:
- Understand the concept of model optimization and its importance in improving model performance.
- Apply regularization techniques like L1 and L2 to prevent overfitting.
- Utilize ensemble methods such as bagging, boosting, and stacking to enhance model accuracy.
- Implement advanced optimization techniques like GridSearchCV and RandomizedSearchCV.
- Understand the trade-offs between model complexity and performance.
---
2. Regularization Techniques
Regularization helps prevent overfitting by discouraging overly complex models, which would
otherwise learn the noise in the data.
Python Example:
python
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)
Python Example:
python
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
Elastic Net:
- Combines both L1 and L2 regularization, balancing the benefits of both methods.
Python Example:
python
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)
y_pred = elastic_net.predict(X_test)
Lab Exercise:
- Apply Lasso, Ridge, and ElasticNet on a regression dataset.
- Compare the models’ performance using evaluation metrics like MSE and R².
- Visualize how regularization affects the coefficients of the models.
---
3. Hyperparameter Tuning
Once a model is built, it is crucial to optimize the model’s hyperparameters to improve performance.
Hyperparameters are external configurations that cannot be learned from the data (e.g.,
regularization strength, learning rate).
GridSearchCV:
- Grid search exhaustively tests all combinations of a predefined set of hyperparameters.
Python Example:
python
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'max_iter': [100, 200, 300]}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f'Best parameters: {grid_search.best_params_}')
RandomizedSearchCV:
- Randomized search samples a fixed number of hyperparameter combinations from a grid, making it
faster than exhaustive grid search.
Python Example:
```python
from sklearn.model_selection import RandomizedSearchCV
param_dist = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30]}
random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, cv=5, n_iter=10)
random_search.fit(X_train, y_train)
print(f'Best parameters: {random_search.best_params_}')
*Lab Exercise:*
- Perform hyperparameter tuning using GridSearchCV and RandomizedSearchCV on models like
Random Forest or SVM.
- Evaluate the impact of hyperparameter tuning on model performance.
---
Ensemble methods combine multiple models to improve prediction accuracy. The primary idea
behind ensemble learning is to reduce the risk of overfitting and increase model robustness by
aggregating predictions from several base models.
*Python Example*:
python
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
```
Boosting:
- Boosting reduces bias by training models sequentially, where each new model corrects the errors of
the previous one. Common boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.
Lab Exercise:
- Implement and compare the performance of Bagging (Random Forest), Boosting (Gradient
Boosting), and Stacking.
- Evaluate the models using appropriate classification metrics (accuracy, precision, recall, etc.).
---
The Bias-Variance Trade-off is an essential concept in machine learning that refers to the balance
between two types of errors:
1. Bias: Error due to overly simplistic models that cannot capture the underlying patterns
(underfitting).
2. Variance: Error due to overly complex models that overfit the training data and do not generalize
well to unseen data.
- Overfitting occurs when a model is too complex, resulting in low bias but high variance.
- Underfitting occurs when a model is too simple, resulting in high bias but low variance.
---
Learning Curves:
- Learning curves help visualize the training and validation error over time or with the size of the
dataset. These curves can help determine if a model is overfitting or underfitting.
Python Example:
python
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
This module provides students with the tools to optimize their models, use ensemble methods for
better performance, and understand critical concepts like regularization, hyperparameter tuning, and
the bias-variance trade-off. These techniques are essential for building robust machine learning
models in real-world scenarios.
TUTORIAL
Conceptual Questions:
1. What is Data Science, and why is it important in today’s world?
2. Explain the difference between structured and unstructured data. Provide examples.
3. What are the core steps in the Data Science workflow?
4. What are the key differences between Python and R in Data Science?
5. How does Python’s pandas library facilitate data manipulation and analysis?
6. What are some common data sources used in Data Science projects?
7. Why is it important to clean and preprocess data before applying any machine learning algorithms?
Practical Questions:
1. Install and import the following libraries in Python: pandas, numpy, matplotlib, and seaborn.
2. Load a CSV file into a pandas DataFrame and display the first 5 rows.
3. Write a Python function to remove missing values from a DataFrame.
4. Plot a histogram to show the distribution of a numerical column in a dataset using matplotlib.
5. *What is the purpose of head() and describe() functions in pandas? Explain with examples.*
6. Use numpy to create an array of random numbers between 0 and 10. Display the first 5 values.
Conceptual Questions:
1. What are the key steps involved in data preprocessing, and why are they important?
2. Explain the difference between normalization and standardization.
3. What is the purpose of handling missing data in a dataset?
4. What are outliers, and why is it important to identify and handle them?
5. Why is feature scaling necessary for certain machine learning algorithms (e.g., KNN, SVM)?
Practical Questions:
1. Load a dataset with missing values and apply imputation to handle them using pandas.
2. *Standardize a numerical column using sklearn’s StandardScaler.*
3. Perform one-hot encoding for a categorical variable in a dataset using pandas.
4. Create a scatter plot to visually identify potential outliers in a dataset using matplotlib and seaborn.
5. *Use the fillna() method to fill missing values with the mean of the column in a pandas
DataFrame.*
6. Check for duplicates in a dataset and remove them using pandas.
Practical Questions:
1. Plot a bar chart to show the distribution of a categorical variable in a dataset using seaborn.
2. Create a pair plot to visualize the relationships between multiple numerical features using seaborn.
3. Create a box plot to identify the spread and outliers of a numerical feature using seaborn.
4. Generate a correlation matrix heatmap for a dataset and interpret the relationships between
features.
5. Plot a line chart to display trends over time for a time-series dataset using matplotlib.
6. Use seaborn to visualize a heatmap of missing data in a dataset.
Module 4: Feature Engineering and Dimensionality Reduction
Conceptual Questions:
1. What is feature engineering, and how does it contribute to the performance of machine learning
models?
2. What is the difference between feature selection and feature extraction?
3. What is Principal Component Analysis (PCA), and how does it help with dimensionality reduction?
4. Explain how you would handle categorical data when preparing it for machine learning algorithms.
5. What is the curse of dimensionality, and why does it occur?
Practical Questions:
1. Apply one-hot encoding to a categorical variable and explain its purpose in machine learning.
2. Perform feature scaling (e.g., Min-Max Scaling) on a numerical column using sklearn.
3. Use PCA to reduce the dimensions of a dataset to 2 components and visualize the transformed
data in a 2D plot.
4. Perform feature selection using Recursive Feature Elimination (RFE) on a dataset and explain the
steps.
5. Create new features based on existing columns (e.g., creating a "year" feature from a date column).
6. *Use the SelectKBest method from sklearn to select the top K features based on a statistical test.*
Conceptual Questions:
1. What is the difference between supervised and unsupervised learning? Provide examples of each.
2. What is the bias-variance trade-off, and how does it affect model performance?
3. Explain the concept of overfitting and how it can be avoided in machine learning models.
4. What are the main differences between regression and classification problems?
5. What is the purpose of cross-validation, and how does it help in model evaluation?
Practical Questions:
1. Build a Linear Regression model using sklearn and evaluate it using Mean Squared Error (MSE) and
R².
2. Train a Logistic Regression model for binary classification and evaluate it using Accuracy, Precision,
Recall, and F1-score.
3. Use k-Nearest Neighbors (KNN) for classification and determine the optimal value of K using cross-
validation.
4. Create a confusion matrix for a classification model and interpret the results.
5. Evaluate the performance of a model using cross-validation and interpret the output.
6. Build and evaluate a Random Forest model for a regression problem using R² and Mean Absolute
Error (MAE).
7. Train a model, tune its hyperparameters using GridSearchCV, and evaluate the performance using
appropriate metrics.