DCS CSED
Machine Learning Project Workflow
1. Import Libraries and Load the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('path/to/your/data.csv')
2. Data Exploration
1. Initial Data Inspection: Examine the dataset's shape and columns.
data.head()
data.info()
<.info(): will also give a direct count of number of numeric and categorical
variables>
<variables/attributes are columns, records are rows>
5 point summary:
data.describe()
numeric:
<min, max values>
<50 percentile/median>
<25,75>
<std, mean>
DYSMECH COMPETENCY SERVICES PVT. LTD. 2
D
categorical:
data.describe(include='O')
data.describe(include=object)
<number of categories present in the variable>
<the top category with highest freq>
<freq of the top category>
3. Identify Missing Values: Check for missing values in each column.
-data.isnull().sum()
# will tell you column wise count of missing values.
-data.isnull().sum(axis=1)
# will tell you count of missing values in each record.
Missing value treatment:
1. Drop:
data.dropna(axis=1,how='any'/'all',thresh=num,subset=[col])
2. Impute:
-mean/median for numeric
data.fillna(tab[col].median/.mean)
-mode for categorical
data.fillna(tab[col].mode()[0]
4. EDA: Follow EDA Cheat sheet for that
1. Measure of Central Tendency- Mean, Median, Mode
2. Distribution of Data – using Visualization technique
a. Univariate Analysis
b. Bivariate Analysis
c. Multivariate Analysis
DCS CSED
3. Dispersion of Data- min, max, range, variance, standard deviation,
coefficient
of variation
4. Skewness and Kurtosis
5. Covariance and Correlation
5. Identify outliers
using box plot
Treatment for Outliers
q1 = data['column'].quantile(0.25)
q3 = data['column'].quantile(0.75)
iqr = q3 - q1
ul = q3 + 1.5 * iqr
ll = q1 - 1.5 * iqr
1. Drop
data = data[~((data['column'] < ll) | (data['column'] > ul))]
2. Capping
data['column'] = np.where(data['column'] > ul, ul, np.where(data['column']
< ll, ll, data['column']))
6. Data Transformation
Log Transformation:
df['column'] = np.log(df['column'])
Box-Cox Transformation:
pt = PowerTransformer(method='box-cox') df['transformed'] =
pt.fit_transform(df[['column']])
Yeo-Johnson Transformation:
pt = PowerTransformer(method='yeo-johnson') df['transformed'] =
pt.fit_transform(df[['column']])
7. Scaling
Follow EDA Cheat sheet for that
8. Encoding
Follow EDA Cheat sheet for that
DYSMECH COMPETENCY SERVICES PVT. LTD. 4
D
9. Train-Test Split
Follow EDA Cheat sheet for that
10. Feature Scaling Explanation
Follow EDA Cheat sheet for that
11. Apply the Algorithm according to target variable
NOTE: Apply the above steps as relevant to your project. If a step is
not essential, skip it and proceed to the next one.