ML Reference Guide
ML Reference Guide
Contents
Example Notebook – IPL Dataset Analysis ............................................................................................ 3
1. Data Collection / Ingestion................................................................................................................. 3
2. Data Understanding ........................................................................................................................... 5
3. Data Integration.................................................................................................................................. 7
4. Initial Validation ............................................................................................................................... 10
5. Exploratory Data Analysis (EDA) ...................................................................................................... 13
6. Feature Engineering ......................................................................................................................... 16
7. Data Splitting .................................................................................................................................... 19
8. Model Training.................................................................................................................................. 22
9. Evaluation ......................................................................................................................................... 26
10. Deployment & Monitoring ............................................................................................................. 29
11. Summary......................................................................................................................................... 31
1|Page
2|Page
Example Notebook – IPL Dataset Analysis
Source Identification:
• Interview Tip: Be ready to justify why multiple sources improve richness but may increase
complexity and inconsistency.
3|Page
Initial Format Validation:
• Verify row & column counts, data types, and missing values.
• Maintain ingestion logs: source, extraction timestamp, access credentials, file format,
preprocessing steps.
• Create and update a data dictionary: column names, types, allowed values, and
descriptions.
Common Challenges:
Example Code:
import pandas as pd
df = pd.read_csv("data/file.csv")
# Quick checks
Tools:
4|Page
• Database connectors: psycopg2, pymongo, SQLAlchemy & more
• API tools: Postman, REST clients
Interview Questions:
3. What strategies would you use for integrating multiple data sources with inconsistent
formats?
4. How would you deal with very large datasets that cannot fit into memory?
2. Data Understanding
Purpose:
Gain a comprehensive view of the dataset to understand its structure, quality, and potential
challenges before cleaning or modeling. This step helps in identifying important features, spotting
anomalies, and planning preprocessing strategies.
• Check rows vs columns: large number of features may require dimensionality reduction.
• Assess dataset size sufficiency: too few samples may cause overfitting; too many may need
sampling or distributed processing.
Column Inspection:
• Identify irrelevant columns like unique IDs, timestamps, or metadata that don’t contribute to
modeling.
5|Page
o Datetime: timestamps
• Check for mixed data types in a column (e.g., strings in numeric columns).
o Binary classification
o Multi-class classification
o Regression
• Examine distribution of target variable: check for imbalance that may affect model
performance.
• List columns for feature engineering, e.g., combining, binning, or deriving new features.
6|Page
Example Code:
print(df.info())
print(df.describe())
print(df[col].value_counts())
Tools:
Interview Tips:
2. How would you handle imbalanced classes? (undersampling, oversampling, SMOTE, class
weighting)8
3. Data Integration
Purpose:
Combine data from multiple sources into a single, coherent dataset to provide a holistic view for
analysis and modeling. Proper integration ensures relationships across different datasets are
preserved, data consistency is maintained, and no critical information is lost or duplicated.
7|Page
Sub-Steps & Activities:
• Verify uniqueness and consistency of keys - duplicate or missing keys can lead to incorrect
joins.
• Standardize key formats (e.g., string casing, trimming spaces, consistent data types) before
joining.
o Left Join: Keeps all records from the left table, adds matches from the right.
o Right Join: Keeps all records from the right table, adds matches from the left.
o Outer Join: Keeps all records from both sides, filling missing values with NaN.
• Validate the join logic by checking row counts before and after merging.
• Combine multiple datasets with the same schema (columns) using pd.concat() or union
operations in SQL.
• Ensure column ordering and data types are consistent before concatenation.
• Resolve naming conflicts: standardize column names, units, and value representations.
• Address semantic conflicts where the same column name means different things across
datasets.
Validate Integration:
• Verify row counts and unique key counts before and after merging.
8|Page
• Perform sanity checks by sampling merged data and verifying correctness of combined
values.
Common Scenarios:
• Integrating external datasets (e.g., weather, market data, economic indicators) to enrich
features.
Example Code:
import pandas as pd
# Validate integration
Tools:
9|Page
Interview Tips:
1. How do you ensure no data duplication or loss after merging multiple sources?
2. What strategies would you use to resolve naming conflicts or mismatched schemas?
3. How would you integrate external data that updates at a different frequency than internal
data?
4. Initial Validation
Purpose:
Perform a systematic quality check of the ingested and integrated data before deeper analysis or
feature engineering. The goal is to detect, document, and quantify potential data issues early, so they
can be addressed before they affect downstream modeling or business decisions.
• Prioritize critical columns (e.g., target variable, primary keys) for immediate attention.
• Decide whether to remove, aggregate, or investigate duplicates depending on the use case.
• Ensure no key-level duplication (e.g., multiple records for the same unique ID unless
expected).
• Confirm that each column matches the expected data type (int, float, datetime, string):
df.dtypes.
10 | P a g e
• Identify invalid entries (e.g., text in numeric columns, malformed dates, inconsistent
encodings).
• Enforce schema consistency, especially when integrating data from multiple sources.
Outlier Detection:
• Validate whether outliers are genuine signals (e.g., high-value customers) or data entry
errors.
• Validate categorical constraints: all category values fall within expected sets.
• Cross-verify related columns (e.g., total = sum of components, age derived from date of
birth).
Completeness Assessment:
• Measure the coverage of critical fields (e.g., >95% non-null for key features).
• Flag columns with insufficient data for downstream usage or imputation strategies.
• This becomes a reference document for data cleaning and pipeline improvements.
11 | P a g e
Example Code:
import pandas as pd
# Missing values
# Duplicates
# Data types
IQR = Q3 - Q1
Output:
Tools:
12 | P a g e
Interview Tips:
• Outlier Treatment:
• Standardize Formats:
• Deduplication:
13 | P a g e
o Histograms and density plots → Understand skewness, modality, range.
Statistical Analysis:
• Correlation Matrices:
• Distribution Analysis:
• Group-by Aggregations:
14 | P a g e
Example Code:
import pandas as pd
# Univariate: Histogram
df['age'].hist(bins=30)
plt.title("Age Distribution")
plt.show()
plt.figure(figsize=(10,8))
plt.show()
# Target Distribution
df['target'].value_counts().plot(kind='bar')
plt.show()
Common Tools:
Tools:
• Pandas, NumPy
• Visualization: Matplotlib, Seaborn, Plotly, Altair
• Statistical analysis: SciPy, statsmodels
• Sweetviz, D-Tale (automated EDA)
• Jupyter notebooks for interactive analysis
15 | P a g e
Interview Tips:
6. Feature Engineering
Purpose:
Transform raw data into meaningful, machine readable features that improve model accuracy,
robustness, and interpretability. Feature engineering is usually the most impactful step in the ML
pipeline, good features can make simple models powerful, while poor features can break even the
most advanced ones.
• Derived Features:
o Calculate new variables from existing ones (e.g., age from birthdate,
days_since_signup, duration = end_date - start_date).
• Time-based Features:
o Create lag, rolling mean, or cumulative sum features for time series.
• Interaction Features:
o Create interaction terms such as the product or ratio of two features (price_per_unit
= price / quantity).
• Aggregations:
• Domain-Specific Features:
16 | P a g e
Feature Transformation:
o Target Encoding: Replace categories with aggregated target statistics (use cross-
validation to avoid leakage).
• Log/Power Transformations:
• Binning:
o Convert continuous features into categorical buckets (e.g., age groups, income
ranges).
• Polynomial Features:
Feature Selection:
• Filter Methods:
o Eliminate highly correlated features (e.g., |corr| > 0.9 to reduce multicollinearity).
• Statistical Tests:
• Wrapper/Embedded Methods:
17 | P a g e
Text & Special Feature Processing:
• Text Data:
o Extract linguistic features like word count, sentiment score, or named entities.
• Datetime Features:
o Create cyclic encodings for periodic data (e.g., sine/cosine transforms for month or
hour).
Example Code:
import pandas as pd
import numpy as np
# Feature creation
# Encoding
df = pd.get_dummies(df, columns=['city'])
# Scaling
scaler = StandardScaler()
df[['income_scaled']] = scaler.fit_transform(df[['income']])
# Log transform
df['log_amount'] = np.log1p(df['amount'])
Tools:
18 | P a g e
• TPOT, AutoML libraries
• Text: NLTK, spaCy, TfidfVectorizer
• Images: OpenCV, PIL, scikit-image
Interview Tips:
1. How would you decide between one-hot encoding and target encoding?
7. Data Splitting
Purpose:
Separate data into training, validation, and test sets so models are trained fairly, tuned without bias,
and evaluated on truly unseen data. Proper splitting prevents leakage and gives accurate
generalization estimates.
• Train / Test Split: allocate ~70–80% training and 20–30% testing for a basic evaluation.
• Validation Set (3-way split): carve out a validation set for hyperparameter tuning (common:
60/20/20).
• Time-based Splitting: for time series, split chronologically (train on past, test on future);
never shuffle time-ordered data.
• Grouped Splits: use GroupKFold when samples are correlated by group (e.g., multiple rows
per user) to avoid leakage across folds.
• Reproducibility: fix random_state and document the split approach for traceability.
19 | P a g e
Important Considerations:
• Always guard against data leakage (no peeking at test data during preprocessing or feature
selection).
• Ensure sufficient samples per split (and per class for stratified splits).
• For imbalanced targets, consider stratification or specialized CV (repeated stratified CV) and
appropriate metrics.
• When using time-based features, ensure feature creation uses only past data (lag features
created from training windows).
Example Code:
X, y, test_size=0.2, random_state=42
df = df.sort_values('event_time')
train_df = df.iloc[:train_size]
test_df = df.iloc[train_size:]
20 | P a g e
gkf = GroupKFold(n_splits=5)
ts = TimeSeriesSplit(n_splits=5)
# GroupKFold usage
cat_cols = ['city']
preprocessor = ColumnTransformer([
])
pipeline = Pipeline([
('pre', preprocessor),
('clf', RandomForestClassifier(random_state=42))
])
print(scores.mean(), scores.std())
Tools:
21 | P a g e
• Custom scripts for temporal/spatial splits
Interview Tips:
1. Explain how you avoid data leakage (e.g., all feature engineering and scaling inside
pipelines).
2. Be prepared to justify stratification and when you’d use TimeSeriesSplit instead of random
CV.
3. If asked about grouped data, mention GroupKFold and why splitting by group matters
(prevents overly optimistic results).
4. Talk about reproducibility (random_state) and how you’d document the split strategy for
production/review.
8. Model Training
Purpose:
Train machine learning models on prepared data to capture underlying patterns and relationships
that can generalize to unseen data. This step is where the algorithm learns from the training set and
becomes capable of making predictions.
Model Selection:
Training Process:
• Track training metrics (accuracy, precision, recall, loss, RMSE, AUC, etc.) to detect
convergence or underfitting/overfitting.
• If using deep learning, monitor epochs, learning rate schedules, and early stopping criteria.
22 | P a g e
Hyperparameter Tuning:
Advanced Considerations:
• Automation: Use frameworks like AutoML (e.g., auto-sklearn, H2O.ai, Vertex AI AutoML).
Example Code:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
param_grid = {
23 | P a g e
grid.fit(X_train, y_train)
# Cross-validation example
clf = LogisticRegression(max_iter=1000)
models = {
"RandomForest": RandomForestClassifier(random_state=42),
"GradientBoosting": GradientBoostingClassifier(),
"XGBoost": XGBClassifier(eval_metric='logloss')
m.fit(X_train, y_train)
• Linear Regression
• Ridge Regression, Lasso Regression
• ElasticNet
• Support Vector Regression (SVR)
• Decision Trees
• Random Forest
• Gradient Boosting (XGBoost, LightGBM, CatBoost)
• Neural Networks (MLP Regressor)
24 | P a g e
Supervised Learning - Classification:
• Logistic Regression
• Naive Bayes
• K-Nearest Neighbors (KNN)
• Support Vector Machines (SVM)
• Decision Trees
• Random Forest
• Gradient Boosting (XGBoost, LightGBM, CatBoost)
• Neural Networks (MLP Classifier, CNN, RNN)
Unsupervised Learning:
• K-Means Clustering
• DBSCAN
• Hierarchical Clustering
• Gaussian Mixture Models
• Principal Component Analysis (PCA)
• t-SNE, UMAP
• Isolation Forest (anomaly detection)
• Autoencoders
Deep Learning:
Ensemble Methods:
• Scikit-learn
• XGBoost, LightGBM, CatBoost
• TensorFlow, Keras
• PyTorch
• PySpark MLlib
• H2O.ai
• AutoML tools: Auto-sklearn, TPOT, AutoKeras
• MLflow for experiment tracking
• Weights & Biases (wandb)
• Ray Tune for hyperparameter optimization
25 | P a g e
Interview Tips:
1. Be ready to justify your choice of algorithm & for few companies algorithm internals (e.g.,
why tree-based methods over linear models).
2. Explain how hyperparameter tuning affects model performance and prevents overfitting.
3. Discuss how you would handle class imbalance (e.g., weighting, SMOTE) or high-dimensional
data (e.g., feature selection).
9. Evaluation
Purpose:
Evaluate how well a trained model performs on unseen data to ensure it generalizes beyond the
training set. The goal is to validate predictive quality, reliability, and business relevance before
deployment.
Performance Metrics:
• Classification:
• Regression:
Evaluation Methods:
• Plot learning curves to detect underfitting (high bias) or overfitting (high variance).
26 | P a g e
Model Interpretation:
• Partial Dependence Plots (PDP): Visualize relationships between specific features and
predictions.
Error Analysis:
Model Comparison:
• Evaluate multiple models on the same test set and compare metrics.
• Consider trade-offs like accuracy vs. interpretability, latency vs. complexity, or precision vs.
recall based on the use case.
• Ensure that model performance meets domain-specific KPIs (e.g., <5% false negatives in
fraud detection).
• Compare train vs. test metrics to confirm no overfitting before production deployment.
Example Code:
# Classification evaluation
y_pred = model.predict(X_test)
# Regression evaluation
import numpy as np
27 | P a g e
y_pred = model.predict(X_test)
# Cross-validation score
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Tools:
Metrics:
Interview Tips:
1. Be ready to explain why you chose a particular metric (e.g., F1 over accuracy for imbalanced
classes).
28 | P a g e
2. Show that you understand bias-variance trade-offs and how learning curves can diagnose
them.
3. Talk about error analysis as an ongoing iterative process, not a one-time check.
4. If asked about explainability, discuss SHAP, LIME, or feature importance and how they guide
decision-making.
• Version control: Track model versions with MLflow, DVC, or other model registries.
• Save preprocessing pipelines: Ensure that transformers, scalers, and encoders are stored
alongside the model for reproducible inference.
import joblib
# Save model
joblib.dump(model, "model_v1.pkl")
# Load model
model = joblib.load("model_v1.pkl")
Deployment Options:
• Batch Predictions: Run periodic predictions on new datasets.
• Real-Time API: Expose the model through REST APIs using Flask or FastAPI.
• Cloud Deployment: Deploy to AWS SageMaker, Azure ML, GCP AI Platform, or other ML
services.
• Containerization: Use Docker (and optionally Kubernetes) for environment consistency and
scalability.
# FastAPI example
29 | P a g e
from fastapi import FastAPI
import joblib
import pandas as pd
app = FastAPI()
model = joblib.load("model_v1.pkl")
@app.post("/predict")
df = pd.DataFrame([data])
pred = model.predict(df)
Integration:
• Connect with data pipelines and databases to automatically ingest new data.
Monitoring:
• Performance Tracking: Monitor metrics like accuracy, F1-score, latency, and throughput in
production.
• Data Drift Detection: Check if input data distributions have changed compared to training
data.
• Logging: Maintain detailed logs for predictions, errors, and system health.
preds = model.predict(X_test)
30 | P a g e
• Use A/B testing to compare new models or features before full rollout.
Tools:
• Model serving: Flask, FastAPI, Django
• Container tools: Docker, Kubernetes
• Cloud platforms: AWS SageMaker, Azure ML, Google Vertex AI
• MLOps: MLflow, Kubeflow, Airflow
• Monitoring: Prometheus, Grafana, Evidently AI
• Model versioning: DVC, MLflow Model Registry
• CI/CD: Jenkins, GitHub Actions, GitLab CI
• API gateways: Kong, AWS API Gateway
• Feature stores: Feast, Tecton
Interview Tips:
1. Explain trade-offs between batch and real-time predictions.
2. Discuss how you monitor model performance and detect drift in production.
11. Summary
This machine learning pipeline is iterative - you usually cycle back to earlier stages based on findings
or changes in requirements. Key feedback loops include:
• Poor evaluation results → revisit feature engineering, try different models, tune
hyperparameters, or collect additional data.
• Data quality issues discovered during EDA → return to data collection, integration, or initial
validation to clean, enrich, or correct the dataset.
• Imbalanced or insufficient data → adjust data splitting, apply resampling, or acquire more
samples.
• Deployment monitoring shows drift → trigger model retraining, update features, or collect
fresh labeled data.
31 | P a g e
Key Principles:
• Maintain flexibility to iterate while following a structured approach.
• Ensure reproducibility with versioned models, documented pipelines, and controlled random
seeds.
• Continuously monitor data and model performance to catch drift or degradation early.
• Combine technical rigor with business awareness for reliable, production-ready ML solutions.
Happy Learning
.
32 | P a g e