Data Science Questions - 10 & 15 Marks
1. What is Data Preprocessing? Explain the steps involved.
Definition:
Data preprocessing is the initial stage in the data analysis pipeline where raw data is cleaned and transformed into
Steps Involved:
- Data Cleaning: Handle missing values, remove duplicates, correct errors.
- Data Transformation: Normalize or scale numerical data; encode categorical variables.
- Feature Engineering: Create new features, combine or split existing ones.
- Data Reduction: Use techniques like PCA, LDA, or feature selection to reduce dimensionality.
- Data Integration: Combine data from multiple sources.
- Data Discretization: Convert continuous data into categorical bins.
Importance:
Enhances data quality, reduces noise, and boosts model accuracy.
2. Define Data Cleaning and Discuss Its Tasks.
Definition:
Data cleaning is the process of correcting or removing inaccurate records from a dataset.
Tasks:
- Missing Values: Imputation (mean, median), deletion, or forward/backward filling.
- Outliers: Detect with z-score, boxplot; remove or transform.
- Noise Handling: Use smoothing techniques or binning.
- Normalization/Scaling: StandardScaler or MinMaxScaler to bring values to a common scale.
- Type Conversion & Deduplication: Convert data types; remove duplicate rows.
Purpose:
Improves data integrity and model reliability.
3. Techniques Used for Handling Outliers
Outliers are extreme values that differ significantly from the rest.
Detection Methods:
- Z-score: Values with z > 3 or z < -3 are considered outliers.
- IQR Method: Values outside Q1 - 1.5IQR or Q3 + 1.5IQR.
- Boxplots: Visual detection.
- Machine Learning Methods: Isolation Forest, One-Class SVM.
Handling Methods:
- Removal: If clearly erroneous.
- Transformation: Log, square root, or winsorization.
- Imputation: Replace with mean/median.
4. Differences Between Accuracy, Precision, Recall, and F1-Score
Metric | Formula | Focus
Accuracy | (TP + TN) / (TP + TN + FP + FN) | Overall correctness
Precision | TP / (TP + FP) | Quality of positive predictions
Recall | TP / (TP + FN) | Coverage of actual positives
F1-score | 2 * (Precision * Recall) / (P + R) | Balance between precision and recall
When to Use:
- Accuracy: For balanced datasets.
- Precision: When false positives are costly (e.g., spam detection).
- Recall: When false negatives are critical (e.g., disease diagnosis).
- F1-score: When you need a balance (imbalanced data).
5. ROC Curve and AUC in Binary Classification
ROC Curve:
Plots True Positive Rate (TPR) vs False Positive Rate (FPR) at different thresholds.
AUC (Area Under Curve):
- Ranges from 0 to 1.
- Higher AUC = better model.
Advantages Over Accuracy:
- Works well with imbalanced datasets.
- Evaluates performance across all thresholds.
- Highlights trade-off between sensitivity and specificity.
6. What is Cross-Validation? Types and Pros/Cons
Definition:
Cross-validation divides the data into parts, training and testing the model multiple times to get an average perform
Types:
- K-Fold: Divides data into k parts; trains on k-1, tests on 1.
- Stratified K-Fold: Preserves class distribution.
- Leave-One-Out (LOOCV): One sample for testing, rest for training.
- Repeated K-Fold: Repeats k-fold multiple times for reliability.
Advantages:
- Reduces overfitting.
- Provides robust performance estimate.
Disadvantages:
- Computationally expensive.
- May not suit small datasets.
1. Central Tendency and Dispersion Measures with Example
Central Tendency:
- Mean: Average value.
- Median: Middle value.
- Mode: Most frequent value.
Dispersion:
- Range: Max - Min.
- Variance: Average of squared differences from the mean.
- Standard Deviation: Square root of variance.
Example:
import numpy as np
scores = [45, 50, 55, 60, 65, 70, 75]
mean = [Link](scores)
median = [Link](scores)
std_dev = [Link](scores)
print(mean, median, std_dev)
2. Hypothesis Testing with Example
Definition:
A method for making inferences about population parameters based on sample data.
Steps:
1. Formulate H0 and H1 (null and alternative).
2. Choose significance level (α = 0.05).
3. Select test (e.g., t-test).
4. Calculate test statistic.
5. Compare with critical value or p-value.
6. Interpret result.
Example: Testing whether a new drug lowers BP more than the old one using a two-sample t-test.
3. Matplotlib Plots with Code
import [Link] as plt
# Line plot
[Link]([1, 2, 3], [4, 5, 6])
[Link]("Line Plot")
[Link]()
# Bar plot
[Link](['A', 'B', 'C'], [10, 20, 15])
[Link]("Bar Plot")
[Link]()
# Histogram
[Link]([1,1,2,3,3,3,4,5])
[Link]("Histogram")
[Link]()
# Scatter plot
[Link]([1,2,3], [4,5,6])
[Link]("Scatter Plot")
[Link]()
4. Seaborn Plots with Code
import seaborn as sns
import [Link] as plt
import pandas as pd
# Sample data
df = sns.load_dataset("tips")
# Scatterplot
[Link](x="total_bill", y="tip", data=df)
[Link]("Scatterplot")
[Link]()
# Heatmap
[Link]([Link](), annot=True)
[Link]("Heatmap")
[Link]()
# Boxplot
[Link](x="day", y="total_bill", data=df)
[Link]("Boxplot")
[Link]()
# Violin plot
[Link](x="day", y="total_bill", data=df)
[Link]("Violin Plot")
[Link]()
5. Visualize and Remove Outliers Using Box Plot and Z-Score
import numpy as np
import [Link] as plt
from scipy import stats
data = [Link]([1, 2, 3, 4, 5, 100]) # Outlier = 100
z_scores = [Link](data)
outliers = data[[Link](z_scores) > 2]
# Boxplot
[Link](data)
[Link]("Box Plot")
[Link]()
# Remove outliers
cleaned = data[[Link](z_scores) <= 2]
print("Cleaned data:", cleaned)
6. Multiple Linear Regression Algorithm and Assumptions
Algorithm:
Fit a linear equation y = β0 + β1x1 + β2x2 + ... + βnxn + ε.
Use Ordinary Least Squares (OLS) to minimize residual sum of squares.
Assumptions:
- Linearity
- Independence of errors
- Homoscedasticity (equal variance)
- Normal distribution of errors
- No multicollinearity
7. Decision Tree with Example
Definition:
A supervised ML algorithm that splits data based on feature conditions.
Example:
Predicting if a customer buys a car based on income and age.
from [Link] import DecisionTreeClassifier
clf = DecisionTreeClassifier()
[Link](X_train, y_train)
Advantages: easy to interpret, non-linear modeling.
8. Random Forest with Example
Definition:
An ensemble method combining multiple decision trees.
Example:
from [Link] import RandomForestClassifier
clf = RandomForestClassifier()
[Link](X_train, y_train)
Advantages: better generalization, handles missing data and outliers.
9. Model Selection and Techniques
Definition:
Choosing the best model for a task.
Techniques:
- Cross-Validation
- Grid Search
- Random Search
- Bayesian Optimization
- AIC/BIC Scores
- Validation Curves
Goal:
Ensure the model generalizes well to unseen data.