Unit III: Data Visualization and
Representation
1. Model Development: Simple and Multiple Regression
🔹 A. Simple Linear Regression (SLR)
1. Definition:
Simple Linear Regression is a predictive statistical technique used to
model the relationship between:
● One independent variable (X) – the input or predictor variable.
● One dependent variable (Y) – the output or response variable.
It assumes a linear relationship between these two variables.
2. Equation:
Y=β0+β1X+ε
Where:
● Y = Predicted (dependent) variable
● X = Independent variable
● β0 = Intercept (value of Y when X = 0)
● β1 = Slope (change in Y for a unit change in X)
● ε = Error term (difference between actual and predicted values)
3. Use Cases in IoT:
● Predicting temperature based on time (e.g., in smart thermostats)
● Estimating humidity based on distance from a water source
● Forecasting energy usage based on hour of the day
4. Assumptions of SLR:
● There is a linear relationship between X and Y.
● The errors (residuals) are normally distributed.
● The variance of errors is constant (homoscedasticity).
● Observations are independent.
5. Evaluation of SLR:
● R-squared (R2): Tells how much variation in Y is explained by X.
● Residual analysis: Helps check for outliers and model accuracy.
● Visualization: Scatter plot with regression line.
🔹 B. Multiple Linear Regression (MLR)
1. Definition:
Multiple Linear Regression is an extension of Simple Linear
Regression. It is used when there is:
● One dependent variable (Y)
● Two or more independent variables (X₁, X₂, ..., Xₙ)
It helps in modeling situations where an outcome depends on multiple
factors.
2. Equation:
Y=β0+β1X1+β2X2+...+βnXn+ε
Where:
● X1, X2, ..., Independent variables (predictors)
● β1, β2,...,βn = Coefficients showing the influence of each predictor
3. Use Cases in IoT:
● Predicting air quality using humidity, temperature, and CO₂ levels
● Estimating traffic congestion using weather, time, and GPS
speed
● Forecasting smart home energy consumption using occupancy,
appliance usage, and time
4. Assumptions of MLR:
● Linear relationship between dependent and all independent
variables
● Independence of errors
● No multicollinearity (predictors should not be highly correlated)
● Homoscedasticity (equal variance of errors)
● Normal distribution of errors
5. Benefits of MLR:
● Improved prediction accuracy with more inputs
● Identifies most influential factors
● Can be used for what-if analysis in engineering problems
6. Evaluation of MLR:
● Adjusted R²: A modified version of R² that adjusts for the number
of predictors
● F-test: Tests overall significance of the model
● p-values: Tests the significance of each predictor
● VIF (Variance Inflation Factor): Checks multicollinearity
● Residual plots: To validate assumptions
🔧 Comparison: SLR vs MLR
Feature Simple Linear Multiple Linear Regression
Regression
Predictors 1 2 or more
Complexity Low Higher
Visual 2D line Not easy to visualize beyond 3
representation variables
Accuracy Limited Higher (if model is well-tuned)
Use cases Basic trends Complex systems (like IoT)
2. Model Evaluation Using Visualization
Visualizing model evaluation helps to diagnose whether a predictive
model is performing well and to identify any problems such as non-
linearity, heteroscedasticity, or outliers. Two of the most essential tools
for this are the Residual Plot and the Distribution Plot.
🔹 A. Residual Plot
1. What is a Residual?
● A residual is the difference between the observed value
(actual) and the predicted value by the model.
Residual=yactua−ypredicted
2. What is a Residual Plot?
● A graph where:
○ The X-axis shows the predicted values (or the independent
variable).
○ The Y-axis shows the residuals.
● Each point represents how far off the prediction was for a specific
observation.
3. Purpose of a Residual Plot:
● Detect non-linearity: If the residuals show a curve, the data may
not be linear.
● Check variance: If the spread of residuals increases or
decreases, there may be heteroscedasticity (non-constant error
variance).
● Identify outliers: Points far from the horizontal line at zero are
possible anomalies.
4. What to Look For:
Pattern in Plot Meaning
Random scattr around zero Good model fit (ideal)
Curved pattern Non-linear relationship (consider
polynomial/spline regression)
Funnel shape (widening or Heteroscedasticity
narrowing spread)
Distinct outliers Possible anomalies or data entry errors
5. Example in IoT:
● Predicting air pollution (PM2.5) based on temperature and
humidity.
● If residuals are clustered around zero with no shape, the model
works well.
● If residuals rise or fall with temperature, the relationship may not
be linear.
🔹 B. Distribution Plot
1. What is a Distribution Plot?
● A graph that shows the frequency distribution of a variable or of
the model’s residuals (errors).
● Can be represented as:
○ Histogram: Shows how values are distributed across
intervals.
○ KDE (Kernel Density Estimate): Smooth line approximation
of the histogram.
2. Purpose of a Distribution Plot:
Check normality: Many statistical models assume that errors are
normally distributed.
● Detect skewness: Asymmetry in the distribution.
● Spot heavy tails or peakedness.
3. Ideal Shape of Residual Distribution:
● A bell-shaped curve centered at zero (normal distribution).
● Indicates that the model errors are random and well-behaved.
4. What to Look For:
Shape Interpretation
Bell curve Good assumption of normality
(symmetric)
Skewed to right/left Model may be biased
Multiple peaks Data may contain subgroups or be improperly
modeled
5. Example in IoT:
● If you build a model to forecast electricity usage in smart homes, a
distribution plot of residuals should ideally form a bell curve. A
skewed curve may suggest:
○ The model consistently underestimates or overestimates
at certain times.
○ Some important features may be missing.
✅ Summary of Use Cases in Model Diagnostics:
Plot Type Use What It Shows
Residual Plot Model Accuracy Non-linearity, heteroscedasticity,
outliers
Distribution Error Normality, skewness, model bias
Plot Assumptions
📌 Visual Examples for Class (Suggested):
● Residual Plot Example: Scattered points centered around y = 0.
● Bad Residual Plot: U-shaped pattern or increasing funnel shape.
● Distribution Plot Example: Histogram or KDE plot of residuals
forming a bell curve.
Why These Matter in IoT Applications:
● IoT systems involve noisy sensor data; understanding error
distribution and behavior is crucial.
● Example: A smart irrigation system predicting soil moisture from
temperature and rainfall — residual analysis helps improve
accuracy and reliability.
3. Polynomial Regression and Splines
🔷 A. Polynomial Regression:
1. Definition:
Polynomial Regression is an extension of linear regression that allows
modeling of non-linear relationships between the independent variable
XXX and the dependent variable YYY.
Instead of fitting a straight line, it fits a curved polynomial function to
the data.
2. Equation:
Y=β0+β1X+β2X2+β3X3+⋯+βnXn+ε
Where:
● β0 : Intercept
● β1,β2,…,βn : Coefficients of the polynomial terms
● X2,X3,…,Xn: Higher-degree powers of the input variable
● ε : Error term
3. Purpose:To capture non-linear trends in data by including higher-
degree terms of the predictor variable.
4. Applications in IoT:
● Predicting battery life of a sensor over time (non-linear decay).
● Modeling temperature variation over a day.
● Estimating pollution levels with multiple peaks during the day.
5. Visual Interpretation: Linear regression gives a straight line.
● Polynomial regression gives a curve (parabola, cubic curve, etc.),
depending on the degree of the polynomial.
6. Advantages:
● Captures non-linear patterns.
● Easy to implement with basic regression libraries.
7. Disadvantages:
● Overfitting: A high-degree polynomial may fit the noise, not the
trend.
● Unstable behavior at the edges of the data range (extreme
values).
8. Example:
If you're using IoT to monitor plant growth, you might find that growth
rate is slow initially, fast in the middle, and slows again. A quadratic or
cubic polynomial fits this better than a straight line.
🔷 B. Splines
1. Definition:
Splines are piecewise polynomial functions that are:
● Defined on subintervals of the data.
● Joined at specific points called “knots”.
● Smoothly connected at the knots to ensure no abrupt changes in
slope or curvature.
2. Purpose:
To model complex, irregular data patterns without overfitting and while
maintaining smoothness across intervals.
3. Types of Splines:
● Linear Splines: Piecewise linear with kinks at knots.
● Quadratic/Cubic Splines: Piecewise 2nd or 3rd-degree
polynomials with smooth joins.
● B-splines: Basis splines used in numerical computation for
smoothness and flexibility.
4. Equation Structure:
There isn’t a single general formula, but splines look like this:
Spline Equation Structure (Plain Text):
f(x) =
p₁(x), for a ≤ x < k₁
p₂(x), for k₁ ≤ x < k₂
pₙ(x), for kₙ₋₁ ≤ x ≤ b
Where:
● Each pᵢ(x) is a polynomial defined on a subinterval.
● k₁, k₂, ..., kₙ₋₁ are the knots (points where the pieces join).
5. Advantages Over Polynomial Regression:
Feature Polynomial Regression Splines
Handles local ❌ No (global curve) ✅ Yes (local
changes adjustment)
Risk of overfitting ✅ High (with high ❌ Lower
degree)
Smoothness ✅ ✅
Flexibility ❌ ✅
6. Applications in IoT:
● Air quality prediction: Captures sudden increases in pollution
during peak hours.
● Sensor calibration: Corrects abrupt shifts in sensor behavior.
● Traffic speed modeling: Models sharp traffic drops due to
incidents.
7. Visual Explanation:
● Imagine splitting a road into segments and fitting a small curve to
each segment.
● All curves connect smoothly, like train tracks — no bumps or
breaks.
✅ Summary Table
Concept Polynomial Regression Splines
Basis Single global polynomial Multiple local polynomials
Flexibility Low (global fit) High (local fit)
Handles sharp No Yes
changes
Risk of overfitting High (if degree is too Low
large)
Use case in IoT Predicting trends with Modeling data with
gentle curves sudden changes
📌 Important Notes for Students:
● Use Polynomial Regression when your data has smooth non-
linear patterns.
● Use Splines when your data has abrupt changes but you want a
smooth model.
● In both cases, visualization helps in identifying whether a linear
model is sufficient or a non-linear model is needed.
4. Measures Used for Evaluation
Evaluating a model means checking how well it predicts outcomes.
The following statistical metrics help quantify the model’s performance:
🔷 A. R-Squared ( R2)
1. Definition:
R-Squared, or the coefficient of determination, represents the
proportion of the variance in the dependent variable Y that is
explained by the independent variables in the model.
2. Formula:
R² = 1 − (SS_res / SS_tot)
Where:
● SS_res = Σ (Yᵢ − Ŷᵢ)² → Residual Sum of Squares
● SS_tot = Σ (Yᵢ − Ȳ)² → Total Sum of Squares
3. Interpretation:
● R2=0: The model explains none of the variability.
● R2=1: The model explains all the variability (perfect fit).
● Values closer to 1 are better.
4. Example:
If R2=0.87, it means 87% of the variation in the dependent variable is
explained by the model, and 13% is due to random error or unmodeled
factors.
5. Limitation:
● It doesn’t indicate if the model is good; a high R2doesn’t mean
the model is accurate.
● It increases with more variables, even if they are irrelevant
(hence use Adjusted R2 for MLR).
🔷 B. Mean Squared Error (MSE):
1. Definition:
MSE measures the average squared difference between the actual
values Yi and the predicted values Y^i .
2. Formula:
MSE = (1 / n) × Σ (Yᵢ − Ŷᵢ)²
where the summation Σ is from i = 1 to n.
2. Interpretation:
● A lower MSE indicates a better fit.
● Squaring errors penalizes large errors more heavily, making
MSE sensitive to outliers.
3. Example:
If the predicted pollution level in IoT data is 60 but the actual is
55, the squared error is (60−55)2=25. MSE averages such squared
errors over all predictions.
🔷 C. Root Mean Squared Error (RMSE):
1. Definition:
RMSE is the square root of MSE. It provides the average prediction
error in the same units as the target variable.
2. Interpretation:
● Like MSE, lower RMSE is better.
● It gives a more interpretable error than MSE because it's not
squared.
3. Example:
If MSE = 25, then RMSE = 5. This means, on average, the predictions
are off by 5 units.
4. Use in IoT:
Useful in sensor readings, where errors must be in real-world units
(e.g., degrees, ppm, m/s²).
🔷 D. Mean Absolute Error (MAE):
1. Definition:
MAE is the average of absolute errors between actual and predicted
values.
2. Interpretation:
● MAE gives equal weight to all errors (no squaring).
● Lower MAE = Better model performance.
3. Comparison with MSE and RMSE:
Metri Penalizes large Output Sensitive to
c errors Unit Outliers
MAE ❌ No Same as Y ❌ Less sensitive
MSE ✅ Yes (squares) Squared ✅ Very sensitive
unit
RMS ✅ Yes (moderate) Same as Y ✅ Moderate
E
✅ Summary Table:
Metric Description Rang Ideal Units
e Value
R2 Proportion of variance 0 to 1 Closer to 1 None
explained
MSE Avg. squared error 0 to Lower is Squared units
∞ better
RMSE Root of MSE 0 to Lower is Same as actual
∞ better values
MAE Avg. absolute error 0 to Lower is Same as actual
∞ better values
5. Prediction and Decision Making Visual Tools
🔷 A. Box Plots:
1. Definition:
A Box Plot (or box-and-whisker plot) is a graphical tool that summarizes
the distribution of a dataset using five key statistics:
● Minimum
● First Quartile (Q1)
● Median (Q2)
● Third Quartile (Q3)
● Maximum
It also highlights outliers.
2. Components:
● Box: Represents the interquartile range (IQR) from Q1 to Q3.
● Line inside box: Median value.
● Whiskers: Extend to the lowest and highest data points within 1.5
× IQR.
● Points outside whiskers: Outliers (extreme values).
3. Formulae: IQR = Q3 – Q1
Lower Bound = Q1 – 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR
Values beyond the lower or upper bound are considered outliers.
4. Use Cases:
● Visualize spread and symmetry in datasets.
● Compare distributions across different variables or groups (e.g.,
error in predictions from different IoT sensors).
● Detect anomalies in time-series or sensor data.
5. Example in IoT:
Comparing the air quality index (AQI) readings from sensors placed in
different areas of a city.
🔷 B. Pivot Table:
1. Definition:
A Pivot Table is a data summarization tool that lets you organize,
group, and compute aggregates (like sum, average, count, etc.) from
large datasets.
2. Key Features:
● Can rearrange data dynamically using drag-and-drop rows and
columns.
● Supports filtering, sorting, and aggregating data.
● Easily implemented in Excel, Google Sheets, or using Python’s
pandas library.
3. Common Aggregations:
Sum = ∑ xᵢ
Average = (∑ xᵢ) / n
Count = Total number of entries
4. Use Cases: Summarizing sensor readings (e.g., average
temperature per day).
● Analyzing system logs or device usage patterns.
● Tracking prediction errors across various model parameters or
groups.
5. Example in IoT:
Displaying average electricity usage by hour for different types of smart
meters.
🔷 C. Heat Map:
1. Definition:
A Heat Map is a color-coded grid used to display the magnitude or
intensity of values in a two-dimensional matrix.
2. Types of Heat Maps:
● Correlation Heat Map: Shows relationships between variables.
● Frequency Heat Map: Indicates how often values occur.
● Feature Importance Heat Map: Highlights the influence of input
features on the prediction.
3. Color Gradient Interpretation:
● Darker or intense colors: Higher magnitude or correlation.
● Lighter colors: Lower values or weak correlations.
● Color schemes vary (e.g., red-yellow-green, blue-white-red).
4. Formula for Correlation (used in Heat Maps):
plaintext
CopyEdit
r = ∑[(Xᵢ - X̄)(Yᵢ - Ȳ)] / √[∑(Xᵢ - X̄)² × ∑(Yᵢ - Ȳ)²]
Where:
● X̄ = Mean of variable X
● Ȳ = Mean of variable Y
● r = Pearson correlation coefficient (range: -1 to +1)
5. Use Cases:
● Identifying strong or weak relationships between features.
● Highlighting areas of high activity or errors in IoT grids.
● Prioritizing features for model optimization based on influence.
6. Example in IoT:
Visualizing the correlation between weather parameters (temperature,
humidity, pressure) and air quality index using a heat map.
✅ Summary Comparison Table
Visual Main Use Key Output Engineering Use
Tool Case
Box Display Median, spread, Analyze sensor
Plot distribution & outliers reading variability
outliers
Pivot Aggregate and Totals, averages, Compare daily energy
Table summarize data counts usage across devices
Heat Show Color-coded matrix Identify most
Map relationships or influential features in a
intensities (e.g., correlation) model
📌 Tips for Engineering Students:
● Use Box Plots when assessing data quality or error spread in
predictive models.
● Use Pivot Tables for quick summaries of large datasets, like IoT
logs or energy usage.
● Use Heat Maps to explore correlations and improve feature
selection for better model performance.