Chapter 5
Chapter 5
Introduction
Data analytics is the process of examining data to find useful information, patterns are
trends to help in decision making.
5.1 Basic Statistical Concepts
Statistics is a branch of mathematics that helps us understand and analyze data. By using
statistics, we can summarize large sets of information in a simple way, making it easier to
draw conclusions. By using statistics large sets of information can be summarized in
simple way making it easier to analyze and draw conclusions.
5.1.1 Measures of Central Tendency
Measures of central tendency help us find the “center” or typical value in a set of data.
There are three main measures of central tendency: mean, median, and mode. These
measures give us a sense of the average or most common values in a dataset.
Mean
The mean is the average of all the numbers in a data set. To find the mean, we add all the
numbers together and divide by the total number of values.
Example: Imagine 5 students scored 50, 60, 70, 80, and 90 in a test. The mean score is
calculated by adding all the scores and then dividing by the number of students:
Mode
The mode is the number that appears most often in a data set. There can be more than
one mode if multiple numbers appear with the same highest frequency. The mode helps
us identify the most frequent or common value in the data.
Example: If 5 students scored 50, 60, 70, 70, and 90, the number 70 appears twice, while
all other numbers appear only once. Therefore, the mode is 70.
Example with Multiple Modes: If the scores were 50, 60, 70, 70, 60, and 90, both 60 and 70
appear twice. So, there are two modes: 60 and 70.
Median
The median is the middle value when the numbers are arranged in order. If there is an
odd number of values, the median is the exact middle number. If there is an even
number of values, the median is the average of the two middle numbers.
Example: Using the same test scores: 50, 60, 70, 80, and 90. When we arrange these
scores in ascending order (which they already are), the middle value is 70. Therefore, the
median score is 70. Example with Even Numbers: If the scores were 50, 60, 70, and 80, we
would take the average of the two middle scores (60 and 70):
Median = (60+70)/2= 65, the median is 65.
The median helps us understand the middle point of the data.
5.1.2 Measures of Dispersion
Measures of dispersion tell us how spread out or scattered the data is. Two common
measures of dispersion are variance and standard deviation. These help us understand
whether the data points are close to the average (mean) or far from it.
Variance
The variance shows how much the numbers in a data set differ from the mean. A higher
variance means that the numbers are more spread out, while a lower variance means
that the numbers are closer to the mean. To calculate variance, we use the following
mathematical formula.
Where: xi represents each individual value in the data set, μ is the mean of the data
set, N is the total number of values in the data set.
Steps are involved in variance calculations:
Step 1: Variance for Class A
Given Score = 50,52,55,57,60
Step 1.1: Compute the Mean (μ)
Step 1.2: Compute Each Squared Deviation ((x −μ2)
Standard Deviation
The standard deviation for Class A is approximately 3.55, while for Class B, it is about
21.26. This means that Class A's scores are closely packed around the mean, whereas
Class B's scores are more widely scattered. The standard deviation helps us easily
understand how much variation there is in the scores.
5.2.4 Introduction to Probability
Probability is the study of how likely an event is to happen. It helps us predict outcomes
based on what we know.
Example: Consider flipping a coin. There are two possible outcomes: heads or tails.
Since both outcomes are equally likely, the probability of getting heads is 50% (or 1/2),
and the probability of getting tails is also 50%.
We can express this mathematically as:
Probability =
Class Activity
Instructions: You will analyze a small dataset, calculate measures of central
tendency (mean, median, mode), and measures of dispersion (variance and
standard deviation).
1. Collect Data: Imagine you are surveying your classmates about the
number of hours they spend on homework in a week. Gather data
from 10 classmates. Record the number of hours (you can use any
reasonable numbers, e.g., between 0 to 20 hours).
2. Calculate Measures of Central Tendency:
• Mean: Calculate the average number of hours spent on homework.
• Median: Determine the middle value when the hours are arranged in
order.
• Mode: Identify which number appears most frequently in your data.
6. Calculate Measures of Dispersion:
• Variance: Use the formula to calculate variance based on your data.
• Standard Deviation: Calculate the standard deviation using the
variance you found.
9. Reflect: Write a brief reflection (3-4 sentences) about what these
calculations tell you regarding your classmates' study habits.
Here, the number of customers is our independent variable (X), and the daily earnings
are the dependent variable (Y).
· Step 2: Understanding the Linear Regression Formula
The formula for simple linear regression is:
Y =β₀+β₁ x+ϵ
Where:
– Y is the dependent variable (in our case, daily earnings),
– X is the independent variable (the number of customers),
– β₀ is the intercept, which is the starting value of Y when X = 0,
– β₁ is the slope, which shows how much Y changes with each unit increase
in X,
– ε is the error term, which accounts for the difference between the
predicted and actual values.
· Step 3: Building the Linear Regression Model
When building a linear regression model, our goal is to find the best line that
explains how two things are related in this case, the number of customers and
daily earnings. Here's how we get the values for the slope (40) and intercept
(300):
Understanding the Slope (β₁ = 40)
The slope shows how much extra money we make for every new customer. Let's use our
data to figure it out:
If you notice, for every 5 extra customers, earnings go up by 200 rupees. So, for each new
customer:
β₁ = 200/ 5 = 40
This means every new customer adds 40 rupees to our earnings.
Understanding the Intercept (β₀ = 300)
The intercept (β₀) represents the earnings when no customers visit. To find this value, we
look at where the line crosses the vertical axis when the number of customers is zero. In
simpler terms, it tells us what the base earnings are, even if no one shows up.
Using this data, we calculate the slope (β₁ = 40) means for every additional customer,
earnings increase by 40 rupees. Now, to find the intercept, we need to consider how
much we earn when there are no customers.
We can use the equation:
Earnings = β₀ + β₁ × Customers
If we take any data point, say when there are 10 customers, the earnings are 500 rupees.
Substituting these values into the equation:
500 = β₀ + (40 × 10)
500 = β₀ + 400
Solving this gives:
β₀ = 500 − 400 = 100
This means that, based on the data, if no customers show up, you'd still expect to make
100 rupees, maybe from regular customers or other fixed earnings. So, the intercept
value of 100 rupees represents the minimum amount you'd make on a day with zero
customers.
Final Equation:
Earnings = 100 + 40 x Customers
This equation means:
– You'll always earn 100 rupees, even if no one comes.
– Each new customer adds 40 Rs to your total.
· Step 4: Interpreting the Model
Once the model is built, we can use it to predict future earnings. For example, if
you expect 22 customers tomorrow; the predicted earnings would be:
Earnings = 100 + (40 x 22) = 100 + 880 = 980 Rs
This means that with 22 customers, you can expect to earn around 980
rupees.
· Step 5: Testing the Model
After building the model, it's important to test it using new data. Let's say on the
6th day, 28 customers visited your stall, and you earned 1,250 Rs. Using the
model, we can predict the earnings for 28 customers:
Predicted Earnings = 100 + (40 x 28) = 100 + 1, 120 = 1,220 Rs.
However, you actually earned 1,250 Rs. The difference between the predicted
and actual earnings is called the error:
Error = 1,220 - 1,250 = -30 Rs.
While the prediction was close, it is not perfect, showing that real-world data
often has some variation.
We can use clustering to group these students based on their performance in these two
subjects. K-means Clustering
K-means clustering is one of the simplest and most popular techniques to group data. In
K-means, we need to decide how many groups (clusters) we want. For this example, let's
say we want to divide the students into two clusters: one for students who are strong in
math and one for students who are strong in English. The algorithm will group students
with similar performance in math and English together by calculating the distance
between their scores and finding patterns. It will assign students like Basim and Umer
(who are good at math) to one group and students like Anie and Tallat (who are good at
English) to another group.
5.3.2 Evaluating and Interpreting Models
Once a model is built, it is important to check how well it performs and to
understand the results it provides. This is called model evaluation.
5.3.2.1 Performance Metrics
Performance metrics help us measure how well a model is doing. Some common metrics
include:
• Error Metrics
Error metrics measure how much the model's predictions differ from the actual
values. In our grocery example, if the model predicts a monthly grocery bill of 8,000 rupees
but the actual bill is 10,000 rupees, the difference is the error.
• Accuracy Metrics
Accuracy metrics tell us how many of the model's predictions were correct. For
example, if a model predicts whether a student will pass or fail an exam, accuracy
measures how often the model was right.
5.3.2.2 Interpreting Outputs
Interpreting a model's output means understanding what the results tell us.
Drawing Conclusions from Insights
For example, if our linear regression model shows that hours studied strongly affect
exam scores, we can conclude that students should study more hours to improve their
scores.
5.3.2.3 Ethical Considerations
When building models, it is important to consider the ethical implications, such as
fairness and privacy.
Fairness and Bias
A model should be fair and not biased. For example, if a model is used to decide who
gets a loan, it should not unfairly favor one group of people over another.
Data Privacy
When using personal data to build models, it is important to respect privacy. For
example, if a company is using customer data to build models, they should ensure the
data is secure and not shared without permission.
5.4.1.3 Histograms
Histograms are used to show the distribution of a dataset. They group data into bins or
intervals, allowing you to see how frequently values occur within those ranges.
Example: If you want to analyze how students performed in math exam, a histogram can
show the distribution of scores.
5.4.1.4 Scatterplots
Scatterplots display relationships between two variables. Each point represents an
observation, and the position on the graph indicates values for both variables.
Example: A scatterplot can be used to explore the relationship between the number of
hours studied and exam scores.
Figure 5.4: Scatterplot showing the relationship between hours studied and exam scores
5.5.1.5 Boxplots
Boxplots, or whisker plots, summarize data distribution by displaying the median,
quartiles, and potential outliers. They provide a visual summary of data variability.
Example: A boxplot can be used to compare the exam scores of different classes to see
which class performed better overall.
5.5 Tools for Data Visualization
As we discussed in the above section visualization data helps us make sense of large
amounts of information by turning numbers into easy-to-understand charts and
graphs. In this section, we will discuss tools that can be used to create these
visualizations and guide you through how to create and interpret them step by step.
There are many tools available for creating data visualizations, but some of the easiest to
use are tools you are already familiar with, like Microsoft Excel and Google Sheets. These
tools are widely accessible and provide straightforward methods to create charts,
graphs, and other visual representations of data.
5.5.1 Using Excel and Google Sheets for Visualization
Excel and Google Sheets: These tools allow you to easily enter your data and then
generate a variety of visualizations such as bar charts, line graphs, and pie charts.
Example: Let's say you run a small business in your local area, and you want to track how
many products you sell each month. You can enter the data for each month in Excel or
Google Sheets, and with just a few clicks, you can create a bar chart to see which month
had the most sales.
5.2.2 Creating and Interpreting Visualizations
Step-by-Step Guide: Here's a simple guide to creating a visualization in Excel or Google
Sheets.
1. Enter Your Data: Start by entering your data into the spreadsheet. Example: In
one column, you could have the months (January, February, etc.), and in another
column, the sales figures for each month.
2. Select the Data: Highlight the data you want to visualize by clicking and
dragging your mouse over the cells.
3. Choose a Chart Type: Click on the "Insert" tab and select the type of chart you
want to create (bar chart, line graph, etc.).
4. Customize the Chart: You can add labels to your chart to make it clearer, such
as labeling the x-axis with the months and the y-axis with the sales figures. This
makes the chart easier to interpret.
5. Understanding Statistical Representations:
When you create a visualization, it's important to understand what the chart is
telling you.
Summary
This unit focuses on model building and data visualization, equipping learners with
the skills to analyze, interpret, and present data effectively. Key topics include:
1. Model Building:
o Understand the role and importance of statistical models in solving realworld
problems.
o Learn to build basic statistical models (e.g., linear regression) and evaluate
their performance using appropriate metrics.
2. Experimental Design:
o Gain an understanding of experimental design in data science, including
hypothesis testing, control groups, and randomization.
3. Data Visualization:
o Explore the types and uses of data visualizations (e.g., bar charts, scatter
plots, heatmaps).
o Learn methods for creating effective visualizations to communicate insights
clearly.
4. Descriptive Statistics:
o Understand the benefits of visualizing data through descriptive statistics
(e.g., mean, median, standard deviation) to summarize and interpret data
trends.
5. Tools for Visualization:
o Develop hands-on skills in creating visualizations using tools like MS Excel,
Google Sheets, Python (Matplotlib), and Tableau.
Multiple Choice Questions
1. An example of a basic statistical model:
a) Linear Regression b) Neural Networks
c) Decision Trees d) Support Vector Machines
2. The activity involved in experimental design in data science:
a) Creating visualizations b) Collecting and analyzing data systematically
c) Writing code for machine learning d) Building databases
3. A commonly used tool for creating data visualizations:
a) MS Excel b) Python (Matplotlib)
c) Tableau d) All of the above
4. The meaning of the slope in a linear regression model:
a) The intercept of the model
b) The change in the dependent variable for a unit change in the independent
variable
c) The error term
d) The mean of the data
5. An example of a real-world application of statistical models:
a) Predicting house prices b) Creating social media posts
c) Designing websites d) Writing essays
6. Option not considered a benefit of data visualization:
a) Identifying trends and patterns b) Communicating insights effectively
c) Making data more complex d) Summarizing large datasets
8. A primary goal of K-Means Clustering:
a) To classify data into predefined categories
b) To group data into clusters based on similarity
c) To predict continuous outcomes
d) To reduce the dimensionality of data
9. The meaning of "K" in K-Means Clustering:
a) Number of features in the dataset
b) Number of clusters to be formed
c) Number of iterations required for convergence
d) Number of data points in the dataset
Short Questions
1. What is the importance of building statistical models in real-world
applications?
2. Name one basic statistical model used for predicting outcomes and explain
its purpose.
3. List two types of data visualizations and describe when you would use each.
4. How does visualizing data help in understanding descriptive statistics?
Long Questions
1. Explain the role and importance of statistical models in solving real-world
problems.
2. Describe the steps involved in building a basic statistical model (e.g., linear
regression). Include details on data collection, model training, and evaluation.
3. Discuss the types of data visualizations and their uses.
4. Explain data collection methods.
5. Discuss the concept of measure of tendency with example.