100% found this document useful (1 vote)
2K views21 pages

Unit I DAN 315326 Notes

The document provides comprehensive notes on Data Analytics, covering its importance, types, lifecycle, and the significance of data quality and quantity. It outlines how data analytics supports informed decision-making, enhances customer experiences, and offers competitive advantages across various sectors. Additionally, it details the phases of the data analytics process and the different types of analytics, including descriptive, diagnostic, predictive, prescriptive, and visual analytics.

Uploaded by

Firozkhan Pathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views21 pages

Unit I DAN 315326 Notes

The document provides comprehensive notes on Data Analytics, covering its importance, types, lifecycle, and the significance of data quality and quantity. It outlines how data analytics supports informed decision-making, enhances customer experiences, and offers competitive advantages across various sectors. Additionally, it details the phases of the data analytics process and the different types of analytics, including descriptive, diagnostic, predictive, prescriptive, and visual analytics.

Uploaded by

Firozkhan Pathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

JAMIA INSTITUTE OF TECHNOLOGY

Akkalkuwa – 425415, Dist. Nandurbar (M.S.)


MSBTE Code: 0366 DTE Code:5239

DATA ANALYTICS
K-Scheme Code: 315326

Prepared By
Firozkhan S Pathan
Computer Department, JIT Akkalkuwa
DATA ANALYTICS NOTES (K SCHEME)

Unit - I Introduction to Data Analytics


1.1 Data Analytics: An Overview, Importance of Data Analytics
Data analytics is a field that involves the process of collecting, transforming, cleaning,
analyzing, and interpreting raw data to discover meaningful insights, trends, and patterns.
Its primary goal is to support data-driven decision-making in various domains, from
business and healthcare to science and government. Essentially, it helps organizations
understand "what happened," "why it happened," "what will happen," and "what should
be done."
Data Analytics is the process of collecting, organizing and studying data to find useful
information understand what’s happening and make better decisions. In simple words it
helps people and businesses learn from data like what worked in the past, what is
happening now and what might happen in the future.
Importance of Data Analytics
1. Informed Decision Making:
• Data analytics helps organizations make decisions backed by evidence rather than
intuition. With accurate data, businesses can make informed choices that are likely to
yield better results.
• Example: A retail chain might analyze sales data to optimize product placement in
stores.
2. Improving Efficiency and Productivity:
• Data analysis helps identify inefficiencies in processes and suggests ways to streamline
operations, ultimately improving productivity and reducing costs.
• Example: Analyzing supply chain data to minimize delays and optimize inventory
management.
3. Competitive Advantage:
• Organizations that can effectively analyze and act on data gain an edge over
competitors. By leveraging data, businesses can stay ahead of market trends and meet
customer demands more efficiently.
• Example: An e-commerce company can analyze customer preferences and adjust its
offerings in real time.
4. Enhanced Customer Experience:
• By analyzing customer behaviour, feedback, and preferences, businesses can create
more personalized experiences, build stronger customer relationships, and improve
customer satisfaction.
• Example: Netflix uses data analytics to recommend personalized content to users
based on their viewing history.

1|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

5. Risk Management:
• Data analytics allows organizations to identify potential risks and vulnerabilities before
they escalate into major problems. Example: Banks use fraud detection algorithms to
identify unusual transaction patterns and prevent financial fraud.
6. Forecasting and Planning:
• Predictive analytics helps businesses forecast future trends, which is crucial for
planning and strategy development. This helps organizations stay ahead of market
changes.
• Example: Manufacturers can predict demand fluctuations and adjust production
schedules accordingly.
7. Optimizing Marketing Strategies:
• Data analytics enables businesses to analyze customer behaviour, measure campaign
effectiveness, and segment their audience for targeted marketing.
• Example: A digital marketing team can analyze which ads drive the most conversions
and adjust their strategy for maximum ROI.
8. Innovation and Product Development:
• By analyzing data on customer needs, market trends, and competitive offerings,
companies can identify opportunities for new products or services.
• Example: A tech company might analyze user feedback and competitive products to
develop a new feature or app.

2|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

1.2 Types of Data Analytics: Descriptive Analysis, Diagnostic Analysis,


Predictive Analysis, Prescriptive Analysis, Visual Analytics

Data analytics can be categorized into four main types, each building upon the previous
one to provide deeper insights and guide more informed actions:
1. Descriptive Analytics
• What it answers: "What happened?"
• Focus: Summarizes past data to give a clear picture of historical events and trends. It
describes the current state or past performance.
• Techniques: Data aggregation, basic statistics (averages, percentages), reporting, and
visualizations (charts, graphs, dashboards).
• Examples:
o Monthly sales reports showing revenue figures.
o Tracking website traffic and engagement metrics.
o Analyzing customer survey results to understand satisfaction levels.
o Company financial statements (e.g., balance sheet, income statement)
2. Diagnostic Analytics
• What it answers: "Why did it happen?"
• Focus: Delves deeper into the data to understand the root causes of past events or
trends identified by descriptive analytics.
• Techniques: Data drilling, data mining, correlation analysis, and root cause analysis.
• Examples:
o Investigating why sales suddenly dropped last quarter (e.g., linking it to a specific
marketing campaign, competitor activity, or a website issue).
o Determining why customer churn increased in a particular demographic.
o Identifying the factors that led to a spike in product returns.
o Analyzing machine logs to pinpoint the cause of equipment failure.

3|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

3. Predictive Analytics
• What it answers: "What will happen?" or "What might happen?"
• Focus: Uses historical data, statistical models, and machine learning to forecast future
outcomes, trends, and probabilities. It doesn't tell you what will happen, but what is
most likely to happen.
• Techniques: Regression analysis, forecasting, machine learning algorithms (e.g.,
neural networks, decision trees).
• Examples:
o Forecasting future sales or demand for products.
o Predicting customer churn or the likelihood of a customer purchasing a specific item.
o Assessing credit risk for loan applications.
o Predicting equipment maintenance needs to prevent breakdowns.
4. Prescriptive Analytics
• What it answers: "What should we do?" or "How can we make it happen?"
• Focus: This is the most advanced type of analytics. It goes beyond prediction to
recommend specific actions to achieve desired outcomes, taking into account various
factors and potential implications.
• Techniques: Optimization, simulation, decision trees, advanced machine learning, and
AI.
• Examples:
o Recommending optimal pricing strategies for products to maximize revenue.
o Suggesting the best delivery routes for logistics companies to minimize costs and
time.
o Providing personalized treatment plans in healthcare based on patient data and
predicted outcomes.
o Optimizing inventory levels to reduce carrying costs while preventing stockouts.
Video: https://youtu.be/sr_s2gTCTRk?si=DADJCFBsyOHx5sU3

https://youtu.be/FHnkRxJEYWo?si=8kPh90feegc-Il9i

https://youtu.be/QoEpC7jUb9k?si=rR5esJ2fG-FyMy9D

5. Visual Analytics
Visual analytics is a powerful form of reasoning that combines data analytics with
interactive visual interfaces. By using interactive visual representations of data, users can
easily interpret large volumes of information and uncover the hidden insights within. Unlike
simple data visualizations, which answer the "what" questions, such as "What are the
trends?" visual analytics digs deeper, answering the "why."

4|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

Visual analytics goes beyond simple visualizations. It allows users to explore their data
in-depth and discover the “why” behind it. Visual analytics allows users to dissect complex
data and grasp big-picture information effectively. The tools in visual analytics make it
possible to identify the root cause of trends, patterns, and correlations that are more
complex than basic visualizations. By examining sales figures, users can probe factors such
as price variance, demographic differences, location, season, and much more.
Importance of Visual Analytics:
• Faster Insights: The human brain processes visual information much quicker than raw
numbers or text, leading to rapid identification of trends, outliers, and patterns.
• Improved Decision-Making: By making complex data more understandable and
allowing for interactive exploration, it empowers users to make more informed and
confident decisions.
• Democratization of Data: It makes advanced data analysis accessible to a wider
audience, including business users who may not have deep technical or statistical
expertise.
• Enhanced Collaboration: Visualizations can be easily shared and discussed, fostering
better understanding and collaboration among teams and stakeholders.
• Problem-Solving: It helps in understanding complex problems, identifying root causes,
and exploring potential solutions through dynamic interaction with the data.
• Real-time Monitoring: Many visual analytics tools offer real-time dashboards, allowing
businesses to monitor key performance indicators (KPIs) and respond quickly to
changes.
Video : https://youtu.be/0og3HT8UqD4?si=ECvgn-kjZvBSWJu1

5|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

1.3 Life cycle of Data Analytics, Quality and Quantity of data,


Measurement
Lifecycle of Data Analytics
The data analytics process involves six key phases to transform raw data into actionable
insights:

1. Discovery
• Define business objectives and scope.
• Identify data sources and perform a gap analysis.
• Formulate a hypothesis and set criteria for validation.
2. Data Preparation
• Collect and load data into an analytics sandbox.
• Clean data (preprocessing) and transform it for analysis (ETL/ELT).
• May involve handling missing values, duplicates, and outliers.
3. Model Planning
• Choose between SQL models (for BI dashboards), statistical models (for relationships),
or machine learning models (for pattern recognition).
• Consider dataset size, output use case, data labelling, accuracy vs. speed, and data
structure (structured vs. unstructured).
• Perform exploratory data analysis (EDA) to guide model selection.
4. Building & Executing the Model
• SQL Model: Define source tables, build queries, test, and publish.
• Statistical Model: Select the right test (e.g., regression, ANOVA), run analysis, and
publish results.
• Machine Learning Model: Split data into training/testing sets, train models, compare
performance, and select the best one.

6|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

5. Communicating Results
• Present findings with visualizations and a clear narrative.
• Highlight key insights and business value.
• Compare results against initial hypothesis criteria.
6. Operationalizing
• Deploy the model in production.
• Monitor performance and business impact.
• Share final reports across the organization.
This structured approach ensures data-driven decision-making aligned with business
goals.
For Ref: https://www.geeksforgeeks.org/software-engineering/life-cycle-phases-of-data-
analytics/
Video: https://youtu.be/LibTzI87AbM?si=F4UO7sOb0rLX3Utu
https://youtu.be/iqldcdxqVHI?si=vTMumoQfR2H-g7Te (In Hindi)
Quality and Quantity of Data in Data Analytics
In data analytics, both the quality and quantity of data play crucial roles in determining the
accuracy, reliability, and effectiveness of insights.
1. Data Quality
Definition: Data quality refers to how well-suited data is for its intended use, based on
factors like accuracy, completeness, consistency, and reliability.
Key Dimensions of Data Quality:
• Accuracy – Does the data correctly represent real-world values? (e.g., no typos in
customer names).
• Completeness – Are there missing values or gaps? (e.g., empty fields in a sales
database).
• Consistency – Is the data uniform across different sources? (e.g., date formats match).
• Timeliness – Is the data up-to-date? (e.g., real-time vs. outdated sales figures).
• Relevance – Does the data align with the business problem? (e.g., including irrelevant
customer demographics).
• Validity – Does the data follow expected formats and rules? (e.g., phone numbers in the
correct structure).
Impact of Poor Data Quality:
• Misleading insights → Wrong business decisions.
• Increased costs (e.g., rework, failed models).
• Lower trust in analytics.

7|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

Improving Data Quality:


• Data Cleaning – Handling missing values, removing duplicates, correcting errors.
• Data Validation – Setting rules (e.g., age must be between 18-100).
• Standardization – Ensuring uniform formats (e.g., "USA" vs. "United States").
• Automated Monitoring – Using tools to detect anomalies in real time.
2. Data Quantity
Definition: Data quantity refers to the volume of data available for analysis.
Why Quantity Matters:
• Machine Learning & AI – More data improves model accuracy (especially deep learning).
• Statistical Significance – Larger datasets reduce sampling bias.
• Granularity – More data allows for deeper segmentation (e.g., customer behavior by
region).
Challenges with Too Little Data:
• Overfitting (models memorize noise instead of learning patterns).
• Unreliable predictions due to insufficient training samples.
• Limited ability to detect trends.
Challenges with Too Much Data (Big Data):
• Storage & Processing Costs – Requires scalable infrastructure (e.g., cloud, distributed
computing).
• Noise & Irrelevance – More data doesn’t always mean better insights; filtering is key.
• Privacy & Compliance Risks – Managing large datasets may violate GDPR or other
regulations.
Balancing Quantity & Quality:
• Prioritize relevant data – Not all data is useful; focus on what aligns with business goals.
• Use sampling – If the dataset is too large, analyze a representative subset.
• Augment data – If lacking quantity, use techniques like synthetic data generation.
Conclusion
• High-quality data ensures reliable, actionable insights.
• Sufficient quantity improves model performance and statistical confidence.
• The best analytics outcomes come from a balance of both—clean, relevant, and
adequately sized datasets.
For Ref: https://www.linkedin.com/pulse/data-quality-vs-quantity-what-matters-more-
fieldworkafrica-lpbec

8|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

Measurement in data analytics


Measurement in data analytics is the process of assigning numerical values or categories
to data to make it quantifiable and useful for analysis. It's about turning raw data into
meaningful metrics.
Key Concepts
• Variables: The characteristics being measured, which can be either dimensions
(qualitative, like a product category) or measures (quantitative, like revenue).
• Levels of Measurement: The classification system that determines what type of
statistical analysis can be performed on the data.
Four Levels of Measurement
• Nominal: Data is categorized without any order.
o Example: Hair color (blonde, brown, black).
• Ordinal: Data is categorized and ranked, but the difference between values isn't
uniform.
o Example: Customer satisfaction (low, medium, high).
• Interval: Data has meaningful, equal intervals, but no true zero point.
o Example: Temperature in Celsius.
• Ratio: Data has equal intervals and a true zero point, allowing for all mathematical
operations.
o Example: Height, weight, or age.
For Ref: https://careerfoundry.com/en/blog/data-analytics/data-levels-of-measurement/

9|Page F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )


DATA ANALYTICS NOTES (K SCHEME)

1.4 Data Types, Measure of central tendency, Measures of dispersion

Data Types
Data types in data analytics classify data into specific categories, which determines how
the data can be used and analyzed. These types are broadly categorized as either
qualitative or quantitative.
A. Qualitative (Categorial) Data:
This type of data describes qualities or characteristics and cannot be measured with
numbers. It's often non-numeric and used for grouping.
• Nominal Data: Categorical data that can be named, but not ordered. There's no
inherent ranking.
o Example: Genders (male, female), marital status (single, married, divorced), or car
brands (Toyota, Ford, Honda).
• Ordinal Data: Categorical data with a meaningful order or ranking. The difference
between the ranks, however, isn't uniform or measurable.
o Example: Customer satisfaction ratings (very unsatisfied, unsatisfied, neutral,
satisfied, very satisfied), education levels (high school, bachelor's, master's, PhD), or
rankings in a competition (first, second, third place).
B. Quantitative (Numerical) Data:
This type of data represents numerical values and can be measured. It's often used for
calculations and statistical analysis.
• Discrete Data: Data that can only take on specific, fixed values. It's countable and
often represented by whole numbers.
o Example: The number of children in a family (you can't have 2.5 children), the
number of cars in a parking lot, or the number of defective products in a batch.
• Continuous Data: Data that can take on any value within a given range. It's
measurable and can be represented by fractions or decimals.
o Example: A person's height (can be 175.5 cm), the temperature of a room (can be
22.3°C), or the time it takes to complete a task.
Beyond these core types, data can also be categorized by their structure:
• Structured Data: Organized in a tabular format with rows and columns, easily
searchable and analysable.
o Example: Data in a relational database or an Excel spreadsheet.
• Unstructured Data: Lacks a predefined format and is not easily organized into rows
and columns.
o Example: Text from social media posts, images, audio files, videos.
• Semi-structured Data: Possesses some organizational properties but does not
conform to a strict tabular structure.
o Example: XML files, JSON data, emails.
For Ref: https://www.geeksforgeeks.org/maths/data-types-in-statistics/

10 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

Measures of central tendency


Measures of central tendency in data analytics are summary statistics that describe the
center or typical value of a dataset. The three most common measures are the mean,
median, and mode. They help in understanding the distribution and characteristics of data.
Here's a breakdown of each measure:
1. Mean:
• The mean is the average of all values in a dataset.
• It's calculated by summing all the values and dividing by the total number of values.
Mean = Sum of all Observations ÷ Total number of Observations
o Example: If there are 5 observations, which are 27, 11, 17, 19, and 21, then the
mean (xˉ) is given by
xˉ = (27 + 11 + 17 + 19 + 21) ÷ 5
xˉ= 95 ÷ 5 xˉ = 19
2. Median:
• The median is the middle value in a dataset when the values are arranged in ascending
order.
• If there's an even number of values, the median is the average of the two middle values.
• The median is less sensitive to outliers compared to the mean.
For Odd Number:
Median = Value of observation at [(n + 1) ÷ 2]th Position
o Example 1: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20, 32, then the
Median is given by
Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 32, 36, 38
N = 10 which is even then
Median = Arithmetic mean of values at (10 ÷ 2)th and [(10 ÷ 2) + 1]th position
Median = (Value at 5th position + Value at 6th position) ÷ 2
Median = (26 + 28) ÷ 2
Median = 27
For Even Number:
Median = Arithmetic mean of Values of observations at (n ÷ 2)th and [(n ÷ 2) + 1]th
Position
o Example 2: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20, then the
Median is given by
Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 36, 38
N = 9 which is odd then
Median = Value at [(9 + 1) ÷ 2]th position
Median = Value at 5th position
Median = 26

11 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

3. Mode:
• The mode is the value that appears most frequently in a dataset.
• A dataset can have one mode (unimodal), more than one mode (bimodal, trimodal, etc.),
or no mode.
• The mode is particularly useful for categorical data.
o Example: Find the mode of observations 5, 3, 4, 3, 7, 3, 5, 4, 3.
Create a table with each observation with its frequency as follows:

xi 5 3 4 7

fi 2 4 2 1

Since 3 has occurred a maximum number of times i.e. 4 times in the given data,
Hence, Mode of the given ungrouped data is 3.
For Ref: https://www.geeksforgeeks.org/maths/measures-of-central-tendency/
https://byjus.com/maths/central-tendency/
Video: https://youtu.be/X48cZ6DGaSw?si=zhU6BZLBdQuhTIuO

Measures of dispersion
Measures of dispersion, also known as measures of variability or spread, quantify how
spread out or scattered the data points are in a dataset. A measure of dispersion with a
value of zero indicates that all the data points are identical. The value increases as the data
becomes more diverse.

There are several key measures of dispersion used in data analytics:


• Range: The simplest measure of dispersion, the range is the difference between the
highest and lowest values in a dataset.
o Formula: Range = Maximum Value - Minimum Value
o Limitation: It is heavily influenced by outliers, as it only considers the two extreme
values.
• Variance: The variance measures the average of the squared differences from the
mean. It quantifies how much individual data points deviate from the average.

12 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

o Formula: Population Variance (σ2) = N∑i=1N(xi−μ)2


o Limitation: The units of variance are squared, which can make it less intuitive to
interpret.
• Standard Deviation: The standard deviation is the square root of the variance. It's the
most common measure of dispersion because it's expressed in the same units as the
original data, making it easier to understand.
o Formula: Standard Deviation (σ) = Variance  variance
o A small standard deviation indicates that data points are clustered closely around
the mean, while a large standard deviation means they are more spread out.
• Interquartile Range (IQR): The IQR is the difference between the third quartile (Q3)
and the first quartile (Q1). It measures the spread of the middle 50% of the data.
o Formula: IQR = Q3−Q1
o Advantage: It is less affected by outliers than the range, making it a more robust
measure. It can also be used to identify potential outliers in a dataset.
For example, consider two different datasets for the daily sales of a company over a
week:

Both datasets have the same mean (7), but the measures of dispersion show they are very
different. Dataset A has a small range, variance, and standard deviation, indicating the
sales figures are tightly clustered around the mean. Dataset B has a much larger range,
variance, and standard deviation, showing the sales are more widely spread out.
For Ref: https://www.geeksforgeeks.org/maths/measures-of-dispersion/
Video: https://youtu.be/TLRKawtvC1Q?si=sxG7rGkrBENC-1VI
https://youtu.be/64ELhoTvzk0?si=RBr74ldpMjSihfgL
13 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

1.5 Sampling Funnel, Central Limit Theorem, Confidence Interval,


Sampling Variation
Sampling Funnel

In data analytics, a sampling funnel refers to the process of gradually reducing a large
dataset into smaller, more focused samples through a series of filters or criteria. Each stage
of the funnel narrows down the data, making it more specific and manageable for analysis.
The sampling funnel is used in data analysis to:
• Improve Efficiency: Analyzing a smaller, targeted sample is significantly faster and
requires fewer computational resources than analyzing the entire dataset.
• Enhance Data Quality: By filtering out noisy data, irrelevant entries, or outliers during
the sampling process, the quality and integrity of the final sample are improved, leading
to more accurate analysis.
• Focus Analysis: It allows data analysts to concentrate on specific subsets of data that
are most relevant to their research question or business objective.
• Reduce Bias: A well-designed sampling funnel can help ensure that the final sample is
a representative subset of the original data, thus minimizing sampling bias.
Stages of a Sampling Funnel
Here are the typical stages, from broadest to most specific:
1. Population (Universe) – Top of the Funnel (TOFU)
• What it is: The full dataset or all potential data points (users, sessions, events, etc.)
• Purpose: Start with everything available before filtering.
• Example: All users who visited the website in the past month.
2. Initial Screening / Qualification
• What it is: Apply the first filter to identify those who meet basic criteria.
• Purpose: Remove irrelevant or out-of-scope data.
• Example: Users who landed on a product page.
3. Engaged Users / Events – Middle of the Funnel (MOFU)
• What it is: A more refined segment showing some level of engagement or action.
• Purpose: Focus on users showing interest or intent.
• Example: Users who added a product to the cart.

14 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

4. Conversion / Target Action – Bottom of the Funnel (BOFU)


• What it is: Final group that completed the desired action.
• Purpose: The core sample for success analysis or conversion rates.
• Example: Users who completed a purchase.
5. Retention / Post-Action (Optional, depending on context)
• What it is: Users who return or continue engagement after the target action.
• Purpose: Analyze long-term behaviour, loyalty, or satisfaction.
• Example: Customers who return to make a second purchase.
For Ref: https://www.geeksforgeeks.org/data-analysis/what-is-data-sampling-types-
importance-best-practices/
Video: https://youtu.be/zbfi5ClQ08I?si=-QLZUTyu-c8284K7
https://youtu.be/aYphfb3itVg?si=hrsZi-qJXqjFIJOY

Central Limit Theorem

It states that if you take a sufficiently large number of random samples from any population,
the distribution of the sample means will be approximately a normal distribution (a bell
curve), regardless of the original population's distribution.
Why is CLT Important?
1. Foundation for Inferential Statistics
o Enables hypothesis testing, confidence intervals, and regression analysis.
2. Normal Approximation Justification
o Many statistical methods (e.g., t-tests, ANOVA) assume normality; CLT justifies their
use even if the population isn’t normal.
3. Applicability Across Distributions
o Works for any population distribution (skewed, uniform, binomial, etc.) if sample size
is sufficiently large.
4. Practical Use in Real-World Data
o Helps in making predictions and decisions when the true population distribution is
unknown.
15 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

Key Principles of CLT


1. Random Sampling
o Samples must be independent and identically distributed (i.i.d.).
2. Sample Size (n ≥ 30 Rule of Thumb)
o For most distributions, n≥30n≥30 ensures normality. Highly skewed distributions
may need larger nn.
3. Population Mean (μ) and Variance (σ²) Finite
o CLT holds if the population has a finite mean and variance.

Quick statement of CLT


If you take many independent random samples of size n from any population (with finite
mean μ and variance σ²), the distribution of the sample mean will be approximately Normal
with
• mean = μ
• standard error (SE) = σ / √n (approximation improves as n grows).
Here’s a population-related CLT example in a step-by-step format for data analytics:
Scenario — Estimating Average Height in a City
Population: Entire city of 1 million adults.
• True (unknown) population mean height μ = 165 cm
• True population SD σ = 10 cm
• Height distribution is not perfectly normal — it’s slightly skewed due to age and
gender differences.
Step-by-Step Illustration of CLT
1. Objective
You can’t measure all 1 million people, so you take samples to estimate the population
mean height.
2. Choose sample size
Let’s take n = 40 people per sample.
3. Apply CLT logic
• According to CLT, if we repeatedly take many samples of size 40, the distribution of the
sample means will approximate a Normal distribution, regardless of the original
height distribution.
4. Calculate Theoretical Standard Error (SE)
SE = σ / √n = 10 / √40 ≈ 1.58 cm
5. Simulate (conceptual)
• Take 10,000 random samples (each of 40 people).

16 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

• For each sample, compute the mean height.


• The histogram of these sample means will look bell-shaped (Normal), even though
the original population is slightly skewed.
6. Example outcome
• Empirical mean of sample means ≈ 165 cm (close to μ)
• Empirical SD of sample means ≈ 1.57 cm (close to theoretical 1.58 cm)
• 95% of sample means fall between 165 ± 1.96 × 1.58 → (161.9, 168.1 cm)
7. Practical analytics takeaway
• Even with a non-Normal population, sample means from sufficiently large samples
follow a Normal pattern.
• This allows you to use z-tests or confidence intervals for inference about the population
mean without needing the population distribution to be Normal.

For Ref: https://www.geeksforgeeks.org/machine-learning/central-limit-theorem-in-data-


science-and-data-analytics/
https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-
theorem/#Central_Limit_Theorem_Explained
Video: https://youtu.be/9Fmu3H2dKDA?si=G-46f4UYhKWRmvAB
https://youtu.be/37sBkv4Ga8M?si=Ch6a3hdA7YVDi0bA
https://youtu.be/a7szVlUy9dU?si=8Kc133nLWR7TbnLi
Confidence Interval
A Confidence Interval (CI) is a range of values that contains the true value of something
we are trying to measure like the average height of students or average income of a
population.
Instead of saying: “The average height is 165 cm.”
We can say: “We are 95% confident the average height is between 160 cm and 170 cm.”

17 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

A confidence interval shows the probability that a parameter will fall between a pair of
values around the mean. Confidence intervals show the degree of uncertainty or certainty
in a sampling method. They are constructed using confidence levels of 95% or 99%.

Confidence Interval Formula

The formula to find Confidence Interval is:


• X bar is the sample mean.
• Z is the number of standard deviations from the sample mean.
• S is the standard deviation in the sample.
• n is the size of the sample.
The value after the ± symbol is known as the margin of error.
Question: In a tree, there are hundreds of mangoes. You randomly choose 40 mangoes
with a mean of 80 and a standard deviation of 4.3. Determine that the mangoes are big
enough.
Solution:
Mean = 80
Standard deviation = 4.3
Number of observations = 40
Take the confidence level as 95%. Therefore the value of Z = 1.9
Substituting the value in the formula, we get
= 80 ± 1.960 × [ 4.3 / √40 ]
= 80 ± 1.960 × [ 4.3 / 6.32]
= 80 ± 1.960 × 0.6803
= 80 ± 1.33
The margin of error is 1.33
All the hundreds of mangoes are likely to be in the range of 78.67 and 81.33.
For Ref: https://www.simplilearn.com/tutorials/data-analytics-tutorial/confidence-
intervals-in-statistics
https://www.geeksforgeeks.org/dsa/confidence-interval/
Video: https://youtu.be/ENnlSlvQHO0?si=DUkwI9zHkMacAuo6
https://youtu.be/0Kmc--WA-Do?si=Bwvo03SnElxHw2DC

18 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

Sampling variation
Sampling variation is a core concept in data analytics that refers to the natural difference
between a sample statistic and the true population parameter. Since it's often
impractical or impossible to analyze an entire population, analysts use a sample—a
smaller, representative subset of the population. The value calculated from this sample
(the "statistic") will almost always be slightly different from the actual value of the entire
population (the "parameter") due to random chance in the sampling process. This
difference is known as sampling variation or sampling error.
Even if the population parameters stay fixed, the results you get will differ slightly from
sample to sample because you’re not measuring the whole population.
Key points
• Cause: Randomness in which individuals/data points are included in the sample.
• Effect: Different samples → slightly different estimates.
• Magnitude: Larger samples reduce variation; smaller samples show more fluctuation.
• Measure: Quantified by Standard Error (SE).
Example in Data Analytics
Scenario: You’re estimating the average daily sales for a retail chain.
1. Population: All sales from 365 days last year (true average = $5,000).
2. Sample 1 (n=30 days): average = $4,950.
3. Sample 2 (n=30 days): average = $5,120.
4. Sample 3 (n=30 days): average = $4,870.
The differences are not due to a change in the population but due to sampling variation.
Application of Data Analytics
Data analytics applications are widespread, impacting various sectors from business and
healthcare to transportation and finance. They enable organizations to gain insights from
data, improve decision-making, optimize operations, and enhance customer
experiences. Common applications include personalized marketing, fraud detection,
supply chain management, and resource optimization.
Key Applications Across Sectors
• Business and Marketing: Data analytics is used to create personalized marketing
campaigns, optimize supply chains and pricing strategies, and improve customer
acquisition and retention. It also plays a vital role in fraud detection for financial
transactions.
• Healthcare: In this sector, data analytics helps to improve patient care through earlier
disease diagnosis and more effective treatment plans. It also aids in public health
initiatives and accelerates the drug discovery process.

19 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)

• Transportation and Logistics: Data analysis helps in optimizing traffic management,


planning efficient delivery and public transportation routes, and streamlining overall
logistics.
• Finance: Financial institutions use data analytics for fraud detection and to better
manage risk associated with loans and investments.
• Government: Governments use data analytics to inform policy-making, optimize
resource allocation based on public needs, and improve public safety by analyzing
crime data.

---------End of Unit I--------

20 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )

You might also like