Unit I DAN 315326 Notes
Unit I DAN 315326 Notes
DATA ANALYTICS
K-Scheme Code: 315326
Prepared By
Firozkhan S Pathan
Computer Department, JIT Akkalkuwa
DATA ANALYTICS NOTES (K SCHEME)
5. Risk Management:
• Data analytics allows organizations to identify potential risks and vulnerabilities before
they escalate into major problems. Example: Banks use fraud detection algorithms to
identify unusual transaction patterns and prevent financial fraud.
6. Forecasting and Planning:
• Predictive analytics helps businesses forecast future trends, which is crucial for
planning and strategy development. This helps organizations stay ahead of market
changes.
• Example: Manufacturers can predict demand fluctuations and adjust production
schedules accordingly.
7. Optimizing Marketing Strategies:
• Data analytics enables businesses to analyze customer behaviour, measure campaign
effectiveness, and segment their audience for targeted marketing.
• Example: A digital marketing team can analyze which ads drive the most conversions
and adjust their strategy for maximum ROI.
8. Innovation and Product Development:
• By analyzing data on customer needs, market trends, and competitive offerings,
companies can identify opportunities for new products or services.
• Example: A tech company might analyze user feedback and competitive products to
develop a new feature or app.
Data analytics can be categorized into four main types, each building upon the previous
one to provide deeper insights and guide more informed actions:
1. Descriptive Analytics
• What it answers: "What happened?"
• Focus: Summarizes past data to give a clear picture of historical events and trends. It
describes the current state or past performance.
• Techniques: Data aggregation, basic statistics (averages, percentages), reporting, and
visualizations (charts, graphs, dashboards).
• Examples:
o Monthly sales reports showing revenue figures.
o Tracking website traffic and engagement metrics.
o Analyzing customer survey results to understand satisfaction levels.
o Company financial statements (e.g., balance sheet, income statement)
2. Diagnostic Analytics
• What it answers: "Why did it happen?"
• Focus: Delves deeper into the data to understand the root causes of past events or
trends identified by descriptive analytics.
• Techniques: Data drilling, data mining, correlation analysis, and root cause analysis.
• Examples:
o Investigating why sales suddenly dropped last quarter (e.g., linking it to a specific
marketing campaign, competitor activity, or a website issue).
o Determining why customer churn increased in a particular demographic.
o Identifying the factors that led to a spike in product returns.
o Analyzing machine logs to pinpoint the cause of equipment failure.
3. Predictive Analytics
• What it answers: "What will happen?" or "What might happen?"
• Focus: Uses historical data, statistical models, and machine learning to forecast future
outcomes, trends, and probabilities. It doesn't tell you what will happen, but what is
most likely to happen.
• Techniques: Regression analysis, forecasting, machine learning algorithms (e.g.,
neural networks, decision trees).
• Examples:
o Forecasting future sales or demand for products.
o Predicting customer churn or the likelihood of a customer purchasing a specific item.
o Assessing credit risk for loan applications.
o Predicting equipment maintenance needs to prevent breakdowns.
4. Prescriptive Analytics
• What it answers: "What should we do?" or "How can we make it happen?"
• Focus: This is the most advanced type of analytics. It goes beyond prediction to
recommend specific actions to achieve desired outcomes, taking into account various
factors and potential implications.
• Techniques: Optimization, simulation, decision trees, advanced machine learning, and
AI.
• Examples:
o Recommending optimal pricing strategies for products to maximize revenue.
o Suggesting the best delivery routes for logistics companies to minimize costs and
time.
o Providing personalized treatment plans in healthcare based on patient data and
predicted outcomes.
o Optimizing inventory levels to reduce carrying costs while preventing stockouts.
Video: https://youtu.be/sr_s2gTCTRk?si=DADJCFBsyOHx5sU3
https://youtu.be/FHnkRxJEYWo?si=8kPh90feegc-Il9i
https://youtu.be/QoEpC7jUb9k?si=rR5esJ2fG-FyMy9D
5. Visual Analytics
Visual analytics is a powerful form of reasoning that combines data analytics with
interactive visual interfaces. By using interactive visual representations of data, users can
easily interpret large volumes of information and uncover the hidden insights within. Unlike
simple data visualizations, which answer the "what" questions, such as "What are the
trends?" visual analytics digs deeper, answering the "why."
Visual analytics goes beyond simple visualizations. It allows users to explore their data
in-depth and discover the “why” behind it. Visual analytics allows users to dissect complex
data and grasp big-picture information effectively. The tools in visual analytics make it
possible to identify the root cause of trends, patterns, and correlations that are more
complex than basic visualizations. By examining sales figures, users can probe factors such
as price variance, demographic differences, location, season, and much more.
Importance of Visual Analytics:
• Faster Insights: The human brain processes visual information much quicker than raw
numbers or text, leading to rapid identification of trends, outliers, and patterns.
• Improved Decision-Making: By making complex data more understandable and
allowing for interactive exploration, it empowers users to make more informed and
confident decisions.
• Democratization of Data: It makes advanced data analysis accessible to a wider
audience, including business users who may not have deep technical or statistical
expertise.
• Enhanced Collaboration: Visualizations can be easily shared and discussed, fostering
better understanding and collaboration among teams and stakeholders.
• Problem-Solving: It helps in understanding complex problems, identifying root causes,
and exploring potential solutions through dynamic interaction with the data.
• Real-time Monitoring: Many visual analytics tools offer real-time dashboards, allowing
businesses to monitor key performance indicators (KPIs) and respond quickly to
changes.
Video : https://youtu.be/0og3HT8UqD4?si=ECvgn-kjZvBSWJu1
1. Discovery
• Define business objectives and scope.
• Identify data sources and perform a gap analysis.
• Formulate a hypothesis and set criteria for validation.
2. Data Preparation
• Collect and load data into an analytics sandbox.
• Clean data (preprocessing) and transform it for analysis (ETL/ELT).
• May involve handling missing values, duplicates, and outliers.
3. Model Planning
• Choose between SQL models (for BI dashboards), statistical models (for relationships),
or machine learning models (for pattern recognition).
• Consider dataset size, output use case, data labelling, accuracy vs. speed, and data
structure (structured vs. unstructured).
• Perform exploratory data analysis (EDA) to guide model selection.
4. Building & Executing the Model
• SQL Model: Define source tables, build queries, test, and publish.
• Statistical Model: Select the right test (e.g., regression, ANOVA), run analysis, and
publish results.
• Machine Learning Model: Split data into training/testing sets, train models, compare
performance, and select the best one.
5. Communicating Results
• Present findings with visualizations and a clear narrative.
• Highlight key insights and business value.
• Compare results against initial hypothesis criteria.
6. Operationalizing
• Deploy the model in production.
• Monitor performance and business impact.
• Share final reports across the organization.
This structured approach ensures data-driven decision-making aligned with business
goals.
For Ref: https://www.geeksforgeeks.org/software-engineering/life-cycle-phases-of-data-
analytics/
Video: https://youtu.be/LibTzI87AbM?si=F4UO7sOb0rLX3Utu
https://youtu.be/iqldcdxqVHI?si=vTMumoQfR2H-g7Te (In Hindi)
Quality and Quantity of Data in Data Analytics
In data analytics, both the quality and quantity of data play crucial roles in determining the
accuracy, reliability, and effectiveness of insights.
1. Data Quality
Definition: Data quality refers to how well-suited data is for its intended use, based on
factors like accuracy, completeness, consistency, and reliability.
Key Dimensions of Data Quality:
• Accuracy – Does the data correctly represent real-world values? (e.g., no typos in
customer names).
• Completeness – Are there missing values or gaps? (e.g., empty fields in a sales
database).
• Consistency – Is the data uniform across different sources? (e.g., date formats match).
• Timeliness – Is the data up-to-date? (e.g., real-time vs. outdated sales figures).
• Relevance – Does the data align with the business problem? (e.g., including irrelevant
customer demographics).
• Validity – Does the data follow expected formats and rules? (e.g., phone numbers in the
correct structure).
Impact of Poor Data Quality:
• Misleading insights → Wrong business decisions.
• Increased costs (e.g., rework, failed models).
• Lower trust in analytics.
Data Types
Data types in data analytics classify data into specific categories, which determines how
the data can be used and analyzed. These types are broadly categorized as either
qualitative or quantitative.
A. Qualitative (Categorial) Data:
This type of data describes qualities or characteristics and cannot be measured with
numbers. It's often non-numeric and used for grouping.
• Nominal Data: Categorical data that can be named, but not ordered. There's no
inherent ranking.
o Example: Genders (male, female), marital status (single, married, divorced), or car
brands (Toyota, Ford, Honda).
• Ordinal Data: Categorical data with a meaningful order or ranking. The difference
between the ranks, however, isn't uniform or measurable.
o Example: Customer satisfaction ratings (very unsatisfied, unsatisfied, neutral,
satisfied, very satisfied), education levels (high school, bachelor's, master's, PhD), or
rankings in a competition (first, second, third place).
B. Quantitative (Numerical) Data:
This type of data represents numerical values and can be measured. It's often used for
calculations and statistical analysis.
• Discrete Data: Data that can only take on specific, fixed values. It's countable and
often represented by whole numbers.
o Example: The number of children in a family (you can't have 2.5 children), the
number of cars in a parking lot, or the number of defective products in a batch.
• Continuous Data: Data that can take on any value within a given range. It's
measurable and can be represented by fractions or decimals.
o Example: A person's height (can be 175.5 cm), the temperature of a room (can be
22.3°C), or the time it takes to complete a task.
Beyond these core types, data can also be categorized by their structure:
• Structured Data: Organized in a tabular format with rows and columns, easily
searchable and analysable.
o Example: Data in a relational database or an Excel spreadsheet.
• Unstructured Data: Lacks a predefined format and is not easily organized into rows
and columns.
o Example: Text from social media posts, images, audio files, videos.
• Semi-structured Data: Possesses some organizational properties but does not
conform to a strict tabular structure.
o Example: XML files, JSON data, emails.
For Ref: https://www.geeksforgeeks.org/maths/data-types-in-statistics/
10 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
11 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
3. Mode:
• The mode is the value that appears most frequently in a dataset.
• A dataset can have one mode (unimodal), more than one mode (bimodal, trimodal, etc.),
or no mode.
• The mode is particularly useful for categorical data.
o Example: Find the mode of observations 5, 3, 4, 3, 7, 3, 5, 4, 3.
Create a table with each observation with its frequency as follows:
xi 5 3 4 7
fi 2 4 2 1
Since 3 has occurred a maximum number of times i.e. 4 times in the given data,
Hence, Mode of the given ungrouped data is 3.
For Ref: https://www.geeksforgeeks.org/maths/measures-of-central-tendency/
https://byjus.com/maths/central-tendency/
Video: https://youtu.be/X48cZ6DGaSw?si=zhU6BZLBdQuhTIuO
Measures of dispersion
Measures of dispersion, also known as measures of variability or spread, quantify how
spread out or scattered the data points are in a dataset. A measure of dispersion with a
value of zero indicates that all the data points are identical. The value increases as the data
becomes more diverse.
12 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
Both datasets have the same mean (7), but the measures of dispersion show they are very
different. Dataset A has a small range, variance, and standard deviation, indicating the
sales figures are tightly clustered around the mean. Dataset B has a much larger range,
variance, and standard deviation, showing the sales are more widely spread out.
For Ref: https://www.geeksforgeeks.org/maths/measures-of-dispersion/
Video: https://youtu.be/TLRKawtvC1Q?si=sxG7rGkrBENC-1VI
https://youtu.be/64ELhoTvzk0?si=RBr74ldpMjSihfgL
13 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
In data analytics, a sampling funnel refers to the process of gradually reducing a large
dataset into smaller, more focused samples through a series of filters or criteria. Each stage
of the funnel narrows down the data, making it more specific and manageable for analysis.
The sampling funnel is used in data analysis to:
• Improve Efficiency: Analyzing a smaller, targeted sample is significantly faster and
requires fewer computational resources than analyzing the entire dataset.
• Enhance Data Quality: By filtering out noisy data, irrelevant entries, or outliers during
the sampling process, the quality and integrity of the final sample are improved, leading
to more accurate analysis.
• Focus Analysis: It allows data analysts to concentrate on specific subsets of data that
are most relevant to their research question or business objective.
• Reduce Bias: A well-designed sampling funnel can help ensure that the final sample is
a representative subset of the original data, thus minimizing sampling bias.
Stages of a Sampling Funnel
Here are the typical stages, from broadest to most specific:
1. Population (Universe) – Top of the Funnel (TOFU)
• What it is: The full dataset or all potential data points (users, sessions, events, etc.)
• Purpose: Start with everything available before filtering.
• Example: All users who visited the website in the past month.
2. Initial Screening / Qualification
• What it is: Apply the first filter to identify those who meet basic criteria.
• Purpose: Remove irrelevant or out-of-scope data.
• Example: Users who landed on a product page.
3. Engaged Users / Events – Middle of the Funnel (MOFU)
• What it is: A more refined segment showing some level of engagement or action.
• Purpose: Focus on users showing interest or intent.
• Example: Users who added a product to the cart.
14 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
It states that if you take a sufficiently large number of random samples from any population,
the distribution of the sample means will be approximately a normal distribution (a bell
curve), regardless of the original population's distribution.
Why is CLT Important?
1. Foundation for Inferential Statistics
o Enables hypothesis testing, confidence intervals, and regression analysis.
2. Normal Approximation Justification
o Many statistical methods (e.g., t-tests, ANOVA) assume normality; CLT justifies their
use even if the population isn’t normal.
3. Applicability Across Distributions
o Works for any population distribution (skewed, uniform, binomial, etc.) if sample size
is sufficiently large.
4. Practical Use in Real-World Data
o Helps in making predictions and decisions when the true population distribution is
unknown.
15 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
16 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
17 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
A confidence interval shows the probability that a parameter will fall between a pair of
values around the mean. Confidence intervals show the degree of uncertainty or certainty
in a sampling method. They are constructed using confidence levels of 95% or 99%.
18 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
Sampling variation
Sampling variation is a core concept in data analytics that refers to the natural difference
between a sample statistic and the true population parameter. Since it's often
impractical or impossible to analyze an entire population, analysts use a sample—a
smaller, representative subset of the population. The value calculated from this sample
(the "statistic") will almost always be slightly different from the actual value of the entire
population (the "parameter") due to random chance in the sampling process. This
difference is known as sampling variation or sampling error.
Even if the population parameters stay fixed, the results you get will differ slightly from
sample to sample because you’re not measuring the whole population.
Key points
• Cause: Randomness in which individuals/data points are included in the sample.
• Effect: Different samples → slightly different estimates.
• Magnitude: Larger samples reduce variation; smaller samples show more fluctuation.
• Measure: Quantified by Standard Error (SE).
Example in Data Analytics
Scenario: You’re estimating the average daily sales for a retail chain.
1. Population: All sales from 365 days last year (true average = $5,000).
2. Sample 1 (n=30 days): average = $4,950.
3. Sample 2 (n=30 days): average = $5,120.
4. Sample 3 (n=30 days): average = $4,870.
The differences are not due to a change in the population but due to sampling variation.
Application of Data Analytics
Data analytics applications are widespread, impacting various sectors from business and
healthcare to transportation and finance. They enable organizations to gain insights from
data, improve decision-making, optimize operations, and enhance customer
experiences. Common applications include personalized marketing, fraud detection,
supply chain management, and resource optimization.
Key Applications Across Sectors
• Business and Marketing: Data analytics is used to create personalized marketing
campaigns, optimize supply chains and pricing strategies, and improve customer
acquisition and retention. It also plays a vital role in fraud detection for financial
transactions.
• Healthcare: In this sector, data analytics helps to improve patient care through earlier
disease diagnosis and more effective treatment plans. It also aids in public health
initiatives and accelerates the drug discovery process.
19 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )
DATA ANALYTICS NOTES (K SCHEME)
20 | P a g e F i roz k h a n S Pat ha n ( J I T- 0 3 6 6 )