0% found this document useful (0 votes)
27 views37 pages

Plotting Data Using Various Graphs

The document provides an overview of bar graphs, pie charts, histograms, and stem-and-leaf plots as tools for data visualization. It outlines their definitions, advantages, disadvantages, and applications, emphasizing the importance of choosing the right type of graph for different data types. Numerical examples illustrate how to construct and interpret these visualizations effectively.

Uploaded by

meena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views37 pages

Plotting Data Using Various Graphs

The document provides an overview of bar graphs, pie charts, histograms, and stem-and-leaf plots as tools for data visualization. It outlines their definitions, advantages, disadvantages, and applications, emphasizing the importance of choosing the right type of graph for different data types. Numerical examples illustrate how to construct and interpret these visualizations effectively.

Uploaded by

meena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Plotting Data Using Bar Graph

1. Theory of Bar Graphs


Definition

A bar graph (or bar chart) is a visual representation of categorical or discrete data using
rectangular bars.

 Each bar’s height (or length) is proportional to the value or frequency of the category it
represents.
 Bars are equally spaced and of equal width, but their heights vary according to data
values.

Key Features

1. Axes
o X-axis: Represents the categories (e.g., students, dealers, products).
o Y-axis: Represents numerical values (frequency, count, or amount).
2. Bars
oDrawn vertically or horizontally.
oEqual width, equal spacing.
oBars do not touch (unlike histograms).
3. Proportionality
o Height/length of each bar = directly proportional to the corresponding value.

Types of Bar Graphs

 Vertical Bar Graph: Bars are vertical (most common).


 Horizontal Bar Graph: Bars are horizontal (useful when category names are long).
 Grouped Bar Graph: Two or more categories grouped side by side.
 Stacked Bar Graph: Categories stacked within a single bar.

Advantages
✅ Easy to draw and understand.
✅ Useful for comparing different categories.
✅ Can represent frequencies, percentages, or amounts.

Disadvantages

❌ Not suitable for continuous data.


❌ Too many categories make the graph cluttered.
❌ Cannot show relationships between variables (only comparison).

Uses

 Comparing sales figures of companies.


 Representing survey responses.
 Showing population of different states/cities.
 Comparing marks of students.

2. Numerical Example
Problem Statement:
A shopkeeper records the number of cars sold by four dealers in one week:

Dealer Cars Sold


A 20
B 35
C 25
D 15

Step 1: Identify categories and values

 Categories = Dealers A, B, C, D
 Values = Cars sold = [20, 35, 25, 15]

Step 2: Choose axes

 X-axis = Dealers
 Y-axis = Cars sold
Step 3: Plot bars

 Draw bars of equal width for each dealer.


 Heights: A = 20, B = 35, C = 25, D = 15.

Step 4: Interpret

 Dealer B sold the most cars (35).


 Dealer D sold the least (15).

Here’s the bar graph for the numerical example (cars sold by dealers):

 Dealer B sold the most (35 cars).


 Dealer D sold the least (15 cars).
 The graph makes comparison across categories very clear.

Plotting Data Using Pie Chart

1. Introduction
In statistics and data visualization, a pie chart is a very popular tool for representing
proportional data. It is essentially a circle divided into slices (or sectors), with each slice
showing the relative contribution of a category to the whole.

The idea of a pie chart is simple yet powerful:

 The whole circle = 100% of the data.


 Each slice = percentage (or fraction) of that category.
 Larger values produce larger slices, smaller values produce smaller slices.

Thus, pie charts provide an at-a-glance understanding of the relative size of parts compared to
the total.
2. Formula for Pie Chart
To convert data into a pie chart:

Angle of slice=Value of categoryTotal of all categories×360∘\text{Angle of slice} = \frac{\


text{Value of category}}{\text{Total of all categories}} \times 360^\
circAngle of slice=Total of all categoriesValue of category×360∘

Explanation of formula:

 Numerator (Value of category) → the actual value of the item.


 Denominator (Total) → sum of all values.
 Multiply by 360∘360^\circ360∘ because a circle = 360°.
 This ensures that all slices together add up to 360° (a full circle).

3. Key Features of Pie Charts


1. Proportional Representation – Each slice shows the share of the whole.
2. Angles Correspond to Values – Bigger data = bigger angle.
3. Labels and Percentages – Pie charts often show values as percentages.
4. Fixed Total – A pie chart always represents 100% of data.
5. Limited Categories – Works best for 3–7 categories.

4. The Pie Chart: A Tool for Proportions


A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportion.
In a pie chart, the arc length of each slice is proportional to the quantity it represents, and the
entire circle represents the whole (100%).

Advantages of Pie Charts

 Simple and Visually Appealing: Pie charts are easy to read and understand at a glance.
Their circular shape naturally lends itself to representing parts of a whole, making the
concept of proportion highly intuitive.
 Effective for a Few Categories: When you have a small number of categories (typically
2 to 5), a pie chart is very effective. It clearly shows which category is the largest and
provides a quick comparison of their relative sizes.
 Ideal for Proportions: The primary strength of a pie chart is its ability to show the
proportional distribution of a set of data. It immediately answers the question, "How big
is this part compared to the whole?"

Disadvantages of Pie Charts

 Difficulty in Comparing Slices: It is extremely difficult for the human eye to accurately
compare the size of slices, especially if they have similar values. For example, telling the
difference between a 23% slice and a 25% slice is challenging, whereas a bar chart would
make this difference immediately obvious.
 Poor for Many Categories: When a pie chart has more than a handful of slices, it
becomes cluttered and hard to read. Small slices are often indistinguishable and may
require labels and leader lines, which further complicates the chart.
 Inability to Show Trends or Changes: A pie chart represents a snapshot of data at a
single point in time. It cannot be used to visualize trends over time, as it has no concept
of a time axis.
 Not Suitable for Continuous Data: Pie charts are designed for categorical or discrete
data that can be grouped into distinct categories. They are completely inappropriate for
continuous data like temperature or height.
 Misleading 3D Effects: Using 3D effects on a pie chart is a common mistake that can
distort the true proportions, as the slices closer to the front of the chart appear larger than
those in the back.

Applications of Pie Charts

Due to their simplicity, pie charts are widely used in situations where the goal is to show how a
total is divided among a few distinct categories.

 Business:
o Market Share: Visualizing the market share of different companies in an
industry.
o Budget Allocation: Showing how a company's budget is distributed across
departments (e.g., marketing, R&D, operations).
 Education:
o Student Demographics: Illustrating the percentage of students in different
majors or grade levels.
o Test Scores: Showing the percentage of students who scored in various ranges
(e.g., A, B, C, D, F).
 Polling and Surveys:
o Opinion Polls: Displaying the percentage of respondents who support different
candidates or positions.
o Survey Results: Summarizing responses to a multiple-choice question (e.g.,
"What is your favorite color?").

In summary, a pie chart is a great tool for a single-variable analysis of a whole, but its use should
be limited to situations with a few categories where the primary goal is to show proportional
distribution. For more complex data or comparisons, other chart types, such as bar charts or line
charts, are more effective

7. Numerical Example
A student scored marks in four subjects:

Subject Marks
Math 40
Science 30
English 20
History 10
Total 100

Step 1: Find total

40+30+20+10=10040 + 30 + 20 + 10 = 10040+30+20+10=100

Step 2: Convert into angles

 Math = 40100×360=144∘\frac{40}{100} \times 360 = 144^\circ10040×360=144∘


 Science = 30100×360=108∘\frac{30}{100} \times 360 = 108^\circ10030×360=108∘
 English = 20100×360=72∘\frac{20}{100} \times 360 = 72^\circ10020×360=72∘
 History = 10100×360=36∘\frac{10}{100} \times 360 = 36^\circ10010×360=36∘

Step 3: Interpret

 Math is the largest slice (40%).


 History is the smallest slice (10%).
 Combined Science + Math = 70% of marks.
✅ Graph Output:

 A circle divided into 4 slices.


 Math (40%) is the largest slice.
 History (10%) is the smallest slice.
 Easy to compare subjects visually.

Conclusion
 A pie chart is best suited for showing proportions of a whole.
 It is easy to construct and interpret, making it one of the most widely used statistical
diagrams.
 However, it is not suitable for very large datasets or when precise comparison is required.
 In our example, the chart clearly shows that Math and Science dominate the student’s
performance.
📊 Understanding and Creating Histograms: An In-Depth Guide

A histogram is a graphical representation of the distribution of a continuous dataset. It helps us


visualize the underlying pattern of data by dividing it into bins and showing the frequency of
data points in each bin. Unlike a bar chart, the bars in a histogram are adjacent, as they represent
a continuous range of values.

1. The Core Components of a Histogram

 Bins: Think of bins as "data buckets." Each bin is a specific interval on the x-axis that holds a
range of data values. All bins in a histogram must have the same width. The number of bins is a
crucial choice that determines the level of detail shown in the plot. Too few bins can hide
important features, while too many can make the plot look noisy.

 Frequency: The height of each bar represents the number of data points (or count) that fall into
that bin's range. It's the frequency of occurrence for that specific interval.

 Distribution: By looking at the overall shape of the bars, we can understand the data's
distribution, or how the values are spread out.

2. Numerical Example: A Manual Step-by-Step Guide

Let's use a dataset of 30 daily high temperatures (in Fahrenheit) to demonstrate how to build a
histogram by hand.
[68,70,75,71,80,78,85,82,79,73,76,69,74,81,83,77,72,84,86,70,75,78,80,81,74,82,85,79,73,76]

Step 1: Find the Data Range

The first step is to identify the minimum and maximum values in your dataset.

 Minimum Value: 68

 Maximum Value: 86

 Range: 86−68=18
Analyzing the Plots

The plot with 10 bins provides a smooth, general overview, clearly showing a bell-shaped
distribution. The plot with 50 bins is more detailed and shows more local peaks and valleys.
While it still resembles a bell shape, it highlights minor fluctuations in the data that the broader
plot hides.

4. Interpreting a Histogram: Understanding Distribution Shapes

The shape of a histogram is the most important feature to analyze. It tells us a lot about the data's
characteristics. Here are some common shapes:

 Normal (Bell-shaped): The most common shape, where the data is symmetric around a central
peak. Many natural phenomena follow this pattern.

 Skewed Right (Positively Skewed): The tail of the distribution extends to the right. This means
most data points are on the left side, and there are a few very high values pulling the mean to the
right.

 Skewed Left (Negatively Skewed): The tail extends to the left. Most data points are on the
right, with a few very low values.
 Bimodal: A histogram with two distinct peaks. This often suggests that the dataset is composed
of two different groups or populations.

By plotting data in a histogram, you can visually identify these shapes and gain a deeper
understanding of your dataset.

Advantages of Histograms

 Easy to Understand: Histograms are intuitive and straightforward to interpret, even for a
non-technical audience. They provide a clear visual summary of a large dataset.
 Effective for Large Datasets: By grouping data into bins, histograms can effectively
summarize and display a large number of data points without becoming cluttered.
 Reveals Data Distribution: Histograms immediately show the shape of the data's
distribution (e.g., normal, skewed, bimodal), central tendency (where the data is
centered), and spread (the range of values).
 Identifies Outliers and Anomalies: Gaps or isolated bars far from the main body of the
histogram can quickly highlight unusual data points that may need further investigation.

Disadvantages of Histograms

 Subjectivity of Bin Width: The appearance and interpretation of a histogram can change
dramatically depending on the number of bins chosen. A different bin width can lead to a
completely different story being told by the data.
 Loss of Detail: Histograms group data into ranges, meaning you lose the exact value of
each individual data point. For example, a bar for the range 10-20 doesn't tell you if the
values within it were 10 or 19.
 Limited for Comparison: Histograms are best for analyzing a single variable.
Comparing multiple datasets or variables side-by-side using histograms can be difficult
and confusing.

Applications of Histograms

Histograms have wide-ranging applications across many fields for data analysis and quality
control.

 Business and Finance:


o Analyzing the distribution of sales figures to identify peak selling periods.
o Assessing customer demographics, such as age or income, to inform marketing
strategies.
 Healthcare:
o Analyzing patient data, such as age distribution or blood pressure readings, to
identify trends and risk factors.
o Optimizing hospital operations by visualizing patient wait times.
 Manufacturing and Quality Control:
o Checking if product dimensions (e.g., the width of a bolt) fall within acceptable
limits.
o Identifying defects in a production process by analyzing the frequency of errors.
 Education:
o Visualizing the distribution of test scores to see if they follow a normal curve, are
skewed, or if there are two groups of students (bimodal).

Stem-and-Leaf Plot (Leaf-and-Stem Chart)

Introduction
In data science and statistics, visual representation of data is a critical step in understanding its
structure, distribution, and key features. One such tool is the Stem-and-Leaf Plot (also called
Leaf-and-Stem Chart).

A stem-and-leaf plot is a tabular method of displaying quantitative data in a way that


preserves the original observations. Unlike many graphical representations (like bar graphs or
histograms), this method does not lose the raw data — instead, it reorganizes them for clarity.

 Invented by John Tukey (1977) as part of Exploratory Data Analysis (EDA).


 Helps identify patterns, clusters, gaps, spread, and outliers in data.
 Can be seen as a hybrid between a table and a graph.

Definition
A Stem-and-Leaf Plot is a method of representing numerical data where:

 Each number is split into two parts:


o Stem → Represents the leading digit(s).
o Leaf → Represents the last digit (or digits).

The stems are listed vertically, and the corresponding leaves are written horizontally next to
them.
Why Use a Stem-and-Leaf Plot?
1. Preserves raw data: Unlike histograms where exact values are lost, a stem-and-leaf plot
keeps every value intact.
2. Quick visualization: It shows the shape of distribution (symmetry, skewness).
3. Highlights outliers: Extreme values stand out clearly.
4. Useful for small data: Works best when dataset size is between 20 and 100 values.
5. Summarizes large sets into compact form: Instead of writing a long list, values are
neatly grouped.

Key Features
 Stems act as class intervals or groups.
 Leaves are the remaining part of the number that differentiates within the group.
 Can be back-to-back (two datasets compared side by side).
 Helps calculate median, mode, quartiles, and spread.
 Shows frequency (how many values fall under a stem).

How to Construct a Stem-and-Leaf Plot?


Step 1 – Arrange Data
Write data in ascending order for convenience.

Step 2 – Decide Stem and Leaf Units

 Example: For numbers like 132, 145 → Stem = "13", "14" (tens place), Leaf = ones digit.
 For decimals → Stem can be the integer part, Leaf the decimal part.

Step 3 – List Stems


Write stems in increasing order (without skipping).

Step 4 – Add Leaves


Write each corresponding leaf next to its stem.

Step 5 – Organize Neatly

 Order leaves in ascending order.


 Make sure spacing is consistent.
Example 1 (Simple Numbers)
Dataset:

12, 15, 17, 18, 21, 22, 25, 27, 32, 34, 35, 38

Construction:

 Stems = 1, 2, 3 (tens place).


 Leaves = ones digit.

Stem Leaves
1 2578
2 1257
3 2458

Interpretation:

 Data ranges from 12 to 38.


 Most values cluster around 20s and 30s.
 Median lies in the 20s range.

(Exam Scores)
Scores:

45, 47, 49, 52, 56, 58, 62, 65, 67, 68, 71, 72, 75, 78

Step 1 – Stems (Tens Place): 4, 5, 6, 7


Step 2 – Leaves (Ones Place):

Stem Leaves
4 579
5 268
6 2578
7 1258

Interpretation:

 Distribution is fairly even across 40s to 70s.


 Students scored more in the 70s, suggesting higher performance.
 Mode = 2 values appear often (67 & 68).

Example 3 (Larger Data)


Dataset:

132, 136, 142, 145, 147, 151, 153, 158, 161, 165, 168, 171
Stem Leaves
13 2 6
14 2 5 7
15 1 3 8
16 1 5 8
17 1

Observation:

 Data ranges from 132 to 171.


 Values spread evenly across stems.
 The highest concentration appears around 150–160 range.

Back-to-Back Stem-and-Leaf Plot


Useful for comparing two datasets (e.g., marks of boys vs girls).

📌 Example:

Boys Stem Girls


752 5 146
986 6 258
75 7 13

This way, two distributions can be compared side by side.

Advantages
 Retains original data.
 Easy to construct.
 Highlights clusters, gaps, and outliers.
 Good for exploratory analysis.

Limitations
 Not suitable for very large datasets (>200 values).
 Choice of stem/leaf division may mislead if not chosen carefully.
 Modern tools (histograms, boxplots, density plots) are more common in big data
analytics.

Applications in Data Science


 Exploratory Data Analysis (EDA): First step before applying machine learning models.
 Outlier Detection: Helps find unusual values quickly.
 Teaching Tool: Explains frequency distributions to beginners.
 Data Cleaning: Easy to spot data entry mistakes.
 Comparisons: With back-to-back plots, two groups can be compared effectively.

Numerical Problem (Detailed)


Problem: Construct a stem-and-leaf plot for daily sales (₹ in hundreds):

212, 218, 223, 225, 226, 232, 239, 241, 245, 246, 251, 259

Solution:

 Stem = first two digits (21, 22, 23, 24, 25).


 Leaf = last digit.

Stem Leaves
21 2 8
22 3 5 6
23 2 9
24 1 5 6
25 1 9

Interpretation:
 Sales are mostly clustered in the 220s–240s range.
 Highest sales = 259 (outlier compared to others).
 Median lies between 232 and 239.

Comparison with Histogram & Boxplot


Feature Stem-and-Leaf Histogram Boxplot
Raw Data Retained ✅ Yes ❌ No ❌ No
Shape of Distribution ✅ Visible ✅ Visible ✅ Visible
Outlier Detection ✅ Easy Moderate ✅ Very Easy
Suitable for Big Data ❌ No ✅ Yes ✅ Yes

📘 Stem-and-Leaf Plots – Visualizations

🔹 Example 1 – Small Dataset (12–38)


Data:

12, 15, 17, 18, 21, 22, 25, 27, 32, 34, 35, 38

Stem-and-Leaf Plot:

Stem | Leaves
----------------
1 | 2 5 7 8
2 | 1 2 5 7
3 | 2 4 5 8

🔹 Example 2 – Exam Scores (45–78)


Data:

45, 47, 49, 52, 56, 58, 62, 65, 67, 68, 71, 72, 75, 78

Stem-and-Leaf Plot:

Stem | Leaves
----------------
4 | 5 7 9
5 | 2 6 8
6 | 2 5 7 8
7 | 1 2 5 8

🔹 Example 3 – Larger Numbers (132–171)


Data:

132, 136, 142, 145, 147, 151, 153, 158, 161, 165, 168, 171

Stem-and-Leaf Plot:
(Here stem = first two digits, leaf = last digit)

Stem | Leaves
----------------
13 | 2 6
14 | 2 5 7
15 | 1 3 8
16 | 1 5 8
17 | 1

🔹 Example 4 – Sales Data (212–259)


Data:

212, 218, 223, 225, 226, 232, 239, 241, 245, 246, 251, 259

Stem-and-Leaf Plot:
(Here stem = first two digits, leaf = last digit)

Stem | Leaves
----------------
21 | 2 8
22 | 3 5 6
23 | 2 9
24 | 1 5 6
25 | 1 9

Summary
 A stem-and-leaf plot is a compact, data-preserving visualization technique.
 Most useful for small datasets.
 Helps in identifying trends, spread, clusters, and outliers.
 Still used in teaching and EDA, though less common in modern big data due to its
manual construction limitations.

Plotting Data Using Dot Plot

Dot Plot

🔹 1. Introduction
A Dot Plot is a statistical visualization technique where each data point is represented by a dot
placed along a number line (axis). It was introduced by William Cleveland (1984) as a way to
improve clarity and precision in displaying data compared to bar charts or histograms.

Unlike other charts that aggregate or summarize data, a dot plot preserves the raw dataset,
making it ideal for exploratory data analysis (EDA) in data science.

🔹 2. Purpose of a Dot Plot


 To display the frequency distribution of a dataset.
 To show clusters, modes, gaps, and outliers in data.
 To compare two or more groups of data visually.
 To act as a bridge between raw data and summary statistics (mean, median, mode,
range, variance).

🔹 3. Types of Dot Plots


1. Simple Dot Plot (One-Dimensional)
o Displays a single dataset on one axis.
o Example: Marks of students out of 10.
2. Stacked Dot Plot
o If a value repeats, dots are stacked vertically above that point.
o Helps visualize frequencies without using bars.
3. Grouped Dot Plot
o Used for comparing two or more categories/groups on the same axis.
o Example: Study hours of Group A vs Group B students.
4. Cleveland Dot Plot
o Developed for large datasets or categorical comparisons.
o Dots are aligned along an axis but categories are listed vertically, reducing clutter.
o Used widely in Data Science dashboards.

🔹 4. Construction of a Dot Plot


1. Collect Data: Choose dataset (numerical/discrete).
2. Determine Range: Minimum and maximum values.
3. Draw Axis: Mark equal intervals on the number line.
4. Place Dots: For each data value, plot a dot above the corresponding axis value.
o If repeated, stack vertically.
5. Add Labels: Axis labels, title, and if needed, group legends.

🔹 5. Interpretation of Dot Plot


A dot plot is rich in statistical insights:

 Center (Central Tendency):


o Approximate where most dots cluster → mean or median.
 Spread (Dispersion):
o The range from lowest to highest value.
o Wider spread = more variation.
 Shape of Distribution:
o Symmetric, skewed, or multimodal patterns become visible.
 Mode(s):
o The value(s) with the tallest stack of dots.
 Outliers:
o Dots that are separated far from the cluster.
🔹 6. Advantages
 Directly shows raw data values.
 Easy to understand for non-technical users.
 Better at showing exact frequency counts compared to bar graphs.
 Allows quick comparisons between datasets.
 Useful for small to medium-sized datasets (n ≤ 50).

🔹 7. Limitations
 Not scalable for large datasets (dots overlap, plot becomes messy).
 Only suitable for discrete or rounded numerical data (not continuous).
 Cannot show higher-level summary statistics like variance, quartiles, or percentiles (box
plots do this better).
 Not as common in industry reporting compared to histograms, boxplots, or violin plots.

🔹 8. Applications in Data Science


 Exploratory Data Analysis (EDA): Understanding data distribution before applying ML
models.
 Outlier Detection: Spotting unusual values in a dataset.
 Comparative Analysis: Comparing groups (male vs female, before vs after experiment,
etc.).
 Teaching & Reporting: Simplest way to introduce frequency distribution in statistics.
 Data Quality Checks: Detects duplicate or missing values visually.

🔹 9. Numerical Example 1 – Simple Dataset


Dataset (Marks out of 10):
4, 5, 7, 6, 5, 3, 5, 4, 6, 7

Step 1 – Range: 3 to 7
Step 2 – Number Line: 3, 4, 5, 6, 7
Step 3 – Plot Dots:

 3→●
 4 → ●●
 5 → ●●●
 6 → ●●
 7 → ●●

Dot Plot Representation:

3: ●
4: ●●
5: ●●●
6: ●●
7: ●●

Interpretation:

 Mode = 5 (most frequent).


 Median = 5.5
 Spread = 3 to 7

🔹 10. Numerical Example 2 – Group Comparison


Dataset: Weekly Study Hours

 Group A: 4, 5, 6, 6, 7, 5, 4
 Group B: 6, 7, 7, 8, 9, 6, 7

👉 Dot Plot Comparison:

 Group A clusters around 4–6 hours.


 Group B clusters around 6–8 hours.
 Group B clearly studies more hours on average.

🔹 11. Numerical Example 3 – Larger Dataset


Dataset: Daily Customer Arrivals at a Shop (10 days)
15, 17, 16, 15, 18, 20, 17, 16, 15, 19

 15 → ●●●
 16 → ●●
 17 → ●●
 18 → ●
 19 → ●
 20 → ●

👉 Interpretation:
 Mode = 15 (most common daily arrival).
 Range = 15–20 (spread = 5).
 Distribution is slightly skewed right.

Example 1 – Small Dataset

Dataset (marks out of 10):


4, 5, 7, 6, 5, 3, 5, 4, 6, 7

Step 1 – Range: from 3 to 7


Step 2 – Draw number line: 3, 4, 5, 6, 7
Step 3 – Plot dots:

 3 → 1 dot
 4 → 2 dots
 5 → 3 dots
 6 → 2 dots
 7 → 2 dots
Scatter Plot

Introduction
A Scatter Plot is a statistical and data visualization tool used to display the relationship
between two quantitative (numerical) variables.

 Each observation in a dataset is represented as a point (x, y) in the Cartesian coordinate


system.
 By examining the distribution of points, we can determine whether a relationship
(correlation) exists between the two variables.
 Scatter plots are considered one of the foundational tools in exploratory data analysis
(EDA) and machine learning.

. Why Scatter Plots are Important?


 They allow us to see raw data relationships before applying mathematical/statistical
models.
 They reveal hidden patterns, clusters, and anomalies in the dataset.
 They provide insights into whether a linear regression model is suitable.
 Scatter plots form the basis for correlation analysis (Pearson/Spearman coefficients).

In Data Science, scatter plots are often the first step in feature analysis before feeding data into
machine learning algorithms.

Elements of a Scatter Plot


1. X-axis (Independent Variable / Predictor):
Represents the input or explanatory variable.
2. Y-axis (Dependent Variable / Response):
Represents the outcome or predicted variable.
3. Data Points (x, y):
Each dot represents a unique observation in the dataset.
4. Trend Line (Optional):
A fitted line (e.g., regression line) that shows the overall direction of the relationship.
Types of Relationships Seen in Scatter Plots
1. Positive Correlation
o As X increases, Y also increases.
o Example: Study hours vs exam marks.
2. Negative Correlation
o As X increases, Y decreases.
o Example: Speed vs travel time.
3. No Correlation
o X and Y do not show any clear relationship.
o Example: Shoe size vs exam marks.
4. Non-linear Relationships
o Points follow a curved pattern (quadratic, exponential, logistic).
o Example: Population growth vs time.

Strength of Correlation in Scatter Plots


Scatter plots visually show how strong or weak the relationship is:

 Strong Positive: Points form a tight upward line.


 Weak Positive: Upward trend exists, but points are scattered.
 Strong Negative: Points form a tight downward line.
 Weak Negative: Downward trend exists, but points are scattered.
 No Correlation: No visible trend.

Mathematically measured using Pearson correlation coefficient (r):

 r ≈ +1 → Strong positive
 r ≈ -1 → Strong negative
 r ≈ 0 → No correlation

Steps to Construct a Scatter Plot


1. Collect a dataset of paired values (x, y).
2. Draw a coordinate system (X-axis for independent, Y-axis for dependent).
3. Mark scale on both axes covering the data range.
4. Plot each observation as a dot at coordinates (x, y).
5. Observe the pattern: cluster, spread, outliers, and correlation.
6. (Optional) Add a regression/trend line for better analysis.
Advantages
 Shows raw data values directly.
 Detects correlation and strength of relationship.
 Identifies outliers, clusters, and non-linear patterns.
 Easy to interpret and explain visually.
 Foundation for regression and predictive analytics.

Limitations
 Can only show relationships between two variables (unless extended to 3D).
 Large datasets cause overlapping points (overplotting).
 Cannot establish causation (only correlation).
 Not useful for categorical variables.

Applications in Data Science


 Exploratory Data Analysis (EDA): Detecting variable relationships.
 Feature Selection: Checking which features correlate with target.
 Regression Analysis: Scatter plots are the base of linear regression models.
 Clustering: Visualizing natural groupings in data.
 Anomaly Detection: Spotting outliers that deviate from the trend.
 Machine Learning Preprocessing: Ensuring data relationships before applying models.

Numerical Example 1 – Positive Correlation


Dataset: Hours Studied vs Marks Obtained

Hours Studied (X) Marks (Y)


1 35
2 50
3 65
4 70
5 85

👉 Interpretation: More hours studied → higher marks.

 Strong positive correlation (r ≈ +0.97).

Numerical Example 2 – Negative Correlation


Dataset: Speed of Car vs Travel Time (Fixed Distance)

Speed (km/h) (X) Travel Time (min) (Y)


20 60
40 30
60 20
80 15
100 12

👉 Interpretation: Higher speed → less time.

 Strong negative correlation (r ≈ -0.98).

Numerical Example 3 – No Correlation


Dataset: Shoe Size vs Exam Score
Shoe Size (X) Exam Score (Y)
7 45
8 70
9 50
10 80
11 60

👉 Interpretation: No visible relationship.

 No correlation (r ≈ 0.15).

Diagrams – Scatter Plots for All Cases

Here are the three scatter plots side by side:

1. Positive Correlation (Hours Studied vs Marks) → upward trend.


2. Negative Correlation (Speed vs Travel Time) → downward trend.
3. No Correlation (Shoe Size vs Exam Score) → scattered randomly.

Time Series Graph

Introduction
A Time Series Graph is a graphical representation of data points collected in chronological
order. It shows how a variable changes over time.

 The X-axis always represents time (hours, days, months, years).


 The Y-axis represents the measured variable (sales, stock prices, temperature, etc.).
 The points are plotted and connected with a line, making trends and patterns easier to see.

In Data Science, time series graphs are fundamental for:

 Exploratory data analysis (EDA)


 Forecasting future values
 Detecting seasonal patterns and anomalies

Components of Time Series


1. Trend (T): Long-term upward or downward movement.
2. Seasonality (S): Regular repeating patterns (e.g., sales in festive seasons).
3. Cyclic (C): Long-term cycles, often in economics/business.
4. Irregular/Random (I): Sudden unpredictable variations.

👉 Time series = T × S × C × I (multiplicative model) OR T + S + C + I (additive model).

Steps to Construct a Time Series Graph


1. Collect data in time order.
2. Draw X-axis = time intervals, Y-axis = observed values.
3. Plot each observation as a point.
4. Connect the points with straight lines.
5. Add labels, title, and legends (if comparing multiple series).

Advantages
 Shows trends, fluctuations, and cycles clearly.
 Useful for prediction and forecasting.
 Easy to interpret visually.

Limitations
 Only shows one variable against time.
 Large variations may hide subtle patterns.
 Requires enough data points to show trends.

Numerical Examples with Diagrams

🟦 Example 1 – Monthly Sales

Data (in ₹000):

Month Sales
Jan 20
Feb 25
Mar 30
Apr 28
May 35
Jun 40

👉 Interpretation: Upward trend with a slight dip in April.

🟥 Example 2 – Daily Temperature

Data (°C):

Day Temp
1 30
2 32
3 31
4 29
5 28
6 30
7 33

👉 Interpretation: Shows short-term fluctuations, lowest on Day 5, peak at Day 7.

🟩 Example 3 – Stock Price


Data (₹):

Day Price
1 100
2 102
3 105
4 103
5 108
6 107
7 110
8 115

👉 Interpretation: Overall upward trend; dip on Day 4 is an irregular fluctuation.

Applications in Data Science


 Forecasting: Sales, stock prices, weather.
 Anomaly Detection: Detecting unusual behavior (fraud, sensor errors).
 Economics & Finance: Business cycle analysis.
 Healthcare: Patient monitoring over time.
 IoT & AI Models: Feeding sequential data into ML models (ARIMA, LSTM, Prophet).
Frequency Distribution Graph

Introduction
In statistics and data science, frequency distribution is a way to organize raw data into
meaningful groups, showing how many times (frequency) each value or range of values occurs.

A Frequency Distribution Graph is the graphical representation of this tabular data. It helps
us quickly see:

 The shape of data distribution (bell-shaped, skewed, uniform).


 The spread (range, concentration of values).
 Outliers or unusual data points.
 Trends or patterns in the dataset.

Frequency graphs are essential for descriptive statistics and form the foundation for
probability distributions, hypothesis testing, and machine learning models.
Types of Frequency Distribution Graphs
There are three major types:

(a) Histogram

 Graph of class intervals vs frequencies.


 Represented by adjacent bars (no gaps between bars, unlike bar charts).
 Each bar’s width = class interval, height = frequency.
 Shows continuous data (heights, weights, marks, prices).

(b) Frequency Polygon

 A line graph connecting the midpoints of histogram bars.


 Useful for comparing two or more distributions.
 Gives a clearer picture of distribution shape than a histogram alone.

(c) Ogive (Cumulative Frequency Graph)

 A line graph plotting cumulative frequencies against upper class boundaries.


 Two types:
1. Less than Ogive → cumulative frequency “less than” a value.
2. Greater than Ogive → cumulative frequency “greater than” a value.
 Where the two ogives intersect, we get the median.

Steps to Construct a Frequency Distribution Graph


1. Collect raw data.
2. Decide class intervals (equal width, usually 5 or 10 units).
3. Count frequencies for each class.
4. Prepare a frequency distribution table.
5. Choose the graph type (Histogram, Polygon, Ogive).
6. Plot X-axis = Class intervals (or midpoints), Y-axis = Frequency.
7. Draw bars (Histogram), connect midpoints (Polygon), or plot cumulative frequencies
(Ogive).
🔹 4. Advantages
 Condenses large datasets into visual form.
 Helps detect patterns, skewness, symmetry.
 Easy to compare multiple datasets.
 Foundation for probability theory and EDA (exploratory data analysis).

🔹 5. Limitations
 Grouping loses exact values.
 Results may differ if class intervals are chosen poorly.
 Not suitable for very small datasets.

Numerical Example 1 – Marks of Students


Dataset (marks out of 50 for 20 students):
12, 18, 25, 30, 22, 35, 40, 28, 15, 32,
38, 45, 20, 10, 27, 33, 42, 36, 29, 24

Step 1: Frequency Distribution Table

Class Interval Frequency


10–19 3
20–29 6
30–39 7
40–49 4

Step 2: Histogram

 X-axis: Marks intervals


 Y-axis: Frequency
 Bars touching each other
Step 3: Frequency Polygon

 Midpoints: 14.5, 24.5, 34.5, 44.5


 Join frequencies with straight lines

Step 4: Ogive (Cumulative Frequency Table)

Class Interval Frequency Cumulative Freq


10–19 3 3
20–29 6 9
30–39 7 16
40–49 4 20

👉 Plot cumulative frequencies against upper boundaries (19.5, 29.5, 39.5, 49.5).
Numerical Example 2 – Daily Wages of
Workers
Dataset (₹):

120, 150, 200, 220, 180, 170, 160, 250, 230, 210,
140, 190, 175, 185, 240, 260, 280, 300, 270, 220

Step 1: Class Intervals (width = 50)

Wage Range
Frequency
(₹)
100–149 2
150–199 5
200–249 7
250–299 4
300–349 2

Step 2: Graphs

 Histogram: Bars show concentration in 200–249 range.


 Polygon: Peak at midpoint 225.
 Ogive: Steadily rising curve showing cumulative distribution.
Applications in Data Science
 EDA (Exploratory Data Analysis): Understanding distribution of variables.
 Machine Learning: Checking normality assumption for models.
 Finance: Income/salary distribution.
 Healthcare: Patient age/weight distribution.
 IoT: Frequency of sensor readings in ranges.
 Quality Control: Defect frequency in manufacturing.

Summary
 A Frequency Distribution Graph visually represents data distribution.
 Main forms: Histogram, Polygon, Ogive.
 Used to detect patterns, skewness, spread, outliers.
 Essential for statistics, probability, and data science applications.

You might also like