Plotting Data Using Various Graphs
Plotting Data Using Various Graphs
A bar graph (or bar chart) is a visual representation of categorical or discrete data using
rectangular bars.
Each bar’s height (or length) is proportional to the value or frequency of the category it
represents.
Bars are equally spaced and of equal width, but their heights vary according to data
values.
Key Features
1. Axes
o X-axis: Represents the categories (e.g., students, dealers, products).
o Y-axis: Represents numerical values (frequency, count, or amount).
2. Bars
oDrawn vertically or horizontally.
oEqual width, equal spacing.
oBars do not touch (unlike histograms).
3. Proportionality
o Height/length of each bar = directly proportional to the corresponding value.
Advantages
✅ Easy to draw and understand.
✅ Useful for comparing different categories.
✅ Can represent frequencies, percentages, or amounts.
Disadvantages
Uses
2. Numerical Example
Problem Statement:
A shopkeeper records the number of cars sold by four dealers in one week:
Categories = Dealers A, B, C, D
Values = Cars sold = [20, 35, 25, 15]
X-axis = Dealers
Y-axis = Cars sold
Step 3: Plot bars
Step 4: Interpret
Here’s the bar graph for the numerical example (cars sold by dealers):
1. Introduction
In statistics and data visualization, a pie chart is a very popular tool for representing
proportional data. It is essentially a circle divided into slices (or sectors), with each slice
showing the relative contribution of a category to the whole.
Thus, pie charts provide an at-a-glance understanding of the relative size of parts compared to
the total.
2. Formula for Pie Chart
To convert data into a pie chart:
Explanation of formula:
Simple and Visually Appealing: Pie charts are easy to read and understand at a glance.
Their circular shape naturally lends itself to representing parts of a whole, making the
concept of proportion highly intuitive.
Effective for a Few Categories: When you have a small number of categories (typically
2 to 5), a pie chart is very effective. It clearly shows which category is the largest and
provides a quick comparison of their relative sizes.
Ideal for Proportions: The primary strength of a pie chart is its ability to show the
proportional distribution of a set of data. It immediately answers the question, "How big
is this part compared to the whole?"
Difficulty in Comparing Slices: It is extremely difficult for the human eye to accurately
compare the size of slices, especially if they have similar values. For example, telling the
difference between a 23% slice and a 25% slice is challenging, whereas a bar chart would
make this difference immediately obvious.
Poor for Many Categories: When a pie chart has more than a handful of slices, it
becomes cluttered and hard to read. Small slices are often indistinguishable and may
require labels and leader lines, which further complicates the chart.
Inability to Show Trends or Changes: A pie chart represents a snapshot of data at a
single point in time. It cannot be used to visualize trends over time, as it has no concept
of a time axis.
Not Suitable for Continuous Data: Pie charts are designed for categorical or discrete
data that can be grouped into distinct categories. They are completely inappropriate for
continuous data like temperature or height.
Misleading 3D Effects: Using 3D effects on a pie chart is a common mistake that can
distort the true proportions, as the slices closer to the front of the chart appear larger than
those in the back.
Due to their simplicity, pie charts are widely used in situations where the goal is to show how a
total is divided among a few distinct categories.
Business:
o Market Share: Visualizing the market share of different companies in an
industry.
o Budget Allocation: Showing how a company's budget is distributed across
departments (e.g., marketing, R&D, operations).
Education:
o Student Demographics: Illustrating the percentage of students in different
majors or grade levels.
o Test Scores: Showing the percentage of students who scored in various ranges
(e.g., A, B, C, D, F).
Polling and Surveys:
o Opinion Polls: Displaying the percentage of respondents who support different
candidates or positions.
o Survey Results: Summarizing responses to a multiple-choice question (e.g.,
"What is your favorite color?").
In summary, a pie chart is a great tool for a single-variable analysis of a whole, but its use should
be limited to situations with a few categories where the primary goal is to show proportional
distribution. For more complex data or comparisons, other chart types, such as bar charts or line
charts, are more effective
7. Numerical Example
A student scored marks in four subjects:
Subject Marks
Math 40
Science 30
English 20
History 10
Total 100
40+30+20+10=10040 + 30 + 20 + 10 = 10040+30+20+10=100
Step 3: Interpret
Conclusion
A pie chart is best suited for showing proportions of a whole.
It is easy to construct and interpret, making it one of the most widely used statistical
diagrams.
However, it is not suitable for very large datasets or when precise comparison is required.
In our example, the chart clearly shows that Math and Science dominate the student’s
performance.
📊 Understanding and Creating Histograms: An In-Depth Guide
Bins: Think of bins as "data buckets." Each bin is a specific interval on the x-axis that holds a
range of data values. All bins in a histogram must have the same width. The number of bins is a
crucial choice that determines the level of detail shown in the plot. Too few bins can hide
important features, while too many can make the plot look noisy.
Frequency: The height of each bar represents the number of data points (or count) that fall into
that bin's range. It's the frequency of occurrence for that specific interval.
Distribution: By looking at the overall shape of the bars, we can understand the data's
distribution, or how the values are spread out.
Let's use a dataset of 30 daily high temperatures (in Fahrenheit) to demonstrate how to build a
histogram by hand.
[68,70,75,71,80,78,85,82,79,73,76,69,74,81,83,77,72,84,86,70,75,78,80,81,74,82,85,79,73,76]
The first step is to identify the minimum and maximum values in your dataset.
Minimum Value: 68
Maximum Value: 86
Range: 86−68=18
Analyzing the Plots
The plot with 10 bins provides a smooth, general overview, clearly showing a bell-shaped
distribution. The plot with 50 bins is more detailed and shows more local peaks and valleys.
While it still resembles a bell shape, it highlights minor fluctuations in the data that the broader
plot hides.
The shape of a histogram is the most important feature to analyze. It tells us a lot about the data's
characteristics. Here are some common shapes:
Normal (Bell-shaped): The most common shape, where the data is symmetric around a central
peak. Many natural phenomena follow this pattern.
Skewed Right (Positively Skewed): The tail of the distribution extends to the right. This means
most data points are on the left side, and there are a few very high values pulling the mean to the
right.
Skewed Left (Negatively Skewed): The tail extends to the left. Most data points are on the
right, with a few very low values.
Bimodal: A histogram with two distinct peaks. This often suggests that the dataset is composed
of two different groups or populations.
By plotting data in a histogram, you can visually identify these shapes and gain a deeper
understanding of your dataset.
Advantages of Histograms
Easy to Understand: Histograms are intuitive and straightforward to interpret, even for a
non-technical audience. They provide a clear visual summary of a large dataset.
Effective for Large Datasets: By grouping data into bins, histograms can effectively
summarize and display a large number of data points without becoming cluttered.
Reveals Data Distribution: Histograms immediately show the shape of the data's
distribution (e.g., normal, skewed, bimodal), central tendency (where the data is
centered), and spread (the range of values).
Identifies Outliers and Anomalies: Gaps or isolated bars far from the main body of the
histogram can quickly highlight unusual data points that may need further investigation.
Disadvantages of Histograms
Subjectivity of Bin Width: The appearance and interpretation of a histogram can change
dramatically depending on the number of bins chosen. A different bin width can lead to a
completely different story being told by the data.
Loss of Detail: Histograms group data into ranges, meaning you lose the exact value of
each individual data point. For example, a bar for the range 10-20 doesn't tell you if the
values within it were 10 or 19.
Limited for Comparison: Histograms are best for analyzing a single variable.
Comparing multiple datasets or variables side-by-side using histograms can be difficult
and confusing.
Applications of Histograms
Histograms have wide-ranging applications across many fields for data analysis and quality
control.
Introduction
In data science and statistics, visual representation of data is a critical step in understanding its
structure, distribution, and key features. One such tool is the Stem-and-Leaf Plot (also called
Leaf-and-Stem Chart).
Definition
A Stem-and-Leaf Plot is a method of representing numerical data where:
The stems are listed vertically, and the corresponding leaves are written horizontally next to
them.
Why Use a Stem-and-Leaf Plot?
1. Preserves raw data: Unlike histograms where exact values are lost, a stem-and-leaf plot
keeps every value intact.
2. Quick visualization: It shows the shape of distribution (symmetry, skewness).
3. Highlights outliers: Extreme values stand out clearly.
4. Useful for small data: Works best when dataset size is between 20 and 100 values.
5. Summarizes large sets into compact form: Instead of writing a long list, values are
neatly grouped.
Key Features
Stems act as class intervals or groups.
Leaves are the remaining part of the number that differentiates within the group.
Can be back-to-back (two datasets compared side by side).
Helps calculate median, mode, quartiles, and spread.
Shows frequency (how many values fall under a stem).
Example: For numbers like 132, 145 → Stem = "13", "14" (tens place), Leaf = ones digit.
For decimals → Stem can be the integer part, Leaf the decimal part.
12, 15, 17, 18, 21, 22, 25, 27, 32, 34, 35, 38
Construction:
Stem Leaves
1 2578
2 1257
3 2458
Interpretation:
(Exam Scores)
Scores:
45, 47, 49, 52, 56, 58, 62, 65, 67, 68, 71, 72, 75, 78
Stem Leaves
4 579
5 268
6 2578
7 1258
Interpretation:
132, 136, 142, 145, 147, 151, 153, 158, 161, 165, 168, 171
Stem Leaves
13 2 6
14 2 5 7
15 1 3 8
16 1 5 8
17 1
Observation:
📌 Example:
Advantages
Retains original data.
Easy to construct.
Highlights clusters, gaps, and outliers.
Good for exploratory analysis.
Limitations
Not suitable for very large datasets (>200 values).
Choice of stem/leaf division may mislead if not chosen carefully.
Modern tools (histograms, boxplots, density plots) are more common in big data
analytics.
212, 218, 223, 225, 226, 232, 239, 241, 245, 246, 251, 259
Solution:
Stem Leaves
21 2 8
22 3 5 6
23 2 9
24 1 5 6
25 1 9
Interpretation:
Sales are mostly clustered in the 220s–240s range.
Highest sales = 259 (outlier compared to others).
Median lies between 232 and 239.
12, 15, 17, 18, 21, 22, 25, 27, 32, 34, 35, 38
Stem-and-Leaf Plot:
Stem | Leaves
----------------
1 | 2 5 7 8
2 | 1 2 5 7
3 | 2 4 5 8
45, 47, 49, 52, 56, 58, 62, 65, 67, 68, 71, 72, 75, 78
Stem-and-Leaf Plot:
Stem | Leaves
----------------
4 | 5 7 9
5 | 2 6 8
6 | 2 5 7 8
7 | 1 2 5 8
132, 136, 142, 145, 147, 151, 153, 158, 161, 165, 168, 171
Stem-and-Leaf Plot:
(Here stem = first two digits, leaf = last digit)
Stem | Leaves
----------------
13 | 2 6
14 | 2 5 7
15 | 1 3 8
16 | 1 5 8
17 | 1
212, 218, 223, 225, 226, 232, 239, 241, 245, 246, 251, 259
Stem-and-Leaf Plot:
(Here stem = first two digits, leaf = last digit)
Stem | Leaves
----------------
21 | 2 8
22 | 3 5 6
23 | 2 9
24 | 1 5 6
25 | 1 9
Summary
A stem-and-leaf plot is a compact, data-preserving visualization technique.
Most useful for small datasets.
Helps in identifying trends, spread, clusters, and outliers.
Still used in teaching and EDA, though less common in modern big data due to its
manual construction limitations.
Dot Plot
🔹 1. Introduction
A Dot Plot is a statistical visualization technique where each data point is represented by a dot
placed along a number line (axis). It was introduced by William Cleveland (1984) as a way to
improve clarity and precision in displaying data compared to bar charts or histograms.
Unlike other charts that aggregate or summarize data, a dot plot preserves the raw dataset,
making it ideal for exploratory data analysis (EDA) in data science.
🔹 7. Limitations
Not scalable for large datasets (dots overlap, plot becomes messy).
Only suitable for discrete or rounded numerical data (not continuous).
Cannot show higher-level summary statistics like variance, quartiles, or percentiles (box
plots do this better).
Not as common in industry reporting compared to histograms, boxplots, or violin plots.
Step 1 – Range: 3 to 7
Step 2 – Number Line: 3, 4, 5, 6, 7
Step 3 – Plot Dots:
3→●
4 → ●●
5 → ●●●
6 → ●●
7 → ●●
3: ●
4: ●●
5: ●●●
6: ●●
7: ●●
Interpretation:
Group A: 4, 5, 6, 6, 7, 5, 4
Group B: 6, 7, 7, 8, 9, 6, 7
15 → ●●●
16 → ●●
17 → ●●
18 → ●
19 → ●
20 → ●
👉 Interpretation:
Mode = 15 (most common daily arrival).
Range = 15–20 (spread = 5).
Distribution is slightly skewed right.
3 → 1 dot
4 → 2 dots
5 → 3 dots
6 → 2 dots
7 → 2 dots
Scatter Plot
Introduction
A Scatter Plot is a statistical and data visualization tool used to display the relationship
between two quantitative (numerical) variables.
In Data Science, scatter plots are often the first step in feature analysis before feeding data into
machine learning algorithms.
r ≈ +1 → Strong positive
r ≈ -1 → Strong negative
r ≈ 0 → No correlation
Limitations
Can only show relationships between two variables (unless extended to 3D).
Large datasets cause overlapping points (overplotting).
Cannot establish causation (only correlation).
Not useful for categorical variables.
No correlation (r ≈ 0.15).
Introduction
A Time Series Graph is a graphical representation of data points collected in chronological
order. It shows how a variable changes over time.
Advantages
Shows trends, fluctuations, and cycles clearly.
Useful for prediction and forecasting.
Easy to interpret visually.
Limitations
Only shows one variable against time.
Large variations may hide subtle patterns.
Requires enough data points to show trends.
Month Sales
Jan 20
Feb 25
Mar 30
Apr 28
May 35
Jun 40
Data (°C):
Day Temp
1 30
2 32
3 31
4 29
5 28
6 30
7 33
Day Price
1 100
2 102
3 105
4 103
5 108
6 107
7 110
8 115
Introduction
In statistics and data science, frequency distribution is a way to organize raw data into
meaningful groups, showing how many times (frequency) each value or range of values occurs.
A Frequency Distribution Graph is the graphical representation of this tabular data. It helps
us quickly see:
Frequency graphs are essential for descriptive statistics and form the foundation for
probability distributions, hypothesis testing, and machine learning models.
Types of Frequency Distribution Graphs
There are three major types:
(a) Histogram
🔹 5. Limitations
Grouping loses exact values.
Results may differ if class intervals are chosen poorly.
Not suitable for very small datasets.
Step 2: Histogram
👉 Plot cumulative frequencies against upper boundaries (19.5, 29.5, 39.5, 49.5).
Numerical Example 2 – Daily Wages of
Workers
Dataset (₹):
120, 150, 200, 220, 180, 170, 160, 250, 230, 210,
140, 190, 175, 185, 240, 260, 280, 300, 270, 220
Wage Range
Frequency
(₹)
100–149 2
150–199 5
200–249 7
250–299 4
300–349 2
Step 2: Graphs
Summary
A Frequency Distribution Graph visually represents data distribution.
Main forms: Histogram, Polygon, Ogive.
Used to detect patterns, skewness, spread, outliers.
Essential for statistics, probability, and data science applications.