Unit I Exploratory Data Analysis
Unit I Exploratory Data Analysis
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for
EDA- Data transformation techniques-merging database, reshaping and pivoting,
Transformation techniques.
Data science
Data science is an interdisciplinary field that combines scientific methods, processes,
algorithms, and systems to extract insights and knowledge from structured and
unstructured data. It involves applying techniques from various fields such as statistics,
mathematics, computer science, and domain expertise to analyze and interpret data in
order to solve complex problems, make predictions, and drive decision-making.
1 | CCS346 Unit - I
APEC
Data Collection: Data collection is the process of gathering or acquiring data from
various sources. This may include surveys, experiments, observations, web scraping,
sensors, or accessing existing datasets. The data collected should align with the defined
data requirements.
Data Processing: Data processing involves transforming raw data into a more usable and
structured format. This step may include data integration, data aggregation, data
transformation, and data normalization to prepare the data for further analysis.
Data Cleaning: Data cleaning, also known as data cleansing or data scrubbing, is the
process of identifying and rectifying errors, inconsistencies, missing values, and outliers
2 | CCS346 Unit - I
APEC
in the dataset. It aims to improve data quality and ensure that the data is accurate and
reliable for analysis.
Exploratory Data Analysis (EDA) is of great significance in data science and analysis.
Here are some key reasons why EDA is crucial:
Understanding the Data: EDA helps in gaining a deep understanding of the dataset at
hand. It allows data scientists to become familiar with the structure, contents, and
characteristics of the data, including variables, their distributions, and relationships.
This understanding is essential for making informed decisions throughout the data
analysis process.
Data Quality Assessment: EDA helps identify and assess the quality of the data. It allows
for the detection of missing values, outliers, inconsistencies, or errors in the dataset. By
addressing data quality issues, EDA helps ensure that subsequent analyses and models
are built on reliable and accurate data.
Feature Selection and Engineering: EDA aids in selecting relevant features or variables
for analysis. By examining relationships and correlations between variables, EDA can
guide the identification of important predictors or features that significantly contribute
to the desired outcome. EDA can also inspire the creation of new derived features or
transformations that improve model performance.
3 | CCS346 Unit - I
APEC
Uncovering Patterns and Insights: EDA enables the discovery of patterns, trends, and
relationships within the data. By using visualization techniques and summary statistics,
EDA helps uncover valuable insights and potential associations between variables.
These insights can drive further analysis, hypothesis generation, or the formulation of
research questions.
Hypothesis Generation and Testing: EDA plays a crucial role in generating hypotheses for
further investigation. By exploring the data, researchers can identify potential
relationships or patterns and formulate hypotheses to test formally. EDA can also
provide evidence or insights to support or refute existing hypotheses.
Understand the Data: Start by getting familiar with the dataset, its structure, and the
variables it contains. Understand the data types (e.g., numerical, categorical) and the
meaning of each variable.
Data Cleaning: Clean the dataset by handling missing values, outliers, and
inconsistencies. Identify and handle missing data appropriately (e.g., imputation,
deletion) based on the context and data quality requirements. Treat outliers and
inconsistent values by either correcting or removing them if necessary.
Handle Data Transformations: Explore and apply necessary data transformations such
as scaling, normalization, or logarithmic transformations to make the data suitable for
analysis. This step may be required to meet assumptions of certain statistical methods
or to improve the interpretability of the data.
Summary Statistics: Compute and analyze summary statistics for each variable. This
includes measures such as mean, median, mode, standard deviation, range, quartiles,
and other descriptive statistics. Summary statistics provide an initial understanding of
the data distribution and basic insights.
4 | CCS346 Unit - I
APEC
Data Visualization: Utilize various visualization techniques to explore the data visually.
Create histograms, scatter plots, box plots, bar charts, heatmaps, or other relevant
visualizations to understand the patterns, distributions, and relationships between
variables. Visualizations can reveal insights that may not be apparent from summary
statistics alone.
Exploring Time Series Data: If the dataset involves time series data, analyze trends,
seasonality, and other temporal patterns. Use line plots, time series decomposition,
autocorrelation plots, or other relevant techniques to explore the temporal behavior of
the data.
Feature Engineering: Based on the insights gained from EDA, consider creating new
derived features or transformations that may enhance the predictive power or
interpretability of the data. This can involve mathematical operations, combinations of
variables, or domain-specific transformations.
Iterative Analysis: EDA is often an iterative process. Repeat the above steps as needed,
diving deeper into specific variables or subsets of the data based on emerging patterns
or research questions. Refine the analysis based on new insights or feedback from
stakeholders.
5 | CCS346 Unit - I
APEC
Remember, the steps and techniques used in EDA can be flexible and iterative, tailored
to the specific dataset and research objectives. The goal is to gain a comprehensive
understanding of the data, identify patterns, and generate hypotheses for further
analysis.
Define Data
In data science, "data" refers to the raw, unprocessed, and often vast quantities of
information that is collected or generated from various sources. It can exist in different
formats, such as structured data (organized and well-defined), semi-structured data
(partially organized), or unstructured data (lacks a predefined structure).
Data serves as the foundation for data science activities and analysis. It can include
numbers, text, images, audio, video, sensor readings, transaction records, social media
posts, and much more. Data can be generated from diverse sources, including databases,
spreadsheets, web scraping, sensors, surveys, or online platforms.
Categorization of Data
In data science, data is typically categorized into two main types:
Quantitative Data: Also known as numerical or structured data, quantitative data
represents measurable quantities or variables. It includes attributes such as age,
temperature, sales figures, or stock prices. Quantitative data is typically analyzed using
statistical techniques and mathematical models.
Examples of quantitative data:
• Scores of tests and exams e.g. 74, 67, 98, etc.
• The weight of a person.
• The temperature in a roo
6 | CCS346 Unit - I
APEC
7 | CCS346 Unit - I
APEC
Ordinal Data
Ordinal data is qualitative data for which their values have some kind of relative
position. These kinds of data can be considered “in-between” qualitative and
quantitative data. The ordinal data only shows the sequences and cannot use for
statistical analysis. Compared to nominal data, ordinal data have some kind of order
that is not present in nominal data.
Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an
android phone, the height of a person, the length of an object, etc. Continuous data
represents information that can be divided into smaller levels. The continuous variable
can take any value within a range.
Examples of continuous data:
• Height of a person - 62.04762 inches, 79.948376 inches
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
9 | CCS346 Unit - I
APEC
Interval Data
The interval level is a numerical level of measurement which, like the ordinal scale,
places variables in order. The interval scale has a known and equal distance between
each value on the scale (imagine the points on a thermometer). Unlike the ratio scale
(the fourth level of measurement), interval data has no true zero; in other words, a
value of zero on an interval scale does not mean the variable is absent.
A temperature of zero degrees Fahrenheit doesn’t mean there is “no temperature” to be
measured—rather, it signifies a very low or cold temperature.
Ratio Data
The fourth and final level of measurement is the ratio level. Just like the interval scale,
the ratio scale is a quantitative level of measurement with equal intervals between each
point. Difference between Interval scale and Ratio scale is that it has a true zero. That is,
a value of zero on a ratio scale means that the variable you’re measuring is absent.
10 | CCS346 Unit - I
APEC
Population is a good example of ratio data. If you have a population count of zero
people, this means there are no people!
Measurement scales
Measurement scales, also known as data scales or levels of measurement, define
the properties and characteristics of the data collected or measured. There are four
commonly recognized measurement scales:
Nominal Scale:
11 | CCS346 Unit - I
APEC
• The nominal scale is the lowest level of measurement. It represents data that can
be categorized into distinct and mutually exclusive groups or categories. The
categories in a nominal scale have no inherent order or ranking.
• Examples of nominal scale data include gender (male/female), eye color
(blue/green/brown), or types of cars (sedan/SUV/hatchback).
• Nominal data can be represented using labels or codes.
Ordinal Scale:
• The ordinal scale represents data with categories that have a natural order or
ranking. In addition to the properties of the nominal scale, ordinal data allows
for the relative positioning or hierarchy between the categories. However, the
intervals between the categories may not be equal.
• Examples of ordinal scale data include rating scales (e.g., 1-5 scale indicating
satisfaction levels), education levels (e.g., high school, bachelor's, master's), or
performance rankings (first, second, third place). Ordinal data can be
represented using labels, codes, or numerical rankings.
Interval Scale:
• The interval scale represents data with categories that have equal intervals
between the values. In addition to the properties of the ordinal scale, interval
data allows for meaningful comparisons of the intervals between the categories.
However, it does not have a true zero point.
• Examples of interval scale data include calendar dates, temperature measured in
Celsius or Fahrenheit, or years. Interval data allows for mathematical operations
such as addition and subtraction but does not support meaningful multiplication
or division.
Ratio Scale:
• The ratio scale is the highest level of measurement. It represents data with
categories that have equal intervals between the values and possess a true zero
point. In addition to the properties of the interval scale, ratio data allows for all
mathematical operations and meaningful ratios.
• Examples of ratio scale data include weight, length, time duration, or count. Ratio
scale data provides a complete and meaningful representation of the data.
Understanding the measurement scale of the data is important for selecting appropriate
statistical techniques, visualization methods, and modeling approaches. Different scales
require different levels of analysis and interpretation.
12 | CCS346 Unit - I
APEC
13 | CCS346 Unit - I
APEC
14 | CCS346 Unit - I
APEC
Line chart
Line charts are used to represent the relation between two data X and Y on a different
axis. Here we will see some of the examples of a line chart in Python :
15 | CCS346 Unit - I
APEC
Year Unemployment_Rate
1920 9.8
1930 12
1940 8
1950 7.2
1960 6.9
1970 7
1980 6.5
1990 6.2
2000 5.5
2010 3.3
The ultimate goal is to depict the above data using a Line chart.
year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 3.3]
plt.plot(x_axis, y_axis)
plt.title('title name')
plt.xlabel('x_axis name')
plt.ylabel('y_axis name')
plt.show()
year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 3.3]
plt.plot(year, unemployment_rate)
plt.title('Unemployment rate vs Year')
16 | CCS346 Unit - I
APEC
plt.xlabel('Year')
plt.ylabel('Unemployment rate')
plt.show()
Output:
Bar charts
A bar plot or bar chart is a graph that represents the category of data with rectangular
bars with lengths and heights that is proportional to the values which they represent.
The bar plots can be plotted horizontally or vertically. A bar chart describes the
comparisons between the discrete categories.
import matplotlib.pyplot as plt Output:
import numpy as np
Scatter plots
Scatter plots are used to observe relationship between variables and uses dots to
represent the relationship between them. The scatter() method in the matplotlib library
is used to draw a scatter plot. Scatter plots are widely used to represent relation among
variables and how change in one affects the other.
/* Python program to create scatter plots*/ Output:
x =[5, 7, 8, 7, 2, 17, 2, 9,
17 | CCS346 Unit - I
APEC
4, 11, 12, 9, 6]
plt.scatter(x, y, c ="blue")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
Bubble chart
Bubble plots are an improved version of the scatter plot. In a scatter plot, there are two
dimensions x, and y. In a bubble plot, there are three dimensions x, y, and z, where the
third dimension z denotes weight. Here, each data point on the graph is shown as a
bubble. Each bubble can be illustrated with a different color, size, and appearance.
Area Chart
An area chart is really similar to a line chart, except that the area between the x axis and
the line is filled in with color or shading. It represents the evolution of a numeric
variable.
import numpy as np Output:
import matplotlib.pyplot as plt
# Create data
x=range(1,6)
y=[1,4,6,8,4]
18 | CCS346 Unit - I
APEC
# Area plot
plt.fill_between(x, y)
plot.show()
19 | CCS346 Unit - I
APEC
'magenta'])
plt.legend()
# Add Labels
plt.legend(loc='upper left')
plt.title('World Population')
plt.xlabel('Number of people (millions)')
plt.ylabel('Year')
Output:
20 | CCS346 Unit - I
APEC
# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')
Output:
Pie Chart
A Pie Chart is a circular statistical plot that can display only one series of data. The area
of the chart is the total percentage of the given data. The area of slices of the pie
represents the percentage of the parts of the data. The slices of pie are called wedges.
The area of the wedge is determined by the length of the arc of the wedge. The area of a
wedge represents the relative percentage of that part with respect to whole data. Pie
21 | CCS346 Unit - I
APEC
charts are commonly used in business presentations like sales, operations, survey
results, resources, etc as they provide a quick summary.
Output:
Table Chart
Matplotlib.pyplot.table() is a subpart of matplotlib library in which a table is generated
using the plotted graph for analysis. This method makes analysis easier and more
efficient as tables give a precise detail than graphs. The matplotlib.pyplot.table creates
tables that often hang beneath stacked bar charts to provide readers insight into the
data generated by the above graph.
plt.figure()
ax = plt.gca()
a = np.random.randn(5)
22 | CCS346 Unit - I
APEC
plt.plot(a)
plt.show()
Output
23 | CCS346 Unit - I
APEC
plt.axes(projection = 'polar')
# setting the radius
r = 2
rads = np.arange(0, (2 * np.pi), 0.01)
# plotting the circle
for rad in rads:
plt.polar(rad, r, 'g.')
plt.show()
Output
import numpy as np
import matplotlib.pyplot as plot
plot.axes(projection='polar')
plot.title('Circle in polar format')
rads = np.arange(0, (2*np.pi), 0.01)
for radian in rads:
plot.polar(radian,2,'o')
plot.show()
24 | CCS346 Unit - I
APEC
Output
Histogram
A histogram is a graph showing frequency distributions. It is a graph showing the
number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram
like this:
25 | CCS346 Unit - I
APEC
You can read from the histogram that there are approximately:
2 people from 140 to 145cm, 5 people from 145 to 150cm, 15 people from 151 to
156cm, 31 people from 157 to 162cm, 46 people from 163 to 168cm, 53 people from
168 to 173cm, 45 people from 173 to 178cm, 28 people from 179 to 184cm, 21 people
from 185 to 190cm, 4 people from 190 to 195cm
plt.hist(x)
plt.show()
Lollipop plot
A basic lollipop plot can be created using the stem() function of matplotlib. This function
takes x axis and y axis values as an argument. x values are optional; if you do not
provide x values, it will automatically assign x positions.
# create data
x=range(1,41)
values=np.random.uniform(size=40)
# stem function
plt.stem(x, values)
plt.ylim(0, 1.2)
plt.show()
26 | CCS346 Unit - I
APEC
Data transformation
Data transformation is a set of techniques used to convert data from one format or
structure to another format or structure. The following are some examples of
transformation activities:
• Data deduplication involves the identification of duplicates and their removal.
• Key restructuring involves transforming any keys with built-in meanings to the
generic keys.
• Data cleansing involves extracting words and deleting out-of-date, inaccurate,
and incomplete information from the source language without extracting the
meaning or information to enhance the accuracy of the source data.
• Data validation is a process of formulating rules or algorithms that help in
validating different types of data against some known issues.
• Format revisioning involves converting from one format to another.
• Data derivation consists of creating a set of rules to generate more information
from the data source.
• Data aggregation involves searching, extracting, summarizing, and preserving
important information in different types of reporting systems.
• Data integration involves converting different data types and merging them into
a common structure or schema.
• Data filtering involves identifying information relevant to any particular user.
• Data joining involves establishing a relationship between two or more tables.
#Importing libraries
Output:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
27 | CCS346 Unit - I
APEC
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},
index=[8, 9, 10, 11],
)
frames = [df1, df2, df3]
result = pd.concat(frames)
print("\n", df1)
print("\n", df2)
print("\n", df3)
print("\n", result)
Visually, a concatenation with no parameters along rows would look like this:
28 | CCS346 Unit - I
APEC
In Pandas for a horizontal combination we have merge() and join(), whereas for vertical
combination we can use concat() and append(). Merge and join perform similar tasks
but internally they have some differences, similar to concat and append.
pandas merge():
Pandas provides various built-in functions for easily combining datasets. Among them,
merge() is a high-performance in-memory operation very similar to relational
databases like SQL. You can use merge() any time when you want to do database-like
join operations.
• The simplest call without any key column
• Specifying key columns using on
• Merging using left_on and right_on
• Various forms of joins: inner, left, right and outer
29 | CCS346 Unit - I
APEC
Syntax:
• # This join brings together the entire DataFame
df.merge(df2)
pd.merge(df1, df2)
(or)
df1.merge(df2)
(or)
df1.merge(df2, on='Name')
Output:
Code 2#: Merge two DataFrames via ‘id’ column.
30 | CCS346 Unit - I
APEC
df1.merge(df2, left_on='id',
right_on='customer_id')
Output:
Code 3#: Merge with different column names -
specify a left_on and right_on
31 | CCS346 Unit - I
APEC
32 | CCS346 Unit - I
APEC
33 | CCS346 Unit - I
APEC
pandas append():
To append the rows of one dataframe with the rows of another, we can use the Pandas append()
function. With the help of append(), we can append columns too.
Steps
• Create a two-dimensional, size-mutable, potentially heterogeneous tabular data, df1.
• Print the input DataFrame, df1.
• Create another DataFrame, df2, with the same column names and print it.
• Use the append method, df1.append(df2, ignore_index=True), to append the rows of df2
with df2.
• Print the resultatnt DataFrame.
34 | CCS346 Unit - I
APEC
Output:
df3 = df1.append(df2,ignore_index=True)
Output
Output
35 | CCS346 Unit - I
APEC
Output
36 | CCS346 Unit - I
APEC
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_tuples([('one', 'x'), ('one', 'y'),
('two', 'x'), ('two','y')])
s = pd.Series(np.arange(2.0, 6.0), index=index)
print(s)
Output
df = s.unstack(level=0)
df.unstack()
output
37 | CCS346 Unit - I
APEC
Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
key ABCABC
df.sum()
data 15
dtype: object
df.mean()
data 2.5
data
0 -1.5
1 -1.5
df.groupby('key').transform(lambda x: x -
2 -1.5
x.mean())
3 1.5
4 1.5
5 1.5
38 | CCS346 Unit - I