IF2106 – Data Engineering
Data Exploring and Analysis (2)
• Statistical Analysis
• Data Grouping
Undergraduate
Computer Science
Overview
• Learn how to statistically analyze grouped data, iterate through
groups, and apply aggregations, transformations, and filtration
techniques
Objectives
Upon completion of this Unit, you are expected to be able to:
• Properly perform and practice data exploration and analysis
techniques
Contents
a. Statistical Analysis
b. Data Grouping
Statistical Analysis
Data Analysis
• Pandas provides
numerous methods for
data analysis
• Also, you can define
your own methods for
specific statistical
analysis
• [Link](): Summary statistics for numerical columns
• [Link](): Returns the mean of all columns
Statistical • [Link](): Returns the correlation between columns in a data
Analysis frame
• [Link](): Returns the number of non-null values in each
data frame column
• The correlation coefficient is a measure that
determines the degree to which two
variables’ movements are associated
Statistical • The most common correlation coefficient,
Analysis generated by the Pearson correlation, may
be used to measure the linear relationship
(Cont.) between two variables
• However, in a nonlinear relationship, this
correlation coefficient may not always
be a suitable measure of dependence
• The range of values for the correlation coefficient
is -1.0 to 1.0
• In other words, the values cannot exceed 1.0
or be less than -1.0, whereby a correlation of
-1.0 indicates a perfect negative correlation,
Statistical and a correlation of 1.0 indicates a perfect
Analysis positive correlation
(Cont.) • The correlation coefficient is denoted as r
• If its value greater than zero, it’s a positive
relationship; while if the value is less than
zero, it’s a negative relationship
• A value of zero indicates that there is no
relationship between the two variables
• [Link](): Returns the highest value in
each column
• [Link](): Returns the lowest value in
Statistical each column
Analysis
• [Link](): Returns the median of each
(Cont.)
column
• [Link](): Returns the standard deviation
of each column
Data Grouping
• You can split data into groups to
perform more specific analysis
over the data set
• Once you perform data grouping,
Data Grouping you can compute summary
statistics (aggregation), perform
specific group operations
(transformation), and discard
data with some conditions
(filtration)
Iterating Through
Groups
• You can iterate through a specific
group
• You can also select a specific group
using the get_group() method
Aggregations • Aggregation functions return a
single aggregated value for each
group
• Once the groupby object is
created, you can implement
various functions on the grouped
data
Transformations
• Transformation on a group or a column returns an
object that is indexed the same size as the one being
grouped
• Thus, the transform should return a result that is the
same size as that of a group chunk
Filtration
• Python provides direct filtering for data
Summary
This Unit covered how to explore and analyze data in different collection
structures. Here’s a recap of what was covered in this Unit:
• How to apply statistical analysis on the derived data from implementing
Python data grouping, iterating through groups, aggregations,
transformations, and filtration techniques
Discussion