10 MUST KNOW
DATA ANALYTICS
TIPS
1/ FIX COLUMN TYPES BEFORE
MERGING
Type: int
Type: Object
Make sure to turn both IDs into ‘String’ type before
merging, here’s how:
Anas Riad
2/ USE .INFO() AND .DESCRIBE()
EARLY
These simple method helps you catch incorrect data types, missing
values, and get a feel for distribution and cardinality.
.info()
Acquire the missing
data and data type
information quickly.
All the available data
types and memory
usage are available
.describe()
The data statistics are
available for us to use
Cornellius Yudha Wijaya
3/ DEAL WITH MISSING
VALUES
1269 missing values in
‘Artist column
Make sure to fix missing values. If it’s a numerical value, you can
use mean or median. If it’s a categorical value, you can put the
category with the highest occurrence or put ‘unknown’.
Filled missing artists
with ‘unkown’
Anas Riad
4/ USE .ASSIGN() FOR METHOD
CHAINING IN EDA
Make data transformation pipelines more readable and compact with
method chaining.
Method Chaining
of 4 different
method
The output after
all the Method
Chaining
Cornellius Yudha Wijaya
5/ GROUP BY
SUMMARY
Choose column
to group by and
aggregate
Aggregation:
sum, count,
mean...
Show Top 10
Anas Riad
6/ PIVOT TABLES FOR
MULTIDIMENSIONAL SUMMARIES
Multi-index pivot tables help explore data from different perspectives
quickly.
Column for the overall data values
Column for the table index
Column for the table index
Statistical Function to perform
Add “All” rows
The Pivot Tables
Cornellius Yudha Wijaya
7/ SIMPLE BAR CHART WITH
PLOTLY
First, define what are you going to
visualise. In this case top 10 count
artists by nationality
Second, define a simple bar
chart with plotly using the
variable created previously
top 10 nationalities.
Anas Riad
8/ APPLY LOG TRANSFORM FOR
SKEWED DATA
Highly skewed numeric data (e.g. income, spending) can distort
analysis. Apply a log or Box-Cox transform for normalization.
The transformation result will direct the overall data distribution close
to normal
Cornellius Yudha Wijaya
9/ TIME-BASED ANALYSIS
Make sure date columns are in
’datetime’ format first
Get the values you want to
visualise as Time Series
PS: Limited to top 10
here to not get a Use [Link] in plotly to plot a
super long list
line chart (time series).
What can you observe?
Investigate?
Down trend?
Anas Riad
10/ DETECT AND VISUALIZE
CORRELATION CLUSTERS
Use correlation matrices + clustering to group correlated features and
detect redundancy or multicollinearity.
Dendrogram (Top and
Left)
Clustering of features
based on similarity in
correlation. Features
close together in the
dendrogram are more
similar than the
others.
Heatmap (Center)
Shows the Pearson
correlation coefficient
between each pair of
features.
Cornellius Yudha Wijaya
WHICH TIP YOU
FOUND THE MOST
INTERESTING?
COMMENT BELOW
FOLLOW ANAS AND CORNELLIUS
FOR MORE CONTENT ON DATA
ANALYTICS AND DATA SCIENCE.