Exploratory Data Analysis
University Rankings Dataset
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Glue ETL
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Extract, Transform, and Load
Extract Transform Load
• Merge
• Convert data type
Data Data • Handle missing
and incorrect Data Destination
Source Source
values
• Organize data
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Glue ETL
Fully managed ETL service to run jobs
Serverless Apache Spark
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
DataFrame - Python and R
DataFrame • Python (pandas) and R widely
used for data transformation
analysis
• Load to a DataFrame
• Fast local access
Computer memory • Suitable for small, medium
datasets
• Single machine analysis
Database File Share
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Apache Spark DataFrame
• Spark DataFrame is spread
across many machines
Spark DataFrame • Spark ETL - Code is moved
to the server where data
is located
• Python and R DataFrame –
Data is moved to the
server where ETL code is
located
• Spark is efficient with
Server 1 Server 2 Server 3
large datasets
Cluster
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Glue DynamicFrame
• Glue DynamicFrame is
similar to a Spark
Glue DynamicFrame DataFrame
• Optimized for ETL jobs
• Glue Catalog Integration
• Easy to read and write
from AWS Data sources and
destinations
• Support for mixed data
types in a field
Server 1 Server 2 Server 3
Cluster
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Mixed Data Type
Spark DataFrame stores Quantity as a string
Quantity type
25 Quantity: str
“Three”
Glue DynamicFrame stores Quantity as a
union type
Quantity: [str, integer]
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
DynamicFrame - Handling Data Type Inconsistencies
• Cast column to a single data type
• Make new columns – one for each data type
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
DynamicFrame - Cast to a single Type
Price Price
Cast to float
100.00 100.00
“Three” NULL
Cast to a single type
• Only valid values are converted
• Others are stored as NULL
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
DynamicFrame - Make Column
Price Price_float Price_string
Make column
100.00 100.00 NULL
“Three” NULL “Three”
Create new columns – one for each type
Value for a row is stored in one of the columns and NULL in
other columns
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
DynamicFrame and Spark DataFrame
toDF
Glue DynamicFrame Spark DataFrame
fromDF
Glue DynamicFrame Spark DataFrame
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Glue ETL Bookmark
Track processed data
Persist state between job runs
Prevent reprocessing of old data and process
only new data
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Bookmarks – Identifying New Data
S3 Data Sources
• Last modified time of Object
JDBC Data Sources
• Primary key of a table is used as Bookmark key
(default)
• Optional: Specify columns to use as Bookmark key
• Bookmark key needs to increase or decrease over time
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Bookmark Configuration
State Description
Enabled Glue persists state information. The initial job execution
handles all data, while subsequent runs process only new
data
Disabled Job processes the entire dataset during every job run
Paused Job processes incremental data since the last successful
run. However, new bookmark state is not updated
If your job is skipping data, verify bookmark configuration and reset them if
necessary
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Transformation Context
Transformation Context - Optional string parameter in Glue
DynamicFrame methods
If specified - persist state for that operation
If not specified – state is not tracked for that operation
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Glue Job Types
Job Purpose
Spark Jobs ETL Jobs run on serverless Spark Cluster. Suitable for
Batch processing
Streaming ETL ETL Jobs for continuous processing of data from sources
like Kinesis and Kafka. Load data to a data lake or
database. Process and deliver data in minutes
Python Shell Jobs For ETL Jobs that do not require Apache Spark
Ray Ray.io is a new framework for Machine Learning workflows
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Summary
Technology Purpose
Python DataFrame Small to medium datasets
(Pandas) Single machine processing
Apache Spark DataFrame Distributed architecture and in-memory
processing
Very large datasets
Glue DynamicFrame Optimized for ETL workflows
Bookmarking support
Use Glue DynamicFrame for AWS integration touchpoints and Spark
DataFrame for data processing and transformation
Pure Spark code is portable to other environments
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – Iris CSV to Parquet
• Lower storage and ETL Job
query costs
Read Write
• Improve query CSV Parquet
performance
• Visual Editor to Glue Catalog Glue Catalog
build Glue ETL job
• Serverless Apache S3 S3
Spark
CSV Source Parquet Destination
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
University Ranking Dataset - Objectives
Use Data Lake for Querying
Answer Common Queries
• Top Universities in the World
• Regional Ranking
• Differences between Public and Private Universities
• Number of Universities
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
What you will learn
▪ Athena
▪ Lazy CSV SerDe
▪ Open CSV SerDe
▪ Handle Data Quality Issues
▪ SQL – Extract, Load, Transform
▪ Apache Spark – Extract, Transform, Load
▪ Views to Simplify Querying
▪ Visualize with Amazon QuickSight
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – University Ranking (Lazy SerDe)
• Glue crawler to catalog University Rankings dataset
• Query with Athena
Crawler
QuickSight Athena Glue Catalog
N. Virginia S3
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – University Ranking (Open CSV SerDe)
• Ensure data in CSV is correctly mapped to columns in
the table
• Handle double quotes and embedded comma
Create
Table
Athena Glue Catalog
S3
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – Data Cleanup
• Convert from String type to correct data type
• Handle missing value and error conditions
SQL
Create
Table
Athena Glue Catalog
S3
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Why Correct Data Type is Important
String type and Numeric comparison – Don’t work!
Numeric comparison
1005 is greater than 5
String comparison
“1005” is less than “5”
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Function Recap
SPLIT to handle ranks with range of values
(like 401-410)
CAST for data type conversion
TRY for handling exceptions
COALESCE to replace NULL values
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – Simplify Querying with Views
• Hide complexity of Query with Views
Create
SQL Athena
View
Create
Table
Glue Catalog
S3
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Explore Data
• Top-N universities in the World
• Regional Ranking
• Country-specific Ranking
• Distribution
View hid the complex data cleanup logic and
allowed us to explore data
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Benefits of a Data Lake
• Take source data as-is and enable SQL
querying
• Quickly build prototypes and proof-of-
concepts
• Extract, Load and Transform - Cleanup at
query time
• Query time cleanup can be inefficient
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – Apache Spark ETL
• Query time cleanup is inefficient
• Optimize Solution – Clean up first and then load to a
data lake
• Improve performance and lower querying costs
Extract Transform Load
Apache Spark Serverless – Glue
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Object Oriented Interaction
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
SQL-based Interaction
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – QuickSight Business Intelligence
• Visually analyze data using plots, pivot tables
• No-code
QuickSight Athena
Table
Glue Catalog
S3
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
QuickSight Analysis
• Highest Rank
• University Distribution
• Top-N universities in the World
• Regional Ranking
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – QuickSight Business Intelligence
• Visually analyze data using plots, pivot tables
• No-code
QuickSight Athena
N. Virginia
Table
Glue Catalog
N. Virginia S3
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Instructor, Course Developer
7X AWS Certified
For a list of courses, visit
https://www.computewithcloud.com/
Connect with me on LinkedIn
https://www.linkedin.com/in/chandralingam/
Chandra Lingam
100K+ Students
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.