AZURE DATA ENGINEERING – ADF & ADB
Day-1
*****
Components in data engineering or the data engineering process…
What is on-premises?
What is cloud computing?
Different types of cloud computing.
Types of services in cloud computing.
Azure portal walkthrough.
Microsoft Enterprise ID (Azure Active Directory).
Subscription-----free trial…
Resource group
Resources.
Day-2
*****
Create a storage account.
Deep understanding of storage accounts - Blob & Gen2.
Connect via Storage Explorer.
Day-3
*****
Authentication methods.
Account key.
SAS (Shared Access Signature).
Service principle.
Managed identity.
Understanding of ACL (Access Control List).
Built-in roles, custom roles.
Day-4 & Day-5 & day-6
************
Prerequisite for data engineering: SQL and Python.
Create tables, SELECT, CASE, GROUP BY, joins, window functions, pivot, cube, rollup, and other built-
in functions.
UPDATE, DELETE, INSERT operations.
Indexes: Cluster index, non-cluster index, column storage index.
Primary key, foreign key
Stored procedures
Connect via Azure Data Studio.
Day-7
*****
ADF (Azure Data Factory) - walkthrough of all the options in Data Factory.
What is integration runtime? Different types and uses.
What is linked services and how to create them using various methods.
What is a dataset and how to create it in different ways.
Different types of activities available in ADF.
Practical exercises on simple copy activities between databases, databases to Azure Data Lake
Storage (ADLS), and ADLS to ADLS.
Day-8
*****
Deep dive into copy activity: Understanding all the tabs and options available/ pipeline optimization
Day-9
*****
Understand the importance of parameterizing pipelines.
Parameterize the pipeline, dataset, and linked services.
Redo the copy activity with parameters and demonstrate the power of parameterization.
Day-10
*****
Explore foreach, if condition, switch activity, until activity, execute activity, validation activity, filter
activity, set variable, append variable, delete activity.
Day-11
*****
Work with activities like, stored procedure activity lookup activity, get metadata activity
Day-12
******
web activity, and webhook activity.
Logic apps
Day-13
*******
Explore triggers in ADF, including schedule triggers, tumbling triggers, and event-based triggers.
Day-14
******
Understand full load and incremental data loading and various methods to achieve it.
Day-15
******
Meta driven pipeline
Azure key vault
Day-16
******
Explore Notebook activity and how to call Databricks notebooks.
Walkthrough of Databricks tools and all available options.
What is Workspace
What is Metastore
What is catalog
What is table, view, volume
What is unity catalog
What is cluster, how to create it and different options in cluster configuration.
Day-17
******
Understand DBFS (Databricks File System) and mounting, including different mounting methods
(using Account key, SAS, Service Principle).
Understand the Dbtuils
Day-18
******
Introduction to Python.
Understand Python data types theoretically: string, int, list, tuple, set, dictionary.
Conditions: if condition, while loop, for loop
Day 19
******
List, list related methods, list comprehensions,
Functions: How to create functions, parameterization of functions.
Lambda functions
Day 20
*******
How to pass function as parameter to other function
Python in-built function – Map, reduce, filter and so on
Day-21
******
Tuple, set, dictionary – methods.
Day-21
******
Explore serialization and deserialization.
In-depth understanding of different big data file formats: Parquet, Avro ORC, CSV, JSON, Delta.
Day -22
******
Learn how to read different file formats using PySpark.
Write data to different file formats.
Explore options for each file format.
Day-23 & Day-24 & Day-25
******
Deep Dive into PySpark functions.
Widgets.
Day-26
******
Understand RDD (Resilient Distributed Dataset) and a few important RDD functions.
Day-27 & Day-28
**************
Explore lake house architecture.
Practical understanding of Parquet, why Delta format is chosen in Databricks.
What is Delta Lake - theory.
Coding Delta Lake in SQL and PySpark.
Day 29
******
What is MapReduce.
Brief understanding of HDFS architecture.
why Hive came into the picture.
Unity catalog
What is a meta store and catalog.
Managed table vs. external table.
Day-30 & Day-31 & Day-32
***********************
In-depth understanding of Spark architecture, covering lazy evaluation, fault tolerance, DAG
(Directed Acyclic Graph), lineage, checkpointing.
Wide and narrow transformations.
Types of clusters and modes of clusters in Databricks.
What is auto-scaling and jobs.
Catalyst optimizer.
Day-33
*******
Practical understanding of concepts like cache, persist, broadcast, accumulator, and df.explain.
Day-34
******
Spark job debugging,
Medallion architecture,
workflows.
Day 35
******
Delta live tables & unity catalog
Day-36
******
Data modelling at a high level – (conceptual, Logical & physical data model. Fact & Dimensions. Star
& Snow flake schema. Normalization and Denormalizations.)
Day-37
******
Data flows in ADF
Day-38
******
SCD2 implementation in ADB & ADF ---------
Grouping all different performance improvement techniques in Spark, Delta, and Databricks, which
we discussed in previous classes.
Topics include cache, persist, partitioning, bucketing, optimization,
Day-39
******
GIT configuration in Databricks and ADF.
Creating branches, understanding the main branch, feature/common branches, developer branches.
Day – 40
********
Resume building, Important questions discussion.
Day-41 – Day-45 (one weekend sat & Sunday – (4-5 hours)
*************
Interview questions preparation in SQL, Python, PySpark.
Agile process
CICD pipelines using Azure DevOps or GitHub Actions.
Real time project flow.