0% found this document useful (0 votes)

20 views16 pages

PROJECT 10 For Python

Python mini project

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views16 pages

PROJECT 10 For Python

Python mini project

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

End-to-End Cloud Data Pipeline for COVID-19 Analysis

1- Introduction.

In this project, we will demonstrate an end-to-end data engineering project based on data from
the COVID-19 epidemic. We will use Azure Data Factory for data ingestion, Azure Data Lake
Gen2 for storage, Azure Databricks for data transformation, Azure Synapse for modeling, and
Power BI for visualization.

2- Project Planning and Definition.

In the following project we will use the following schema with the following order :

1. Data Ingestion (Azure Data Factory):

• Azure Data Factory serves as our data ingestion platform. It enables us to collect
COVID-19 data from various sources, including government databases, APIs, and web
scraping. Data Factory’s data connectors and scheduling capabilities are invaluable for
automated ingestion.

2. Data Storage (Azure Data Lake Gen2):

• Processed COVID-19 data finds its home in Azure Data Lake Gen2. This storage solution
provides scalable, secure, and cost-effective storage, which is critical for
accommodating the increasing volume of epidemic data.

3. Data Transformation (Azure Databricks):

• We utilize Azure Databricks for data transformation and processing tasks. Databricks
clusters allow us to perform data cleansing, normalization, and feature engineering,
preparing the COVID-19 data for analysis.

4. Data Modeling (Azure Synapse):

• Azure Synapse serves as our data modeling and analytics platform. We employ it to
build data models, perform SQL-based queries, and conduct complex analytical tasks
on the COVID-19 dataset.

5. Data Visualization (Power BI):

• Power BI is the tool of choice for data visualization. We create interactive dashboards
and reports to present the COVID-19 data insights, enabling stakeholders to make
informed decisions.

By orchestrating data flow, transformation, storage, modeling, and visualization using Azure
services, we aim to provide actionable insights from this critical dataset.
Architectural overview

This architectural overview encapsulates our approach in this data engineering project,
emphasizing the role of Azure services in processing and analyzing COVID-19 data. The
architecture ensures that the data pipeline is efficient, secure, and capable of handling the
evolving requirements of epidemic data analysis.

For this project, we initiated by establishing our resource group, within which we proceeded to
create the essential resources.

Resources
3- Data Collection and Ingestion.

We will extract the data from GitHub, which contains information about victims and cases
infected by the COVID epidemic, as well as other related data.

the data will be look :

Raw Data

In this phase, we harnessed the power of Azure Data Factory to seamlessly ingest data from the
“GitHub” source into our Azure environment.

In the next picture we will see the pipeline where I made the ingestion :

the pipeline of copy all data

We commence by establishing our source dataset, configured as an HTTP source. This step
encompasses the setup of the Linked service and the definition of the BASE URL for all our files.
In the case of the sink datasets, we select Delimited Text (CSV) format, and for the Linked
service, we designate the destination container as “raw-data” within our Storage account.
Since the data remains unaltered and requires no transformations, the primary task revolves
around copying the files. We apply minor modifications to the file names to improve their clarity
and comprehensibility.
4- Data Storage.

In this phase, we ensure that our data is meticulously organized within our Azure Data Lake
Storage. We have designated a specific container, which we refer to as “bronze-data,” for this
purpose. This staging area acts as the initial repository for our raw data files, providing a secure
and organized location for the data to reside.

Raw-data in ADLS-Gen2
5- Data Transformation.

In the data transformation phase, we leverage Azure Databricks with a notebook environment
to write and execute our Spark code. To initiate this process, we’ve provisioned Azure
Databricks by creating the required resources and compute instances. However, before delving
into coding, it is crucial to establish a secure connection between Azure Databricks and the
Azure Storage accounts that house our raw data.

To achieve this, we have developed an application using the ‘App registrations’ resource
provided by Azure. Within this application, we’ve generated an ‘Application (client) ID’ and a
‘Directory (tenant) ID.’ For clarity, this application is named ‘App01.’

Subsequently, within the ‘Certificates & secrets’ section of application management, we have
generated a client secret key (as illustrated in the figure below). This secret key plays a pivotal
role in maintaining a secure and robust connection between Azure Databricks and the Azure
Storage accounts, enabling seamless data transformation and processing.

Finally, we must configure role assignments for ‘Storage Blob Data Contributor.’ This
assignment allows Azure Databricks to read, write, and delete access to Azure Storage blob
containers and data, facilitating efficient data management and processing.”

the secret Key

So now, we have commenced writing our Spark code in the notebook, beginning with the
establishment of the connection between the container in the storage accounts and our
notebook.

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "client ID",
"fs.azure.account.oauth2.client.secret": 'Secret Key',
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/Directory
(tenant) ID/oauth2/token"}
#raw data = bronze data
dbutils.fs.mount(
source = "abfss://bronze-data@<Storage Account name>.dfs.core.windows.net", #
contrainer@storageacc
mount_point = "/mnt/bronze-data",
extra_configs = configs)

#transformed data
dbutils.fs.mount(
source = "abfss://transformed-data@<Storage Account name>.dfs.core.windows.net", #
contrainer@storageacc
mount_point = "/mnt/transformed-data",
extra_configs = configs)

After this, we can list the files and directories in the ‘bronze data’ container by using the
following command:

%fs
ls "/mnt/raw-data"

After that, we should configure our Spark session using the command: Spark. Now, we can
begin our actual data work by first reading the data:

#read the data

admissions_hospital = spark.read.format("csv").option("header","true").load("/mnt/covid-
data/admissions_hospital")
death_uk_ind_only = spark.read.format("csv").option("header","true").load("/mnt/covid-
data/death_uk_ind_only")
death_world = spark.read.format("csv").option("header","true").load("/mnt/covid-
data/death_world")
response_country = spark.read.format("csv").option("header","true").load("/mnt/covid-
data/response_country")
test = spark.read.format("csv").option("header","true").load("/mnt/covid-data/test")
We can utilize various PySpark SQL DataFrame methods, such as .show() and .printSchema(),
to view and gain a better understanding of the data.

After reviewing the data and considering our requirements, we have decided to extract two
dimension tables from our dataset. The first one, named ‘dim_country,’ will contain information
about countries, including details such as country code, continent, and population. this table
will be extracted from our ‘death_world’ file.

#let's process and transform the death world data

death_world.createOrReplaceTempView("raw_death_world") #to query the files
dim_country = spark.sql("""select distinct country
, country_code
, continent
, population
from raw_death_world
""")

The data in the ‘dim_country’ table will be structured as follows:

dim_country.show(6)

only showing top 6 rows

The second dimension table that we will use is ‘dim_date,’ which contains information
structured as follows: date_key, date, year, month, day, day_name, day_of_year,
week_of_month, week_of_year, month_name, year_month, and year_week.

This table will include data for the date range from January 1, 2020, to December 30, 2022. The
table will contain 1,095 records.

the table will look like that:

dim_date.show(5)

only showing top 5 rows.

For the fact tables, we’re going to extract three fact

tables: fact_cases_death, fact_response_country, and fact_admissions_hospital.

fact_cases_death will contain measures related to countries, date, indicator, rate_14_day,

daily_count, and source. Regarding the date column, we need to change the format to facilitate
our analytics on Azure Synapse, making it more convenient.

#let's go now for the fact table of cases death

death_world.createOrReplaceTempView("death_world") #pour que on puisse faire des
requetes
fact_cases_death = spark.sql("""select country, date_format(date,'yyyyMMdd' ) as date_key,
indicator, rate_14_day, daily_count, source
from death_world
""")

So, in it, the data looks like this :

fact_response_countrywill contain measures related to Country, Response_measure, change,
date_start, and date_end. Regarding the date_end and date_start columns, we also need to
change the format to facilitate our analytics on Azure Synapse, making it more convenient.

response_country.createOrReplaceTempView("response_country")
fact_response_country = spark.sql("""select Country
, Response_measure
, change
, date_format(date_start,'yyyyMMdd' ) as date_start
, date_format(date_end,'yyyyMMdd' ) as date_end
from response_country
""")

fact_admissions_hosptalwill contain measures related to Country, indicator, date, value,

source, and URL.

admissions_hospital.createOrReplaceTempView("admissions_hospital")
fact_admissions_hosptal = spark.sql("""select country
, indicator
, date_format(date,'yyyyMMdd' ) AS date_key
, value
, source
, url
from admissions_hospital
""")

Now that we know the dimensions table and the fact table, we should write these tables into
our container named ‘transformed data’

#Write our files

dim_date.write.format("com.databricks.spark.csv").option("header","true").option("delimiter",
",").mode("overwrite").save("/mnt/transformed-data/dim_date")
dim_country.write.format("com.databricks.spark.csv").option("header","true").option("delimite
r", ",").mode("overwrite").save("/mnt/transformed-data/dim_country")
fact_cases_death.write.format("com.databricks.spark.csv").option("header","true").option("de
limiter", ",").mode("overwrite").save("/mnt/transformed-data/fact_cases_death")
fact_response_country.write.format("com.databricks.spark.csv").option("header","true").optio
n("delimiter", ",").mode("overwrite").save("/mnt/transformed-data/fact_response_country")
fact_admissions_hosptal.write.format("com.databricks.spark.csv").option("header","true").opt
ion("delimiter", ",").mode("overwrite").save("/mnt/transformed-data/fact_admissions_hosptal")

So far, so good. Now we can locate our files in our sink container named ‘transformed-data’:
Transformed-data Container
6- Data Modeling.

Now that we have our data in the tables, we will proceed to load it into the Lake Database in
Azure Synapse Analytics, enabling us to create our models.

First, we need to set up our Azure Synapse workspace. By creating our Synapse Studio, we also
create another Storage Account: Azure Data Lake Storage Gen2 (ADLS Gen2).

To use Azure Synapse for working with this data, we should copy the files from the
‘Transformed-data’ container into our ADLS Gen2. For this purpose, we will utilize a pipeline
containing a copy activity from our source with the linked service: AzureBlobStorage, to our
destination with the linked service: Default Storage account for our Synapse workspace (ADLS
Gen2).

Another tip: to copy all the files in the ‘transformed-data’ container, rather than one file at a
time, we can utilize the ‘Wildcard File Path’ option with the input as ‘transformed-data/*’.

Now, in the data part of the Synapse workspace, we add a Lake Database named ‘CovidDB.’
Following this, we create external tables from the data lake. To do this, we specify the External
table name (which will be the same as ‘dim_country,’ ‘dim_date,’ etc.), the Linked service
(which will be ‘ADLS Gen2’), and the Input file or folder. This input will specify the path to the
files.

In the phase of creating our tables in the Lake database, we have recently discovered that the
files in the “fact_cases_death” folder have been duplicated four times due to their size, as
demonstrated in the picture below.

folder fact_cases_death

We will now implement a different pipeline to consolidate the files within the
“fact_cases_death” folder. This new pipeline will consist of a single activity: data copying. In
this pipeline, we will use the wildcard path option directly targeting our “fact_cases_death”
folder. Additionally, we will modify the sink settings by choosing the “merge files” copy
behavior.

This will make it easier for us to add our tables, as illustrated in the picture below:
the Database CovidDB
We will now establish relationships between the tables. A relationship is a connection that
defines how the data in these tables should be associated or correlated.

We chose the “To table” option for the fact tables, as these tables serve as the parent tables for
the dimension tables.

Data model
7- Data Visualization.

In this section, we will leverage data visualization to gain valuable insights from the dataset. The
following measures will be visualized:

Total Daily Hospital Admissions per Country Over time:

• To comprehend the daily hospital admissions on a country level, we execute the
following query. It calculates the sum of daily admissions, providing a pivotal metric to
monitor the impact of events like COVID-19.

the daily hospital admissions in Mars

total daily admissions by month

for See the changing in time, we will use The “Play Axis” in our presentation, allowing the
creation of a time series animation to illustrate how variations between countries change over
time.
-- the query of Total Daily Hospital Admissions per Country Over time
SELECT
fd.date,
dc.country,
SUM(fh.value) AS total_daily_admissions
FROM [CovidDB].[dbo].fact_admissions_hospital fh
JOIN [CovidDB].[dbo].dim_date fd ON fh.date_key = fd.date_key
JOIN [CovidDB].[dbo].dim_country dc ON fh.country = dc.country
GROUP BY fd.date, dc.country
ORDER BY fd.date, dc.country;

Trends Over Time:

• Understanding trends over time is crucial for assessing the evolution of daily hospital
admissions. We utilize a line chart, utilizing the result from the query above. The date is
placed on the x-axis and the total daily admissions on the y-axis. Tracking trends over
time helps in identifying significant fluctuations.

Variations Between Countries:

• Comparing variations between countries offers critical insights into the disparities in
daily hospital admissions. We opt for a Bar Chart visualization, which enables side-by-
side comparisons of different countries.

Comparaison in Mars
also here we will use The “Play Axis” feature.

Seasonal Patterns:
• Identifying seasonal patterns can provide essential context. To achieve this, we employ
a query that calculates a seven-day moving average. This moving average is pivotal in
recognizing recurring trends in hospital admissions. It can be instrumental in resource
allocation and preparedness.

These visualizations are instrumental in uncovering patterns, trends, and disparities in the data.

This screen is just a part of the visualization files for a specific moment, similar to the
comparison — it’s only on Mars. It’s more like an animation.

Session 6 - Azure Case Study - Covid 19
No ratings yet
Session 6 - Azure Case Study - Covid 19
42 pages
ADF Course Deck V2
No ratings yet
ADF Course Deck V2
216 pages
Azure Data Engineering for Covid-19 Analysis
No ratings yet
Azure Data Engineering for Covid-19 Analysis
154 pages
ADE Project Amit
No ratings yet
ADE Project Amit
17 pages
Azure StreamSets Data Pipeline Guide
No ratings yet
Azure StreamSets Data Pipeline Guide
35 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
PROJECT 11 For Python
No ratings yet
PROJECT 11 For Python
22 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
PROJECT 12 For Python
No ratings yet
PROJECT 12 For Python
26 pages
Azure de QSN and Ans
No ratings yet
Azure de QSN and Ans
16 pages
Azure Data Engineering Course
No ratings yet
Azure Data Engineering Course
8 pages
Azure Data Engineering Course Interview Questions 1751484980
No ratings yet
Azure Data Engineering Course Interview Questions 1751484980
20 pages
Azure Databricks Documentation
100% (1)
Azure Databricks Documentation
7,197 pages
Dense Rank
No ratings yet
Dense Rank
2 pages
PROJECT 4 For Python
No ratings yet
PROJECT 4 For Python
26 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Azure Data Engineer Exam Guide
No ratings yet
Azure Data Engineer Exam Guide
5 pages
Azure Databricks
No ratings yet
Azure Databricks
21 pages
Course Content
No ratings yet
Course Content
13 pages
Azure de Project
No ratings yet
Azure de Project
29 pages
Data Science
No ratings yet
Data Science
6 pages
Azure Data Engineer
100% (1)
Azure Data Engineer
8 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
DP 900t00a Enu Powerpoint 04
No ratings yet
DP 900t00a Enu Powerpoint 04
23 pages
DP 203 Data Engineering Course Syllabus
No ratings yet
DP 203 Data Engineering Course Syllabus
4 pages
End To End Project ADF
100% (1)
End To End Project ADF
73 pages
Data Analysis Visualization Full Project
No ratings yet
Data Analysis Visualization Full Project
19 pages
Azure Data Solutions Course Overview
No ratings yet
Azure Data Solutions Course Overview
4 pages
Data Analyst Azure PowerBI Syllabus
No ratings yet
Data Analyst Azure PowerBI Syllabus
35 pages
DP 600t00a Enu Powerpoint 02
No ratings yet
DP 600t00a Enu Powerpoint 02
30 pages
Data Storage and Processing Design Guide
No ratings yet
Data Storage and Processing Design Guide
4 pages
CRM Data Collection and Storage
No ratings yet
CRM Data Collection and Storage
22 pages
Module 4
No ratings yet
Module 4
13 pages
Big Data and Visualization Hands-Steps-1
No ratings yet
Big Data and Visualization Hands-Steps-1
7 pages
Azure Data Engineering Exam Guide
0% (1)
Azure Data Engineering Exam Guide
5 pages
Complete Notes On Azure de 1734338895
No ratings yet
Complete Notes On Azure de 1734338895
94 pages
Assignment 8
No ratings yet
Assignment 8
2 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Azure Project
No ratings yet
Azure Project
13 pages
Solutions For Data Warehousing 7
No ratings yet
Solutions For Data Warehousing 7
18 pages
Air Quality Data Analysis Process
No ratings yet
Air Quality Data Analysis Process
8 pages
Advanced Python Exercise Set
No ratings yet
Advanced Python Exercise Set
4 pages
Fiscal Transparency Interfaces
No ratings yet
Fiscal Transparency Interfaces
45 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
DP-203T00 Assessment Guide for Azure
No ratings yet
DP-203T00 Assessment Guide for Azure
13 pages
Sample Template File For Project
No ratings yet
Sample Template File For Project
8 pages
Azure Data Engineering Interview Q & A - Topicwise
100% (1)
Azure Data Engineering Interview Q & A - Topicwise
57 pages
DV
No ratings yet
DV
20 pages
ADE4 Topics To Brush 1
No ratings yet
ADE4 Topics To Brush 1
20 pages
Ade Companywise Interview
No ratings yet
Ade Companywise Interview
133 pages
Naga Tulasi Gedela - DE
No ratings yet
Naga Tulasi Gedela - DE
4 pages
Azure Data Factory Interview Questions Answers 1740678784
No ratings yet
Azure Data Factory Interview Questions Answers 1740678784
9 pages
Companywise Interview Questions
No ratings yet
Companywise Interview Questions
71 pages
Study Guide For Exam DP-203 - Data Engineering On Microsoft Azure - Microsoft Learn
No ratings yet
Study Guide For Exam DP-203 - Data Engineering On Microsoft Azure - Microsoft Learn
4 pages
PROJECT 2 For Python
No ratings yet
PROJECT 2 For Python
41 pages
PROJECT 8 For Python
No ratings yet
PROJECT 8 For Python
31 pages
PROJECT 9 For Python
No ratings yet
PROJECT 9 For Python
14 pages
PROJECT 3 For Python
No ratings yet
PROJECT 3 For Python
23 pages
PROJECT 1 For Python
No ratings yet
PROJECT 1 For Python
42 pages
CS8091-Big Data Analytics
No ratings yet
CS8091-Big Data Analytics
12 pages
Historian - For - Linux - User - Guide - v2.2.0
No ratings yet
Historian - For - Linux - User - Guide - v2.2.0
56 pages
Sample Estimation Model
No ratings yet
Sample Estimation Model
10 pages
History of DBMS
No ratings yet
History of DBMS
14 pages
DBMS Indexing and B+ Tree Overview
No ratings yet
DBMS Indexing and B+ Tree Overview
6 pages
DB HW 6
No ratings yet
DB HW 6
7 pages
Tuning Database Locks & Latches: Hamid R. Minoui
No ratings yet
Tuning Database Locks & Latches: Hamid R. Minoui
60 pages
DBMS Full
No ratings yet
DBMS Full
32 pages
AC8227L Android Scatter
No ratings yet
AC8227L Android Scatter
8 pages
Elective Ict Parcticals 2025 Shs 3 First Term
100% (2)
Elective Ict Parcticals 2025 Shs 3 First Term
3 pages
Assignment 6
No ratings yet
Assignment 6
3 pages
Program Structure II
No ratings yet
Program Structure II
20 pages
Search Algorithms in C#: Sequential, Binary, Interpolation
No ratings yet
Search Algorithms in C#: Sequential, Binary, Interpolation
3 pages
Facets of Data Important
No ratings yet
Facets of Data Important
4 pages
Olongapo City Voter List - East Tapinac
No ratings yet
Olongapo City Voter List - East Tapinac
194 pages
SQL Questions For Journal
No ratings yet
SQL Questions For Journal
9 pages
FTK Imager
No ratings yet
FTK Imager
40 pages
Advanced Information and Knowledge
No ratings yet
Advanced Information and Knowledge
105 pages
Introduction to MS Excel Basics
No ratings yet
Introduction to MS Excel Basics
6 pages
How To:How To Install FCM License: Ccoom MM Muunniittyy Lleeaarrnn Pprroodduuccttss Ttrraaiinniinngg Ddoow Wnnllooaaddss
No ratings yet
How To:How To Install FCM License: Ccoom MM Muunniittyy Lleeaarrnn Pprroodduuccttss Ttrraaiinniinngg Ddoow Wnnllooaaddss
3 pages
Taxonomy & Content Classification: Market Milestone Report
No ratings yet
Taxonomy & Content Classification: Market Milestone Report
60 pages
Past Paper Database MCQs
No ratings yet
Past Paper Database MCQs
1 page
SEO Basics: Understanding Search Ranking
No ratings yet
SEO Basics: Understanding Search Ranking
14 pages
Oracle: Data Definition Language (DDL)
100% (1)
Oracle: Data Definition Language (DDL)
57 pages
O o o o O: Services
No ratings yet
O o o o O: Services
10 pages
SQ L Queries
No ratings yet
SQ L Queries
29 pages
Getting Started With MySQL Command Line
No ratings yet
Getting Started With MySQL Command Line
8 pages
Single Line
No ratings yet
Single Line
54 pages
EMC VNX 5100 and 5300 Disk and OE Matrix
No ratings yet
EMC VNX 5100 and 5300 Disk and OE Matrix
15 pages
Raid PDF
No ratings yet
Raid PDF
24 pages

PROJECT 10 For Python

Uploaded by

PROJECT 10 For Python

Uploaded by

End-to-End Cloud Data Pipeline for COVID-19 Analysis

2- Project Planning and Definition.

1. Data Ingestion (Azure Data Factory):

2. Data Storage (Azure Data Lake Gen2):

3. Data Transformation (Azure Databricks):

4. Data Modeling (Azure Synapse):

5. Data Visualization (Power BI):

the data will be look :

the pipeline of copy all data

the secret Key

configs = {"fs.azure.account.auth.type": "OAuth",

#read the data

#let's process and transform the death world data

The data in the ‘dim_country’ table will be structured as follows:

only showing top 6 rows

the table will look like that:

only showing top 5 rows.

For the fact tables, we’re going to extract three fact

fact_cases_death will contain measures related to countries, date, indicator, rate_14_day,

#let's go now for the fact table of cases death

So, in it, the data looks like this :

fact_admissions_hosptalwill contain measures related to Country, indicator, date, value,

#Write our files

Total Daily Hospital Admissions per Country Over time:

the daily hospital admissions in Mars

total daily admissions by month

Trends Over Time:

Variations Between Countries:

You might also like