0% found this document useful (0 votes)
84 views12 pages

Azure Data Factory Overview

ADF is a cloud-based data integration service that allows users to visually design pipelines to orchestrate data movement and transformation. Pipelines are composed of activities that perform tasks like copying or transforming data. Linked services define connections to data stores and datasets represent pieces of data within those stores. Integration runtimes enable connectivity between on-prem and cloud resources and triggers control pipeline execution.

Uploaded by

clouditlab9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views12 pages

Azure Data Factory Overview

ADF is a cloud-based data integration service that allows users to visually design pipelines to orchestrate data movement and transformation. Pipelines are composed of activities that perform tasks like copying or transforming data. Linked services define connections to data stores and datasets represent pieces of data within those stores. Integration runtimes enable connectivity between on-prem and cloud resources and triggers control pipeline execution.

Uploaded by

clouditlab9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ADF Overview

What Is Azure Data Factory (ADF)

• ADF is an Azure cloud based, code-free, data integration service, that is used to develop, orchestrate, schedule &
monitor the ETL processing for data applications in Azure.

• Cloud Based: It is a Microsoft Azure platform-as-a-service (PaaS) offering for data movement and transformation. All
tasks are performed in Azure Portal.

• Code-Free: The development is done using a visual interface offering drag & drop functionality for building ETL
pipelines.

• Orchestrate: We can create a workflow of ETL activities in a sequence of steps, as a pipeline.

• Schedule: The ADF pipeline can be scheduled to run at a defined time interval.

• Monitor: And then we can monitor the execution of the pipeline as well as get notified on success or failure.
Data Integration Capabilities

• Using ADF we can perform the following activities when we build the ETL pipelines for our data processing application.

• Connect and Collect Data: ADF provides a wide range of data source connectors using which it is possible to connect
to a disparate range of on-premise as well as cloud data stores, pull the data from the data sources & land on the
Azure storage in the form of files.

• Transform and Enrich Data: Once data is extracted from source system & landed on the Azure storage, we can use
ADF to transform & enrich the data using Data Flow component in ADF.

• Publish: The transformed & enriched data can then be copied into Azure Synapse, Azure SQL or we can simply build an
Azure Data Lake solution leveraging Azure Storage.

• Monitor: Lastly, we can monitor the ETL pipeline using Azure Monitor as well as using ADF UI
Pipeline

• An ADF pipeline is a logical grouping of activities which is used to perform a unit of work.

• When we develop an ETL process, we define the activities for the ETL steps, as a pipeline of operations.

• A pipeline encapsulates the data flow in the ETL that can include several different steps, such as

• Copying the data from source systems

• Transforming the copied data using transformations such as filter, lookup or aggregate and change the structure
of the data

• Write the transformed data into a target system such ADLS Gen2, Azure SQL etc.

• Activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.

• Lastly, we can run a pipeline manually as well as using a trigger.


Activity
• An activity is an individual processing task within a pipeline & it specifies the action to perform on the data.

• The data which the activities consume or produce, is represented in the form of a Dataset.

• Activities typically perform either of these 3 tasks, data movement, data transformation or control activities.

• We can execute the activities in either a sequential manner or a parallel manner.

• A key point to note is that the activities can be performed entirely within ADF, or we can also trigger other Azure
services such as Azure Databricks, Azure HDInsight etc., using specific activities available within ADF for running these
external tasks.

• We can classify activities as follows ->

Data Movement: Copies data from a source data store to a sink data store.

Data Transformation: HDInsight (Hive, Hadoop, Spark), Azure Functions, Azure Batch, Machine Learning, etc.

Data Control: Used to run other pipeline, SSIS packages, ForEach, Until, Wait, etc.
Linked Service
• Linked service are used to define the connection to a data source.

• We can consider these like connection strings, that specify the connection information that is needed for ADF to
connect to a data source or a data destination.

• The properties or configuration settings of a linked service depends on the type of data source.

• For example, we can use an Azure SQLDB linked service to connect to an Azure SQL Database, or we can define an
Azure Blob Storage linked service to connect to an Azure Blob, etc.

• A linked service basically represents a connection to two types of resources

• Data Store: It can represent a data store like SQL Server, Oracle, Azure Blob storage, etc.

• Compute: It can represent a compute resource that can host the execution of an activity. For example, Azure
Databricks cluster, Synapse Spark Pool, etc.
Dataset

• Dataset represents a data structure within the data store to which a linked service is pointing.

• This Dataset can point to an Input data source that we want to ingest, or it can point to an output data source that we
want to store.

• So let us say, if we are reading and processing data from Azure SQLDB, then we will need to create an input dataset
that uses an Azure SQLDB linked service that specifies the connection details for the database.

• The dataset would specify the table to ingest.

• After processing the data, if we are storing it into an Azure Blob Storage, then we will need to create an output dataset
that uses an Azure Blob Storage linked service that points to the Azure Blob Storage location, as well as the format of
the information in the Blob such as Parquet, JSON, delimited text, etc.
Integration Runtime

• An Integration Runtime (IR) is the compute infrastructure that is used by ADF to provide the data integration
capabilities across different network environments. Basically where the activities either run on or get dispatched from.

• An IR provides the capability to connect the on-prem network to Azure cloud network.

• Also, the IR acts like a bridge between the activity and the data source to which a linked Service points to.

• An IR provides following capabilities:

Data Movement: When we use the copy activity.

Activity Dispatch: When we use external compute such as Azure Databricks.

SSIS Package Execution: When we run SSIS packages along-with ADF pipelines.

Data Flow: When we use transformations provided in Data Flow activity.


Types of Integration Runtime

• Azure IR:

- It is a fully managed, serverless compute in Azure.

- It supports connecting only those data stores and compute services that have a publicly accessible endpoint.

• Self-hosted IR:

- This service manage activities between cloud data stores and a data store residing in a private network.

- It is necessary when we want to access data in the on-premise data center of an organization.

- It creates a secure tunnel that allows ADF to read or write data to on-premise database or files

• Azure-SSIS IR:

- It is required to natively execute SSIS packages.


Triggers

• Triggers are used to initiate the execution of a pipeline.

• They determine when to execute a pipeline.

• We can execute a pipeline on a schedule, or on a periodic interval, or when an event occurs.

• Triggers are of following types

• Schedule: It runs a pipeline on a specific time and frequency, for example, everyday at 9:00 AM.

• Tumbling Window: It runs a pipeline on a periodic interval, for example, every 15 minutes.

• Storage events: It runs a pipeline as a response to a storage event, for example, when a file arrives in Blob Storage.

• Custom events: It runs a pipeline as a response to a custom event, for example, an EventGird based event.
LAB – ADF Provisioning

You might also like