100% found this document useful (1 vote)
2K views2,034 pages

Data Factory, Data Integration

This document provides an overview of the Azure Data Factory documentation, including quickstarts, tutorials, samples, concepts, how-to guides, reference materials, and resources. Key sections include quickstarts for creating data factories using different tools/languages, tutorials on copying and transforming data both in the cloud and between on-premises and cloud, and concepts covering pipelines/activities, linked services, integration runtime, roles/permissions, and pricing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views2,034 pages

Data Factory, Data Integration

This document provides an overview of the Azure Data Factory documentation, including quickstarts, tutorials, samples, concepts, how-to guides, reference materials, and resources. Key sections include quickstarts for creating data factories using different tools/languages, tutorials on copying and transforming data both in the cloud and between on-premises and cloud, and concepts covering pipelines/activities, linked services, integration runtime, roles/permissions, and pricing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2034

Contents

Data Factory Documentation


Switch to version 1 documentation
Overview
Introduction to Data Factory
Compare current version to version 1
Quickstarts
Create data factory - User interface (UI)
Create data factory - Copy Data tool
Create data factory - Azure PowerShell
Create data factory - .NET
Create data factory - Python
Create data factory - REST
Create data factory - Resource Manager template
Create data flow
Tutorials
Copy data in cloud
Copy Data tool
User interface (UI)
.NET
Copy on-premises data to cloud
Copy Data tool
User interface (UI)
Azure PowerShell
Copy data in bulk
User interface (UI)
Azure PowerShell
Copy data incrementally
1 - Copy from one table
User interface (UI)
Azure PowerShell
2 - Copy from multiple tables
User interface (UI)
Azure PowerShell
3 - Use change tracking feature
User interface (UI)
Azure PowerShell
4 - Copy new files by lastmodifieddate
Copy Data tool
5 - Copy new files by time partitioned file name
Copy Data tool
Transform data in cloud
HDInsight Spark
User interface (UI)
Azure PowerShell
Databricks Notebook
User interface (UI)
Transform data in virtual network
User interface (UI)
Azure PowerShell
Add branching and chaining
User interface (UI)
.NET
Run SSIS packages in Azure
User interface (UI)
Azure PowerShell
Samples
Code samples
Azure PowerShell
Concepts
Pipelines and activities
Linked services
Datasets
Pipeline execution and triggers
Integration runtime
Mapping Data Flows
Mapping data flow concepts
Debug mode
Schema drift
Inspect pane
Column patterns
Data flow monitoring
Data flow performance
Move nodes
Optimize tab
Expression builder
Reference nodes
Expression language
Roles and permissions
Understanding pricing
Naming rules
How-to guides
Author
Visually author data factories
Continuous integration and delivery
Iterative development and debugging
Connectors
Amazon Marketplace Web Service
Amazon Redshift
Amazon S3
Azure Blob Storage
Azure Cosmos DB SQL API
Azure Cosmos DB's API for MongoDB
Azure Data Explorer
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Azure Database for MariaDB
Azure Database for MySQL
Azure Database for PostgreSQL
Azure File Storage
Azure Search
Azure SQL Database
Azure SQL Database Managed Instance
Azure SQL Data Warehouse
Azure Table Storage
Cassandra
Common Data Service for Apps
Concur
Couchbase
DB2
Delimited text format
Drill
Dynamics 365
Dynamics AX
Dynamics CRM
File System
FTP
Google AdWords
Google BigQuery
Google Cloud Storage
Greenplum
HBase
HDFS
Hive
HTTP
HubSpot
Impala
Informix
Jira
Magento
MariaDB
Marketo
Microsoft Access
MongoDB
MongoDB (legacy)
MySQL
Netezza
OData
ODBC
Office 365
Oracle
Oracle Eloqua
Oracle Responsys
Oracle Service Cloud
Parquet format
Paypal
Phoenix
PostgreSQL
Presto
QuickBooks Online
REST
Salesforce
Salesforce Service Cloud
Salesforce Marketing Cloud
SAP Business Warehouse Open Hub
Load SAP BW data
SAP Business Warehouse MDX
SAP Cloud for Customer
SAP ECC
SAP HANA
SAP Table
ServiceNow
SFTP
Shopify
Spark
SQL Server
Square
Sybase
Teradata
Vertica
Web Table
Xero
Zoho
Move data
Copy data using Copy Activity
Delete files using Delete Activity
Copy Data tool
Load Data Lake Storage Gen2
Copy from Data Lake Storage Gen1
Load SQL Data Warehouse
Load Data Lake Storage Gen1
Load SAP BW data
Load Office 365 data
Read or write partitioned data
Format and compression support
Schema and type mapping
Fault tolerance
Performance and tuning
Transform data
HDInsight Hive Activity
HDInsight Pig Activity
HDInsight MapReduce Activity
HDInsight Streaming Activity
HDInsight Spark Activity
ML Batch Execution Activity
ML Update Resource Activity
Stored Procedure Activity
Data Lake U-SQL Activity
Databricks Notebook Activity
Databricks Jar Activity
Databricks Python Activity
Custom activity
Compute linked services
Control flow
Append Variable Activity
Azure Function Activity
Execute Data Flow Activity
Execute Pipeline Activity
Filter Activity
For Each Activity
Get Metadata Activity
If Condition Activity
Lookup Activity
Set Variable Activity
Until Activity
Validation Activity
Wait Activity
Web Activity
Webhook Activity
Data flow transformations
Aggregate
Alter row
Conditional split
Derived column
Exists
Filter
Join
Lookup
New branch
Pivot
Select
Sink
Sort
Source
Surrogate key
Union
Unpivot
Window
Parameterize
Parameterize linked services
Expression Language
System variables
Security
Data movement security considerations
Store credentials in Azure Key Vault
Encrypt credentials for self-hosted integration runtime
Managed identity for Data Factory
Monitor and manage
Monitor visually
Monitor with Azure Monitor
Monitor with SDKs
Monitor integration runtime
Monitor Azure-SSIS integration runtime
Reconfigure Azure-SSIS integration runtime
Copy or clone a data factory
Create integration runtime
Azure integration runtime
Self hosted integration runtime
Azure-SSIS integration runtime
Shared self-hosted integration runtime
Run SSIS packages in Azure
Run SSIS packages with Execute SSIS Package activity
Run SSIS packages with Stored Procedure activity
Schedule Azure-SSIS integration runtime
Join Azure-SSIS IR to a virtual network
Enable Azure AD authentication for Azure-SSIS IR
Provision Enterprise Edition for Azure-SSIS IR
Customize setup for Azure-SSIS IR
Install licensed components for Azure-SSIS IR
Configure high performance for Azure-SSIS IR
Configure disaster recovery for Azure-SSIS IR
Clean up SSISDB logs with Elastic Database Jobs
Create triggers
Create an event-based trigger
Create a schedule trigger
Create a tumbling window trigger
Templates
Overview of templates
Copy files from multiple containers
Copy new files by LastModifiedDate
Bulk copy with control table
Delta copy with control table
Transform data with Databricks
Reference
.NET
PowerShell
REST API
Resource Manager template
Python
Resources
Ask a question - MSDN forum
Ask a question - Stack Overflow
Request a feature
FAQ
Whitepapers
Roadmap
Pricing
Availability by region
Support options
Introduction to Azure Data Factory
1/3/2019 • 10 minutes to read • Edit Online

NOTE
This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see
Introduction to Data Factory V2.

What is Azure Data Factory?


In the world of big data, how is existing data leveraged in business? Is it possible to enrich data that's generated in
the cloud by using reference data from on-premises data sources or other disparate data sources?
For example, a gaming company collects logs that are produced by games in the cloud. It wants to analyze these
logs to gain insights into customer preferences, demographics, usage behavior, and so on. The company also wants
to identify up-sell and cross-sell opportunities, develop compelling new features to drive business growth, and
provide a better experience to customers.
To analyze these logs, the company needs to use the reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. Therefore, the company
wants to ingest log data from the cloud data store and reference data from the on-premises data store.
Next they want to process the data by using Hadoop in the cloud (Azure HDInsight). They want to publish the
result data into a cloud data warehouse such as Azure SQL Data Warehouse or an on-premises data store such as
SQL Server. The company wants this workflow to run once a week.
The company needs a platform where they can create a workflow that can ingest data from both on-premises and
cloud data stores. The company also needs to be able to transform or process data by using existing compute
services such as Hadoop, and publish the results to an on-premises or cloud data store for BI applications to
consume.

Azure Data Factory is the platform for these kinds of scenarios. It is a cloud -based data integration service that
allows you to create data -driven workflows in the cloud that orchestrate and automate data movement and data
transformation. Using Azure Data Factory, you can do the following tasks:
Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data
stores.
Process or transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure
Data Lake Analytics, and Azure Machine Learning.
Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
It's more of an Extract-and-Load (EL ) and Transform-and-Load (TL ) platform rather than a traditional Extract-
Transform-and-Load (ETL ) platform. The transformations process data by using compute services rather than by
adding derived columns, counting the number of rows, sorting data, and so on.
Currently, in Azure Data Factory, the data that workflows consume and produce is time-sliced data (hourly, daily,
weekly, and so on). For example, a pipeline might read input data, process data, and produce output data once a
day. You can also run a workflow just one time.

How does it work?


The pipelines (data-driven workflows) in Azure Data Factory typically perform the following three steps:

Connect and collect


Enterprises have data of various types that are located in disparate sources. The first step in building an
information production system is to connect to all the required sources of data and processing. These sources
include SaaS services, file shares, FTP, and web services. Then move the data as-needed to a centralized location
for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services to
integrate these data sources and processing. It is expensive and hard to integrate and maintain such systems.
These systems also often lack the enterprise grade monitoring, alerting, and controls that a fully managed service
can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud
source data stores to a centralization data store in the cloud for further analysis.
For example, you can collect data in Azure Data Lake Store and transform the data later by using an Azure Data
Lake Analytics compute service. Or, collect data in Azure blob storage and transform it later by using an Azure
HDInsight Hadoop cluster.
Transform and enrich
After data is present in a centralized data store in the cloud, process or transfer it by using compute services such
as HDInsight Hadoop, Spark, Data Lake Analytics, or Machine Learning. You want to reliably produce transformed
data on a maintainable and controlled schedule to feed production environments with trusted data.
Publish
Deliver transformed data from the cloud to on-premises sources such as SQL Server. Alternatively, keep it in your
cloud storage sources for consumption by BI and analytics tools and other applications.

Key components
An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory
is composed of four key components. These components work together to provide the platform on which you can
compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory can have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive
query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to manage
the activities as a set instead of each one individually. For example, you can deploy and schedule the pipeline,
instead of scheduling independent activities.
Activity
A pipeline can have one or more activities. Activities define the actions to perform on your data. For example, you
can use a copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity.
A Hive activity runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory
supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data from any source can
be written to any sink. Select a data store to learn how to copy data to and from that store. Data Factory supports
the following data stores:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage ✓ ✓

Azure Cosmos DB (SQL API) ✓ ✓

Azure Data Lake Storage ✓ ✓


Gen1

Azure SQL Database ✓ ✓

Azure SQL Data Warehouse ✓ ✓

Azure Search Index ✓

Azure Table storage ✓ ✓

Databases Amazon Redshift ✓

DB2* ✓

MySQL* ✓

Oracle* ✓ ✓

PostgreSQL* ✓

SAP Business Warehouse* ✓

SAP HANA* ✓

SQL Server* ✓ ✓

Sybase* ✓

Teradata* ✓

NoSQL Cassandra* ✓

MongoDB* ✓
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

File Amazon S3 ✓

File System* ✓ ✓

FTP ✓

HDFS* ✓

SFTP ✓

Others Generic HTTP ✓

Generic OData ✓

Generic ODBC* ✓

Salesforce ✓

Web Table (table from ✓


HTML)

For more information, see Move data by using Copy Activity.


Data transformation activities
Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.

DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Update Azure VM


Resource

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

Data Lake Analytics U-SQL Azure Data Lake Analytics

DotNet HDInsight [Hadoop] or Azure Batch


NOTE
You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from
Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed.
See Run R Script using Azure Data Factory.

For more information, see Move data by using Copy Activity.


Custom .NET activities
Create a custom .NET activity if you need to move data to or from a data store that Copy Activity doesn't support
or if you need to transform data by using your own logic. For details about how to create and use a custom activity,
see Use custom activities in an Azure Data Factory pipeline.
Datasets
An activity takes zero or more datasets as inputs and one or more datasets as outputs. Datasets represent data
structures within the data stores. These structures point to or reference the data you want to use in your activities
(such as inputs or outputs).
For example, an Azure blob dataset specifies the blob container and folder in the Azure blob storage from which
the pipeline should read the data. Or an Azure SQL table dataset specifies the table to which the output data is
written by the activity.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for Data
Factory to connect to external resources. Think of it this way: a linked service defines the connection to the data
source and a dataset represents the structure of the data.
For example, an Azure Storage-linked service specifies a connection string with which to connect to the Azure
Storage account. An Azure blob dataset specifies the blob container and the folder that contains the data.
Linked services are used for two reasons in Data Factory:
To represent a data store that includes, but isn't limited to, an on-premises SQL Server database, Oracle
database, file share, or Azure blob storage account. See the Data movement activities section for a list of
supported data stores.
To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive
activity runs on an HDInsight Hadoop cluster. See the Data transformation activities section for a list of
supported compute environments.
Relationship between Data Factory entities

Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data by using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores. It also lets you process data by using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using both
programmatic and UI mechanisms.
Data Factory is available in only West US, East US, and North Europe regions. However, the service that powers
the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall, then
a Data Management Gateway that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are located in the West Europe region. You can create and use an Azure Data Factory instance in
North Europe. Then you can use it to schedule jobs on your compute environments in West Europe. It takes a few
milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on
your computing environment does not change.

Get started with creating a pipeline


You can use one of these tools or APIs to create data pipelines in Azure Data Factory:
Azure portal
Visual Studio
PowerShell
.NET API
REST API
Azure Resource Manager template
To learn how to build data factories with data pipelines, follow the step-by-step instructions in the following
tutorials:

TUTORIAL DESCRIPTION

Move data between two cloud data stores Create a data factory with a pipeline that moves data from
blob storage to a SQL database.

Transform data by using Hadoop cluster Build your first Azure data factory with a data pipeline that
processes data by running a Hive script on an Azure
HDInsight (Hadoop) cluster.

Move data between an on-premises data store and a cloud Build a data factory with a pipeline that moves data from an
data store by using Data Management Gateway on-premises SQL Server database to an Azure blob. As part of
the walkthrough, you install and configure the Data
Management Gateway on your machine.
Introduction to Azure Data Factory
2/27/2019 • 8 minutes to read • Edit Online

In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage
systems. However, on its own, raw data doesn't have the proper context or meaning to provide meaningful
insights to analysts, data scientists, or business decision makers.
Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of
raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for
these complex hybrid extract-transform-load (ETL ), extract-load-transform (ELT), and data integration
projects.
For example, imagine a gaming company that collects petabytes of game logs that are produced by games in
the cloud. The company wants to analyze these logs to gain insights into customer preferences,
demographics, and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop
compelling new features, drive business growth, and provide a better experience to its customers.
To analyze these logs, the company needs to use reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. The company wants
to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud
data store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight),
and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily
build a report on top of it. They want to automate this workflow, and monitor and manage it on a daily
schedule. They also want to execute it when files land in a blob store container.
Azure Data Factory is the platform that solves such data scenarios. It is a cloud -based data integration service
that allows you to create data -driven workflows in the cloud for orchestrating and automating data
movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven
workflows (called pipelines) that can ingest data from disparate data stores. It can process and transform the
data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and
Azure Machine Learning.
Additionally, you can publish output data to data stores such as Azure SQL Data Warehouse for business
intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized
into meaningful data stores and data lakes for better business decisions.
How does it work?
The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:

Connect and collect


Enterprises have data of various types that are located in disparate sources on-premises, in the cloud,
structured, unstructured, and semi-structured, all arriving at different intervals and speeds.
The first step in building an information production system is to connect to all the required sources of data
and processing, such as software-as-a-service (SaaS ) services, databases, file shares, and FTP web services.
The next step is to move the data as needed to a centralized location for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services
to integrate these data sources and processing. It's expensive and hard to integrate and maintain such
systems. In addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully
managed service can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and
cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can
collect data in Azure Data Lake Store and transform the data later by using an Azure Data Lake Analytics
compute service. You can also collect data in Azure Blob storage and transform it later by using an Azure
HDInsight Hadoop cluster.
Transform and enrich
After data is present in a centralized data store in the cloud, process or transform the collected data by using
compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. You want to
reliably produce transformed data on a maintainable and controlled schedule to feed production
environments with trusted data.
Publish
After the raw data has been refined into a business-ready consumable form, load the data into Azure Data
Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can
point to from their business intelligence tools.
Monitor
After you have successfully built and deployed your data integration pipeline, providing business value from
refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory
has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and
health panels on the Azure portal.

Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of four key components. These components work together to provide the platform on
which you can compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a
unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group
of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to
partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each
one individually. The activities in a pipeline can be chained together to operate sequentially, or they can
operate independently in parallel.
Activity
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data
from one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on
an Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities:
data movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you
want to use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for
Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to
the data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked
service specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob
dataset specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
To represent a data store that includes, but isn't limited to, an on-premises SQL Server database,
Oracle database, file share, or Azure blob storage account. For a list of supported data stores, see the
copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and
supported compute environments, see the transform data article.
Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off.
There are different types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within
the trigger definition.
Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The
arguments for the defined parameters are passed during execution from the run context that was created by a
trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets
and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data
store or a compute environment. It is also a reusable/referenceable entity.
Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or
from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
For more information about Data Factory concepts, see the following articles:
Dataset and linked services
Pipelines and activities
Integration runtime

Supported regions
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region.
However, a data factory can access data stores and compute services in other Azure regions to move data
between data stores or process data using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores and the processing of data using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using
both programmatic and UI mechanisms.
Although Data Factory is available only in certain regions, the service that powers the data movement in Data
Factory is available globally in several regions. If a data store is behind a firewall, then a Self-hosted
Integration Runtime that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are running out of the West Europe region. You can create and use an Azure Data Factory
instance in East US or East US 2 and use it to schedule jobs on your compute environments in West Europe.
It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for
running the job on your computing environment does not change.

Accessibility
The Data Factory user experience in the Azure portal is accessible.
Compare with version 1
For a list of differences between version 1 and the current version of the Data Factory service, see Compare
with version 1.

Next steps
Get started with creating a Data Factory pipeline by using one of the following tools/SDKs:
Data Factory UI in the Azure portal
Copy Data tool in the Azure portal
PowerShell
.NET
Python
REST
Azure Resource Manager template
Compare Azure Data Factory with Data Factory
version 1
3/5/2019 • 10 minutes to read • Edit Online

This article compares Data Factory with Data Factory version 1. For an introduction to Data Factory, see
Introduction to Data Factory.For an introduction to Data Factory version 1, see Introduction to Azure Data Factory.

Feature comparison
The following table compares the features of Data Factory with the features of Data Factory version 1.

FEATURE VERSION 1 CURRENT VERSION

Datasets A named view of data that references Datasets are the same in the current
the data that you want to use in your version. However, you do not need to
activities as inputs and outputs. define availability schedules for
Datasets identify data within different datasets. You can define a trigger
data stores, such as tables, files, folders, resource that can schedule pipelines
and documents. For example, an Azure from a clock scheduler paradigm. For
Blob dataset specifies the blob more information, see Triggers and
container and folder in Azure Blob Datasets.
storage from which the activity should
read the data.

Availability defines the processing


window slicing model for the dataset
(for example, hourly, daily, and so on).

Linked services Linked services are much like Linked services are the same as in Data
connection strings, which define the Factory V1, but with a new connectVia
connection information that's necessary property to utilize the Integration
for Data Factory to connect to external Runtime compute environment of the
resources. current version of Data Factory. For
more information, see Integration
runtime in Azure Data Factory and
Linked service properties for Azure Blob
storage.

Pipelines A data factory can have one or more Pipelines are groups of activities that
pipelines. A pipeline is a logical are performed on data. However, the
grouping of activities that together scheduling of activities in the pipeline
perform a task. You use startTime, has been separated into new trigger
endTime, and isPaused to schedule and resources. You can think of pipelines in
run pipelines. the current version of Data Factory
more as “workflow units” that you
schedule separately via triggers.

Pipelines do not have “windows” of time


execution in the current version of Data
Factory. The Data Factory V1 concepts
of startTime, endTime, and isPaused are
no longer present in the current version
of Data Factory. For more information,
see Pipeline execution and triggers and
Pipelines and activities.
FEATURE VERSION 1 CURRENT VERSION

Activities Activities define actions to perform on In the current version of Data Factory,
your data within a pipeline. Data activities still are defined actions within
movement (copy activity) and data a pipelineThe current version of Data
transformation activities (such as Hive, Factory introduces new control flow
Pig, and MapReduce) are supported. activities. You use these activities in a
control flow (looping and branching).
Data movement and data
transformation activities that were
supported in V1 are supported in the
current version. You can define
transformation activities without using
datasets in the current version.

Hybrid data movement and activity Now called Integration Runtime, Data Data Management Gateway is now
dispatch Management Gateway supported called Self-Hosted Integration Runtime.
moving data between on-premises and It provides the same capability as it did
cloud. in V1.

The Azure-SSIS Integration Runtime in


the current version of Data Factory also
supports deploying and running SQL
Server Integration Services (SSIS)
packages in the cloud. For more
information, see Integration runtime in
Azure Data Factory.

Parameters NA Parameters are key-value pairs of read-


only configuration settings that are
defined in pipelines. You can pass
arguments for the parameters when
you are manually running the pipeline.
If you are using a scheduler trigger, the
trigger can pass values for the
parameters too. Activities within the
pipeline consume the parameter values.

Expressions Data Factory V1 allows you to use In the current version of Data Factory,
functions and system variables in data you can use expressions anywhere in a
selection queries and activity/dataset JSON string value. For more
properties. information, see Expressions and
functions in the current version of Data
Factory.

Pipeline runs NA A single instance of a pipeline execution.


For example, say you have a pipeline
that executes at 8 AM, 9 AM, and 10
AM. There would be three separate runs
of the pipeline (pipeline runs) in this
case. Each pipeline run has a unique
pipeline run ID. The pipeline run ID is a
GUID that uniquely defines that
particular pipeline run. Pipeline runs are
typically instantiated by passing
arguments to parameters that are
defined in the pipelines.

Activity runs NA An instance of an activity execution


within a pipeline.
FEATURE VERSION 1 CURRENT VERSION

Trigger runs NA An instance of a trigger execution. For


more information, see Triggers.

Scheduling Scheduling is based on pipeline Scheduler trigger or execution via


start/end times and dataset availability. external scheduler. For more
information, see Pipeline execution and
triggers.

The following sections provide more information about the capabilities of the current version.

Control flow
To support diverse integration flows and patterns in the modern data warehouse, the current version of Data
Factory has enabled a new flexible data pipeline model that is no longer tied to time-series data. A few common
flows that were previously not possible are now enabled. They are described in the following sections.
Chaining activities
In V1, you had to configure the output of an activity as an input of another activity to chain them. in the current
version, you can chain activities in a sequence within a pipeline. You can use the dependsOn property in an
activity definition to chain it with an upstream activity. For more information and an example, see Pipelines and
activities and Branching and chaining activities.
Branching activities
in the current version, you can branch activities within a pipeline. The If-condition activity provides the same
functionality that an if statement provides in programming languages. It evaluates a set of activities when the
condition evaluates to true and another set of activities when the condition evaluates to false . For examples of
branching activities, see the Branching and chaining activities tutorial.
Parameters
You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-demand
or from a trigger. Activities can consume the arguments that are passed to the pipeline. For more information, see
Pipelines and triggers.
Custom state passing
Activity outputs including state can be consumed by a subsequent activity in the pipeline. For example, in the
JSON definition of an activity, you can access the output of the previous activity by using the following syntax:
@activity('NameofPreviousActivity').output.value . By using this feature, you can build workflows where values
can pass through activities.
Looping containers
The ForEach activity defines a repeating control flow in your pipeline. This activity iterates over a collection and
runs specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping
structure in programming languages.
The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to
true . You can specify a timeout value for the until activity in Data Factory.

Trigger-based flows
Pipelines can be triggered by on-demand (event-based, i.e. blob post) or wall-clock time. The pipelines and triggers
article has detailed information about triggers.
Invoking a pipeline from another pipeline
The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.
Delta flows
A key use case in ETL patterns is “delta loads,” in which only data that has changed since the last iteration of a
pipeline is loaded. New capabilities in the current version, such as lookup activity, flexible scheduling, and control
flow, enable this use case in a natural way. For a tutorial with step-by-step instructions, see Tutorial: Incremental
copy.
Other control flow activities
Following are a few more control flow activities that are supported by the current version of Data Factory.

CONTROL ACTIVITY DESCRIPTION

ForEach activity Defines a repeating control flow in your pipeline. This activity
is used to iterate over a collection and runs specified activities
in a loop. The loop implementation of this activity is similar to
Foreach looping structure in programming languages.

Web activity Calls a custom REST endpoint from a Data Factory pipeline.
You can pass datasets and linked services to be consumed
and accessed by the activity.

Lookup activity Reads or looks up a record or table name value from any
external source. This output can further be referenced by
succeeding activities.

Get metadata activity Retrieves the metadata of any data in Azure Data Factory.

Wait activity Pauses the pipeline for a specified period of time.

Deploy SSIS packages to Azure


You use Azure-SSIS if you want to move your SSIS workloads to the cloud, create a data factory by using the
current version, and provision an Azure-SSIS Integration Runtime.
The Azure-SSIS Integration Runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to
running your SSIS packages in the cloud. After you provision Azure-SSIS Integration Runtime, you can use the
same tools that you have been using to deploy SSIS packages to an on-premises SSIS environment.
For example, you can use SQL Server Data Tools or SQL Server Management Studio to deploy SSIS packages to
this runtime on Azure. For step-by-step instructions, see the tutorial Deploy SQL Server integration services
packages to Azure.

Flexible scheduling
In the current version of Data Factory, you do not need to define dataset availability schedules. You can define a
trigger resource that can schedule pipelines from a clock scheduler paradigm. You can also pass parameters to
pipelines from a trigger for a flexible scheduling and execution model.
Pipelines do not have “windows” of time execution in the current version of Data Factory. The Data Factory V1
concepts of startTime, endTime, and isPaused don't exist in the current version of Data Factory. For more
information about how to build and then schedule a pipeline in the current version of Data Factory, see Pipeline
execution and triggers.

Support for more data stores


The current version supports the copying of data to and from more data stores than V1. For a list of supported
data stores, see the following articles:
Version 1 - supported data stores
Current version - supported data stores

Support for on-demand Spark cluster


The current version supports the creation of an on-demand Azure HDInsight Spark cluster. To create an on-
demand Spark cluster, specify the cluster type as Spark in your on-demand, HDInsight linked service definition.
Then you can configure the Spark activity in your pipeline to use this linked service.
At runtime, when the activity is executed, the Data Factory service automatically creates the Spark cluster for you.
For more information, see the following articles:
Spark Activity in the current version of Data Factory
Azure HDInsight on-demand linked service

Custom activities
In V1, you implement (custom) DotNet activity code by creating a .NET class library project with a class that
implements the Execute method of the IDotNetActivity interface. Therefore, you need to write your custom code in
.NET Framework 4.5.2 and run it on Windows-based Azure Batch Pool nodes.
In a custom activity in the current version, you don't have to implement a .NET interface. You can directly run
commands, scripts, and your own custom code compiled as an executable.
For more information, see Difference between custom activity in Data Factory and version 1.

SDKs
the current version of Data Factory provides a richer set of SDKs that can be used to author, manage, and monitor
pipelines.
.NET SDK: The .NET SDK is updated in the current version.
PowerShell: The PowerShell cmdlets are updated in the current version. The cmdlets for the current
version have DataFactoryV2 in the name, for example: Get-AzDataFactoryV2.
Python SDK: This SDK is new in the current version.
REST API: The REST API is updated in the current version.
The SDKs that are updated in the current version are not backward-compatible with V1 clients.

Authoring experience
V2 V1

Azure portal Yes Yes

Azure PowerShell Yes Yes

.NET SDK Yes Yes

REST API Yes Yes


V2 V1

Python SDK Yes No

Resource Manager template Yes Yes

Roles and permissions


The Data Factory version 1 Contributor role can be used to create and manage the current version of Data Factory
resources. For more info, see Data Factory Contributor.

Monitoring experience
in the current version, you can also monitor data factories by using Azure Monitor. The new PowerShell cmdlets
support monitoring of integration runtimes. Both V1 and V2 support visual monitoring via a monitoring
application that can be launched from the Azure portal.

Next steps
Learn how to create a data factory by following step-by-step instructions in the following quickstarts: PowerShell,
.NET, Python, REST API.
Quickstart: Create a data factory by using the
Azure Data Factory UI
2/11/2019 • 10 minutes to read • Edit Online

This quickstart describes how to use the Azure Data Factory UI to create and monitor a data factory.
The pipeline that you create in this data factory copies data from one folder to another folder in Azure
Blob storage. For a tutorial on how to transform data by using Azure Data Factory, see Tutorial:
Transform data by using Spark.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.

2. On the Blob service page, select + Container on the toolbar.


3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.
John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.

12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Video
Watching this video helps you understand the Data Factory UI:

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is
supported only in Microsoft Edge and Google Chrome web browsers.
2. Go to the Azure portal.
3. Select Create a resource on the left menu, select Analytics, and then select Data Factory.

4. On the New data factory page, enter ADFTutorialDataFactory for Name.


The name of the Azure data factory must be globally unique. If you see the following error,
change the name of the data factory (for example, <yourname>ADFTutorialDataFactory)
and try creating again. For naming rules for Data Factory artifacts, see the Data Factory -
naming rules article.

5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2.
8. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
9. Select Create.
10. After the creation is complete, you see the Data Factory page. Select the Author & Monitor
tile to start the Azure Data Factory user interface (UI) application on a separate tab.

11. On the Let's get started page, switch to the Author tab in the left panel.
Create a linked service
In this procedure, you create a linked service to link your Azure storage account to the data factory. The
linked service has the connection information that the Data Factory service uses at runtime to connect
to it.
1. Select Connections, and then select the New button on the toolbar.

2. On the New Linked Service page, select Azure Blob Storage, and then select Continue.
3. Complete the following steps:
a. For Name, enter AzureStorageLinkedService.
b. For Storage account name, select the name of your Azure storage account.
c. Select Test connection to confirm that the Data Factory service can connect to the storage
account.
d. Select Finish to save the linked service.
Create datasets
In this procedure, you create two datasets: InputDataset and OutputDataset. These datasets are of
type AzureBlob. They refer to the Azure Storage linked service that you created in the previous
section.
The input dataset represents the source data in the input folder. In the input dataset definition, you
specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the
source data.
The output dataset represents the data that's copied to the destination. In the output dataset definition,
you specify the blob container (adftutorial), the folder (output), and the file to which the data is
copied. Each run of a pipeline has a unique ID associated with it. You can access this ID by using the
system variable RunId. The name of the output file is dynamically evaluated based on the run ID of the
pipeline.
In the linked service settings, you specified the Azure storage account that contains the source data. In
the source dataset settings, you specify where exactly the source data resides (blob container, folder,
and file). In the sink dataset settings, you specify where the data is copied to (blob container, folder, and
file).
1. Select the + (plus) button, and then select Dataset.
2. On the New Dataset page, select Azure Blob Storage, and then select Finish.

3. In the General tab for the dataset, enter InputDataset for Name.
4. Switch to the Connection tab and complete the following steps:
a. For Linked service, select AzureStorageLinkedService.
b. For File path, select the Browse button.
c. In the Choose a file or folder window, browse to the input folder in the adftutorial
container, select the emp.txt file, and then select Finish.

d. (optional) Select Preview data to preview the data in the emp.txt file.
5. Repeat the steps to create the output dataset:
a. Select the + (plus) button, and then select Dataset.
b. On the New Dataset page, select Azure Blob Storage, and then select Finish.
c. In General table, specify OutputDataset for the name.
d. In Connection tab, select AzureStorageLinkedService as linked service, and enter
adftutorial/output for the folder, in the directory field. If the output folder does not exist, the
copy activity creates it at runtime.

Create a pipeline
In this procedure, you create and validate a pipeline with a copy activity that uses the input and output
datasets. The copy activity copies data from the file you specified in the input dataset settings to the file
you specified in the output dataset settings. If the input dataset specifies only a folder (not the file
name), the copy activity copies all the files in the source folder to the destination.
1. Select the + (plus) button, and then select Pipeline.

2. In the General tab, specify CopyPipeline for Name.


3. In the Activities toolbox, expand Move & Transform. Drag the Copy activity from the
Activities toolbox to the pipeline designer surface. You can also search for activities in the
Activities toolbox. Specify CopyFromBlobToBlob for Name.
4. Switch to the Source tab in the copy activity settings, and select InputDataset for Source
Dataset.
5. Switch to the Sink tab in the copy activity settings, and select OutputDataset for Sink Dataset.
6. Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings.
Confirm that the pipeline has been successfully validated. To close the validation output, select
the >> (right arrow ) button.

Debug the pipeline


In this step, you debug the pipeline before deploying it to Data Factory.
1. On the pipeline toolbar above the canvas, click Debug to trigger a test run.
2. Confirm that you see the status of the pipeline run on the Output tab of the pipeline settings at
the bottom.
3. Confirm that you see an output file in the output folder of the adftutorial container. If the
output folder does not exist, the Data Factory service automatically creates it.

Trigger the pipeline manually


In this procedure, you deploy entities (linked services, datasets, pipelines) to Azure Data Factory. Then,
you manually trigger a pipeline run.
1. Before you trigger a pipeline, you must publish entities to Data Factory. To publish, select
Publish All on the top.

2. To trigger the pipeline manually, select Trigger on the pipeline toolbar, and then select Trigger
Now.

Monitor the pipeline


1. Switch to the Monitor tab on the left. Use the Refresh button to refresh the list.
2. Select the View Activity Runs link under Actions. You see the status of the copy activity run
on this page.

3. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.

4. Confirm that you see a new file in the output folder.


5. You can switch back to the Pipeline Runs view from the Activity Runs view by selecting the
Pipelines link.

Trigger the pipeline on a schedule


This procedure is optional in this tutorial. You can create a scheduler trigger to schedule the pipeline to
run periodically (hourly, daily, and so on). In this procedure, you create a trigger to run every minute
until the end date and time that you specify.
1. Switch to the Author tab.
2. Go to your pipeline, select Trigger on the pipeline toolbar, and then select New/Edit.
3. On the Add Triggers page, select Choose trigger, and then select New.
4. On the New Trigger page, under End, select On Date, specify an end time a few minutes after
the current time, and then select Apply.
A cost is associated with each pipeline run, so specify the end time only minutes apart from the
start time. Ensure that it's the same day. However, ensure that there is enough time for the
pipeline to run between the publish time and the end time. The trigger comes into effect only
after you publish the solution to Data Factory, not when you save the trigger in the UI.

5. On the New Trigger page, select the Activated check box, and then select Next.
6. Review the warning message, and select Finish.

7. Select Publish All to publish changes to Data Factory.


8. Switch to the Monitor tab on the left. Select Refresh to refresh the list. You see that the pipeline
runs once every minute from the publish time to the end time.
Notice the values in the Triggered By column. The manual trigger run was from the step
(Trigger Now) that you did earlier.

9. Switch to the Trigger Runs view.

10. Confirm that an output file is created for every pipeline run until the specified end date and time
in the output folder.

Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Use the Copy Data tool to copy
data
4/8/2019 • 6 minutes to read • Edit Online

In this quickstart, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to
create a pipeline that copies data from a folder in Azure Blob storage to another folder.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.

2. On the Blob service page, select + Container on the toolbar.


3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.
John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.

12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Create a data factory
1. Select New on the left menu, select Data + Analytics, and then select Data Factory.

2. On the New data factory page, enter ADFTutorialDataFactory for Name.


The name of the Azure data factory must be globally unique. If you see the following error,
change the name of the data factory (for example, <yourname>ADFTutorialDataFactory) and
try creating again. For naming rules for Data Factory artifacts, see the Data Factory - naming
rules article.

3. For Subscription, select your Azure subscription in which you want to create the data factory.
4. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
5. For Version, select V2.
6. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
7. Select Create.
8. After the creation is complete, you see the Data Factory page. Select the Author & Monitor tile
to start the Azure Data Factory user interface (UI) application on a separate tab.

Start the Copy Data tool


1. On the Let's get started page, select the Copy Data tile to start the Copy Data tool.
2. On the Properties page of the Copy Data tool, you can specify a name for the pipeline and its
description, then select Next.

3. On the Source data store page, complete the following steps:


a. Click + Create new connection to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Next.

c. On the Specify the Azure Blob storage account page, select your storage account from the
Storage account name list, and then select Finish.
d. Select the newly created linked service as source, then click Next.

4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adftutorial/input folder, select the emp.txt file, then click
Choose.
d. Check the Binary copy option to copy file as-is, then select Next.

5. On the Destination data store page, select the Azure Blob Storage linked service you just
created, and then select Next.
6. On the Choose the output file or folder page, enter adftutorial/output for the folder path,
then select Next.

7. On the Settings page, select Next to use the default configurations.


8. On the Summary page, review all settings, and select Next.
9. On the Deployment complete page, select Monitor to monitor the pipeline that you created.

10. The application switches to the Monitor tab. You see the status of the pipeline on this tab. Select
Refresh to refresh the list.

11. Select the View Activity Runs link in the Actions column. The pipeline has only one activity of
type Copy.
12. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.

13. Verify that the emp.txt file is created in the output folder of the adftutorial container. If the
output folder does not exist, the Data Factory service automatically creates it.
14. Switch to the Author tab above the Monitor tab on the left panel so that you can edit linked
services, datasets, and pipelines. To learn about editing them in the Data Factory UI, see Create a
data factory by using the Azure portal.

Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Create an Azure data factory using
PowerShell
3/5/2019 • 12 minutes to read • Edit Online

This quickstart describes how to use PowerShell to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.

3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.

John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.

Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.
Log in to PowerShell
1. Launch PowerShell on your machine. Keep PowerShell open until the end of this quickstart. If
you close and reopen, you need to run these commands again.
2. Run the following command, and enter the same Azure user name and password that you use
to sign in to the Azure portal:

Connect-AzAccount

3. Run the following command to view all the subscriptions for this account:

Get-AzSubscription

4. If you see multiple subscriptions associated with your account, run the following command to
select the subscription that you want to work with. Replace SubscriptionId with the ID of your
Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later.
Copy the following command text to PowerShell, specify a name for the Azure resource group
in double quotes, and then run the command. For example: "ADFQuickStartRG" .

$resourceGroupName = "ADFQuickStartRG";

If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again
2. To create the Azure resource group, run the following command:

$ResGrp = New-AzResourceGroup $resourceGroupName -location 'East US'

If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again.
3. Define a variable for the data factory name.
IMPORTANT
Update the data factory name to be globally unique. For example, ADFTutorialFactorySP1127.

$dataFactoryName = "ADFQuickStartFactory";

4. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet, using the Location
and ResourceGroupName property from the $ResGrp variable:

$DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $ResGrp.ResourceGroupName `


-Location $ResGrp.Location -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data


Factory names must be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a
member of contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.

Create a linked service


Create linked services in a data factory to link your data stores and compute services to the data
factory. In this quickstart, you create an Azure Storage linked service that is used as both the source
and sink stores. The linked service has the connection information that the Data Factory service uses
at runtime to connect to it.
1. Create a JSON file named AzureStorageLinkedService.json in C:\ADFv2QuickStartPSH
folder with the following content: (Create the folder ADFv2QuickStartPSH if it does not
already exist.).

IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before
saving the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=
<accountName>;AccountKey=<accountKey>;EndpointSuffix=core.windows.net",
"type": "SecureString"
}
}
}
}

If you are using Notepad, select All files for the Save as type filed in the Save as dialog box.
Otherwise, it may add .txt extension to the file. For example,
AzureStorageLinkedService.json.txt . If you create the file in File Explorer before opening it in
Notepad, you may not see the .txt extension since the Hide extensions for known files
types option is set by default. Remove the .txt extension before proceeding to the next step.
2. In PowerShell, switch to the ADFv2QuickStartPSH folder.

Set-Location 'C:\ADFv2QuickStartPSH'

3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:


AzureStorageLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName -Name "AzureStorageLinkedService" `
-DefinitionFile ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create a dataset
In this step, you define a dataset that represents the data to copy from a source to a sink. The dataset
is of type AzureBlob. It refers to the Azure Storage linked service you created in the previous step.
It takes a parameter to construct the folderPath property. For an input dataset, the copy activity in
the pipeline passes the input path as a value for this parameter. Similarly, for an output dataset, the
copy activity passes the output path as a value for this parameter.
1. Create a JSON file named BlobDataset.json in the C:\ADFv2QuickStartPSH folder, with
the following content:
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{dataset().path}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

2. To create the dataset: BlobDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName -Name "BlobDataset" `
-DefinitionFile ".\BlobDataset.json"

Here is the sample output:

DatasetName : BlobDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a pipeline
In this quickstart, you create a pipeline with one activity that takes two parameters - input blob path
and output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity uses the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
1. Create a JSON file named Adfv2QuickStartPipeline.json in the C:\ADFv2QuickStartPSH
folder with the following content:
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}

2. To create the pipeline: Adfv2QuickStartPipeline, Run the Set-AzDataFactoryV2Pipeline


cmdlet.

$DFPipeLine = Set-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "Adfv2QuickStartPipeline" `
-DefinitionFile ".\Adfv2QuickStartPipeline.json"

Create a pipeline run


In this step, you set values for the pipeline parameters: inputPath and outputPath with actual values
of source and sink blob paths. Then, you create a pipeline run by using these arguments.
1. Create a JSON file named PipelineParameters.json in the C:\ADFv2QuickStartPSH folder
with the following content:
{
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/output"
}

2. Run the Invoke-AzDataFactoryV2Pipeline cmdlet to create a pipeline run and pass in the
parameter values. The cmdlet returns the pipeline run ID for future monitoring.

$RunId = Invoke-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name `
-ParameterFile .\PipelineParameters.json

Monitor the pipeline run


1. Run the following PowerShell script to continuously check the pipeline run status until it
finishes copying the data. Copy/paste the following script in the PowerShell window, and press
ENTER.

while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId

if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}

Start-Sleep -Seconds 10
}

Here is the sample output of pipeline run:

Pipeline is running...status: InProgress


Pipeline run finished. The status is: Succeeded

ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
RunId : 0000000000-0000-0000-0000-0000000000000
PipelineName : Adfv2QuickStartPipeline
LastUpdated : 9/28/2017 8:28:38 PM
Parameters : {[inputPath, adftutorial/input], [outputPath, adftutorial/output]}
RunStart : 9/28/2017 8:28:14 PM
RunEnd : 9/28/2017 8:28:38 PM
DurationInMs : 24151
Status : Succeeded
Message :

You might see the following error:


Activity CopyFromBlobToBlob failed: Failed to detect region of linked service
'AzureStorage' : 'AzureStorageLinkedService' with error '[Region Resolver] Azure Storage
failed to get address for DNS. Warning: System.Net.Sockets.SocketException (0x80004005): No
such host is known

If you see the error, perform the following steps:


a. In the AzureStorageLinkedService.json, confirm that the name and key of your Azure
Storage Account are correct.
b. Verify that the format of the connection string is correct. The properties, for example,
AccountName and AccountKey are separated by semi-colon ( ; ) character.
c. If you have angled brackets surrounding the account name and account key, remove
them.
d. Here is an example connection string:

"connectionString": {
"value":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccountname;AccountKey=mystorag
eaccountkey;EndpointSuffix=core.windows.net",
"type": "SecureString"
}

e. Recreate the linked service by following steps in the Create a linked service section.
f. Rerun the pipeline by following steps in the Create a pipeline run section.
g. Run the current monitoring command again to monitor the new pipeline run.
2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.

Write-Output "Activity run details:"


$Result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $DataFactory.DataFactoryName -
ResourceGroupName $ResGrp.ResourceGroupName -PipelineRunId $RunId -RunStartedAfter (Get-
Date).AddMinutes(-30) -RunStartedBefore (Get-Date).AddMinutes(30)
$Result

Write-Output "Activity 'Output' section:"


$Result.Output -join "`r`n"

Write-Output "Activity 'Error' section:"


$Result.Error -join "`r`n"

3. Confirm that you see the output similar to the following sample output of activity run result:
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
ActivityName : CopyFromBlobToBlob
PipelineRunId : 00000000000-0000-0000-0000-000000000000
PipelineName : Adfv2QuickStartPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, copyDuration, throughput...}
LinkedServiceName :
ActivityRunStart : 9/28/2017 8:28:18 PM
ActivityRunEnd : 9/28/2017 8:28:36 PM
DurationInMs : 18095
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity 'Output' section:


"dataRead": 38
"dataWritten": 38
"copyDuration": 7
"throughput": 0.01
"errors": []
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West US)"
"usedDataIntegrationUnits": 2
"billedDuration": 14

Verify the output


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies
the emp.txt file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page, click Refresh to see the output folder.

2. Click output in the folder list.


3. Confirm that the emp.txt is copied to the output folder.
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the
Azure resource group, which includes all the resources in the resource group. If you want to keep the
other resources intact, delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following
command to delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:

Remove-AzDataFactoryV2 -Name $dataFactoryName -ResourceGroupName $resourceGroupName

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using .NET SDK
4/28/2019 • 11 minutes to read • Edit Online

This quickstart describes how to use .NET SDK to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and
password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.

3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.


7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.

John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Visual Studio
The walkthrough in this article uses Visual Studio 2017. You can also use Visual Studio 2013 or 2015.
Azure .NET SDK
Download and install Azure .NET SDK on your machine.

Create an application in Azure Active Directory


Following instructions from the sections in this article to do the following tasks:
1. Create an Azure Active Directory application. Create an application in Azure Active Directory
that represents the .NET application you are creating in this tutorial. For the sign-on URL, you can
provide a dummy URL as shown in the article ( https://contoso.org/exampleapp ).
2. Get the application ID and authentication key, and note down these values that you use later
in this tutorial.
3. Get the tenant ID and note down this value that you use later in this tutorial.
4. Assign the application to the Contributor role at the subscription level so that the application can
create data factories in the subscription.

Create a Visual Studio project


Using Visual Studio 2013/2015/2017, create a C# .NET console application.
1. Launch Visual Studio.
2. Click File, point to New, and click Project.
3. Select Visual C# -> Console App (.NET Framework) from the list of project types on the right.
.NET version 4.5.2 or above is required.
4. Enter ADFv2QuickStart for the Name.
5. Click OK to create the project.
Install NuGet packages
1. Click Tools -> NuGet Package Manager -> Package Manager Console.
2. In the Package Manager Console, run the following commands to install packages. Refer to
Microsoft.Azure.Management.DataFactory nuget package with details.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs, include the following statements to add references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add the following code to the Main method that sets the variables. Replace the place-holders
with your own values. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL
Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.

// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID where the data factory resides>";
string resourceGroup = "<your resource group where the data factory resides>";
string region = "East US 2";
string dataFactoryName = "<specify the name of data factory to create. It must be globally
unique.>";
string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
// specify the container and input folder from which all files need to be copied to the
output folder.
string inputBlobPath = "<the path to existing blob(s) to copy data from, e.g.
containername/foldername>";
//specify the contains and output folder where the files are copied
string outputBlobPath = "<the blob path to copy data to, e.g. containername/foldername>";

string storageLinkedServiceName = "AzureStorageLinkedService"; // name of the Azure


Storage linked service
string blobDatasetName = "BlobDataset"; // name of the blob dataset
string pipelineName = "Adfv2QuickStartPipeline"; // name of the pipeline

3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create a data factory, a linked
service, datasets, and a pipeline. You also use this object to monitor the pipeline run details.
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/",
cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


Add the following code to the Main method that creates a data factory.

// Create a data factory


Console.WriteLine("Creating data factory " + dataFactoryName + "...");
Factory dataFactory = new Factory
{
Location = region,
Identity = new FactoryIdentity()
};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);
Console.WriteLine(SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings));

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState ==


"PendingCreation")
{
System.Threading.Thread.Sleep(1000);
}

Create a linked service


Add the following code to the Main method that creates an Azure Storage linked service.
You create linked services in a data factory to link your data stores and compute services to the data
factory. In this Quickstart, you only need to create one Azure Storage linked service for both the copy
source and sink store, named "AzureStorageLinkedService" in the sample.

// Create an Azure Storage linked service


Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");

LinkedServiceResource storageLinkedService = new LinkedServiceResource(


new AzureStorageLinkedService
{
ConnectionString = new SecureString("DefaultEndpointsProtocol=https;AccountName=" +
storageAccount + ";AccountKey=" + storageKey)
}
);
client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,
storageLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(storageLinkedService,
client.SerializationSettings));

Create a dataset
Add the following code to the Main method that creates an Azure blob dataset.
You define a dataset that represents the data to copy from a source to a sink. In this example, this
Blob dataset references to the Azure Storage linked service you created in the previous step. The
dataset takes a parameter whose value is set in an activity that consumes the dataset. The parameter
is used to construct the "folderPath" pointing to where the data resides/stored.

// Create an Azure Blob dataset


Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
},
FolderPath = new Expression { Value = "@{dataset().path}" },
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "path", new ParameterSpecification { Type = ParameterType.String } }

}
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity refers to the same blob dataset created in the previous step as input and output. When
the dataset is used as an input dataset, input path is specified. And, when the dataset is used as an
output dataset, the output path is specified.
// Create a pipeline with a copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "inputPath", new ParameterSpecification { Type = ParameterType.String } },
{ "outputPath", new ParameterSpecification { Type = ParameterType.String } }
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToBlob",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.inputPath" }
}
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.outputPath" }
}
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.
This code also sets values of inputPath and outputPath parameters specified in pipeline with the
actual values of source and sink blob paths.

// Create a pipeline run


Console.WriteLine("Creating pipeline run...");
Dictionary<string, object> parameters = new Dictionary<string, object>
{
{ "inputPath", inputBlobPath },
{ "outputPath", outputBlobPath }
};
CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,
dataFactoryName, pipelineName, parameters: parameters).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);
Monitor a pipeline run
1. Add the following code to the Main method to continuously check the status until it finishes
copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName,
runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to the Main method that retrieves copy activity run details, for
example, size of the data read/written.

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

RunFilterParameters filterParams = new RunFilterParameters(DateTime.UtcNow.AddMinutes(-


10), DateTime.UtcNow.AddMinutes(10));
ActivityRunsQueryResponse queryResponse =
client.ActivityRuns.QueryByPipelineRun(resourceGroup, dataFactoryName, runResponse.RunId,
filterParams);
if (pipelineRun.Status == "Succeeded")
Console.WriteLine(queryResponse.Value.First().Output);
else
Console.WriteLine(queryResponse.Value.First().Error);
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating data factory, linked service, datasets, pipeline, and
pipeline run. It then checks the pipeline run status. Wait until you see the copy activity run details with
data read/written size. Then, use tools such as Azure Storage explorer to check the blob(s) is copied
to "outputBlobPath" from "inputBlobPath" as you specified in variables.
Sample output

Creating data factory SPv2Factory0907...


{
"identity": {
"type": "SystemAssigned"
},
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
}
}
}
Creating dataset BlobDataset...
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
Creating pipeline Adfv2QuickStartPipeline...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"name": "CopyFromBlobToBlob"
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
Creating pipeline run...
Pipeline run ID: 308d222d-3858-48b1-9e66-acd921feaa09
Checking pipeline run status...
Status: InProgress
Status: InProgress
Checking copy activity run details...
{
"dataRead": 331452208,
"dataWritten": 331452208,
"copyDuration": 23,
"throughput": 14073.209,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West US)",
"usedDataIntegrationUnits": 2,
"billedDuration": 23
}

Press any key to exit...

Verify the output


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies
the emp.txt file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page, click Refresh to see the output folder.

2. Click output in the folder list.


3. Confirm that the emp.txt is copied to the output folder.
Clean up resources
To programmatically, delete the data factory, add the following lines of code to the program:

Console.WriteLine("Deleting the data factory");


client.Factories.Delete(resourceGroup, dataFactoryName);

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using Python
3/6/2019 • 9 minutes to read • Edit Online

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use Python to create an Azure data factory. The pipeline in this data
factory copies data from one folder to another folder in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role by following instructions in the same article.
Create and upload an input file
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John|Doe
Jane|Doe

2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and input folder
in the container. Then, upload the input.txt file to the input folder.

Install the Python package


1. Open a terminal or command prompt with administrator privileges.
2. First, install the Python package for Azure management resources:

pip install azure-mgmt-resource

3. To install the Python package for Data Factory, run the following command:

pip install azure-mgmt-datafactory

The Python SDK for Data Factory supports Python 2.7, 3.3, 3.4, 3.5, 3.6 and 3.7.
Create a data factory client
1. Create a file named datafactory.py. Add the following statements to add references to
namespaces.

from azure.common.credentials import ServicePrincipalCredentials


from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
from datetime import datetime, timedelta
import time

2. Add the following functions that print information.

def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)

def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")

def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))

3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create the data factory, linked
service, datasets, and pipeline. You also use this object to monitor the pipeline run details. Set
subscription_id variable to the ID of your Azure subscription. For a list of Azure regions in
which Data Factory is currently available, select the regions that interest you on the following
page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
def main():

# Azure subscription ID
subscription_id = '<Specify your Azure Subscription ID>'

# This program creates this resource group. If it's an existing resource group, comment
out the code that creates the resource group
rg_name = 'ADFTutorialResourceGroup'

# The data factory name. It must be globally unique.


df_name = '<Specify a name for the data factory. It must be globally unique>'

# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory application/client
ID>', secret='<client secret>', tenant='<Active Directory tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)

rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}

Create a data factory


Add the following code to the Main method that creates a data factory. If your resource group
already exists, comment out the first create_or_update statement.

# create the resource group


# comment out if the resource group already exits
resource_client.resource_groups.create_or_update(rg_name, rg_params)

#Create a data factory


df_resource = Factory(location='eastus')
df = adf_client.factories.create_or_update(rg_name, df_name, df_resource)
print_item(df)
while df.provisioning_state != 'Succeeded':
df = adf_client.factories.get(rg_name, df_name)
time.sleep(1)

Create a linked service


Add the following code to the Main method that creates an Azure Storage linked service.
You create linked services in a data factory to link your data stores and compute services to the data
factory. In this quickstart, you only need create one Azure Storage linked service as both copy source
and sink store, named "AzureStorageLinkedService" in the sample. Replace <storageaccountname> and
<storageaccountkey> with name and key of your Azure Storage account.

# Create an Azure Storage linked service


ls_name = 'storageLinkedService'

# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString('DefaultEndpointsProtocol=https;AccountName=
<storageaccountname>;AccountKey=<storageaccountkey>')

ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about
properties of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure
Storage linked service you create in the previous step.

# Create an Azure blob dataset (input)


ds_name = 'ds_in'
ds_ls = LinkedServiceReference(ls_name)
blob_path= 'adfv2tutorial/input'
blob_filename = 'input.txt'
ds_azure_blob= AzureBlobDataset(ds_ls, folder_path=blob_path, file_name = blob_filename)
ds = adf_client.datasets.create_or_update(rg_name, df_name, ds_name, ds_azure_blob)
print_item(ds)

Create a dataset for sink Azure Blob


Add the following code to the Main method that creates an Azure blob dataset. For information about
properties of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure
Storage linked service you create in the previous step.

# Create an Azure blob dataset (output)


dsOut_name = 'ds_out'
output_blobpath = 'adfv2tutorial/output'
dsOut_azure_blob = AzureBlobDataset(ds_ls, folder_path=output_blobpath)
dsOut = adf_client.datasets.create_or_update(rg_name, df_name, dsOut_name, dsOut_azure_blob)
print_item(dsOut)

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.

# Create a copy activity


act_name = 'copyBlobtoBlob'
blob_source = BlobSource()
blob_sink = BlobSink()
dsin_ref = DatasetReference(ds_name)
dsOut_ref = DatasetReference(dsOut_name)
copy_activity = CopyActivity(act_name,inputs=[dsin_ref], outputs=[dsOut_ref],
source=blob_source, sink=blob_sink)

#Create a pipeline with the copy activity


p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.
#Create a pipeline run.
run_response = adf_client.pipelines.create_run(rg_name, df_name, p_name,
{
}
)

Monitor a pipeline run


To monitor the pipeline run, add the following code the Main method:

#Monitor the pipeline run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
activity_runs_paged = list(adf_client.activity_runs.list_by_pipeline_run(rg_name, df_name,
pipeline_run.run_id, datetime.now() - timedelta(1), datetime.now() + timedelta(1)))
print_activity_run_details(activity_runs_paged[0])

Now, add the following statement to invoke the main method when the program is run:

# Start the main method


main()

Full script
Here is the full Python code:

from azure.common.credentials import ServicePrincipalCredentials


from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
from datetime import datetime, timedelta
import time

def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)
print("\n")

def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n")

def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))

def main():

# Azure subscription ID
subscription_id = '<your Azure subscription ID>'

# This program creates this resource group. If it's an existing resource group, comment out the
code that creates the resource group
rg_name = '<Azure resource group name>'

# The data factory name. It must be globally unique.


df_name = '<Your data factory name>'

# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory client ID>',
secret='<client secret>', tenant='<tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)

rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}

# create the resource group


# comment out if the resource group already exits
resource_client.resource_groups.create_or_update(rg_name, rg_params)

# Create a data factory


df_resource = Factory(location='eastus')
df = adf_client.factories.create_or_update(rg_name, df_name, df_resource)
print_item(df)
while df.provisioning_state != 'Succeeded':
df = adf_client.factories.get(rg_name, df_name)
time.sleep(1)

# Create an Azure Storage linked service


ls_name = 'storageLinkedService'

# Specify the name and key of your Azure Storage account


storage_string = SecureString('DefaultEndpointsProtocol=https;AccountName=<storage account
name>;AccountKey=<storage account key>')

ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)

# Create an Azure blob dataset (input)


ds_name = 'ds_in'
ds_ls = LinkedServiceReference(ls_name)
blob_path= 'adfv2tutorial/input'
blob_filename = 'input.txt'
ds_azure_blob= AzureBlobDataset(ds_ls, folder_path=blob_path, file_name = blob_filename)
ds = adf_client.datasets.create_or_update(rg_name, df_name, ds_name, ds_azure_blob)
print_item(ds)

# Create an Azure blob dataset (output)


dsOut_name = 'ds_out'
output_blobpath = 'adfv2tutorial/output'
dsOut_azure_blob = AzureBlobDataset(ds_ls, folder_path=output_blobpath)
dsOut = adf_client.datasets.create_or_update(rg_name, df_name, dsOut_name, dsOut_azure_blob)
print_item(dsOut)

# Create a copy activity


act_name = 'copyBlobtoBlob'
blob_source = BlobSource()
blob_sink = BlobSink()
dsin_ref = DatasetReference(ds_name)
dsOut_ref = DatasetReference(dsOut_name)
dsOut_ref = DatasetReference(dsOut_name)
copy_activity = CopyActivity(act_name,inputs=[dsin_ref], outputs=[dsOut_ref],
source=blob_source, sink=blob_sink)

# Create a pipeline with the copy activity


p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)

# Create a pipeline run


run_response = adf_client.pipelines.create_run(rg_name, df_name, p_name,
{
}
)

# Monitor the pipeilne run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
activity_runs_paged = list(adf_client.activity_runs.list_by_pipeline_run(rg_name, df_name,
pipeline_run.run_id, datetime.now() - timedelta(1), datetime.now() + timedelta(1)))
print_activity_run_details(activity_runs_paged[0])

# Start the main method


main()

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating data factory, linked service, datasets, pipeline, and pipeline
run. Wait until you see the copy activity run details with data read/written size. Then, use tools such as
Azure Storage explorer to check the blob(s) is copied to "outputBlobPath" from "inputBlobPath" as you
specified in variables.
Here is the sample output:
Name: <data factory name>
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>
Location: eastus
Tags: {}

Name: storageLinkedService
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory
name>/linkedservices/storageLinkedService

Name: ds_in
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_in

Name: ds_out
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_out

Name: copyPipeline
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/pipelines/copyPipeline

Pipeline run status: Succeeded


Datetime with no tzinfo will be considered UTC.
Datetime with no tzinfo will be considered UTC.

Activity run details

Activity run status: Succeeded


Number of bytes read: 18
Number of bytes written: 18
Copy duration: 4

Clean up resources
To delete the data factory, add the following code to the program:

adf_client.factories.delete(rg_name,df_name)

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure data factory and
pipeline by using the REST API
3/26/2019 • 8 minutes to read • Edit Online

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use REST API to create an Azure data factory. The pipeline in this data
factory copies data from one location to another location in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.

Azure subscription. If you don't have a subscription, you can create a free trial account.
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create a blob container in Blob Storage, create an input folder in the container, and upload some
files to the folder. You can use tools such as Azure Storage explorer to connect to Azure Blob
storage, create a blob container, upload input file, and verify the output file.
Install Azure PowerShell. Follow the instructions in How to install and configure Azure
PowerShell. This quickstart uses PowerShell to invoke REST API calls.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role.

Set global variables


1. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close
and reopen, you need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to
the Azure portal:

Connect-AzAccount
Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

2. Run the following commands after replacing the places-holders with your own values, to set
global variables to be used in later steps.

$tenantID = "<your tenant ID>"


$appId = "<your application ID>"
$authKey = "<your authentication key for the application>"
$subsId = "<your subscription ID to create the factory>"
$resourceGroup = "<your resource group to create the factory>"
$dataFactoryName = "<specify the name of data factory to create. It must be globally
unique.>"
$apiVersion = "2018-06-01"

Authenticate with Azure AD


Run the following commands to authenticate with Azure Active Directory (AAD ):

$AuthContext =
[Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext]"https://login.microsoftonli
ne.com/${tenantId}"
$cred = New-Object -TypeName Microsoft.IdentityModel.Clients.ActiveDirectory.ClientCredential -
ArgumentList ($appId, $authKey)
$result = $AuthContext.AcquireToken("https://management.core.windows.net/", $cred)
$authHeader = @{
'Content-Type'='application/json'
'Accept'='application/json'
'Authorization'=$result.CreateAuthorizationHeader()
}

Create a data factory


Run the following commands to create a data factory:

$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}?api-version=${apiVersion}"
$body = @"
{
"name": "$dataFactoryName",
"location": "East US",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
Note the following points:
The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.

Data factory name "ADFv2QuickStartDataFactory" is not available.

For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Here is the sample response:

{
"name": "<dataFactoryName>",
"tags": {
},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-14T06:22:59.9106216Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "<service principal ID>",
"tenantId": "<tenant ID>"
},
"id": "dataFactoryName",
"type": "Microsoft.DataFactory/factories",
"location": "East US"
}

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data
factory. In this quickstart, you only need create one Azure Storage linked service as both copy source
and sink store, named "AzureStorageLinkedService" in the sample.
Run the following commands to create a linked service named AzureStorageLinkedService:
Replace <accountName> and <accountKey> with name and key of your Azure storage account before
executing the commands.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/linkedservices/AzureStorageLinkedService?api-
version=${apiVersion}"
$body = @"
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:

{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/linkedservices/AzureStorageLinkedService",
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "@{value=**********; type=SecureString}"
}
},
"etag": "0000c552-0000-0000-0000-59b1459c0000"
}

Create datasets
You define a dataset that represents the data to copy from a source to a sink. In this example, this Blob
dataset refers to the Azure Storage linked service you create in the previous step. The dataset takes a
parameter whose value is set in an activity that consumes the dataset. The parameter is used to
construct the "folderPath" pointing to where the data resides/stored.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/datasets/BlobDataset?api-version=${apiVersion}"
$body = @"
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:

{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/datasets/BlobDataset",
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{value=@{dataset().path}; type=Expression}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": "@{type=String}"
}
},
"etag": "0000c752-0000-0000-0000-59b1459d0000"
}

Create pipeline
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The copy
activity refers to the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline?api-
version=${apiVersion}"
$body = @"
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:


{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/pipelines/Adfv2QuickStartPipeline",
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
"@{name=CopyFromBlobToBlob; type=Copy; inputs=System.Object[]; outputs=System.Object[];
typeProperties=}"
],
"parameters": {
"inputPath": "@{type=String}",
"outputPath": "@{type=String}"
}
},
"etag": "0000c852-0000-0000-0000-59b1459e0000"
}

Create pipeline run


In this step, you set values of inputPath and outputPath parameters specified in pipeline with the
actual values of source and sink blob paths, and trigger a pipeline run. The pipeline run ID returned in
the response body is used in later monitoring API.
Replace value of inputPath and outputPath with your source and sink blob path to copy data from
and to before saving the file.

$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline/createRun?api-
version=${apiVersion}"
$body = @"
{
"inputPath": "<the path to existing blob(s) to copy data from, e.g. containername/path>",
"outputPath": "<the blob path to copy data to, e.g. containername/path>"
}
"@
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$runId = $response.runId

Here is the sample output:

{
"runId": "2f26be35-c112-43fa-9eaa-8ba93ea57881"
}

Monitor pipeline
1. Run the following script to continuously check the pipeline run status until it finishes copying
the data.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-
version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"

if ($response.Status -eq "InProgress") {


Start-Sleep -Seconds 15
}
else {
$response | ConvertTo-Json
break
}
}

Here is the sample output:

{
"key": "000000000-0000-0000-0000-00000000000",
"timestamp": "2017-09-07T13:12:39.5561795Z",
"runId": "000000000-0000-0000-0000-000000000000",
"dataFactoryName": "<dataFactoryName>",
"pipelineName": "Adfv2QuickStartPipeline",
"parameters": [
"inputPath: <inputBlobPath>",
"outputPath: <outputBlobPath>"
],
"parametersCount": 2,
"parameterNames": [
"inputPath",
"outputPath"
],
"parameterNamesCount": 2,
"parameterValues": [
"<inputBlobPath>",
"<outputBlobPath>"
],
"parameterValuesCount": 2,
"runStart": "2017-09-07T13:12:00.3710792Z",
"runEnd": "2017-09-07T13:12:39.5561795Z",
"durationInMs": 39185,
"status": "Succeeded",
"message": ""
}

2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.

$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}/activityruns?
api-version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
$response | ConvertTo-Json

Here is the sample output:


{
"value": [
{
"id": "000000000-0000-0000-0000-00000000000",
"timestamp": "2017-09-07T13:12:38.4780542Z",
"pipelineRunId": "000000000-0000-00000-0000-0000000000000",
"pipelineName": "Adfv2QuickStartPipeline",
"status": "Succeeded",
"failureType": "",
"linkedServiceName": "",
"activityName": "CopyFromBlobToBlob",
"activityType": "Copy",
"activityStart": "2017-09-07T13:12:02.3299261Z",
"activityEnd": "2017-09-07T13:12:38.4780542Z",
"duration": 36148,
"input": "@{source=; sink=}",
"output": "@{dataRead=331452208; dataWritten=331452208; copyDuration=22;
throughput=14712.9; errors=System.Object[];
effectiveIntegrationRuntime=DefaultIntegrationRuntime (West US); usedDataIntegrationUnits=2;
billedDuration=22}",
"error": "@{errorCode=; message=; failureType=; target=CopyFromBlobToBlob}"
}
]
}

Verify the output


Use Azure Storage explorer to check the blob(s) is copied to "outputBlobPath" from "inputBlobPath" as
you specified when creating a pipeline run.

Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure
resource group, which includes all the resources in the resource group. If you want to keep the other
resources intact, delete only the data factory you created in this tutorial.
Run the following command to delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Run the following command to delete only the data factory:

Remove-AzDataFactoryV2 -Name "<NameOfYourDataFactory>" -ResourceGroupName "<NameOfResourceGroup>"

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Tutorial: Create an Azure data factory using
Azure Resource Manager template
3/26/2019 • 15 minutes to read • Edit Online

This quickstart describes how to use an Azure Resource Manager template to create an Azure data
factory. The pipeline you create in this data factory copies data from one folder to another folder in an
Azure blob storage. For a tutorial on how to transform data using Azure Data Factory, see Tutorial:
Transform data using Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure
Data Factory service, see Introduction to Azure Data Factory.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.

5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.

3. In the New container dialog box, enter adftutorial for the name, and then select OK.

4. Select adftutorial in the list of containers.

5. On the Container page, select Upload on the toolbar.

6. On the Upload blob page, select Advanced.

7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.

John, Doe
Jane, Doe

8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.

10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.

Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.

Resource Manager templates


To learn about Azure Resource Manager templates in general, see Authoring Azure Resource Manager
Templates.
The following section provides the complete Resource Manager template for defining Data Factory
entities so that you can quickly run through the tutorial and test the template. To understand how each
Data Factory entity is defined, see Data Factory entities in the template section.
To learn about the JSON syntax and properties for Data Factory resources in a template, see
Microsoft.DataFactory resource types.

Data Factory JSON


Create a JSON file named ADFTutorialARM.json in C:\ADFTutorial folder with the following
content:

{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"dataFactoryName": {
"type": "string",
"metadata": {
"description": "Name of the data factory. Must be globally unique."
}
},
"dataFactoryLocation": {
"type": "string",
"allowedValues": [
"East US",
"East US 2",
"West Europe"
],
"defaultValue": "East US",
"metadata": {
"description": "Location of the data factory. Currently, only East US, East US 2, and West
Europe are supported. "
}
},
"storageAccountName": {
"type": "string",
"metadata": {
"description": "Name of the Azure storage account that contains the input/output data."
}
},
"storageAccountKey": {
"storageAccountKey": {
"type": "securestring",
"metadata": {
"description": "Key for the Azure storage account."
}
},
"blobContainer": {
"type": "string",
"metadata": {
"description": "Name of the blob container in the Azure Storage account."
}
},
"inputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that has the input file."
}
},
"inputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the input file/blob."
}
},
"outputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that will hold the transformed data."
}
},
"outputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the output file/blob."
}
},
"triggerStartTime": {
"type": "string",
"metadata": {
"description": "Start time for the trigger."
}
},
"triggerEndTime": {
"type": "string",
"metadata": {
"description": "End time for the trigger."
}
}
},
"variables": {
"azureStorageLinkedServiceName": "ArmtemplateStorageLinkedService",
"inputDatasetName": "ArmtemplateTestDatasetIn",
"outputDatasetName": "ArmtemplateTestDatasetOut",
"pipelineName": "ArmtemplateSampleCopyPipeline",
"triggerName": "ArmTemplateTestTrigger"
},
"resources": [{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "[parameters('dataFactoryLocation')]",
"identity": {
"type": "SystemAssigned"
},
"resources": [{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[parameters('dataFactoryName')]"
],
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": {
"value": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=
',parameters('storageAccountKey'))]",
"type": "SecureString"
}
}
}
},
{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "datasets",
"name": "[variables('outputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'),
'/')]",
"fileName": "[parameters('outputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
},
{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "[parameters('triggerStartTime')]",
"endTime": "[parameters('triggerEndTime')]",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}
]
}]
}

Parameters JSON
Create a JSON file named ADFTutorialARM -Parameters.json that contains parameters for the Azure
Resource Manager template.
IMPORTANT
Specify the name and key of your Azure Storage account for the storageAccountName and
storageAccountKey parameters in this parameter file. You created the adftutorial container and uploaded
the sample file (emp.txt) to the input folder in this Azure blob storage.
Specify a globally unique name for the data factory for the dataFactoryName parameter. For example:
ARMTutorialFactoryJohnDoe11282017.
For the triggerStartTime, specify the current day in the format: 2017-11-28T00:00:00 .
For the triggerEndTime, specify the next day in the format: 2017-11-29T00:00:00 . You can also check the
current UTC time and specify the next hour or two as the end time. For example, if the UTC time now is 1:32
AM, specify 2017-11-29:03:00:00 as the end time. In this case, the trigger runs the pipeline twice (at 2 AM
and 3 AM).

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"dataFactoryName": {
"value": "<datafactoryname>"
},
"dataFactoryLocation": {
"value": "East US"
},
"storageAccountName": {
"value": "<yourstorageaccountname>"
},
"storageAccountKey": {
"value": "<yourstorageaccountkey>"
},
"blobContainer": {
"value": "adftutorial"
},
"inputBlobFolder": {
"value": "input"
},
"inputBlobName": {
"value": "emp.txt"
},
"outputBlobFolder": {
"value": "output"
},
"outputBlobName": {
"value": "emp.txt"
},
"triggerStartTime": {
"value": "2017-11-28T00:00:00. Set to today"
},
"triggerEndTime": {
"value": "2017-11-29T00:00:00. Set to tomorrow"
}
}
}

IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you
can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying
Data Factory entities in these environments.
Deploy Data Factory entities
In PowerShell, run the following command to deploy Data Factory entities using the Resource Manager
template you created earlier in this quickstart.

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile C:\ADFTutorial\ADFTutorialARM.json -TemplateParameterFile
C:\ADFTutorial\ADFTutorialARM-Parameters.json

You see output similar to the following sample:

DeploymentName : MyARMDeployment
ResourceGroupName : ADFTutorialResourceGroup
ProvisioningState : Succeeded
Timestamp : 11/29/2017 3:11:13 AM
Mode : Incremental
TemplateLink :
Parameters :
Name Type Value
=============== ============ ==========
dataFactoryName String <data factory name>
dataFactoryLocation String East US
storageAccountName String <storage account name>
storageAccountKey SecureString
blobContainer String adftutorial
inputBlobFolder String input
inputBlobName String emp.txt
outputBlobFolder String output
outputBlobName String emp.txt
triggerStartTime String 11/29/2017 12:00:00 AM
triggerEndTime String 11/29/2017 4:00:00 AM

Outputs :
DeploymentDebugLogLevel :

Start the trigger


The template deploys the following Data Factory entities:
Azure Storage linked service
Azure Blob datasets (input and output)
Pipeline with a copy activity
Trigger to trigger the pipeline
The deployed trigger is in stopped state. One of the ways to start the trigger is to use the Start-
AzDataFactoryV2Trigger PowerShell cmdlet. The following procedure provides detailed steps:
1. In the PowerShell window, create a variable to hold the name of the resource group. Copy the
following command into the PowerShell window, and press ENTER. If you have specified a
different resource group name for the New -AzResourceGroupDeployment command, update the
value here.

$resourceGroupName = "ADFTutorialResourceGroup"

2. Create a variable to hold the name of the data factory. Specify the same name that you specified
in the ADFTutorialARM -Parameters.json file.
$dataFactoryName = "<yourdatafactoryname>"

3. Set a variable for the name of the trigger. The name of the trigger is hardcoded in the Resource
Manager template file (ADFTutorialARM.json).

$triggerName = "ArmTemplateTestTrigger"

4. Get the status of the trigger by running the following PowerShell command after specifying the
name of your data factory and trigger:

Get-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $triggerName

Here is the sample output:

TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Stopped

Notice that the runtime state of the trigger is Stopped.


5. Start the trigger. The trigger runs the pipeline defined in the template at the hour. That's, if you
executed this command at 2:25 PM, the trigger runs the pipeline at 3 PM for the first time. Then,
it runs the pipeline hourly until the end time you specified for the trigger.

Start-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -TriggerName $triggerName

Here is the sample output:

Confirm
Are you sure you want to start trigger 'ArmTemplateTestTrigger' in data factory
'ARMFactory1128'?
[Y] Yes [N] No [S] Suspend [?] Help (default is "Y"): y
True

6. Confirm that the trigger has been started by running the Get-AzDataFactoryV2Trigger command
again.

Get-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -TriggerName $triggerName

Here is the sample output:

TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Started
Monitor the pipeline
1. After logging in to the Azure portal, Click All services, search with the keyword such as data fa,
and select Data factories.

2. In the Data Factories page, click the data factory you created. If needed, filter the list with the
name of your data factory.

3. In the Data factory page, click Monitor & Manage tile.


4. The Data Integration Application should open in a separate tab in the web browser. If the
monitor tab is not active, switch to the monitor tab. Notice that the pipeline run was triggered
by a scheduler trigger.

IMPORTANT
You see pipeline runs only at the hour clock (for example: 4 AM, 5 AM, 6 AM, etc.). Click Refresh on the
toolbar to refresh the list when the time reaches the next hour.

5. Click the link in the Actions columns.

6. You see the activity runs associated with the pipeline run. In this quickstart, the pipeline has only
one activity of type: Copy. Therefore, you see a run for that activity.

7. Click the link under Output column. You see the output from the copy operation in an Output
window. Click the maximize button to see the full output. You can close the maximized output
window or close it.
8. Stop the trigger once you see a successful/failure run. The trigger runs the pipeline once an hour.
The pipeline copies the same file from the input folder to the output folder for each run. To stop
the trigger, run the following command in the PowerShell window.

Stop-AzDataFactoryV2Trigger -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $triggerName

Verify the output


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies the
emp.txt file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page, click Refresh to see the output folder.

2. Click output in the folder list.


3. Confirm that the emp.txt is copied to the output folder.
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure
resource group, which includes all the resources in the resource group. If you want to keep the other
resources intact, delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following
command to delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:

Remove-AzDataFactoryV2 -Name $dataFactoryName -ResourceGroupName $resourceGroupName

JSON definitions for entities


The following Data Factory entities are defined in the JSON template:
Azure Storage linked service
Azure blob input dataset
Azure Blob output dataset
Data pipeline with a copy activity
Trigger
Azure Storage linked service
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites. You specify the name and
key of your Azure storage account in this section. See Azure Storage linked service for details about
JSON properties used to define an Azure Storage linked service.
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[parameters('dataFactoryName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": {
"value": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=
',parameters('storageAccountKey'))]",
"type": "SecureString"
}
}
}
}

The connectionString uses the storageAccountName and storageAccountKey parameters. The values
for these parameters passed by using a configuration file. The definition also uses variables:
azureStorageLinkedService and dataFactoryName defined in the template.
Azure blob input dataset
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. In Azure blob dataset definition, you specify names of
blob container, folder, and file that contains the input data. See Azure Blob dataset properties for details
about JSON properties used to define an Azure Blob dataset.

{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},

Azure blob output dataset


You specify the name of the folder in the Azure Blob Storage that holds the copied data from the input
folder. See Azure Blob dataset properties for details about JSON properties used to define an Azure
Blob dataset.
{
"type": "datasets",
"name": "[variables('outputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'),
'/')]",
"fileName": "[parameters('outputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
}

Data pipeline
You define a pipeline that copies data from one Azure blob dataset to another Azure blob dataset. See
Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example.

{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
}

Trigger
You define a trigger that runs the pipeline once an hour. The deployed trigger is in stopped state. Start
the trigger by using the Start-AzDataFactoryV2Trigger cmdlet. For more information about triggers,
see Pipeline execution and triggers article.

{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-28T00:00:00",
"endTime": "2017-11-29T00:00:00",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}

Reuse the template


In the tutorial, you created a template for defining Data Factory entities and a template for passing
values for parameters. To use the same template to deploy Data Factory entities to different
environments, you create a parameter file for each environment and use it when deploying to that
environment.
Example:

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Dev.json

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Test.json

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Production.json

Notice that the first command uses parameter file for the development environment, second one for the
test environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, create many data factories with
one or more pipelines that implement the same logic but each data factory uses different Azure storage
accounts. In this scenario, you use the same template in the same environment (dev, test, or production)
with different parameter files to create data factories.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Create Azure Data Factory Data Flow
5/6/2019 • 2 minutes to read • Edit Online

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.

Mapping Data Flows in ADF provide a way to transform data at scale without any coding required. You can design
a data transformation job in the data flow designer by constructing a series of transformations. Start with any
number of source transformations followed by data transformation steps. Then, complete your data flow with sink
to land your results in a destination.
Get started by first creating a new V2 Data Factory from the Azure portal. After creating your new factory, click on
the "Author & Monitor" tile to launch the Data Factory UI.

Once you are in the Data Factory UI, you can use sample Data Flows. The samples are available from the ADF
Template Gallery. In ADF, create "Pipeline from Template" and select the Data Flow category from the template
gallery.
You will be prompted to enter your Azure Blob Storage account information.

The data used for these samples can be found here. Download the sample data and store the files in your Azure
Blob storage accounts so that you can execute the samples.

Create new data flow


Use the Create Resource "plus sign" button in the ADF UI to create Data Flows.

Next steps
Begin building your data transformation with a source transformation.
Copy data from Azure Blob storage to a SQL
database by using the Copy Data tool
3/6/2019 • 5 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from Azure Blob storage to a SQL database.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source data store. If you don't have an Azure storage account,
see the instructions in Create a storage account.
Azure SQL Database: Use a SQL database as the sink data store. If you don't have a SQL database, see the
instructions in Create a SQL database.
Create a blob and a SQL table
Prepare your Blob storage and your SQL database for the tutorial by performing these steps.
Create a source blob
1. Launch Notepad. Copy the following text and save it in a file named inputEmp.txt on your disk:

John|Doe
Jane|Doe

2. Create a container named adfv2tutorial and upload the inputEmp.txt file to the container. You can use
various tools to perform these tasks, such as Azure Storage Explorer.
Create a sink SQL table
1. Use the following SQL script to create a table named dbo.emp in your SQL database:

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);


2. Allow Azure services to access SQL Server. Verify that the setting Allow access to Azure services is
enabled for your server that's running SQL Database. This setting lets Data Factory write data to your
database instance. To verify and turn on this setting, go to your Azure SQL server > Security > Firewalls
and virtual networks > set the Allow access to Azure services option to ON.

Create a data factory


1. On the left menu, select + New > Data + Analytics > Data Factory:

2. On the New data factory page, under Name, enter ADFTutorialDataFactory.


The name for your data factory must be globally unique. You might receive the following error message:

If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.

11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select the Copy Data tile to launch the Copy Data tool.
2. On the Properties page, under Task name, enter CopyFromBlobToSqlPipeline. Then select Next. The
Data Factory UI creates a pipeline with the specified task name.

3. On the Source data store page, complete the following steps:


a. Click + Create new connection to add a connection
b. Select Azure Blob Storage from the gallery, and then select Next.

c. On the New Linked Service page, select your storage account from the Storage account name list,
and then select Finish.
d. Select the newly created linked service as source, then click Next.

4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adfv2tutorial/input folder, select the inputEmp.txt file, then click
Choose.
b. Click Next to move to next step.
5. On the File format settings page, notice that the tool automatically detects the column and row delimiters.
Select Next. You also can preview data and view the schema of the input data on this page.

6. On the Destination data store page, completes the following steps:


a. Click + Create new connection to add a connection

b. Select Azure SQL Database from the gallery, and then select Next.

c. On the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password, then select Finish.
d. Select the newly created linked service as sink, then click Next.

7. On the Table mapping page, select the [dbo].[emp] table, and then select Next.
8. On the Schema mapping page, notice that the first and second columns in the input file are mapped to the
FirstName and LastName columns of the emp table. Select Next.

9. On the Settings page, select Next.


10. On the Summary page, review the settings, and then select Next.
11. On the Deployment page, select Monitor to monitor the pipeline (task).

12. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list.
13. To view the activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. For details about the copy operation, select the Details link (eyeglasses icon) in the
Actions column. To go back to the Pipeline Runs view, select the Pipelines link at the top. To refresh the
view, select Refresh.

14. Verify that the data is inserted into the emp table in your SQL database.

15. Select the Author tab on the left to switch to the editor mode. You can update the linked services, datasets,
and pipelines that were created via the tool by using the editor. For details on editing these entities in the
Data Factory UI, see the Azure portal version of this tutorial.
Next steps
The pipeline in this sample copies data from Blob storage to a SQL database. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob storage to a SQL
database by using Azure Data Factory
3/26/2019 • 10 minutes to read • Edit Online

In this tutorial, you create a data factory by using the Azure Data Factory user interface (UI). The pipeline in this
data factory copies data from Azure Blob storage to a SQL database. The configuration pattern in this tutorial
applies to copying from a file-based data store to a relational data store. For a list of data stores supported as
sources and sinks, see the supported data stores table.

NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account. You use Blob storage as a source data store. If you don't have a storage account, see
Create an Azure storage account for steps to create one.
Azure SQL Database. You use the database as a sink data store. If you don't have a SQL database, see Create
a SQL database for steps to create one.
Create a blob and a SQL table
Now, prepare your Blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Launch Notepad. Copy the following text, and save it as an emp.txt file on your disk:

John,Doe
Jane,Doe

2. Create a container named adftutorial in your Blob storage. Create a folder named input in this container.
Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure Storage
Explorer to do these tasks.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your SQL database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2. Allow Azure services to access SQL Server. Ensure that Allow access to Azure services is turned ON for
your SQL Server so that Data Factory can write data to your SQL Server. To verify and turn on this setting,
take the following steps:
a. On the left, select More services > SQL servers.
b. Select your server, and under SETTINGS select Firewall.
c. On the Firewall settings page, select ON for Allow access to Azure services.

Create a data factory


In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome. Currently, Data Factory UI is supported only in Microsoft
Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. On the New data factory page, under Name, enter ADFTutorialDataFactory.

The name of the Azure data factory must be globally unique. If you see the following error message for the
name field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For
naming rules for Data Factory artifacts, see Data Factory naming rules.

4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version, select V2.
7. Under Location, select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Pin to dashboard.
9. Select Create.
10. On the dashboard, you see the following tile with the status Deploying Data Factory:
11. After the creation is finished, you see the Data factory page as shown in the image.

12. Select Author & Monitor to launch the Data Factory UI in a separate tab.

Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start with creating the pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the Let's get started page, select Create pipeline.
2. In the General tab for the pipeline, enter CopyPipeline for Name of the pipeline.
3. In the Activities tool box, expand the Move andTransform category, and drag and drop the Copy Data
activity from the tool box to the pipeline designer surface. Specify CopyFromBlobToSql for Name.

Configure source
1. Go to the Source tab. Select + New to create a source dataset.
2. In the New Dataset window, select Azure Blob Storage, and then select Finish. The source data is in
Blob storage, so you select Azure Blob Storage for the source dataset.
3. You see a new tab opened for blob dataset. On the General tab at the bottom of the Properties window,
enter SourceBlobDataset for Name.
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New.

5. In the New Linked Service window, enter AzureStorageLinkedService as name, select your storage
account from the Storage account name list, then select Save to deploy the linked service.
6. After the linked service is created, you are back in the dataset settings. Next to File path, select Browse.

7. Navigate to the adftutorial/input folder, select the emp.txt file, and then select Finish.
8. Confirm that File format is set to Text format and that Column delimiter is set to Comma ( , ). If the
source file uses different row and column delimiters, you can select Detect Text Format for File format.
The Copy Data tool detects the file format and delimiters automatically for you. You can still override these
values. To preview data on this page, select Preview data.

9. Go to the Schema tab of the Properties window, and select Import Schema. Notice that the application
detected two columns in the source file. You import the schema here so that you can map columns from the
source data store to the sink data store. If you don't need to map columns, you can skip this step. For this
tutorial, import the schema.
10. Now, go back to the pipeline -> Source tab, confirm that SourceBlobDataset is selected. To preview data
on this page, select Preview data.

Configure sink
1. Go to the Sink tab, and select + New to create a sink dataset.

2. In the New Dataset window, input "SQL" in the search box to filter the connectors, then select Azure SQL
Database, and then select Finish. In this tutorial, you copy data to a SQL database.
3. On the General tab of the Properties window, in Name, enter OutputSqlDataset.

4. Go to the Connection tab, and next to Linked service, select + New. A dataset must be associated with a
linked service. The linked service has the connection string that Data Factory uses to connect to the SQL
database at runtime. The dataset specifies the container, folder, and the file (optional) to which the data is
copied.

5. In the New Linked Service window, take the following steps:


a. Under Name, enter AzureSqlDatabaseLinkedService.
b. Under Server name, select your SQL Server instance.
c. Under Database name, select your SQL database.
d. Under User name, enter the name of the user.
e. Under Password, enter the password for the user.
f. Select Test connection to test the connection.
g. Select Save to save the linked service.

6. In Table, select [dbo].[emp].


7. Go to the Schema tab, and select Import Schema.

8. Select the ID column, and then select Delete. The ID column is an identity column in the SQL database, so
the copy activity doesn't need to insert data into this column.

9. Go to the tab with the pipeline, and in Sink Dataset, confirm that OutputSqlDataset is selected.
Configure mapping
Go to the Mapping tab at the bottom of the Properties window, and select Import Schemas. Notice that the
first and second columns in the source file are mapped to FirstName and LastName in the SQL database.

Validate the pipeline


To validate the pipeline, select Validate from the tool bar.
You can see the JSON code associated with the pipeline by clicking Code on the upper-right.

Debug and publish the pipeline


You can debug a pipeline before you publish artifacts (linked services, datasets, and pipeline) to Data Factory or
your own Azure Repos Git repository.
1. To debug the pipeline, select Debug on the toolbar. You see the status of the pipeline run in the Output tab
at the bottom of the window.
2. Once the pipeline can run successfully, in the top toolbar, select Publish All. This action publishes entities
(datasets, and pipelines) you created to Data Factory.

3. Wait until you see the Successfully published message. To see notification messages, click the Show
Notifications on the top-right (bell button).

Trigger the pipeline manually


In this step, you manually trigger the pipeline you published in the previous step.
1. Select Trigger on the toolbar, and then select Trigger Now. On the Pipeline Run page, select Finish.
2. Go to the Monitor tab on the left. You see a pipeline run that is triggered by a manual trigger. You can use
links in the Actions column to view activity details and to rerun the pipeline.

3. To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. In this example, there is only one activity, so you see only one entry in the list. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. Select Pipelines at the top
to go back to the Pipeline Runs view. To refresh the view, select Refresh.

4. Verify that two more rows are added to the emp table in the SQL database.

Trigger the pipeline on a schedule


In this schedule, you create a schedule trigger for the pipeline. The trigger runs the pipeline on the specified
schedule, such as hourly or daily. In this example, you set the trigger to run every minute until the specified end
datetime.
1. Go to the Author tab on the left above the monitor tab.
2. Go to your pipeline, click Trigger on the tool bar, and select New/Edit.
3. In the Add Triggers window, select Choose trigger, and then select + New.

4. In the New Trigger window, take the following steps:


a. Under Name, enter RunEveryMinute.
b. Under End, select On Date.
c. Under End On, select the drop-down list.
d. Select the current day option. By default, the end day is set to the next day.
e. Update the minutes part to be a few minutes past the current datetime. The trigger is activated only after
you publish the changes. If you set it to only a couple of minutes apart and you don't publish it by then, you
don't see a trigger run.
f. Select Apply.
g. Select the Activated option. You can deactivate it and activate it later by using this check box.
h. Select Next.
IMPORTANT
A cost is associated with each pipeline run, so set the end date appropriately.

5. On the Trigger Run Parameters page, review the warning, and then select Finish. The pipeline in this
example doesn't take any parameters.
6. Click Publish All to publish the change.
7. Go to the Monitor tab on the left to see the triggered pipeline runs.

8. To switch from the Pipeline Runs view to the Trigger Runs view, select Pipeline Runs and then select
Trigger Runs.

9. You see the trigger runs in a list.

10. Verify that two rows per minute (for each pipeline run) are inserted into the emp table until the specified
end time.

Next steps
The pipeline in this sample copies data from one location to another location in Blob storage. You learned how to:
Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob to Azure SQL Database
using Azure Data Factory
5/15/2019 • 10 minutes to read • Edit Online

In this tutorial, you create a Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL
Database. The configuration pattern in this tutorial applies to copying from a file-based data store to a relational
data store. For a list of data stores supported as sources and sinks, see supported data stores table.
You perform the following steps in this tutorial:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory, refer to samples
under "Quickstarts".
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following this instruction. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create a blob and a SQL table
Now, prepare your Azure Blob and Azure SQL Database for the tutorial by performing the following steps:
Create a source blob
1. Launch Notepad. Copy the following text and save it as inputEmp.txt file on your disk.

John|Doe
Jane|Doe

2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and to upload the
inputEmp.txt file to the container.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your Azure SQL Database.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2. Allow Azure services to access SQL server. Ensure that Allow access to Azure services setting is turned
ON for your Azure SQL server so that the Data Factory service can write data to your Azure SQL server.
To verify and turn on this setting, do the following steps:
a. Click More services hub on the left and click SQL servers.
b. Select your server, and click Firewall under SETTINGS.
c. In the Firewall settings page, click ON for Allow access to Azure services.

Create a Visual Studio project


Using Visual Studio 2015/2017, create a C# .NET console application.
1. Launch Visual Studio.
2. Click File, point to New, and click Project.
3. Select Visual C# -> Console App (.NET Framework) from the list of project types on the right. .NET
version 4.5.2 or above is required.
4. Enter ADFv2Tutorial for the Name.
5. Click OK to create the project.

Install NuGet packages


1. Click Tools -> NuGet Package Manager -> Package Manager Console.
2. In the Package Manager Console, run the following commands to install packages. Refer to
Microsoft.Azure.Management.DataFactory nuget package with details.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs, include the following statements to add references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add the following code to the Main method that sets variables. Replace place-holders with your own
values. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory: Products available
by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.

// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID to create the factory>";
string resourceGroup = "<your resource group to create the factory>";

string region = "East US";


string dataFactoryName = "<specify the name of a data factory to create. It must be globally
unique.>";

// Specify the source Azure Blob information


string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
string inputBlobPath = "adfv2tutorial/";
string inputBlobName = "inputEmp.txt";

// Specify the sink Azure SQL Database information


string azureSqlConnString = "Server=tcp:<your server name>.database.windows.net,1433;Database=<your
database name>;User ID=<your username>@<your server name>;Password=<your
password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30";
string azureSqlTableName = "dbo.emp";

string storageLinkedServiceName = "AzureStorageLinkedService";


string sqlDbLinkedServiceName = "AzureSqlDbLinkedService";
string blobDatasetName = "BlobDataset";
string sqlDatasetName = "SqlDataset";
string pipelineName = "Adfv2TutorialBlobToSqlCopy";

3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create a data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


Add the following code to the Main method that creates a data factory.
// Create a data factory
Console.WriteLine("Creating a data factory " + dataFactoryName + "...");
Factory dataFactory = new Factory
{
Location = region,
Identity = new FactoryIdentity()

};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);
Console.WriteLine(SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings));

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState == "PendingCreation")


{
System.Threading.Thread.Sleep(1000);
}

Create linked services


In this tutorial, you create two linked services for source and sink respectively:
Create an Azure Storage linked service
Add the following code to the Main method that creates an Azure Storage linked service. Learn more from
Azure Blob linked service properties on supported properties and details.

// Create an Azure Storage linked service


Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");

LinkedServiceResource storageLinkedService = new LinkedServiceResource(


new AzureStorageLinkedService
{
ConnectionString = new SecureString("DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey)
}
);
client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,
storageLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(storageLinkedService, client.SerializationSettings));

Create an Azure SQL Database linked service


Add the following code to the Main method that creates an Azure SQL Database linked service. Learn more
from Azure SQL Database linked service properties on supported properties and details.

// Create an Azure SQL Database linked service


Console.WriteLine("Creating linked service " + sqlDbLinkedServiceName + "...");

LinkedServiceResource sqlDbLinkedService = new LinkedServiceResource(


new AzureSqlDatabaseLinkedService
{
ConnectionString = new SecureString(azureSqlConnString)
}
);
client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, sqlDbLinkedServiceName,
sqlDbLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(sqlDbLinkedService, client.SerializationSettings));

Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
The blob format indicating how to parse the content: TextFormat and its settings (for example, column
delimiter).
The data structure, including column names and data types which in this case map to the sink SQL table.

// Create an Azure Blob dataset


Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
},
FolderPath = inputBlobPath,
FileName = inputBlobName,
Format = new TextFormat { ColumnDelimiter = "|" },
Structure = new List<DatasetDataElement>
{
new DatasetDataElement
{
Name = "FirstName",
Type = "String"
},
new DatasetDataElement
{
Name = "LastName",
Type = "String"
}
}
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));

Create a dataset for sink Azure SQL Database


Add the following code to the Main method that creates an Azure SQL Database dataset. Learn more from
Azure SQL Database dataset properties on supported properties and details.
You define a dataset that represents the sink data in Azure SQL Database. This dataset refers to the Azure SQL
Database linked service you create in the previous step. It also specifies the SQL table that holds the copied data.
// Create an Azure SQL Database dataset
Console.WriteLine("Creating dataset " + sqlDatasetName + "...");
DatasetResource sqlDataset = new DatasetResource(
new AzureSqlTableDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = sqlDbLinkedServiceName
},
TableName = azureSqlTableName
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, sqlDatasetName, sqlDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(sqlDataset, client.SerializationSettings));

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity. In this tutorial, this
pipeline contains one activity: copy activity, which takes in the Blob dataset as source and the SQL dataset as sink.
Learn more from Copy Activity Overview on copy activity details.

// Create a pipeline with copy activity


Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToSQL",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = sqlDatasetName
}
},
Source = new BlobSource { },
Sink = new SqlSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.
// Create a pipeline run
Console.WriteLine("Creating pipeline run...");
CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,
dataFactoryName, pipelineName).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Monitor a pipeline run


1. Add the following code to the Main method to continuously check the status of the pipeline run until it
finishes copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

List<ActivityRun> activityRuns = client.ActivityRuns.ListByPipelineRun(


resourceGroup, dataFactoryName, runResponse.RunId, DateTime.UtcNow.AddMinutes(-10),
DateTime.UtcNow.AddMinutes(10)).ToList();

if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
}
else
Console.WriteLine(activityRuns.First().Error);

Console.WriteLine("\nPress any key to exit...");


Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating a data factory, linked service, datasets, pipeline, and pipeline run. It
then checks the pipeline run status. Wait until you see the copy activity run details with data read/written size.
Then, use tools such as SSMS (SQL Server Management Studio) or Visual Studio to connect to your destination
Azure SQL Database and check if the data is copied into the table you specified.
Sample output

Creating a data factory AdfV2Tutorial...


{
"identity": {
"type": "SystemAssigned"
},
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=<accountKey>"
}
}
}
}
Creating linked service AzureSqlDbLinkedService...
{
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
}
Creating dataset BlobDataset...
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adfv2tutorial/",
"fileName": "inputEmp.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": "|"
}
},
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
}
Creating dataset SqlDataset...
{
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "dbo.emp"
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureSqlDbLinkedService"
}
}
}
}
Creating pipeline Adfv2TutorialBlobToSqlCopy...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"inputs": [
{
"type": "DatasetReference",
"referenceName": "BlobDataset"
}
],
"outputs": [
{
"type": "DatasetReference",
"referenceName": "SqlDataset"
}
],
"name": "CopyFromBlobToSQL"
}
]
}
}
Creating pipeline run...
Pipeline run ID: 1cd03653-88a0-4c90-aabc-ae12d843e252
Checking pipeline run status...
Status: InProgress
Status: InProgress
Status: Succeeded
Checking copy activity run details...
{
"dataRead": 18,
"dataWritten": 28,
"rowsCopied": 2,
"copyDuration": 2,
"throughput": 0.01,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"usedDataIntegrationUnits": 2,
"billedDuration": 2
}

Press any key to exit...

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copying data from on-premises to cloud:
Copy data from on-premises to cloud
Copy data from an on-premises SQL Server
database to Azure Blob storage by using the Copy
Data tool
4/8/2019 • 8 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from an on-premises SQL Server database to Azure Blob storage.

NOTE
If you're new to Azure Data Factory, see Introduction to Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. Select your user name in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the
Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Query.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')


INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

Azure storage account


In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose storage account, see Create a storage account for instructions to
create one. The pipeline in the data factory you that create in this tutorial copies data from the on-premises SQL
Server database (source) to this Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select More services. Filter by using the Storage keyword, and then select Storage
accounts.

3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.

2. In the Blob service window, select Container.

3. In the New container window, in the Name box, enter adftutorial, and then select OK.

4. In the list of containers, select adftutorial.


5. Keep the Container window for adftutorial open. You use it verify the output at the end of the tutorial.
Data Factory automatically creates the output folder in this container, so you don't need to create one.

Create a data factory


1. On the menu on the left, select New > Data + Analytics > Data Factory.
2. On the New data factory page, under Name, enter ADFTutorialDataFactory.
The name of the data factory must be globally unique. If you see the following error message for the name
field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For naming
rules for Data Factory artifacts, see Data Factory naming rules.

3. Select the Azure subscription in which you want to create the data factory.
4. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under Version, select **V2 **.
6. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by Data Factory can be in other locations/regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, you see the following tile with the status Deploying Data Factory:
10. After the creation is finished, you see the Data Factory page as shown in the image.

11. Select Author & Monitor to launch the Data Factory user interface in a separate tab.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select Copy Data to launch the Copy Data tool.
2. On the Properties page of the Copy Data tool, under Task name, enter
CopyFromOnPremSqlToAzureBlobPipeline. Then select Next. The Copy Data tool creates a pipeline
with the name you specify for this field.

3. On the Source data store page, click on Create new connection.


4. Under New Linked Service, search for SQL Server, and then select Next.

5. Under New Linked Service (SQL Server) Name, enter SqlServerLinkedService. Select +New under
Connect via integration runtime. You must create a self-hosted integration runtime, download it to your
machine, and register it with Data Factory. The self-hosted integration runtime copies data between your
on-premises environment and the cloud.
6. In the Integration Runtime Setup dialog box, Select Private Network. Then select Next.
7. In the Integration Runtime Setup dialog box under Name, enter TutorialIntegrationRuntime. Then
select Next.

8. Select Click here to launch the express setup for this computer. This action installs the integration
runtime on your machine and registers it with Data Factory. Alternatively, you can use the manual setup
option to download the installation file, run it, and use the key to register the integration runtime.
9. Run the downloaded application. You see the status of the express setup in the window.
10. Confirm that TutorialIntegrationRuntime is selected for the Integration Runtime field.
11. In Specify the on-premises SQL Server database, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Server name, enter the name of your on-premises SQL Server instance.
c. Under Database name, enter the name of your on-premises database.
d. Under Authentication type, select appropriate authentication.
e. Under User name, enter the name of user with access to on-premises SQL Server.
f. Enter the password for the user. Select Finish.
12. Select Next.
13. On the Select tables from which to copy the data or use a custom query page, select the [dbo].[emp]
table in the list, and select Next. You can select any other table based on your database.

14. On the Destination data store page, select Create new connection
15. In New Linked Service, Search and Select Azure Blob, then Continue.
16. On the New Linked Service (Azure Blob Storage) dialog, take the following steps:

a. Under **Name****, enter **AzureStorageLinkedService**.

b. Under **Connect via integration runtime**, select **TutorialIntegrationRuntime**

c. Under **Storage account name**, select your storage account from the drop-down list.

d. Select **Next**.

17. In Destination data store dialog, select Next. In Connection properties, select Azure storage service
as Azure Blob Storage. Select Next.
18. In the Choose the output file or folder dialog, under Folder path, enter adftutorial/fromonprem. You
created the adftutorial container as part of the prerequisites. If the output folder doesn't exist (in this case
fromonprem ), Data Factory automatically creates it. You also can use the Browse button to browse the
blob storage and its containers/folders. If you do not specify any value under File name, by default the
name from the source would be used (in this case dbo.emp).

19. On the File format settings dialog, select Next.


20. On the Settings dialog, select Next.
21. On the Summary dialog, review values for all the settings, and select Next.

22. On the Deployment page, select Monitor to monitor the pipeline or task you created.

23. On the Monitor tab, you can view the status of the pipeline you created. You can use the links in the Action
column to view activity runs associated with the pipeline run and to rerun the pipeline.
24. Select the View Activity Runs link in the Actions column to see activity runs associated with the pipeline
run. To see details about the copy operation, select the Details link (eyeglasses icon) in the Actions column.
To switch back to the Pipeline Runs view, select Pipelines at the top.

25. Confirm that you see the output file in the fromonprem folder of the adftutorial container.

26. Select the Edit tab on the left to switch to the editor mode. You can update the linked services, datasets, and
pipelines created by the tool by using the editor. Select Code to view the JSON code associated with the
entity opened in the editor. For details on how to edit these entities in the Data Factory UI, see the Azure
portal version of this tutorial.
Next steps
The pipeline in this sample copies data from an on-premises SQL Server database to Blob storage. You learned
how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn about how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy data from an on-premises SQL Server database
to Azure Blob storage
4/8/2019 • 9 minutes to read • Edit Online

In this tutorial, you use the Azure Data Factory user interface (UI) to create a data factory pipeline that copies data
from an on-premises SQL Server database to Azure Blob storage. You create and use a self-hosted integration
runtime, which moves data between on-premises and cloud data stores.

NOTE
This article doesn't provide a detailed introduction to Data Factory. For more information, see Introduction to Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. In the upper-right corner, select your
user name, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the Azure
portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')


INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

6. In the tree view, right-click the database that you created, and then select New Query.
Azure storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose Azure storage account, see Create a storage account. The pipeline
in the data factory that you create in this tutorial copies data from the on-premises SQL Server database (source)
to Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select More services. Filter by using the Storage keyword, and then select Storage
accounts.

3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, go to Overview, and then select Blobs.

2. In the Blob service window, select Container.

3. In the New container window, under Name, enter adftutorial. Then select OK.

4. In the list of containers, select adftutorial.


5. Keep the container window for adftutorial open. You use it verify the output at the end of the tutorial.
Data Factory automatically creates the output folder in this container, so you don't need to create one.

Create a data factory


In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.
1. Open the Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. On the New data factory page, under Name, enter ADFTutorialDataFactory.

The name of the data factory must be globally unique. If you see the following error message for the name field,
change the name of the data factory (for example, yournameADFTutorialDataFactory). For naming rules for Data
Factory artifacts, see Data Factory naming rules.

1. Select the Azure subscription in which you want to create the data factory.
2. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
3. Under Version, select V2.
4. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Storage and SQL Database) and computes (for example,
Azure HDInsight) used by Data Factory can be in other regions.
5. Select Pin to dashboard.
6. Select Create.
7. On the dashboard, you see the following tile with the status Deploying Data Factory:
8. After the creation is finished, you see the Data Factory page as shown in the image:

9. Select the Author & Monitor tile to launch the Data Factory UI in a separate tab.

Create a pipeline
1. On the Let's get started page, select Create pipeline. A pipeline is automatically created for you. You see
the pipeline in the tree view, and its editor opens.
2. On the General tab at the bottom of the Properties window, in Name, enter SQLServerToBlobPipeline.

3. In the Activities tool box, expand DataFlow. Drag and drop the Copy activity to the pipeline design
surface. Set the name of the activity to CopySqlServerToAzureBlobActivity.
4. In the Properties window, go to the Source tab, and select + New.

5. In the New Dataset window, search for SQL Server. Select SQL Server, and then select Finish. You see a
new tab titled SqlServerTable1. You also see the SqlServerTable1 dataset in the tree view on the left.
6. On the General tab at the bottom of the Properties window, in Name, enter SqlServerDataset.
7. Go to the Connection tab, and select + New. You create a connection to the source data store (SQL Server
database) in this step.

8. In the New Linked Service window, add Name as SqlServerLinkedService. Select New under Connect
via integration runtime. In this section, you create a self-hosted integration runtime and associate it with
an on-premises machine with the SQL Server database. The self-hosted integration runtime is the
component that copies data from the SQL Server database on your machine to Blob storage.
9. In the Integration Runtime Setup window, select Private Network, and then select Next.
10. Enter a name for the integration runtime, and select Next.
11. Under Option 1: Express setup, select Click here to launch the express setup for this computer.
12. In the Integration Runtime (Self-hosted) Express Setup window, select Close.
13. In the New Linked Service window, ensure the Integration Runtime created above is selected under
Connect via integration runtime.
14. In the New Linked Service window, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Connect via integration runtime, confirm that the self-hosted integration runtime you created
earlier shows up.
c. Under Server name, enter the name of your SQL Server instance.
d. Under Database name, enter the name of the database with the emp table.
e. Under Authentication type, select the appropriate authentication type that Data Factory should use to
connect to your SQL Server database.
f. Under User name and Password, enter the user name and password. If you need to use a backslash (\) in
your user account or server name, precede it with the escape character (\). For example, use
mydomain\\myuser.
g. Select Test connection. Do this step to confirm that Data Factory can connect to your SQL Server
database by using the self-hosted integration runtime you created.
h. To save the linked service, select Finish.
15. You should be back in the window with the source dataset opened. On the Connection tab of the
Properties window, take the following steps:
a. In Linked service, confirm that you see SqlServerLinkedService.
b. In Table, select [dbo].[emp].

16. Go to the tab with SQLServerToBlobPipeline, or select SQLServerToBlobPipeline in the tree view.

17. Go to the Sink tab at the bottom of the Properties window, and select + New.
18. In the New Dataset window, select Azure Blob Storage. Then select Finish. You see a new tab opened for
the dataset. You also see the dataset in the tree view.
19. In Name, enter AzureBlobDataset.
20. Go to the Connection tab at the bottom of the Properties window. Next to Linked service, select + New.

21. In the New Linked Service window, take the following steps:
a. Under Name, enter AzureStorageLinkedService.
b. Under Storage account name, select your storage account.
c. To test the connection to your storage account, select Test connection.
d. Select Save.

22. You should be back in the window with the sink dataset open. On the Connection tab, take the following
steps:
a. In Linked service, confirm that AzureStorageLinkedService is selected.
b. For the folder/ Directory part of File path, enter adftutorial/fromonprem. If the output folder doesn't
exist in the adftutorial container, Data Factory automatically creates the output folder.
c. For the file name part of File path, select Add dynamic content.
d. Add @CONCAT(pipeline().RunId, '.txt') , select Finish. This will rename the file with PipelineRunID.txt.
23. Go to the tab with the pipeline opened, or select the pipeline in the tree view. In Sink Dataset, confirm that
AzureBlobDataset is selected.

24. To validate the pipeline settings, select Validate on the toolbar for the pipeline. To close the Pipe
Validation Report, select Close.
25. To publish entities you created to Data Factory, select Publish All.

26. Wait until you see the Publishing succeeded pop-up. To check the status of publishing, select the Show
Notifications link on the left. To close the notification window, select Close.
Trigger a pipeline run
Select Trigger on the toolbar for the pipeline, and then select Trigger Now.
Monitor the pipeline run
1. Go to the Monitor tab. You see the pipeline that you manually triggered in the previous step.

2. To view activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. You see only activity runs because there is only one activity in the pipeline. To see details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. To go back to the Pipeline
Runs view, select Pipelines at the top.

Verify the output


The pipeline automatically creates the output folder named fromonprem in the adftutorial blob container.
Confirm that you see the [pipeline().RunId ].txt file in the output folder.
Next steps
The pipeline in this sample copies data from one location to another in Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Storage linked services.
Create SQL Server and Blob storage datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Tutorial: Copy data from an on-premises SQL Server
database to Azure Blob storage
4/8/2019 • 15 minutes to read • Edit Online

In this tutorial, you use Azure PowerShell to create a data-factory pipeline that copies data from an on-premises
SQL Server database to Azure Blob storage. You create and use a self-hosted integration runtime, which moves
data between on-premises and cloud data stores.

NOTE
This article does not provide a detailed introduction to the Data Factory service. For more information, see Introduction to
Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal, select your username at the top-right
corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on adding a user to a role, see the Manage access using RBAC and the Azure
portal article.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Azure Blob
storage (sink). You then create a table named emp in your SQL Server database, and insert a couple of sample
entries into the table.
1. Start SQL Server Management Studio. If it is not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:

INSERT INTO emp VALUES ('John', 'Doe')


INSERT INTO emp VALUES ('Jane', 'Doe')
GO

6. In the tree view, right-click the database that you created, and then select New Query.
Azure Storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Azure Blob storage) as a
destination/sink data store. If you don't have a general-purpose Azure storage account, see Create a storage
account. The pipeline in the data factory you that create in this tutorial copies data from the on-premises SQL
Server database (source) to this Azure Blob storage (sink).
Get storage account name and account key
You use the name and key of your Azure storage account in this tutorial. Get the name and key of your storage
account by doing the following:
1. Sign in to the Azure portal with your Azure username and password.
2. In the left pane, select More services, filter by using the Storage keyword, and then select Storage
accounts.

3. In the list of storage accounts, filter for your storage account (if needed), and then select your storage
account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Azure Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.

2. In the Blob service window, select Container.

3. In the New container window, in the Name box, enter adftutorial, and then select OK.

4. In the list of containers, select adftutorial.


5. Keep the container window for adftutorial open. You use it verify the output at the end of the tutorial.
Data Factory automatically creates the output folder in this container, so you don't need to create one.

Windows PowerShell
Install Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Install the latest version of Azure PowerShell if you don't already have it on your machine. For detailed
instructions, see How to install and configure Azure PowerShell.
Log in to PowerShell
1. Start PowerShell on your machine, and keep it open through completion of this quickstart tutorial. If you
close and reopen it, you'll need to run these commands again.
2. Run the following command, and then enter the Azure username and password that you use to sign in to
the Azure portal:

Connect-AzAccount

3. If you have multiple Azure subscriptions, run the following command to select the subscription that you
want to work with. Replace SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

Create a data factory


1. Define a variable for the resource group name that you'll use later in PowerShell commands. Copy the
following command to PowerShell, specify a name for the Azure resource group (enclosed in double
quotation marks; for example, "adfrg" ), and then run the command.

$resourceGroupName = "ADFTutorialResourceGroup"

2. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.

3. Define a variable for the data factory name that you can use in PowerShell commands later. The name must
start with a letter or a number, and it can contain only letters, numbers, and the dash (-) character.

IMPORTANT
Update the data factory name with a globally unique name. An example is ADFTutorialFactorySP1127.

$dataFactoryName = "ADFTutorialFactory"

4. Define a variable for the location of the data factory:

$location = "East US"

5. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName

NOTE
The name of the data factory must be globally unique. If you receive the following error, change the name and try again.

The specified data factory name 'ADFv2TutorialDataFactory' is already in use. Data factory names
must be globally unique.

To create data-factory instances, the user account that you use to sign in to Azure must be assigned a contributor or
owner role or must be an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data stores (Azure
Storage, Azure SQL Database, and so on) and computes (Azure HDInsight and so on) used by the data factory can be in
other regions.

Create a self-hosted integration runtime


In this section, you create a self-hosted integration runtime and associate it with an on-premises machine with the
SQL Server database. The self-hosted integration runtime is the component that copies data from the SQL Server
database on your machine to Azure Blob storage.
1. Create a variable for the name of integration runtime. Use a unique name, and note the name. You use it
later in this tutorial.

$integrationRuntimeName = "ADFTutorialIR"

2. Create a self-hosted integration runtime.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $integrationRuntimeName -Type SelfHosted -Description "selfhosted IR
description"

Here is the sample output:


Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factories/onpremdf0914/inte
grationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description : selfhosted IR description

3. To retrieve the status of the created integration runtime, run the following command:

Get-AzDataFactoryV2IntegrationRuntime -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName -Status

Here is the sample output:

Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration

4. To retrieve the authentication keys for registering the self-hosted integration runtime with the Data Factory
service in the cloud, run the following command. Copy one of the keys (excluding the quotation marks) for
registering the self-hosted integration runtime that you install on your machine in the next step.

Get-AzDataFactoryV2IntegrationRuntimeKey -Name $integrationRuntimeName -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName | ConvertTo-Json

Here is the sample output:

{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}

Install the integration runtime


1. Download Azure Data Factory Integration Runtime on a local Windows machine, and then run the
installation.
2. In the Welcome to Microsoft Integration Runtime Setup wizard, select Next.
3. In the End-User License Agreement window, accept the terms and license agreement, and select Next.
4. In the Destination Folder window, select Next.
5. In the Ready to install Microsoft Integration Runtime window, select Install.
6. If you see a warning message about the computer being configured to enter sleep or hibernate mode when
not in use, select OK.
7. If a Power Options window is displayed, close it, and switch to the setup window.
8. In the Completed the Microsoft Integration Runtime Setup wizard, select Finish.
9. In the Register Integration Runtime (Self-hosted) window, paste the key you saved in the previous
section, and then select Register.

When the self-hosted integration runtime is registered successfully, the following message is displayed:
10. In the New Integration Runtime (Self-hosted) Node window, select Next.

11. In the Intranet Communication Channel window, select Skip.


You can select a TLS/SSL certification for securing intra-node communication in a multi-node integration
runtime environment.
12. In the Register Integration Runtime (Self-hosted) window, select Launch Configuration Manager.
13. When the node is connected to the cloud service, the following message is displayed:

14. Test the connectivity to your SQL Server database by doing the following:
a. In the Configuration Manager window, switch to the Diagnostics tab.
b. In the Data source type box, select SqlServer.
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the username.
g. Enter the password that's associated with the username.
h. To confirm that integration runtime can connect to the SQL Server, select Test.
If the connection is successful, a green checkmark icon is displayed. Otherwise, you'll receive an error
message associated with the failure. Fix any issues, and ensure that the integration runtime can connect to
your SQL Server instance.
Note all the preceding values for later use in this tutorial.

Create linked services


To link your data stores and compute services to the data factory, create linked services in the data factory. In this
tutorial, you link your Azure storage account and on-premises SQL Server instance to the data store. The linked
services have the connection information that the Data Factory service uses at runtime to connect to them.
Create an Azure Storage linked service (destination/sink)
In this step, you link your Azure storage account to the data factory.
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADFv2Tutorial folder with the
following code. If the ADFv2Tutorial folder does not already exist, create it.
IMPORTANT
Before you save the file, replace <accountName> and <accountKey> with the name and key of your Azure storage
account. You noted them in the Prerequisites section.

{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>;EndpointSuffix=core.windows.net"
}
}
},
"name": "AzureStorageLinkedService"
}

2. In PowerShell, switch to the C:\ADFv2Tutorial folder.


3. To create the linked service, AzureStorageLinkedService, run the following
Set-AzDataFactoryV2LinkedService cmdlet:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$ResourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is a sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

If you receive a "file not found" error, confirm that the file exists by running the dir command. If the file
name has a .txt extension (for example, AzureStorageLinkedService.json.txt), remove it, and then run the
PowerShell command again.
Create and encrypt a SQL Server linked service (source )
In this step, you link your on-premises SQL Server instance to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFv2Tutorial folder by using the
following code:

IMPORTANT
Select the section that's based on the authentication that you use to connect to SQL Server.

Using SQL authentication (sa):


{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

Using Windows authentication:

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

IMPORTANT
Select the section that's based on the authentication you use to connect to your SQL Server instance.
Replace <integration runtime name> with the name of your integration runtime.
Before you save the file, replace <servername>, <databasename>, <username>, and <password> with the
values of your SQL Server instance.
If you need to use a backslash (\) in your user account or server name, precede it with the escape character (\).
For example, use mydomain\\myuser.

2. To encrypt the sensitive data (username, password, and so on), run the
New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet.
This encryption ensures that the credentials are encrypted using Data Protection Application Programming
Interface (DPAPI). The encrypted credentials are stored locally on the self-hosted integration runtime node
(local machine). The output payload can be redirected to another JSON file (in this case,
encryptedLinkedService.json) that contains encrypted credentials.
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -
ResourceGroupName $ResourceGroupName -IntegrationRuntimeName $integrationRuntimeName -File
".\SQLServerLinkedService.json" > encryptedSQLServerLinkedService.json

3. Run the following command, which creates EncryptedSqlServerLinkedService:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$ResourceGroupName -Name "EncryptedSqlServerLinkedService" -File
".\encryptedSqlServerLinkedService.json"

Create datasets
In this step, you create input and output datasets. They represent input and output data for the copy operation,
which copies data from the on-premises SQL Server database to Azure Blob storage.
Create a dataset for the source SQL Server database
In this step, you define a dataset that represents data in the SQL Server database instance. The dataset is of type
SqlServerTable. It refers to the SQL Server linked service that you created in the preceding step. The linked
service has the connection information that the Data Factory service uses to connect to your SQL Server instance
at runtime. This dataset specifies the SQL table in the database that contains the data. In this tutorial, the emp
table contains the source data.
1. Create a JSON file named SqlServerDataset.json in the C:\ADFv2Tutorial folder, with the following code:

{
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dbo.emp"
},
"structure": [
{
"name": "ID",
"type": "String"
},
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"linkedServiceName": {
"referenceName": "EncryptedSqlServerLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "SqlServerDataset"
}

2. To create the dataset SqlServerDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SqlServerDataset" -File ".\SqlServerDataset.json"

Here is the sample output:


DatasetName : SqlServerDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Structure : {"name": "ID" "type": "String", "name": "FirstName" "type": "String", "name":
"LastName" "type": "String"}
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset

Create a dataset for Azure Blob storage (sink)


In this step, you define a dataset that represents data that will be copied to Azure Blob storage. The dataset is of
the type AzureBlob. It refers to the Azure Storage linked service that you created earlier in this tutorial.
The linked service has the connection information that the data factory uses at runtime to connect to your Azure
storage account. This dataset specifies the folder in the Azure storage to which the data is copied from the SQL
Server database. In this tutorial, the folder is adftutorial/fromonprem, where adftutorial is the blob container
and fromonprem is the folder.
1. Create a JSON file named AzureBlobDataset.json in the C:\ADFv2Tutorial folder, with the following code:

{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/fromonprem",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "AzureBlobDataset"
}

2. To create the dataset AzureBlobDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "AzureBlobDataset" -File ".\AzureBlobDataset.json"

Here is the sample output:

DatasetName : AzureBlobDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a pipeline
In this tutorial, you create a pipeline with a copy activity. The copy activity uses SqlServerDataset as the input
dataset and AzureBlobDataset as the output dataset. The source type is set to SqlSource and the sink type is set to
BlobSink.
1. Create a JSON file named SqlServerToBlobPipeline.json in the C:\ADFv2Tutorial folder, with the following
code:
{
"name": "SQLServerToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type":"BlobSink"
}
},
"name": "CopySqlServerToAzureBlobActivity",
"inputs": [
{
"referenceName": "SqlServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlobDataset",
"type": "DatasetReference"
}
]
}
]
}
}

2. To create the pipeline SQLServerToBlobPipeline, run the Set-AzDataFactoryV2Pipeline cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SQLServerToBlobPipeline" -File ".\SQLServerToBlobPipeline.json"

Here is the sample output:

PipelineName : SQLServerToBlobPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Activities : {CopySqlServerToAzureBlobActivity}
Parameters :

Create a pipeline run


Start a pipeline run for the SQLServerToBlobPipeline pipeline, and capture the pipeline run ID for future
monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName 'SQLServerToBlobPipeline'

Monitor the pipeline run


1. To continuously check the run status of pipeline SQLServerToBlobPipeline, run the following script in
PowerShell, and print the final result:
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)

if (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {


Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
Start-Sleep -Seconds 30
}
else {
Write-Host "Pipeline 'SQLServerToBlobPipeline' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
}

Here is the output of the sample run:

ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : copy
PipelineRunId : 4ec8980c-62f6-466f-92fa-e69b10f33640
PipelineName : SQLServerToBlobPipeline
Input :
Output :
LinkedServiceName :
ActivityRunStart : 9/13/2017 1:35:22 PM
ActivityRunEnd : 9/13/2017 1:35:42 PM
DurationInMs : 20824
Status : Succeeded
Error : {errorCode, message, failureType, target}

2. You can get the run ID of pipeline SQLServerToBlobPipeline and check the detailed activity run result by
running the following command:

Write-Host "Pipeline 'SQLServerToBlobPipeline' run result:" -foregroundcolor "Yellow"


($result | Where-Object {$_.ActivityName -eq "CopySqlServerToAzureBlobActivity"}).Output.ToString()

Here is the output of the sample run:

{
"dataRead": 36,
"dataWritten": 24,
"rowsCopied": 2,
"copyDuration": 3,
"throughput": 0.01171875,
"errors": [],
"effectiveIntegrationRuntime": "MyIntegrationRuntime",
"billedDuration": 3
}

Verify the output


The pipeline automatically creates the output folder named fromonprem in the adftutorial blob container.
Confirm that you see the dbo.emp.txt file in the output folder.
1. In the Azure portal, in the adftutorial container window, select Refresh to see the output folder.
2. Select fromonprem in the list of folders.
3. Confirm that you see a file named dbo.emp.txt .

Next steps
The pipeline in this sample copies data from one location to another in Azure Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see supported data stores.
To learn about copying data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy multiple tables in bulk by using Azure Data
Factory
4/8/2019 • 14 minutes to read • Edit Online

This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory.

At a high level, this tutorial involves following steps:


Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure portal. To learn about using other tools/SDKs to create a data factory, see Quickstarts.

End-to-end workflow
In this scenario, you have a number of tables in Azure SQL Database that you want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:

The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.

Azure services to access SQL server


For both SQL Database and SQL Data Warehouse, allow Azure services to access SQL server. Ensure that Allow
access to Azure services setting is turned ON for your Azure SQL server. This setting allows the Data Factory
service to read data from your Azure SQL Database and write data to your Azure SQL Data Warehouse. To verify
and turn on this setting, do the following steps:
1. Click More services hub on the left and click SQL servers.
2. Select your server, and click Firewall under SETTINGS.
3. In the Firewall settings page, click ON for Allow access to Azure services.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFTutorialBulkCopyDF for the name.

The name of the Azure data factory must be globally unique. If you see the following error for the name
field, change the name of the data factory (for example, yournameADFTutorialBulkCopyDF ). See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.

`Data factory name “ADFTutorialBulkCopyDF” is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Data Factory UI application in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:

Create linked services


You create linked services to link your data stores and computes to a data factory. A linked service has the
connection information that the Data Factory service uses to connect to the data store at runtime.
In this tutorial, you link your Azure SQL Database, Azure SQL Data Warehouse, and Azure Blob Storage data
stores to your data factory. The Azure SQL Database is the source data store. The Azure SQL Data Warehouse is
the sink/destination data store. The Azure Blob Storage is to stage the data before the data is loaded into SQL
Data Warehouse by using PolyBase.
Create the source Azure SQL Database linked service
In this step, you create a linked service to link your Azure SQL database to the data factory.
1. Click Connections at the bottom of the window, and click + New on the toolbar.

2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. Select your Azure SQL server for Server name
c. Select your Azure SQL database for Database name.
d. Enter name of the user to connect to Azure SQL database.
e. Enter password for the user.
f. To test the connection to Azure SQL database using the specified information, click Test connection.
g. Click Save.

Create the sink Azure SQL Data Warehouse linked service


1. In the Connections tab, click + New on the toolbar again.
2. In the New Linked Service window, select Azure SQL Data Warehouse, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDWLinkedService for Name.
b. Select your Azure SQL server for Server name
c. Select your Azure SQL database for Database name.
d. Enter name of the user to connect to Azure SQL database.
e. Enter password for the user.
f. To test the connection to Azure SQL database using the specified information, click Test connection.
g. Click Save.
Create the staging Azure Storage linked service
In this tutorial, you use Azure Blob storage as an interim staging area to enable PolyBase for a better copy
performance.
1. In the Connections tab, click + New on the toolbar again.
2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.

Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored.
The input dataset AzureSqlDatabaseDataset refers to the AzureSqlDatabaseLinkedService. The linked
service specifies the connection string to connect to the database. The dataset specifies the name of the database
and the table that contains the source data.
The output dataset AzureSqlDWDataset refers to the AzureSqlDWLinkedService. The linked service specifies
the connection string to connect to the data warehouse. The dataset specifies the database and the table to which
the data is copied.
In this tutorial, the source and destination SQL tables are not hard-coded in the dataset definitions. Instead, the
ForEach activity passes the name of the table at runtime to the Copy activity.
Create a dataset for source SQL Database
1. Click + (plus) in the left pane, and click Dataset.

2. In the New Dataset window, select Azure SQL Database, and click Finish. You should see a new tab titled
AzureSqlTable1.
3. In the properties window at the bottom, enter AzureSqlDatabaseDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select any table for Table. This table is a dummy table. You specify a query on the source dataset
when creating a pipeline. The query is used to extract data from the Azure SQL database.
Alternatively, you can click Edit check box, and enter dummyName as the table name.

Create a dataset for sink SQL Data Warehouse


1. Click + (plus) in the left pane, and click Dataset.
2. In the New Dataset window, select Azure SQL Data Warehouse, and click Finish. You should see a new
tab titled AzureSqlDWTable1.
3. In the properties window at the bottom, enter AzureSqlDWDataset for Name.
4. Switch to the Parameters tab, click + New, and enter DWTableName for the parameter name. If you
copy/paste this name from the page, ensure that there is no trailing space character at the end of
DWTableName.
5. Switch to the Connection tab,
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. For Table, check the Edit option, click into the table name input box, then click the Add dynamic
content link below.

c. In the Add Dynamic Content page, click the DWTAbleName under Parameters which will
automatically populate the top expression text box @dataset().DWTableName , then click Finish. The
tableName property of the dataset is set to the value that's passed as an argument for the
DWTableName parameter. The ForEach activity iterates through a list of tables, and passes one by one to
the Copy activity.

Create pipelines
In this tutorial, you create two pipelines: IterateAndCopySQLTables and GetTableListAndTriggerCopyData.
The GetTableListAndTriggerCopyData pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline IterateAndCopySQLTables to do the actual data copy.
The IterateAndCopySQLTables takes a list of tables as a parameter. For each table in the list, it copies data from
the table in Azure SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
Create the pipeline IterateAndCopySQLTables
1. In the left pane, click + (plus), and click Pipeline.

2. In the General tab, specify IterateAndCopySQLTables for name.


3. Switch to the Parameters tab, and do the following actions:
a. Click + New.
b. Enter tableList for the parameter name.
c. Select Array for Type.

4. In the Activities toolbox, expand Iteration & Conditions, and drag-drop the ForEach activity to the
pipeline design surface. You can also search for activities in the Activities toolbox.
a. In the General tab at the bottom, enter IterateSQLTables for Name.
b. Switch to the Settings tab, click the inputbox for Items, then click the Add dynamic content link below.
c. In the Add Dynamic Content page, collapse the System Vairables and Functions section, click the
tableList under Parameters which will automatically populate the top expression text box as
@pipeline().parameter.tableList , then click Finish.

d. Switch to Activities tab, click Add activity to add a child activity to the ForEach activity.
5. In the Activities toolbox, expand DataFlow, and drag-drop Copy activity into the pipeline designer
surface. Notice the breadcrumb menu at the top. The IterateAndCopySQLTable is the pipeline name and
IterateSQLTables is the ForEach activity name. The designer is in the activity scope. To switch back to the
pipeline editor from the ForEach editor, click the link in the breadcrumb menu.
6. Switch to the Source tab, and do the following steps:
a. Select AzureSqlDatabaseDataset for Source Dataset.
b. Select Query option for User Query.
c. Click the Query input box -> select the Add dynamic content below -> enter the following
expression for Query -> select Finish.

SELECT * FROM [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]


7. Switch to the Sink tab, and do the following steps:
a. Select AzureSqlDWDataset for Sink Dataset.
b. Click input box for the VALUE of DWTableName parameter -> select the Add dynamic content
below, enter [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}] expression as script, -> select Finish.
c. Expand Polybase Settings, and select Allow polybase.
d. Clear the Use Type default option.
e. Click the Pre-copy Script input box -> select the Add dynamic content below -> enter the
following expression as script -> select Finish.

TRUNCATE TABLE [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]

8. Switch to the Settings tab, and do the following steps:


a. Select True for Enable Staging.
b. Select AzureStorageLinkedService for Store Account Linked Service.

9. To validate the pipeline settings, click Validate on the top pipeline tool bar. Confirm that there is no
validation error. To close the Pipeline Validation Report, click >>.
Create the pipeline GetTableListAndTriggerCopyData
This pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
1. In the left pane, click + (plus), and click Pipeline.

2. In the Properties window, change the name of the pipeline to GetTableListAndTriggerCopyData.


3. In the Activities toolbox, expand General, and drag-and-drop Lookup activity to the pipeline designer
surface, and do the following steps:
a. Enter LookupTableList for Name.
b. Enter Retrieve the table list from Azure SQL database for Description.
4. Switch to the Settings page, and do the following steps:
a. Select AzureSqlDatabaseDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

SELECT TABLE_SCHEMA, TABLE_NAME FROM information_schema.TABLES WHERE TABLE_TYPE = 'BASE TABLE'


and TABLE_SCHEMA = 'SalesLT' and TABLE_NAME <> 'ProductModel'

d. Clear the checkbox for the First row only field.


5. Drag-and-drop Execute Pipeline activity from the Activities toolbox to the pipeline designer surface, and
set the name to TriggerCopy.

6. Switch to the Settings page, and do the following steps:


a. Select IterateAndCopySQLTables for Invoked pipeline.
b. Expand the Advanced section.
c. Click + New in the Parameters section.
d. Enter tableList for parameter name.
e. Click VALUE input box -> select the Add dynamic content below -> enter
@activity('LookupTableList').output.value as table name value -> select Finish. You are setting the
result list from the Lookup activity as an input to the second pipeline. The result list contains the list
of tables whose data needs to be copied to the destination.
7. Connect the Lookup activity to the Execute Pipeline activity by dragging the green box attached to the
Lookup activity to the left of Execute Pipeline activity.

8. To validate the pipeline, click Validate on the toolbar. Confirm that there are no validation errors. To close
the Pipeline Validation Report, click >>.
9. To publish entities (datasets, pipelines, etc.) to the Data Factory service, click Publish All on top of the
window. Wait until the publishing succeeds.

Trigger a pipeline run


Go to pipeline GetTableListAndTriggerCopyData, click Trigger, and click Trigger Now.

Monitor the pipeline run


1. Switch to the Monitor tab. Click Refresh until you see runs for both the pipelines in your solution.
Continue refreshing the list until you see the Succeeded status.

2. To view activity runs associated with the GetTableListAndTriggerCopyData pipeline, click the first link in the
Actions link for that pipeline. You should see two activity runs for this pipeline run.

3. To view the output of the Lookup activity, click link in the Output column for that activity. You can
maximize and restore the Output window. After reviewing, click X to close the Output window.
{
"count": 9,
"value": [
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Customer"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Product"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductModelProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductCategory"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Address"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "CustomerAddress"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderDetail"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderHeader"
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"effectiveIntegrationRuntimes": [
{
"name": "DefaultIntegrationRuntime",
"type": "Managed",
"location": "East US",
"billedDuration": 0,
"nodes": null
}
]
}

4. To switch back to the Pipeline Runs view, click Pipelines link at the top. Click View Activity Runs link
(first link in the Actions column) for the IterateAndCopySQLTables pipeline. You should see output as
shown in the following image: Notice that there is one Copy activity run for each table in the Lookup
activity output.
5. Confirm that the data was copied to the target SQL Data Warehouse you used in this tutorial.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Copy multiple tables in bulk by using Azure Data
Factory
3/5/2019 • 11 minutes to read • Edit Online

This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.
At a high level, this tutorial involves following steps:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure PowerShell. To learn about using other tools/SDKs to create a data factory, see
Quickstarts.

End-to-end workflow
In this scenario, we have a number of tables in Azure SQL Database that we want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:

The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.

Azure services to access SQL server


For both SQL Database and SQL Data Warehouse, allow Azure services to access SQL server. Ensure that Allow
access to Azure services setting is turned ON for your Azure SQL server. This setting allows the Data Factory
service to read data from your Azure SQL Database and write data to your Azure SQL Data Warehouse. To verify
and turn on this setting, do the following steps:
1. Click All services on the left and click SQL servers.
2. Select your server, and click Firewall under SETTINGS.
3. In the Firewall settings page, click ON for Allow access to Azure services.

Create a data factory


1. Launch PowerShell. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you
need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

2. Run the Set-AzDataFactoryV2 cmdlet to create a data factory. Replace place-holders with your own
values before executing the command.

$resourceGroupName = "<your resource group to create the factory>"


$dataFactoryName = "<specify the name of data factory to create. It must be globally unique.>"
Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory
names must be globally unique.

To create Data Factory instances, you must be a Contributor or Administrator of the Azure
subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest
you on the following page, and then expand Analytics to locate Data Factory: Products available by
region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.)
used by data factory can be in other regions.

Create linked services


In this tutorial, you create three linked services for source, sink, and staging blob respectively, which includes
connections to your data stores:
Create the source Azure SQL Database linked service
1. Create a JSON file named AzureSqlDatabaseLinkedService.json in C:\ADFv2TutorialBulkCopy folder
with the following content: (Create the folder ADFv2TutorialBulkCopy if it does not already exist.)

IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}

2. In Azure PowerShell, switch to the ADFv2TutorialBulkCopy folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureSqlDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "AzureSqlDatabaseLinkedService" -File ".\AzureSqlDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSqlDatabaseLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create the sink Azure SQL Data Warehouse linked service


1. Create a JSON file named AzureSqlDWLinkedService.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:

IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}

2. To create the linked service: AzureSqlDWLinkedService, run the Set-AzDataFactoryV2LinkedService


cmdlet.
Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName
-Name "AzureSqlDWLinkedService" -File ".\AzureSqlDWLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSqlDWLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDWLinkedService

Create the staging Azure Storage linked service


In this tutorial, you use Azure Blob storage as an interim staging area to enable PolyBase for a better copy
performance.
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:

IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before saving the
file.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}
}

2. To create the linked service: AzureStorageLinkedService, run the Set-AzDataFactoryV2LinkedService


cmdlet.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored:
Create a dataset for source SQL Database
1. Create a JSON file named AzureSqlDatabaseDataset.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content. The "tableName" is a dummy one as later you use the SQL query in copy
activity to retrieve data.

{
"name": "AzureSqlDatabaseDataset",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "dummy"
}
}
}

2. To create the dataset: AzureSqlDatabaseDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name


"AzureSqlDatabaseDataset" -File ".\AzureSqlDatabaseDataset.json"

Here is the sample output:

DatasetName : AzureSqlDatabaseDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a dataset for sink SQL Data Warehouse


1. Create a JSON file named AzureSqlDWDataset.json in the C:\ADFv2TutorialBulkCopy folder, with the
following content: The "tableName" is set as a parameter, later the copy activity that references this dataset
passes the actual value into the dataset.

{
"name": "AzureSqlDWDataset",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "AzureSqlDWLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": {
"value": "@{dataset().DWTableName}",
"type": "Expression"
}
},
"parameters":{
"DWTableName":{
"type":"String"
}
}
}
}

2. To create the dataset: AzureSqlDWDataset, run the Set-AzDataFactoryV2Dataset cmdlet.


Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"AzureSqlDWDataset" -File ".\AzureSqlDWDataset.json"

Here is the sample output:

DatasetName : AzureSqlDWDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDwTableDataset

Create pipelines
In this tutorial, you create two pipelines:
Create the pipeline "IterateAndCopySQLTables"
This pipeline takes a list of tables as a parameter. For each table in the list, it copies data from the table in Azure
SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
1. Create a JSON file named IterateAndCopySQLTables.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:
{
"name": "IterateAndCopySQLTables",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "CopyData",
"description": "Copy data from SQL database to SQL DW",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlDWDataset",
"type": "DatasetReference",
"parameters": {
"DWTableName": "[@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]"
}
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
},
"sink": {
"type": "SqlDWSink",
"preCopyScript": "TRUNCATE TABLE [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object"
}
}
}
}
2. To create the pipeline: IterateAndCopySQLTables, Run the Set-AzDataFactoryV2Pipeline cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IterateAndCopySQLTables" -File ".\IterateAndCopySQLTables.json"

Here is the sample output:

PipelineName : IterateAndCopySQLTables
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {IterateSQLTables}
Parameters : {[tableList, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Create the pipeline "GetTableListAndTriggerCopyData"


This pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
1. Create a JSON file named GetTableListAndTriggerCopyData.json in the C:\ADFv2TutorialBulkCopy
folder, with the following content:
{
"name":"GetTableListAndTriggerCopyData",
"properties":{
"activities":[
{
"name": "LookupTableList",
"description": "Retrieve the table list from Azure SQL dataabse",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT TABLE_SCHEMA, TABLE_NAME FROM
information_schema.TABLES WHERE TABLE_TYPE = 'BASE TABLE' and TABLE_SCHEMA = 'SalesLT' and TABLE_NAME <>
'ProductModel'"
},
"dataset": {
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "TriggerCopy",
"type": "ExecutePipeline",
"typeProperties": {
"parameters": {
"tableList": {
"value": "@activity('LookupTableList').output.value",
"type": "Expression"
}
},
"pipeline": {
"referenceName": "IterateAndCopySQLTables",
"type": "PipelineReference"
},
"waitOnCompletion": true
},
"dependsOn": [
{
"activity": "LookupTableList",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}

2. To create the pipeline: GetTableListAndTriggerCopyData, Run the Set-AzDataFactoryV2Pipeline


cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "GetTableListAndTriggerCopyData" -File ".\GetTableListAndTriggerCopyData.json"

Here is the sample output:


PipelineName : GetTableListAndTriggerCopyData
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {LookupTableList, TriggerCopy}
Parameters :

Start and monitor pipeline run


1. Start a pipeline run for the main "GetTableListAndTriggerCopyData" pipeline and capture the pipeline run
ID for future monitoring. Underneath, it triggers the run for pipeline "IterateAndCopySQLTables" as
specified in ExecutePipeline activity.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName 'GetTableListAndTriggerCopyData'

2. Run the following script to continuously check the run status of pipeline
GetTableListAndTriggerCopyData, and print out the final pipeline run and activity run result.

while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
Write-Host "Pipeline run details:" -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}

Start-Sleep -Seconds 15
}

$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result

Here is the output of the sample run:


Pipeline run details:
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
RunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
LastUpdated : 9/18/2017 4:08:15 PM
Parameters : {}
RunStart : 9/18/2017 4:06:44 PM
RunEnd : 9/18/2017 4:08:15 PM
DurationInMs : 90637
Status : Succeeded
Message :

Activity run details:


ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : LookupTableList
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {source, dataset, firstRowOnly}
Output : {count, value, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:06:46 PM
ActivityRunEnd : 9/18/2017 4:07:09 PM
DurationInMs : 22995
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : TriggerCopy
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {pipeline, parameters, waitOnCompletion}
Output : {pipelineRunId}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:07:11 PM
ActivityRunEnd : 9/18/2017 4:08:14 PM
DurationInMs : 62581
Status : Succeeded
Error : {errorCode, message, failureType, target}

3. You can get the run ID of pipeline "IterateAndCopySQLTables", and check the detailed activity run result
as the following.

Write-Host "Pipeline 'IterateAndCopySQLTables' run result:" -foregroundcolor "Yellow"


($result | Where-Object {$_.ActivityName -eq "TriggerCopy"}).Output.ToString()

Here is the output of the sample run:

{
"pipelineRunId": "7514d165-14bf-41fb-b5fb-789bea6c9e58"
}

$result2 = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId <copy above run ID> -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
$result2

4. Connect to your sink Azure SQL Data Warehouse and confirm that data has been copied from Azure SQL
Database properly.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Incrementally load data from a source data store to a
destination data store
5/10/2019 • 2 minutes to read • Edit Online

In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used
scenario. The tutorials in this section show you different ways of loading data incrementally by using Azure Data
Factory.

Delta data loading from database by using a watermark


In this case, you define a watermark in your source database. A watermark is a column that has the last updated
time stamp or an incrementing key. The delta loading solution loads the changed data between an old watermark
and a new watermark. The workflow for this approach is depicted in the following diagram:

For step-by-step instructions, see the following tutorials:


Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally copy data from multiple tables in on-premises SQL Server to Azure SQL Database

Delta data loading from SQL DB by using the Change Tracking


technology
Change Tracking technology is a lightweight solution in SQL Server and Azure SQL Database that provides an
efficient change tracking mechanism for applications. It enables an application to easily identify data that was
inserted, updated, or deleted.
The workflow for this approach is depicted in the following diagram:

For step-by-step instructions, see the following tutorial:


Incrementally copy data from Azure SQL Database to Azure Blob storage by using Change Tracking technology

Loading new and changed files only by using LastModifiedDate


You can copy the new and changed files only by using LastModifiedDate to the destination store. ADF will scan all
the files from the source store, apply the file filter by their LastModifiedDate, and only copy the new and updated
file since last time to the destination store. Please be aware if you let ADF scan huge amounts of files but only copy
a few files to destination, you would still expect the long duration due to file scanning is time consuming as well.
For step-by-step instructions, see the following tutorial:
Incrementally copy new and changed files based on LastModifiedDate from Azure Blob storage to Azure Blob
storage

Loading new files only by using time partitioned folder or file name.
You can copy new files only, where files or folders has already been time partitioned with timeslice information as
part of the file or folder name (for example, /yyyy/mm/dd/file.csv). It is the most performance approach for
incremental loading new files.
For step-by-step instructions, see the following tutorial:
Incrementally copy new files based on time partitioned folder or file name from Azure Blob storage to Azure Blob
storage

Next steps
Advance to the following tutorial:
Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/26/2019 • 13 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run

Overview
Here is the high-level solution diagram:

Here are the important steps to create this solution:


1. Select the watermark column. Select one column in the source data store, which can be used to slice the
new or updated records for every run. Normally, the data in this selected column (for example,
last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value in this
column is used as a watermark.
2. Prepare a data store to store the watermark value. In this tutorial, you store the watermark value in a
SQL database.
3. Create a pipeline with the following workflow:
The pipeline in this solution has the following activities:
Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Blob storage as a new file.
Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table as
the data source store:

create table data_source_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

INSERT INTO data_source_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'aaaa','9/1/2017 12:56:00 AM'),
(2, 'bbbb','9/2/2017 5:23:00 AM'),
(3, 'cccc','9/3/2017 2:36:00 AM'),
(4, 'dddd','9/4/2017 3:21:00 AM'),
(5, 'eeee','9/5/2017 8:06:00 AM');

In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000

Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(

TableName varchar(255),
WatermarkValue datetime,
);

2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.

INSERT INTO watermarktable


VALUES ('data_source_table','1/1/2010 12:00:00 AM')

3. Review the data in the table watermarktable .

Select * from watermarktable

Output:

TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000

Create a stored procedure in your SQL database


Run the following command to create a stored procedure in your SQL database:

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFIncCopyTutorialDF for the name.

The name of the Azure data factory must be globally unique. If you see a red exclamation mark with the
following error, change the name of the data factory (for example, yournameADFIncCopyTutorialDF ) and
try creating again. See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.

`Data factory name "ADFIncCopyTutorialDF" is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.

Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. In the get started page of Data Factory UI, click the Create pipeline tile.

2. In the General page of the Properties window for the pipeline, enter IncrementalCopyPipeline name.
3. Let's add the first lookup activity to get the old watermark value. In the Activities toolbox, expand General,
and drag-drop the Lookup activity to the pipeline designer surface. Change the name of the activity to
LookupOldWaterMarkActivity.
4. Switch to the Settings tab, and click + New for Source Dataset. In this step, you create a dataset to
represent data in the watermarktable. This table contains the old watermark that was used in the previous
copy operation.
5. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
the dataset.
6. In the properties window for the dataset, enter WatermarkDataset for Name.
7. Switch to the Connection tab, and click + New to make a connection (create a linked service) to your Azure
SQL database.

8. In the New Linked Service window, do the following steps:


a. Enter AzureSqlDatabaseLinkedService for Name.
b. Select your Azure SQL server for Server name.
c. Enter the name of the user to access for the Azure SQL server.
d. Enter the password for the user.
e. To test connection to the Azure SQL database, click Test connection.
f. Click Save.
g. In the Connection tab, confirm that AzureSqlDatabaseLinkedService is selected for Linked
service.
9. Select [dbo].[watermarktable] for Table. If you want to preview data in the table, click Preview data.

10. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that
WatermarkDataset is selected for the Source Dataset field.
11. In the Activities toolbox, expand General, and drag-drop another Lookup activity to the pipeline designer
surface, and set the name to LookupNewWaterMarkActivity in the General tab of the properties
window. This Lookup activity gets the new watermark value from the table with the source data to be copied
to the destination.
12. In the properties window for the second Lookup activity, switch to the Settings tab, and click New. You
create a dataset to point to the source table that contains the new watermark value (maximum value of
LastModifyTime).
13. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
this dataset. You also see the dataset in the tree view.
14. In the General tab of the properties window, enter SourceDataset for Name.

15. Switch to the Connection tab, and do the following steps:


a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[data_source_table] for Table. You specify a query on this dataset later in the tutorial.
The query takes the precedence over the table you specify in this step.

16. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that SourceDataset is
selected for the Source Dataset field.
17. Select Query for the Use Query field, and enter the following query: you are only selecting the maximum
value of LastModifytime from the data_source_table. If you don't have this query, the dataset gets all the
rows from the table as you specified the table name (data_source_table) in the dataset definition.

select MAX(LastModifytime) as NewWatermarkvalue from data_source_table


18. In the Activities toolbox, expand DataFlow, and drag-drop the Copy activity from the Activities toolbox,
and set the name to IncrementalCopyActivity.
19. Connect both Lookup activities to the Copy activity by dragging the green button attached to the
Lookup activities to the Copy activity. Release the mouse button when you see the border color of the Copy
activity changes to blue.

20. Select the Copy activity and confirm that you see the properties for the activity in the Properties window.
21. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for the Use Query field.
c. Enter the following SQL query for the Query field.

select * from data_source_table where LastModifytime >


'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'
22. Switch to the Sink tab, and click + New for the Sink Dataset field.

23. In this tutorial sink data store is of type Azure Blob Storage. Therefore, select Azure Blob Storage, and click
Finish in the New Dataset window.
24. In the General tab of the Properties window for the dataset, enter SinkDataset for Name.
25. Switch to the Connection tab, and click + New. In this step, you create a connection (linked service) to your
Azure Blob storage.

26. In the New Linked Service window, do the following steps:


a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.
27. In the Connection tab, do the following steps:
a. Confirm that AzureStorageLinkedService is selected for Linked service.
b. For the folder part of the File path field, enter adftutorial/incrementalcopy. adftutorial is the
blob container name and incrementalcopy is the folder name. This snippet assumes that you have a
blob container named adftutorial in your blob storage. Create the container if it doesn't exist, or set it
to the name of an existing one. Azure Data Factory automatically creates the output folder
incrementalcopy if it does not exist. You can also use the Browse button for the File path to
navigate to a folder in a blob container. .RunId, '.txt')`.
c. Fir the filename part of the File path field, enter @CONCAT('Incremental-', pipeline().RunId, '.txt')
. The file name is dynamically generated by using the expression. Each pipeline run has a unique ID.
The Copy activity uses the run ID to generate the file name.
28. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left.
29. In the Activities toolbox, expand General, and drag-drop the Stored Procedure activity from the
Activities toolbox to the pipeline designer surface. Connect the green (Success) output of the Copy
activity to the Stored Procedure activity.

30. Select Stored Procedure Activity in the pipeline designer, change its name to
StoredProceduretoWriteWatermarkActivity.
31. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService* for Linked service.

32. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark.
b. To specify values for the stored procedure parameters, click Import parameter, and enter following
values for the parameters:

NAME TYPE VALUE

LastModifiedtime DateTime @{activity('LookupNewWaterMark


Activity').output.firstRow.NewWate
rmarkvalue}

TableName String @{activity('LookupOldWaterMarkA


ctivity').output.firstRow.TableName}
33. To validate the pipeline settings, click Validate on the toolbar. Confirm that there are no validation errors. To
close the Pipeline Validation Report window, click >>.

34. Publish entities (linked services, datasets, and pipelines) to the Azure Data Factory service by selecting the
Publish All button. Wait until you see a message that the publishing succeeded.
Trigger a pipeline run
1. Click Trigger on the toolbar, and click Trigger Now.

2. In the Pipeline Run window, select Finish.

Monitor the pipeline run


1. Switch to the Monitor tab on the left. You can see the status of the pipeline run triggered by the manual
trigger. Click Refresh button to refresh the list.

2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.

Review the results


1. Connect to your Azure Storage Account by using tools such as Azure Storage Explorer. Verify that an output
file is created in the incrementalcopy folder of the adftutorial container.

2. Open the output file and notice that all the data is copied from the data_source_table to the blob file.

1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000

3. Check the latest value from watermarktable . You see that the watermark value was updated.

Select * from watermarktable

Here is the output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-05 8:06:00.000

Add more data to source


Insert new data into the SQL database (data source store).

INSERT INTO data_source_table


VALUES (6, 'newdata','9/6/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (7, 'newdata','9/7/2017 9:01:00 AM')

The updated data in the SQL database is:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000
6 | newdata | 2017-09-06 02:23:00.000
7 | newdata | 2017-09-07 09:01:00.000

Trigger another pipeline run


1. Switch to the Edit tab. Click the pipeline in the tree view if it's not opened in the designer.
2. Click Trigger on the toolbar, and click Trigger Now.

Monitor the second pipeline run


1. Switch to the Monitor tab on the left. You can see the status of the pipeline run triggered by the manual
trigger. Click Refresh button to refresh the list.

2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.

Verify the second output


1. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-<GUID>.txt . Open that file, and you see two rows of records in it.

6,newdata,2017-09-06 02:23:00.0000000
7,newdata,2017-09-07 09:01:00.0000000

2. Check the latest value from watermarktable . You see that the watermark value was updated again.

Select * from watermarktable

sample output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-07 09:01:00.000

Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/14/2019 • 13 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.

Overview
Here is the high-level solution diagram:

Here are the important steps to create this solution:


1. Select the watermark column. Select one column in the source data store, which can be used to slice the
new or updated records for every run. Normally, the data in this selected column (for example,
last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value in this
column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following workflow:
The pipeline in this solution has the following activities:
Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Blob storage as a new file.
Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Azure PowerShell. Follow the instructions in Install and configure Azure PowerShell.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:

create table data_source_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

INSERT INTO data_source_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'aaaa','9/1/2017 12:56:00 AM'),
(2, 'bbbb','9/2/2017 5:23:00 AM'),
(3, 'cccc','9/3/2017 2:36:00 AM'),
(4, 'dddd','9/4/2017 3:21:00 AM'),
(5, 'eeee','9/5/2017 8:06:00 AM');

In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000

Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(

TableName varchar(255),
WatermarkValue datetime,
);

2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.

INSERT INTO watermarktable


VALUES ('data_source_table','1/1/2010 12:00:00 AM')

3. Review the data in the table watermarktable .

Select * from watermarktable

Output:

TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000

Create a stored procedure in your SQL database


Run the following command to create a stored procedure in your SQL database:

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotation
marks, and then run the command. An example is "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

2. Define a variable for the location of the data factory.

$location = "East US"


3. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to make it globally unique. An example is ADFTutorialFactorySP1127.

$dataFactoryName = "ADFIncCopyTutorialFactory";

5. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName

Note the following points:


The name of the data factory must be globally unique. If you receive the following error, change the name
and try again:

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names must
be globally unique.

To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your storage account and SQL database.
Create a Storage linked service
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADF folder with the following content.
(Create the folder ADF if it doesn't already exist.) Replace <accountName> and <accountKey> with the name
and key of your storage account before you save the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}

2. In PowerShell, switch to the ADF folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName and
DataFactoryName parameters:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create a SQL Database linked service


1. Create a JSON file named AzureSQLDatabaseLinkedService.json in the C:\ADF folder with the following
content. (Create the folder ADF if it doesn't already exist.) Replace <server>, <database>, <user id>, and
<password> with the name of your server, database, user ID, and password before you save the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database>; Persist
Security Info=False; User ID=<user> ; Password=<password>; MultipleActiveResultSets = False; Encrypt =
True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}

2. In PowerShell, switch to the ADF folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service
AzureSQLDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"
Here is the sample output:

LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
ProvisioningState :

Create datasets
In this step, you create datasets to represent source and sink data.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:

{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

In this tutorial, you use the table name data_source_table. Replace it if you use a table with a different name.
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a sink dataset


1. Create a JSON file named SinkDataset.json in the same folder with the following content:
{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incrementalcopy",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

IMPORTANT
This snippet assumes that you have a blob container named adftutorial in your blob storage. Create the container if
it doesn't exist, or set it to the name of an existing one. The output folder incrementalcopy is automatically
created if it doesn't exist in the container. In this tutorial, the file name is dynamically generated by using the
expression @CONCAT('Incremental-', pipeline().RunId, '.txt') .

2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset SinkDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SinkDataset" -File ".\SinkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a dataset for a watermark


In this step, you create a dataset for storing a high watermark value.
1. Create a JSON file named WatermarkDataset.json in the same folder with the following content:

{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset WatermarkDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "WatermarkDataset" -File ".\WatermarkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : WatermarkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. Create a JSON file IncrementalCopyPipeline.json in the same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable"
},

"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(LastModifytime) as NewWatermarkvalue from data_source_table"
},

"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},

{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],

"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},

{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {

"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {"value":
"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}", "type": "datetime" },
"TableName": { "value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type":"String"}
}
},

"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},

"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]

}
}
2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:

PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Activities : {LookupOldWaterMarkActivity, LookupNewWaterMarkActivity, IncrementalCopyActivity,
StoredProceduretoWriteWatermarkActivity}
Parameters :

Run the pipeline


1. Run the pipeline IncrementalCopyPipeline by using the Invoke-AzDataFactoryV2Pipeline cmdlet.
Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroupName


$resourceGroupName -dataFactoryName $dataFactoryName

2. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".

Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-PipelineRunId $RunId -RunStartedAfter "<start time>" -RunStartedBefore "<end time>"

Here is the sample output:


ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupNewWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {NewWatermarkvalue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:42:50 AM
DurationInMs : 7777
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:43:07 AM
DurationInMs : 25437
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:10 AM
ActivityRunEnd : 9/14/2017 7:43:29 AM
DurationInMs : 19769
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:32 AM
ActivityRunEnd : 9/14/2017 7:43:47 AM
DurationInMs : 14467
Status : Succeeded
Error : {errorCode, message, failureType, target}

Review the results


1. In the blob storage (sink store), you see that the data were copied to the file defined in SinkDataset. In the
current tutorial, the file name is Incremental- d4bf3ce2-5d60-43f3-9318-923155f61037.txt . Open the file, and
you can see records in the file that are the same as the data in the SQL database.
1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000

2. Check the latest value from watermarktable . You see that the watermark value was updated.

Select * from watermarktable

Here is the sample output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-05 8:06:00.000

Insert data into the data source store to verify delta data loading
1. Insert new data into the SQL database (data source store).

INSERT INTO data_source_table


VALUES (6, 'newdata','9/6/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (7, 'newdata','9/7/2017 9:01:00 AM')

The updated data in the SQL database is:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000
6 | newdata | 2017-09-06 02:23:00.000
7 | newdata | 2017-09-07 09:01:00.000

2. Run the pipeline IncrementalCopyPipeline again by using the Invoke-AzDataFactoryV2Pipeline cmdlet.


Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroupName


$resourceGroupName -dataFactoryName $dataFactoryName

3. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".

Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-PipelineRunId $RunId -RunStartedAfter "<start time>" -RunStartedBefore "<end time>"

Here is the sample output:


ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupNewWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {NewWatermarkvalue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:58 AM
DurationInMs : 31758
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:52 AM
DurationInMs : 25497
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:00 AM
ActivityRunEnd : 9/14/2017 8:53:20 AM
DurationInMs : 20194
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:23 AM
ActivityRunEnd : 9/14/2017 8:53:41 AM
DurationInMs : 18502
Status : Succeeded
Error : {errorCode, message, failureType, target}

4. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-2fc90ab8-d42c-4583-aa64-755dba9925d7.txt . Open that file, and you see two rows of records in it.

5. Check the latest value from watermarktable . You see that the watermark value was updated again.

Select * from watermarktable


sample output:

TABLENAME WATERMARKVALUE

data_source_table 2017-09-07 09:01:00.000

Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from multiple tables in SQL Server to an
Azure SQL database
4/14/2019 • 17 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-premises SQL Server to an
Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.

Overview
Here are the important steps to create this solution:
1. Select the watermark column.
Select one column for each table in the source data store, which can be used to identify the new or updated records for every run.
Normally, the data in this selected column (for example, last_modify_time or ID ) keeps increasing when rows are created or updated. The
maximum value in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter to the pipeline. For each source
table, it invokes the following activities to perform delta loading for that table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the second Lookup activity to
retrieve the new watermark value. These watermark values are passed to the Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark column greater than the old
watermark value and less than the new watermark value. Then, it copies the delta data from the source data store to Azure Blob storage
as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:

If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see Create an Azure SQL database
for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

INSERT INTO customer_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'John','9/1/2017 12:56:00 AM'),
(2, 'Mike','9/2/2017 5:23:00 AM'),
(3, 'Alice','9/3/2017 2:36:00 AM'),
(4, 'Andy','9/4/2017 3:21:00 AM'),
(5, 'Anny','9/5/2017 8:06:00 AM');

INSERT INTO project_table


(Project, Creationtime)
VALUES
('project1','1/1/2015 0:00:00 AM'),
('project2','2/2/2016 1:23:00 AM'),
('project3','3/4/2017 5:16:00 AM');

Create destination tables in your Azure SQL database


1. Open SQL Server Management Studio, and connect to your Azure SQL database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your SQL database to create tables named customer_table and project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to store the watermark value:

create table watermarktable


(

TableName varchar(255),
WatermarkValue datetime,
);

2. Insert initial watermark values for both source tables into the watermark table.
INSERT INTO watermarktable
VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');

Create a stored procedure in the Azure SQL database


Run the following command to create a stored procedure in your SQL database. This stored procedure updates the watermark value after every
pipeline run.

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create data types and additional stored procedures in Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used to merge the data from source
tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in via a table variable and then
merge the them into destination store. Be cautious that it is not expecting a "large" number of delta rows (more than 100) to be stored in the
table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy activity to copy all the delta data
into a temporary "staging" table in the destination store first, and then built your own stored procedure without using table variable to merge
them from the “staging” table to the “final” table.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

GO

CREATE PROCEDURE usp_upsert_customer_table @customer_table DataTypeforCustomerTable READONLY


AS

BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END

GO

CREATE TYPE DataTypeforProjectTable AS TABLE(


Project varchar(255),
Creationtime datetime
);

GO

CREATE PROCEDURE usp_upsert_project_table @project_table DataTypeforProjectTable READONLY


AS

BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in Microsoft Edge and Google
Chrome web browsers.
2. Click New on the left menu, click Data + Analytics, and click Data Factory.
3. In the New data factory page, enter ADFMultiIncCopyTutorialDF for the name.

The name of the Azure data factory must be globally unique. If you receive the following error, change the name of the data factory (for
example, yournameADFMultiIncCopyTutorialDF ) and try creating again. See Data Factory - Naming Rules article for naming rules for
Data Factory artifacts.

`Data factory name ADFMultiIncCopyTutorialDF is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down list. The data stores (Azure
Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.

12. Click Author & Monitor tile to launch Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page of Azure Data Factory UI, click Create pipeline (or) switch to the Edit tab.
Create self-hosted integration runtime
As you are moving data from a data store in a private network (on-premises) to an Azure data store, install a self-hosted integration runtime (IR )
in your on-premises environment. The self-hosted IR moves data between your private network and Azure.
1. Click Connections at the bottom of the left pane, and switch to the Integration Runtimes in the Connections window.

2. In the Integration Runtimes tab, click + New.

3. In the Integration Runtime Setup window, select Perform data movement and dispatch activities to external computes, and
click Next.
4. Select ** Private Network**, and click Next.

5. Enter MySelfHostedIR for Name, and click Next.


6. Click Click here to launch the express setup for this computer in the Option 1: Express setup section.
7. In the Integration Runtime (Self-hosted) Express Setup window, click Close.

8. In the Web browser, in the Integration Runtime Setup window, click Finish.
9. Confirm that you see MySelfHostedIR in the list of integration runtimes.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this section, you create linked
services to your on-premises SQL Server database and SQL database.
Create the SQL Server linked service
In this step, you link your on-premises SQL Server database to the data factory.
1. In the Connections window, switch from Integration Runtimes tab to the Linked Services tab, and click + New.
2. In the New Linked Service window, select SQL Server, and click Continue.

3. In the New Linked Service window, do the following steps:


a. Enter SqlServerLinkedService for Name.
b. Select MySelfHostedIR for Connect via integration runtime. This is an important step. The default integration runtime
cannot connect to an on-premises data store. Use the self-hosted integration runtime you created earlier.
c. For Server name, enter the name of your computer that has the SQL Server database.
d. For Database name, enter the name of the database in your SQL Server that has the source data. You created a table and
inserted data into this database as part of the prerequisites.
e. For Authentication type, select the type of the authentication you want to use to connect to the database.
f. For User name, enter the name of user that has access to the SQL Server database. If you need to use a slash character ( \ ) in
your user account or server name, use the escape character ( \ ). An example is mydomain\\myuser .
g. For Password, enter the password for the user.
h. To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection
succeeds.
i. To save the linked service, click Save.

Create the Azure SQL Database linked service


In the last step, you create a linked service to link your source SQL Server database to the data factory. In this step, you link your
destination/sink Azure SQL database to the data factory.
1. In the Connections window, switch from Integration Runtimes tab to the Linked Services tab, and click + New.

2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. For Server name, select the name of your Azure SQL server from the drop-down list.
c. For Database name, select the Azure SQL database in which you created customer_table and project_table as part of the
prerequisites.
d. For User name, enter the name of user that has access to the Azure SQL database.
e. For Password, enter the password for the user.
f. To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection
succeeds.
g. To save the linked service, click Save.

4. Confirm that you see two linked services in the list.


Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the watermark.
Create a source dataset
1. In the left pane, click + (plus), and click Dataset.

2. In the New Dataset window, select SQL Server, click Finish.


3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the treeview. In the General tab of the
Properties window at the bottom, enter SourceDataset for Name.

4. Switch to the Connection tab in the Properties window, and select SqlServerLinkedService for Linked service. You do not select a
table here. The Copy activity in the pipeline uses a SQL query to load the data rather than load the entire table.
Create a sink dataset
1. In the left pane, click + (plus), and click Dataset.

2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the treeview. In the General tab of the
Properties window at the bottom, enter SinkDataset for Name.

4. Switch to the Parameters tab in the Properties window, and do the following steps:
a. Click New in the Create/update parameters section.
b. Enter SinkTableName for the name, and String for the type. This dataset takes SinkTableName as a parameter. The
SinkTableName parameter is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline iterates through a list
of table names and passes the table name to this dataset in each iteration.

5. Switch to the Connection tab in the Properties window, and select AzureSqlLinkedService for Linked service. For Table property,
click Add dynamic content.

6. Select SinkTableName in the Parameters section

7. After clicking Finish, you see **@dataset().SinkTableName** as the table name.


Create a dataset for a watermark
In this step, you create a dataset for storing a high watermark value.
1. In the left pane, click + (plus), and click Dataset.

2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. In the General tab of the Properties window at the bottom, enter WatermarkDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[watermarktable] for Table.

Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names and performs the following
operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next iteration.
Create the pipeline
1. In the left pane, click + (plus), and click Pipeline.

2. In the General tab of the Properties window, enter IncrementalCopyPipeline for Name.

3. In the Properties window, do the following steps:


a. Click + New.
b. Enter tableList for the parameter name.
c. Select Object for the parameter type.

4. In the Activities toolbox, expand Iteration & Conditionals, and drag-drop the ForEach activity to the pipeline designer surface. In the
General tab of the Properties window, enter IterateSQLTables.

5. Switch to the Settings tab in the Properties window, and enter @pipeline().parameters.tableList for Items. The ForEach activity
iterates through a list of tables and performs the incremental copy operation.

6. Select the ForEach activity in the pipeline if it isn't already selected. Click the Edit (Pencil icon) button.
7. In the Activities toolbox, expand General, drag-drop the Lookup activity to the pipeline designer surface, and enter
LookupOldWaterMarkActivity for Name.

8. Switch to the Settings tab of the Properties window, and do the following steps:
a. Select WatermarkDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

select * from watermarktable where TableName = '@{item().TABLE_NAME}'


9. Drag-drop the Lookup activity from the Activities toolbox, and enter LookupNewWaterMarkActivity for Name.
10. Switch to the Settings tab.
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from @{item().TABLE_NAME}

11. Drag-drop the Copy activity from the Activities toolbox, and enter IncrementalCopyActivity for Name.
12. Connect Lookup activities to the Copy activity one by one. To connect, start dragging at the green box attached to the Lookup activity
and drop it on the Copy activity. Release the mouse button when the border color of the Copy activity changes to blue.

13. Select the Copy activity in the pipeline. Switch to the Source tab in the Properties window.
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} >


'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and @{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'
14. Switch to the Sink tab, and select SinkDataset for Sink Dataset.

15. Do the following steps:


a. In the Dataset property, for SinkTableName parameter, enter @{item().TABLE_NAME} .
b. For Stored Procedure Name property, enter @{item().StoredProcedureNameForMergeOperation} .
c. For Table Type property, enter @{item().TableType} .
16. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer surface. Connect the Copy activity to
the Stored Procedure activity.

17. Select the Stored Procedure activity in the pipeline, and enter StoredProceduretoWriteWatermarkActivity for Name in the General
tab of the Properties window.

18. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService for Linked Service.
19. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark .
b. Select Import parameter.
c. Specify the following values for the parameters:

NAME TYPE VALUE

LastModifiedtime DateTime @{activity('LookupNewWaterMarkActivity').output.firstRow.NewWate

TableName String @{activity('LookupOldWaterMarkActivity').output.firstRow.TableNa

20. In the left pane, click Publish. This action publishes the entities you created to the Data Factory service.

21. Wait until you see the Successfully published message. To see the notifications, click the Show Notifications link. Close the
notifications window by clicking X.
Run the pipeline
1. On the toolbar for the pipeline, click Trigger, and click Trigger Now.

2. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]

Monitor the pipeline


1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger. Click Refresh button to refresh the list.
Links in the Actions column allow you to view activity runs associated with the pipeline run, and to rerun the pipeline.

2. Click View Activity Runs link in the Actions column. You see all the activity runs associated with the selected pipeline run.
Review the results
In SQL Server Management Studio, run the following queries against the target SQL database to verify that the data was copied from source
tables to destination tables:
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000

Query

select * from watermarktable

Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000

Notice that the watermark values for both tables were updated.
Add more data to the source tables
Run the following query against the source SQL Server database to update an existing row in customer_table. Insert a new row into
project_table.

UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3

INSERT INTO project_table


(Project, Creationtime)
VALUES
('NewProject','10/1/2017 0:00:00 AM');

Rerun the pipeline


1. In the web browser window, switch to the Edit tab on the left.
2. On the toolbar for the pipeline, click Trigger, and click Trigger Now.

3. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.

[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]

Monitor the pipeline again


1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger. Click Refresh button to refresh the list.
Links in the Actions column allow you to view activity runs associated with the pipeline run, and to rerun the pipeline.
2. Click View Activity Runs link in the Actions column. You see all the activity runs associated with the selected pipeline run.

Review the final results


In SQL Server Management Studio, run the following queries against the target database to verify that the updated/new data was copied from
source tables to destination tables.
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000

Notice that the NewProject entry was added to project_table.


Query

select * from watermarktable

Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000

Notice that the watermark values for both tables were updated.

Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from multiple tables in SQL
Server to an Azure SQL database
4/8/2019 • 18 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-
premises SQL Server to an Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.

Overview
Here are the important steps to create this solution:
1. Select the watermark column. Select one column for each table in the source data store, which can be
used to identify the new or updated records for every run. Normally, the data in this selected column (for
example, last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and
project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

INSERT INTO customer_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'John','9/1/2017 12:56:00 AM'),
(2, 'Mike','9/2/2017 5:23:00 AM'),
(3, 'Alice','9/3/2017 2:36:00 AM'),
(4, 'Andy','9/4/2017 3:21:00 AM'),
(5, 'Anny','9/5/2017 8:06:00 AM');

INSERT INTO project_table


(Project, Creationtime)
VALUES
('project1','1/1/2015 0:00:00 AM'),
('project2','2/2/2016 1:23:00 AM'),
('project3','3/4/2017 5:16:00 AM');

Create destination tables in your Azure SQL database


1. Open SQL Server Management Studio, and connect to your SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your SQL database to create tables named customer_table and
project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:

create table watermarktable


(

TableName varchar(255),
WatermarkValue datetime,
);

2. Insert initial watermark values for both source tables into the watermark table.

INSERT INTO watermarktable


VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');

Create a stored procedure in the Azure SQL database


Run the following command to create a stored procedure in your SQL database. This stored procedure updates
the watermark value after every pipeline run.

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create data types and additional stored procedures in the Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used
to merge the data from source tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in
via a table variable and then merge the them into destination store. Be cautious that it is not expecting a "large"
number of delta rows (more than 100) to be stored in the table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy
activity to copy all the delta data into a temporary "staging" table in the destination store first, and then built your
own stored procedure without using table variable to merge them from the “staging” table to the “final” table.

CREATE TYPE DataTypeforCustomerTable AS TABLE(


PersonID int,
Name varchar(255),
LastModifytime datetime
);

GO

CREATE PROCEDURE usp_upsert_customer_table @customer_table DataTypeforCustomerTable READONLY


AS

BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END

GO

CREATE TYPE DataTypeforProjectTable AS TABLE(


Project varchar(255),
Creationtime datetime
);

GO

CREATE PROCEDURE usp_upsert_project_table @project_table DataTypeforProjectTable READONLY


AS

BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END

Azure PowerShell
Install the latest Azure PowerShell modules by following the instructions in Install and configure Azure
PowerShell.

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotation
marks, and then run the command. An example is "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

2. Define a variable for the location of the data factory.

$location = "East US"

3. To create the Azure resource group, run the following command:

New-AzureRmResourceGroup $resourceGroupName $location

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to make it globally unique. An example is ADFIncMultiCopyTutorialFactorySP1127.

$dataFactoryName = "ADFIncMultiCopyTutorialFactory";

5. To create the data factory, run the following Set-AzureRmDataFactoryV2 cmdlet:

Set-AzureRmDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name


$dataFactoryName

Note the following points:


The name of the data factory must be globally unique. If you receive the following error, change the name
and try again:

The specified Data Factory name 'ADFIncMultiCopyTutorialFactory' is already in use. Data Factory names
must be globally unique.

To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.

Create a self-hosted integration runtime


In this section, you create a self-hosted integration runtime and associate it with an on-premises machine with the
SQL Server database. The self-hosted integration runtime is the component that copies data from SQL Server on
your machine to Azure Blob storage.
1. Create a variable for the name of the integration runtime. Use a unique name, and make a note of it. You
use it later in this tutorial.

$integrationRuntimeName = "ADFTutorialIR"

2. Create a self-hosted integration runtime.

Set-AzDataFactoryV2IntegrationRuntime -Name $integrationRuntimeName -Type SelfHosted -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName

Here is the sample output:

Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factories/onpremdf0914/inte
grationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description :

3. To retrieve the status of the created integration runtime, run the following command. Confirm that the value
of the State property is set to NeedRegistration.

Get-AzDataFactoryV2IntegrationRuntime -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName -Status

Here is the sample output:

Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration

4. To retrieve the authentication keys used to register the self-hosted integration runtime with Azure Data
Factory service in the cloud, run the following command:

Get-AzDataFactoryV2IntegrationRuntimeKey -Name $integrationRuntimeName -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName | ConvertTo-Json

Here is the sample output:


{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}

5. Copy one of the keys (exclude the double quotation marks) used to register the self-hosted integration
runtime that you install on your machine in the following steps.

Install the integration runtime


1. If you already have the integration runtime on your machine, uninstall it by using Add or Remove
Programs.
2. Download the self-hosted integration runtime on a local Windows machine. Run the installation.
3. On the Welcome to Microsoft Integration Runtime Setup page, select Next.
4. On the End-User License Agreement page, accept the terms and license agreement, and select Next.
5. On the Destination Folder page, select Next.
6. On the Ready to install Microsoft Integration Runtime page, select Install.
7. If you see a warning message about configuring the computer to enter sleep or hibernate mode when not
in use, select OK.
8. If you see the Power Options page, close it, and go to the setup page.
9. On the Completed the Microsoft Integration Runtime Setup page, select Finish.
10. On the Register Integration Runtime (Self-hosted) page, paste the key you saved in the previous
section, and select Register.

11. When the self-hosted integration runtime is registered successfully, you see the following message:
12. On the New Integration Runtime (Self-hosted) Node page, select Next.

13. On the Intranet Communication Channel page, select Skip. Select a TLS/SSL certification to secure
intranode communication in a multinode integration runtime environment.
14. On the Register Integration Runtime (Self-hosted) page, select Launch Configuration Manager.
15. When the node is connected to the cloud service, you see the following page:

16. Now, test the connectivity to your SQL Server database.


a. On the Configuration Manager page, go to the Diagnostics tab.
b. Select SqlServer for the data source type.
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the user name.
g. Enter the password for the user name.
h. Select Test to confirm that the integration runtime can connect to SQL Server. If the connection is
successful, you see a green check mark. If the connection is not successful, you see an error message. Fix
any issues, and ensure that the integration runtime can connect to SQL Server.

NOTE
Make a note of the values for authentication type, server, database, user, and password. You use them later in this
tutorial.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this
section, you create linked services to your on-premises SQL Server database and SQL database.
Create the SQL Server linked service
In this step, you link your on-premises SQL Server database to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFTutorials\IncCopyMultiTableTutorial
folder with the following content. Select the right section based on the authentication you use to connect to
SQL Server. Create the local folders if they don't already exist.
IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.

If you use SQL authentication, copy the following JSON definition:

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

If you use Windows authentication, copy the following JSON definition:

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}

IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.
Replace <integration runtime name> with the name of your integration runtime.
Replace <servername>, <databasename>, <username>, and <password> with values of your SQL Server
database before you save the file.
If you need to use a slash character ( \ ) in your user account or server name, use the escape character ( \ ). An
example is mydomain\\myuser .

2. In PowerShell, switch to the C:\ADFTutorials\IncCopyMultiTableTutorial folder.


3. Run the Set-AzureRmDataFactoryV2LinkedService cmdlet to create the linked service
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName and
DataFactoryName parameters:

Set-AzureRmDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "SqlServerLinkedService" -File ".\SqlServerLinkedService.json"

Here is the sample output:

LinkedServiceName : SqlServerLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerLinkedService

Create the SQL database linked service


1. Create a JSON file named AzureSQLDatabaseLinkedService.json in
C:\ADFTutorials\IncCopyMultiTableTutorial folder with the following content. (Create the folder ADF if it
doesn't already exist.) Replace <server>, <database name>, <user id>, and <password> with the name of
your SQL Server database, name of your database, user ID, and password before you save the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}

2. In PowerShell, run the Set-AzureRmDataFactoryV2LinkedService cmdlet to create the linked service


AzureSQLDatabaseLinkedService.

Set-AzureRmDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dummyName"
},
"linkedServiceName": {
"referenceName": "SqlServerLinkedService",
"type": "LinkedServiceReference"
}
}
}

The table name is a dummy name. The Copy activity in the pipeline uses a SQL query to load the data
rather than load the entire table.
2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.

Set-AzureRmDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset

Create a sink dataset


1. Create a JSON file named SinkDataset.json in the same folder with the following content. The tableName
element is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline iterates through a
list of table names and passes the table name to this dataset in each iteration.

{
"name": "SinkDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@{dataset().SinkTableName}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"SinkTableName": {
"type": "String"
}
}
}
}

2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset SinkDataset.


Set-AzureRmDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName
-Name "SinkDataset" -File ".\SinkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a dataset for a watermark


In this step, you create a dataset for storing a high watermark value.
1. Create a JSON file named WatermarkDataset.json in the same folder with the following content:

{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset WatermarkDataset.

Set-AzureRmDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "WatermarkDataset" -File ".\WatermarkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : WatermarkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : <data factory name>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names
and performs the following operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the
last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column
in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the
destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next
iteration.
Create the pipeline
1. Create a JSON file named IncrementalCopyPipeline.json in the same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [{

"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},

"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable where TableName = '@{item().TABLE_NAME}'"
},

"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from
@{item().TABLE_NAME}"
},

"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},

{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and
@{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "SqlSink",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterStoredProcedureName": "@{item().StoredProcedureNameForMergeOperation}"
}
},
"dependsOn": [{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],

"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"SinkTableName": "@{item().TableType}"
}
}]
},

{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {

"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {
"value": "@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type": "datetime"
},
"TableName": {
"value": "@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type": "String"
}
}
},

"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},

"dependsOn": [{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}]
}

}
}],

"parameters": {
"tableList": {
"type": "Object"
"type": "Object"
}
}
}
}

2. Run the Set-AzureRmDataFactoryV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.

Set-AzureRmDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:

PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Activities : {IterateSQLTables}
Parameters : {[tableList,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Run the pipeline


1. Create a parameter file named Parameters.json in the same folder with the following content:

{
"tableList":
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
}

2. Run the pipeline IncrementalCopyPipeline by using the Invoke-AzureRmDataFactoryV2Pipeline


cmdlet. Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzureRmDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup


$resourceGroupName -dataFactoryName $dataFactoryName -ParameterFile ".\Parameters.json"

Monitor the pipeline


1. Sign in to the Azure portal.
2. Select All services, search with the keyword Data factories, and select Data factories.
3. Search for your data factory in the list of data factories, and select it to open the Data factory page.

4. On the Data factory page, select Monitor & Manage.

5. The Data Integration Application opens in a separate tab. You can see all the pipeline runs and their
status. Notice that in the following example, the status of the pipeline run is Succeeded. To check
parameters passed to the pipeline, select the link in the Parameters column. If an error occurred, you see a
link in the Error column. Select the link in the Actions column.

6. When you select the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline:

7. To go back to the Pipeline Runs view, select Pipelines as shown in the image.

Review the results


In SQL Server Management Studio, run the following queries against the target SQL database to verify that the
data was copied from source tables to destination tables:
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000

Query

select * from watermarktable


Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000

Notice that the watermark values for both tables were updated.

Add more data to the source tables


Run the following query against the source SQL Server database to update an existing row in customer_table.
Insert a new row into project_table.

UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3

INSERT INTO project_table


(Project, Creationtime)
VALUES
('NewProject','10/1/2017 0:00:00 AM');

Rerun the pipeline


1. Now, rerun the pipeline by executing the following PowerShell command:

$RunId = Invoke-AzureRmDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup


$resourceGroupname -dataFactoryName $dataFactoryName -ParameterFile ".\Parameters.json"

2. Monitor the pipeline runs by following the instructions in the Monitor the pipeline section. Because the
pipeline status is In Progress, you see another action link under Actions to cancel the pipeline run.

3. Select Refresh to refresh the list until the pipeline run succeeds.

4. Optionally, select the View Activity Runs link under Actions to see all the activity runs associated with
this pipeline run.

Review the final results


In SQL Server Management Studio, run the following queries against the target database to verify that the
updated/new data was copied from source tables to destination tables.
Query

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000

Notice that the NewProject entry was added to project_table.


Query

select * from watermarktable

Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000

Notice that the watermark values for both tables were updated.

Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/26/2019 • 15 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline

Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.

End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.

NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database as
the source data store. You can also use an on-premises SQL Server.

1. Initial loading of historical data (run once):


a. Enable Change Tracking technology in the source Azure SQL database.
b. Get the initial value of SYS_CHANGE_VERSION in the Azure SQL database as the baseline to capture
changed data.
c. Load full data from the Azure SQL database into an Azure blob storage.
2. Incremental loading of delta data on a schedule (run periodically after the initial loading of data):
a. Get the old and new SYS_CHANGE_VERSION values.
b. Load the delta data by joining the primary keys of changed rows (between two
SYS_CHANGE_VERSION values) from sys.change_tracking_tables with data in the source table, and
then move the delta data to destination.
c. Update the SYS_CHANGE_VERSION for the delta loading next time.

High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).

2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.
create table data_source_table
(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(1, 'aaaa', 21),
(2, 'bbbb', 24),
(3, 'cccc', 20),
(4, 'dddd', 26),
(5, 'eeee', 22);

4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:

NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database

ALTER DATABASE <your database name>


SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)

ALTER TABLE data_source_table


ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)

5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:

create table table_store_ChangeTracking_version


(
TableName varchar(255),
SYS_CHANGE_VERSION BIGINT,
);

DECLARE @ChangeTracking_version BIGINT


SET @ChangeTracking_version = CHANGE_TRACKING_CURRENT_VERSION();

INSERT INTO table_store_ChangeTracking_version


VALUES ('data_source_table', @ChangeTracking_version)

NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change tracking
version is 0.
6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.

CREATE PROCEDURE Update_ChangeTracking_Version @CurrentTrackingVersion BIGINT, @TableName varchar(50)


AS

BEGIN

UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName

END

Azure PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFTutorialDataFactory for the name.

The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this
section, you create linked services to your Azure Storage account and Azure SQL database.
Create Azure Storage linked service.
In this step, you link your Azure Storage Account to the data factory.
1. Click Connections, and click + New.

2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.
Create Azure SQL Database linked service.
In this step, you link your Azure SQL database to the data factory.
1. Click Connections, and click + New.
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for the Name field.
b. Select your Azure SQL server for the Server name field.
c. Select your Azure SQL database for the Database name field.
d. Enter name of the user for the User name field.
e. Enter password for the user for the Password field.
f. Click Test connection to test the connection.
g. Click Save to save the linked service.
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure SQL Database, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to SourceDataset.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[data_source_table] for Table.

Create a dataset to represent data copied to sink data store.


In this step, you create a dataset to represent the data that is copied from the source data store. You created the
adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the container if it does not
exist (or) set it to the name of an existing one. In this tutorial, the output file name is dynamically generated by
using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt') .
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure Blob Storage, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to SinkDataset.
4. Switch to the Connection tab in the Properties window, and do the following steps:
a. Select AzureStorageLinkedService for Linked service.
b. Enter adftutorial/incchgtracking for folder part of the filePath.
c. Enter @CONCAT('Incremental-', pipeline().RunId, '.txt') for file part of the filePath.

Create a dataset to represent change tracking data


In this step, you create a dataset for storing the change tracking version. You created the table
table_store_ChangeTracking_version as part of the prerequisites.
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure SQL Database, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to ChangeTrackingDataset.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[table_store_ChangeTracking_version] for Table.

Create a pipeline for the full copy


In this step, you create a pipeline with a copy activity that copies the entire data from the source data store (Azure
SQL Database) to the destination data store (Azure Blob Storage).
1. Click + (plus) in the left pane, and click Pipeline.

2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to FullCopyPipeline.
3. In the Activities toolbox, expand Data Flow, and drag-drop the Copy activity to the pipeline designer
surface, and set the name FullCopyActivity.
4. Switch to the Source tab, and select SourceDataset for the Source Dataset field.

5. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.

6. To validate the pipeline definition, click Validate on the toolbar. Confirm that there is no validation error.
Close the Pipeline Validation Report by clicking >>.
7. To publish entities (linked services, datasets, and pipelines), click Publish. Wait until the publishing
succeeds.

8. Wait until you see the Successfully published message.

9. You can also see notifications by clicking the Show Notifications button on the left. To close the
notifications window, click X.
Run the full copy pipeline
Click Trigger on the toolbar for the pipeline, and click Trigger Now.

Monitor the full copy pipeline


1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh. The links in the Actions column let you view activity runs associated with the pipeline run and to
rerun the pipeline.
2. To view activity runs associated with the pipeline run, click the View Activity Runs link in the Actions
column. There is only one activity in the pipeline, so you see only one entry in the list. To switch back to the
pipeline runs view, click Pipelines link at the top.

Review the results


You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.

The file should have the data from the Azure SQL database:

1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22

Add more data to the source table


Run the following query against the Azure SQL database to add a row and update a row.
INSERT INTO data_source_table
(PersonID, Name, Age)
VALUES
(6, 'new','50');

UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1

Create a pipeline for the delta copy


In this step, you create a pipeline with the following activities, and run it periodically. The lookup activities get the
old and new SYS_CHANGE_VERSION from Azure SQL Database and pass it to copy activity. The copy activity
copies the inserted/updated/deleted data between the two SYS_CHANGE_VERSION values from Azure SQL
Database to Azure Blob Storage. The stored procedure activity updates the value of SYS_CHANGE_VERSION
for the next pipeline run.
1. In the Data Factory UI, switch to the Edit tab. Click + (plus) in the left pane, and click Pipeline.

2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to IncrementalCopyPipeline.
3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to LookupLastChangeTrackingVersionActivity. This activity gets the
change tracking version used in the last copy operation that is stored in the table
table_store_ChangeTracking_version.
4. Switch to the Settings in the Properties window, and select ChangeTrackingDataset for the Source
Dataset field.

5. Drag-and-drop the Lookup activity from the Activities toolbox to the pipeline designer surface. Set the
name of the activity to LookupCurrentChangeTrackingVersionActivity. This activity gets the current
change tracking version.

6. Switch to the Settings in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.

SELECT CHANGE_TRACKING_CURRENT_VERSION() as CurrentChangeTrackingVersion


7. In the Activities toolbox, expand Data Flow, and drag-drop the Copy activity to the pipeline designer
surface. Set the name of the activity to IncrementalCopyActivity. This activity copies the data between
last change tracking version and the current change tracking version to the destination data store.

8. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
select data_source_table.PersonID,data_source_table.Name,data_source_table.Age,
CT.SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN
CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT
on data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVer
sion}

9. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.

10. Connect both Lookup activities to the Copy activity one by one. Drag the green button attached to the
Lookup activity to the Copy activity.

11. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer surface.
Set the name of the activity to StoredProceduretoUpdateChangeTrackingActivity. This activity updates
the change tracking version in the table_store_ChangeTracking_version table.
12. Switch to the SQL Account* tab, and select AzureSqlDatabaseLinkedService for Linked service.

13. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select Update_ChangeTracking_Version.
b. Select Import parameter.
c. In the Stored procedure parameters section, specify following values for the parameters:

NAME TYPE VALUE

CurrentTrackingVersion Int64 @{activity('LookupCurrentChangeT


rackingVersionActivity').output.first
Row.CurrentChangeTrackingVersio
n}

TableName String @{activity('LookupLastChangeTrac


kingVersionActivity').output.firstRo
w.TableName}
14. Connect the Copy activity to the Stored Procedure Activity. Drag-and-drop the green button attached
to the Copy activity to the Stored Procedure activity.

15. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Report window by clicking >>.

16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the Publish
All button. Wait until you see the Publishing succeeded message.
Run the incremental copy pipeline
1. Click Trigger on the toolbar for the pipeline, and click Trigger Now.

2. In the Pipeline Run window, select Finish.


Monitor the incremental copy pipeline
1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh. The links in the Actions column let you view activity runs associated with the pipeline run and to
rerun the pipeline.
2. To view activity runs associated with the pipeline run, click the View Activity Runs link in the Actions
column. There is only one activity in the pipeline, so you see only one entry in the list. To switch back to the
pipeline runs view, click Pipelines link at the top.

Review the results


You see the second file in the incchgtracking folder of the adftutorial container.

The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.

1,update,10,2,U
6,new,50,1,I

The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I

Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/5/2019 • 14 minutes to read • Edit Online

In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.

End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.

NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database
as the source data store. You can also use an on-premises SQL Server.

1. Initial loading of historical data (run once):


a. Enable Change Tracking technology in the source Azure SQL database.
b. Get the initial value of SYS_CHANGE_VERSION in the Azure SQL database as the baseline to capture
changed data.
c. Load full data from the Azure SQL database into an Azure blob storage.
2. Incremental loading of delta data on a schedule (run periodically after the initial loading of data):
a. Get the old and new SYS_CHANGE_VERSION values.
b. Load the delta data by joining the primary keys of changed rows (between two
SYS_CHANGE_VERSION values) from sys.change_tracking_tables with data in the source table,
and then move the delta data to destination.
c. Update the SYS_CHANGE_VERSION for the delta loading next time.

High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).

2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure PowerShell. Install the latest Azure PowerShell modules by following instructions in How to install and
configure Azure PowerShell.
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.

create table data_source_table


(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(1, 'aaaa', 21),
(2, 'bbbb', 24),
(3, 'cccc', 20),
(4, 'dddd', 26),
(5, 'eeee', 22);

4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:

NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database

ALTER DATABASE <your database name>


SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)

ALTER TABLE data_source_table


ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)

5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:

create table table_store_ChangeTracking_version


(
TableName varchar(255),
SYS_CHANGE_VERSION BIGINT,
);

DECLARE @ChangeTracking_version BIGINT


SET @ChangeTracking_version = CHANGE_TRACKING_CURRENT_VERSION();

INSERT INTO table_store_ChangeTracking_version


VALUES ('data_source_table', @ChangeTracking_version)
NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change tracking
version is 0.

6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.

CREATE PROCEDURE Update_ChangeTracking_Version @CurrentTrackingVersion BIGINT, @TableName varchar(50)


AS

BEGIN

UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName

END

Azure PowerShell
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes, and
then run the command. For example: "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again

2. Define a variable for the location of the data factory:

$location = "East US"

3. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to be globally unique.
$dataFactoryName = "IncCopyChgTrackingDF";

5. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name and try again.

The specified Data Factory name 'ADFIncCopyChangeTrackingTestFactory' is already in use. Data Factory
names must be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory
can be in other regions.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In this
section, you create linked services to your Azure Storage account and Azure SQL database.
Create Azure Storage linked service.
In this step, you link your Azure Storage Account to the data factory.
1. Create a JSON file named AzureStorageLinkedService.json in
C:\ADFTutorials\IncCopyChangeTrackingTutorial folder with the following content: (Create the folder
if it does not already exist.). Replace <accountName> , <accountKey> with name and key of your Azure storage
account before saving the file.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}

2. In Azure PowerShell, switch to the C:\ADFTutorials\IncCopyChgTrackingTutorial folder.


3. Run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName
and DataFactoryName parameters.
Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create Azure SQL Database linked service.


In this step, you link your Azure SQL database to the data factory.
1. Create a JSON file named AzureSQLDatabaseLinkedService.json in
C:\ADFTutorials\IncCopyChangeTrackingTutorial folder with the following content: Replace <server>
<database name>, <user id>, and <password> with name of your Azure SQL server, name of your
database, user ID, and password before saving the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}

2. In Azure PowerShell, run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureSQLDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a source dataset
In this step, you create a dataset to represent the source data.
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SourceDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a sink dataset


In this step, you create a dataset to represent the data that is copied from the source data store.
1. Create a JSON file named SinkDataset.json in the same folder with the following content:

{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incchgtracking",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

You create the adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the
container if it does not exist (or) set it to the name of an existing one. In this tutorial, the output file name is
dynamically generated by using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt').
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SinkDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SinkDataset" -File ".\SinkDataset.json"
Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a change tracking dataset


In this step, you create a dataset for storing the change tracking version.
1. Create a JSON file named ChangeTrackingDataset.json in the same folder with the following content:

{
"name": " ChangeTrackingDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "table_store_ChangeTracking_version"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

You create the table table_store_ChangeTracking_version as part of the prerequisites.


2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: WatermarkDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "ChangeTrackingDataset" -File ".\ChangeTrackingDataset.json"

Here is the sample output of the cmdlet:

DatasetName : ChangeTrackingDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline for the full copy


In this step, you create a pipeline with a copy activity that copies the entire data from the source data store (Azure
SQL Database) to the destination data store (Azure Blob Storage).
1. Create a JSON file: FullCopyPipeline.json in same folder with the following content:
{
"name": "FullCopyPipeline",
"properties": {
"activities": [{
"name": "FullCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink"
}
},

"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}]
}]
}
}

2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline: FullCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "FullCopyPipeline" -File ".\FullCopyPipeline.json"

Here is the sample output:

PipelineName : FullCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {FullCopyActivity}
Parameters :

Run the full copy pipeline


Run the pipeline: FullCopyPipeline by using Invoke-AzDataFactoryV2Pipeline cmdlet.

Invoke-AzDataFactoryV2Pipeline -PipelineName "FullCopyPipeline" -ResourceGroup $resourceGroupName -


dataFactoryName $dataFactoryName

Monitor the full copy pipeline


1. Log in to Azure portal.
2. Click All services, search with the keyword data factories , and select Data factories.
3. Search for your data factory in the list of data factories, and select it to launch the Data factory page.

4. In the Data factory page, click Monitor & Manage tile.

5. The Data Integration Application launches in a separate tab. You can see all the pipeline runs and their
statuses. Notice that in the following example, the status of the pipeline run is Succeeded. You can check
parameters passed to the pipeline by clicking link in the Parameters column. If there was an error, you see
a link in the Error column. Click the link in the Actions column.
6. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.

7. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.

The file should have the data from the Azure SQL database:

1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22

Add more data to the source table


Run the following query against the Azure SQL database to add a row and update a row.

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(6, 'new','50');

UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1

Create a pipeline for the delta copy


In this step, you create a pipeline with the following activities, and run it periodically. The lookup activities get the
old and new SYS_CHANGE_VERSION from Azure SQL Database and pass it to copy activity. The copy activity
copies the inserted/updated/deleted data between the two SYS_CHANGE_VERSION values from Azure SQL
Database to Azure Blob Storage. The stored procedure activity updates the value of SYS_CHANGE_VERSION
for the next pipeline run.
1. Create a JSON file: IncrementalCopyPipeline.json in same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupLastChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from table_store_ChangeTracking_version"
},

"dataset": {
"referenceName": "ChangeTrackingDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupCurrentChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT CHANGE_TRACKING_CURRENT_VERSION() as
CurrentChangeTrackingVersion"
},

"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},

{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select
data_source_table.PersonID,data_source_table.Name,data_source_table.Age, CT.SYS_CHANGE_VERSION,
SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT on
data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupLastChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
]
},
{
"activity": "LookupCurrentChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
}
],

"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},

{
"name": "StoredProceduretoUpdateChangeTrackingActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {

"storedProcedureName": "Update_ChangeTracking_Version",
"storedProcedureParameters": {
"CurrentTrackingVersion": {"value":
"@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}
", "type": "INT64" },
"TableName": {
"value":"@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.TableName}",
"type":"String"}
}
},

"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},

"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]

}
}

2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline: FullCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:


PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {LookupLastChangeTrackingVersionActivity,
LookupCurrentChangeTrackingVersionActivity, IncrementalCopyActivity,
StoredProceduretoUpdateChangeTrackingActivity}
Parameters :

Run the incremental copy pipeline


Run the pipeline: IncrementalCopyPipeline by using Invoke-AzDataFactoryV2Pipeline cmdlet.

Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup $resourceGroupName -


dataFactoryName $dataFactoryName

Monitor the incremental copy pipeline


1. In the Data Integration Application, refresh the pipeline runs view. Confirm that you see the
IncrementalCopyPipeline in the list. Click the link in the Actions column.

2. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.

3. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see the second file in the incchgtracking folder of the adftutorial container.
The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.

1,update,10,2,U
6,new,50,1,I

The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.

==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I

Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally copy new and changed files based on
LastModifiedDate by using the Copy Data tool
5/10/2019 • 5 minutes to read • Edit Online

In this tutorial, you'll use the Azure portal to create a data factory. Then, you'll use the Copy Data tool to create a
pipeline that incrementally copies new and changed files only, based on their LastModifiedDate from Azure Blob
storage to Azure Blob storage.
By doing so, ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and
copy the new and updated file only since last time to the destination store. Please note that if you let ADF scan
huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file
scanning is time consuming as well.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you will perform the following tasks:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure
storage account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. You can use various tools to perform this task, such as Azure Storage
Explorer.
2. Create a container named destination.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:
2. On the New data factory page, under Name, enter ADFTutorialDataFactory.

The name for your data factory must be globally unique. You might receive the following error message:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which you'll create the new data factory.
4. For Resource Group, take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that your data factory uses can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, refer to the Deploying Data Factory tile to see the process status.

10. After creation is finished, the Data Factory home page is displayed.
11. To open the Azure Data Factory user interface (UI) on a separate tab, select the Author & Monitor tile.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select the Copy Data title to open the Copy Data tool.

2. On the Properties page, take the following steps:


a. Under Task name, enter DeltaCopyFromBlobPipeline.
b. Under Task cadence or Task schedule, select Run regularly on schedule.
c. Under Trigger Type, select Tumbling Window.
d. Under Recurrence, enter 15 Minute(s).
e. Select Next.
The Data Factory UI creates a pipeline with the specified task name.

3. On the Source data store page, complete the following steps:


a. Select + Create new connection, to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Continue.

c. On the New Linked Service page, select your storage account from the Storage account name list
and then select Finish.
d. Select the newly created linked service and then select Next.

4. On the Choose the input file or folder page, complete the following steps:
a. Browse and select the source folder, and then select Choose.
b. Under File loading behavior, select Incremental load: LastModifiedDate.

c. Check Binary copy and select Next.


5. On the Destination data store page, select AzureBlobStorage. This is the same storage account as the
source data store. Then select Next.

6. On the Choose the output file or folder page, complete the following steps:
a. Browse and select the destination folder, and then select Choose.

b. Select Next.

7. On the Settings page, select Next.


8. On the Summary page, review the settings and then select Next.

9. On the Deployment page, select Monitor to monitor the pipeline (task).


10. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list, and select the View Activity
Runs link in the Actions column.

11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column.

Because there is no file in the source container in your Blob storage account, you will not see any file
copied to the destination container in your Blob storage account.
12. Create an empty text file and name it file1.txt. Upload this text file to the source container in your storage
account. You can use various tools to perform these tasks, such as Azure Storage Explorer.

13. To go back to the Pipeline Runs view, select All Pipeline Runs, and wait for the same pipeline to be
triggered again automatically.

14. Select View Activity Run for the second pipeline run when you see it. Then review the details in the same
way you did for the first pipeline run.
You will that see one file (file1.txt) has been copied from the source container to the destination container
of your Blob storage account.

15. Create another empty text file and name it file2.txt. Upload this text file to the source container in your
Blob storage account.
16. Repeat steps 13 and 14 for this second text file. You will see that only the new file (file2.txt) has been copied
from the source container to the destination container of your storage account in the next pipeline run.

You can also verify this by using Azure Storage Explorer to scan the files.
Next steps
Advance to the following tutorial to learn about transforming data by using an Apache Spark cluster on Azure:
Transform data in the cloud by using an Apache Spark cluster
Incrementally copy new files based on time
partitioned file name by using the Copy Data tool
3/26/2019 • 5 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that incrementally copies new files based on time partitioned file name from Azure Blob storage to Azure
Blob storage.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure storage
account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. Create a folder path as 2019/02/26/14 in your container. Create an
empty text file, and name it as file1.txt. Upload the file1.txt to the folder path source/2019/02/26/14 in
your storage account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
NOTE
Please adjust the folder name with your UTC time. For example, if the current UTC time is 2:03 PM on Feb 26th,
2019, you can create the folder path as source/2019/02/26/14/ by the rule of
source/{Year}/{Month}/{Day}/{Hour}/.

2. Create a container named destination. You can use various tools to perform these tasks, such as Azure
Storage Explorer.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factory:

2. On the New data factory page, under Name, enter ADFTutorialDataFactory.


The name for your data factory must be globally unique. You might receive the following error message:

If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.

11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.

Use the Copy Data tool to create a pipeline


1. On the Let's get started page, select the Copy Data title to launch the Copy Data tool.
2. On the Properties page, take the following steps:
a. Under Task name, enter DeltaCopyFromBlobPipeline.
b. Under Task cadence or Task schedule, select Run regularly on schedule.
c. Under Trigger type, select Tumbling Window.
d. Under Recurrence, enter 1 Hour(s).
e. Select Next.
The Data Factory UI creates a pipeline with the specified task name.
3. On the Source data store page, complete the following steps:
a. Click + Create new connection, to add a connection.
b. Select Azure Blob Storage from the gallery, and then click Continue.

c. On the New Linked Service page, select your storage account from the Storage account name list,
and then click Finish.

d. Select the newly created linked service, then click Next.


4. On the Choose the input file or folder page, do the following steps:
a. Browse and select the source container, then select Choose.

b. Under File loading behavior, select Incremental load: time-partitioned folder/file names.
c. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
d. Check Binary copy and click Next.

5. On the Destination data store page, select the AzureBlobStorage, which is the same storage account as
data source store, and then click Next.
6. On the Choose the output file or folder page, do the following steps:
a. Browse and select the destination folder, then click Choose.

b. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
c. Click Next.

7. On the Settings page, select Next.


8. On the Summary page, review the settings, and then select Next.

9. On the Deployment page, select Monitor to monitor the pipeline (task).


10. Notice that the Monitor tab on the left is automatically selected. You need wait for the pipeline run when it
is triggered automatically (about after one hour). When it runs, the Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list, and select the View Activity
Runs link in the Actions column.

11. There's only one activity (copy activity) in the pipeline, so you see only one entry. You can see the source file
(file1.txt) has been copied from source/2019/02/26/14/ to destination/2019/02/26/14/ with the same
file name.

You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files.
12. Create another empty text file with the new name as file2.txt. Upload the file2.txt file to the folder path
source/2019/02/26/15 in your storage account. You can use various tools to perform these tasks, such as
Azure Storage Explorer.

NOTE
You might be aware that a new folder path is required to be created. Please adjust the folder name with your UTC
time. For example, if the current UTC time is 3:20 PM on Feb 26th, 2019, you can create the folder path as
source/2019/02/26/15/ by the rule of {Year}/{Month}/{Day}/{Hour}/.

13. To go back to the Pipeline Runs view, select All Pipelines Runs, and wait for the same pipeline being
triggered again automatically after another one hour.
14. Select View Activity Run for the second pipeline run when it comes, and do the same to review details.

You can see the source file (file2.txt) has been copied from source/2019/02/26/15/ to
destination/2019/02/26/15/ with the same file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files
in destination container

Next steps
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Transform data using Spark cluster in cloud
Transform data in the cloud by using a Spark activity
in Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline. This pipeline transforms data by
using a Spark activity and an on-demand Azure HDInsight linked service.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure storage account. You create a Python script and an input file, and you upload them to Azure Storage.
The output from the Spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.

NOTE
HdInsight supports only general-purpose storage accounts with standard tier. Make sure that the account is not a premium
or blob only storage account.

Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload the Python script to your Blob storage account
1. Create a Python file named WordCount_Spark.py with the following content:
import sys
from operator import add

from pyspark.sql import SparkSession

def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()

lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")

spark.stop()

if __name__ == "__main__":
main()

2. Replace <storageAccountName> with the name of your Azure storage account. Then, save the file.
3. In Azure Blob storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under the spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The Spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt file to the inputfiles subfolder.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Select New on the left menu, select Data + Analytics, and then select Data Factory.
3. In the New data factory pane, enter ADFTutorialDataFactory under Name.
The name of the Azure data factory must be globally unique. If you see the following error, change the
name of the data factory. (For example, use <yourname>ADFTutorialDataFactory). For naming rules for
Data Factory artifacts, see the Data Factory - naming rules article.

4. For Subscription, select your Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version, select V2.
7. For Location, select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data
Factory uses can be in other regions.
8. Select Create.
9. After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.

Create linked services


You author two linked services in this section:
An Azure Storage linked service that links an Azure storage account to the data factory. This storage is used
by the on-demand HDInsight cluster. It also contains the Spark script to be run.
An on-demand HDInsight linked service. Azure Data Factory automatically creates an HDInsight cluster
and runs the Spark program. It then deletes the HDInsight cluster after the cluster is idle for a preconfigured
time.
Create an Azure Storage linked service
1. On the Let's get started page, switch to the Edit tab in the left panel.
2. Select Connections at the bottom of the window, and then select + New.

3. In the New Linked Service window, select Data Store > Azure Blob Storage, and then select Continue.
4. For Storage account name, select the name from the list, and then select Save.
Create an on-demand HDInsight linked service
1. Select the + New button again to create another linked service.
2. In the New Linked Service window, select Compute > Azure HDInsight, and then select Continue.
3. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureHDInsightLinkedService.
b. For Type, confirm that On-demand HDInsight is selected.
c. For Azure Storage Linked Service, select AzureStorage1. You created this linked service earlier. If you
used a different name, specify the right name here.
d. For Cluster type, select spark.
e. For Service principal id, enter the ID of the service principal that has permission to create an HDInsight
cluster.
This service principal needs to be a member of the Contributor role of the subscription or the resource
group in which the cluster is created. For more information, see Create an Azure Active Directory
application and service principal.
f. For Service principal key, enter the key.
g. For Resource group, select the same resource group that you used when you created the data factory.
The Spark cluster is created in this resource group.
h. Expand OS type.
i. Enter a name for Cluster user name.
j. Enter the Cluster password for the user.
k. Select Finish.
NOTE
Azure HDInsight limits the total number of cores that you can use in each Azure region that it supports. For the on-demand
HDInsight linked service, the HDInsight cluster is created in the same Azure Storage location that's used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.

Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.

2. In the Activities toolbox, expand HDInsight. Drag the Spark activity from the Activities toolbox to the
pipeline designer surface.
3. In the properties for the Spark activity window at the bottom, complete the following steps:
a. Switch to the HDI Cluster tab.
b. Select AzureHDInsightLinkedService (which you created in the previous procedure).
4. Switch to the Script/Jar tab, and complete the following steps:
a. For Job Linked Service, select AzureStorage1.
b. Select Browse Storage.

c. Browse to the adftutorial/spark/script folder, select WordCount_Spark.py, and then select Finish.
5. To validate the pipeline, select the Validate button on the toolbar. Select the >> (right arrow ) button to
close the validation window.
6. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.

Trigger a pipeline run


Select Trigger on the toolbar, and then select Trigger Now.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 20 minutes to create
a Spark cluster.
2. Select Refresh periodically to check the status of the pipeline run.

3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.

You can switch back to the pipeline runs view by selecting the Pipelines link at the top.

Verify the output


Verify that the output file is created in the spark/otuputfiles/wordcount folder of the adftutorial container.
The file should have each word from the input text file and the number of times the word appeared in the file. For
example:

(u'This', 1)
(u'a', 1)
(u'is', 1)
(u'test', 1)
(u'file', 1)

Next steps
The pipeline in this sample transforms data by using a Spark activity and an on-demand HDInsight linked service.
You learned how to:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
To learn how to transform data by running a Hive script on an Azure HDInsight cluster that's in a virtual network,
advance to the next tutorial:
Tutorial: Transform data using Hive in Azure Virtual Network.
Transform data in the cloud by using Spark activity in
Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online

In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Spark
Activity and an on-demand HDInsight linked service. You perform the following steps in this tutorial:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure Storage account. You create a python script and an input file, and upload them to the Azure storage.
The output from the spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload python script to your Blob Storage account
1. Create a python file named WordCount_Spark.py with the following content:
import sys
from operator import add

from pyspark.sql import SparkSession

def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()

lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")

spark.stop()

if __name__ == "__main__":
main()

2. Replace <storageAccountName> with the name of your Azure Storage account. Then, save the file.
3. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The spark program counts the number of words in this
text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt to the inputfiles subfolder.

Author linked services


You author two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is used
by the on-demand HDInsight cluster. It also contains the Spark script to be executed.
An On-Demand HDInsight Linked Service. Azure Data Factory automatically creates a HDInsight cluster, run
the Spark program, and then deletes the HDInsight cluster after it's idle for a pre-configured time.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedService.json.
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
}
}
}

Update the <storageAccountName> and <storageAccountKey> with the name and key of your Azure Storage
account.
On-demand HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyOnDemandSparkLinkedService.json.

{
"name": "MyOnDemandSparkLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 2,
"clusterType": "spark",
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscriptionID> ",
"servicePrincipalId": "<servicePrincipalID>",
"servicePrincipalKey": {
"value": "<servicePrincipalKey>",
"type": "SecureString"
},
"tenant": "<tenant ID>",
"clusterResourceGroup": "<resourceGroupofHDICluster>",
"version": "3.6",
"osType": "Linux",
"clusterNamePrefix":"ADFSparkSample",
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}

Update values for the following properties in the linked service definition:
hostSubscriptionId. Replace <subscriptionID> with the ID of your Azure subscription. The on-demand
HDInsight cluster is created in this subscription.
tenant. Replace <tenantID> with ID of your Azure tenant.
servicePrincipalId, servicePrincipalKey. Replace <servicePrincipalID> and <servicePrincipalKey> with ID
and key of your service principal in the Azure Active Directory. This service principal needs to be a member of
the Contributor role of the subscription or the resource Group in which the cluster is created. See create Azure
Active Directory application and service principal for details.
clusterResourceGroup. Replace <resourceGroupOfHDICluster> with the name of the resource group in
which the HDInsight cluster needs to be created.
NOTE
Azure HDInsight has limitation on the total number of cores you can use in each Azure region it supports. For On-Demand
HDInsight Linked Service, the HDInsight cluster will be created in the same location of the Azure Storage used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.

Author a pipeline
In this step, you create a new pipeline with a Spark activity. The activity uses the word count sample. Download
the contents from this location if you haven't already done so.
Create a JSON file in your preferred editor, copy the following JSON definition of a pipeline definition, and save it
as MySparkOnDemandPipeline.json.

{
"name": "MySparkOnDemandPipeline",
"properties": {
"activities": [
{
"name": "MySparkActivity",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyOnDemandSparkLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"rootPath": "adftutorial/spark",
"entryFilePath": "script/WordCount_Spark.py",
"getDebugInfo": "Failure",
"sparkJobLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}

Note the following points:


rootPath points to the spark folder of the adftutorial container.
entryFilePath points to the WordCount_Spark.py file in the script sub folder of the spark folder.

Create a data factory


You have authored linked service and pipeline definitions in JSON files. Now, let’s create a data factory, and
deploy the linked Service and pipeline JSON files by using PowerShell cmdlets. Run the following PowerShell
commands one by one:
1. Set variables one by one.
Resource Group Name

$resourceGroupName = "ADFTutorialResourceGroup"

Data Factory Name. Must be globally unique


$dataFactoryName = "MyDataFactory09102017"

Pipeline name

$pipelineName = "MySparkOnDemandPipeline" # Name of the pipeline

2. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

3. Create the resource group: ADFTutorialResourceGroup.

New-AzResourceGroup -Name $resourceGroupName -Location "East Us"

4. Create the data factory.

$df = Set-AzDataFactoryV2 -Location EastUS -Name $dataFactoryName -ResourceGroupName


$resourceGroupName

Execute the following command to see the output:

$df

5. Switch to the folder where you created JSON files, and run the following command to deploy an Azure
Storage linked service:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "MyStorageLinkedService" -File "MyStorageLinkedService.json"

6. Run the following command to deploy an on-demand Spark linked service:


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -Name "MyOnDemandSparkLinkedService" -File "MyOnDemandSparkLinkedService.json"

7. Run the following command to deploy a pipeline:

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name $pipelineName -File "MySparkOnDemandPipeline.json"

Start and monitor a pipeline run


1. Start a pipeline run. It also captures the pipeline run ID for future monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName $pipelineName

2. Run the following script to continuously check the pipeline run status until it finishes.

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

3. Here is the output of the sample run:


Pipeline run status: In Progress
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName :
ActivityName : MySparkActivity
PipelineRunId : 94e71d08-a6fa-4191-b7d1-cf8c71cb4794
PipelineName : MySparkOnDemandPipeline
Input : {rootPath, entryFilePath, getDebugInfo, sparkJobLinkedService}
Output :
LinkedServiceName :
ActivityRunStart : 9/20/2017 6:33:47 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :

Pipeline ' MySparkOnDemandPipeline' run finished. Result:


ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : MyDataFactory09102017
ActivityName : MySparkActivity
PipelineRunId : 94e71d08-a6fa-4191-b7d1-cf8c71cb4794
PipelineName : MySparkOnDemandPipeline
Input : {rootPath, entryFilePath, getDebugInfo, sparkJobLinkedService}
Output : {clusterInUse, jobId, ExecutionProgress, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 9/20/2017 6:33:47 AM
ActivityRunEnd : 9/20/2017 6:46:30 AM
DurationInMs : 763466
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"clusterInUse": "https://ADFSparkSamplexxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.azurehdinsight.net/"
"jobId": "0"
"ExecutionProgress": "Succeeded"
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MySparkActivity"

4. Confirm that a folder named outputfiles is created in the spark folder of adftutorial container with the
output from the spark program.

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
Advance to the next tutorial to learn how to transform data by running Hive script on an Azure HDInsight cluster
that is in a virtual network.
Tutorial: transform data using Hive in Azure Virtual Network.
Run a Databricks notebook with the Databricks
Notebook Activity in Azure Data Factory
5/22/2019 • 5 minutes to read • Edit Online

In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks
notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks
notebook during execution.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses Databricks Notebook Activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
For an eleven-minute introduction and demonstration of this feature, watch the following video:

Prerequisites
Azure Databricks workspace. Create a Databricks workspace or use an existing one. You create a Python
notebook in your Azure Databricks workspace. Then you execute the notebook and pass parameters to it using
Azure Data Factory.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Select Create a resource on the left menu, select Analytics, and then select Data Factory.
3. In the New data factory pane, enter ADFTutorialDataFactory under Name.
The name of the Azure data factory must be globally unique. If you see the following error, change the name
of the data factory. (For example, use <yourname>ADFTutorialDataFactory). For naming rules for Data
Factory artifacts, see the Data Factory - naming rules article.
4. For Subscription, select your Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version, select V2.
7. For Location, select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data Factory
uses can be in other regions.
8. Select Create.
9. After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.

Create linked services


In this section, you author a Databricks linked service. This linked service contains the connection information to
the Databricks cluster:
Create an Azure Databricks linked service
1. On the Let's get started page, switch to the Edit tab in the left panel.

2. Select Connections at the bottom of the window, and then select + New.
3. In the New Linked Service window, select Compute > Azure Databricks, and then select Continue.
4. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureDatabricks_LinkedService
b. Select the appropriate Databricks workspace that you will run your notebook in
c. For Select cluster, select New job cluster
d. For Domain/ Region, info should auto-populate
e. For Access Token, generate it from Azure Databricks workplace. You can find the steps here.
f. For Cluster version, select 4.2 (with Apache Spark 2.3.1, Scala 2.11)
g. For Cluster node type, select Standard_D3_v2 under General Purpose (HDD ) category for this
tutorial.
h. For Workers, enter 2.
i. Select Finish
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. Create a parameter to be used in the Pipeline. Later you pass this parameter to the Databricks Notebook
Activity. In the empty pipeline, click on the Parameters tab, then New and name it as 'name'.
3. In the Activities toolbox, expand Databricks. Drag the Notebook activity from the Activities toolbox to
the pipeline designer surface.

4. In the properties for the Databricks Notebook activity window at the bottom, complete the following
steps:
a. Switch to the Azure Databricks tab.
b. Select AzureDatabricks_LinkedService (which you created in the previous procedure).
c. Switch to the Settings tab
c. Browse to select a Databricks Notebook path. Let’s create a notebook and specify the path here. You get
the Notebook Path by following the next few steps.
a. Launch your Azure Databricks Workspace
b. Create a New Folder in Workplace and call it as adftutorial.

c. Create a new notebook (Python), let’s call it mynotebook under adftutorial Folder, click Create.
d. In the newly created notebook "mynotebook'" add the following code:

# Creating widgets for leveraging parameters, and printing the parameters

dbutils.widgets.text("input", "","")
dbutils.widgets.get("input")
y = getArgument("input")
print ("Param -\'input':")
print (y)

e. The Notebook Path in this case is /adftutorial/mynotebook


5. Switch back to the Data Factory UI authoring tool. Navigate to Settings Tab under the Notebook1
Activity.
a. Add Parameter to the Notebook activity. You use the same parameter that you added earlier to the
Pipeline.
b. Name the parameter as input and provide the value as expression @pipeline().parameters.name.
6. To validate the pipeline, select the Validate button on the toolbar. To close the validation window, select the
>> (right arrow ) button.

7. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Trigger on the toolbar, and then select Trigger Now.

The Pipeline Run dialog box asks for the name parameter. Use /path/filename as the parameter here. Click
Finish.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 5-8 minutes to create
a Databricks job cluster, where the notebook is executed.

2. Select Refresh periodically to check the status of the pipeline run.


3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.

You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
Verify the output
You can log on to the Azure Databricks workspace, go to Clusters and you can see the Job status as pending
execution, running, or terminated.

You can click on the Job name and navigate to see further details. On successful run, you can validate the
parameters passed and the output of the Python notebook.

Next steps
The pipeline in this sample triggers a Databricks Notebook activity and passes a parameter to it. You learned how
to:
Create a data factory.
Create a pipeline that uses a Databricks Notebook activity.
Trigger a pipeline run.
Monitor the pipeline run.
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
3/15/2019 • 9 minutes to read • Edit Online

In this tutorial, you use Azure portal to create a Data Factory pipeline that transforms data using Hive Activity on a
HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
A virtual machine. Create an Azure virtual machine VM and join it into the same virtual network that
contains your HDInsight cluster. For details, see How to create virtual machines.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
DROP TABLE IF EXISTS HiveSampleOut;
CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';

INSERT OVERWRITE TABLE HiveSampleOut


Select
clientid,
market,
devicemodel,
state
FROM hivesampletable

2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Log in to the Azure portal.
3. Click New on the left menu, click Data + Analytics, and click Data Factory.
4. In the New data factory page, enter ADFTutorialHiveFactory for the name.
The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameMyAzureSsisDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.

`Data factory name “MyAzureSsisDataFactory” is not available`

5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported for creation of data factories are
shown in the list.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.
12. After the creation is complete, you see the Data Factory page as shown in the image.

13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
14. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
Create a self-hosted integration runtime
As the Hadoop cluster is inside a virtual network, you need to install a self-hosted integration runtime (IR ) in the
same virtual network. In this section, you create a new VM, join it to the same virtual network, and install self-
hosted IR on it. The self-hosted IR allows Data Factory service to dispatch processing requests to a compute
service such as HDInsight inside a virtual network. It also allows you to move data to/from data stores inside a
virtual network to Azure. You use a self-hosted IR when the data store or compute is in an on-premises
environment as well.
1. In the Azure Data Factory UI, click Connections at the bottom of the window, switch to the Integration
Runtimes tab, and click + New button on the toolbar.

2. In the Integration Runtime Setup window, Select Perform data movement and dispatch activities to
external computes option, and click Next.
3. Select Private Network, and click Next.
4. Enter MySelfHostedIR for Name, and click Next.

5. Copy the authentication key for the integration runtime by clicking the copy button, and save it. Keep the
window open. You use this key to register the IR installed in a virtual machine.
Install IR on a virtual machine
1. On the Azure VM, download self-hosted integration runtime. Use the authentication key obtained in the
previous step to manually register the self-hosted integration runtime.
2. You see the following message when the self-hosted integration runtime is registered successfully.

3. Click Launch Configuration Manager. You see the following page when the node is connected to the
cloud service:
Self-hosted IR in the Azure Data Factory UI
1. In the Azure Data Factory UI, you should see the name of the self-hosted VM name and its status.
2. Click Finish to close the Integration Runtime Setup window. You see the self-hosted IR in the list of
integration runtimes.

Create linked services


You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, you use this Azure Storage account to store the
Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Create Azure Storage linked service
1. Switch to the Linked Services tab, and click New.

2. In the New Linked Service window, select Azure Blob Storage, and click Continue.

3. In the New Linked Service window, do the following steps:


a. Enter AzureStorageLinkedService for Name.
b. Select MySelfHostedIR for Connect via integration runtime.
c. Select your Azure storage account for Storage account name.
d. To test the connection to storage account, click Test connection.
e. Click Save.

Create HDInsight linked service


1. Click New again to create another linked service.
2. Switch to the Compute tab, select Azure HDInsight, and click Continue.

3. In the New Linked Service window, do the following steps:


a. Enter AzureHDInsightLinkedService for Name.
b. Select Bring your own HDInsight.
c. Select your HDInsight cluster for Hdi cluster.
d. Enter the user name for the HDInsight cluster.
e. Enter the password for the user.
This article assumes that you have access to the cluster over the internet. For example, that you can connect to the
cluster at https://clustername.azurehdinsight.net . This address uses the public gateway, which is not available if
you have used network security groups (NSGs) or user-defined routes (UDRs) to restrict access from the internet.
For Data Factory to be able to submit jobs to HDInsight cluster in Azure Virtual Network, you need to configure
your Azure Virtual Network such a way that the URL can be resolved to the private IP address of gateway used by
HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with name
starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL
https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . If you don’t have a DNS server in
your Azure Virtual Network, you can temporarily work around by editing the hosts file
(C:\Windows\System32\drivers\etc) of all VMs that registered as self-hosted integration runtime nodes by
adding an entry similar to the following one:
10.6.0.15 myHDIClusterName.azurehdinsight.net

Create a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined.
Note the following points:
scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
1. In the Data Factory UI, click + (plus) in the left pane, and click Pipeline.

2. In the Activities toolbox, expand HDInsight, and drag-drop Hive activity to the pipeline designer surface.
3. In the properties window, switch to the HDI Cluster tab, and select AzureHDInsightLinkedService for
HDInsight Linked Service.

4. Switch to the Scripts tab, and do the following steps:


a. Select AzureStorageLinkedService for Script Linked Service.
b. For File Path, click Browse Storage.
c. In the Choose a file or folder window, navigate to hivescripts folder of the adftutorial container,
select hivescript.hql, and click Finish.

d. Confirm that you see adftutorial/hivescripts/hivescript.hql for File Path.

e. In the Script tab, expand Advanced section.


f. Click Auto-fill from script for Parameters.
g. Enter the value for the Output parameter in the following format:
wasb://<Blob Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ . For example:
wasb://[email protected]/outputfolder/ .
5. To publish artifacts to Data Factory, click Publish.

Trigger a pipeline run


1. First, validate the pipeline by clicking the Validate button on the toolbar. Close the Pipeline Validation
Output window by clicking right-arrow (>>).
2. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.

Monitor the pipeline run


1. Switch to the Monitor tab on the left. You see a pipeline run in the Pipeline Runs list.

2. To refresh the list, click Refresh.


3. To view activity runs associated with the pipeline runs, click View activity runs in the Action column.
Other action links are for stopping/rerunning the pipeline.

4. You see only one activity run since there is only one activity in the pipeline of type HDInsightHive. To
switch back to the previous view, click Pipelines link at the top.
5. Confirm that you see an output file in the outputfolder of the adftutorial container.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
4/11/2019 • 9 minutes to read • Edit Online

In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Hive Activity
on a HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:

DROP TABLE IF EXISTS HiveSampleOut;


CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';

INSERT OVERWRITE TABLE HiveSampleOut


Select
clientid,
market,
devicemodel,
state
FROM hivesampletable

2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.
Create a data factory
1. Set the resource group name. You create a resource group as part of this tutorial. However, you can use an
existing resource group if you like.

$resourceGroupName = "ADFTutorialResourceGroup"

2. Specify the data factory name. Must be globally unique.

$dataFactoryName = "MyDataFactory09142017"

3. Specify a name for the pipeline.

$pipelineName = "MyHivePipeline" #

4. Specify a name for the self-hosted integration runtime. You need a self-hosted integration runtime when the
Data Factory needs to access resources (such as Azure SQL Database) inside a VNet.

$selfHostedIntegrationRuntimeName = "MySelfHostedIR09142017"

5. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate Data
Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

6. Create the resource group: ADFTutorialResourceGroup if it does not exist already in your subscription.

New-AzResourceGroup -Name $resourceGroupName -Location "East Us"

7. Create the data factory.

$df = Set-AzDataFactoryV2 -Location EastUS -Name $dataFactoryName -ResourceGroupName $resourceGroupName

Execute the following command to see the output:


$df

Create self-hosted IR
In this section, you create a self-hosted integration runtime and associate it with an Azure VM in the same Azure
Virtual Network where your HDInsight cluster is in.
1. Create Self-hosted integration runtime. Use a unique name in case if another integration runtime with the
same name exists.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName -Type SelfHosted

This command creates a logical registration of the self-hosted integration runtime.


2. Use PowerShell to retrieve authentication keys to register the self-hosted integration runtime. Copy one of
the keys for registering the self-hosted integration runtime.

Get-AzDataFactoryV2IntegrationRuntimeKey -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName | ConvertTo-Json

Here is the sample output:

{
"AuthKey1": "IR@0000000000000000000000000000000000000=",
"AuthKey2": "IR@0000000000000000000000000000000000000="
}

Note down the value of AuthKey1 without quotation mark.


3. Create an Azure VM and join it into the same virtual network that contains your HDInsight cluster. For
details, see How to create virtual machines. Join them into an Azure Virtual Network.
4. On the Azure VM, download self-hosted integration runtime. Use the Authentication Key obtained in the
previous step to manually register the self-hosted integration runtime.
You see the following message when the self-hosted integration runtime is registered successfully:

You see the following page when the node is connected to the cloud service:
Author linked services
You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, we also use this Azure Storage account to keep the
Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedService.json.

{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}

Replace <accountname> and <accountkey> with the name and key of your Azure Storage account.
HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyHDInsightLinkedService.json.

{
"name": "MyHDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<clustername>.azurehdinsight.net",
"userName": "<username>",
"password": {
"value": "<password>",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}

Update values for the following properties in the linked service definition:
userName. Name of the cluster login user that you specified when creating the cluster.
password. The password for the user.
clusterUri. Specify the URL of your HDInsight cluster in the following format:
https://<clustername>.azurehdinsight.net . This article assumes that you have access to the cluster over the
internet. For example, you can connect to the cluster at https://clustername.azurehdinsight.net . This
address uses the public gateway, which is not available if you have used network security groups (NSGs) or
user-defined routes (UDRs) to restrict access from the internet. For Data Factory to submit jobs to
HDInsight clusters in Azure Virtual Network, your Azure Virtual Network needs to be configured in such a
way that the URL can be resolved to the private IP address of the gateway used by HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with
name starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL
https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . This is the recommended
approach. If you don’t have a DNS server in your Azure Virtual Network, you can temporarily work
around this by editing the hosts file (C:\Windows\System32\drivers\etc) of all VMs that registered as
self-hosted integration runtime nodes by adding an entry like this:
10.6.0.15 myHDIClusterName.azurehdinsight.net

Create linked services


In the PowerShell, switch to the folder where you created JSON files, and run the following command to deploy
the linked services:
1. In the PowerShell, switch to the folder where you created JSON files.
2. Run the following command to create an Azure Storage linked service.
Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName
-Name "MyStorageLinkedService" -File "MyStorageLinkedService.json"

3. Run the following command to create an Azure HDInsight linked service.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "MyHDInsightLinkedService" -File "MyHDInsightLinkedService.json"

Author a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined. Create a JSON file in your preferred editor, copy the following
JSON definition of a pipeline definition, and save it as MyHivePipeline.json.

{
"name": "MyHivePipeline",
"properties": {
"activities": [
{
"name": "MyHiveActivity",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDILinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptPath": "adftutorial\\hivescripts\\hivescript.hql",
"getDebugInfo": "Failure",
"defines": {
"Output": "wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/"
},
"scriptLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}

Note the following points:


scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
Switch to the folder where you created JSON files, and run the following command to deploy the pipeline:

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name


$pipelineName -File "MyHivePipeline.json"

Start the pipeline


1. Start a pipeline run. It also captures the pipeline run ID for future monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName $pipelineName

2. Run the following script to continuously check the pipeline run status until it finishes.

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

Here is the output of the sample run:


Pipeline run status: In Progress

ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 000000000-0000-0000-000000000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output :
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :

Pipeline ' MyHivePipeline' run finished. Result:

ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 0000000-0000-0000-0000-000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output : {logLocation, clusterInUse, jobId, ExecutionProgress...}
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd : 9/18/2017 6:59:16 AM
DurationInMs : 63636
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"logLocation": "wasbs://[email protected]/HiveQueryJobs/000000000-0000-47c3-
9b28-1cdc7f3f2ba2/18_09_2017_06_58_18_023/Status"
"clusterInUse": "https://adfv2HivePrivate.azurehdinsight.net"
"jobId": "job_1505387997356_0024"
"ExecutionProgress": "Succeeded"
"effectiveIntegrationRuntime": "MySelfhostedIR"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MyHiveActivity"

3. Check the outputfolder folder for new file created as the Hive query result, it should look like the following
sample output:

8 en-US SCH-i500 California


23 en-US Incredible Pennsylvania
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
246 en-US SCH-i500 District Of Columbia
246 en-US SCH-i500 District Of Columbia
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Branching and chaining activities in a Data Factory
pipeline
3/26/2019 • 10 minutes to read • Edit Online

In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If the
copy activity succeeds, the pipeline sends details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, the pipeline sends details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:

You perform the following steps in this tutorial:


Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a Copy activity and a Web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
This tutorial uses Azure portal. You can use other mechanisms to interact with Azure Data Factory, refer to
"Quickstarts" in the table of contents.

Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John,Doe
Jane,Doe

2. Use tools such as Azure Storage Explorer do the following steps:


a. Create the adfv2branch container.
b. Create input folder in the adfv2branch container.
c. Upload input.txt file to the container.

Create email workflow endpoints


To trigger sending an email from the pipeline, you use Logic Apps to define the workflow. For details on creating a
Logic App workflow, see How to create a logic app.
Success email workflow
Create a Logic App workflow named CopySuccessEmail . Define the workflow trigger as
When an HTTP request is received , and add an action of Office 365 Outlook – Send an email .

For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}

The Request in the Logic App Designer should look like the following image:

For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your success email workflow:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Fail email workflow


Follow the same steps to create another Logic Apps workflow of CopyFailEmail. In the request trigger, the
Request Body JSON schema is the same. Change the format of your email like the Subject to tailor toward a failure
email. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your failure email workflow:

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

You should now have two workflow URLs:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factory:
3. In the New data factory page, enter ADFTutorialDataFactory for the name.

The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.

11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.

Create a pipeline
In this step, you create a pipeline with one Copy activity and two Web activities. You use the following features to
create the pipeline:
Parameters for the pipeline that are access by datasets.
Web activity to invoke logic apps workflows to send success/failure emails.
Connecting one activity with another activity (on success and failure)
Using output from an activity as an input to the subsequent activity
1. In the get started page of Data Factory UI, click the Create pipeline tile.
2. In the properties window for the pipeline, switch to the Parameters tab, and use the New button to add the
following three parameters of type String: sourceBlobContainer, sinkBlobContainer, and receiver.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure
emails to the receiver whose email address is specified by this parameter.

3. In the Activities toolbox, expand Data Flow, and drag-drop Copy activity to the pipeline designer surface.
4. In the Properties window for the Copy activity at the bottom, switch to the Source tab, and click + New.
You create a source dataset for the copy activity in this step.

5. In the New Dataset window, select Azure Blob Storage, and click Finish.
6. You see a new tab titled AzureBlob1. Change the name of the dataset to SourceBlobDataset.
7. Switch to the Connection tab in the Properties window, and click New for the Linked service. You create
a linked service to link your Azure Storage account to the data factory in this step.
8. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure storage account for the Storage account name.
c. Click Save.
9. Enter @pipeline().parameters.sourceBlobContainer for the folder and emp.txt for the file name. You use the
sourceBlobContainer pipeline parameter to set the folder path for the dataset.
13. Switch to the pipeline tab (or) click the pipeline in the treeview. Confirm that SourceBlobDataset is selected
for Source Dataset.

![Source dataset](./media/tutorial-control-flow-portal/pipeline-source-dataset-selected.png)

13. In the properties window, switch to the Sink tab, and click + New for Sink Dataset. You create a sink
dataset for the copy activity in this step similar to the way you created the source dataset.

14. In the New Dataset window, select Azure Blob Storage, and click Finish.
15. In the General settings page for the dataset, enter SinkBlobDataset for Name.
16. Switch to the Connection tab, and do the following steps:
a. Select AzureStorageLinkedService for LinkedService.
b. Enter @pipeline().parameters.sinkBlobContainer for the folder.
c. Enter @CONCAT(pipeline().RunId, '.txt') for the file name. The expression uses the ID of the current
pipeline run for the file name. For the supported list of system variables and expressions, see System
variables and Expression language.
17. Switch to the pipeline tab at the top. Expand General in the Activities toolbox, and drag-drop a Web
activity to the pipeline designer surface. Set the name of the activity to SendSuccessEmailActivity. The
Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow.

18. Switch to the Settings tab from the General tab, and do the following steps:
a. For URL, specify URL for the logic apps workflow that sends the success email.
b. Select POST for Method.
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json.
e. Specify the following JSON for Body.

{
"message": "@{activity('Copy1').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}

The message body contains the following properties:


Message – Passing value of @{activity('Copy1').output.dataWritten . Accesses a property of
the previous copy activity and passes the value of dataWritten. For the failure case, pass the
error output instead of @{activity('CopyBlobtoBlob').error.message .
Data Factory Name – Passing value of @{pipeline().DataFactory} This is a system variable,
allowing you to access the corresponding data factory name. For a list of system variables, see
System Variables article.
Pipeline Name – Passing value of @{pipeline().Pipeline} . This is also a system variable,
allowing you to access the corresponding pipeline name.
Receiver – Passing value of "@pipeline().parameters.receiver"). Accessing the pipeline
parameters.

19. Connect the Copy activity to the Web activity by dragging the green button next to the Copy activity and
dropping on the Web activity.
20. Drag-drop another Web activity from the Activities toolbox to the pipeline designer surface, and set the
name to SendFailureEmailActivity.

21. Switch to the Settings tab, and do the following steps:


a. For URL, specify URL for the logic apps workflow that sends the failure email.
b. Select POST for Method.
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json.
e. Specify the following JSON for Body.
{
"message": "@{activity('Copy1').error.message}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}

22. Select Copy activity in the pipeline designer, and click +-> button, and select Error.

23. Drag the red button next to the Copy activity to the second Web activity SendFailureEmailActivity. You
can move the activities around so that the pipeline looks like in the following image:
24. To validate the pipeline, click Validate button on the toolbar. Close the Pipeline Validation Output
window by clicking the >> button.

25. To publish the entities (datasets, pipelines, etc.) to Data Factory service, select Publish All. Wait until you
see the Successfully published message.
Trigger a pipeline run that succeeds
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.

2. In the Pipeline Run window, do the following steps:


a. Enter adftutorial/adfv2branch/input for the sourceBlobContainer parameter.
b. Enter adftutorial/adfv2branch/output for the sinkBlobContainer parameter.
c. Enter an email address of the receiver.
d. Click Finish
Monitor the successful pipeline run
1. To monitor the pipeline run, switch to the Monitor tab on the left. You see the pipeline run that was
triggered manually by you. Use the Refresh button to refresh the list.

2. To view activity runs associated with this pipeline run, click the first link in the Actions column. You can
switch back to the previous view by clicking Pipelines at the top. Use the Refresh button to refresh the list.

Trigger a pipeline run that fails


1. Switch to the Edit tab on the left.
2. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.
3. In the Pipeline Run window, do the following steps:
a. Enter adftutorial/dummy/input for the sourceBlobContainer parameter. Ensure that the dummy
folder does not exist in the adftutorial container.
b. Enter adftutorial/dummy/output for the sinkBlobContainer parameter.
c. Enter an email address of the receiver.
d. Click Finish.

Monitor the failed pipeline run


1. To monitor the pipeline run, switch to the Monitor tab on the left. You see the pipeline run that was
triggered manually by you. Use the Refresh button to refresh the list.
2. Click Error link for the pipeline run to see details about the error.

3. To view activity runs associated with this pipeline run, click the first link in the Actions column. Use the
Refresh button to refresh the list. Notice that the Copy activity in the pipeline failed. The Web activity
succeeded to send the failure email to the specified receiver.

4. Click Error link in the Actions column to see details about the error.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Branching and chaining activities in a Data Factory
pipeline
3/29/2019 • 14 minutes to read • Edit Online

In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If
the copy activity succeeds, you want to send details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, you want to send details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:

You perform the following steps in this tutorial:


Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory, refer to
"Quickstarts" in the table of contents.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2013, 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following these instructions. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John|Doe
Jane|Doe

2. Use tools such as Azure Storage Explorer to create the adfv2branch container, and to upload the input.txt
file to the container.

Create Visual Studio project


Using Visual Studio 2015/2017, create a C# .NET console application.
1. Launch Visual Studio.
2. Click File, point to New, and click Project. .NET version 4.5.2 or above is required.
3. Select Visual C# -> Console App (.NET Framework) from the list of project types on the right.
4. Enter ADFv2BranchTutorial for the Name.
5. Click OK to create the project.

Install NuGet packages


1. Click Tools -> NuGet Package Manager -> Package Manager Console.
2. In the Package Manager Console, run the following commands to install packages. Refer to
Microsoft.Azure.Management.DataFactory nuget package with details.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs, include the following statements to add references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add these static variables to the Program class. Replace place-holders with your own values. For a list of
Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can
be in other regions.

// Set variables
static string tenantID = "<tenant ID>";
static string applicationId = "<application ID>";
static string authenticationKey = "<Authentication key for your application>";
static string subscriptionId = "<Azure subscription ID>";
static string resourceGroup = "<Azure resource group name>";

static string region = "East US";


static string dataFactoryName = "<Data factory name>";

// Specify the source Azure Blob information


static string storageAccount = "<Azure Storage account name>";
static string storageKey = "<Azure Storage account key>";
// confirm that you have the input.txt file placed in th input folder of the adfv2branch container.
static string inputBlobPath = "adfv2branch/input";
static string inputBlobName = "input.txt";
static string outputBlobPath = "adfv2branch/output";
static string emailReceiver = "<specify email address of the receiver>";

static string storageLinkedServiceName = "AzureStorageLinkedService";


static string blobSourceDatasetName = "SourceStorageDataset";
static string blobSinkDatasetName = "SinkStorageDataset";
static string pipelineName = "Adfv2TutorialBranchCopy";

static string copyBlobActivity = "CopyBlobtoBlob";


static string sendFailEmailActivity = "SendFailEmailActivity";
static string sendSuccessEmailActivity = "SendSuccessEmailActivity";

3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


Create a “CreateOrUpdateDataFactory” function in your Program.cs file:
static Factory CreateOrUpdateDataFactory(DataFactoryManagementClient client)
{
Console.WriteLine("Creating data factory " + dataFactoryName + "...");
Factory resource = new Factory
{
Location = region
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource, client.SerializationSettings));

Factory response;
{
response = client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, resource);
}

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState == "PendingCreation")


{
System.Threading.Thread.Sleep(1000);
}
return response;
}

Add the following code to Main method that creates a data factory.

Factory df = CreateOrUpdateDataFactory(client);

Create an Azure Storage linked service


Create a “StorageLinkedServiceDefinition” function in your Program.cs file:

static LinkedServiceResource StorageLinkedServiceDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");
AzureStorageLinkedService storageLinkedService = new AzureStorageLinkedService
{
ConnectionString = new SecureString("DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey)
};
Console.WriteLine(SafeJsonConvert.SerializeObject(storageLinkedService, client.SerializationSettings));
LinkedServiceResource linkedService = new LinkedServiceResource(storageLinkedService,
name:storageLinkedServiceName);
return linkedService;
}

Add the following code to the Main method that creates an Azure Storage linked service. Learn more from
Azure Blob linked service properties on supported properties and details.

client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,


StorageLinkedServiceDefinition(client));

Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
Notice the use of parameters for the FolderPath. “sourceBlobContainer” is the name of the parameter and the
expression is replaced with the values passed in the pipeline run. The syntax to define parameters is
@pipeline().parameters.<parameterName>

Create a “SourceBlobDatasetDefinition” function in your Program.cs file

static DatasetResource SourceBlobDatasetDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating dataset " + blobSourceDatasetName + "...");
AzureBlobDataset blobDataset = new AzureBlobDataset
{
FolderPath = new Expression { Value = "@pipeline().parameters.sourceBlobContainer" },
FileName = inputBlobName,
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
DatasetResource dataset = new DatasetResource(blobDataset, name:blobSourceDatasetName);
return dataset;
}

Create a dataset for sink Azure Blob


Create a “SourceBlobDatasetDefinition” function in your Program.cs file

static DatasetResource SinkBlobDatasetDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating dataset " + blobSinkDatasetName + "...");
AzureBlobDataset blobDataset = new AzureBlobDataset
{
FolderPath = new Expression { Value = "@pipeline().parameters.sinkBlobContainer" },
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
DatasetResource dataset = new DatasetResource(blobDataset, name: blobSinkDatasetName);
return dataset;
}

Add the following code to the Main method that creates both Azure Blob source and sink datasets.

client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSourceDatasetName,


SourceBlobDatasetDefinition(client));

client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSinkDatasetName,


SinkBlobDatasetDefinition(client));

Create a C# class: EmailRequest


In your C# project, create a class named EmailRequest. This defines what properties the pipeline sends in the
body request when sending an email. In this tutorial, the pipeline sends four properties from the pipeline to the
email:
Message: body of the email. In the case of a successful copy, this property contains details of the run (number
of data written). In the case of a failed copy, this property contains details of the error.
Data factory name: name of the data factory
Pipeline name: name of the pipeline
Receiver: Parameter that is passed through. This property specifies the receiver of the email.

class EmailRequest
{
[Newtonsoft.Json.JsonProperty(PropertyName = "message")]
public string message;

[Newtonsoft.Json.JsonProperty(PropertyName = "dataFactoryName")]
public string dataFactoryName;

[Newtonsoft.Json.JsonProperty(PropertyName = "pipelineName")]
public string pipelineName;

[Newtonsoft.Json.JsonProperty(PropertyName = "receiver")]
public string receiver;

public EmailRequest(string input, string df, string pipeline, string receiverName)


{
message = input;
dataFactoryName = df;
pipelineName = pipeline;
receiver = receiverName;
}
}

Create email workflow endpoints


To trigger sending an email, you use Logic Apps to define the workflow. For details on creating a Logic App
workflow, see How to create a logic app.
Success email workflow
Create a Logic App workflow named CopySuccessEmail . Define the workflow trigger as
When an HTTP request is received , and add an action of Office 365 Outlook – Send an email .

For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}

This aligns with the EmailRequest class you created in the previous section.
Your Request should look like this in the Logic App Designer:

For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Make a note of your HTTP Post request URL for your success email workflow:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000
Fail email workflow
Clone your CopySuccessEmail and create another Logic Apps workflow of CopyFailEmail. In the request
trigger, the Request Body JSON schema is the same. Simply change the format of your email like the Subject to
tailor toward a failure email. Here is an example:

Make a note of your HTTP Post request URL for your failure email workflow:

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

You should now have two workflow URL’s:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity and dependsOn property.
In this tutorial, the pipeline contains one activity: copy activity, which takes in the Blob dataset as a source and
another Blob dataset as a sink. Upon the copy activity succeeding and failing, it calls different email tasks.
In this pipeline, you use the following features:
Parameters
Web Activity
Activity dependency
Using output from an activity as an input to the subsequent activity
Let’s break down the following pipeline section by section:

static PipelineResource PipelineDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource resource = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "sourceBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "sinkBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "receiver", new ParameterSpecification { Type = ParameterType.String } }

},
Activities = new List<Activity>
{
new CopyActivity
{
Name = copyBlobActivity,
Inputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSourceDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSinkDatasetName
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
},
new WebActivity
{
Name = sendSuccessEmailActivity,
Method = WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/00000000000000000000000000000000000/triggers/manual/path
s/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
},
new WebActivity
{
Name = sendFailEmailActivity,
Method =WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/000000000000000000000000000000000/triggers/manual/paths/
invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').error.message}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Failed" }
}
}
}
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource, client.SerializationSettings));
return resource;
}

Add the following code to the Main method that creates the pipeline:

client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, PipelineDefinition(client));

Parameters
The first section of our pipeline defines parameters.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure emails to
the receiver whose email address is specified by this parameter.

Parameters = new Dictionary<string, ParameterSpecification>


{
{ "sourceBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "sinkBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "receiver", new ParameterSpecification { Type = ParameterType.String } }
},

Web Activity
The Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow. You create two web activities: one that
calls to the CopySuccessEmail workflow and one that calls the CopyFailWorkFlow.
new WebActivity
{
Name = sendCopyEmailActivity,
Method = WebActivityMethod.POST,
Url = "https://prodxxx.eastus.logic.azure.com:443/workflows/12345",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
}

In the “Url” property, paste the Request URL endpoints from your Logic Apps workflow accordingly. In the “Body”
property, pass an instance of the “EmailRequest” class. The email request contains the following properties:
Message – Passing value of @{activity('CopyBlobtoBlob').output.dataWritten . Accesses a property of the
previous copy activity and passes the value of dataWritten. For the failure case, pass the error output instead of
@{activity('CopyBlobtoBlob').error.message .
Data Factory Name – Passing value of @{pipeline().DataFactory} This is a system variable, allowing you to
access the corresponding data factory name. For a list of system variables, see System Variables article.
Pipeline Name – Passing value of @{pipeline().Pipeline} . This is also a system variable, allowing you to
access the corresponding pipeline name.
Receiver – Passing value of "@pipeline().parameters.receiver"). Accessing the pipeline parameters.
This code creates a new Activity Dependency, depending on the previous copy activity that it succeeds.

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.

// Create a pipeline run


Console.WriteLine("Creating pipeline run...");
Dictionary<string, object> arguments = new Dictionary<string, object>
{
{ "sourceBlobContainer", inputBlobPath },
{ "sinkBlobContainer", outputBlobPath },
{ "receiver", emailReceiver }
};

CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,


dataFactoryName, pipelineName, arguments).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Main class
Your final Main method should look like this. Build and run your program to trigger a pipeline run!
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Factory df = CreateOrUpdateDataFactory(client);

client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,


StorageLinkedServiceDefinition(client));
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSourceDatasetName,
SourceBlobDatasetDefinition(client));
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSinkDatasetName,
SinkBlobDatasetDefinition(client));

client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, PipelineDefinition(client));

Console.WriteLine("Creating pipeline run...");


Dictionary<string, object> arguments = new Dictionary<string, object>
{
{ "sourceBlobContainer", inputBlobPath },
{ "sinkBlobContainer", outputBlobPath },
{ "receiver", emailReceiver }
};

CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,


dataFactoryName, pipelineName, arguments).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Monitor a pipeline run


1. Add the following code to the Main method to continuously check the status of the pipeline run until it
finishes copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.
// Check the copy activity run details
Console.WriteLine("Checking copy activity run details...");

List<ActivityRun> activityRuns = client.ActivityRuns.ListByPipelineRun(


resourceGroup, dataFactoryName, runResponse.RunId, DateTime.UtcNow.AddMinutes(-10),
DateTime.UtcNow.AddMinutes(10)).ToList();

if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
//SaveToJson(SafeJsonConvert.SerializeObject(activityRuns.First().Output,
client.SerializationSettings), "ActivityRunResult.json", folderForJsons);
}
else
Console.WriteLine(activityRuns.First().Error);

Console.WriteLine("\nPress any key to exit...");


Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution. The console prints the progress of creating data
factory, linked service, datasets, pipeline, and pipeline run. It then checks the pipeline run status. Wait until you see
the copy activity run details with data read/written size. Then, use tools such as Azure Storage explorer to check
the blob(s) is copied to "outputBlobPath" from "inputBlobPath" as you specified in variables.
Sample output:

Creating data factory DFTutorialTest...


{
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***"
}
}
}
Creating dataset SourceStorageDataset...
{
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"type": "Expression",
"value": "@pipeline().parameters.sourceBlobContainer"
},
"fileName": "input.txt"
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
Creating dataset SinkStorageDataset...
{
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"type": "Expression",
"value": "@pipeline().parameters.sinkBlobContainer"
}
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
Creating pipeline Adfv2TutorialBranchCopy...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [
{
"type": "DatasetReference",
"referenceName": "SourceStorageDataset"
}
],
"outputs": [
{
"type": "DatasetReference",
"referenceName": "SinkStorageDataset"
}
],
"name": "CopyBlobtoBlob"
},
{
"type": "WebActivity",
"typeProperties": {
"method": "POST",
"url": "https://xxxx.eastus.logic.azure.com:443/workflows/... ",
"body": {
"message": "@{activity('CopyBlobtoBlob').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
},
"name": "SendSuccessEmailActivity",
"dependsOn": [
{
"activity": "CopyBlobtoBlob",
"dependencyConditions": [
"Succeeded"
]
}
]
},
{
"type": "WebActivity",
"typeProperties": {
"method": "POST",
"url": "https://xxx.eastus.logic.azure.com:443/workflows/... ",
"body": {
"message": "@{activity('CopyBlobtoBlob').error.message}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
},
},
"name": "SendFailEmailActivity",
"dependsOn": [
{
"activity": "CopyBlobtoBlob",
"dependencyConditions": [
"Failed"
]
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
},
"receiver": {
"type": "String"
}
}
}
}
Creating pipeline run...
Pipeline run ID: 00000000-0000-0000-0000-0000000000000
Checking pipeline run status...
Status: InProgress
Status: InProgress
Status: Succeeded
Checking copy activity run details...
{
"dataRead": 20,
"dataWritten": 20,
"copyDuration": 4,
"throughput": 0.01,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
}
{}

Press any key to exit...

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory
3/5/2019 • 9 minutes to read • Edit Online

This tutorial provides steps for using the Azure portal to provision an Azure-SSIS integration runtime (IR ) in Azure
Data Factory. Then, you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to
deploy and run SQL Server Integration Services (SSIS ) packages in this runtime in Azure. For conceptual
information on Azure-SSIS IRs, see Azure-SSIS integration runtime overview.
In this tutorial, you complete the following steps:
Create a data factory.
Provision an Azure-SSIS integration runtime.

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. Azure Data Factory creates the SSIS Catalog (SSISDB database) on this database server. We
recommend that you create the database server in the same Azure region as the integration runtime. This
configuration lets the integration runtime write execution logs to the SSISDB database without crossing Azure
regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of an
elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network. If you
use Azure SQL Database with virtual network service endpoints/Managed Instance to host SSISDB or require
access to on-premises data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR
in a virtual network.
Confirm that the Allow access to Azure services setting is enabled for the database server. This is not
applicable when you use Azure SQL Database with virtual network service endpoints/Managed Instance to
host SSISDB. For more information, see Secure your Azure SQL database. To enable this setting by using
PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of client
machine, to the client IP address list in the firewall settings for the database server. For more information, see
Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or Azure
Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory (ADF ). For the
latter, you need to add the managed identity for your ADF into an AAD group with access permissions to the
database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of an Azure-SSIS IR does not support using an existing SSIS Catalog.
NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in
Microsoft Edge and Google Chrome web browsers.
2. Sign in to the Azure portal.
3. Select New on the left menu, select Data + Analytics, and then select Data Factory.

4. On the New data factory page, enter MyAzureSsisDataFactory under Name.


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, <yourname>MyAzureSsisDataFactory) and try creating again.
For naming rules for Data Factory artifacts, see the Data Factory - naming rules article.
Data factory name “MyAzureSsisDataFactory” is not available

5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2 (Preview).
8. For Location, select the location for the data factory. The list shows only locations that are supported for the
creation of data factories.
9. Select Pin to dashboard.
10. Select Create.
11. On the dashboard, you see the following tile with the status Deploying data factory:

12. After the creation is complete, you see the Data factory page.
13. Select Author & Monitor to open the Data Factory user interface (UI) on a separate tab.

Create an Azure-SSIS integration runtime


From the Data Factory overview
1. On the Let's get started page, select the Configure SSIS Integration Runtime tile.

2. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
From the Authoring UI
1. In the Azure Data Factory UI, switch to the Edit tab, select Connections, and then switch to the
Integration Runtimes tab to view existing integration runtimes in your data factory.
2. Select New to create an Azure-SSIS IR.

3. In the Integration Runtime Setup window, select Lift-and-shift existing SSIS packages to execute in
Azure, and then select Next.
4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.

Provision an Azure-SSIS integration runtime


1. On the General Settings page of Integration Runtime Setup, complete the following steps:
a. For Name, enter the name of your integration runtime.
b. For Description, enter the description of your integration runtime.
c. For Location, select the location of your integration runtime. Only supported locations are displayed. We
recommend that you select the same location of your database server to host SSISDB.
d. For Node Size, select the size of node in your integration runtime cluster. Only supported node sizes are
displayed. Select a large node size (scale up), if you want to run many compute/memory –intensive
packages.
e. For Node Number, select the number of nodes in your integration runtime cluster. Only supported node
numbers are displayed. Select a large cluster with many nodes (scale out), if you want to run many packages
in parallel.
f. For Edition/License, select SQL Server edition/license for your integration runtime: Standard or
Enterprise. Select Enterprise, if you want to use advanced/premium features on your integration runtime.
g. For Save Money, select Azure Hybrid Benefit (AHB ) option for your integration runtime: Yes or No.
Select Yes, if you want to bring your own SQL Server license with Software Assurance to benefit from cost
savings with hybrid use.
h. Click Next.
2. On the SQL Settings page, complete the following steps:

a. For Subscription, select the Azure subscription that has your database server to host SSISDB.
b. For Location, select the location of your database server to host SSISDB. We recommend that you select
the same location of your integration runtime.
c. For Catalog Database Server Endpoint, select the endpoint of your database server to host SSISDB.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of
an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network.
For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL Database
single databases/elastic pools and Managed Instance. If you select Azure SQL Database with virtual
network service endpoints/Managed Instance to host SSISDB or require access to on-premises data, you
need to join your Azure-SSIS IR to a virtual network. See Create Azure-SSIS IR in a virtual network.
d. On Use AAD authentication... checkbox, select the authentication method for your database server to
host SSISDB: SQL or Azure Active Directory (AAD ) with the managed identity for your Azure Data Factory
(ADF ). If you check it, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
e. For Admin Username, enter SQL authentication username for your database server to host SSISDB.
f. For Admin Password, enter SQL authentication password for your database server to host SSISDB.
g. For Catalog Database Service Tier, select the service tier for your database server to host SSISDB:
Basic/Standard/Premium tier or elastic pool name.
h. Click Test Connection and if successful, click Next.
3. On the Advanced Settings page, complete the following steps:
a. For Maximum Parallel Executions Per Node, select the maximum number of packages to execute
concurrently per node in your integration runtime cluster. Only supported package numbers are displayed.
Select a low number, if you want to use more than one cores to run a single large/heavy-weight package
that is compute/memory -intensive. Select a high number, if you want to run one or more small/light-weight
packages in a single core.
b. For Custom Setup Container SAS URI, optionally enter Shared Access Signature (SAS ) Uniform
Resource Identifier (URI) of your Azure Storage Blob container where your setup script and its associated
files are stored, see Custom setup for Azure-SSIS IR.
c. On Select a VNet... checkbox, select whether you want to join your integration runtime to a virtual
network. You should check it if you use Azure SQL Database with virtual network service
endpoints/Managed Instance to host SSISDB or require access to on-premises data, see Create Azure-SSIS
IR in a virtual network.
4. Click Finish to start the creation of your integration runtime.

IMPORTANT
This process takes approximately 20 to 30 minutes to complete.
The Data Factory service connects to your Azure SQL Database server to prepare the SSIS Catalog (SSISDB database).
When you provision an instance of an Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable
are also installed. These components provide connectivity to Excel and Access files and to various Azure data sources,
in addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.

5. On the Connections tab, switch to Integration Runtimes if needed. Select Refresh to refresh the status.

6. Use the links in the Actions column to stop/start, edit, or delete the integration runtime. Use the last link to
view JSON code for the integration runtime. The edit and delete buttons are enabled only when the IR is
stopped.

Deploy SSIS packages


Now, use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy your SSIS
packages to Azure. Connect to your Azure SQL Database server that hosts the SSIS Catalog (SSISDB database).
The name of Azure SQL Database server is in the format <servername>.database.windows.net .
See the following articles from the SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to the SSIS Catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication

Next steps
In this tutorial, you learned how to:
Create a data factory.
Provision an Azure-SSIS integration runtime.
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory with PowerShell
3/15/2019 • 11 minutes to read • Edit Online

This tutorial provides steps for provisioning an Azure-SSIS integration runtime (IR ) in Azure Data Factory. Then,
you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy and run SQL
Server Integration Services (SSIS ) packages in this runtime in Azure. In this tutorial, you do the following steps:

NOTE
This article uses Azure PowerShell to provision an Azure SSIS IR. To use the Data Factory user interface (UI) to provision an
Azure SSIS IR, see Tutorial: Create an Azure SSIS integration runtime.

Create a data factory.


Create an Azure-SSIS integration runtime
Start the Azure-SSIS integration runtime
Deploy SSIS packages
Review the complete script

Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Azure subscription. If you don't have an Azure subscription, create a free account before you begin. For
conceptual information on Azure-SSIS IR, see Azure-SSIS integration runtime overview.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. This server hosts the SSIS Catalog database (SSISDB ). We recommend that you create the
database server in the same Azure region as the integration runtime. This configuration lets the integration
runtime write execution logs to SSISDB without crossing Azure regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part
of an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual
network. For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL
Database single databases/elastic pools and Managed Instance. If you use Azure SQL Database with
virtual network service endpoints/Managed Instance to host SSISDB or require access to on-premises
data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR in a virtual
network.
Confirm that the "Allow access to Azure services" setting is ON for the database server. This setting is
not applicable when you use Azure SQL Database with virtual network service endpoints/Managed
Instance to host SSISDB. For more information, see Secure your Azure SQL database. To enable this
setting by using PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine or a range of IP addresses that includes the IP address of client
machine to the client IP address list in the firewall settings for the database server. For more information,
see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or
Azure Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory. For
the latter, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of Azure-SSIS IR does not support using an existing SSIS Catalog.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell. You use
PowerShell to run a script to provision an Azure-SSIS integration runtime that runs SSIS packages in the cloud.

NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.

Launch Windows PowerShell ISE


Start Windows PowerShell ISE with administrative privileges.

Create variables
Copy and paste the following script: Specify values for the variables. For a list of supported pricing tiers for Azure
SQL Database, see SQL Database resource limits.
# Azure Data Factory information
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
# Data factory name. Must be globally unique
$DataFactoryName = "[Data factory name]"
$DataFactoryLocation = "EastUS"

# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored

# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"

Validate the connection to database


Add the following script to validate your Azure SQL Database server, <servername>.database.windows.net .

$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" + $SSISDBServerAdminUserName +


";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want to proceed?
[Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}

To create an Azure SQL database as part of the script, see the following example:
Set values for the variables that haven't been defined already. For example: SSISDBServerName,
FirewallIPAddress.

New-AzSqlServer -ResourceGroupName $ResourceGroupName `


-ServerName $SSISDBServerName `
-Location $DataFactoryLocation `
-SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList
$SSISDBServerAdminUserName, $(ConvertTo-SecureString -String $SSISDBServerAdminPassword -AsPlainText -Force))

New-AzSqlServerFirewallRule -ResourceGroupName $ResourceGroupName `


-ServerName $SSISDBServerName `
-FirewallRuleName "ClientIPAddress_$today" -StartIpAddress $FirewallIPAddress -EndIpAddress
$FirewallIPAddress

New-AzSqlServerFirewallRule -ResourceGroupName $ResourceGroupName -ServerName $SSISDBServerName -


AllowAllAzureIPs

Log in and select subscription


Add the following code to the script to log in and select your Azure subscription:

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

Create a resource group


Create an Azure resource group using the New -AzResourceGroup command. A resource group is a logical
container into which Azure resources are deployed and managed as a group. The following example creates a
resource group named myResourceGroup in the westeurope location.
If your resource group already exists, don't copy this code to your script.

New-AzResourceGroup -Location $DataFactoryLocation -Name $ResourceGroupName

Create a data factory


Run the following command to create a data factory:

Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $DataFactoryName

Create an integration runtime


Run the following command to create an Azure-SSIS integration runtime that runs SSIS packages in Azure:
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName, $secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode
`
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogAdminCredential $serverCreds `
-CatalogPricingTier $SSISDBPricingTier

if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}

Start integration runtime


Run the following command to start the Azure-SSIS integration runtime:

write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

This command takes from 20 to 30 minutes to complete.

Deploy SSIS packages


Now, use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy your SSIS
packages to Azure. Connect to your Azure SQL server that hosts the SSIS catalog (SSISDB ). The name of the
Azure SQL Database server is in the format: <servername>.database.windows.net .
See the following articles from SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to SSIS catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication

Full script
The PowerShell script in this section configures an instance of Azure-SSIS integration runtime in the cloud that
runs SSIS packages. After you run this script successfully, you can deploy and run SSIS packages in the Microsoft
Azure cloud with SSISDB hosted in Azure SQL Database.
1. Launch the Windows PowerShell Integrated Scripting Environment (ISE ).
2. In the ISE, run the following command from the command prompt.

Set-ExecutionPolicy Unrestricted -Scope CurrentUser

3. Copy the PowerShell script in this section and paste it into the ISE.
4. Provide appropriate values for all parameters at the beginning of the script.
5. Run the script. The Start-AzDataFactoryV2IntegrationRuntime command near the end of the script runs for 20 to
30 minutes.

NOTE
The script connects to your Azure SQL Database server to prepare the SSIS Catalog database (SSISDB).
When you provision an instance of Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable are
also installed. These components provide connectivity to Excel and Access files and to various Azure data sources, in
addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.

For a list of supported pricing tiers for Azure SQL Database, see SQL Database resource limits.
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see
ADF + SSIS IR availability by region.

# Azure Data Factory information


# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
# Data factory name. Must be globally unique
$DataFactoryName = "[Data factory name]"
$DataFactoryLocation = "EastUS"

# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored

# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"

$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" + $SSISDBServerAdminUserName +


";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want to proceed?
[Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $DataFactoryName

$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force


$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName, $secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode
`
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogAdminCredential $serverCreds `
-CatalogPricingTier $SSISDBPricingTier

if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}

write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")
write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

Join Azure-SSIS IR to a virtual network


If you use Azure SQL Database with virtual network service endpoints/Managed Instance that joins a virtual
network to host SSISDB, you must also join your Azure-SSIS integration runtime to the same virtual network.
Azure Data Factory lets you join your Azure-SSIS integration runtime to a virtual network. For more information,
see Join Azure-SSIS integration runtime to a virtual network.
For a full script to create an Azure-SSIS integration runtime that joins a virtual network, see Create an Azure-SSIS
integration runtime.

Monitor and manage Azure-SSIS IR


See the following articles for details about monitoring and managing an Azure-SSIS IR.
Monitor an Azure-SSIS integration runtime
Manage an Azure-SSIS integration runtime

Next steps
In this tutorial, you learned how to:
Create a data factory.
Create an Azure-SSIS integration runtime
Start the Azure-SSIS integration runtime
Deploy SSIS packages
Review the complete script
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Azure PowerShell samples for Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online

The following table includes links to sample Azure PowerShell scripts for Azure Data Factory.

Copy data

Copy blobs from a folder to another folder in an Azure Blob This PowerShell script copies blobs from a folder in Azure Blob
Storage Storage to another folder in the same Blob Storage.

Copy data from on-premises SQL Server to Azure Blob This PowerShell script copies data from an on-premises SQL
Storage Server database to an Azure blob storage.

Bulk copy This sample PowerShell script copies data from multiple tables
in an Azure SQL database to an Azure SQL data warehouse.

Incremental copy This sample PowerShell script loads only new or updated
records from a source data store to a sink data store after the
initial full copy of data from the source to the sink.

Transform data

Transform data using a Spark cluster This PowerShell script transforms data by running a program
on a Spark cluster.

Lift and shift SSIS packages to Azure

Create Azure-SSIS integration runtime This PowerShell script provisions an Azure-SSIS integration
runtime that runs SQL Server Integration Services (SSIS)
packages in Azure.
Pipelines and activities in Azure Data Factory
5/15/2019 • 14 minutes to read • Edit Online

This article helps you understand pipelines and activities in Azure Data Factory and use them to
construct end-to-end data-driven workflows for your data movement and data processing scenarios.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. For example, a pipeline could contain a set of activities that ingest and clean
log data, and then kick off a Spark job on an HDInsight cluster to analyze the log data. The beauty of
this is that the pipeline allows you to manage the activities as a set instead of each one individually.
For example, you can deploy and schedule the pipeline, instead of the activities independently.
The activities in a pipeline define actions to perform on your data. For example, you may use a copy
activity to copy data from an on-premises SQL Server to an Azure Blob Storage. Then, use a Hive
activity that runs a Hive script on an Azure HDInsight cluster to process/transform data from the
blob storage to produce output data. Finally, use a second copy activity to copy the output data to an
Azure SQL Data Warehouse on top of which business intelligence (BI) reporting solutions are built.
Data Factory supports three types of activities: data movement activities, data transformation
activities, and control activities. An activity can take zero or more input datasets and produce one or
more output datasets. The following diagram shows the relationship between pipeline, activity, and
dataset in Data Factory:

An input dataset represents the input for an activity in the pipeline and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For
example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity.
For more information about datasets, see Datasets in Azure Data Factory article.

Data movement activities


Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory
supports the data stores listed in the table in this section. Data from any source can be written to any
sink. Click a data store to learn how to copy data to and from that store.

SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY


CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Azure Azure Blob ✓ ✓ ✓ ✓


Storage

Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB

Azure Data ✓ ✓ ✓ ✓
Explorer

Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen1

Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen2

Azure ✓ ✓ ✓
Database for
MariaDB

Azure ✓ ✓ ✓
Database for
MySQL

Azure ✓ ✓ ✓
Database for
PostgreSQL

Azure File ✓ ✓ ✓ ✓
Storage

Azure SQL ✓ ✓ ✓ ✓
Database

Azure SQL ✓ ✓ ✓
Database
Managed
Instance

Azure SQL ✓ ✓ ✓ ✓
Data
Warehouse

Azure Search ✓ ✓ ✓
Index

Azure Table ✓ ✓ ✓ ✓
Storage

Database Amazon ✓ ✓ ✓
Redshift

DB2 ✓ ✓ ✓

Drill (Preview) ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Google ✓ ✓ ✓
BigQuery

Greenplum ✓ ✓ ✓

HBase ✓ ✓ ✓

Hive ✓ ✓ ✓

Apache Impala ✓ ✓ ✓
(Preview)

Informix ✓ ✓

MariaDB ✓ ✓ ✓

Microsoft ✓ ✓
Access

MySQL ✓ ✓ ✓

Netezza ✓ ✓ ✓

Oracle ✓ ✓ ✓ ✓

Phoenix ✓ ✓ ✓

PostgreSQL ✓ ✓ ✓

Presto ✓ ✓ ✓
(Preview)

SAP Business ✓ ✓
Warehouse
Open Hub

SAP Business ✓ ✓
Warehouse via
MDX

SAP HANA ✓ ✓ ✓

SAP Table ✓ ✓ ✓

Spark ✓ ✓ ✓

SQL Server ✓ ✓ ✓ ✓

Sybase ✓ ✓

Teradata ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Vertica ✓ ✓ ✓

NoSQL Cassandra ✓ ✓ ✓

Couchbase ✓ ✓ ✓
(Preview)

MongoDB ✓ ✓ ✓

File Amazon S3 ✓ ✓ ✓

File System ✓ ✓ ✓ ✓

FTP ✓ ✓ ✓

Google Cloud ✓ ✓ ✓
Storage

HDFS ✓ ✓ ✓

SFTP ✓ ✓ ✓

Generic Generic HTTP ✓ ✓ ✓


protocol

Generic OData ✓ ✓ ✓

Generic ODBC ✓ ✓ ✓

Generic REST ✓ ✓ ✓

Services and Amazon ✓ ✓ ✓


apps Marketplace
Web Service
(Preview)

Common Data ✓ ✓ ✓ ✓
Service for
Apps

Concur ✓ ✓ ✓
(Preview)

Dynamics 365 ✓ ✓ ✓ ✓

Dynamics AX ✓ ✓ ✓
(Preview)

Dynamics ✓ ✓ ✓ ✓
CRM
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Google ✓ ✓ ✓
AdWords
(Preview)

HubSpot ✓ ✓ ✓
(Preview)

Jira (Preview) ✓ ✓ ✓

Magento ✓ ✓ ✓
(Preview)

Marketo ✓ ✓ ✓
(Preview)

Office 365 ✓ ✓ ✓

Oracle Eloqua ✓ ✓ ✓
(Preview)

Oracle ✓ ✓ ✓
Responsys
(Preview)

Oracle Service ✓ ✓ ✓
Cloud
(Preview)

Paypal ✓ ✓ ✓
(Preview)

QuickBooks ✓ ✓ ✓
(Preview)

Salesforce ✓ ✓ ✓ ✓

Salesforce ✓ ✓ ✓ ✓
Service Cloud

Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)

SAP Cloud for ✓ ✓ ✓ ✓


Customer
(C4C)

SAP ECC ✓ ✓ ✓

ServiceNow ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR

Shopify ✓ ✓ ✓
(Preview)

Square ✓ ✓ ✓
(Preview)

Web Table ✓ ✓
(HTML table)

Xero (Preview) ✓ ✓ ✓

Zoho (Preview) ✓ ✓ ✓

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.

For more information, see Copy Activity - Overview article.

Data transformation activities


Azure Data Factory supports the following transformation activities that can be added to pipelines
either individually or chained with another activity.

DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Azure VM


Update Resource

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

U-SQL Azure Data Lake Analytics

Custom Code Azure Batch

Databricks Notebook Azure Databricks

For more information, see the data transformation activities article.


Control activities
The following control flow activities are supported:

CONTROL ACTIVITY DESCRIPTION

Execute Pipeline Activity Execute Pipeline activity allows a Data Factory pipeline
to invoke another pipeline.

ForEachActivity ForEach Activity defines a repeating control flow in


your pipeline. This activity is used to iterate over a
collection and executes specified activities in a loop.
The loop implementation of this activity is similar to
Foreach looping structure in programming languages.

WebActivity Web Activity can be used to call a custom REST


endpoint from a Data Factory pipeline. You can pass
datasets and linked services to be consumed and
accessed by the activity.

Lookup Activity Lookup Activity can be used to read or look up a


record/ table name/ value from any external source.
This output can further be referenced by succeeding
activities.

Get Metadata Activity GetMetadata activity can be used to retrieve metadata


of any data in Azure Data Factory.

Until Activity Implements Do-Until loop that is similar to Do-Until


looping structure in programming languages. It
executes a set of activities in a loop until the condition
associated with the activity evaluates to true. You can
specify a timeout value for the until activity in Data
Factory.

If Condition Activity The If Condition can be used to branch based on


condition that evaluates to true or false. The If
Condition activity provides the same functionality that
an if statement provides in programming languages. It
evaluates a set of activities when the condition
evaluates to true and another set of activities when
the condition evaluates to false .

Wait Activity When you use a Wait activity in a pipeline, the pipeline
waits for the specified period of time before continuing
with execution of subsequent activities.

Pipeline JSON
Here is how a pipeline is defined in JSON format:
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
}
}
}

TAG DESCRIPTION TYPE REQUIRED

name Name of the pipeline. String Yes


Specify a name that
represents the action
that the pipeline
performs.
Maximum
number of
characters: 140
Must start with a
letter, number, or
an underscore (_)
Following
characters are
not allowed: “.”,
“+”, “?”, “/”,
“<”,”>”,”*”,”%”,”&”,”:
”,”\”

description Specify the text String No


describing what the
pipeline is used for.

activities The activities section Array Yes


can have one or more
activities defined within
it. See the Activity JSON
section for details about
the activities JSON
element.

parameters The parameters section List No


can have one or more
parameters defined
within the pipeline,
making your pipeline
flexible for reuse.

Activity JSON
The activities section can have one or more activities defined within it. There are two main types of
activities: Execution and Control Activities.
Execution activities
Execution activities include data movement and data transformation activities. They have the
following top-level structure:

{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}

Following table describes properties in the activity JSON definition:

TAG DESCRIPTION REQUIRED

name Name of the activity. Specify a Yes


name that represents the action
that the activity performs.
Maximum number of
characters: 55
Must start with a letter
number, or an underscore
(_)
Following characters are
not allowed: “.”, “+”, “?”, “/”,
“<”,”>”,”*”,”%”,”&”,”:”,”\”

description Text describing what the activity or Yes


is used for

type Type of the activity. See the Data Yes


Movement Activities, Data
Transformation Activities, and
Control Activities sections for
different types of activities.

linkedServiceName Name of the linked service used Yes for HDInsight Activity, Azure
by the activity. Machine Learning Batch Scoring
Activity, Stored Procedure Activity.
An activity may require that you
specify the linked service that links No for all others
to the required compute
environment.

typeProperties Properties in the typeProperties No


section depend on each type of
activity. To see type properties for
an activity, click links to the activity
in the previous section.
TAG DESCRIPTION REQUIRED

policy Policies that affect the run-time No


behavior of the activity. This
property includes timeout and
retry behavior. If it is not specified,
default values are used. For more
information, see Activity policy
section.

dependsOn This property is used to define No


activity dependencies, and how
subsequent activities depend on
previous activities. For more
information, see Activity
dependency

Activity policy
Policies affect the run-time behavior of an activity, giving configurability options. Activity Policies are
only available for execution activities.
Activity policy JSON definition

{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity"
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}

JSON NAME DESCRIPTION ALLOWED VALUES REQUIRED

timeout Specifies the timeout for Timespan No. Default timeout is 7


the activity to run. days.

retry Maximum retry Integer No. Default is 0


attempts

retryIntervalInSeconds The delay between retry Integer No. Default is 30


attempts in seconds seconds
JSON NAME DESCRIPTION ALLOWED VALUES REQUIRED

secureOutput When set to true, output Boolean No. Default is false.


from activity is
considered as secure and
will not be logged to
monitoring.

Control activity
Control activities have the following top-level structure:

{
"name": "Control Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"dependsOn":
{
}
}

TAG DESCRIPTION REQUIRED

name Name of the activity. Specify a Yes


name that represents the action
that the activity performs.
Maximum number of
characters: 55
Must start with a letter
number, or an underscore
(_)
Following characters are
not allowed: “.”, “+”, “?”, “/”,
“<”,”>”,”*”,”%”,”&”,”:”,”\”

description Text describing what the activity or Yes


is used for

type Type of the activity. See the data Yes


movement activities, data
transformation activities, and
control activities sections for
different types of activities.

typeProperties Properties in the typeProperties No


section depend on each type of
activity. To see type properties for
an activity, click links to the activity
in the previous section.

dependsOn This property is used to define No


Activity Dependency, and how
subsequent activities depend on
previous activities. For more
information, see activity
dependency.
Activity dependency
Activity Dependency defines how subsequent activities depend on previous activities, determining
the condition whether to continue executing the next task. An activity can depend on one or multiple
previous activities with different dependency conditions.
The different dependency conditions are: Succeeded, Failed, Skipped, Completed.
For example, if a pipeline has Activity A -> Activity B, the different scenarios that can happen are:
Activity B has dependency condition on Activity A with succeeded: Activity B only runs if Activity
A has a final status of succeeded
Activity B has dependency condition on Activity A with failed: Activity B only runs if Activity A
has a final status of failed
Activity B has dependency condition on Activity A with completed: Activity B runs if Activity A
has a final status of succeeded or failed
Activity B has dependency condition on Activity A with skipped: Activity B runs if Activity A has a
final status of skipped. Skipped occurs in the scenario of Activity X -> Activity Y -> Activity Z,
where each activity runs only if the previous activity succeeds. If Activity X fails, then Activity Y
has a status of “Skipped” because it never executes. Similarly, Activity Z has a status of “Skipped”
as well.
Example: Activity 2 depends on the Activity 1 succeeding

{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities": [
{
"name": "MyFirstActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
}
},
{
"name": "MySecondActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
},
"dependsOn": [
{
"activity": "MyFirstActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
],
"parameters": {
}
}
}

Sample copy pipeline


In the following sample pipeline, there is one activity of type Copy in the activities section. In this
sample, the copy activity copies data from an Azure Blob storage to an Azure SQL database.

{
"name": "CopyPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"policy": {
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to Copy.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
See Datasets article for defining datasets in JSON.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is
specified as the sink type. In the data movement activities section, click the data store that you
want to use as a source or a sink to learn more about moving data to/from that data store.
For a complete walkthrough of creating this pipeline, see Quickstart: create a data factory.

Sample transformation pipeline


In the following sample pipeline, there is one activity of type HDInsightHive in the activities
section. In this sample, the HDInsight Hive activity transforms data from an Azure Blob storage by
running a Hive script file on an Azure HDInsight Hadoop cluster.
{
"name": "TransformPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"retry": 3
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
]
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to HDInsightHive.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by
the scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container
adfgetstarted .
The defines section is used to specify the runtime settings that are passed to the hive script as
Hive configuration values (for example, $ {hiveconf:inputtable} , ${hiveconf:partitionedtable} ).

The typeProperties section is different for each transformation activity. To learn about type
properties supported for a transformation activity, click the transformation activity in the Data
transformation activities.
For a complete walkthrough of creating this pipeline, see Tutorial: transform data using Spark.

Multiple activities in a pipeline


The previous two sample pipelines have only one activity in them. You can have more than one
activity in a pipeline. If you have multiple activities in a pipeline and subsequent activities are not
dependent on previous activities, the activities may run in parallel.
You can chain two activities by using activity dependency, which defines how subsequent activities
depend on previous activities, determining the condition whether to continue executing the next task.
An activity can depend on one or more previous activities with different dependency conditions.
Scheduling pipelines
Pipelines are scheduled by triggers. There are different types of triggers (scheduler trigger, which
allows pipelines to be triggered on a wall-clock schedule, as well as manual trigger, which triggers
pipelines on-demand). For more information about triggers, see pipeline execution and triggers
article.
To have your trigger kick off a pipeline run, you must include a pipeline reference of the particular
pipeline in the trigger definition. Pipelines & triggers have an n-m relationship. Multiple triggers can
kick off a single pipeline and the same trigger can kick off multiple pipelines. Once the trigger is
defined, you must start the trigger to have it start triggering the pipeline. For more information
about triggers, see pipeline execution and triggers article.
For example, say you have a scheduler trigger, “Trigger A” that I wish to kick off my pipeline,
“MyCopyPipeline.” You define the trigger as shown in the following example:
Trigger A definition

{
"name": "TriggerA",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
...
}
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyCopyPipeline"
},
"parameters": {
"copySourceName": "FileSource"
}
}
}
}

Next steps
See the following tutorials for step-by-step instructions for creating pipelines with activities:
Build a pipeline with a copy activity
Build a pipeline with a data transformation activity
Linked services in Azure Data Factory
4/28/2019 • 4 minutes to read • Edit Online

This article describes what linked services are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform
a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy
activity to copy data from an on-premises SQL Server to Azure Blob storage. Then, you might use a Hive activity
that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data.
Finally, you might use a second copy activity to copy the output data to Azure SQL Data Warehouse, on top of
which business intelligence (BI) reporting solutions are built. For more information about pipelines and activities,
see Pipelines and activities in Azure Data Factory.
Now, a dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs.
Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the linked
data stores, and the linked service defines the connection to the data source. For example, an Azure Storage linked
service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the
folder within that Azure storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked services: Azure
Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure Storage
linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). The Azure
Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to
connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset specifies the blob
container and blob folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies
the SQL table in your SQL database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory:

Linked service JSON


A linked service in Data Factory is defined in JSON format as follows:

{
"name": "<Name of the linked service>",
"properties": {
"type": "<Type of the linked service>",
"typeProperties": {
"<data store or compute-specific type properties>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

The following table describes properties in the above JSON:

PROPERTY DESCRIPTION REQUIRED

name Name of the linked service. See Azure Yes


Data Factory - Naming rules.

type Type of the linked service. For example: Yes


AzureStorage (data store) or AzureBatch
(compute). See the description for
typeProperties.

typeProperties The type properties are different for Yes


each data store or compute.

For the supported data store types and


their type properties, see the dataset
type table in this article. Navigate to the
data store connector article to learn
about type properties specific to a data
store.

For the supported compute types and


their type properties, see Compute
linked services.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

Linked service example


The following linked service is an Azure Storage linked service. Notice that the type is set to AzureStorage. The
type properties for the Azure Storage linked service include a connection string. The Data Factory service uses this
connection string to connect to the data store at runtime.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Create linked services


You can create linked services by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure
Resource Manager Template, and Azure portal

Data store linked services


Connecting to data stores can be found in our supported data stores and formats. Reference the list for specific
connection properties needed for different stores.

Compute linked services


Reference compute environments supported for details about different compute environments you can connect to
from your data factory as well as the different configurations.

Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Datasets in Azure Data Factory
4/28/2019 • 11 minutes to read • Edit Online

This article describes what datasets are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. Now, a
dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files,
folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in
Blob storage from which the activity should read the data.
Before you create a dataset, you must create a linked service to link your data store to the data factory.
Linked services are much like connection strings, which define the connection information needed for
Data Factory to connect to external resources. Think of it this way; the dataset represents the structure of
the data within the linked data stores, and the linked service defines the connection to the data source.
For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob
dataset represents the blob container and the folder within that Azure storage account that contains the
input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked
services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which
refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL
Database linked service). The Azure Storage and Azure SQL Database linked services contain
connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL
Database, respectively. The Azure Blob dataset specifies the blob container and blob folder that contains
the input blobs in your Blob storage. The Azure SQL Table dataset specifies the SQL table in your SQL
database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in
Data Factory:

Dataset JSON
A dataset in Data Factory is defined in the following JSON format:
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"structure": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}

The following table describes properties in the above JSON:

PROPERTY DESCRIPTION REQUIRED

name Name of the dataset. See Azure Yes


Data Factory - Naming rules.

type Type of the dataset. Specify one of Yes


the types supported by Data
Factory (for example: AzureBlob,
AzureSqlTable).

For details, see Dataset types.

structure Schema of the dataset. For details, No


see Dataset schema.

typeProperties The type properties are different for Yes


each type (for example: Azure Blob,
Azure SQL table). For details on the
supported types and their
properties, see Dataset type.

Data flow compatible dataset

NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer
SLA provisions.

See supported dataset types for a list of dataset types that are Data Flow compatible. Datasets that are
compatible for Data Flow require fine-grained dataset definitions for transformations. Thus, the JSON
definition is slightly different. Instead of a structure property, datasets that are Data Flow compatible
have a schema property.
In Data Flow, datasets are used in source and sink transformations. The datasets define the basic data
schemas. If your data has no schema, you can use schema drift for your source and sink. The schema in
the dataset represents the physical data type and shape.
By defining the schema from the dataset, you'll get the related data types, data formats, file location, and
connection information from the associated Linked service. Metadata from the datasets appears in your
source transformation as the source projection. The projection in the source transformation represents
the Data Flow data with defined names and types.
When you import the schema of a Data Flow dataset, select the Import Schema button and choose to
import from the source or from a local file. In most cases, you'll import the schema directly from the
source. But if you already have a local schema file (a Parquet file or CSV with headers), you can direct
Data Factory to base the schema on that file.

{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"schema": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}

The following table describes properties in the above JSON:

PROPERTY DESCRIPTION REQUIRED

name Name of the dataset. See Azure Yes


Data Factory - Naming rules.

type Type of the dataset. Specify one of Yes


the types supported by Data
Factory (for example: AzureBlob,
AzureSqlTable).

For details, see Dataset types.

schema Schema of the dataset. For details, No


see Data Flow compatible datasets.

typeProperties The type properties are different for Yes


each type (for example: Azure Blob,
Azure SQL table). For details on the
supported types and their
properties, see Dataset type.

Dataset example
In the following example, the dataset represents a table named MyTable in a SQL database.
{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "MyAzureSqlLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties":
{
"tableName": "MyTable"
},
}
}

Note the following points:


type is set to AzureSqlTable.
tableName type property (specific to AzureSqlTable type) is set to MyTable.
linkedServiceName refers to a linked service of type AzureSqlDatabase, which is defined in the next
JSON snippet.

Dataset type
There are many different types of datasets, depending on the data store you use. See the following table
for a list of data stores supported by Data Factory. Click a data store to learn how to create a linked
service and a dataset for that data store.

SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Azure Azure Blob ✓ ✓ ✓ ✓ ✓


Storage Supported
Formats:
Delimited Text,
Parquet

Azure ✓ ✓ ✓ ✓
Cosmos DB
(SQL API)

Azure ✓ ✓ ✓ ✓
Cosmos
DB's API for
MongoDB

Azure Data ✓ ✓ ✓ ✓
Explorer

Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen1 Delimited Text,
Parquet
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen2 Delimited Text,
Parquet

Azure ✓ ✓ ✓
Database for
MariaDB

Azure ✓ ✓ ✓
Database for
MySQL

Azure ✓ ✓ ✓
Database for
PostgreSQL

Azure File ✓ ✓ ✓ ✓
Storage

Azure SQL ✓ ✓ ✓ ✓ ✓
Database

Azure SQL ✓ ✓ ✓
Database
Managed
Instance

Azure SQL ✓ ✓ ✓ ✓ ✓
Data
Warehouse

Azure ✓ ✓ ✓
Search Index

Azure Table ✓ ✓ ✓ ✓
Storage

Database Amazon ✓ ✓ ✓
Redshift

DB2 ✓ ✓ ✓

Drill ✓ ✓ ✓
(Preview)

Google ✓ ✓ ✓
BigQuery

Greenplum ✓ ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

HBase ✓ ✓ ✓

Hive ✓ ✓ ✓

Apache ✓ ✓ ✓
Impala
(Preview)

Informix ✓ ✓

MariaDB ✓ ✓ ✓

Microsoft ✓ ✓
Access

MySQL ✓ ✓ ✓

Netezza ✓ ✓ ✓

Oracle ✓ ✓ ✓ ✓

Phoenix ✓ ✓ ✓

PostgreSQL ✓ ✓ ✓

Presto ✓ ✓ ✓
(Preview)

SAP ✓ ✓
Business
Warehouse
Open Hub

SAP ✓ ✓
Business
Warehouse
via MDX

SAP HANA ✓ ✓ ✓

SAP Table ✓ ✓ ✓

Spark ✓ ✓ ✓

SQL Server ✓ ✓ ✓ ✓

Sybase ✓ ✓

Teradata ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Vertica ✓ ✓ ✓

NoSQL Cassandra ✓ ✓ ✓

Couchbase ✓ ✓ ✓
(Preview)

MongoDB ✓ ✓ ✓

File Amazon S3 ✓ ✓ ✓

File System ✓ ✓ ✓ ✓

FTP ✓ ✓ ✓

Google ✓ ✓ ✓
Cloud
Storage

HDFS ✓ ✓ ✓

SFTP ✓ ✓ ✓

Generic Generic ✓ ✓ ✓
protocol HTTP

Generic ✓ ✓ ✓
OData

Generic ✓ ✓ ✓
ODBC

Generic ✓ ✓ ✓
REST

Services Amazon ✓ ✓ ✓
and apps Marketplace
Web Service
(Preview)

Common ✓ ✓ ✓ ✓
Data Service
for Apps

Concur ✓ ✓ ✓
(Preview)

Dynamics ✓ ✓ ✓ ✓
365
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Dynamics ✓ ✓ ✓
AX (Preview)

Dynamics ✓ ✓ ✓ ✓
CRM

Google ✓ ✓ ✓
AdWords
(Preview)

HubSpot ✓ ✓ ✓
(Preview)

Jira ✓ ✓ ✓
(Preview)

Magento ✓ ✓ ✓
(Preview)

Marketo ✓ ✓ ✓
(Preview)

Office 365 ✓ ✓ ✓

Oracle ✓ ✓ ✓
Eloqua
(Preview)

Oracle ✓ ✓ ✓
Responsys
(Preview)

Oracle ✓ ✓ ✓
Service
Cloud
(Preview)

Paypal ✓ ✓ ✓
(Preview)

QuickBooks ✓ ✓ ✓
(Preview)

Salesforce ✓ ✓ ✓ ✓

Salesforce ✓ ✓ ✓ ✓
Service
Cloud
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW

Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)

SAP Cloud ✓ ✓ ✓ ✓
for
Customer
(C4C)

SAP ECC ✓ ✓ ✓

ServiceNow ✓ ✓ ✓

Shopify ✓ ✓ ✓
(Preview)

Square ✓ ✓ ✓
(Preview)

Web Table ✓ ✓
(HTML
table)

Xero ✓ ✓ ✓
(Preview)

Zoho ✓ ✓ ✓
(Preview)

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.

In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly, for an
Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following JSON:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference",
},

"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
}
}
}

Dataset structure or schema


The structure section or schema (Data Flow compatible) section datasets is optional. It defines the
schema of the dataset by containing a collection of names and data types of columns. You use the
structure section to provide type information that is used to convert types and map columns from the
source to the destination.
Each column in the structure contains the following properties:

PROPERTY DESCRIPTION REQUIRED

name Name of the column. Yes

type Data type of the column. Data No


Factory supports the following
interim data types as allowed
values: Int16, Int32, Int64, Single,
Double, Decimal, Byte[],
Boolean, String, Guid, Datetime,
Datetimeoffset, and Timespan

culture .NET-based culture to be used when No


the type is a .NET type: Datetime
or Datetimeoffset . The default is
en-us .

format Format string to be used when the No


type is a .NET type: Datetime or
Datetimeoffset . Refer to Custom
Date and Time Format Strings on
how to format datetime.

Example
In the following example, suppose the source Blob data is in CSV format and contains three columns:
userid, name, and lastlogindate. They are of type Int64, String, and Datetime with a custom datetime
format using abbreviated French names for day of the week.
Define the Blob dataset structure as follows along with type definitions for the columns:
"structure":
[
{ "name": "userid", "type": "Int64"},
{ "name": "name", "type": "String"},
{ "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"}
]

Guidance
The following guidelines help you understand when to include structure information, and what to
include in the structure section. Learn more on how data factory maps source data to sink and when to
specify structure information from Schema and type mapping.
For strong schema data sources, specify the structure section only if you want map source
columns to sink columns, and their names are not the same. This kind of structured data source
stores data schema and type information along with the data itself. Examples of structured data
sources include SQL Server, Oracle, and Azure SQL Database.

As type information is already available for structured data sources, you should not include type
information when you do include the structure section.
For no/weak schema data sources e.g. text file in blob storage, include structure when the
dataset is an input for a copy activity, and data types of source dataset should be converted to native
types for the sink. And include structure when you want to map source columns to sink columns..

Create datasets
You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure
Resource Manager Template, and Azure portal

Current version vs. version 1 datasets


Here are some differences between Data Factory and Data Factory version 1 datasets:
The external property is not supported in the current version. It's replaced by a trigger.
The policy and availability properties are not supported in the current version. The start time for a
pipeline depends on triggers.
Scoped datasets (datasets defined in a pipeline) are not supported in the current version.

Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one
of these tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Pipeline execution and triggers in Azure Data
Factory
3/5/2019 • 15 minutes to read • Edit Online

A pipeline run in Azure Data Factory defines an instance of a pipeline execution. For example, say you have a
pipeline that executes at 8:00 AM, 9:00 AM, and 10:00 AM. In this case, there are three separate runs of the
pipeline, or pipeline runs. Each pipeline run has a unique pipeline run ID. A run ID is a GUID that uniquely
defines that particular pipeline run.
Pipeline runs are typically instantiated by passing arguments to parameters that you define in the pipeline. You
can execute a pipeline either manually or by using a trigger. This article provides details about both ways of
executing a pipeline.

Manual execution (on-demand)


The manual execution of a pipeline is also referred to as on-demand execution.
For example, say you have a basic pipeline named copyPipeline that you want to execute. The pipeline has a
single activity that copies from an Azure Blob storage source folder to a destination folder in the same storage.
The following JSON definition shows this sample pipeline:
{
"name": "copyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "CopyBlobtoBlob",
"inputs": [
{
"referenceName": "sourceBlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sinkBlobDataset",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
}
}
}
}

In the JSON definition, the pipeline takes two parameters: sourceBlobContainer and sinkBlobContainer.
You pass values to these parameters at runtime.
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module