Data Factory, Data Integration
Data Factory, Data Integration
NOTE
This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see
Introduction to Data Factory V2.
Azure Data Factory is the platform for these kinds of scenarios. It is a cloud -based data integration service that
allows you to create data -driven workflows in the cloud that orchestrate and automate data movement and data
transformation. Using Azure Data Factory, you can do the following tasks:
Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data
stores.
Process or transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure
Data Lake Analytics, and Azure Machine Learning.
Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
It's more of an Extract-and-Load (EL ) and Transform-and-Load (TL ) platform rather than a traditional Extract-
Transform-and-Load (ETL ) platform. The transformations process data by using compute services rather than by
adding derived columns, counting the number of rows, sorting data, and so on.
Currently, in Azure Data Factory, the data that workflows consume and produce is time-sliced data (hourly, daily,
weekly, and so on). For example, a pipeline might read input data, process data, and produce output data once a
day. You can also run a workflow just one time.
Key components
An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory
is composed of four key components. These components work together to provide the platform on which you can
compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory can have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive
query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to manage
the activities as a set instead of each one individually. For example, you can deploy and schedule the pipeline,
instead of scheduling independent activities.
Activity
A pipeline can have one or more activities. Activities define the actions to perform on your data. For example, you
can use a copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity.
A Hive activity runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory
supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data from any source can
be written to any sink. Select a data store to learn how to copy data to and from that store. Data Factory supports
the following data stores:
DB2* ✓
MySQL* ✓
Oracle* ✓ ✓
PostgreSQL* ✓
SAP HANA* ✓
SQL Server* ✓ ✓
Sybase* ✓
Teradata* ✓
NoSQL Cassandra* ✓
MongoDB* ✓
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
File Amazon S3 ✓
File System* ✓ ✓
FTP ✓
HDFS* ✓
SFTP ✓
Generic OData ✓
Generic ODBC* ✓
Salesforce ✓
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data by using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores. It also lets you process data by using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using both
programmatic and UI mechanisms.
Data Factory is available in only West US, East US, and North Europe regions. However, the service that powers
the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall, then
a Data Management Gateway that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are located in the West Europe region. You can create and use an Azure Data Factory instance in
North Europe. Then you can use it to schedule jobs on your compute environments in West Europe. It takes a few
milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on
your computing environment does not change.
TUTORIAL DESCRIPTION
Move data between two cloud data stores Create a data factory with a pipeline that moves data from
blob storage to a SQL database.
Transform data by using Hadoop cluster Build your first Azure data factory with a data pipeline that
processes data by running a Hive script on an Azure
HDInsight (Hadoop) cluster.
Move data between an on-premises data store and a cloud Build a data factory with a pipeline that moves data from an
data store by using Data Management Gateway on-premises SQL Server database to an Azure blob. As part of
the walkthrough, you install and configure the Data
Management Gateway on your machine.
Introduction to Azure Data Factory
2/27/2019 • 8 minutes to read • Edit Online
In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage
systems. However, on its own, raw data doesn't have the proper context or meaning to provide meaningful
insights to analysts, data scientists, or business decision makers.
Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of
raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for
these complex hybrid extract-transform-load (ETL ), extract-load-transform (ELT), and data integration
projects.
For example, imagine a gaming company that collects petabytes of game logs that are produced by games in
the cloud. The company wants to analyze these logs to gain insights into customer preferences,
demographics, and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop
compelling new features, drive business growth, and provide a better experience to its customers.
To analyze these logs, the company needs to use reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. The company wants
to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud
data store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight),
and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily
build a report on top of it. They want to automate this workflow, and monitor and manage it on a daily
schedule. They also want to execute it when files land in a blob store container.
Azure Data Factory is the platform that solves such data scenarios. It is a cloud -based data integration service
that allows you to create data -driven workflows in the cloud for orchestrating and automating data
movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven
workflows (called pipelines) that can ingest data from disparate data stores. It can process and transform the
data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and
Azure Machine Learning.
Additionally, you can publish output data to data stores such as Azure SQL Data Warehouse for business
intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized
into meaningful data stores and data lakes for better business decisions.
How does it work?
The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:
Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of four key components. These components work together to provide the platform on
which you can compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a
unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group
of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to
partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each
one individually. The activities in a pipeline can be chained together to operate sequentially, or they can
operate independently in parallel.
Activity
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data
from one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on
an Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities:
data movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you
want to use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for
Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to
the data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked
service specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob
dataset specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
To represent a data store that includes, but isn't limited to, an on-premises SQL Server database,
Oracle database, file share, or Azure blob storage account. For a list of supported data stores, see the
copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and
supported compute environments, see the transform data article.
Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off.
There are different types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within
the trigger definition.
Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The
arguments for the defined parameters are passed during execution from the run context that was created by a
trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets
and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data
store or a compute environment. It is also a reusable/referenceable entity.
Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or
from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
For more information about Data Factory concepts, see the following articles:
Dataset and linked services
Pipelines and activities
Integration runtime
Supported regions
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region.
However, a data factory can access data stores and compute services in other Azure regions to move data
between data stores or process data using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores and the processing of data using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows by using
both programmatic and UI mechanisms.
Although Data Factory is available only in certain regions, the service that powers the data movement in Data
Factory is available globally in several regions. If a data store is behind a firewall, then a Self-hosted
Integration Runtime that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are running out of the West Europe region. You can create and use an Azure Data Factory
instance in East US or East US 2 and use it to schedule jobs on your compute environments in West Europe.
It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for
running the job on your computing environment does not change.
Accessibility
The Data Factory user experience in the Azure portal is accessible.
Compare with version 1
For a list of differences between version 1 and the current version of the Data Factory service, see Compare
with version 1.
Next steps
Get started with creating a Data Factory pipeline by using one of the following tools/SDKs:
Data Factory UI in the Azure portal
Copy Data tool in the Azure portal
PowerShell
.NET
Python
REST
Azure Resource Manager template
Compare Azure Data Factory with Data Factory
version 1
3/5/2019 • 10 minutes to read • Edit Online
This article compares Data Factory with Data Factory version 1. For an introduction to Data Factory, see
Introduction to Data Factory.For an introduction to Data Factory version 1, see Introduction to Azure Data Factory.
Feature comparison
The following table compares the features of Data Factory with the features of Data Factory version 1.
Datasets A named view of data that references Datasets are the same in the current
the data that you want to use in your version. However, you do not need to
activities as inputs and outputs. define availability schedules for
Datasets identify data within different datasets. You can define a trigger
data stores, such as tables, files, folders, resource that can schedule pipelines
and documents. For example, an Azure from a clock scheduler paradigm. For
Blob dataset specifies the blob more information, see Triggers and
container and folder in Azure Blob Datasets.
storage from which the activity should
read the data.
Linked services Linked services are much like Linked services are the same as in Data
connection strings, which define the Factory V1, but with a new connectVia
connection information that's necessary property to utilize the Integration
for Data Factory to connect to external Runtime compute environment of the
resources. current version of Data Factory. For
more information, see Integration
runtime in Azure Data Factory and
Linked service properties for Azure Blob
storage.
Pipelines A data factory can have one or more Pipelines are groups of activities that
pipelines. A pipeline is a logical are performed on data. However, the
grouping of activities that together scheduling of activities in the pipeline
perform a task. You use startTime, has been separated into new trigger
endTime, and isPaused to schedule and resources. You can think of pipelines in
run pipelines. the current version of Data Factory
more as “workflow units” that you
schedule separately via triggers.
Activities Activities define actions to perform on In the current version of Data Factory,
your data within a pipeline. Data activities still are defined actions within
movement (copy activity) and data a pipelineThe current version of Data
transformation activities (such as Hive, Factory introduces new control flow
Pig, and MapReduce) are supported. activities. You use these activities in a
control flow (looping and branching).
Data movement and data
transformation activities that were
supported in V1 are supported in the
current version. You can define
transformation activities without using
datasets in the current version.
Hybrid data movement and activity Now called Integration Runtime, Data Data Management Gateway is now
dispatch Management Gateway supported called Self-Hosted Integration Runtime.
moving data between on-premises and It provides the same capability as it did
cloud. in V1.
Expressions Data Factory V1 allows you to use In the current version of Data Factory,
functions and system variables in data you can use expressions anywhere in a
selection queries and activity/dataset JSON string value. For more
properties. information, see Expressions and
functions in the current version of Data
Factory.
The following sections provide more information about the capabilities of the current version.
Control flow
To support diverse integration flows and patterns in the modern data warehouse, the current version of Data
Factory has enabled a new flexible data pipeline model that is no longer tied to time-series data. A few common
flows that were previously not possible are now enabled. They are described in the following sections.
Chaining activities
In V1, you had to configure the output of an activity as an input of another activity to chain them. in the current
version, you can chain activities in a sequence within a pipeline. You can use the dependsOn property in an
activity definition to chain it with an upstream activity. For more information and an example, see Pipelines and
activities and Branching and chaining activities.
Branching activities
in the current version, you can branch activities within a pipeline. The If-condition activity provides the same
functionality that an if statement provides in programming languages. It evaluates a set of activities when the
condition evaluates to true and another set of activities when the condition evaluates to false . For examples of
branching activities, see the Branching and chaining activities tutorial.
Parameters
You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-demand
or from a trigger. Activities can consume the arguments that are passed to the pipeline. For more information, see
Pipelines and triggers.
Custom state passing
Activity outputs including state can be consumed by a subsequent activity in the pipeline. For example, in the
JSON definition of an activity, you can access the output of the previous activity by using the following syntax:
@activity('NameofPreviousActivity').output.value . By using this feature, you can build workflows where values
can pass through activities.
Looping containers
The ForEach activity defines a repeating control flow in your pipeline. This activity iterates over a collection and
runs specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping
structure in programming languages.
The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to
true . You can specify a timeout value for the until activity in Data Factory.
Trigger-based flows
Pipelines can be triggered by on-demand (event-based, i.e. blob post) or wall-clock time. The pipelines and triggers
article has detailed information about triggers.
Invoking a pipeline from another pipeline
The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.
Delta flows
A key use case in ETL patterns is “delta loads,” in which only data that has changed since the last iteration of a
pipeline is loaded. New capabilities in the current version, such as lookup activity, flexible scheduling, and control
flow, enable this use case in a natural way. For a tutorial with step-by-step instructions, see Tutorial: Incremental
copy.
Other control flow activities
Following are a few more control flow activities that are supported by the current version of Data Factory.
ForEach activity Defines a repeating control flow in your pipeline. This activity
is used to iterate over a collection and runs specified activities
in a loop. The loop implementation of this activity is similar to
Foreach looping structure in programming languages.
Web activity Calls a custom REST endpoint from a Data Factory pipeline.
You can pass datasets and linked services to be consumed
and accessed by the activity.
Lookup activity Reads or looks up a record or table name value from any
external source. This output can further be referenced by
succeeding activities.
Get metadata activity Retrieves the metadata of any data in Azure Data Factory.
Flexible scheduling
In the current version of Data Factory, you do not need to define dataset availability schedules. You can define a
trigger resource that can schedule pipelines from a clock scheduler paradigm. You can also pass parameters to
pipelines from a trigger for a flexible scheduling and execution model.
Pipelines do not have “windows” of time execution in the current version of Data Factory. The Data Factory V1
concepts of startTime, endTime, and isPaused don't exist in the current version of Data Factory. For more
information about how to build and then schedule a pipeline in the current version of Data Factory, see Pipeline
execution and triggers.
Custom activities
In V1, you implement (custom) DotNet activity code by creating a .NET class library project with a class that
implements the Execute method of the IDotNetActivity interface. Therefore, you need to write your custom code in
.NET Framework 4.5.2 and run it on Windows-based Azure Batch Pool nodes.
In a custom activity in the current version, you don't have to implement a .NET interface. You can directly run
commands, scripts, and your own custom code compiled as an executable.
For more information, see Difference between custom activity in Data Factory and version 1.
SDKs
the current version of Data Factory provides a richer set of SDKs that can be used to author, manage, and monitor
pipelines.
.NET SDK: The .NET SDK is updated in the current version.
PowerShell: The PowerShell cmdlets are updated in the current version. The cmdlets for the current
version have DataFactoryV2 in the name, for example: Get-AzDataFactoryV2.
Python SDK: This SDK is new in the current version.
REST API: The REST API is updated in the current version.
The SDKs that are updated in the current version are not backward-compatible with V1 clients.
Authoring experience
V2 V1
Monitoring experience
in the current version, you can also monitor data factories by using Azure Monitor. The new PowerShell cmdlets
support monitoring of integration runtimes. Both V1 and V2 support visual monitoring via a monitoring
application that can be launched from the Azure portal.
Next steps
Learn how to create a data factory by following step-by-step instructions in the following quickstarts: PowerShell,
.NET, Python, REST API.
Quickstart: Create a data factory by using the
Azure Data Factory UI
2/11/2019 • 10 minutes to read • Edit Online
This quickstart describes how to use the Azure Data Factory UI to create and monitor a data factory.
The pipeline that you create in this data factory copies data from one folder to another folder in Azure
Blob storage. For a tutorial on how to transform data by using Azure Data Factory, see Tutorial:
Transform data by using Spark.
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Video
Watching this video helps you understand the Data Factory UI:
5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2.
8. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
9. Select Create.
10. After the creation is complete, you see the Data Factory page. Select the Author & Monitor
tile to start the Azure Data Factory user interface (UI) application on a separate tab.
11. On the Let's get started page, switch to the Author tab in the left panel.
Create a linked service
In this procedure, you create a linked service to link your Azure storage account to the data factory. The
linked service has the connection information that the Data Factory service uses at runtime to connect
to it.
1. Select Connections, and then select the New button on the toolbar.
2. On the New Linked Service page, select Azure Blob Storage, and then select Continue.
3. Complete the following steps:
a. For Name, enter AzureStorageLinkedService.
b. For Storage account name, select the name of your Azure storage account.
c. Select Test connection to confirm that the Data Factory service can connect to the storage
account.
d. Select Finish to save the linked service.
Create datasets
In this procedure, you create two datasets: InputDataset and OutputDataset. These datasets are of
type AzureBlob. They refer to the Azure Storage linked service that you created in the previous
section.
The input dataset represents the source data in the input folder. In the input dataset definition, you
specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the
source data.
The output dataset represents the data that's copied to the destination. In the output dataset definition,
you specify the blob container (adftutorial), the folder (output), and the file to which the data is
copied. Each run of a pipeline has a unique ID associated with it. You can access this ID by using the
system variable RunId. The name of the output file is dynamically evaluated based on the run ID of the
pipeline.
In the linked service settings, you specified the Azure storage account that contains the source data. In
the source dataset settings, you specify where exactly the source data resides (blob container, folder,
and file). In the sink dataset settings, you specify where the data is copied to (blob container, folder, and
file).
1. Select the + (plus) button, and then select Dataset.
2. On the New Dataset page, select Azure Blob Storage, and then select Finish.
3. In the General tab for the dataset, enter InputDataset for Name.
4. Switch to the Connection tab and complete the following steps:
a. For Linked service, select AzureStorageLinkedService.
b. For File path, select the Browse button.
c. In the Choose a file or folder window, browse to the input folder in the adftutorial
container, select the emp.txt file, and then select Finish.
d. (optional) Select Preview data to preview the data in the emp.txt file.
5. Repeat the steps to create the output dataset:
a. Select the + (plus) button, and then select Dataset.
b. On the New Dataset page, select Azure Blob Storage, and then select Finish.
c. In General table, specify OutputDataset for the name.
d. In Connection tab, select AzureStorageLinkedService as linked service, and enter
adftutorial/output for the folder, in the directory field. If the output folder does not exist, the
copy activity creates it at runtime.
Create a pipeline
In this procedure, you create and validate a pipeline with a copy activity that uses the input and output
datasets. The copy activity copies data from the file you specified in the input dataset settings to the file
you specified in the output dataset settings. If the input dataset specifies only a folder (not the file
name), the copy activity copies all the files in the source folder to the destination.
1. Select the + (plus) button, and then select Pipeline.
2. To trigger the pipeline manually, select Trigger on the pipeline toolbar, and then select Trigger
Now.
3. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.
5. On the New Trigger page, select the Activated check box, and then select Next.
6. Review the warning message, and select Finish.
10. Confirm that an output file is created for every pipeline run until the specified end date and time
in the output folder.
Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Use the Copy Data tool to copy
data
4/8/2019 • 6 minutes to read • Edit Online
In this quickstart, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to
create a pipeline that copies data from a folder in Azure Blob storage to another folder.
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Create a data factory
1. Select New on the left menu, select Data + Analytics, and then select Data Factory.
3. For Subscription, select your Azure subscription in which you want to create the data factory.
4. For Resource Group, use one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
5. For Version, select V2.
6. For Location, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory
meta data will be stored. Please note that the associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in
other regions.
7. Select Create.
8. After the creation is complete, you see the Data Factory page. Select the Author & Monitor tile
to start the Azure Data Factory user interface (UI) application on a separate tab.
c. On the Specify the Azure Blob storage account page, select your storage account from the
Storage account name list, and then select Finish.
d. Select the newly created linked service as source, then click Next.
4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adftutorial/input folder, select the emp.txt file, then click
Choose.
d. Check the Binary copy option to copy file as-is, then select Next.
5. On the Destination data store page, select the Azure Blob Storage linked service you just
created, and then select Next.
6. On the Choose the output file or folder page, enter adftutorial/output for the folder path,
then select Next.
10. The application switches to the Monitor tab. You see the status of the pipeline on this tab. Select
Refresh to refresh the list.
11. Select the View Activity Runs link in the Actions column. The pipeline has only one activity of
type Copy.
12. To view details about the copy operation, select the Details (eyeglasses image) link in the
Actions column. For details about the properties, see Copy Activity overview.
13. Verify that the emp.txt file is created in the output folder of the adftutorial container. If the
output folder does not exist, the Data Factory service automatically creates it.
14. Switch to the Author tab above the Monitor tab on the left panel so that you can edit linked
services, datasets, and pipelines. To learn about editing them in the Data Factory UI, see Create a
data factory by using the Azure portal.
Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To
learn about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Create an Azure data factory using
PowerShell
3/5/2019 • 12 minutes to read • Edit Online
This quickstart describes how to use PowerShell to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.
3. In the New container dialog box, enter adftutorial for the name, and then select OK.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not
already exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.
Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.
Log in to PowerShell
1. Launch PowerShell on your machine. Keep PowerShell open until the end of this quickstart. If
you close and reopen, you need to run these commands again.
2. Run the following command, and enter the same Azure user name and password that you use
to sign in to the Azure portal:
Connect-AzAccount
3. Run the following command to view all the subscriptions for this account:
Get-AzSubscription
4. If you see multiple subscriptions associated with your account, run the following command to
select the subscription that you want to work with. Replace SubscriptionId with the ID of your
Azure subscription:
$resourceGroupName = "ADFQuickStartRG";
If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again
2. To create the Azure resource group, run the following command:
If the resource group already exists, you may not want to overwrite it. Assign a different value
to the $ResourceGroupName variable and run the command again.
3. Define a variable for the data factory name.
IMPORTANT
Update the data factory name to be globally unique. For example, ADFTutorialFactorySP1127.
$dataFactoryName = "ADFQuickStartFactory";
4. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet, using the Location
and ResourceGroupName property from the $ResGrp variable:
To create Data Factory instances, the user account you use to log in to Azure must be a
member of contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before
saving the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=
<accountName>;AccountKey=<accountKey>;EndpointSuffix=core.windows.net",
"type": "SecureString"
}
}
}
}
If you are using Notepad, select All files for the Save as type filed in the Save as dialog box.
Otherwise, it may add .txt extension to the file. For example,
AzureStorageLinkedService.json.txt . If you create the file in File Explorer before opening it in
Notepad, you may not see the .txt extension since the Hide extensions for known files
types option is set by default. Remove the .txt extension before proceeding to the next step.
2. In PowerShell, switch to the ADFv2QuickStartPSH folder.
Set-Location 'C:\ADFv2QuickStartPSH'
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
Create a dataset
In this step, you define a dataset that represents the data to copy from a source to a sink. The dataset
is of type AzureBlob. It refers to the Azure Storage linked service you created in the previous step.
It takes a parameter to construct the folderPath property. For an input dataset, the copy activity in
the pipeline passes the input path as a value for this parameter. Similarly, for an output dataset, the
copy activity passes the output path as a value for this parameter.
1. Create a JSON file named BlobDataset.json in the C:\ADFv2QuickStartPSH folder, with
the following content:
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{dataset().path}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
DatasetName : BlobDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
Create a pipeline
In this quickstart, you create a pipeline with one activity that takes two parameters - input blob path
and output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity uses the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
1. Create a JSON file named Adfv2QuickStartPipeline.json in the C:\ADFv2QuickStartPSH
folder with the following content:
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
$DFPipeLine = Set-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "Adfv2QuickStartPipeline" `
-DefinitionFile ".\Adfv2QuickStartPipeline.json"
2. Run the Invoke-AzDataFactoryV2Pipeline cmdlet to create a pipeline run and pass in the
parameter values. The cmdlet returns the pipeline run ID for future monitoring.
$RunId = Invoke-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name `
-ParameterFile .\PipelineParameters.json
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId
if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}
Start-Sleep -Seconds 10
}
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
RunId : 0000000000-0000-0000-0000-0000000000000
PipelineName : Adfv2QuickStartPipeline
LastUpdated : 9/28/2017 8:28:38 PM
Parameters : {[inputPath, adftutorial/input], [outputPath, adftutorial/output]}
RunStart : 9/28/2017 8:28:14 PM
RunEnd : 9/28/2017 8:28:38 PM
DurationInMs : 24151
Status : Succeeded
Message :
"connectionString": {
"value":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccountname;AccountKey=mystorag
eaccountkey;EndpointSuffix=core.windows.net",
"type": "SecureString"
}
e. Recreate the linked service by following steps in the Create a linked service section.
f. Rerun the pipeline by following steps in the Create a pipeline run section.
g. Run the current monitoring command again to monitor the new pipeline run.
2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.
3. Confirm that you see the output similar to the following sample output of activity run result:
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : SPTestFactory0928
ActivityName : CopyFromBlobToBlob
PipelineRunId : 00000000000-0000-0000-0000-000000000000
PipelineName : Adfv2QuickStartPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, copyDuration, throughput...}
LinkedServiceName :
ActivityRunStart : 9/28/2017 8:28:18 PM
ActivityRunEnd : 9/28/2017 8:28:36 PM
DurationInMs : 18095
Status : Succeeded
Error : {errorCode, message, failureType, target}
Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using .NET SDK
4/28/2019 • 11 minutes to read • Edit Online
This quickstart describes how to use .NET SDK to create an Azure data factory. The pipeline you
create in this data factory copies data from one folder to another folder in an Azure blob storage. For
a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform data using
Spark.
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the
Azure Data Factory service, see Introduction to Azure Data Factory.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member
of the contributor or owner role, or an administrator of the Azure subscription. To view the
permissions that you have in the subscription, in the Azure portal, select your username in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account,
see Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and
password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a
folder named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.
3. In the New container dialog box, enter adftutorial for the name, and then select OK.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the
Files box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Visual Studio
The walkthrough in this article uses Visual Studio 2017. You can also use Visual Studio 2013 or 2015.
Azure .NET SDK
Download and install Azure .NET SDK on your machine.
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add the following code to the Main method that sets the variables. Replace the place-holders
with your own values. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL
Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.
// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID where the data factory resides>";
string resourceGroup = "<your resource group where the data factory resides>";
string region = "East US 2";
string dataFactoryName = "<specify the name of data factory to create. It must be globally
unique.>";
string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
// specify the container and input folder from which all files need to be copied to the
output folder.
string inputBlobPath = "<the path to existing blob(s) to copy data from, e.g.
containername/foldername>";
//specify the contains and output folder where the files are copied
string outputBlobPath = "<the blob path to copy data to, e.g. containername/foldername>";
3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create a data factory, a linked
service, datasets, and a pipeline. You also use this object to monitor the pipeline run details.
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/",
cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };
Create a dataset
Add the following code to the Main method that creates an Azure blob dataset.
You define a dataset that represents the data to copy from a source to a sink. In this example, this
Blob dataset references to the Azure Storage linked service you created in the previous step. The
dataset takes a parameter whose value is set in an activity that consumes the dataset. The parameter
is used to construct the "folderPath" pointing to where the data resides/stored.
}
}
);
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The
copy activity refers to the same blob dataset created in the previous step as input and output. When
the dataset is used as an input dataset, input path is specified. And, when the dataset is used as an
output dataset, the output path is specified.
// Create a pipeline with a copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "inputPath", new ParameterSpecification { Type = ParameterType.String } },
{ "outputPath", new ParameterSpecification { Type = ParameterType.String } }
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToBlob",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.inputPath" }
}
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.outputPath" }
}
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));
2. Add the following code to the Main method that retrieves copy activity run details, for
example, size of the data read/written.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob
storage. Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline
using Python
3/6/2019 • 9 minutes to read • Edit Online
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use Python to create an Azure data factory. The pipeline in this data
factory copies data from one folder to another folder in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role by following instructions in the same article.
Create and upload an input file
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.
John|Doe
Jane|Doe
2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and input folder
in the container. Then, upload the input.txt file to the input folder.
3. To install the Python package for Data Factory, run the following command:
The Python SDK for Data Factory supports Python 2.7, 3.3, 3.4, 3.5, 3.6 and 3.7.
Create a data factory client
1. Create a file named datafactory.py. Add the following statements to add references to
namespaces.
def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)
def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")
def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))
3. Add the following code to the Main method that creates an instance of
DataFactoryManagementClient class. You use this object to create the data factory, linked
service, datasets, and pipeline. You also use this object to monitor the pipeline run details. Set
subscription_id variable to the ID of your Azure subscription. For a list of Azure regions in
which Data Factory is currently available, select the regions that interest you on the following
page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
def main():
# Azure subscription ID
subscription_id = '<Specify your Azure Subscription ID>'
# This program creates this resource group. If it's an existing resource group, comment
out the code that creates the resource group
rg_name = 'ADFTutorialResourceGroup'
# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory application/client
ID>', secret='<client secret>', tenant='<Active Directory tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)
rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}
# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString('DefaultEndpointsProtocol=https;AccountName=
<storageaccountname>;AccountKey=<storageaccountkey>')
ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about
properties of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure
Storage linked service you create in the previous step.
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity.
Now, add the following statement to invoke the main method when the program is run:
Full script
Here is the full Python code:
def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)
print("\n")
def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n")
def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))
def main():
# Azure subscription ID
subscription_id = '<your Azure subscription ID>'
# This program creates this resource group. If it's an existing resource group, comment out the
code that creates the resource group
rg_name = '<Azure resource group name>'
# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ServicePrincipalCredentials(client_id='<Active Directory client ID>',
secret='<client secret>', tenant='<tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)
rg_params = {'location':'eastus'}
df_params = {'location':'eastus'}
ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Name: storageLinkedService
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory
name>/linkedservices/storageLinkedService
Name: ds_in
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_in
Name: ds_out
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_out
Name: copyPipeline
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/pipelines/copyPipeline
Clean up resources
To delete the data factory, add the following code to the program:
adf_client.factories.delete(rg_name,df_name)
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure data factory and
pipeline by using the REST API
3/26/2019 • 8 minutes to read • Edit Online
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven
workflows in the cloud for orchestrating and automating data movement and data transformation.
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that
can ingest data from disparate data stores, process/transform the data by using compute services such
as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and
publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI)
applications to consume.
This quickstart describes how to use REST API to create an Azure data factory. The pipeline in this data
factory copies data from one location to another location in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.
Azure subscription. If you don't have a subscription, you can create a free trial account.
Azure Storage account. You use the blob storage as source and sink data store. If you don't have
an Azure storage account, see the Create a storage account article for steps to create one.
Create a blob container in Blob Storage, create an input folder in the container, and upload some
files to the folder. You can use tools such as Azure Storage explorer to connect to Azure Blob
storage, create a blob container, upload input file, and verify the output file.
Install Azure PowerShell. Follow the instructions in How to install and configure Azure
PowerShell. This quickstart uses PowerShell to invoke REST API calls.
Create an application in Azure Active Directory following this instruction. Make note of the
following values that you use in later steps: application ID, authentication key, and tenant ID.
Assign application to "Contributor" role.
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
2. Run the following commands after replacing the places-holders with your own values, to set
global variables to be used in later steps.
$AuthContext =
[Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext]"https://login.microsoftonli
ne.com/${tenantId}"
$cred = New-Object -TypeName Microsoft.IdentityModel.Clients.ActiveDirectory.ClientCredential -
ArgumentList ($appId, $authKey)
$result = $AuthContext.AcquireToken("https://management.core.windows.net/", $cred)
$authHeader = @{
'Content-Type'='application/json'
'Accept'='application/json'
'Authorization'=$result.CreateAuthorizationHeader()
}
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}?api-version=${apiVersion}"
$body = @"
{
"name": "$dataFactoryName",
"location": "East US",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
Note the following points:
The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory:
Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Here is the sample response:
{
"name": "<dataFactoryName>",
"tags": {
},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-14T06:22:59.9106216Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "<service principal ID>",
"tenantId": "<tenant ID>"
},
"id": "dataFactoryName",
"type": "Microsoft.DataFactory/factories",
"location": "East US"
}
{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/linkedservices/AzureStorageLinkedService",
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "@{value=**********; type=SecureString}"
}
},
"etag": "0000c552-0000-0000-0000-59b1459c0000"
}
Create datasets
You define a dataset that represents the data to copy from a source to a sink. In this example, this Blob
dataset refers to the Azure Storage linked service you create in the previous step. The dataset takes a
parameter whose value is set in an activity that consumes the dataset. The parameter is used to
construct the "folderPath" pointing to where the data resides/stored.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/datasets/BlobDataset?api-version=${apiVersion}"
$body = @"
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
{
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory
/factories/<dataFactoryName>/datasets/BlobDataset",
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@{value=@{dataset().path}; type=Expression}"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": "@{type=String}"
}
},
"etag": "0000c752-0000-0000-0000-59b1459d0000"
}
Create pipeline
In this example, this pipeline contains one activity and takes two parameters - input blob path and
output blob path. The values for these parameters are set when the pipeline is triggered/run. The copy
activity refers to the same blob dataset created in the previous step as input and output. When the
dataset is used as an input dataset, input path is specified. And, when the dataset is used as an output
dataset, the output path is specified.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline?api-
version=${apiVersion}"
$body = @"
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Mic
rosoft.DataFactory/factories/${dataFactoryName}/pipelines/Adfv2QuickStartPipeline/createRun?api-
version=${apiVersion}"
$body = @"
{
"inputPath": "<the path to existing blob(s) to copy data from, e.g. containername/path>",
"outputPath": "<the blob path to copy data to, e.g. containername/path>"
}
"@
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$runId = $response.runId
{
"runId": "2f26be35-c112-43fa-9eaa-8ba93ea57881"
}
Monitor pipeline
1. Run the following script to continuously check the pipeline run status until it finishes copying
the data.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-
version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"
{
"key": "000000000-0000-0000-0000-00000000000",
"timestamp": "2017-09-07T13:12:39.5561795Z",
"runId": "000000000-0000-0000-0000-000000000000",
"dataFactoryName": "<dataFactoryName>",
"pipelineName": "Adfv2QuickStartPipeline",
"parameters": [
"inputPath: <inputBlobPath>",
"outputPath: <outputBlobPath>"
],
"parametersCount": 2,
"parameterNames": [
"inputPath",
"outputPath"
],
"parameterNamesCount": 2,
"parameterValues": [
"<inputBlobPath>",
"<outputBlobPath>"
],
"parameterValuesCount": 2,
"runStart": "2017-09-07T13:12:00.3710792Z",
"runEnd": "2017-09-07T13:12:39.5561795Z",
"durationInMs": 39185,
"status": "Succeeded",
"message": ""
}
2. Run the following script to retrieve copy activity run details, for example, size of the data
read/written.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/provid
ers/Microsoft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}/activityruns?
api-version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
$response | ConvertTo-Json
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure
resource group, which includes all the resources in the resource group. If you want to keep the other
resources intact, delete only the data factory you created in this tutorial.
Run the following command to delete the entire resource group:
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Tutorial: Create an Azure data factory using
Azure Resource Manager template
3/26/2019 • 15 minutes to read • Edit Online
This quickstart describes how to use an Azure Resource Manager template to create an Azure data
factory. The pipeline you create in this data factory copies data from one folder to another folder in an
Azure blob storage. For a tutorial on how to transform data using Azure Data Factory, see Tutorial:
Transform data using Spark.
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure
Data Factory service, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of
the contributor or owner role, or an administrator of the Azure subscription. To view the permissions
that you have in the subscription, in the Azure portal, select your username in the upper-right corner,
and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines,
triggers, and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factory
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the
resource level or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure storage account
You use a general-purpose Azure storage account (specifically Blob storage) as both source and
destination data stores in this quickstart. If you don't have a general-purpose Azure storage account, see
Create a storage account to create one.
Get the storage account name and account key
You will need the name and key of your Azure storage account for this quickstart. The following
procedure provides steps to get the name and key of your storage account:
1. In a web browser, go to the Azure portal. Sign in by using your Azure username and password.
2. Select All services on the left menu, filter with the Storage keyword, and select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your
storage account.
4. On the Storage account page, select Access keys on the menu.
5. Copy the values for the Storage account name and key1 boxes to the clipboard. Paste them
into Notepad or any other editor and save it. You use them later in this quickstart.
Create the input folder and files
In this section, you create a blob container named adftutorial in Azure Blob storage. You create a folder
named input in the container, and then upload a sample file to the input folder.
1. On the Storage account page, switch to Overview, and then select Blobs.
2. On the Blob service page, select + Container on the toolbar.
3. In the New container dialog box, enter adftutorial for the name, and then select OK.
7. Start Notepad and create a file named emp.txt with the following content. Save it in the
c:\ADFv2QuickStartPSH folder. Create the ADFv2QuickStartPSH folder if it does not already
exist.
John, Doe
Jane, Doe
8. In the Azure portal, on the Upload blob page, browse to and select the emp.txt file for the Files
box.
9. Enter input as a value for the Upload to folder box.
10. Confirm that the folder is input and the file is emp.txt, and select Upload.
You should see the emp.txt file and the status of the upload in the list.
11. Close the Upload blob page by clicking X in the corner.
12. Keep the Container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM
module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az
module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module
installation instructions, see Install Azure PowerShell.
Install the latest Azure PowerShell modules by following instructions in How to install and configure
Azure PowerShell.
{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"dataFactoryName": {
"type": "string",
"metadata": {
"description": "Name of the data factory. Must be globally unique."
}
},
"dataFactoryLocation": {
"type": "string",
"allowedValues": [
"East US",
"East US 2",
"West Europe"
],
"defaultValue": "East US",
"metadata": {
"description": "Location of the data factory. Currently, only East US, East US 2, and West
Europe are supported. "
}
},
"storageAccountName": {
"type": "string",
"metadata": {
"description": "Name of the Azure storage account that contains the input/output data."
}
},
"storageAccountKey": {
"storageAccountKey": {
"type": "securestring",
"metadata": {
"description": "Key for the Azure storage account."
}
},
"blobContainer": {
"type": "string",
"metadata": {
"description": "Name of the blob container in the Azure Storage account."
}
},
"inputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that has the input file."
}
},
"inputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the input file/blob."
}
},
"outputBlobFolder": {
"type": "string",
"metadata": {
"description": "The folder in the blob container that will hold the transformed data."
}
},
"outputBlobName": {
"type": "string",
"metadata": {
"description": "Name of the output file/blob."
}
},
"triggerStartTime": {
"type": "string",
"metadata": {
"description": "Start time for the trigger."
}
},
"triggerEndTime": {
"type": "string",
"metadata": {
"description": "End time for the trigger."
}
}
},
"variables": {
"azureStorageLinkedServiceName": "ArmtemplateStorageLinkedService",
"inputDatasetName": "ArmtemplateTestDatasetIn",
"outputDatasetName": "ArmtemplateTestDatasetOut",
"pipelineName": "ArmtemplateSampleCopyPipeline",
"triggerName": "ArmTemplateTestTrigger"
},
"resources": [{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "[parameters('dataFactoryLocation')]",
"identity": {
"type": "SystemAssigned"
},
"resources": [{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[parameters('dataFactoryName')]"
],
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": {
"value": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=
',parameters('storageAccountKey'))]",
"type": "SecureString"
}
}
}
},
{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "datasets",
"name": "[variables('outputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'),
'/')]",
"fileName": "[parameters('outputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
},
{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "[parameters('triggerStartTime')]",
"endTime": "[parameters('triggerEndTime')]",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}
]
}]
}
Parameters JSON
Create a JSON file named ADFTutorialARM -Parameters.json that contains parameters for the Azure
Resource Manager template.
IMPORTANT
Specify the name and key of your Azure Storage account for the storageAccountName and
storageAccountKey parameters in this parameter file. You created the adftutorial container and uploaded
the sample file (emp.txt) to the input folder in this Azure blob storage.
Specify a globally unique name for the data factory for the dataFactoryName parameter. For example:
ARMTutorialFactoryJohnDoe11282017.
For the triggerStartTime, specify the current day in the format: 2017-11-28T00:00:00 .
For the triggerEndTime, specify the next day in the format: 2017-11-29T00:00:00 . You can also check the
current UTC time and specify the next hour or two as the end time. For example, if the UTC time now is 1:32
AM, specify 2017-11-29:03:00:00 as the end time. In this case, the trigger runs the pipeline twice (at 2 AM
and 3 AM).
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"dataFactoryName": {
"value": "<datafactoryname>"
},
"dataFactoryLocation": {
"value": "East US"
},
"storageAccountName": {
"value": "<yourstorageaccountname>"
},
"storageAccountKey": {
"value": "<yourstorageaccountkey>"
},
"blobContainer": {
"value": "adftutorial"
},
"inputBlobFolder": {
"value": "input"
},
"inputBlobName": {
"value": "emp.txt"
},
"outputBlobFolder": {
"value": "output"
},
"outputBlobName": {
"value": "emp.txt"
},
"triggerStartTime": {
"value": "2017-11-28T00:00:00. Set to today"
},
"triggerEndTime": {
"value": "2017-11-29T00:00:00. Set to tomorrow"
}
}
}
IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you
can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying
Data Factory entities in these environments.
Deploy Data Factory entities
In PowerShell, run the following command to deploy Data Factory entities using the Resource Manager
template you created earlier in this quickstart.
DeploymentName : MyARMDeployment
ResourceGroupName : ADFTutorialResourceGroup
ProvisioningState : Succeeded
Timestamp : 11/29/2017 3:11:13 AM
Mode : Incremental
TemplateLink :
Parameters :
Name Type Value
=============== ============ ==========
dataFactoryName String <data factory name>
dataFactoryLocation String East US
storageAccountName String <storage account name>
storageAccountKey SecureString
blobContainer String adftutorial
inputBlobFolder String input
inputBlobName String emp.txt
outputBlobFolder String output
outputBlobName String emp.txt
triggerStartTime String 11/29/2017 12:00:00 AM
triggerEndTime String 11/29/2017 4:00:00 AM
Outputs :
DeploymentDebugLogLevel :
$resourceGroupName = "ADFTutorialResourceGroup"
2. Create a variable to hold the name of the data factory. Specify the same name that you specified
in the ADFTutorialARM -Parameters.json file.
$dataFactoryName = "<yourdatafactoryname>"
3. Set a variable for the name of the trigger. The name of the trigger is hardcoded in the Resource
Manager template file (ADFTutorialARM.json).
$triggerName = "ArmTemplateTestTrigger"
4. Get the status of the trigger by running the following PowerShell command after specifying the
name of your data factory and trigger:
TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Stopped
Confirm
Are you sure you want to start trigger 'ArmTemplateTestTrigger' in data factory
'ARMFactory1128'?
[Y] Yes [N] No [S] Suspend [?] Help (default is "Y"): y
True
6. Confirm that the trigger has been started by running the Get-AzDataFactoryV2Trigger command
again.
TriggerName : ArmTemplateTestTrigger
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ARMFactory1128
Properties : Microsoft.Azure.Management.DataFactory.Models.ScheduleTrigger
RuntimeState : Started
Monitor the pipeline
1. After logging in to the Azure portal, Click All services, search with the keyword such as data fa,
and select Data factories.
2. In the Data Factories page, click the data factory you created. If needed, filter the list with the
name of your data factory.
IMPORTANT
You see pipeline runs only at the hour clock (for example: 4 AM, 5 AM, 6 AM, etc.). Click Refresh on the
toolbar to refresh the list when the time reaches the next hour.
6. You see the activity runs associated with the pipeline run. In this quickstart, the pipeline has only
one activity of type: Copy. Therefore, you see a run for that activity.
7. Click the link under Output column. You see the output from the copy operation in an Output
window. Click the maximize button to see the full output. You can close the maximized output
window or close it.
8. Stop the trigger once you see a successful/failure run. The trigger runs the pipeline once an hour.
The pipeline copies the same file from the input folder to the output folder for each run. To stop
the trigger, run the following command in the PowerShell window.
Note: dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:
The connectionString uses the storageAccountName and storageAccountKey parameters. The values
for these parameters passed by using a configuration file. The definition also uses variables:
azureStorageLinkedService and dataFactoryName defined in the template.
Azure blob input dataset
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. In Azure blob dataset definition, you specify names of
blob container, folder, and file that contains the input data. See Azure Blob dataset properties for details
about JSON properties used to define an Azure Blob dataset.
{
"type": "datasets",
"name": "[variables('inputDatasetName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'),
'/')]",
"fileName": "[parameters('inputBlobName')]"
},
"linkedServiceName": {
"referenceName": "[variables('azureStorageLinkedServiceName')]",
"type": "LinkedServiceReference"
}
}
},
Data pipeline
You define a pipeline that copies data from one Azure blob dataset to another Azure blob dataset. See
Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example.
{
"type": "pipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "MyCopyActivity",
"inputs": [{
"referenceName": "[variables('inputDatasetName')]",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "[variables('outputDatasetName')]",
"type": "DatasetReference"
}]
}]
}
}
Trigger
You define a trigger that runs the pipeline once an hour. The deployed trigger is in stopped state. Start
the trigger by using the Start-AzDataFactoryV2Trigger cmdlet. For more information about triggers,
see Pipeline execution and triggers article.
{
"type": "triggers",
"name": "[variables('triggerName')]",
"dependsOn": [
"[parameters('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('inputDatasetName')]",
"[variables('outputDatasetName')]",
"[variables('pipelineName')]"
],
"apiVersion": "2018-06-01",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-28T00:00:00",
"endTime": "2017-11-29T00:00:00",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "ArmtemplateSampleCopyPipeline"
},
"parameters": {}
}]
}
}
Notice that the first command uses parameter file for the development environment, second one for the
test environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, create many data factories with
one or more pipelines that implement the same logic but each data factory uses different Azure storage
accounts. In this scenario, you use the same template in the same environment (dev, test, or production)
with different parameter files to create data factories.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage.
Go through the tutorials to learn about using Data Factory in more scenarios.
Create Azure Data Factory Data Flow
5/6/2019 • 2 minutes to read • Edit Online
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer SLA
provisions.
Mapping Data Flows in ADF provide a way to transform data at scale without any coding required. You can design
a data transformation job in the data flow designer by constructing a series of transformations. Start with any
number of source transformations followed by data transformation steps. Then, complete your data flow with sink
to land your results in a destination.
Get started by first creating a new V2 Data Factory from the Azure portal. After creating your new factory, click on
the "Author & Monitor" tile to launch the Data Factory UI.
Once you are in the Data Factory UI, you can use sample Data Flows. The samples are available from the ADF
Template Gallery. In ADF, create "Pipeline from Template" and select the Data Flow category from the template
gallery.
You will be prompted to enter your Azure Blob Storage account information.
The data used for these samples can be found here. Download the sample data and store the files in your Azure
Blob storage accounts so that you can execute the samples.
Next steps
Begin building your data transformation with a source transformation.
Copy data from Azure Blob storage to a SQL
database by using the Copy Data tool
3/6/2019 • 5 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from Azure Blob storage to a SQL database.
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source data store. If you don't have an Azure storage account,
see the instructions in Create a storage account.
Azure SQL Database: Use a SQL database as the sink data store. If you don't have a SQL database, see the
instructions in Create a SQL database.
Create a blob and a SQL table
Prepare your Blob storage and your SQL database for the tutorial by performing these steps.
Create a source blob
1. Launch Notepad. Copy the following text and save it in a file named inputEmp.txt on your disk:
John|Doe
Jane|Doe
2. Create a container named adfv2tutorial and upload the inputEmp.txt file to the container. You can use
various tools to perform these tasks, such as Azure Storage Explorer.
Create a sink SQL table
1. Use the following SQL script to create a table named dbo.emp in your SQL database:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.
11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.
c. On the New Linked Service page, select your storage account from the Storage account name list,
and then select Finish.
d. Select the newly created linked service as source, then click Next.
4. On the Choose the input file or folder page, complete the following steps:
a. Click Browse to navigate to the adfv2tutorial/input folder, select the inputEmp.txt file, then click
Choose.
b. Click Next to move to next step.
5. On the File format settings page, notice that the tool automatically detects the column and row delimiters.
Select Next. You also can preview data and view the schema of the input data on this page.
b. Select Azure SQL Database from the gallery, and then select Next.
c. On the New Linked Service page, select your server name and DB name from the dropdown list, and
specify the username and password, then select Finish.
d. Select the newly created linked service as sink, then click Next.
7. On the Table mapping page, select the [dbo].[emp] table, and then select Next.
8. On the Schema mapping page, notice that the first and second columns in the input file are mapped to the
FirstName and LastName columns of the emp table. Select Next.
12. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view
activity run details and to rerun the pipeline. Select Refresh to refresh the list.
13. To view the activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. For details about the copy operation, select the Details link (eyeglasses icon) in the
Actions column. To go back to the Pipeline Runs view, select the Pipelines link at the top. To refresh the
view, select Refresh.
14. Verify that the data is inserted into the emp table in your SQL database.
15. Select the Author tab on the left to switch to the editor mode. You can update the linked services, datasets,
and pipelines that were created via the tool by using the editor. For details on editing these entities in the
Data Factory UI, see the Azure portal version of this tutorial.
Next steps
The pipeline in this sample copies data from Blob storage to a SQL database. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob storage to a SQL
database by using Azure Data Factory
3/26/2019 • 10 minutes to read • Edit Online
In this tutorial, you create a data factory by using the Azure Data Factory user interface (UI). The pipeline in this
data factory copies data from Azure Blob storage to a SQL database. The configuration pattern in this tutorial
applies to copying from a file-based data store to a relational data store. For a list of data stores supported as
sources and sinks, see the supported data stores table.
NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account. You use Blob storage as a source data store. If you don't have a storage account, see
Create an Azure storage account for steps to create one.
Azure SQL Database. You use the database as a sink data store. If you don't have a SQL database, see Create
a SQL database for steps to create one.
Create a blob and a SQL table
Now, prepare your Blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Launch Notepad. Copy the following text, and save it as an emp.txt file on your disk:
John,Doe
Jane,Doe
2. Create a container named adftutorial in your Blob storage. Create a folder named input in this container.
Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure Storage
Explorer to do these tasks.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your SQL database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
2. Allow Azure services to access SQL Server. Ensure that Allow access to Azure services is turned ON for
your SQL Server so that Data Factory can write data to your SQL Server. To verify and turn on this setting,
take the following steps:
a. On the left, select More services > SQL servers.
b. Select your server, and under SETTINGS select Firewall.
c. On the Firewall settings page, select ON for Allow access to Azure services.
The name of the Azure data factory must be globally unique. If you see the following error message for the
name field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For
naming rules for Data Factory artifacts, see Data Factory naming rules.
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version, select V2.
7. Under Location, select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Pin to dashboard.
9. Select Create.
10. On the dashboard, you see the following tile with the status Deploying Data Factory:
11. After the creation is finished, you see the Data factory page as shown in the image.
12. Select Author & Monitor to launch the Data Factory UI in a separate tab.
Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start with creating the pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the Let's get started page, select Create pipeline.
2. In the General tab for the pipeline, enter CopyPipeline for Name of the pipeline.
3. In the Activities tool box, expand the Move andTransform category, and drag and drop the Copy Data
activity from the tool box to the pipeline designer surface. Specify CopyFromBlobToSql for Name.
Configure source
1. Go to the Source tab. Select + New to create a source dataset.
2. In the New Dataset window, select Azure Blob Storage, and then select Finish. The source data is in
Blob storage, so you select Azure Blob Storage for the source dataset.
3. You see a new tab opened for blob dataset. On the General tab at the bottom of the Properties window,
enter SourceBlobDataset for Name.
4. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New.
5. In the New Linked Service window, enter AzureStorageLinkedService as name, select your storage
account from the Storage account name list, then select Save to deploy the linked service.
6. After the linked service is created, you are back in the dataset settings. Next to File path, select Browse.
7. Navigate to the adftutorial/input folder, select the emp.txt file, and then select Finish.
8. Confirm that File format is set to Text format and that Column delimiter is set to Comma ( , ). If the
source file uses different row and column delimiters, you can select Detect Text Format for File format.
The Copy Data tool detects the file format and delimiters automatically for you. You can still override these
values. To preview data on this page, select Preview data.
9. Go to the Schema tab of the Properties window, and select Import Schema. Notice that the application
detected two columns in the source file. You import the schema here so that you can map columns from the
source data store to the sink data store. If you don't need to map columns, you can skip this step. For this
tutorial, import the schema.
10. Now, go back to the pipeline -> Source tab, confirm that SourceBlobDataset is selected. To preview data
on this page, select Preview data.
Configure sink
1. Go to the Sink tab, and select + New to create a sink dataset.
2. In the New Dataset window, input "SQL" in the search box to filter the connectors, then select Azure SQL
Database, and then select Finish. In this tutorial, you copy data to a SQL database.
3. On the General tab of the Properties window, in Name, enter OutputSqlDataset.
4. Go to the Connection tab, and next to Linked service, select + New. A dataset must be associated with a
linked service. The linked service has the connection string that Data Factory uses to connect to the SQL
database at runtime. The dataset specifies the container, folder, and the file (optional) to which the data is
copied.
8. Select the ID column, and then select Delete. The ID column is an identity column in the SQL database, so
the copy activity doesn't need to insert data into this column.
9. Go to the tab with the pipeline, and in Sink Dataset, confirm that OutputSqlDataset is selected.
Configure mapping
Go to the Mapping tab at the bottom of the Properties window, and select Import Schemas. Notice that the
first and second columns in the source file are mapped to FirstName and LastName in the SQL database.
3. Wait until you see the Successfully published message. To see notification messages, click the Show
Notifications on the top-right (bell button).
3. To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. In this example, there is only one activity, so you see only one entry in the list. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. Select Pipelines at the top
to go back to the Pipeline Runs view. To refresh the view, select Refresh.
4. Verify that two more rows are added to the emp table in the SQL database.
5. On the Trigger Run Parameters page, review the warning, and then select Finish. The pipeline in this
example doesn't take any parameters.
6. Click Publish All to publish the change.
7. Go to the Monitor tab on the left to see the triggered pipeline runs.
8. To switch from the Pipeline Runs view to the Trigger Runs view, select Pipeline Runs and then select
Trigger Runs.
10. Verify that two rows per minute (for each pipeline run) are inserted into the emp table until the specified
end time.
Next steps
The pipeline in this sample copies data from one location to another location in Blob storage. You learned how to:
Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob to Azure SQL Database
using Azure Data Factory
5/15/2019 • 10 minutes to read • Edit Online
In this tutorial, you create a Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL
Database. The configuration pattern in this tutorial applies to copying from a file-based data store to a relational
data store. For a list of data stores supported as sources and sinks, see supported data stores table.
You perform the following steps in this tutorial:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory, refer to samples
under "Quickstarts".
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following this instruction. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create a blob and a SQL table
Now, prepare your Azure Blob and Azure SQL Database for the tutorial by performing the following steps:
Create a source blob
1. Launch Notepad. Copy the following text and save it as inputEmp.txt file on your disk.
John|Doe
Jane|Doe
2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and to upload the
inputEmp.txt file to the container.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your Azure SQL Database.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
2. Allow Azure services to access SQL server. Ensure that Allow access to Azure services setting is turned
ON for your Azure SQL server so that the Data Factory service can write data to your Azure SQL server.
To verify and turn on this setting, do the following steps:
a. Click More services hub on the left and click SQL servers.
b. Select your server, and click Firewall under SETTINGS.
c. In the Firewall settings page, click ON for Allow access to Azure services.
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add the following code to the Main method that sets variables. Replace place-holders with your own
values. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factory: Products available
by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID to create the factory>";
string resourceGroup = "<your resource group to create the factory>";
3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create a data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.
};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);
Console.WriteLine(SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings));
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
The blob format indicating how to parse the content: TextFormat and its settings (for example, column
delimiter).
The data structure, including column names and data types which in this case map to the sink SQL table.
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity. In this tutorial, this
pipeline contains one activity: copy activity, which takes in the Blob dataset as source and the SQL dataset as sink.
Learn more from Copy Activity Overview on copy activity details.
2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.
if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
}
else
Console.WriteLine(activityRuns.First().Error);
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copying data from on-premises to cloud:
Copy data from on-premises to cloud
Copy data from an on-premises SQL Server
database to Azure Blob storage by using the Copy
Data tool
4/8/2019 • 8 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from an on-premises SQL Server database to Azure Blob storage.
NOTE
If you're new to Azure Data Factory, see Introduction to Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. Select your user name in the upper-
right corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the
Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Query.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.
3. In the New container window, in the Name box, enter adftutorial, and then select OK.
3. Select the Azure subscription in which you want to create the data factory.
4. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under Version, select **V2 **.
6. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by Data Factory can be in other locations/regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, you see the following tile with the status Deploying Data Factory:
10. After the creation is finished, you see the Data Factory page as shown in the image.
11. Select Author & Monitor to launch the Data Factory user interface in a separate tab.
5. Under New Linked Service (SQL Server) Name, enter SqlServerLinkedService. Select +New under
Connect via integration runtime. You must create a self-hosted integration runtime, download it to your
machine, and register it with Data Factory. The self-hosted integration runtime copies data between your
on-premises environment and the cloud.
6. In the Integration Runtime Setup dialog box, Select Private Network. Then select Next.
7. In the Integration Runtime Setup dialog box under Name, enter TutorialIntegrationRuntime. Then
select Next.
8. Select Click here to launch the express setup for this computer. This action installs the integration
runtime on your machine and registers it with Data Factory. Alternatively, you can use the manual setup
option to download the installation file, run it, and use the key to register the integration runtime.
9. Run the downloaded application. You see the status of the express setup in the window.
10. Confirm that TutorialIntegrationRuntime is selected for the Integration Runtime field.
11. In Specify the on-premises SQL Server database, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Server name, enter the name of your on-premises SQL Server instance.
c. Under Database name, enter the name of your on-premises database.
d. Under Authentication type, select appropriate authentication.
e. Under User name, enter the name of user with access to on-premises SQL Server.
f. Enter the password for the user. Select Finish.
12. Select Next.
13. On the Select tables from which to copy the data or use a custom query page, select the [dbo].[emp]
table in the list, and select Next. You can select any other table based on your database.
14. On the Destination data store page, select Create new connection
15. In New Linked Service, Search and Select Azure Blob, then Continue.
16. On the New Linked Service (Azure Blob Storage) dialog, take the following steps:
c. Under **Storage account name**, select your storage account from the drop-down list.
d. Select **Next**.
17. In Destination data store dialog, select Next. In Connection properties, select Azure storage service
as Azure Blob Storage. Select Next.
18. In the Choose the output file or folder dialog, under Folder path, enter adftutorial/fromonprem. You
created the adftutorial container as part of the prerequisites. If the output folder doesn't exist (in this case
fromonprem ), Data Factory automatically creates it. You also can use the Browse button to browse the
blob storage and its containers/folders. If you do not specify any value under File name, by default the
name from the source would be used (in this case dbo.emp).
22. On the Deployment page, select Monitor to monitor the pipeline or task you created.
23. On the Monitor tab, you can view the status of the pipeline you created. You can use the links in the Action
column to view activity runs associated with the pipeline run and to rerun the pipeline.
24. Select the View Activity Runs link in the Actions column to see activity runs associated with the pipeline
run. To see details about the copy operation, select the Details link (eyeglasses icon) in the Actions column.
To switch back to the Pipeline Runs view, select Pipelines at the top.
25. Confirm that you see the output file in the fromonprem folder of the adftutorial container.
26. Select the Edit tab on the left to switch to the editor mode. You can update the linked services, datasets, and
pipelines created by the tool by using the editor. Select Code to view the JSON code associated with the
entity opened in the editor. For details on how to edit these entities in the Data Factory UI, see the Azure
portal version of this tutorial.
Next steps
The pipeline in this sample copies data from an on-premises SQL Server database to Blob storage. You learned
how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn about how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy data from an on-premises SQL Server database
to Azure Blob storage
4/8/2019 • 9 minutes to read • Edit Online
In this tutorial, you use the Azure Data Factory user interface (UI) to create a data factory pipeline that copies data
from an on-premises SQL Server database to Azure Blob storage. You create and use a self-hosted integration
runtime, which moves data between on-premises and cloud data stores.
NOTE
This article doesn't provide a detailed introduction to Data Factory. For more information, see Introduction to Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. In the upper-right corner, select your
user name, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on how to add a user to a role, see Manage access using RBAC and the Azure
portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Blob storage
(sink). You then create a table named emp in your SQL Server database and insert a couple of sample entries into
the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:
6. In the tree view, right-click the database that you created, and then select New Query.
Azure storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose Azure storage account, see Create a storage account. The pipeline
in the data factory that you create in this tutorial copies data from the on-premises SQL Server database (source)
to Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select More services. Filter by using the Storage keyword, and then select Storage
accounts.
3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, go to Overview, and then select Blobs.
3. In the New container window, under Name, enter adftutorial. Then select OK.
The name of the data factory must be globally unique. If you see the following error message for the name field,
change the name of the data factory (for example, yournameADFTutorialDataFactory). For naming rules for Data
Factory artifacts, see Data Factory naming rules.
1. Select the Azure subscription in which you want to create the data factory.
2. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
3. Under Version, select V2.
4. Under Location, select the location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Storage and SQL Database) and computes (for example,
Azure HDInsight) used by Data Factory can be in other regions.
5. Select Pin to dashboard.
6. Select Create.
7. On the dashboard, you see the following tile with the status Deploying Data Factory:
8. After the creation is finished, you see the Data Factory page as shown in the image:
9. Select the Author & Monitor tile to launch the Data Factory UI in a separate tab.
Create a pipeline
1. On the Let's get started page, select Create pipeline. A pipeline is automatically created for you. You see
the pipeline in the tree view, and its editor opens.
2. On the General tab at the bottom of the Properties window, in Name, enter SQLServerToBlobPipeline.
3. In the Activities tool box, expand DataFlow. Drag and drop the Copy activity to the pipeline design
surface. Set the name of the activity to CopySqlServerToAzureBlobActivity.
4. In the Properties window, go to the Source tab, and select + New.
5. In the New Dataset window, search for SQL Server. Select SQL Server, and then select Finish. You see a
new tab titled SqlServerTable1. You also see the SqlServerTable1 dataset in the tree view on the left.
6. On the General tab at the bottom of the Properties window, in Name, enter SqlServerDataset.
7. Go to the Connection tab, and select + New. You create a connection to the source data store (SQL Server
database) in this step.
8. In the New Linked Service window, add Name as SqlServerLinkedService. Select New under Connect
via integration runtime. In this section, you create a self-hosted integration runtime and associate it with
an on-premises machine with the SQL Server database. The self-hosted integration runtime is the
component that copies data from the SQL Server database on your machine to Blob storage.
9. In the Integration Runtime Setup window, select Private Network, and then select Next.
10. Enter a name for the integration runtime, and select Next.
11. Under Option 1: Express setup, select Click here to launch the express setup for this computer.
12. In the Integration Runtime (Self-hosted) Express Setup window, select Close.
13. In the New Linked Service window, ensure the Integration Runtime created above is selected under
Connect via integration runtime.
14. In the New Linked Service window, take the following steps:
a. Under Name, enter SqlServerLinkedService.
b. Under Connect via integration runtime, confirm that the self-hosted integration runtime you created
earlier shows up.
c. Under Server name, enter the name of your SQL Server instance.
d. Under Database name, enter the name of the database with the emp table.
e. Under Authentication type, select the appropriate authentication type that Data Factory should use to
connect to your SQL Server database.
f. Under User name and Password, enter the user name and password. If you need to use a backslash (\) in
your user account or server name, precede it with the escape character (\). For example, use
mydomain\\myuser.
g. Select Test connection. Do this step to confirm that Data Factory can connect to your SQL Server
database by using the self-hosted integration runtime you created.
h. To save the linked service, select Finish.
15. You should be back in the window with the source dataset opened. On the Connection tab of the
Properties window, take the following steps:
a. In Linked service, confirm that you see SqlServerLinkedService.
b. In Table, select [dbo].[emp].
16. Go to the tab with SQLServerToBlobPipeline, or select SQLServerToBlobPipeline in the tree view.
17. Go to the Sink tab at the bottom of the Properties window, and select + New.
18. In the New Dataset window, select Azure Blob Storage. Then select Finish. You see a new tab opened for
the dataset. You also see the dataset in the tree view.
19. In Name, enter AzureBlobDataset.
20. Go to the Connection tab at the bottom of the Properties window. Next to Linked service, select + New.
21. In the New Linked Service window, take the following steps:
a. Under Name, enter AzureStorageLinkedService.
b. Under Storage account name, select your storage account.
c. To test the connection to your storage account, select Test connection.
d. Select Save.
22. You should be back in the window with the sink dataset open. On the Connection tab, take the following
steps:
a. In Linked service, confirm that AzureStorageLinkedService is selected.
b. For the folder/ Directory part of File path, enter adftutorial/fromonprem. If the output folder doesn't
exist in the adftutorial container, Data Factory automatically creates the output folder.
c. For the file name part of File path, select Add dynamic content.
d. Add @CONCAT(pipeline().RunId, '.txt') , select Finish. This will rename the file with PipelineRunID.txt.
23. Go to the tab with the pipeline opened, or select the pipeline in the tree view. In Sink Dataset, confirm that
AzureBlobDataset is selected.
24. To validate the pipeline settings, select Validate on the toolbar for the pipeline. To close the Pipe
Validation Report, select Close.
25. To publish entities you created to Data Factory, select Publish All.
26. Wait until you see the Publishing succeeded pop-up. To check the status of publishing, select the Show
Notifications link on the left. To close the notification window, select Close.
Trigger a pipeline run
Select Trigger on the toolbar for the pipeline, and then select Trigger Now.
Monitor the pipeline run
1. Go to the Monitor tab. You see the pipeline that you manually triggered in the previous step.
2. To view activity runs associated with the pipeline run, select the View Activity Runs link in the Actions
column. You see only activity runs because there is only one activity in the pipeline. To see details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column. To go back to the Pipeline
Runs view, select Pipelines at the top.
In this tutorial, you use Azure PowerShell to create a data-factory pipeline that copies data from an on-premises
SQL Server database to Azure Blob storage. You create and use a self-hosted integration runtime, which moves
data between on-premises and cloud data stores.
NOTE
This article does not provide a detailed introduction to the Data Factory service. For more information, see Introduction to
Azure Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal, select your username at the top-right
corner, and then select Permissions. If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on adding a user to a role, see the Manage access using RBAC and the Azure
portal article.
SQL Server 2014, 2016, and 2017
In this tutorial, you use an on-premises SQL Server database as a source data store. The pipeline in the data
factory you create in this tutorial copies data from this on-premises SQL Server database (source) to Azure Blob
storage (sink). You then create a table named emp in your SQL Server database, and insert a couple of sample
entries into the table.
1. Start SQL Server Management Studio. If it is not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases, and then select New Database.
4. In the New Database window, enter a name for the database, and then select OK.
5. To create the emp table and insert some sample data into it, run the following query script against the
database:
6. In the tree view, right-click the database that you created, and then select New Query.
Azure Storage account
In this tutorial, you use a general-purpose Azure storage account (specifically, Azure Blob storage) as a
destination/sink data store. If you don't have a general-purpose Azure storage account, see Create a storage
account. The pipeline in the data factory you that create in this tutorial copies data from the on-premises SQL
Server database (source) to this Azure Blob storage (sink).
Get storage account name and account key
You use the name and key of your Azure storage account in this tutorial. Get the name and key of your storage
account by doing the following:
1. Sign in to the Azure portal with your Azure username and password.
2. In the left pane, select More services, filter by using the Storage keyword, and then select Storage
accounts.
3. In the list of storage accounts, filter for your storage account (if needed), and then select your storage
account.
4. In the Storage account window, select Access keys.
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Azure Blob storage.
1. In the Storage account window, switch to Overview, and then select Blobs.
3. In the New container window, in the Name box, enter adftutorial, and then select OK.
Windows PowerShell
Install Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Install the latest version of Azure PowerShell if you don't already have it on your machine. For detailed
instructions, see How to install and configure Azure PowerShell.
Log in to PowerShell
1. Start PowerShell on your machine, and keep it open through completion of this quickstart tutorial. If you
close and reopen it, you'll need to run these commands again.
2. Run the following command, and then enter the Azure username and password that you use to sign in to
the Azure portal:
Connect-AzAccount
3. If you have multiple Azure subscriptions, run the following command to select the subscription that you
want to work with. Replace SubscriptionId with the ID of your Azure subscription:
$resourceGroupName = "ADFTutorialResourceGroup"
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.
3. Define a variable for the data factory name that you can use in PowerShell commands later. The name must
start with a letter or a number, and it can contain only letters, numbers, and the dash (-) character.
IMPORTANT
Update the data factory name with a globally unique name. An example is ADFTutorialFactorySP1127.
$dataFactoryName = "ADFTutorialFactory"
NOTE
The name of the data factory must be globally unique. If you receive the following error, change the name and try again.
The specified data factory name 'ADFv2TutorialDataFactory' is already in use. Data factory names
must be globally unique.
To create data-factory instances, the user account that you use to sign in to Azure must be assigned a contributor or
owner role or must be an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data stores (Azure
Storage, Azure SQL Database, and so on) and computes (Azure HDInsight and so on) used by the data factory can be in
other regions.
$integrationRuntimeName = "ADFTutorialIR"
3. To retrieve the status of the created integration runtime, run the following command:
Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration
4. To retrieve the authentication keys for registering the self-hosted integration runtime with the Data Factory
service in the cloud, run the following command. Copy one of the keys (excluding the quotation marks) for
registering the self-hosted integration runtime that you install on your machine in the next step.
{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
When the self-hosted integration runtime is registered successfully, the following message is displayed:
10. In the New Integration Runtime (Self-hosted) Node window, select Next.
14. Test the connectivity to your SQL Server database by doing the following:
a. In the Configuration Manager window, switch to the Diagnostics tab.
b. In the Data source type box, select SqlServer.
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the username.
g. Enter the password that's associated with the username.
h. To confirm that integration runtime can connect to the SQL Server, select Test.
If the connection is successful, a green checkmark icon is displayed. Otherwise, you'll receive an error
message associated with the failure. Fix any issues, and ensure that the integration runtime can connect to
your SQL Server instance.
Note all the preceding values for later use in this tutorial.
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>;EndpointSuffix=core.windows.net"
}
}
},
"name": "AzureStorageLinkedService"
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
If you receive a "file not found" error, confirm that the file exists by running the dir command. If the file
name has a .txt extension (for example, AzureStorageLinkedService.json.txt), remove it, and then run the
PowerShell command again.
Create and encrypt a SQL Server linked service (source )
In this step, you link your on-premises SQL Server instance to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFv2Tutorial folder by using the
following code:
IMPORTANT
Select the section that's based on the authentication that you use to connect to SQL Server.
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}
IMPORTANT
Select the section that's based on the authentication you use to connect to your SQL Server instance.
Replace <integration runtime name> with the name of your integration runtime.
Before you save the file, replace <servername>, <databasename>, <username>, and <password> with the
values of your SQL Server instance.
If you need to use a backslash (\) in your user account or server name, precede it with the escape character (\).
For example, use mydomain\\myuser.
2. To encrypt the sensitive data (username, password, and so on), run the
New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet.
This encryption ensures that the credentials are encrypted using Data Protection Application Programming
Interface (DPAPI). The encrypted credentials are stored locally on the self-hosted integration runtime node
(local machine). The output payload can be redirected to another JSON file (in this case,
encryptedLinkedService.json) that contains encrypted credentials.
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -
ResourceGroupName $ResourceGroupName -IntegrationRuntimeName $integrationRuntimeName -File
".\SQLServerLinkedService.json" > encryptedSQLServerLinkedService.json
Create datasets
In this step, you create input and output datasets. They represent input and output data for the copy operation,
which copies data from the on-premises SQL Server database to Azure Blob storage.
Create a dataset for the source SQL Server database
In this step, you define a dataset that represents data in the SQL Server database instance. The dataset is of type
SqlServerTable. It refers to the SQL Server linked service that you created in the preceding step. The linked
service has the connection information that the Data Factory service uses to connect to your SQL Server instance
at runtime. This dataset specifies the SQL table in the database that contains the data. In this tutorial, the emp
table contains the source data.
1. Create a JSON file named SqlServerDataset.json in the C:\ADFv2Tutorial folder, with the following code:
{
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dbo.emp"
},
"structure": [
{
"name": "ID",
"type": "String"
},
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"linkedServiceName": {
"referenceName": "EncryptedSqlServerLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "SqlServerDataset"
}
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/fromonprem",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"name": "AzureBlobDataset"
}
DatasetName : AzureBlobDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
Create a pipeline
In this tutorial, you create a pipeline with a copy activity. The copy activity uses SqlServerDataset as the input
dataset and AzureBlobDataset as the output dataset. The source type is set to SqlSource and the sink type is set to
BlobSink.
1. Create a JSON file named SqlServerToBlobPipeline.json in the C:\ADFv2Tutorial folder, with the following
code:
{
"name": "SQLServerToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type":"BlobSink"
}
},
"name": "CopySqlServerToAzureBlobActivity",
"inputs": [
{
"referenceName": "SqlServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlobDataset",
"type": "DatasetReference"
}
]
}
]
}
}
PipelineName : SQLServerToBlobPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Activities : {CopySqlServerToAzureBlobActivity}
Parameters :
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : copy
PipelineRunId : 4ec8980c-62f6-466f-92fa-e69b10f33640
PipelineName : SQLServerToBlobPipeline
Input :
Output :
LinkedServiceName :
ActivityRunStart : 9/13/2017 1:35:22 PM
ActivityRunEnd : 9/13/2017 1:35:42 PM
DurationInMs : 20824
Status : Succeeded
Error : {errorCode, message, failureType, target}
2. You can get the run ID of pipeline SQLServerToBlobPipeline and check the detailed activity run result by
running the following command:
{
"dataRead": 36,
"dataWritten": 24,
"rowsCopied": 2,
"copyDuration": 3,
"throughput": 0.01171875,
"errors": [],
"effectiveIntegrationRuntime": "MyIntegrationRuntime",
"billedDuration": 3
}
Next steps
The pipeline in this sample copies data from one location to another in Azure Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see supported data stores.
To learn about copying data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy multiple tables in bulk by using Azure Data
Factory
4/8/2019 • 14 minutes to read • Edit Online
This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory.
End-to-end workflow
In this scenario, you have a number of tables in Azure SQL Database that you want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:
The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.
The name of the Azure data factory must be globally unique. If you see the following error for the name
field, change the name of the data factory (for example, yournameADFTutorialBulkCopyDF ). See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Data Factory UI application in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. Select your Azure SQL server for Server name
c. Select your Azure SQL database for Database name.
d. Enter name of the user to connect to Azure SQL database.
e. Enter password for the user.
f. To test the connection to Azure SQL database using the specified information, click Test connection.
g. Click Save.
Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored.
The input dataset AzureSqlDatabaseDataset refers to the AzureSqlDatabaseLinkedService. The linked
service specifies the connection string to connect to the database. The dataset specifies the name of the database
and the table that contains the source data.
The output dataset AzureSqlDWDataset refers to the AzureSqlDWLinkedService. The linked service specifies
the connection string to connect to the data warehouse. The dataset specifies the database and the table to which
the data is copied.
In this tutorial, the source and destination SQL tables are not hard-coded in the dataset definitions. Instead, the
ForEach activity passes the name of the table at runtime to the Copy activity.
Create a dataset for source SQL Database
1. Click + (plus) in the left pane, and click Dataset.
2. In the New Dataset window, select Azure SQL Database, and click Finish. You should see a new tab titled
AzureSqlTable1.
3. In the properties window at the bottom, enter AzureSqlDatabaseDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select any table for Table. This table is a dummy table. You specify a query on the source dataset
when creating a pipeline. The query is used to extract data from the Azure SQL database.
Alternatively, you can click Edit check box, and enter dummyName as the table name.
c. In the Add Dynamic Content page, click the DWTAbleName under Parameters which will
automatically populate the top expression text box @dataset().DWTableName , then click Finish. The
tableName property of the dataset is set to the value that's passed as an argument for the
DWTableName parameter. The ForEach activity iterates through a list of tables, and passes one by one to
the Copy activity.
Create pipelines
In this tutorial, you create two pipelines: IterateAndCopySQLTables and GetTableListAndTriggerCopyData.
The GetTableListAndTriggerCopyData pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline IterateAndCopySQLTables to do the actual data copy.
The IterateAndCopySQLTables takes a list of tables as a parameter. For each table in the list, it copies data from
the table in Azure SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
Create the pipeline IterateAndCopySQLTables
1. In the left pane, click + (plus), and click Pipeline.
4. In the Activities toolbox, expand Iteration & Conditions, and drag-drop the ForEach activity to the
pipeline design surface. You can also search for activities in the Activities toolbox.
a. In the General tab at the bottom, enter IterateSQLTables for Name.
b. Switch to the Settings tab, click the inputbox for Items, then click the Add dynamic content link below.
c. In the Add Dynamic Content page, collapse the System Vairables and Functions section, click the
tableList under Parameters which will automatically populate the top expression text box as
@pipeline().parameter.tableList , then click Finish.
d. Switch to Activities tab, click Add activity to add a child activity to the ForEach activity.
5. In the Activities toolbox, expand DataFlow, and drag-drop Copy activity into the pipeline designer
surface. Notice the breadcrumb menu at the top. The IterateAndCopySQLTable is the pipeline name and
IterateSQLTables is the ForEach activity name. The designer is in the activity scope. To switch back to the
pipeline editor from the ForEach editor, click the link in the breadcrumb menu.
6. Switch to the Source tab, and do the following steps:
a. Select AzureSqlDatabaseDataset for Source Dataset.
b. Select Query option for User Query.
c. Click the Query input box -> select the Add dynamic content below -> enter the following
expression for Query -> select Finish.
9. To validate the pipeline settings, click Validate on the top pipeline tool bar. Confirm that there is no
validation error. To close the Pipeline Validation Report, click >>.
Create the pipeline GetTableListAndTriggerCopyData
This pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
1. In the left pane, click + (plus), and click Pipeline.
8. To validate the pipeline, click Validate on the toolbar. Confirm that there are no validation errors. To close
the Pipeline Validation Report, click >>.
9. To publish entities (datasets, pipelines, etc.) to the Data Factory service, click Publish All on top of the
window. Wait until the publishing succeeds.
2. To view activity runs associated with the GetTableListAndTriggerCopyData pipeline, click the first link in the
Actions link for that pipeline. You should see two activity runs for this pipeline run.
3. To view the output of the Lookup activity, click link in the Output column for that activity. You can
maximize and restore the Output window. After reviewing, click X to close the Output window.
{
"count": 9,
"value": [
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Customer"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Product"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductModelProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductCategory"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Address"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "CustomerAddress"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderDetail"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderHeader"
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"effectiveIntegrationRuntimes": [
{
"name": "DefaultIntegrationRuntime",
"type": "Managed",
"location": "East US",
"billedDuration": 0,
"nodes": null
}
]
}
4. To switch back to the Pipeline Runs view, click Pipelines link at the top. Click View Activity Runs link
(first link in the Actions column) for the IterateAndCopySQLTables pipeline. You should see output as
shown in the following image: Notice that there is one Copy activity run for each table in the Lookup
activity output.
5. Confirm that the data was copied to the target SQL Data Warehouse you used in this tutorial.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Copy multiple tables in bulk by using Azure Data
Factory
3/5/2019 • 11 minutes to read • Edit Online
This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure SQL Data
Warehouse. You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.
At a high level, this tutorial involves following steps:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure PowerShell. To learn about using other tools/SDKs to create a data factory, see
Quickstarts.
End-to-end workflow
In this scenario, we have a number of tables in Azure SQL Database that we want to copy to SQL Data
Warehouse. Here is the logical sequence of steps in the workflow that happens in pipelines:
The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the pipeline
triggers another pipeline, which iterates over each table in the database and performs the data copy operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the list,
copy the specific table in Azure SQL Database to the corresponding table in SQL Data Warehouse using staged
copy via Blob storage and PolyBase for best performance. In this example, the first pipeline passes the list of
tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Azure Storage account. The Azure Storage account is used as staging blob storage in the bulk copy operation.
Azure SQL Database. This database contains the source data.
Azure SQL Data Warehouse. This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and SQL Data Warehouse
Prepare the source Azure SQL Database:
Create an Azure SQL Database with Adventure Works LT sample data following Create an Azure SQL database
article. This tutorial copies all the tables from this sample database to a SQL data warehouse.
Prepare the sink Azure SQL Data Warehouse:
1. If you don't have an Azure SQL Data Warehouse, see the Create a SQL Data Warehouse article for steps to
create one.
2. Create corresponding table schemas in SQL Data Warehouse. You can use Migration Utility to migrate
schema from Azure SQL Database to Azure SQL Data Warehouse. You use Azure Data Factory to
migrate/copy data in a later step.
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:
2. Run the Set-AzDataFactoryV2 cmdlet to create a data factory. Replace place-holders with your own
values before executing the command.
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory
names must be globally unique.
To create Data Factory instances, you must be a Contributor or Administrator of the Azure
subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest
you on the following page, and then expand Analytics to locate Data Factory: Products available by
region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.)
used by data factory can be in other regions.
IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}
LinkedServiceName : AzureSqlDatabaseLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
}
LinkedServiceName : AzureSqlDWLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDWLinkedService
IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before saving the
file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored:
Create a dataset for source SQL Database
1. Create a JSON file named AzureSqlDatabaseDataset.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content. The "tableName" is a dummy one as later you use the SQL query in copy
activity to retrieve data.
{
"name": "AzureSqlDatabaseDataset",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "dummy"
}
}
}
DatasetName : AzureSqlDatabaseDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": "AzureSqlDWDataset",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "AzureSqlDWLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": {
"value": "@{dataset().DWTableName}",
"type": "Expression"
}
},
"parameters":{
"DWTableName":{
"type":"String"
}
}
}
}
DatasetName : AzureSqlDWDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDwTableDataset
Create pipelines
In this tutorial, you create two pipelines:
Create the pipeline "IterateAndCopySQLTables"
This pipeline takes a list of tables as a parameter. For each table in the list, it copies data from the table in Azure
SQL Database to Azure SQL Data Warehouse using staged copy and PolyBase.
1. Create a JSON file named IterateAndCopySQLTables.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:
{
"name": "IterateAndCopySQLTables",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "CopyData",
"description": "Copy data from SQL database to SQL DW",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlDWDataset",
"type": "DatasetReference",
"parameters": {
"DWTableName": "[@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]"
}
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
},
"sink": {
"type": "SqlDWSink",
"preCopyScript": "TRUNCATE TABLE [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object"
}
}
}
}
2. To create the pipeline: IterateAndCopySQLTables, Run the Set-AzDataFactoryV2Pipeline cmdlet.
PipelineName : IterateAndCopySQLTables
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {IterateSQLTables}
Parameters : {[tableList, Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
2. Run the following script to continuously check the run status of pipeline
GetTableListAndTriggerCopyData, and print out the final pipeline run and activity run result.
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
Write-Host "Pipeline run details:" -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}
Start-Sleep -Seconds 15
}
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : TriggerCopy
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {pipeline, parameters, waitOnCompletion}
Output : {pipelineRunId}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:07:11 PM
ActivityRunEnd : 9/18/2017 4:08:14 PM
DurationInMs : 62581
Status : Succeeded
Error : {errorCode, message, failureType, target}
3. You can get the run ID of pipeline "IterateAndCopySQLTables", and check the detailed activity run result
as the following.
{
"pipelineRunId": "7514d165-14bf-41fb-b5fb-789bea6c9e58"
}
4. Connect to your sink Azure SQL Data Warehouse and confirm that data has been copied from Azure SQL
Database properly.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure SQL Data Warehouse, and Azure Storage linked services.
Create Azure SQL Database and Azure SQL Data Warehouse datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Incrementally load data from a source data store to a
destination data store
5/10/2019 • 2 minutes to read • Edit Online
In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used
scenario. The tutorials in this section show you different ways of loading data incrementally by using Azure Data
Factory.
Loading new files only by using time partitioned folder or file name.
You can copy new files only, where files or folders has already been time partitioned with timeslice information as
part of the file or folder name (for example, /yyyy/mm/dd/file.csv). It is the most performance approach for
incremental loading new files.
For step-by-step instructions, see the following tutorial:
Incrementally copy new files based on time partitioned folder or file name from Azure Blob storage to Azure Blob
storage
Next steps
Advance to the following tutorial:
Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/26/2019 • 13 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
Overview
Here is the high-level solution diagram:
Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table as
the data source store:
In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:
Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(
TableName varchar(255),
WatermarkValue datetime,
);
2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.
Output:
TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
The name of the Azure data factory must be globally unique. If you see a red exclamation mark with the
following error, change the name of the data factory (for example, yournameADFIncCopyTutorialDF ) and
try creating again. See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. In the get started page of Data Factory UI, click the Create pipeline tile.
2. In the General page of the Properties window for the pipeline, enter IncrementalCopyPipeline name.
3. Let's add the first lookup activity to get the old watermark value. In the Activities toolbox, expand General,
and drag-drop the Lookup activity to the pipeline designer surface. Change the name of the activity to
LookupOldWaterMarkActivity.
4. Switch to the Settings tab, and click + New for Source Dataset. In this step, you create a dataset to
represent data in the watermarktable. This table contains the old watermark that was used in the previous
copy operation.
5. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
the dataset.
6. In the properties window for the dataset, enter WatermarkDataset for Name.
7. Switch to the Connection tab, and click + New to make a connection (create a linked service) to your Azure
SQL database.
10. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that
WatermarkDataset is selected for the Source Dataset field.
11. In the Activities toolbox, expand General, and drag-drop another Lookup activity to the pipeline designer
surface, and set the name to LookupNewWaterMarkActivity in the General tab of the properties
window. This Lookup activity gets the new watermark value from the table with the source data to be copied
to the destination.
12. In the properties window for the second Lookup activity, switch to the Settings tab, and click New. You
create a dataset to point to the source table that contains the new watermark value (maximum value of
LastModifyTime).
13. In the New Dataset window, select Azure SQL Database, and click Finish. You see a new tab opened for
this dataset. You also see the dataset in the tree view.
14. In the General tab of the properties window, enter SourceDataset for Name.
16. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline in
the tree view on the left. In the properties window for the Lookup activity, confirm that SourceDataset is
selected for the Source Dataset field.
17. Select Query for the Use Query field, and enter the following query: you are only selecting the maximum
value of LastModifytime from the data_source_table. If you don't have this query, the dataset gets all the
rows from the table as you specified the table name (data_source_table) in the dataset definition.
20. Select the Copy activity and confirm that you see the properties for the activity in the Properties window.
21. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for the Use Query field.
c. Enter the following SQL query for the Query field.
23. In this tutorial sink data store is of type Azure Blob Storage. Therefore, select Azure Blob Storage, and click
Finish in the New Dataset window.
24. In the General tab of the Properties window for the dataset, enter SinkDataset for Name.
25. Switch to the Connection tab, and click + New. In this step, you create a connection (linked service) to your
Azure Blob storage.
30. Select Stored Procedure Activity in the pipeline designer, change its name to
StoredProceduretoWriteWatermarkActivity.
31. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService* for Linked service.
32. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark.
b. To specify values for the stored procedure parameters, click Import parameter, and enter following
values for the parameters:
34. Publish entities (linked services, datasets, and pipelines) to the Azure Data Factory service by selecting the
Publish All button. Wait until you see a message that the publishing succeeded.
Trigger a pipeline run
1. Click Trigger on the toolbar, and click Trigger Now.
2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.
2. Open the output file and notice that all the data is copied from the data_source_table to the blob file.
1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000
3. Check the latest value from watermarktable . You see that the watermark value was updated.
TABLENAME WATERMARKVALUE
2. To view activity runs associated with this pipeline run, click the first link (View Activity Runs) in the
Actions column. You can go back to the previous view by clicking Pipelines at the top. Click Refresh
button to refresh the list.
6,newdata,2017-09-06 02:23:00.0000000
7,newdata,2017-09-07 09:01:00.0000000
2. Check the latest value from watermarktable . You see that the watermark value was updated again.
sample output:
TABLENAME WATERMARKVALUE
Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from an Azure SQL database
to Azure Blob storage
3/14/2019 • 13 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from a table in an Azure SQL
database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Overview
Here is the high-level solution diagram:
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure SQL Database. You use the database as the source data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Azure Storage. You use the blob storage as the sink data store. If you don't have a storage account, see Create
a storage account for steps to create one. Create a container named adftutorial.
Azure PowerShell. Follow the instructions in Install and configure Azure PowerShell.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Server Explorer, right-click the database, and choose New
Query.
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:
In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:
Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(
TableName varchar(255),
WatermarkValue datetime,
);
2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.
Output:
TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
IMPORTANT
Update the data factory name to make it globally unique. An example is ADFTutorialFactorySP1127.
$dataFactoryName = "ADFIncCopyTutorialFactory";
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names must
be globally unique.
To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database>; Persist
Security Info=False; User ID=<user> ; Password=<password>; MultipleActiveResultSets = False; Encrypt =
True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}
LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
ProvisioningState :
Create datasets
In this step, you create datasets to represent source and sink data.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
In this tutorial, you use the table name data_source_table. Replace it if you use a table with a different name.
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.
DatasetName : SourceDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
IMPORTANT
This snippet assumes that you have a blob container named adftutorial in your blob storage. Create the container if
it doesn't exist, or set it to the name of an existing one. The output folder incrementalcopy is automatically
created if it doesn't exist in the container. In this tutorial, the file name is dynamically generated by using the
expression @CONCAT('Incremental-', pipeline().RunId, '.txt') .
DatasetName : SinkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset WatermarkDataset.
DatasetName : WatermarkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. Create a JSON file IncrementalCopyPipeline.json in the same folder with the following content:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable"
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(LastModifytime) as NewWatermarkvalue from data_source_table"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {"value":
"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}", "type": "datetime" },
"TableName": { "value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type":"String"}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}
2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.
PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Activities : {LookupOldWaterMarkActivity, LookupNewWaterMarkActivity, IncrementalCopyActivity,
StoredProceduretoWriteWatermarkActivity}
Parameters :
2. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:43:07 AM
DurationInMs : 25437
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:10 AM
ActivityRunEnd : 9/14/2017 7:43:29 AM
DurationInMs : 19769
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:32 AM
ActivityRunEnd : 9/14/2017 7:43:47 AM
DurationInMs : 14467
Status : Succeeded
Error : {errorCode, message, failureType, target}
2. Check the latest value from watermarktable . You see that the watermark value was updated.
TABLENAME WATERMARKVALUE
Insert data into the data source store to verify delta data loading
1. Insert new data into the SQL database (data source store).
3. Check the status of the pipeline by running the Get-AzDataFactoryV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:52 AM
DurationInMs : 25497
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:00 AM
ActivityRunEnd : 9/14/2017 8:53:20 AM
DurationInMs : 20194
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:23 AM
ActivityRunEnd : 9/14/2017 8:53:41 AM
DurationInMs : 18502
Status : Succeeded
Error : {errorCode, message, failureType, target}
4. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-2fc90ab8-d42c-4583-aa64-755dba9925d7.txt . Open that file, and you see two rows of records in it.
5. Check the latest value from watermarktable . You see that the watermark value was updated again.
TABLENAME WATERMARKVALUE
Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
In this tutorial, the pipeline copied data from a single table in a SQL database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in an on-premises SQL Server database to a SQL
database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from multiple tables in SQL Server to an
Azure SQL database
4/14/2019 • 17 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-premises SQL Server to an
Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Overview
Here are the important steps to create this solution:
1. Select the watermark column.
Select one column for each table in the source data store, which can be used to identify the new or updated records for every run.
Normally, the data in this selected column (for example, last_modify_time or ID ) keeps increasing when rows are created or updated. The
maximum value in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter to the pipeline. For each source
table, it invokes the following activities to perform delta loading for that table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the second Lookup activity to
retrieve the new watermark value. These watermark values are passed to the Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark column greater than the old
watermark value and less than the new watermark value. Then, it copies the delta data from the source data store to Azure Blob storage
as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see Create an Azure SQL database
for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and project_table :
Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to store the watermark value:
TableName varchar(255),
WatermarkValue datetime,
);
2. Insert initial watermark values for both source tables into the watermark table.
INSERT INTO watermarktable
VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
Create data types and additional stored procedures in Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used to merge the data from source
tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in via a table variable and then
merge the them into destination store. Be cautious that it is not expecting a "large" number of delta rows (more than 100) to be stored in the
table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy activity to copy all the delta data
into a temporary "staging" table in the destination store first, and then built your own stored procedure without using table variable to merge
them from the “staging” table to the “final” table.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
GO
BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
The name of the Azure data factory must be globally unique. If you receive the following error, change the name of the data factory (for
example, yournameADFMultiIncCopyTutorialDF ) and try creating again. See Data Factory - Naming Rules article for naming rules for
Data Factory artifacts.
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down list. The data stores (Azure
Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page of Azure Data Factory UI, click Create pipeline (or) switch to the Edit tab.
Create self-hosted integration runtime
As you are moving data from a data store in a private network (on-premises) to an Azure data store, install a self-hosted integration runtime (IR )
in your on-premises environment. The self-hosted IR moves data between your private network and Azure.
1. Click Connections at the bottom of the left pane, and switch to the Integration Runtimes in the Connections window.
3. In the Integration Runtime Setup window, select Perform data movement and dispatch activities to external computes, and
click Next.
4. Select ** Private Network**, and click Next.
8. In the Web browser, in the Integration Runtime Setup window, click Finish.
9. Confirm that you see MySelfHostedIR in the list of integration runtimes.
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for Name.
b. For Server name, select the name of your Azure SQL server from the drop-down list.
c. For Database name, select the Azure SQL database in which you created customer_table and project_table as part of the
prerequisites.
d. For User name, enter the name of user that has access to the Azure SQL database.
e. For Password, enter the password for the user.
f. To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection
succeeds.
g. To save the linked service, click Save.
4. Switch to the Connection tab in the Properties window, and select SqlServerLinkedService for Linked service. You do not select a
table here. The Copy activity in the pipeline uses a SQL query to load the data rather than load the entire table.
Create a sink dataset
1. In the left pane, click + (plus), and click Dataset.
2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the treeview. In the General tab of the
Properties window at the bottom, enter SinkDataset for Name.
4. Switch to the Parameters tab in the Properties window, and do the following steps:
a. Click New in the Create/update parameters section.
b. Enter SinkTableName for the name, and String for the type. This dataset takes SinkTableName as a parameter. The
SinkTableName parameter is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline iterates through a list
of table names and passes the table name to this dataset in each iteration.
5. Switch to the Connection tab in the Properties window, and select AzureSqlLinkedService for Linked service. For Table property,
click Add dynamic content.
2. In the New Dataset window, select Azure SQL Database, and click Finish.
3. In the General tab of the Properties window at the bottom, enter WatermarkDataset for Name.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[watermarktable] for Table.
Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names and performs the following
operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next iteration.
Create the pipeline
1. In the left pane, click + (plus), and click Pipeline.
2. In the General tab of the Properties window, enter IncrementalCopyPipeline for Name.
4. In the Activities toolbox, expand Iteration & Conditionals, and drag-drop the ForEach activity to the pipeline designer surface. In the
General tab of the Properties window, enter IterateSQLTables.
5. Switch to the Settings tab in the Properties window, and enter @pipeline().parameters.tableList for Items. The ForEach activity
iterates through a list of tables and performs the incremental copy operation.
6. Select the ForEach activity in the pipeline if it isn't already selected. Click the Edit (Pencil icon) button.
7. In the Activities toolbox, expand General, drag-drop the Lookup activity to the pipeline designer surface, and enter
LookupOldWaterMarkActivity for Name.
8. Switch to the Settings tab of the Properties window, and do the following steps:
a. Select WatermarkDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
11. Drag-drop the Copy activity from the Activities toolbox, and enter IncrementalCopyActivity for Name.
12. Connect Lookup activities to the Copy activity one by one. To connect, start dragging at the green box attached to the Lookup activity
and drop it on the Copy activity. Release the mouse button when the border color of the Copy activity changes to blue.
13. Select the Copy activity in the pipeline. Switch to the Source tab in the Properties window.
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
17. Select the Stored Procedure activity in the pipeline, and enter StoredProceduretoWriteWatermarkActivity for Name in the General
tab of the Properties window.
18. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService for Linked Service.
19. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select usp_write_watermark .
b. Select Import parameter.
c. Specify the following values for the parameters:
20. In the left pane, click Publish. This action publishes the entities you created to the Data Factory service.
21. Wait until you see the Successfully published message. To see the notifications, click the Show Notifications link. Close the
notifications window by clicking X.
Run the pipeline
1. On the toolbar for the pipeline, click Trigger, and click Trigger Now.
2. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
2. Click View Activity Runs link in the Actions column. You see all the activity runs associated with the selected pipeline run.
Review the results
In SQL Server Management Studio, run the following queries against the target SQL database to verify that the data was copied from source
tables to destination tables:
Query
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
Query
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000
Notice that the watermark values for both tables were updated.
Add more data to the source tables
Run the following query against the source SQL Server database to update an existing row in customer_table. Insert a new row into
project_table.
UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3
3. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000
Notice that the watermark values for both tables were updated.
Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from multiple tables in SQL
Server to an Azure SQL database
4/8/2019 • 18 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in on-
premises SQL Server to an Azure SQL database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Overview
Here are the important steps to create this solution:
1. Select the watermark column. Select one column for each table in the source data store, which can be
used to identify the new or updated records for every run. Normally, the data in this selected column (for
example, last_modify_time or ID ) keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
2. Prepare a data store to store the watermark value.
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities:
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Server. You use an on-premises SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a SQL database as the sink data store. If you don't have a SQL database, see
Create an Azure SQL database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your on-premises SQL Server database.
2. In Server Explorer, right-click the database and choose New Query.
3. Run the following SQL command against your database to create tables named customer_table and
project_table :
Create another table in the Azure SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
TableName varchar(255),
WatermarkValue datetime,
);
2. Insert initial watermark values for both source tables into the watermark table.
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
Create data types and additional stored procedures in the Azure SQL database
Run the following query to create two stored procedures and two data types in your SQL database. They're used
to merge the data from source tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in
via a table variable and then merge the them into destination store. Be cautious that it is not expecting a "large"
number of delta rows (more than 100) to be stored in the table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy
activity to copy all the delta data into a temporary "staging" table in the destination store first, and then built your
own stored procedure without using table variable to merge them from the “staging” table to the “final” table.
GO
BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
GO
BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Azure PowerShell
Install the latest Azure PowerShell modules by following the instructions in Install and configure Azure
PowerShell.
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
IMPORTANT
Update the data factory name to make it globally unique. An example is ADFIncMultiCopyTutorialFactorySP1127.
$dataFactoryName = "ADFIncMultiCopyTutorialFactory";
The specified Data Factory name 'ADFIncMultiCopyTutorialFactory' is already in use. Data Factory names
must be globally unique.
To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, SQL Database, etc.) and computes (Azure HDInsight, etc.) used by the data
factory can be in other regions.
$integrationRuntimeName = "ADFTutorialIR"
Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factories/onpremdf0914/inte
grationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description :
3. To retrieve the status of the created integration runtime, run the following command. Confirm that the value
of the State property is set to NeedRegistration.
Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration
4. To retrieve the authentication keys used to register the self-hosted integration runtime with Azure Data
Factory service in the cloud, run the following command:
5. Copy one of the keys (exclude the double quotation marks) used to register the self-hosted integration
runtime that you install on your machine in the following steps.
11. When the self-hosted integration runtime is registered successfully, you see the following message:
12. On the New Integration Runtime (Self-hosted) Node page, select Next.
13. On the Intranet Communication Channel page, select Skip. Select a TLS/SSL certification to secure
intranode communication in a multinode integration runtime environment.
14. On the Register Integration Runtime (Self-hosted) page, select Launch Configuration Manager.
15. When the node is connected to the cloud service, you see the following page:
NOTE
Make a note of the values for authentication type, server, database, user, and password. You use them later in this
tutorial.
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=<server>;Database=<database>;Integrated Security=True"
},
"userName": "<user> or <domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
}
},
"name": "SqlServerLinkedService"
}
IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.
Replace <integration runtime name> with the name of your integration runtime.
Replace <servername>, <databasename>, <username>, and <password> with values of your SQL Server
database before you save the file.
If you need to use a slash character ( \ ) in your user account or server name, use the escape character ( \ ). An
example is mydomain\\myuser .
LinkedServiceName : SqlServerLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}
LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": "dummyName"
},
"linkedServiceName": {
"referenceName": "SqlServerLinkedService",
"type": "LinkedServiceReference"
}
}
}
The table name is a dummy name. The Copy activity in the pipeline uses a SQL query to load the data
rather than load the entire table.
2. Run the Set-AzureRmDataFactoryV2Dataset cmdlet to create the dataset SourceDataset.
DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset
{
"name": "SinkDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@{dataset().SinkTableName}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"SinkTableName": {
"type": "String"
}
}
}
}
DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : WatermarkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : <data factory name>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names
and performs the following operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the
last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column
in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the
destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next
iteration.
Create the pipeline
1. Create a JSON file named IncrementalCopyPipeline.json in the same folder with the following content:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable where TableName = '@{item().TABLE_NAME}'"
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from
@{item().TABLE_NAME}"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and
@{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "SqlSink",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterTableType": "@{item().TableType}",
"SqlWriterStoredProcedureName": "@{item().StoredProcedureNameForMergeOperation}"
}
},
"dependsOn": [{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"SinkTableName": "@{item().TableType}"
}
}]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {
"value": "@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type": "datetime"
},
"TableName": {
"value": "@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"dependsOn": [{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}]
}
}
}],
"parameters": {
"tableList": {
"type": "Object"
"type": "Object"
}
}
}
}
PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFIncMultiCopyTutorialFactory1201
Activities : {IterateSQLTables}
Parameters : {[tableList,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
{
"tableList":
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
}
5. The Data Integration Application opens in a separate tab. You can see all the pipeline runs and their
status. Notice that in the following example, the status of the pipeline run is Succeeded. To check
parameters passed to the pipeline, select the link in the Parameters column. If an error occurred, you see a
link in the Error column. Select the link in the Actions column.
6. When you select the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline:
7. To go back to the Pipeline Runs view, select Pipelines as shown in the image.
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
Query
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000
Notice that the watermark values for both tables were updated.
UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3
2. Monitor the pipeline runs by following the instructions in the Monitor the pipeline section. Because the
pipeline status is In Progress, you see another action link under Actions to cancel the pipeline run.
3. Select Refresh to refresh the list until the pipeline run succeeds.
4. Optionally, select the View Activity Runs link under Actions to see all the activity runs associated with
this pipeline run.
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Notice the new values of Name and LastModifytime for the PersonID for number 3.
Query
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000
Notice that the watermark values for both tables were updated.
Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR ).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/26/2019 • 15 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline
Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.
End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.
NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database as
the source data store. You can also use an on-premises SQL Server.
High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).
2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.
create table data_source_table
(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);
4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:
NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database
5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:
NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change tracking
version is 0.
6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.
BEGIN
UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName
END
Azure PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
13. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure Storage account for Storage account name.
c. Click Save.
Create Azure SQL Database linked service.
In this step, you link your Azure SQL database to the data factory.
1. Click Connections, and click + New.
2. In the New Linked Service window, select Azure SQL Database, and click Continue.
3. In the New Linked Service window, do the following steps:
a. Enter AzureSqlDatabaseLinkedService for the Name field.
b. Select your Azure SQL server for the Server name field.
c. Select your Azure SQL database for the Database name field.
d. Enter name of the user for the User name field.
e. Enter password for the user for the Password field.
f. Click Test connection to test the connection.
g. Click Save to save the linked service.
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus), and click Dataset.
2. Select Azure SQL Database, and click Finish.
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Properties
window, change the name of the dataset to SourceDataset.
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedService for Linked service.
b. Select [dbo].[data_source_table] for Table.
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to FullCopyPipeline.
3. In the Activities toolbox, expand Data Flow, and drag-drop the Copy activity to the pipeline designer
surface, and set the name FullCopyActivity.
4. Switch to the Source tab, and select SourceDataset for the Source Dataset field.
5. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.
6. To validate the pipeline definition, click Validate on the toolbar. Confirm that there is no validation error.
Close the Pipeline Validation Report by clicking >>.
7. To publish entities (linked services, datasets, and pipelines), click Publish. Wait until the publishing
succeeds.
9. You can also see notifications by clicking the Show Notifications button on the left. To close the
notifications window, click X.
Run the full copy pipeline
Click Trigger on the toolbar for the pipeline, and click Trigger Now.
The file should have the data from the Azure SQL database:
1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22
UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the Properties
window, change the name of the pipeline to IncrementalCopyPipeline.
3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to LookupLastChangeTrackingVersionActivity. This activity gets the
change tracking version used in the last copy operation that is stored in the table
table_store_ChangeTracking_version.
4. Switch to the Settings in the Properties window, and select ChangeTrackingDataset for the Source
Dataset field.
5. Drag-and-drop the Lookup activity from the Activities toolbox to the pipeline designer surface. Set the
name of the activity to LookupCurrentChangeTrackingVersionActivity. This activity gets the current
change tracking version.
6. Switch to the Settings in the Properties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
8. Switch to the Source tab in the Properties window, and do the following steps:
a. Select SourceDataset for Source Dataset.
b. Select Query for Use Query.
c. Enter the following SQL query for Query.
select data_source_table.PersonID,data_source_table.Name,data_source_table.Age,
CT.SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN
CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT
on data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVer
sion}
9. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.
10. Connect both Lookup activities to the Copy activity one by one. Drag the green button attached to the
Lookup activity to the Copy activity.
11. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer surface.
Set the name of the activity to StoredProceduretoUpdateChangeTrackingActivity. This activity updates
the change tracking version in the table_store_ChangeTracking_version table.
12. Switch to the SQL Account* tab, and select AzureSqlDatabaseLinkedService for Linked service.
13. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name, select Update_ChangeTracking_Version.
b. Select Import parameter.
c. In the Stored procedure parameters section, specify following values for the parameters:
15. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Report window by clicking >>.
16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the Publish
All button. Wait until you see the Publishing succeeded message.
Run the incremental copy pipeline
1. Click Trigger on the toolbar for the pipeline, and click Trigger Now.
The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.
1,update,10,2,U
6,new,50,1,I
The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I
Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Database to
Azure Blob Storage using change tracking
information
3/5/2019 • 14 minutes to read • Edit Online
In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source Azure SQL database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In some
cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you
processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and
SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with SQL
Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob Storage.
For more concrete information about SQL Change Tracking technology, see Change tracking in SQL Server.
End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking technology.
NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL Database
as the source data store. You can also use an on-premises SQL Server.
High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).
2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure PowerShell. Install the latest Azure PowerShell modules by following instructions in How to install and
configure Azure PowerShell.
Azure SQL Database. You use the database as the source data store. If you don't have an Azure SQL
Database, see the Create an Azure SQL database article for steps to create one.
Azure Storage account. You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named adftutorial.
Create a data source table in your Azure SQL database
1. Launch SQL Server Management Studio, and connect to your Azure SQL server.
2. In Server Explorer, right-click your database and choose the New Query.
3. Run the following SQL command against your Azure SQL database to create a table named
data_source_table as data source store.
4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:
NOTE
Replace <your database name> with the name of your Azure SQL database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three days
or more, some changed data is not included. You need to either change the value of CHANGE_RETENTION to a
bigger number. Alternatively, ensure that your period to load the changed data is within two days. For more
information, see Enable change tracking for a database
5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:
6. Run the following query to create a stored procedure in your Azure SQL database. The pipeline invokes this
stored procedure to update the change tracking version in the table you created in the previous step.
BEGIN
UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName
END
Azure PowerShell
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.
IMPORTANT
Update the data factory name to be globally unique.
$dataFactoryName = "IncCopyChgTrackingDF";
The specified Data Factory name 'ADFIncCopyChangeTrackingTestFactory' is already in use. Data Factory
names must be globally unique.
To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on
the following page, and then expand Analytics to locate Data Factory: Products available by region. The
data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory
can be in other regions.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>",
"type": "SecureString"
}
}
}
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"value": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=<database name>; Persist
Security Info=False; User ID=<user name>; Password=<password>; MultipleActiveResultSets = False;
Encrypt = True; TrustServerCertificate = False; Connection Timeout = 30;",
"type": "SecureString"
}
}
}
}
2. In Azure PowerShell, run the Set-AzDataFactoryV2LinkedService cmdlet to create the linked service:
AzureSQLDatabaseLinkedService.
LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a source dataset
In this step, you create a dataset to represent the source data.
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incchgtracking",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
You create the adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the
container if it does not exist (or) set it to the name of an existing one. In this tutorial, the output file name is
dynamically generated by using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt').
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SinkDataset
DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
{
"name": " ChangeTrackingDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "table_store_ChangeTracking_version"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : ChangeTrackingDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}]
}]
}
}
PipelineName : FullCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {FullCopyActivity}
Parameters :
5. The Data Integration Application launches in a separate tab. You can see all the pipeline runs and their
statuses. Notice that in the following example, the status of the pipeline run is Succeeded. You can check
parameters passed to the pipeline by clicking link in the Parameters column. If there was an error, you see
a link in the Error column. Click the link in the Actions column.
6. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.
7. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.
The file should have the data from the Azure SQL database:
1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22
UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupLastChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from table_store_ChangeTracking_version"
},
"dataset": {
"referenceName": "ChangeTrackingDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupCurrentChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT CHANGE_TRACKING_CURRENT_VERSION() as
CurrentChangeTrackingVersion"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select
data_source_table.PersonID,data_source_table.Name,data_source_table.Age, CT.SYS_CHANGE_VERSION,
SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT on
data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupLastChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
]
},
{
"activity": "LookupCurrentChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},
{
"name": "StoredProceduretoUpdateChangeTrackingActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "Update_ChangeTracking_Version",
"storedProcedureParameters": {
"CurrentTrackingVersion": {"value":
"@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}
", "type": "INT64" },
"TableName": {
"value":"@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.TableName}",
"type":"String"}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}
2. When you click the link in the Actions column, you see the following page that shows all the activity runs
for the pipeline.
3. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see the second file in the incchgtracking folder of the adftutorial container.
The file should have only the delta data from the Azure SQL database. The record with U is the updated row in
the database and I is the one added row.
1,update,10,2,U
6,new,50,1,I
The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I
Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally copy new and changed files based on
LastModifiedDate by using the Copy Data tool
5/10/2019 • 5 minutes to read • Edit Online
In this tutorial, you'll use the Azure portal to create a data factory. Then, you'll use the Copy Data tool to create a
pipeline that incrementally copies new and changed files only, based on their LastModifiedDate from Azure Blob
storage to Azure Blob storage.
By doing so, ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and
copy the new and updated file only since last time to the destination store. Please note that if you let ADF scan
huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file
scanning is time consuming as well.
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure
storage account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. You can use various tools to perform this task, such as Azure Storage
Explorer.
2. Create a container named destination.
The name for your data factory must be globally unique. You might receive the following error message:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which you'll create the new data factory.
4. For Resource Group, take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that your data factory uses can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, refer to the Deploying Data Factory tile to see the process status.
10. After creation is finished, the Data Factory home page is displayed.
11. To open the Azure Data Factory user interface (UI) on a separate tab, select the Author & Monitor tile.
c. On the New Linked Service page, select your storage account from the Storage account name list
and then select Finish.
d. Select the newly created linked service and then select Next.
4. On the Choose the input file or folder page, complete the following steps:
a. Browse and select the source folder, and then select Choose.
b. Under File loading behavior, select Incremental load: LastModifiedDate.
6. On the Choose the output file or folder page, complete the following steps:
a. Browse and select the destination folder, and then select Choose.
b. Select Next.
11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the
copy operation, select the Details link (eyeglasses icon) in the Actions column.
Because there is no file in the source container in your Blob storage account, you will not see any file
copied to the destination container in your Blob storage account.
12. Create an empty text file and name it file1.txt. Upload this text file to the source container in your storage
account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
13. To go back to the Pipeline Runs view, select All Pipeline Runs, and wait for the same pipeline to be
triggered again automatically.
14. Select View Activity Run for the second pipeline run when you see it. Then review the details in the same
way you did for the first pipeline run.
You will that see one file (file1.txt) has been copied from the source container to the destination container
of your Blob storage account.
15. Create another empty text file and name it file2.txt. Upload this text file to the source container in your
Blob storage account.
16. Repeat steps 13 and 14 for this second text file. You will see that only the new file (file2.txt) has been copied
from the source container to the destination container of your storage account in the next pipeline run.
You can also verify this by using Azure Storage Explorer to scan the files.
Next steps
Advance to the following tutorial to learn about transforming data by using an Apache Spark cluster on Azure:
Transform data in the cloud by using an Apache Spark cluster
Incrementally copy new files based on time
partitioned file name by using the Copy Data tool
3/26/2019 • 5 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that incrementally copies new files based on time partitioned file name from Azure Blob storage to Azure
Blob storage.
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure storage account: Use Blob storage as the source and sink data store. If you don't have an Azure storage
account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source. Create a folder path as 2019/02/26/14 in your container. Create an
empty text file, and name it as file1.txt. Upload the file1.txt to the folder path source/2019/02/26/14 in
your storage account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
NOTE
Please adjust the folder name with your UTC time. For example, if the current UTC time is 2:03 PM on Feb 26th,
2019, you can create the folder path as source/2019/02/26/14/ by the rule of
source/{Year}/{Month}/{Day}/{Hour}/.
2. Create a container named destination. You can use various tools to perform these tasks, such as Azure
Storage Explorer.
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version, select V2 for the version.
6. Under location, select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Pin to dashboard.
8. Select Create.
9. On the dashboard, the Deploying Data Factory tile shows the process status.
10. After creation is finished, the Data Factory home page is displayed.
11. To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.
c. On the New Linked Service page, select your storage account from the Storage account name list,
and then click Finish.
b. Under File loading behavior, select Incremental load: time-partitioned folder/file names.
c. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
d. Check Binary copy and click Next.
5. On the Destination data store page, select the AzureBlobStorage, which is the same storage account as
data source store, and then click Next.
6. On the Choose the output file or folder page, do the following steps:
a. Browse and select the destination folder, then click Choose.
b. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/, and change the format as
followings:
c. Click Next.
11. There's only one activity (copy activity) in the pipeline, so you see only one entry. You can see the source file
(file1.txt) has been copied from source/2019/02/26/14/ to destination/2019/02/26/14/ with the same
file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files.
12. Create another empty text file with the new name as file2.txt. Upload the file2.txt file to the folder path
source/2019/02/26/15 in your storage account. You can use various tools to perform these tasks, such as
Azure Storage Explorer.
NOTE
You might be aware that a new folder path is required to be created. Please adjust the folder name with your UTC
time. For example, if the current UTC time is 3:20 PM on Feb 26th, 2019, you can create the folder path as
source/2019/02/26/15/ by the rule of {Year}/{Month}/{Day}/{Hour}/.
13. To go back to the Pipeline Runs view, select All Pipelines Runs, and wait for the same pipeline being
triggered again automatically after another one hour.
14. Select View Activity Run for the second pipeline run when it comes, and do the same to review details.
You can see the source file (file2.txt) has been copied from source/2019/02/26/15/ to
destination/2019/02/26/15/ with the same file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files
in destination container
Next steps
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Transform data using Spark cluster in cloud
Transform data in the cloud by using a Spark activity
in Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline. This pipeline transforms data by
using a Spark activity and an on-demand Azure HDInsight linked service.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure storage account. You create a Python script and an input file, and you upload them to Azure Storage.
The output from the Spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
NOTE
HdInsight supports only general-purpose storage accounts with standard tier. Make sure that the account is not a premium
or blob only storage account.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload the Python script to your Blob storage account
1. Create a Python file named WordCount_Spark.py with the following content:
import sys
from operator import add
def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")
spark.stop()
if __name__ == "__main__":
main()
2. Replace <storageAccountName> with the name of your Azure storage account. Then, save the file.
3. In Azure Blob storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under the spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The Spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt file to the inputfiles subfolder.
4. For Subscription, select your Azure subscription in which you want to create the data factory.
5. For Resource Group, take one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version, select V2.
7. For Location, select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factory: Products available by region.
The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data
Factory uses can be in other regions.
8. Select Create.
9. After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.
3. In the New Linked Service window, select Data Store > Azure Blob Storage, and then select Continue.
4. For Storage account name, select the name from the list, and then select Save.
Create an on-demand HDInsight linked service
1. Select the + New button again to create another linked service.
2. In the New Linked Service window, select Compute > Azure HDInsight, and then select Continue.
3. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureHDInsightLinkedService.
b. For Type, confirm that On-demand HDInsight is selected.
c. For Azure Storage Linked Service, select AzureStorage1. You created this linked service earlier. If you
used a different name, specify the right name here.
d. For Cluster type, select spark.
e. For Service principal id, enter the ID of the service principal that has permission to create an HDInsight
cluster.
This service principal needs to be a member of the Contributor role of the subscription or the resource
group in which the cluster is created. For more information, see Create an Azure Active Directory
application and service principal.
f. For Service principal key, enter the key.
g. For Resource group, select the same resource group that you used when you created the data factory.
The Spark cluster is created in this resource group.
h. Expand OS type.
i. Enter a name for Cluster user name.
j. Enter the Cluster password for the user.
k. Select Finish.
NOTE
Azure HDInsight limits the total number of cores that you can use in each Azure region that it supports. For the on-demand
HDInsight linked service, the HDInsight cluster is created in the same Azure Storage location that's used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. In the Activities toolbox, expand HDInsight. Drag the Spark activity from the Activities toolbox to the
pipeline designer surface.
3. In the properties for the Spark activity window at the bottom, complete the following steps:
a. Switch to the HDI Cluster tab.
b. Select AzureHDInsightLinkedService (which you created in the previous procedure).
4. Switch to the Script/Jar tab, and complete the following steps:
a. For Job Linked Service, select AzureStorage1.
b. Select Browse Storage.
c. Browse to the adftutorial/spark/script folder, select WordCount_Spark.py, and then select Finish.
5. To validate the pipeline, select the Validate button on the toolbar. Select the >> (right arrow ) button to
close the validation window.
6. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.
You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
(u'This', 1)
(u'a', 1)
(u'is', 1)
(u'test', 1)
(u'file', 1)
Next steps
The pipeline in this sample transforms data by using a Spark activity and an on-demand HDInsight linked service.
You learned how to:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
To learn how to transform data by running a Hive script on an Azure HDInsight cluster that's in a virtual network,
advance to the next tutorial:
Tutorial: Transform data using Hive in Azure Virtual Network.
Transform data in the cloud by using Spark activity in
Azure Data Factory
3/7/2019 • 7 minutes to read • Edit Online
In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Spark
Activity and an on-demand HDInsight linked service. You perform the following steps in this tutorial:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure Storage account. You create a python script and an input file, and upload them to the Azure storage.
The output from the spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload python script to your Blob Storage account
1. Create a python file named WordCount_Spark.py with the following content:
import sys
from operator import add
def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecr
aftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles
/wordcount")
spark.stop()
if __name__ == "__main__":
main()
2. Replace <storageAccountName> with the name of your Azure Storage account. Then, save the file.
3. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark.
5. Create a subfolder named script under spark folder.
6. Upload the WordCount_Spark.py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstory.txt with some text. The spark program counts the number of words in this
text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt to the inputfiles subfolder.
Update the <storageAccountName> and <storageAccountKey> with the name and key of your Azure Storage
account.
On-demand HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyOnDemandSparkLinkedService.json.
{
"name": "MyOnDemandSparkLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 2,
"clusterType": "spark",
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscriptionID> ",
"servicePrincipalId": "<servicePrincipalID>",
"servicePrincipalKey": {
"value": "<servicePrincipalKey>",
"type": "SecureString"
},
"tenant": "<tenant ID>",
"clusterResourceGroup": "<resourceGroupofHDICluster>",
"version": "3.6",
"osType": "Linux",
"clusterNamePrefix":"ADFSparkSample",
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
Update values for the following properties in the linked service definition:
hostSubscriptionId. Replace <subscriptionID> with the ID of your Azure subscription. The on-demand
HDInsight cluster is created in this subscription.
tenant. Replace <tenantID> with ID of your Azure tenant.
servicePrincipalId, servicePrincipalKey. Replace <servicePrincipalID> and <servicePrincipalKey> with ID
and key of your service principal in the Azure Active Directory. This service principal needs to be a member of
the Contributor role of the subscription or the resource Group in which the cluster is created. See create Azure
Active Directory application and service principal for details.
clusterResourceGroup. Replace <resourceGroupOfHDICluster> with the name of the resource group in
which the HDInsight cluster needs to be created.
NOTE
Azure HDInsight has limitation on the total number of cores you can use in each Azure region it supports. For On-Demand
HDInsight Linked Service, the HDInsight cluster will be created in the same location of the Azure Storage used as its primary
storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more information, see Set
up clusters in HDInsight with Hadoop, Spark, Kafka, and more.
Author a pipeline
In this step, you create a new pipeline with a Spark activity. The activity uses the word count sample. Download
the contents from this location if you haven't already done so.
Create a JSON file in your preferred editor, copy the following JSON definition of a pipeline definition, and save it
as MySparkOnDemandPipeline.json.
{
"name": "MySparkOnDemandPipeline",
"properties": {
"activities": [
{
"name": "MySparkActivity",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyOnDemandSparkLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"rootPath": "adftutorial/spark",
"entryFilePath": "script/WordCount_Spark.py",
"getDebugInfo": "Failure",
"sparkJobLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}
$resourceGroupName = "ADFTutorialResourceGroup"
Pipeline name
2. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:
$df
5. Switch to the folder where you created JSON files, and run the following command to deploy an Azure
Storage linked service:
2. Run the following script to continuously check the pipeline run status until it finishes.
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
4. Confirm that a folder named outputfiles is created in the spark folder of adftutorial container with the
output from the spark program.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
Advance to the next tutorial to learn how to transform data by running Hive script on an Azure HDInsight cluster
that is in a virtual network.
Tutorial: transform data using Hive in Azure Virtual Network.
Run a Databricks notebook with the Databricks
Notebook Activity in Azure Data Factory
5/22/2019 • 5 minutes to read • Edit Online
In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks
notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks
notebook during execution.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses Databricks Notebook Activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
For an eleven-minute introduction and demonstration of this feature, watch the following video:
Prerequisites
Azure Databricks workspace. Create a Databricks workspace or use an existing one. You create a Python
notebook in your Azure Databricks workspace. Then you execute the notebook and pass parameters to it using
Azure Data Factory.
2. Select Connections at the bottom of the window, and then select + New.
3. In the New Linked Service window, select Compute > Azure Databricks, and then select Continue.
4. In the New Linked Service window, complete the following steps:
a. For Name, enter AzureDatabricks_LinkedService
b. Select the appropriate Databricks workspace that you will run your notebook in
c. For Select cluster, select New job cluster
d. For Domain/ Region, info should auto-populate
e. For Access Token, generate it from Azure Databricks workplace. You can find the steps here.
f. For Cluster version, select 4.2 (with Apache Spark 2.3.1, Scala 2.11)
g. For Cluster node type, select Standard_D3_v2 under General Purpose (HDD ) category for this
tutorial.
h. For Workers, enter 2.
i. Select Finish
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. Create a parameter to be used in the Pipeline. Later you pass this parameter to the Databricks Notebook
Activity. In the empty pipeline, click on the Parameters tab, then New and name it as 'name'.
3. In the Activities toolbox, expand Databricks. Drag the Notebook activity from the Activities toolbox to
the pipeline designer surface.
4. In the properties for the Databricks Notebook activity window at the bottom, complete the following
steps:
a. Switch to the Azure Databricks tab.
b. Select AzureDatabricks_LinkedService (which you created in the previous procedure).
c. Switch to the Settings tab
c. Browse to select a Databricks Notebook path. Let’s create a notebook and specify the path here. You get
the Notebook Path by following the next few steps.
a. Launch your Azure Databricks Workspace
b. Create a New Folder in Workplace and call it as adftutorial.
c. Create a new notebook (Python), let’s call it mynotebook under adftutorial Folder, click Create.
d. In the newly created notebook "mynotebook'" add the following code:
dbutils.widgets.text("input", "","")
dbutils.widgets.get("input")
y = getArgument("input")
print ("Param -\'input':")
print (y)
7. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Trigger on the toolbar, and then select Trigger Now.
The Pipeline Run dialog box asks for the name parameter. Use /path/filename as the parameter here. Click
Finish.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 5-8 minutes to create
a Databricks job cluster, where the notebook is executed.
You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
Verify the output
You can log on to the Azure Databricks workspace, go to Clusters and you can see the Job status as pending
execution, running, or terminated.
You can click on the Job name and navigate to see further details. On successful run, you can validate the
parameters passed and the output of the Python notebook.
Next steps
The pipeline in this sample triggers a Databricks Notebook activity and passes a parameter to it. You learned how
to:
Create a data factory.
Create a pipeline that uses a Databricks Notebook activity.
Trigger a pipeline run.
Monitor the pipeline run.
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
3/15/2019 • 9 minutes to read • Edit Online
In this tutorial, you use Azure portal to create a Data Factory pipeline that transforms data using Hive Activity on a
HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
A virtual machine. Create an Azure virtual machine VM and join it into the same virtual network that
contains your HDInsight cluster. For details, see How to create virtual machines.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
DROP TABLE IF EXISTS HiveSampleOut;
CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';
2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version.
8. Select the location for the data factory. Only locations that are supported for creation of data factories are
shown in the list.
9. Select Pin to dashboard.
10. Click Create.
11. On the dashboard, you see the following tile with status: Deploying data factory.
12. After the creation is complete, you see the Data Factory page as shown in the image.
13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
14. In the get started page, switch to the Edit tab in the left panel as shown in the following image:
Create a self-hosted integration runtime
As the Hadoop cluster is inside a virtual network, you need to install a self-hosted integration runtime (IR ) in the
same virtual network. In this section, you create a new VM, join it to the same virtual network, and install self-
hosted IR on it. The self-hosted IR allows Data Factory service to dispatch processing requests to a compute
service such as HDInsight inside a virtual network. It also allows you to move data to/from data stores inside a
virtual network to Azure. You use a self-hosted IR when the data store or compute is in an on-premises
environment as well.
1. In the Azure Data Factory UI, click Connections at the bottom of the window, switch to the Integration
Runtimes tab, and click + New button on the toolbar.
2. In the Integration Runtime Setup window, Select Perform data movement and dispatch activities to
external computes option, and click Next.
3. Select Private Network, and click Next.
4. Enter MySelfHostedIR for Name, and click Next.
5. Copy the authentication key for the integration runtime by clicking the copy button, and save it. Keep the
window open. You use this key to register the IR installed in a virtual machine.
Install IR on a virtual machine
1. On the Azure VM, download self-hosted integration runtime. Use the authentication key obtained in the
previous step to manually register the self-hosted integration runtime.
2. You see the following message when the self-hosted integration runtime is registered successfully.
3. Click Launch Configuration Manager. You see the following page when the node is connected to the
cloud service:
Self-hosted IR in the Azure Data Factory UI
1. In the Azure Data Factory UI, you should see the name of the self-hosted VM name and its status.
2. Click Finish to close the Integration Runtime Setup window. You see the self-hosted IR in the list of
integration runtimes.
2. In the New Linked Service window, select Azure Blob Storage, and click Continue.
Create a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined.
Note the following points:
scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
1. In the Data Factory UI, click + (plus) in the left pane, and click Pipeline.
2. In the Activities toolbox, expand HDInsight, and drag-drop Hive activity to the pipeline designer surface.
3. In the properties window, switch to the HDI Cluster tab, and select AzureHDInsightLinkedService for
HDInsight Linked Service.
4. You see only one activity run since there is only one activity in the pipeline of type HDInsightHive. To
switch back to the previous view, click Pipelines link at the top.
5. Confirm that you see an output file in the outputfolder of the adftutorial container.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
4/11/2019 • 9 minutes to read • Edit Online
In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Hive Activity
on a HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure Storage account. You create a hive script, and upload it to the Azure storage. The output from the
Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Virtual Network. If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration of
Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the previous
step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a sample
configuration of HDInsight in a virtual network.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts.
4. Upload the hivescript.hql file to the hivescripts subfolder.
Create a data factory
1. Set the resource group name. You create a resource group as part of this tutorial. However, you can use an
existing resource group if you like.
$resourceGroupName = "ADFTutorialResourceGroup"
$dataFactoryName = "MyDataFactory09142017"
$pipelineName = "MyHivePipeline" #
4. Specify a name for the self-hosted integration runtime. You need a self-hosted integration runtime when the
Data Factory needs to access resources (such as Azure SQL Database) inside a VNet.
$selfHostedIntegrationRuntimeName = "MySelfHostedIR09142017"
5. Launch PowerShell. Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently available,
select the regions that interest you on the following page, and then expand Analytics to locate Data
Factory: Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and
computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace SubscriptionId
with the ID of your Azure subscription:
6. Create the resource group: ADFTutorialResourceGroup if it does not exist already in your subscription.
Create self-hosted IR
In this section, you create a self-hosted integration runtime and associate it with an Azure VM in the same Azure
Virtual Network where your HDInsight cluster is in.
1. Create Self-hosted integration runtime. Use a unique name in case if another integration runtime with the
same name exists.
{
"AuthKey1": "IR@0000000000000000000000000000000000000=",
"AuthKey2": "IR@0000000000000000000000000000000000000="
}
You see the following page when the node is connected to the cloud service:
Author linked services
You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, we also use this Azure Storage account to keep the
Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedService.json.
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
"type": "SecureString"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}
Replace <accountname> and <accountkey> with the name and key of your Azure Storage account.
HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyHDInsightLinkedService.json.
{
"name": "MyHDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<clustername>.azurehdinsight.net",
"userName": "<username>",
"password": {
"value": "<password>",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}
Update values for the following properties in the linked service definition:
userName. Name of the cluster login user that you specified when creating the cluster.
password. The password for the user.
clusterUri. Specify the URL of your HDInsight cluster in the following format:
https://<clustername>.azurehdinsight.net . This article assumes that you have access to the cluster over the
internet. For example, you can connect to the cluster at https://clustername.azurehdinsight.net . This
address uses the public gateway, which is not available if you have used network security groups (NSGs) or
user-defined routes (UDRs) to restrict access from the internet. For Data Factory to submit jobs to
HDInsight clusters in Azure Virtual Network, your Azure Virtual Network needs to be configured in such a
way that the URL can be resolved to the private IP address of the gateway used by HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with
name starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL
https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . This is the recommended
approach. If you don’t have a DNS server in your Azure Virtual Network, you can temporarily work
around this by editing the hosts file (C:\Windows\System32\drivers\etc) of all VMs that registered as
self-hosted integration runtime nodes by adding an entry like this:
10.6.0.15 myHDIClusterName.azurehdinsight.net
Author a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined. Create a JSON file in your preferred editor, copy the following
JSON definition of a pipeline definition, and save it as MyHivePipeline.json.
{
"name": "MyHivePipeline",
"properties": {
"activities": [
{
"name": "MyHiveActivity",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDILinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptPath": "adftutorial\\hivescripts\\hivescript.hql",
"getDebugInfo": "Failure",
"defines": {
"Output": "wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/"
},
"scriptLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}
2. Run the following script to continuously check the pipeline run status until it finishes.
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore
(Get-Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 000000000-0000-0000-000000000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output :
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :
ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 0000000-0000-0000-0000-000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output : {logLocation, clusterInUse, jobId, ExecutionProgress...}
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd : 9/18/2017 6:59:16 AM
DurationInMs : 63636
Status : Succeeded
Error : {errorCode, message, failureType, target}
3. Check the outputfolder folder for new file created as the Hive query result, it should look like the following
sample output:
In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If the
copy activity succeeds, the pipeline sends details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, the pipeline sends details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:
Prerequisites
Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.
John,Doe
Jane,Doe
For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
The Request in the Logic App Designer should look like the following image:
For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your success email workflow:
The name of the Azure data factory must be globally unique. If you receive the following error, change the
name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See Data
Factory - Naming Rules article for naming rules for Data Factory artifacts.
`Data factory name “ADFTutorialDataFactory” is not available`
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down
list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
8. Select Pin to dashboard.
9. Click Create.
10. On the dashboard, you see the following tile with status: Deploying data factory.
11. After the creation is complete, you see the Data Factory page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
Create a pipeline
In this step, you create a pipeline with one Copy activity and two Web activities. You use the following features to
create the pipeline:
Parameters for the pipeline that are access by datasets.
Web activity to invoke logic apps workflows to send success/failure emails.
Connecting one activity with another activity (on success and failure)
Using output from an activity as an input to the subsequent activity
1. In the get started page of Data Factory UI, click the Create pipeline tile.
2. In the properties window for the pipeline, switch to the Parameters tab, and use the New button to add the
following three parameters of type String: sourceBlobContainer, sinkBlobContainer, and receiver.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure
emails to the receiver whose email address is specified by this parameter.
3. In the Activities toolbox, expand Data Flow, and drag-drop Copy activity to the pipeline designer surface.
4. In the Properties window for the Copy activity at the bottom, switch to the Source tab, and click + New.
You create a source dataset for the copy activity in this step.
5. In the New Dataset window, select Azure Blob Storage, and click Finish.
6. You see a new tab titled AzureBlob1. Change the name of the dataset to SourceBlobDataset.
7. Switch to the Connection tab in the Properties window, and click New for the Linked service. You create
a linked service to link your Azure Storage account to the data factory in this step.
8. In the New Linked Service window, do the following steps:
a. Enter AzureStorageLinkedService for Name.
b. Select your Azure storage account for the Storage account name.
c. Click Save.
9. Enter @pipeline().parameters.sourceBlobContainer for the folder and emp.txt for the file name. You use the
sourceBlobContainer pipeline parameter to set the folder path for the dataset.
13. Switch to the pipeline tab (or) click the pipeline in the treeview. Confirm that SourceBlobDataset is selected
for Source Dataset.

13. In the properties window, switch to the Sink tab, and click + New for Sink Dataset. You create a sink
dataset for the copy activity in this step similar to the way you created the source dataset.
14. In the New Dataset window, select Azure Blob Storage, and click Finish.
15. In the General settings page for the dataset, enter SinkBlobDataset for Name.
16. Switch to the Connection tab, and do the following steps:
a. Select AzureStorageLinkedService for LinkedService.
b. Enter @pipeline().parameters.sinkBlobContainer for the folder.
c. Enter @CONCAT(pipeline().RunId, '.txt') for the file name. The expression uses the ID of the current
pipeline run for the file name. For the supported list of system variables and expressions, see System
variables and Expression language.
17. Switch to the pipeline tab at the top. Expand General in the Activities toolbox, and drag-drop a Web
activity to the pipeline designer surface. Set the name of the activity to SendSuccessEmailActivity. The
Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow.
18. Switch to the Settings tab from the General tab, and do the following steps:
a. For URL, specify URL for the logic apps workflow that sends the success email.
b. Select POST for Method.
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json.
e. Specify the following JSON for Body.
{
"message": "@{activity('Copy1').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
19. Connect the Copy activity to the Web activity by dragging the green button next to the Copy activity and
dropping on the Web activity.
20. Drag-drop another Web activity from the Activities toolbox to the pipeline designer surface, and set the
name to SendFailureEmailActivity.
22. Select Copy activity in the pipeline designer, and click +-> button, and select Error.
23. Drag the red button next to the Copy activity to the second Web activity SendFailureEmailActivity. You
can move the activities around so that the pipeline looks like in the following image:
24. To validate the pipeline, click Validate button on the toolbar. Close the Pipeline Validation Output
window by clicking the >> button.
25. To publish the entities (datasets, pipelines, etc.) to Data Factory service, select Publish All. Wait until you
see the Successfully published message.
Trigger a pipeline run that succeeds
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.
2. To view activity runs associated with this pipeline run, click the first link in the Actions column. You can
switch back to the previous view by clicking Pipelines at the top. Use the Refresh button to refresh the list.
3. To view activity runs associated with this pipeline run, click the first link in the Actions column. Use the
Refresh button to refresh the list. Notice that the Copy activity in the pipeline failed. The Web activity
succeeded to send the failure email to the specified receiver.
4. Click Error link in the Actions column to see details about the error.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Branching and chaining activities in a Data Factory
pipeline
3/29/2019 • 14 minutes to read • Edit Online
In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If
the copy activity succeeds, you want to send details of the successful copy operation (such as the amount of data
written) in a success email. If the copy activity fails, you want to send details of copy failure (such as the error
message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:
Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database. You use the database as sink data store. If you don't have an Azure SQL Database, see
the Create an Azure SQL database article for steps to create one.
Visual Studio 2013, 2015, or 2017. The walkthrough in this article uses Visual Studio 2017.
Download and install Azure .NET SDK.
Create an application in Azure Active Directory following these instructions. Make note of the following
values that you use in later steps: application ID, authentication key, and tenant ID. Assign application to
"Contributor" role by following instructions in the same article.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.
John|Doe
Jane|Doe
2. Use tools such as Azure Storage Explorer to create the adfv2branch container, and to upload the input.txt
file to the container.
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add these static variables to the Program class. Replace place-holders with your own values. For a list of
Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factory: Products available by region. The data
stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can
be in other regions.
// Set variables
static string tenantID = "<tenant ID>";
static string applicationId = "<application ID>";
static string authenticationKey = "<Authentication key for your application>";
static string subscriptionId = "<Azure subscription ID>";
static string resourceGroup = "<Azure resource group name>";
3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.
Factory response;
{
response = client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, resource);
}
Add the following code to Main method that creates a data factory.
Factory df = CreateOrUpdateDataFactory(client);
Add the following code to the Main method that creates an Azure Storage linked service. Learn more from
Azure Blob linked service properties on supported properties and details.
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. Learn more from Azure Blob
dataset properties on supported properties and details.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName;
Notice the use of parameters for the FolderPath. “sourceBlobContainer” is the name of the parameter and the
expression is replaced with the values passed in the pipeline run. The syntax to define parameters is
@pipeline().parameters.<parameterName>
Add the following code to the Main method that creates both Azure Blob source and sink datasets.
class EmailRequest
{
[Newtonsoft.Json.JsonProperty(PropertyName = "message")]
public string message;
[Newtonsoft.Json.JsonProperty(PropertyName = "dataFactoryName")]
public string dataFactoryName;
[Newtonsoft.Json.JsonProperty(PropertyName = "pipelineName")]
public string pipelineName;
[Newtonsoft.Json.JsonProperty(PropertyName = "receiver")]
public string receiver;
For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
This aligns with the EmailRequest class you created in the previous section.
Your Request should look like this in the Logic App Designer:
For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Make a note of your HTTP Post request URL for your success email workflow:
Make a note of your HTTP Post request URL for your failure email workflow:
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity and dependsOn property.
In this tutorial, the pipeline contains one activity: copy activity, which takes in the Blob dataset as a source and
another Blob dataset as a sink. Upon the copy activity succeeding and failing, it calls different email tasks.
In this pipeline, you use the following features:
Parameters
Web Activity
Activity dependency
Using output from an activity as an input to the subsequent activity
Let’s break down the following pipeline section by section:
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = copyBlobActivity,
Inputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSourceDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSinkDatasetName
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
},
new WebActivity
{
Name = sendSuccessEmailActivity,
Method = WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/00000000000000000000000000000000000/triggers/manual/path
s/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
},
new WebActivity
{
Name = sendFailEmailActivity,
Method =WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/000000000000000000000000000000000/triggers/manual/paths/
invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').error.message}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Failed" }
}
}
}
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource, client.SerializationSettings));
return resource;
}
Add the following code to the Main method that creates the pipeline:
Parameters
The first section of our pipeline defines parameters.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure emails to
the receiver whose email address is specified by this parameter.
Web Activity
The Web Activity allows a call to any REST endpoint. For more information about the activity, see Web Activity.
This pipeline uses a Web Activity to call the Logic Apps email workflow. You create two web activities: one that
calls to the CopySuccessEmail workflow and one that calls the CopyFailWorkFlow.
new WebActivity
{
Name = sendCopyEmailActivity,
Method = WebActivityMethod.POST,
Url = "https://prodxxx.eastus.logic.azure.com:443/workflows/12345",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
}
In the “Url” property, paste the Request URL endpoints from your Logic Apps workflow accordingly. In the “Body”
property, pass an instance of the “EmailRequest” class. The email request contains the following properties:
Message – Passing value of @{activity('CopyBlobtoBlob').output.dataWritten . Accesses a property of the
previous copy activity and passes the value of dataWritten. For the failure case, pass the error output instead of
@{activity('CopyBlobtoBlob').error.message .
Data Factory Name – Passing value of @{pipeline().DataFactory} This is a system variable, allowing you to
access the corresponding data factory name. For a list of system variables, see System Variables article.
Pipeline Name – Passing value of @{pipeline().Pipeline} . This is also a system variable, allowing you to
access the corresponding pipeline name.
Receiver – Passing value of "@pipeline().parameters.receiver"). Accessing the pipeline parameters.
This code creates a new Activity Dependency, depending on the previous copy activity that it succeeds.
Main class
Your final Main method should look like this. Build and run your program to trigger a pipeline run!
// Authenticate and create a data factory management client
var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };
Factory df = CreateOrUpdateDataFactory(client);
2. Add the following code to the Main method that retrieves copy activity run details, for example, size of the
data read/written.
// Check the copy activity run details
Console.WriteLine("Checking copy activity run details...");
if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
//SaveToJson(SafeJsonConvert.SerializeObject(activityRuns.First().Output,
client.SerializationSettings), "ActivityRunResult.json", folderForJsons);
}
else
Console.WriteLine(activityRuns.First().Error);
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory
3/5/2019 • 9 minutes to read • Edit Online
This tutorial provides steps for using the Azure portal to provision an Azure-SSIS integration runtime (IR ) in Azure
Data Factory. Then, you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to
deploy and run SQL Server Integration Services (SSIS ) packages in this runtime in Azure. For conceptual
information on Azure-SSIS IRs, see Azure-SSIS integration runtime overview.
In this tutorial, you complete the following steps:
Create a data factory.
Provision an Azure-SSIS integration runtime.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. Azure Data Factory creates the SSIS Catalog (SSISDB database) on this database server. We
recommend that you create the database server in the same Azure region as the integration runtime. This
configuration lets the integration runtime write execution logs to the SSISDB database without crossing Azure
regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of an
elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network. If you
use Azure SQL Database with virtual network service endpoints/Managed Instance to host SSISDB or require
access to on-premises data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR
in a virtual network.
Confirm that the Allow access to Azure services setting is enabled for the database server. This is not
applicable when you use Azure SQL Database with virtual network service endpoints/Managed Instance to
host SSISDB. For more information, see Secure your Azure SQL database. To enable this setting by using
PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of client
machine, to the client IP address list in the firewall settings for the database server. For more information, see
Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or Azure
Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory (ADF ). For the
latter, you need to add the managed identity for your ADF into an AAD group with access permissions to the
database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of an Azure-SSIS IR does not support using an existing SSIS Catalog.
NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.
5. For Subscription, select your Azure subscription in which you want to create the data factory.
6. For Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the list.
Select Create new, and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version, select V2 (Preview).
8. For Location, select the location for the data factory. The list shows only locations that are supported for the
creation of data factories.
9. Select Pin to dashboard.
10. Select Create.
11. On the dashboard, you see the following tile with the status Deploying data factory:
12. After the creation is complete, you see the Data factory page.
13. Select Author & Monitor to open the Data Factory user interface (UI) on a separate tab.
2. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
From the Authoring UI
1. In the Azure Data Factory UI, switch to the Edit tab, select Connections, and then switch to the
Integration Runtimes tab to view existing integration runtimes in your data factory.
2. Select New to create an Azure-SSIS IR.
3. In the Integration Runtime Setup window, select Lift-and-shift existing SSIS packages to execute in
Azure, and then select Next.
4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
a. For Subscription, select the Azure subscription that has your database server to host SSISDB.
b. For Location, select the location of your database server to host SSISDB. We recommend that you select
the same location of your integration runtime.
c. For Catalog Database Server Endpoint, select the endpoint of your database server to host SSISDB.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part of
an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual network.
For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL Database
single databases/elastic pools and Managed Instance. If you select Azure SQL Database with virtual
network service endpoints/Managed Instance to host SSISDB or require access to on-premises data, you
need to join your Azure-SSIS IR to a virtual network. See Create Azure-SSIS IR in a virtual network.
d. On Use AAD authentication... checkbox, select the authentication method for your database server to
host SSISDB: SQL or Azure Active Directory (AAD ) with the managed identity for your Azure Data Factory
(ADF ). If you check it, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
e. For Admin Username, enter SQL authentication username for your database server to host SSISDB.
f. For Admin Password, enter SQL authentication password for your database server to host SSISDB.
g. For Catalog Database Service Tier, select the service tier for your database server to host SSISDB:
Basic/Standard/Premium tier or elastic pool name.
h. Click Test Connection and if successful, click Next.
3. On the Advanced Settings page, complete the following steps:
a. For Maximum Parallel Executions Per Node, select the maximum number of packages to execute
concurrently per node in your integration runtime cluster. Only supported package numbers are displayed.
Select a low number, if you want to use more than one cores to run a single large/heavy-weight package
that is compute/memory -intensive. Select a high number, if you want to run one or more small/light-weight
packages in a single core.
b. For Custom Setup Container SAS URI, optionally enter Shared Access Signature (SAS ) Uniform
Resource Identifier (URI) of your Azure Storage Blob container where your setup script and its associated
files are stored, see Custom setup for Azure-SSIS IR.
c. On Select a VNet... checkbox, select whether you want to join your integration runtime to a virtual
network. You should check it if you use Azure SQL Database with virtual network service
endpoints/Managed Instance to host SSISDB or require access to on-premises data, see Create Azure-SSIS
IR in a virtual network.
4. Click Finish to start the creation of your integration runtime.
IMPORTANT
This process takes approximately 20 to 30 minutes to complete.
The Data Factory service connects to your Azure SQL Database server to prepare the SSIS Catalog (SSISDB database).
When you provision an instance of an Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable
are also installed. These components provide connectivity to Excel and Access files and to various Azure data sources,
in addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.
5. On the Connections tab, switch to Integration Runtimes if needed. Select Refresh to refresh the status.
6. Use the links in the Actions column to stop/start, edit, or delete the integration runtime. Use the last link to
view JSON code for the integration runtime. The edit and delete buttons are enabled only when the IR is
stopped.
Next steps
In this tutorial, you learned how to:
Create a data factory.
Provision an Azure-SSIS integration runtime.
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Provision the Azure-SSIS Integration Runtime in
Azure Data Factory with PowerShell
3/15/2019 • 11 minutes to read • Edit Online
This tutorial provides steps for provisioning an Azure-SSIS integration runtime (IR ) in Azure Data Factory. Then,
you can use SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS ) to deploy and run SQL
Server Integration Services (SSIS ) packages in this runtime in Azure. In this tutorial, you do the following steps:
NOTE
This article uses Azure PowerShell to provision an Azure SSIS IR. To use the Data Factory user interface (UI) to provision an
Azure SSIS IR, see Tutorial: Create an Azure SSIS integration runtime.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
Azure subscription. If you don't have an Azure subscription, create a free account before you begin. For
conceptual information on Azure-SSIS IR, see Azure-SSIS integration runtime overview.
Azure SQL Database server. If you don't already have a database server, create one in the Azure portal before
you get started. This server hosts the SSIS Catalog database (SSISDB ). We recommend that you create the
database server in the same Azure region as the integration runtime. This configuration lets the integration
runtime write execution logs to SSISDB without crossing Azure regions.
Based on the selected database server, SSISDB can be created on your behalf as a single database, part
of an elastic pool, or in a Managed Instance and accessible in public network or by joining a virtual
network. For guidance in choosing the type of database server to host SSISDB, see Compare Azure SQL
Database single databases/elastic pools and Managed Instance. If you use Azure SQL Database with
virtual network service endpoints/Managed Instance to host SSISDB or require access to on-premises
data, you need to join your Azure-SSIS IR to a virtual network, see Create Azure-SSIS IR in a virtual
network.
Confirm that the "Allow access to Azure services" setting is ON for the database server. This setting is
not applicable when you use Azure SQL Database with virtual network service endpoints/Managed
Instance to host SSISDB. For more information, see Secure your Azure SQL database. To enable this
setting by using PowerShell, see New -AzSqlServerFirewallRule.
Add the IP address of the client machine or a range of IP addresses that includes the IP address of client
machine to the client IP address list in the firewall settings for the database server. For more information,
see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server using SQL authentication with your server admin credentials or
Azure Active Directory (AAD ) authentication with the managed identity for your Azure Data Factory. For
the latter, you need to add the managed identity for your ADF into an AAD group with access
permissions to the database server, see Create Azure-SSIS IR with AAD authentication.
Confirm that your Azure SQL Database server does not have an SSIS Catalog (SSISDB database). The
provisioning of Azure-SSIS IR does not support using an existing SSIS Catalog.
Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell. You use
PowerShell to run a script to provision an Azure-SSIS integration runtime that runs SSIS packages in the cloud.
NOTE
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see ADF +
SSIS IR availability by region.
Create variables
Copy and paste the following script: Specify values for the variables. For a list of supported pricing tiers for Azure
SQL Database, see SQL Database resource limits.
# Azure Data Factory information
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like "`$"
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
# Data factory name. Must be globally unique
$DataFactoryName = "[Data factory name]"
$DataFactoryLocation = "EastUS"
# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored
# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"
To create an Azure SQL database as part of the script, see the following example:
Set values for the variables that haven't been defined already. For example: SSISDBServerName,
FirewallIPAddress.
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force
Full script
The PowerShell script in this section configures an instance of Azure-SSIS integration runtime in the cloud that
runs SSIS packages. After you run this script successfully, you can deploy and run SSIS packages in the Microsoft
Azure cloud with SSISDB hosted in Azure SQL Database.
1. Launch the Windows PowerShell Integrated Scripting Environment (ISE ).
2. In the ISE, run the following command from the command prompt.
3. Copy the PowerShell script in this section and paste it into the ISE.
4. Provide appropriate values for all parameters at the beginning of the script.
5. Run the script. The Start-AzDataFactoryV2IntegrationRuntime command near the end of the script runs for 20 to
30 minutes.
NOTE
The script connects to your Azure SQL Database server to prepare the SSIS Catalog database (SSISDB).
When you provision an instance of Azure-SSIS IR, the Azure Feature Pack for SSIS and the Access Redistributable are
also installed. These components provide connectivity to Excel and Access files and to various Azure data sources, in
addition to the data sources supported by the built-in components. You can also install additional components. For
more info, see Custom setup for the Azure-SSIS integration runtime.
For a list of supported pricing tiers for Azure SQL Database, see SQL Database resource limits.
For a list of Azure regions in which Data Factory and Azure-SSIS Integration Runtime are currently available, see
ADF + SSIS IR availability by region.
# Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[Specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[Specify a description for your Azure-SSIS IR]"
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium features
on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your own
on-premises SQL Server license to earn cost savings from Azure Hybrid Benefit (AHB) option
# For a Standard_D1_v2 node, 1-4 parallel executions per node are supported, but for other nodes, 1-8 are
currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup script
and its associated files are stored
# SSISDB info
$SSISDBServerEndpoint = "[your Azure SQL Database server name].database.windows.net" # WARNING: Please ensure
that there is no existing SSISDB, so we can prepare and manage one on your behalf
that there is no existing SSISDB, so we can prepare and manage one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify "S0",
"S1", "S2", "S3", etc.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>)]"
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
write-host("##### Starting your Azure-SSIS integration runtime. This command takes 20 to 30 minutes to
complete. #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force
Next steps
In this tutorial, you learned how to:
Create a data factory.
Create an Azure-SSIS integration runtime
Start the Azure-SSIS integration runtime
Deploy SSIS packages
Review the complete script
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize Azure-SSIS IR
Azure PowerShell samples for Azure Data Factory
3/7/2019 • 2 minutes to read • Edit Online
The following table includes links to sample Azure PowerShell scripts for Azure Data Factory.
Copy data
Copy blobs from a folder to another folder in an Azure Blob This PowerShell script copies blobs from a folder in Azure Blob
Storage Storage to another folder in the same Blob Storage.
Copy data from on-premises SQL Server to Azure Blob This PowerShell script copies data from an on-premises SQL
Storage Server database to an Azure blob storage.
Bulk copy This sample PowerShell script copies data from multiple tables
in an Azure SQL database to an Azure SQL data warehouse.
Incremental copy This sample PowerShell script loads only new or updated
records from a source data store to a sink data store after the
initial full copy of data from the source to the sink.
Transform data
Transform data using a Spark cluster This PowerShell script transforms data by running a program
on a Spark cluster.
Create Azure-SSIS integration runtime This PowerShell script provisions an Azure-SSIS integration
runtime that runs SQL Server Integration Services (SSIS)
packages in Azure.
Pipelines and activities in Azure Data Factory
5/15/2019 • 14 minutes to read • Edit Online
This article helps you understand pipelines and activities in Azure Data Factory and use them to
construct end-to-end data-driven workflows for your data movement and data processing scenarios.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. For example, a pipeline could contain a set of activities that ingest and clean
log data, and then kick off a Spark job on an HDInsight cluster to analyze the log data. The beauty of
this is that the pipeline allows you to manage the activities as a set instead of each one individually.
For example, you can deploy and schedule the pipeline, instead of the activities independently.
The activities in a pipeline define actions to perform on your data. For example, you may use a copy
activity to copy data from an on-premises SQL Server to an Azure Blob Storage. Then, use a Hive
activity that runs a Hive script on an Azure HDInsight cluster to process/transform data from the
blob storage to produce output data. Finally, use a second copy activity to copy the output data to an
Azure SQL Data Warehouse on top of which business intelligence (BI) reporting solutions are built.
Data Factory supports three types of activities: data movement activities, data transformation
activities, and control activities. An activity can take zero or more input datasets and produce one or
more output datasets. The following diagram shows the relationship between pipeline, activity, and
dataset in Data Factory:
An input dataset represents the input for an activity in the pipeline and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For
example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity.
For more information about datasets, see Datasets in Azure Data Factory article.
Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB
Azure Data ✓ ✓ ✓ ✓
Explorer
Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen1
Azure Data ✓ ✓ ✓ ✓
Lake Storage
Gen2
Azure ✓ ✓ ✓
Database for
MariaDB
Azure ✓ ✓ ✓
Database for
MySQL
Azure ✓ ✓ ✓
Database for
PostgreSQL
Azure File ✓ ✓ ✓ ✓
Storage
Azure SQL ✓ ✓ ✓ ✓
Database
Azure SQL ✓ ✓ ✓
Database
Managed
Instance
Azure SQL ✓ ✓ ✓ ✓
Data
Warehouse
Azure Search ✓ ✓ ✓
Index
Azure Table ✓ ✓ ✓ ✓
Storage
Database Amazon ✓ ✓ ✓
Redshift
DB2 ✓ ✓ ✓
Drill (Preview) ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Google ✓ ✓ ✓
BigQuery
Greenplum ✓ ✓ ✓
HBase ✓ ✓ ✓
Hive ✓ ✓ ✓
Apache Impala ✓ ✓ ✓
(Preview)
Informix ✓ ✓
MariaDB ✓ ✓ ✓
Microsoft ✓ ✓
Access
MySQL ✓ ✓ ✓
Netezza ✓ ✓ ✓
Oracle ✓ ✓ ✓ ✓
Phoenix ✓ ✓ ✓
PostgreSQL ✓ ✓ ✓
Presto ✓ ✓ ✓
(Preview)
SAP Business ✓ ✓
Warehouse
Open Hub
SAP Business ✓ ✓
Warehouse via
MDX
SAP HANA ✓ ✓ ✓
SAP Table ✓ ✓ ✓
Spark ✓ ✓ ✓
SQL Server ✓ ✓ ✓ ✓
Sybase ✓ ✓
Teradata ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Vertica ✓ ✓ ✓
NoSQL Cassandra ✓ ✓ ✓
Couchbase ✓ ✓ ✓
(Preview)
MongoDB ✓ ✓ ✓
File Amazon S3 ✓ ✓ ✓
File System ✓ ✓ ✓ ✓
FTP ✓ ✓ ✓
Google Cloud ✓ ✓ ✓
Storage
HDFS ✓ ✓ ✓
SFTP ✓ ✓ ✓
Generic OData ✓ ✓ ✓
Generic ODBC ✓ ✓ ✓
Generic REST ✓ ✓ ✓
Common Data ✓ ✓ ✓ ✓
Service for
Apps
Concur ✓ ✓ ✓
(Preview)
Dynamics 365 ✓ ✓ ✓ ✓
Dynamics AX ✓ ✓ ✓
(Preview)
Dynamics ✓ ✓ ✓ ✓
CRM
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Google ✓ ✓ ✓
AdWords
(Preview)
HubSpot ✓ ✓ ✓
(Preview)
Jira (Preview) ✓ ✓ ✓
Magento ✓ ✓ ✓
(Preview)
Marketo ✓ ✓ ✓
(Preview)
Office 365 ✓ ✓ ✓
Oracle Eloqua ✓ ✓ ✓
(Preview)
Oracle ✓ ✓ ✓
Responsys
(Preview)
Oracle Service ✓ ✓ ✓
Cloud
(Preview)
Paypal ✓ ✓ ✓
(Preview)
QuickBooks ✓ ✓ ✓
(Preview)
Salesforce ✓ ✓ ✓ ✓
Salesforce ✓ ✓ ✓ ✓
Service Cloud
Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)
SAP ECC ✓ ✓ ✓
ServiceNow ✓ ✓ ✓
SUPPORTED AS SUPPORTED AS SUPPORTED BY SUPPORTED BY
CATEGORY DATA STORE A SOURCE A SINK AZURE IR SELF-HOSTED IR
Shopify ✓ ✓ ✓
(Preview)
Square ✓ ✓ ✓
(Preview)
Web Table ✓ ✓
(HTML table)
Xero (Preview) ✓ ✓ ✓
Zoho (Preview) ✓ ✓ ✓
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
Execute Pipeline Activity Execute Pipeline activity allows a Data Factory pipeline
to invoke another pipeline.
Wait Activity When you use a Wait activity in a pipeline, the pipeline
waits for the specified period of time before continuing
with execution of subsequent activities.
Pipeline JSON
Here is how a pipeline is defined in JSON format:
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
}
}
}
Activity JSON
The activities section can have one or more activities defined within it. There are two main types of
activities: Execution and Control Activities.
Execution activities
Execution activities include data movement and data transformation activities. They have the
following top-level structure:
{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}
linkedServiceName Name of the linked service used Yes for HDInsight Activity, Azure
by the activity. Machine Learning Batch Scoring
Activity, Stored Procedure Activity.
An activity may require that you
specify the linked service that links No for all others
to the required compute
environment.
Activity policy
Policies affect the run-time behavior of an activity, giving configurability options. Activity Policies are
only available for execution activities.
Activity policy JSON definition
{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity"
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}
Control activity
Control activities have the following top-level structure:
{
"name": "Control Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"dependsOn":
{
}
}
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities": [
{
"name": "MyFirstActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
}
},
{
"name": "MySecondActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
},
"dependsOn": [
{
"activity": "MyFirstActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
],
"parameters": {
}
}
}
{
"name": "CopyPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"policy": {
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}
The typeProperties section is different for each transformation activity. To learn about type
properties supported for a transformation activity, click the transformation activity in the Data
transformation activities.
For a complete walkthrough of creating this pipeline, see Tutorial: transform data using Spark.
{
"name": "TriggerA",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
...
}
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyCopyPipeline"
},
"parameters": {
"copySourceName": "FileSource"
}
}
}
}
Next steps
See the following tutorials for step-by-step instructions for creating pipelines with activities:
Build a pipeline with a copy activity
Build a pipeline with a data transformation activity
Linked services in Azure Data Factory
4/28/2019 • 4 minutes to read • Edit Online
This article describes what linked services are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform
a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy
activity to copy data from an on-premises SQL Server to Azure Blob storage. Then, you might use a Hive activity
that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data.
Finally, you might use a second copy activity to copy the output data to Azure SQL Data Warehouse, on top of
which business intelligence (BI) reporting solutions are built. For more information about pipelines and activities,
see Pipelines and activities in Azure Data Factory.
Now, a dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs.
Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the linked
data stores, and the linked service defines the connection to the data source. For example, an Azure Storage linked
service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the
folder within that Azure storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked services: Azure
Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure Storage
linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). The Azure
Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to
connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset specifies the blob
container and blob folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies
the SQL table in your SQL database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory:
{
"name": "<Name of the linked service>",
"properties": {
"type": "<Type of the linked service>",
"typeProperties": {
"<data store or compute-specific type properties>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Datasets in Azure Data Factory
4/28/2019 • 11 minutes to read • Edit Online
This article describes what datasets are, how they are defined in JSON format, and how they are used in
Azure Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. Now, a
dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files,
folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in
Blob storage from which the activity should read the data.
Before you create a dataset, you must create a linked service to link your data store to the data factory.
Linked services are much like connection strings, which define the connection information needed for
Data Factory to connect to external resources. Think of it this way; the dataset represents the structure of
the data within the linked data stores, and the linked service defines the connection to the data source.
For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob
dataset represents the blob container and the folder within that Azure storage account that contains the
input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked
services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which
refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL
Database linked service). The Azure Storage and Azure SQL Database linked services contain
connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL
Database, respectively. The Azure Blob dataset specifies the blob container and blob folder that contains
the input blobs in your Blob storage. The Azure SQL Table dataset specifies the SQL table in your SQL
database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in
Data Factory:
Dataset JSON
A dataset in Data Factory is defined in the following JSON format:
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"structure": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}
NOTE
Azure Data Factory Mapping Data Flow is currently a public preview feature and is not subject to Azure customer
SLA provisions.
See supported dataset types for a list of dataset types that are Data Flow compatible. Datasets that are
compatible for Data Flow require fine-grained dataset definitions for transformations. Thus, the JSON
definition is slightly different. Instead of a structure property, datasets that are Data Flow compatible
have a schema property.
In Data Flow, datasets are used in source and sink transformations. The datasets define the basic data
schemas. If your data has no schema, you can use schema drift for your source and sink. The schema in
the dataset represents the physical data type and shape.
By defining the schema from the dataset, you'll get the related data types, data formats, file location, and
connection information from the associated Linked service. Metadata from the datasets appears in your
source transformation as the source projection. The projection in the source transformation represents
the Data Flow data with defined names and types.
When you import the schema of a Data Flow dataset, select the Import Schema button and choose to
import from the source or from a local file. In most cases, you'll import the schema directly from the
source. But if you already have a local schema file (a Parquet file or CSV with headers), you can direct
Data Factory to base the schema on that file.
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"schema": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}
Dataset example
In the following example, the dataset represents a table named MyTable in a SQL database.
{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "MyAzureSqlLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties":
{
"tableName": "MyTable"
},
}
}
Dataset type
There are many different types of datasets, depending on the data store you use. See the following table
for a list of data stores supported by Data Factory. Click a data store to learn how to create a linked
service and a dataset for that data store.
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Azure ✓ ✓ ✓ ✓
Cosmos DB
(SQL API)
Azure ✓ ✓ ✓ ✓
Cosmos
DB's API for
MongoDB
Azure Data ✓ ✓ ✓ ✓
Explorer
Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen1 Delimited Text,
Parquet
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Azure Data ✓ ✓ ✓ ✓ ✓
Lake Supported
Storage Formats:
Gen2 Delimited Text,
Parquet
Azure ✓ ✓ ✓
Database for
MariaDB
Azure ✓ ✓ ✓
Database for
MySQL
Azure ✓ ✓ ✓
Database for
PostgreSQL
Azure File ✓ ✓ ✓ ✓
Storage
Azure SQL ✓ ✓ ✓ ✓ ✓
Database
Azure SQL ✓ ✓ ✓
Database
Managed
Instance
Azure SQL ✓ ✓ ✓ ✓ ✓
Data
Warehouse
Azure ✓ ✓ ✓
Search Index
Azure Table ✓ ✓ ✓ ✓
Storage
Database Amazon ✓ ✓ ✓
Redshift
DB2 ✓ ✓ ✓
Drill ✓ ✓ ✓
(Preview)
Google ✓ ✓ ✓
BigQuery
Greenplum ✓ ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
HBase ✓ ✓ ✓
Hive ✓ ✓ ✓
Apache ✓ ✓ ✓
Impala
(Preview)
Informix ✓ ✓
MariaDB ✓ ✓ ✓
Microsoft ✓ ✓
Access
MySQL ✓ ✓ ✓
Netezza ✓ ✓ ✓
Oracle ✓ ✓ ✓ ✓
Phoenix ✓ ✓ ✓
PostgreSQL ✓ ✓ ✓
Presto ✓ ✓ ✓
(Preview)
SAP ✓ ✓
Business
Warehouse
Open Hub
SAP ✓ ✓
Business
Warehouse
via MDX
SAP HANA ✓ ✓ ✓
SAP Table ✓ ✓ ✓
Spark ✓ ✓ ✓
SQL Server ✓ ✓ ✓ ✓
Sybase ✓ ✓
Teradata ✓ ✓
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Vertica ✓ ✓ ✓
NoSQL Cassandra ✓ ✓ ✓
Couchbase ✓ ✓ ✓
(Preview)
MongoDB ✓ ✓ ✓
File Amazon S3 ✓ ✓ ✓
File System ✓ ✓ ✓ ✓
FTP ✓ ✓ ✓
Google ✓ ✓ ✓
Cloud
Storage
HDFS ✓ ✓ ✓
SFTP ✓ ✓ ✓
Generic Generic ✓ ✓ ✓
protocol HTTP
Generic ✓ ✓ ✓
OData
Generic ✓ ✓ ✓
ODBC
Generic ✓ ✓ ✓
REST
Services Amazon ✓ ✓ ✓
and apps Marketplace
Web Service
(Preview)
Common ✓ ✓ ✓ ✓
Data Service
for Apps
Concur ✓ ✓ ✓
(Preview)
Dynamics ✓ ✓ ✓ ✓
365
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Dynamics ✓ ✓ ✓
AX (Preview)
Dynamics ✓ ✓ ✓ ✓
CRM
Google ✓ ✓ ✓
AdWords
(Preview)
HubSpot ✓ ✓ ✓
(Preview)
Jira ✓ ✓ ✓
(Preview)
Magento ✓ ✓ ✓
(Preview)
Marketo ✓ ✓ ✓
(Preview)
Office 365 ✓ ✓ ✓
Oracle ✓ ✓ ✓
Eloqua
(Preview)
Oracle ✓ ✓ ✓
Responsys
(Preview)
Oracle ✓ ✓ ✓
Service
Cloud
(Preview)
Paypal ✓ ✓ ✓
(Preview)
QuickBooks ✓ ✓ ✓
(Preview)
Salesforce ✓ ✓ ✓ ✓
Salesforce ✓ ✓ ✓ ✓
Service
Cloud
SUPPORTED SUPPORTED
AS A COPY AS A COPY SUPPORTED SUPPORTED
ACTIVITY ACTIVITY SUPPORTED BY SELF- BY DATA
CATEGORY DATA STORE SOURCE SINK BY AZURE IR HOSTED IR FLOW
Salesforce ✓ ✓ ✓
Marketing
Cloud
(Preview)
SAP Cloud ✓ ✓ ✓ ✓
for
Customer
(C4C)
SAP ECC ✓ ✓ ✓
ServiceNow ✓ ✓ ✓
Shopify ✓ ✓ ✓
(Preview)
Square ✓ ✓ ✓
(Preview)
Web Table ✓ ✓
(HTML
table)
Xero ✓ ✓ ✓
(Preview)
Zoho ✓ ✓ ✓
(Preview)
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a
dependency on preview connectors in your solution, please contact Azure support.
In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly, for an
Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following JSON:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference",
},
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
}
}
}
Example
In the following example, suppose the source Blob data is in CSV format and contains three columns:
userid, name, and lastlogindate. They are of type Int64, String, and Datetime with a custom datetime
format using abbreviated French names for day of the week.
Define the Blob dataset structure as follows along with type definitions for the columns:
"structure":
[
{ "name": "userid", "type": "Int64"},
{ "name": "name", "type": "String"},
{ "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"}
]
Guidance
The following guidelines help you understand when to include structure information, and what to
include in the structure section. Learn more on how data factory maps source data to sink and when to
specify structure information from Schema and type mapping.
For strong schema data sources, specify the structure section only if you want map source
columns to sink columns, and their names are not the same. This kind of structured data source
stores data schema and type information along with the data itself. Examples of structured data
sources include SQL Server, Oracle, and Azure SQL Database.
As type information is already available for structured data sources, you should not include type
information when you do include the structure section.
For no/weak schema data sources e.g. text file in blob storage, include structure when the
dataset is an input for a copy activity, and data types of source dataset should be converted to native
types for the sink. And include structure when you want to map source columns to sink columns..
Create datasets
You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure
Resource Manager Template, and Azure portal
Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one
of these tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Pipeline execution and triggers in Azure Data
Factory
3/5/2019 • 15 minutes to read • Edit Online
A pipeline run in Azure Data Factory defines an instance of a pipeline execution. For example, say you have a
pipeline that executes at 8:00 AM, 9:00 AM, and 10:00 AM. In this case, there are three separate runs of the
pipeline, or pipeline runs. Each pipeline run has a unique pipeline run ID. A run ID is a GUID that uniquely
defines that particular pipeline run.
Pipeline runs are typically instantiated by passing arguments to parameters that you define in the pipeline. You
can execute a pipeline either manually or by using a trigger. This article provides details about both ways of
executing a pipeline.
In the JSON definition, the pipeline takes two parameters: sourceBlobContainer and sinkBlobContainer.
You pass values to these parameters at runtime.
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module