End-to-End: Azure ML and Big Data
Case Study – Students Guide
Overview
Margie's Travel (MT) provides concierge services for business travelers. In an
increasingly crowded market, they are always looking for ways to differentiate
themselves, and provide added value to their corporate customers.
They are looking to pilot a web app that their internal customer service agents can
use to provide additional information useful to the traveler during the flight booking
process. They want to enable their agents to enter in the flight information and
produce a prediction as to whether the departing flight will encounter a 15-minute
or longer delay, considering the weather forecasted for the departure hour.
Solution architecture
Below is a diagram of the solution architecture you will build in this lab. Please study
this carefully so you understand the whole of the solution as you are working on the
various components.
Exercise 1: Retrieve lab environment information and
create Databricks cluster.
Duration: 10 minutes
In this exercise, you will retrieve your Azure Storage account name and access key
and your Azure Subscription Id and record the values to use later within the lab. You
will also create a new Azure Databricks cluster.
Task 1: Retrieve Azure Storage account information and
Subscription Id
You will need to have the Azure Storage account name and access key when you
create your Azure Databricks cluster during the lab. You will also need to create
storage containers in which you will store your flight and weather data files.
1. From the side menu in the Azure portal, choose Resource groups, then enter
your resource group name into the filter box, and select it from the list.
2. Next, select your lab Azure Storage account from the list.
3. On the left menu, select Overview, locate, and copy your Azure
Subscription ID and save to a text editor such as Notepad for later.
4. Select Access keys (1) from the menu. Copy the storage account name (2)
and the key1 key (3) and copy the values to a text editor such as Notepad for
later.
Task 2: Create an Azure Databricks cluster
You have provisioned an Azure Databricks workspace, and now you need to create a
new cluster within the workspace. Part of the cluster configuration includes setting
up an account access key to your Azure Storage account, using the Spark Config
within the new cluster form. This will allow your cluster to access the lab files.
1. From the side menu in the Azure portal, select Resource groups, then enter
your resource group name into the filter box, and select it from the list.
2. Next, select your Azure Databricks service from the list.
3. In the Overview pane of the Azure Databricks service, select Launch
Workspace.
Azure Databricks will automatically log you in using Azure Active Directory
Single Sign On.
4. Select Compute from the side menu and then Create Compute.
5. On the New Cluster form, provide the following:
o Cluster Name: lab
o Policy: Unrestricted
o Single node: Selected
o Databricks Runtime Version: Runtime: 9.1 LTS ML (Scala 2.12, Spark 3.1.2)
o Node type: Compute Optimized → Standard_F4
o Terminate after: Check the box and enter 120
o Advanced options:
o Spark Config: Expand Advanced Options and edit the Spark Config by
entering the connection information for your Azure Storage account
that you copied above in Task 1. This will allow your cluster to access
the lab files. Enter the following:
[Link].<STORAGE_ACCOUNT_NAME>.[Link]
<ACCESS_KEY>, where <STORAGE_ACCOUNT_NAME> is your Azure
Storage account name, and <ACCESS_KEY> is your storage access
key.
Example: [Link]
HD+91Y77b+TezEu1lh9QXXU2Va6Cjg9bu0RRpb/KtBj8lWQa6jwyA0OGTDmSNVFr8iSlkytIFON
EHLdl67Fgxg==
6. Select Create Cluster.
Exercise 2: Load Sample Data and Databricks
Notebooks
Duration: 60 minutes
In this exercise, you will implement a classification experiment. You will load the
training data from your local machine into a dataset. Then, you will explore the data
to identify the primary components you should use for prediction and use two
different algorithms for predicting the classification. You will then evaluate the
performance of both algorithms and choose the algorithm that performs best. The
model selected will be exposed as a web service that is integrated with the optional
sample web app at the end.
Task 1: Upload the Sample Datasets
1. Before you begin working with machine learning services, there are three
datasets you need to load.
2. Download the three CSV sample datasets from here: [Link] (If
you get an error, or the page won't open, try pasting the URL into a new
browser window and verify the case sensitive URL is exactly as shown). If you
are still having trouble, a zip file called [Link] is
included in the assignment resources folder on blackboard.
3. Extract the ZIP and verify you have the following files:
o [Link]
o [Link]
o [Link]
4. Open your Azure Databricks workspace. Before continuing to the next step,
verify that your new cluster is running. Do this by navigating to Clusters on
the left-hand menu and ensuring that the state of your cluster is Running.
5. Select Catalog from the menu and choose your cluster from the dropdown.
Next, select default under Databases (if this does not appear, you may need
to start your cluster). Finally, select +Add -> Add data above the Tables
header.
4
1
3
6. Select the native integration DBFS from the listed Data source options.
7. Drag-and-drop the [Link] file into the file
area. Select Create Table with UI.
7. Select your labcluster to preview the table, then select Preview Table.
8. Change the Table Name to flight_delays_with_airport_codes and select the
checkmark for First row is header. Select Create Table.
9. Repeat steps 5 through 8 for the [Link] and
[Link] files, setting the name for each dataset in a
similar fashion:
o flightweatherwithairportcode_csv renamed
to flight_weather_with_airport_code
o airportcodelocationlookupclean_csv renamed
to airport_code_location_lookup_clean
Task 2: Install Azure ML library on the cluster
1. Select Compute on the left-hand menu, then select your lab cluster to open it.
1
2
2. Select the Libraries tab. If you do not see the Azure ML library already
installed on the cluster, continue to the next step. Otherwise, continue to Task
3.
3. Select Install New.
2
4. In the Install Library dialog, select PyPi for the Library Source, then enter the
following in the Package field: azureml-sdk[databricks]. Select Install.
5. Wait until the library's status shows as Installed before continuing.
8. Repeat step 4 with the library: mlflow
Task 3: Open Azure Databricks and complete lab notebooks
1. Within Azure Databricks, select Workspace on the menu, then Users, then
select the ellipses next to your username. Select Import.
1
2 4
2. Within the Import Notebooks dialog, select Import from: URL, then paste the
following into the URL textbox: Notes: You can also directly upload the dbc file (blackboard)
[Link]
[Link]
files/[Link]
Hands-on%20lab/lab-files/[Link]
3. After importing, expand the new BigDataVis folder.
1. Within the [Link] there should be 3 subfolders for exercises 2, 5
and 6 containing a total of 5 notebooks. Open the Exercise 2 subfolder and
explore the notebooks starting at 01 Data Preparation.
2. Run each cell of the notebooks located in the Exercise 2 folder (01, 02 and 03)
individually by selecting within the cell, then entering Ctrl+Enter on your
keyboard. Pay close attention to the instructions within the notebook so you
understand each step of the data preparation process. Make sure to connect
to your compute cluster using the dropdown menu pinned to the top of the
notebooks.
3. During Exercise 2 – 0.3 Deploy as a Web Service use the Machine Learning
group to explore and serve your model:
NOTE: you may not currently be able to serve your model while your lab cluster is running due
to CPU quotas imposed on free or trial accounts. If this is the case, please stop your cluster to
temporarily serve your model and execute the REST call in a local python notebook to test its
functionality.
4. Do NOT run the Clean up part of Notebook 3 (i.e. this command: [Link]()).
You will need the URL of your Machine Learning Model exposed later in
Exercise 8: Deploy intelligent web app (Optional Lab).
Note: you could get this URL by updating your Notebook and adding this
line print(service.scoring_uri), or by going to your Azure Machine Learning
service workspace via the Azure portal and then to the "Deployments" blade.
5. Do NOT run any notebooks within the Exercise 5 or 6 folders. They will be
discussed later in the lab.
Exercise 3: Setup Azure Data Factory
Duration: 20 minutes
In this exercise, you will create a baseline environment for Azure Data Factory
development for further operationalization of data movement and processing. You
will create a Data Factory service, and then install the Data Management Gateway
which is the agent that facilitates data movement from on-premises to Microsoft
Azure.
Task 1: Download and stage data to be processed
1. Open a web browser.
2. Download the AdventureWorks sample data from [Link]
3. Extract it to a new folder called C:\Data.
Task 2: Install and configure Azure Data Factory Integration
Runtime on your machine
1. To download the latest version of Azure Data Factory Integration Runtime, go
to [Link]
2. Select Download, then choose the download you want from the next screen.
3. Run the installer, once downloaded.
4. When you see the following screen, select Next.
5. Check the box to accept the terms and select Next.
6. Accept the default Destination Folder and select Next.
7. Choose Install to complete the installation.
8. Select Finish once the installation has completed.
9. After selecting Finish, the following screen will appear. Keep it open for now.
You will come back to this screen once the Data Factory in Azure has been
provisioned, and obtain the gateway key so we can connect Data Factory to
this "on-premises" server.
Task 3: Configure Azure Data Factory
1. Launch a new browser window and navigate to the Azure portal
([Link] Once prompted, log in with your Microsoft Azure
credentials. If prompted, choose whether your account is an organization
account or a Microsoft account. This will be based on which account was used
to provision your Azure subscription that is being used for this lab.
2. From the side menu in the Azure portal, choose Resource groups, then enter
your resource group name into the filter box, and select it from the list.
3. Next, select your Azure Data Factory service from the list.
4. On the Data Factory Overview screen, select Launch studio.
5. A new page will open in another tab or new window. Within the Azure Data
Factory site, select Manage on the menu.
6. Now, select Integration runtimes in the menu beneath Connections (1), then
select + New (2).
7. In the Integration Runtime Setup blade that appears, select Azure, Self-
Hosted, then select Continue.
8. Select Self-Hosted then select Continue.
9. Enter a Name, such as bigdatagateway-[initials], and select Create.
10. Under Option 2: Manual setup, copy the Key1 authentication key value by
selecting the Copy button, then select Close.
11. Don't close the current screen or browser session.
12. Paste the Key1 value into the box in the middle of the Microsoft Integration
Runtime Configuration Manager screen.
13. Select Register.
14. It can take up to a minute or two to register. If it takes more than a couple of
minutes, and the screen does not respond or returns an error message, close
the screen by selecting the Cancel button.
15. The next screen will be New Integration Runtime (Self-hosted) Node. Select
Finish.
16. You will then get a screen with a confirmation message. Select the Launch
Configuration Manager button to view the connection details.
17. You can now return to the Azure Data Factory page and view the Integration
Runtime you just configured. You may need to select Refresh to view the
Running status for the IR.
18. Select the Azure Data Factory Overview button on the menu. Leave this open
for the next exercise.
Exercise 4: Develop a data factory pipeline for data
movement.
Duration: 20 minutes
In this exercise, you will create an Azure Data Factory pipeline to copy data (.csv
files) from an on-premises server (your machine) to Azure Blob Storage. The goal of
the exercise is to demonstrate data movement from an on-premises location to
Azure Storage (via the Integration Runtime).
Task 1: Create copy pipeline using the Copy Data Wizard
1. Within the Azure Data Factory overview page, select Ingest.
2. In the Copy Data properties, enter the following:
o Task Type: Built-in copy task
o Task cadence or task schedule: Select Schedule
o Start Date (UTC): Enter 03/01/2018 12:00 AM
o Recurrence: Every 1, and select Month(s)
o Under the Advanced recurrence options, make sure you have a value
of 0 in the textboxes for Hours (UTC) and Minutes (UTC), otherwise it
will fail later during Publishing.
o End: No End
o Click Next;
3. Select Next.
4. On the Source data store screen, select + Create new connection.
5. Scroll through the options and select File System, then select Continue.
6. In the New Linked Service form, enter the following:
o Name: OnPremServer
o Connect via integration runtime: Select the Integration runtime
created previously in this exercise.
o Host: C:\Data
o Username: Use your machine's login username.
o Password: Use your machine's login password.
7. Select Test connection to verify you correctly entered the values. Finally,
select Create.
Debugging Connection Failed
If your connection fails, please check:
1) Your account details (your username might be the email associated with your Microsoft
account not the name shown),
2) Your firewall settings (if non-standard this may cause issues),
3) And Integration configurations:
By default, new installations disable access to local storage. This can be enabled by executing
[Link] found in “C:\Program Files\Microsoft Integration Runtime\5.0\Shared\” with the args: “-elma” and “-
DisableLocalFolderPathValidation”; See here for more details
• Open a new PowerShell with Admin permission:
• Navigate to the executable and issue the command:
Possibly also add
The error message "Cannot connect to '\Laptop123\C'. Detail Message: The system
could not find the environment option that was entered".
Solution at [Link]
connect-to-laptop123c-the-system-could-not
Try using below command to find your username: whoami
8. On the Source data store page choose the input file or folder screen, select Browse, then
select the FlightsAndWeather folder. Next, select Load all files under file loading behavior,
check Recursively, then select Next.
9. On the File format settings page, select the following options:
o File format: Delimited Text
o Column delimiter: Comma (,)
o Row delimiter: Auto detect (\r, \n, or \r\n)
o First row as header: Checked
10. Select Next.
11. On the Destination data store screen, select + Create new connection.
12. Select Azure Blob Storage within the New Connection blade, then
select Continue.
13. On the New Connection (Azure Blob Storage) account screen, enter the
following, test your connection, and then select Create.
o Name: BlobStorageOutput
o Connect via integration runtime: Select your Integration Runtime.
o Authentication method: Select Account key
o Account selection method: From Azure subscription
o Storage account name: Select the blob storage account you
provisioned in the before-the-lab section.
14. On the Destination data store page, select Next.
15. From the Choose the output file or folder tab, enter the following:
o Folder path: sparkcontainer/FlightsAndWeather/{Year}/{Month}/
o Filename: [Link]
o Year: yyyy
o Month: MM
o Copy behavior: Merge files
o Select Next.
16. On the File format settings screen, select the Delimited Text file format,
and check the Add header to file checkbox, then select Next.
17. On the Settings screen, enter:
o Task name: CopyOnPrem2AzurePipeline
o Task description: (Optional) This pipeline copies time-sliced CSV files from
on-premises C:\\Data to Azure Blob Storage as a continuous job.
o Select Skip incompatible rows under Fault tolerance,
o Uncheck Enable logging. Expand Advanced settings
o Set Degree of copy parallelism to 10
o Then select Next.
18. Review settings on the Summary tab, but DO NOT choose Next.
19. Scroll down on the summary page until you see the Copy Settings section.
Select Edit next to Copy Settings.
20. Change the following Copy setting:
o Timeout: 7.[Link]
o Retry: 3
o Select Save.
21. After saving the Copy settings, select Next on the Summary tab.
22. On the Deployment screen you will see a message that the deployment in is
progress, and after a minute or two that the deployment completed.
Select Edit Pipeline to close out of the wizard and navigate to the pipeline
editing blade.
Exercise 5: Operationalize ML scoring with Azure
Databricks and Data Factory
Duration: 20 minutes
In this exercise, you will extend the Data Factory to operationalize the scoring of data
using the previously created machine learning model within an Azure Databricks
notebook.
Task 1: Create Azure Databricks Linked Service
1. Return to, or reopen, the Author & Monitor page for your Azure Data Factory
in a web browser, navigate to the Author view, and select the pipeline.
2. Once there, expand Databricks under Activities.
3. Drag the Notebook activity onto the design surface to the side of the Copy
activity.
4. Select the Notebook activity on the design surface to display tabs containing
its properties and settings at the bottom of the screen. On the General tab,
enter BatchScore into the Name field.
5. Select the Azure Databricks tab, and select + New next to the Databricks
Linked service drop down. Here, you will configure a new linked service which
will serve as the connection to your Databricks cluster.
6. On the New Linked Service dialog, enter the following:
o Name: AzureDatabricks
o Connect via integration runtime: Leave set to Default.
o Account selection method: From Azure subscription
o Azure subscription: Choose your Azure Subscription.
o Databricks workspace: Pick your Databricks workspace to populate
the Domain automatically.
o Select cluster: Existing interactive cluster
7. Leave the form open and open your Azure Databricks workspace in another
browser tab. You will generate and retrieve the Access token here.
8. In Azure Databricks, select the Account icon in the top corner of the window,
then select User Settings.
9. Select Generate New Token under the Access Tokens tab. Enter ADF
access for the comment and leave the lifetime at 90 days. Select Generate.
10. Copy the generated token and paste it into a text editor such as Notepad
for a later step.
11. Switch back to your Azure Data Factory screen and paste the generated token
into the Access token field within the form. After a moment, select your
cluster underneath Choose from existing clusters. Select Create.
12. Switch back to Azure Databricks. Select Workspace in the menu. Select
the Exercise 5 folder then open notebook 01 Deploy for Batch Scoring.
Examine the content but don't run any of the cells yet. You need to
replace STORAGE-ACCOUNT-NAME with the name of the blob storage
account you copied in Exercise 1 into cmd 4.
13. Switch back to your Azure Data Factory screen. Select the Settings tab, then
browse to your Exercise 5/01 Deploy for Batch Score notebook into the
Notebook path field.
14. The final step is to connect the Copy activities with the Notebook activity.
Select the small green box on the side of the copy activity and drag the arrow
onto the Notebook activity on the design surface. What this means is that the
copy activity must complete processing and generate its files in your storage
account before the Notebook activity runs, ensuring the files required by the
BatchScore notebook are in place at the time of execution. Select Publish All,
then Publish, after making the connection.
Task 2: Trigger workflow
1. Switch back to Azure Data Factory. Select your pipeline if it is not already
opened.
2. Select Trigger, then Trigger Now located above the pipeline design surface.
3. Enter 3/1/2017into the windowStart parameter, then select OK.
4. Select Monitor in the menu. You will be able to see your pipeline activity in
progress as well as the status of past runs.
Note: You may need to restart your Azure Databricks cluster if it has
automatically terminated due to inactivity.
Exercise 6: Summarize data using Azure Databricks
Duration: 10 minutes
In this exercise, you will prepare a summary of flight delay data using Spark SQL.
Task 1: Summarize delays by airport
1. Open your Azure Databricks workspace, expand the Exercise 6 folder and
open the final notebook called 01 Explore Data.
2. Execute each cell and follow the instructions in the notebook that explains
each step.
Exercise 7: Visualizing in Power BI Desktop
Duration: 20 minutes
In this exercise, you will create visualizations in Power BI Desktop.
Task 1: Obtain the JDBC connection string to your Azure
Databricks cluster
Before you begin, you must first obtain the JDBC connection string to your Azure
Databricks cluster.
1. In Azure Databricks, go to Clusters and select your cluster.
2. On the cluster edit page, scroll down to the bottom of the page,
expand Advanced Options, then select the JDBC/ODBC tab.
3. On the JDBC/ODBC tab, copy and save the first JDBC URL.
o Construct the JDBC server address that you will use when you set up
your Spark cluster connection in Power BI Desktop.
o Take the JDBC URL and do the following:
o Replace jdbc:spark with https.
o Remove everything in the path between the port number and sql,
retaining the components indicated by the boxes in the image below.
Also remove ;AuthMech=3;UID=token;PWD=<personal-access-token> from the
end of the string.
o In our example, the server address would be:
[Link]
3592075405/0615-225254-need937
Task 2: Connect to Azure Databricks using Power BI Desktop
1. If you did not already do so during the before the hands-on lab setup,
download Power BI Desktop from [Link]
us/desktop/.
2. When Power BI Desktop starts, you will need to enter your personal
information, or Sign in if you already have an account.
3. Select Get data on the screen that is displayed next.
4. Select Spark from the list of available data sources. You may enter Spark into
the search field to find it faster.
5. Select Connect.
6. On the next screen, you will be prompted for your Spark cluster information.
7. Paste the JDBC connection string you constructed into the Server field.
8. Select the HTTP protocol.
9. Select DirectQuery for the Data Connectivity mode and select OK. This option
will offload query tasks to the Azure Databricks Spark cluster, providing near-
real time querying.
10. Enter your credentials on the next screen as follows:
o Username: token
o Password: Remember that ADF Access token we generated for the
Azure Data Factory notebook activity? Paste the same value here for
the password.
11. Select Connect.
12. In the Navigator dialog, check the box next to flight_delays_summary, and
select Load.
Task 3: Create Power BI report
1. Once the data finishes loading, you will see the fields appear on the far side of
the Power BI Desktop client window.
2. From the Visualizations area, next to Fields, select the Globe icon to add a
Map visualization to the report design surface.
3. With the Map visualization still selected, drag the OriginLatLong field to
the Location field under Visualizations. Then Next, drag the NumDelays field
to the Bubble Size field under Visualizations.
4. You should now see a map that looks similar to the following (resize and
zoom on your map if necessary):
5. Unselect the Map visualization by selecting the white space next to the map in
the report area.
6. From the Visualizations area, select the Stacked Column Chart icon to add a
bar chart visual to the report's design surface.
7. With the Stacked Column Chart still selected, drag the DayofMonth field and
drop it into the Axis field located under Visualizations.
8. Next, drag the NumDelays field over, and drop it into the Value field.
9. Grab the corner of the new Stacked Column Chart visual on the report design
surface and drag it out to make it as wide as the bottom of your report
design surface. It should look something like the following.
10. Unselect the Stacked Column Chart visual by selecting on the white space next
to the map on the design surface.
11. From the Visualizations area, select the Treemap icon to add this
visualization to the report.
12. With the Treemap visualization selected, drag the OriginAirportCode field
into the Group field under Visualizations.
13. Next, drag the NumDelays field over, and drop it into the Values field.
14. Grab the corner of the Treemap visual on the report design surface and
expand it to fill the area between the map and the side edge of the design
surface. The report should now look similar to the following.
15. You can cross filter any of the visualizations on the report by selecting one of
the other visuals within the report, as shown below (This may take a few
seconds to change, as the data is loaded).
16. You can save the report, by choosing Save from the File menu, and entering a
name and location for the file.