DP 090T00A ENU TrainerHandbook
DP 090T00A ENU TrainerHandbook
Official
Course
DP-090T00
Implementing a Machine
Learning Solution with
Azure Databricks
DP-090T00
Implementing a Machine
Learning Solution with Azure
Databricks
II Disclaimer
Information in this document, including URL and other Internet Web site references, is subject to change
without notice. Unless otherwise noted, the example companies, organizations, products, domain names,
e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with
any real company, organization, product, domain name, e-mail address, logo, person, place or event is
intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the
user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in
or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.
The names of manufacturers, products, or URLs are provided for informational purposes only and
Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding
these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a
manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links
may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is
not responsible for the contents of any linked site or any link contained in a linked site, or any changes or
updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission
received from any linked site. Microsoft is providing these links to you only as a convenience, and the
inclusion of any link does not imply endorsement of Microsoft of the site or the products contained
therein.
1 http://www.microsoft.com/trademarks
EULA III
13. “Personal Device” means one (1) personal computer, device, workstation or other digital electronic
device that you personally own or control that meets or exceeds the hardware level specified for
the particular Microsoft Instructor-Led Courseware.
14. “Private Training Session” means the instructor-led training classes provided by MPN Members for
corporate customers to teach a predefined learning objective using Microsoft Instructor-Led
Courseware. These classes are not advertised or promoted to the general public and class attend-
ance is restricted to individuals employed by or contracted by the corporate customer.
15. “Trainer” means (i) an academically accredited educator engaged by a Microsoft Imagine Academy
Program Member to teach an Authorized Training Session, (ii) an academically accredited educator
validated as a Microsoft Learn for Educators – Validated Educator, and/or (iii) a MCT.
16. “Trainer Content” means the trainer version of the Microsoft Instructor-Led Courseware and
additional supplemental content designated solely for Trainers’ use to teach a training session
using the Microsoft Instructor-Led Courseware. Trainer Content may include Microsoft PowerPoint
presentations, trainer preparation guide, train the trainer materials, Microsoft One Note packs,
classroom setup guide and Pre-release course feedback form. To clarify, Trainer Content does not
include any software, virtual hard disks or virtual machines.
2. USE RIGHTS. The Licensed Content is licensed, not sold. The Licensed Content is licensed on a one
copy per user basis, such that you must acquire a license for each individual that accesses or uses the
Licensed Content.
●● 2.1 Below are five separate sets of use rights. Only one set of rights apply to you.
1. If you are a Microsoft Imagine Academy (MSIA) Program Member:
1. Each license acquired on behalf of yourself may only be used to review one (1) copy of the
Microsoft Instructor-Led Courseware in the form provided to you. If the Microsoft Instruc-
tor-Led Courseware is in digital format, you may install one (1) copy on up to three (3)
Personal Devices. You may not install the Microsoft Instructor-Led Courseware on a device
you do not own or control.
2. For each license you acquire on behalf of an End User or Trainer, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one
(1) End User who is enrolled in the Authorized Training Session, and only immediately
prior to the commencement of the Authorized Training Session that is the subject matter
of the Microsoft Instructor-Led Courseware being provided, or
2. provide one (1) End User with the unique redemption code and instructions on how they
can access one (1) digital version of the Microsoft Instructor-Led Courseware, or
3. provide one (1) Trainer with the unique redemption code and instructions on how they
can access one (1) Trainer Content.
3. For each license you acquire, you must comply with the following:
1. you will only provide access to the Licensed Content to those individuals who have
acquired a valid license to the Licensed Content,
2. you will ensure each End User attending an Authorized Training Session has their own
valid licensed copy of the Microsoft Instructor-Led Courseware that is the subject of the
Authorized Training Session,
3. you will ensure that each End User provided with the hard-copy version of the Microsoft
Instructor-Led Courseware will be presented with a copy of this agreement and each End
EULA V
User will agree that their use of the Microsoft Instructor-Led Courseware will be subject
to the terms in this agreement prior to providing them with the Microsoft Instructor-Led
Courseware. Each individual will be required to denote their acceptance of this agree-
ment in a manner that is enforceable under local law prior to their accessing the Micro-
soft Instructor-Led Courseware,
4. you will ensure that each Trainer teaching an Authorized Training Session has their own
valid licensed copy of the Trainer Content that is the subject of the Authorized Training
Session,
5. you will only use qualified Trainers who have in-depth knowledge of and experience with
the Microsoft technology that is the subject of the Microsoft Instructor-Led Courseware
being taught for all your Authorized Training Sessions,
6. you will only deliver a maximum of 15 hours of training per week for each Authorized
Training Session that uses a MOC title, and
7. you acknowledge that Trainers that are not MCTs will not have access to all of the trainer
resources for the Microsoft Instructor-Led Courseware.
2. If you are a Microsoft Learning Competency Member:
1. Each license acquire may only be used to review one (1) copy of the Microsoft Instruc-
tor-Led Courseware in the form provided to you. If the Microsoft Instructor-Led Course-
ware is in digital format, you may install one (1) copy on up to three (3) Personal Devices.
You may not install the Microsoft Instructor-Led Courseware on a device you do not own or
control.
2. For each license you acquire on behalf of an End User or MCT, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one
(1) End User attending the Authorized Training Session and only immediately prior to
the commencement of the Authorized Training Session that is the subject matter of the
Microsoft Instructor-Led Courseware provided, or
2. provide one (1) End User attending the Authorized Training Session with the unique
redemption code and instructions on how they can access one (1) digital version of the
Microsoft Instructor-Led Courseware, or
3. you will provide one (1) MCT with the unique redemption code and instructions on how
they can access one (1) Trainer Content.
3. For each license you acquire, you must comply with the following:
1. you will only provide access to the Licensed Content to those individuals who have
acquired a valid license to the Licensed Content,
2. you will ensure that each End User attending an Authorized Training Session has their
own valid licensed copy of the Microsoft Instructor-Led Courseware that is the subject of
the Authorized Training Session,
3. you will ensure that each End User provided with a hard-copy version of the Microsoft
Instructor-Led Courseware will be presented with a copy of this agreement and each End
User will agree that their use of the Microsoft Instructor-Led Courseware will be subject
to the terms in this agreement prior to providing them with the Microsoft Instructor-Led
Courseware. Each individual will be required to denote their acceptance of this agree-
ment in a manner that is enforceable under local law prior to their accessing the Micro-
soft Instructor-Led Courseware,
VI EULA
4. you will ensure that each MCT teaching an Authorized Training Session has their own
valid licensed copy of the Trainer Content that is the subject of the Authorized Training
Session,
5. you will only use qualified MCTs who also hold the applicable Microsoft Certification
credential that is the subject of the MOC title being taught for all your Authorized
Training Sessions using MOC,
6. you will only provide access to the Microsoft Instructor-Led Courseware to End Users,
and
7. you will only provide access to the Trainer Content to MCTs.
3. If you are a MPN Member:
1. Each license acquired on behalf of yourself may only be used to review one (1) copy of the
Microsoft Instructor-Led Courseware in the form provided to you. If the Microsoft Instruc-
tor-Led Courseware is in digital format, you may install one (1) copy on up to three (3)
Personal Devices. You may not install the Microsoft Instructor-Led Courseware on a device
you do not own or control.
2. For each license you acquire on behalf of an End User or Trainer, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one
(1) End User attending the Private Training Session, and only immediately prior to the
commencement of the Private Training Session that is the subject matter of the Micro-
soft Instructor-Led Courseware being provided, or
2. provide one (1) End User who is attending the Private Training Session with the unique
redemption code and instructions on how they can access one (1) digital version of the
Microsoft Instructor-Led Courseware, or
3. you will provide one (1) Trainer who is teaching the Private Training Session with the
unique redemption code and instructions on how they can access one (1) Trainer
Content.
3. For each license you acquire, you must comply with the following:
1. you will only provide access to the Licensed Content to those individuals who have
acquired a valid license to the Licensed Content,
2. you will ensure that each End User attending an Private Training Session has their own
valid licensed copy of the Microsoft Instructor-Led Courseware that is the subject of the
Private Training Session,
3. you will ensure that each End User provided with a hard copy version of the Microsoft
Instructor-Led Courseware will be presented with a copy of this agreement and each End
User will agree that their use of the Microsoft Instructor-Led Courseware will be subject
to the terms in this agreement prior to providing them with the Microsoft Instructor-Led
Courseware. Each individual will be required to denote their acceptance of this agree-
ment in a manner that is enforceable under local law prior to their accessing the Micro-
soft Instructor-Led Courseware,
4. you will ensure that each Trainer teaching an Private Training Session has their own valid
licensed copy of the Trainer Content that is the subject of the Private Training Session,
EULA VII
5. you will only use qualified Trainers who hold the applicable Microsoft Certification
credential that is the subject of the Microsoft Instructor-Led Courseware being taught
for all your Private Training Sessions,
6. you will only use qualified MCTs who hold the applicable Microsoft Certification creden-
tial that is the subject of the MOC title being taught for all your Private Training Sessions
using MOC,
7. you will only provide access to the Microsoft Instructor-Led Courseware to End Users,
and
8. you will only provide access to the Trainer Content to Trainers.
4. If you are an End User:
For each license you acquire, you may use the Microsoft Instructor-Led Courseware solely for
your personal training use. If the Microsoft Instructor-Led Courseware is in digital format, you
may access the Microsoft Instructor-Led Courseware online using the unique redemption code
provided to you by the training provider and install and use one (1) copy of the Microsoft
Instructor-Led Courseware on up to three (3) Personal Devices. You may also print one (1) copy
of the Microsoft Instructor-Led Courseware. You may not install the Microsoft Instructor-Led
Courseware on a device you do not own or control.
5. If you are a Trainer.
1. For each license you acquire, you may install and use one (1) copy of the Trainer Content in
the form provided to you on one (1) Personal Device solely to prepare and deliver an
Authorized Training Session or Private Training Session, and install one (1) additional copy
on another Personal Device as a backup copy, which may be used only to reinstall the
Trainer Content. You may not install or use a copy of the Trainer Content on a device you do
not own or control. You may also print one (1) copy of the Trainer Content solely to prepare
for and deliver an Authorized Training Session or Private Training Session.
2. If you are an MCT, you may customize the written portions of the Trainer Content that are
logically associated with instruction of a training session in accordance with the most recent
version of the MCT agreement.
3. If you elect to exercise the foregoing rights, you agree to comply with the following: (i)
customizations may only be used for teaching Authorized Training Sessions and Private
Training Sessions, and (ii) all customizations will comply with this agreement. For clarity, any
use of “customize” refers only to changing the order of slides and content, and/or not using
all the slides or content, it does not mean changing or modifying any slide or content.
●● 2.2 Separation of Components. The Licensed Content is licensed as a single unit and you
may not separate their components and install them on different devices.
●● 2.3 Redistribution of Licensed Content. Except as expressly provided in the use rights
above, you may not distribute any Licensed Content or any portion thereof (including any permit-
ted modifications) to any third parties without the express written permission of Microsoft.
●● 2.4 Third Party Notices. The Licensed Content may include third party code that Micro-
soft, not the third party, licenses to you under this agreement. Notices, if any, for the third party
code are included for your information only.
●● 2.5 Additional Terms. Some Licensed Content may contain components with additional
terms, conditions, and licenses regarding its use. Any non-conflicting terms in those conditions
and licenses also apply to your use of that respective component and supplements the terms
described in this agreement.
VIII EULA
laws and treaties. Microsoft or its suppliers own the title, copyright, and other intellectual property
rights in the Licensed Content.
6. EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regula-
tions. You must comply with all domestic and international export laws and regulations that apply to
the Licensed Content. These laws include restrictions on destinations, end users and end use. For
additional information, see www.microsoft.com/exporting.
7. SUPPORT SERVICES. Because the Licensed Content is provided “as is”, we are not obligated to
provide support services for it.
8. TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you
fail to comply with the terms and conditions of this agreement. Upon termination of this agreement
for any reason, you will immediately stop all use of and delete and destroy all copies of the Licensed
Content in your possession or under your control.
9. LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed
Content. The third party sites are not under the control of Microsoft, and Microsoft is not responsible
for the contents of any third party sites, any links contained in third party sites, or any changes or
updates to third party sites. Microsoft is not responsible for webcasting or any other form of trans-
mission received from any third party sites. Microsoft is providing these links to third party sites to
you only as a convenience, and the inclusion of any link does not imply an endorsement by Microsoft
of the third party site.
10. ENTIRE AGREEMENT. This agreement, and any additional terms for the Trainer Content, updates and
supplements are the entire agreement for the Licensed Content, updates and supplements.
11. APPLICABLE LAW.
1. United States. If you acquired the Licensed Content in the United States, Washington state law
governs the interpretation of this agreement and applies to claims for breach of it, regardless of
conflict of laws principles. The laws of the state where you live govern all other claims, including
claims under state consumer protection laws, unfair competition laws, and in tort.
2. Outside the United States. If you acquired the Licensed Content in any other country, the laws of
that country apply.
12. LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the
laws of your country. You may also have rights with respect to the party from whom you acquired the
Licensed Content. This agreement does not change your rights under the laws of your country if the
laws of your country do not permit it to do so.
13. DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS" AND "AS AVAILA-
BLE." YOU BEAR THE RISK OF USING IT. MICROSOFT AND ITS RESPECTIVE AFFILIATES GIVES NO
EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS. YOU MAY HAVE ADDITIONAL CON-
SUMER RIGHTS UNDER YOUR LOCAL LAWS WHICH THIS AGREEMENT CANNOT CHANGE. TO
THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS, MICROSOFT AND ITS RESPECTIVE AFFILI-
ATES EXCLUDES ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICU-
LAR PURPOSE AND NON-INFRINGEMENT.
14. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. YOU CAN RECOVER FROM
MICROSOFT, ITS RESPECTIVE AFFILIATES AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP TO
US$5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL, LOST
PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES.
X EULA
■■ Module 0 Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Welcome to the Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
■■ Module 1 Introduction to Azure Databricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Getting started with Azure Databricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Working with data in Azure Databricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
■■ Module 2 Training and Evaluating Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Preparing Data for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Training a Machine Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
■■ Module 3 Managing Experiments and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Using MLflow to Track Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Managing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
■■ Module 4 Integrating Azure Databricks and Azure Machine Learning . . . . . . . . . . . . . . . . . . . . 45
Tracking Experiments with Azure Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Deploying Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Module 0 Welcome
Course Agenda
This course includes the following modules.
Lab Environment
This course includes hands-on activities designed to help you learn by working with Azure Databricks. To
complete the labs in this course, you will need:
●● A modern web browser - for example, Microsoft Edge.
●● The lab files for this course, which are published online at https://aka.ms/mslearn-dp090.
●● A Microsoft Azure1 subscription.
If you are taking this course with a Microsoft Learning Partner, you can use an “Azure Pass” to claim a
free temporary Azure subscription. Redeem your Azure Pass code at https://www.microsoftazurepass.
com, signing in with a Microsoft account that hasn't been used to redeem an Azure Pass previously.
You can complete the labs on your own computer. In some classes, a hosted environment may be
available - check with your instructor.
1 https://azure.microsoft.com
Module 1 Introduction to Azure Databricks
1 https://docs.microsoft.com/en-us/azure/databricks/scenarios/what-is-azure-databricks
4 Module 1 Introduction to Azure Databricks
Workspaces
A workspace is an environment for accessing all of your Databricks elements
●● It groups objects (like notebooks, libraries, experiments) into folders
●● Provides access to your data
●● Provides access to the computations resources used (clusters, jobs)
Each user has a home folder for their notebooks and libraries.
The objects stored in the Workspace root folder are: folders, notebooks, libraries, and experiments.
To perform an action on a Workspace object, we can right-click the object and choose one of the availa-
ble actions.
Clusters
A cluster is a set of computational resources on which you run your code (as notebooks or jobs). We can
run ETL pipelines, or machine learning, data science, analytics workloads on the cluster.
We can create:
●● An all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
Getting started with Azure Databricks 5
●● A job cluster to run a specific job. The cluster will be terminated when the job completes
(A job is a way of running a notebook or JAR either immediately or on a scheduled basis)
Before we can use a cluster, we have to choose one of the available runtimes.
Databricks runtimes are the set of core components that run on Azure Databricks clusters. Azure Data-
bricks offers several types of runtimes:
●● Databricks Runtime: includes Apache Spark, components and updates that optimize the usability,
performance, and security for big data analytics
●● Databricks Runtime for Machine Learning: a variant that adds multiple machine learning libraries,
TensorFlow, Keras, PyTorch
●● Databricks Light: for jobs that don’t need the advanced performance, reliability, or autoscaling of the
Databricks Runtime
To create and configure a new cluster, we have to click on the Create Cluster button and choose our
options:
To launch the cluster, we have to click the Start button and then confirm to launch it. It is recommend-
ed to wait until the cluster is started.
A cluster can be customized in many ways. In case you want to make third-party code available to your
notebooks, you can install a library. Your cluster can be provisioned to use Python/Java/Scala/R libraries
via PyPI or Maven.
Once the cluster is running, we can click Edit to change its properties. In case we want to provision your
cluster with additional libraries, we can click on the Libraries and then choose Install New.
We can pick a library and it will be available later to be used in your notebooks.
More information: for more information about provisioning clusters, see libraries2 in the Azure Data-
bricks documentation.
2 https://docs.microsoft.com/azure/databricks/libraries/
Getting started with Azure Databricks 7
If we use small data files on the local machine that we want to analyze with Azure Databricks, we can
import them to DBFS using the UI. There are two ways to upload data to DBFS with the UI
●● Upload files to the FileStore in the Upload Data UI.
●● Upload data to a table with the Create table UI, which is also accessible via the Import & Explore Data
box on the landing page.
We may also read data on cluster nodes using Spark APIs.
We can read data imported to DBFS into Apache Spark DataFrames. For example, if you import a CSV file,
you can read the data using this code
df = spark.read.csv('/FileStore/tables/nyc_taxi.csv', header="true", infer-
Schema="true")
We can also read data imported to DBFS in programs running on the Spark driver node using local file
APIs. For example:
df = spark.read.csv('/dbfs/FileStore/tables/nyc_taxi.csv', header="true",
inferSchema="true")
Importing data
To add data, we can go to the landing page and click on Import & Explore Data
To get the data in a table there are multiple options available:
●● Upload a local file and import the data
●● Use data already existing under DBFS
●● Mount external datasources, like Azure Storage, Azure Datalake and more
To create a table based on a local file, we can select Upload File to upload data from your local
machine.
8 Module 1 Introduction to Azure Databricks
Once the data is uploaded, it will be available as a table or as a mountpoint under the DBFS filesystem (/
FileStore)
Databricks can create a table automatically if we click on Create Table with UI
Getting started with Azure Databricks 9
Alternately, we can have full control over the structure of the new table by choosing Create Table in
Notebook.
Azure Databricks will generate spark code that loads your data (and we can customize it via spark API)
data_mount_point = '/mnt/data'
data_file_path = '/bronze/wwi-factsale.csv'
dbutils.fs.mount(
source = f"wasbs://dev@{data_storage_account_name}.blob.core.windows.
net",
mount_point = data_mount_point,
extra_configs = {f"fs.azure.account.key.{data_storage_account_name}.blob.
core.windows.net": data_storage_account_key})
display(dbutils.fs.ls("/mnt/data"))
#this path is available as dbfs:/mnt/data for spark APIs, e.g. spark.read
#this path is available as file:/dbfs/mnt/data for regular APIs, e.g. os.
listdir
Notebooks support a shorthand — %fs magic command — for accessing the dbutils filesystem module.
Most dbutils.fs commands are available using %fs magic commands:
# List the DBFS root
%fs ls
More information: for more information about DBFS, see the Databricks File System3 in the Azure
Databricks documentation.
3 https://docs.microsoft.com/azure/databricks/data/databricks-file-system
Getting started with Azure Databricks 11
To create a notebook, we can click on Workspace, browse into the desired folder, right click and choose
Create then select Notebook.
A name should be given to the new notebook, and a default language to be used inside the code cells. A
cluster has to be specified for running the code.
For runnable cells, the following programming languages are supported: Python, Scala, R, SQL.
You may choose the default language for the cells in a notebook. You may also override that language
later.
By hovering the Plus button below the current cell or by choosing the top right menu options, we can
change the contents of the notebook. We may add new cells, cut/copy/export the cell contents, run a
specific cell.
We can override the default language by specifying the language magic command %<language> at the
beginning of a cell.
The supported magic commands are:
●● %python
●● %r
●● %scala
●● %sql
Notebooks also support a few auxiliary magic commands:
●● %sh: Allows you to run shell code in your notebook
●● %fs: Allows you to use dbutils filesystem commands
●● %md: Allows you to include various types of documentation, including text, images, and mathematical
formulas and equations.
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Getting Started with Azure Databricks lab.
14 Module 1 Introduction to Azure Databricks
We can use spark to load the table data by using the sql method:
df = spark.sql("SELECT * FROM nyc_taxi_csv")
Spark supports many different data formats, such as csv, json, xml, parquet, avro, orc and more.
Dataframe size
To get the number of rows available in a dataframe we can use the count() method.
4 https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame
5 https://docs.microsoft.com/azure/databricks/spark/latest/dataframes-datasets/
Working with data in Azure Databricks 15
df.count
Dataframe structure
To get the schema metadata for a given dataframe we can use the printSchema() method.
Each column in a given dataframe has a name, a type and a nullable flag.
df.printSchema
Dataframe contents
Spark has a builtin function that allows to print the rows inside a dataframe: show()
df.show
df.show(100, truncate=False) #show more lines, do not truncate
By default it will only show the first 20 lines in your dataframe and it will truncate long columns.
Additional parameters are available to override these settings.
Querying DataFrames
DataFrames allow the processing of huge amounts of data. Spark uses an optimization engine to gener-
ate logical queries. Data is distributed over your cluster and you get huge performance for massive
amounts of data.
Spark SQL is a component that introduced the DataFrames which provides support for structured and
semi-structured data.
Spark has multiple interfaces (apis) for dealing with dataframes:
●● We have seen the .sql() method which allows to run arbitrary SQL queries on table data.
●● Another option is to use the spark domain specific language for structured data manipulation,
available in Scala, Java, Python and R.
DataFrame API
The Apache Spark DataFrame API provides a rich set of functions (select columns, filter, join, aggregate,
and so on) that allow you to solve common data analysis problems efficiently.
A complex operation where tables are joined, filtered and restructures is easy to write, easy to under-
stand, typesafe, feels natural for those with prior sql experience, and comes with the added speed of
parallel processing given by the spark engine.
To load or save data use read and write:
df = spark.read.format('json').load('sample/trips.json')
df.write.format('parquet').bucketBy(100, 'year', 'month').mode("over-
write").saveAsTable('table1'))
16 Module 1 Introduction to Azure Databricks
These join types are supported: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_
semi, and left_anti.
Note that filter is an alias for where.
Working with data in Azure Databricks 17
To use a list of conditions for a column and return an expression use when:
df.select(df.name, F.when(df.age > 4, 1).when(df.age < 3, -1).other-
wise(0)).show()
Available statistics are: - count - mean - stddev - min - max - arbitrary approximate percentiles specified
as a percentage (eg, 75%).
To find correlations between specific columns use corr.
Currently only supports the Pearson Correlation Coefficient
df.corr('tripDistance', 'totalAmount')
More information: for more information about the Spark API, see Dataframe API6 and the Column
API7 in the Spark documentation.
Visualizing Data
Spark has a builtin show function which allows to print the rows in a dataframe.
Azure Databricks adds its own display capabilities and adds various other types of visualizations out of
the box using the display and displayHTML functions.
The same data we've seen above as a table can be displayed as a bar chart, pie, histogram or other
graphs. Even maps or images can be displayed:
6 https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame
7 https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.Column
18 Module 1 Introduction to Azure Databricks
Plot options
The following display options are available:
●● we can choose the dataframe columns to be used as axes (Keys, Values)
●● we can choose to group our series of data
●● we can choose the aggregations to be used with our grouped data (avg, sum, count, min, max)
More information: for more information about the available visualizations, see Visualizations8 in the
Azure Databricks documentation.
8 https://docs.microsoft.com/azure/databricks/notebooks/visualizations/
Working with data in Azure Databricks 19
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Working with Data in Azure Databricks lab.
Module Review
Knowledge Check
In this lesson, you learned how to run code in your notebooks, how to do basic manipulation of data-
frames.
Use the following review questions to check your learning.
Question 1
Alice creates a notebook on Azure Databricks to train her datasets, before using them with SparkML. Which
of the following languages are supported for doing that in a notebook?
Java
Python
C#
Question 2
Bob has to ingest data in Azure Databricks, coming from multiple datasources. He will then run some ETLs
to move the data into the datalake. Which of his datasources can be mounted under DBFS?
FTP
Azure Data Lake Store
SMB
20 Module 1 Introduction to Azure Databricks
Answers
Question 1
Alice creates a notebook on Azure Databricks to train her datasets, before using them with SparkML.
Which of the following languages are supported for doing that in a notebook?
Java
■■ Python
C#
Explanation
Azure Databricks supports the following programming languages in runnable notebook cells: Python, Scala,
SQL and R.
Question 2
Bob has to ingest data in Azure Databricks, coming from multiple datasources. He will then run some
ETLs to move the data into the datalake. Which of his datasources can be mounted under DBFS?
FTP
■■ Azure Data Lake Store
SMB
Explanation
Azure Databricks allows using several datasources: Azure Blob Storage, Azure Data Lake, Cassandra, JDBC,
Kakfa, Redis, Elasticsearch, files uploaded and mounted via DBFS, as well as data integrations with various
other products and databases.
Module 2 Training and Evaluating Machine
Learning Models
Traditional programming
In traditional programming, the inputs of hard coded rules and data are used to arrive at the output of
answers.
You provide the traditional program with Rules and Data, and it gives your results or answers.
Machine learning
The result of training a machine learning algorithm is that the algorithm has learned the rules to map the
input data to answers.
22 Module 2 Training and Evaluating Machine Learning Models
In machine learning, you train the algorithm with data and answers, also known as labels, and the
algorithm learns the rules to map the data to their respective labels.
Data Cleaning
Big Data has become part of the lexicon of organizations worldwide, as more and more organizations
look to leverage data to drive more informed business decisions. With this evolution in business deci-
sion-making, the amount of raw data collected, along with the number and diversity of data sources, is
growing at an astounding rate. Raw data, however, is often noisy and unreliable and may contain missing
values and outliers. Using such data for Machine Learning can produce misleading results. Thus, data
cleaning of the raw data is one of the most important steps in preparing data for Machine Learning. As
we discussed in the previous lesson that Machine Learning algorithm learns the rules from data, and thus
having clean and consistent data is an important factor in influencing the predictive abilities of the
underlying algorithms.
The most common type of data available for machine learning is in tabular format. The tabular data is
typically available in the form of rows and columns. In tabular data the row describes a single observa-
tion, and each column describes different properties of the observation. Column values can be continu-
ous (numerical), discrete (categorical), datetime (time-series), or text. Columns that are chosen as inputs
to the Machine Learning models are also known as model features. Data cleaning deals with issues in the
data quality such as errors, missing values and outliers. There are several techniques in dealing with data
quality issues and we will discuss some of the common approaches below.
Duplicate records
In some situations, you find duplicate records in the table. The easiest solution is to drop the duplicate
records.
Outliers
An outlier is defined as an observation that is significantly different than all other observations in a given
column. There are several ways to identify outliers, and one common approach is to compute the Z-score
for an observation x:
You can use similar strategies as imputing null values to deal with outliers. However, it is important to
note that outliers are not necessarily invalid data and, in some situations, it is perfectly valid to retain the
outliers in your training data.
Feature Engineering
Machine learning models are as strong as the data they a trained on. Often it is important to derive
features from existing raw data that better represent the nature of the data and thus help improve the
predictive power of the machine learning algorithms. This process of generating new predictive features
from existing raw data is commonly referred to as feature engineering.
There are certainly many valid approaches to feature engineering and some of the most popular ones,
categorized by data type, are as follows:
●● Aggregation (count, sum, average, mean, median, and the like)
●● Part-of (year of date, month of date, week of date, and the like)
●● Binning (grouping entities into bins and then applying aggregations)
●● Flagging (boolean conditions resulting in True of False)
●● Frequency-based (calculating the frequencies of the levels of one or more categorical variables)
●● Embedding (transforming one or more categorical or text features into a new set of features, possibly
with a different cardinality)
●● Deriving by example
Feature engineering is not limited to the above list and can include involve domain knowledge-based
approaches for deriving features. Let’s work with an example to understand the process of feature
engineering. In our example, we are working with system that gives us weather data on an hourly basis,
and we have a column in the data that is hour of day. The hour of day column is of type integer
and it can assume any integer value in the range [0, 23]. The question is, how best to represent this
data to a machine learning algorithm that can learn its cyclical nature? One approach is to engineer a set
of new features that transforms the hour of day column using sine and cosine functions. These
derived features are plotted in the figure below for the range [0, 24]:
24 Module 2 Training and Evaluating Machine Learning Models
The cosine function provides symmetrically equal weights to corresponding AM and PM hours, and the
sine function provides symmetrically opposite weights to corresponding AM and PM hours. Both func-
tions capture the cyclical nature of hour of day.
Data Scaling
Scaling numerical features is an important part of preprocessing data for machine learning. Typically
range of values each input feature takes vary greatly between features. There are many machine learning
algorithms that are sensitive to the magnitude of the input features and thus without feature scaling
higher weights might get assigned to features with higher magnitudes irrespective of the importance of
the feature on the predicted output.
There are two common approaches to scaling numerical features: (1) Normalization and (2) Standardiza-
tion. We will discuss each of these approaches below.
Normalization
Normalization rescales the data into the range [0, 1] and mathematically.
For example, for each individual value, you can subtract the minimum value for that input in the training
dataset, and then divide by the range of the values in the training dataset. The range of the values is the
difference between the maximum value and the minimum value.
Standardization
Standardization rescales the data to have mean = 0 and standard deviation = 1.
For the numeric input, you first compute the mean and standard deviation using all the data available in
the training dataset. Then, for each individual input value, you scale that value by subtracting the mean
and then dividing by the standard deviation.
Preparing Data for Machine Learning 25
Data Encoding
A common type of data that is prevalent in machine learning is called categorical data. Categorical data
implies discrete or limit set of values. For example, a person’s gender or ethnicity is considered as
categorical. Let’s consider the following data table:
Ordinal encoding
Ordinal encoding, converts categorical data into integer codes ranging from 0 to (number of categories
– 1). For example, the categories Make and Color from the above table are encoded as:
Make Encoding
A 0
G 1
T 2
Color Encoding
Red 0
Green 1
Blue 2
Using the above encoding, the transformed table is shown below:
One-hot encoding
One Hot encoding is often the recommended approach, and it involves transforming each categorical
value into n (= number of categories) binary values, with one of them 1, and all others 0. For example, the
above table can be transformed as:
26 Module 2 Training and Evaluating Machine Learning Models
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Preparing Data for Machine Learning exercises.
Training a Machine Learning Model 27
MLLib
MLLib is a legacy approach for machine learning on Apache Spark. It builds off of Spark's Resilient
Distributed Dataset1 (RDD) data structure. This data structure forms the foundation of Apache Spark,
but additional data structures on top of the RDD, such as DataFrames, have reduced the need to work
directly with RDDs.
As of Apache Spark 2.0, the library entered a maintenance mode. This means that MLLib is still available
and has not been deprecated, but there will be no new functionality added to the library. Instead,
customers are advised to move to the org.apache.spark.ml library, commonly referred to as Spark
ML.
Spark ML
Spark ML is the primary library for machine learning development in Apache Spark. It supports Data-
Frames in its API, versus the classic RDD approach. This makes Spark ML an easier library to work with for
data scientists, as Spark DataFrames share many common ideas with DataFrames in Pandas and R.
The most confusing part about MLLib versus Spark ML is that they are both the same library. The
difference is that “classic” MLLib namespace is org.apache.spark.mllib whereas the Spark ML
namespace is org.apache.spark.ml. Whenever possible, use the Spark ML namespace when per-
forming new data science activities.
Splitting Data
The first step involves splitting data between training and validation datasets. Doing so allows a data
scientist to train a model with a representative portion of the data, while still retaining some percentage
as a hold-out dataset. This hold-out dataset can be useful for determining whether the training model is
overfitting–that is, latching onto the peculiarities of the training dataset rather than finding generally
applicable relationships between variables.
Dataframes support a randomSplit() method which makes this process of splitting data simple.
Training a Model
Training a model relies on three key abstractions: a transformer, an estimator, and a pipeline.
1 https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds
28 Module 2 Training and Evaluating Machine Learning Models
A transformer takes a DataFrame as an input and returns a new DataFrame as an output. Transformers
are helpful for performing feature engineering and feature selection, as the result of a transformer is
another DataFrame. An example of this might be to read in a text column, map that text column into a
set of feature vectors, and output a DataFrame with the newly mapped column. Transformers will
implement a .transform() method.
An estimator takes a DataFrame as an input and returns a model. It takes a DataFrame as an input and
returns a model, which is itself a transformer. An example of an estimator is the LinearRegression
machine learning algorithm. It accepts a DataFrame and produces a Model. Estimators implement a
.fit() method.
Pipelines combine together estimators and transformers and implement a .fit() method. This makes it
easier to combine multiple algorithms, by breaking the training process out into a series of stages.
Validating a Model
Once a model has been trained, it becomes possible to validate its results. Spark ML includes built-in
summary statistics for models based on the algorithm of choice. Using linear regression as an example,
the model contains a summary object which includes scores such as Root Mean Square Error (RMSE),
Mean Absolute Error (MAE), and coefficient of determination (R2, pronounced R-squared). These will be
the summary measures based on the training data.
From there, with a validation dataset, it is possible to calculate summary statistics on a never-before-seen
set of data, running the model's transform() function against the validation dataset. From there, use
evaluators such as the RegressionEvaluator to calculate measures such as RMSE, MAE, and R2.
Other Frameworks
Azure Databricks supports machine learning frameworks other than Spark ML / MLLib. For example,
Azure Databricks offers support for popular libraries like TensorFlow and PyTorch.
It is possible to install these libraries directly, but the best recommendation is to use the Databricks
Runtime for Machine Learning2. This comes with a variety of machine learning libraries pre-installed,
including TensorFlow, PyTorch, Keras, and XGBoost. It also includes libraries essential for distributed
training, allowing data scientists to take advantage of the distributed nature of Apache Spark.
For libraries which do not support distributed training, it is also possible to use a single node cluster3.
For example, PyTorch4 and TensorFlow5 both support single node use.
2 https://docs.microsoft.com/azure/databricks/runtime/mlruntime
3 https://docs.microsoft.com/azure/databricks/clusters/single-node
4 https://docs.microsoft.com/azure/databricks/applications/machine-learning/train-model/pytorch#use-pytorch-on-a-single-node
5 https://docs.microsoft.com/azure/databricks/applications/machine-learning/train-model/tensorflow#use-tensorflow-on-a-single-node
Training a Machine Learning Model 29
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Training and Validating a Machine Learning Model exercises.
Module Review
Knowledge Check
In this lesson, you learned how to train and evaluate a machine learning model.
Use the following review questions to check your learning.
Question 1
John is looking to train his first machine learning model. One of his inputs includes the size of the T-Shirts,
with possible values of XS, S, M, L, and XL. What is the best approach John can employ to preprocess the
T-Shirt size input feature?
Standardization
One Hot Encoding
Normalization
Question 2
Which of the following is a key abstraction for model training in Spark ML?
Calculator
Iterator
Processor
Transformer
30 Module 2 Training and Evaluating Machine Learning Models
Answers
Question 1
John is looking to train his first machine learning model. One of his inputs includes the size of the
T-Shirts, with possible values of XS, S, M, L, and XL. What is the best approach John can employ to
preprocess the T-Shirt size input feature?
Standardization
■■ One Hot Encoding
Normalization
Explanation
One Hot Encoding is often the recommended approach to encode categorical features such as T-Shirt sizes.
Whereas, Standardization and Normalization are approaches to scale numerical features.
Question 2
Which of the following is a key abstraction for model training in Spark ML?
Calculator
Iterator
Processor
■■ Transformer
Explanation
Transformers, estimators, and pipelines are the three key abstractions in Spark ML. Transformers change
the shape of DataFrames, estimators convert DataFrames to objects like trained models, and pipelines
connect together chains of transformers and estimators.
Module 3 Managing Experiments and Models
MLflow Tracking
MLflow Tracking allows data scientists to work with experiments. For each run in an experiment, a data
scientist may log parameters, versions of libraries used, evaluation metrics, and generated output files
when training machine learning models.
This provides the ability to audit the results of prior model training executions.
32 Module 3 Managing Experiments and Models
MLflow Projects
An MLflow Project is a way of packaging up code in a manner which allows for consistent deployment
and the ability to reproduce results. MLflow supports several environments for projects, including via
Conda, Docker, and directly on a system.
MLflow Models
MLflow offers a standardized format for packaging models for distribution. This standardized model
format allows MLflow to work with models generated from several popular libraries, including scikit-
learn, Keras, MLlib, ONNX, and more. Review the MLflow Models documentation1 for information
on the full set of supported model flavors.
1 https://mlflow.org/docs/latest/models.html
Using MLflow to Track Experiments 33
From there, MLflow Models and MLflow Projects combine with the MLflow Model Registry to allow
operations team members to deploy models in the registry, serving them either through a REST API or as
part of a batch inference solution using Azure Databricks.
MLflow Terminology
There are several terms which will be important to understand when working with MLflow. Most of these
terms are fairly common in the data science space and other products, such as Azure Machine Learning,
use very similar terminology to allow for simplified cross-product development of skills. The following
sections include key terms and concepts for each MLflow product.
MLflow Tracking
MLflow Tracking is built around runs, that is, executions of code for a data science task. Each run con-
tains several key attributes, including:
●● Parameters - Key-value pairs which represent inputs. Use parameters to track hyperparameters, that
is, inputs to functions which affect the machine learning process.
●● Metrics - Key-value pairs which represent how the model is performing. This can include evaluation
measures such as Root Mean Square Error, and metrics can be updated throughout the course of a
run. This allows a data scientist, for example, to track Root Mean Square Error for each epoch of a
neural network.
●● Artifacts - Output files. Artifacts may be stored in any format, and can include models, images, log
files, data files, or anything else which might be important for model analysis and understanding.
These runs can be combined together into experiments, which are intended to collect and organize runs.
For example, a data scientist may create an experiment to train a classifier against a particular data set.
Each run might try a different algorithm or different set of inputs. The data scientist can then review the
individual runs in order to determine which run generated the best model.
MLflow Projects
A project in MLflow is a method of packaging data science code. This allows other data scientists or
automated processes to use the code in a consistent manner.
34 Module 3 Managing Experiments and Models
Each project includes at least one entry point, which is a file (either .py or .sh) that is intended to act as
the starting point for project use. Projects also specify details about the environment. This includes the
specific packages (and versions of packages) used in developing the project, as new versions of packages
may include breaking changes.
MLflow Models
A model in MLflow is a directory containing an arbitrary set of files along with an MLmodel file in the
root of the directory.
MLflow allows models to be of a particular flavor, which is a descriptor of which tool or library generated
a model. This allows MLflow to work with a wide variety of modeling libraries, such as scikit-learn,
Keras, MLlib, ONNX, and many more. Each model has a signature, which describes the expected inputs
and outputs for the model.
In this case, the experiment's name will be the name of the notebook. It is possible to export a variable
named MLFLOW_EXPERIMENT_NAME to change the name of your experiment should you choose.
Reviewing Experiments
Inside a notebook, the Experiment menu option displays a context bar which includes information on
runs of the current experiment.
Using MLflow to Track Experiments 35
Selecting the External Link icon in the experiment run will provide additional details on a particular run.
This link will provide the information that MLflow Tracker logged, including notes, parameters, metrics,
tags, and artifacts.
36 Module 3 Managing Experiments and Models
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Using MLflow to Track Experiments lab.
Managing Models 37
Managing Models
Model Management Overview
Training a great model is a start to a data science project, but having a trained model that existed in a
notebook on a cluster at one point in time will not be enough. This is where model management comes
into play.
The two key steps for model management in MLflow are registration and versioning of models. With
registration, a data scientist stores the details of a model in the MLflow Model Registry, along with a
name for ease of access. Users can retrieve the model from the registry and use that model to perform
inference on new data sets. Further, it is possible to serve models on Azure Databricks or in Azure
Machine Learning, automatically generating a REST API to interact with the model.
Once a model is out in production, there is still more work to do. As models change over time, model
management becomes a process of training new candidate models, comparing to the current version and
prior candidate models, and determining whether a candidate is worthy of becoming the next production
model. MLflow's versioning system makes this easy by labeling new versions of models and retaining
information on prior model versions automatically. This allows a data scientist to perform testing on a
variety of model versions and ensure that new models are performing better than older models.
Registering a Model
Once you have a model trained using the library of your choice, the next step is to register that model.
Registration allows MLflow to keep track of the model, retaining details on how the model performed in
training as well as the contents of the model itself.
On the run details page, select the folder which contains the model and then select Register Model.
38 Module 3 Managing Experiments and Models
If you have not already created the model before, select the Model drop-down list and choose + Create
New Model.
Choose an appropriate name for the model and then select Register.
At this point, model registration will occur and you will have a new model. Navigate to the Models menu
to view the model.
At this point, model registration will occur and you will have a new model. You can reference the model
in code using the following method:
model = mlflow.sklearn.load_model(
model_uri=f"models:/{model_name}/{model_version}")
Model Versioning
With machine learning, model training is not a one-time process. Instead, models will update over time.
Keeping track of these changes is possible in MLflow using versioning.
On the run details page, select the folder which contains the model and then select Register Model.
40 Module 3 Managing Experiments and Models
Because you have already created a model, select the Model drop-down list and choose the appropriate
model name.
After performing this transition, return to the model details and the Stage column will contain informa-
tion on the newly-transitioned model version.
After performing this transition, use the following method to retrieve a model at a particular stage:
import mlflow.pyfunc
model_uri = "models:/{model_name}/{model_stage}".format(model_name=model_
name, model_stage=model_stage)
model = mlflow.pyfunc.load_model(model_uri)
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Managing Models lab.
Module Review
Knowledge Check
In this module, you learned how to manage experiments and models using MLflow in Azure Databricks.
Use the following review questions to check your learning.
Question 1
Which of the following is the name of an MLflow component?
MLflow Framework
MLflow Tracking
MLflow Training
Question 2
An experiment is best defined as which of the following statements?
A collection of tests
A collection of runs
A collection of notebooks
44 Module 3 Managing Experiments and Models
Answers
Question 1
Which of the following is the name of an MLflow component?
MLflow Framework
■■ MLflow Tracking
MLflow Training
Explanation
MLflow is made up of four key components: a Model component, which provides a standard for shaping
models; a Model Registry, which allows registration and versioning of models; Projects, which package data
science code; and Tracking, which retains information on the execution of data science code.
Question 2
An experiment is best defined as which of the following statements?
A collection of tests
■■ A collection of runs
A collection of notebooks
Explanation
An experiment is defined as a collection of runs. A run is defined as the execution of code for a data science
task.
Module 4 Integrating Azure Databricks and
Azure Machine Learning
Built on the Microsoft Azure cloud platform, Azure Machine Learning enables you to manage:
●● Scalable on-demand compute for machine learning workloads.
46 Module 4 Integrating Azure Databricks and Azure Machine Learning
●● Data storage and connectivity to ingest data from a wide range sources.
●● Machine learning workflow orchestration to automate model training, deployment, and management
processes.
●● Model registration and management, so you can track multiple versions of models and the data on
which they were trained.
●● Metrics and monitoring for training experiments, datasets, and published services.
●● Model deployment for real-time and batch inferencing.
1 https://www.mlflow.org/
2 https://mlflow.org/docs/latest/quickstart.html#using-the-tracking-api
Tracking Experiments with Azure Machine Learning 47
Your model training and logging code is provided within the with block.
Next, you can use MLflow’s log_artifact() API to save model artifacts such as your Predicted vs
True curve as shown:
import matplotlib.pyplot as plt
Next, when you open the Outputs + logs tab you will observe the model artifacts that were logged
via MLflow tracking APIs.
In summary, using MLflow integration with Azure Machine Learning, you can run experiments in Azure
Databricks and leverage Azure Machine Learning workspace capabilities of centralized, secure, and
scalable solution to store model training metrics and artifacts.
50 Module 4 Integrating Azure Databricks and Azure Machine Learning
databricks_compute.wait_for_completion(True)
script_directory = "./scripts"
script_name = "process_data.py"
dataset_name = "nyc-taxi-dataset"
The above step defines the configuration to create a new Databricks job cluster to run the python script.
The cluster is created on the fly to run the script and the cluster is subsequently deleted after the step
execution is completed.
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Running experiments in Azure Machine Learning exercises.
Deploying Models 53
Deploying Models
Model Deployment Overview
In machine learning, Model Deployment can be considered as a process by which you integrate your
trained machine learning models into production environment such that your business or end user
applications can use the model predictions to make decisions or gain insights into your data. The most
common way you deploy model using Azure Machine Learning from Azure Databricks is to deploy the
model as a real-time inferencing service. Here the term inferencing refers to the use of a trained
model to make predictions on new input data on which the model has not been trained.
In Azure Machine learning, you can create real-time inferencing solutions by deploying a model as a
real-time service, hosted in a containerized platform such as Azure Kubernetes Services (AKS).
model = Model.register(workspace=ws,
model_name='nyc-taxi-fare',
model_path='model.pkl', # local path
description='Model to predict taxi fares in NYC.')
Creating an Environment
Azure Machine Learning environments are an encapsulation of the environment where your machine
learning training happens. They define Python packages, environment variables, Docker settings and
other attributes in declarative fashion. The below code snippet shows an example of how you can create
an environment for your deployment:
from azureml.core import Environment
from azureml.core.environment import CondaDependencies
my_env_name="nyc-taxi-env"
myenv = Environment.get(workspace=ws, name='AzureML-Minimal').clone(my_env_
name)
conda_dep = CondaDependencies()
conda_dep.add_pip_package("numpy==1.18.1")
conda_dep.add_pip_package("pandas==1.1.5")
conda_dep.add_pip_package("joblib==0.14.1")
conda_dep.add_pip_package("scikit-learn==0.24.1")
conda_dep.add_pip_package("sklearn-pandas==2.1.0")
myenv.python.conda_dependencies=conda_dep
56 Module 4 Integrating Azure Databricks and Azure Machine Learning
cluster_name = 'aks-cluster'
compute_config = AksCompute.provisioning_configuration(location='eastus')
production_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
production_cluster.wait_for_completion(show_output=True)
With the compute target created, you can now define the deployment configuration, which sets the
target-specific compute specification for the containerized deployment:
from azureml.core.webservice import AksWebservice
deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1,
memory_gb = 1)
The code to configure an ACI deployment is similar, except that you do not need to explicitly create an
ACI compute target, and you must use the deploy_configuration class from the azureml.core.webser-
vice.AciWebservice namespace. Similarly, you can use the azureml.core.webservice.LocalWebservice
namespace to configure a local Docker-based service.
service = Model.deploy(workspace=ws,
name = 'nyc-taxi-service',
models = [model],
inference_config = inference_config,
deployment_config = deploy_config,
Deploying Models 57
deployment_target = production_cluster)
service.wait_for_deployment(show_output = True)
For ACI or local services, you can omit the deployment_target parameter (or set it to None).
More Information: For more information about deploying models with Azure Machine Learning, see
Deploy models with Azure Machine Learning3 in the documentation.
Troubleshooting Deployment
There are a lot of elements to a service deployment, including the trained model, the runtime environ-
ment configuration, the scoring script, the container image, and the container host. Troubleshooting a
failed deployment, or an error when consuming a deployed service can be complex.
Note: To view the state of a service, you must use the compute-specific service type (for example
AksWebservice) and not a generic WebService object.
For an operational service, the state should be Healthy.
The logs include detailed information about the provisioning of the service, and the requests it has
processed; and can often provide an insight into the cause of unexpected errors.
deployment_config = LocalWebservice.deploy_configuration(port=8890)
service = Model.deploy(ws, 'test-svc', [model], inference_config, deploy-
3 https://aka.ms/AA70zfv
58 Module 4 Integrating Azure Databricks and Azure Machine Learning
ment_config)
You can then test the locally deployed service using the SDK:
print(service.run(input_data = json_data))
You can then troubleshoot runtime issues by making changes to the scoring file that is referenced in the
inference configuration, and reloading the service without redeploying it (something you can only do
with a local service):
service.reload()
print(service.run(input_data = json_data))
Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Deploying Models in Azure Machine Learning exercises.
Module Review
Knowledge Check
In this lesson, you learned how to integrate Azure Databricks and Azure Machine Learning.
Use the following review questions to check your learning.
Question 1
What is the correct method to log a model metric, _rmse, in MLflow?
mlflow.log("rmse", _rmse)
mlflow.log_artifact("rmse", _rmse)
mlflow.log_metric("rmse", _rmse)
Deploying Models 59
Question 2
To support real-time inferencing in production applications, which is the best choice as a deployment target
for the scoring web service?
Azure Kubernetes Service (AKS)
Azure Container Instances (ACI)
Azure Machine Learning Compute Clusters
60 Module 4 Integrating Azure Databricks and Azure Machine Learning
Answers
Question 1
What is the correct method to log a model metric, _rmse, in MLflow?
mlflow.log("rmse", _rmse)
mlflow.log_artifact("rmse", _rmse)
■■ mlflow.log_metric("rmse", _rmse)
Explanation
MLflow module provides “fluent” APIs and log_metric() is the correct method to log a model metric.
Question 2
To support real-time inferencing in production applications, which is the best choice as a deployment
target for the scoring web service?
■■ Azure Kubernetes Service (AKS)
Azure Container Instances (ACI)
Azure Machine Learning Compute Clusters
Explanation
AKS is recommended for high-scale production deployments. AKS provides fast response time and autoscal-
ing of the deployed service.