0% found this document useful (0 votes)
75 views73 pages

DP 090T00A ENU TrainerHandbook

Uploaded by

hristo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views73 pages

DP 090T00A ENU TrainerHandbook

Uploaded by

hristo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Microsoft

Official
Course

DP-090T00
Implementing a Machine
Learning Solution with
Azure Databricks
DP-090T00
Implementing a Machine
Learning Solution with Azure
Databricks
II Disclaimer

Information in this document, including URL and other Internet Web site references, is subject to change
without notice. Unless otherwise noted, the example companies, organizations, products, domain names,
e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with
any real company, organization, product, domain name, e-mail address, logo, person, place or event is
intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the
user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in
or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.

The names of manufacturers, products, or URLs are provided for informational purposes only and   
Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding
these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a
manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links
may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is
not responsible for the contents of any linked site or any link contained in a linked site, or any changes or
updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission
received from any linked site. Microsoft is providing these links to you only as a convenience, and the
inclusion of any link does not imply endorsement of Microsoft of the site or the products contained  
therein.

© 2019 Microsoft Corporation. All rights reserved.

Microsoft and the trademarks listed at http://www.microsoft.com/trademarks 1are trademarks of the


Microsoft group of companies. All other trademarks are property of their respective owners.

1 http://www.microsoft.com/trademarks
EULA III

MICROSOFT LICENSE TERMS


MICROSOFT INSTRUCTOR-LED COURSEWARE
These license terms are an agreement between Microsoft Corporation (or based on where you live, one
of its affiliates) and you. Please read them. They apply to your use of the content accompanying this
agreement which includes the media on which you received it, if any. These license terms also apply to
Trainer Content and any updates and supplements for the Licensed Content unless other terms accompa-
ny those items. If so, those terms apply.
BY ACCESSING, DOWNLOADING OR USING THE LICENSED CONTENT, YOU ACCEPT THESE TERMS.
IF YOU DO NOT ACCEPT THEM, DO NOT ACCESS, DOWNLOAD OR USE THE LICENSED CONTENT.
If you comply with these license terms, you have the rights below for each license you acquire.
1. DEFINITIONS.
1. “Authorized Learning Center” means a Microsoft Imagine Academy (MSIA) Program Member,
Microsoft Learning Competency Member, or such other entity as Microsoft may designate from
time to time.
2. “Authorized Training Session” means the instructor-led training class using Microsoft Instruc-
tor-Led Courseware conducted by a Trainer at or through an Authorized Learning Center.
3. “Classroom Device” means one (1) dedicated, secure computer that an Authorized Learning Center
owns or controls that is located at an Authorized Learning Center’s training facilities that meets or
exceeds the hardware level specified for the particular Microsoft Instructor-Led Courseware.
4. “End User” means an individual who is (i) duly enrolled in and attending an Authorized Training
Session or Private Training Session, (ii) an employee of an MPN Member (defined below), or (iii) a
Microsoft full-time employee, a Microsoft Imagine Academy (MSIA) Program Member, or a
Microsoft Learn for Educators – Validated Educator.
5. “Licensed Content” means the content accompanying this agreement which may include the
Microsoft Instructor-Led Courseware or Trainer Content.
6. “Microsoft Certified Trainer” or “MCT” means an individual who is (i) engaged to teach a training
session to End Users on behalf of an Authorized Learning Center or MPN Member, and (ii) current-
ly certified as a Microsoft Certified Trainer under the Microsoft Certification Program.
7. “Microsoft Instructor-Led Courseware” means the Microsoft-branded instructor-led training course
that educates IT professionals, developers, students at an academic institution, and other learners
on Microsoft technologies. A Microsoft Instructor-Led Courseware title may be branded as MOC,
Microsoft Dynamics, or Microsoft Business Group courseware.
8. “Microsoft Imagine Academy (MSIA) Program Member” means an active member of the Microsoft
Imagine Academy Program.
9. “Microsoft Learn for Educators – Validated Educator” means an educator who has been validated
through the Microsoft Learn for Educators program as an active educator at a college, university,
community college, polytechnic or K-12 institution.
10. “Microsoft Learning Competency Member” means an active member of the Microsoft Partner
Network program in good standing that currently holds the Learning Competency status.
11. “MOC” means the “Official Microsoft Learning Product” instructor-led courseware known as
Microsoft Official Course that educates IT professionals, developers, students at an academic
institution, and other learners on Microsoft technologies.
12. “MPN Member” means an active Microsoft Partner Network program member in good standing.
IV EULA

13. “Personal Device” means one (1) personal computer, device, workstation or other digital electronic
device that you personally own or control that meets or exceeds the hardware level specified for
the particular Microsoft Instructor-Led Courseware.
14. “Private Training Session” means the instructor-led training classes provided by MPN Members for
corporate customers to teach a predefined learning objective using Microsoft Instructor-Led
Courseware. These classes are not advertised or promoted to the general public and class attend-
ance is restricted to individuals employed by or contracted by the corporate customer.
15. “Trainer” means (i) an academically accredited educator engaged by a Microsoft Imagine Academy
Program Member to teach an Authorized Training Session, (ii) an academically accredited educator
validated as a Microsoft Learn for Educators – Validated Educator, and/or (iii) a MCT.
16. “Trainer Content” means the trainer version of the Microsoft Instructor-Led Courseware and
additional supplemental content designated solely for Trainers’ use to teach a training session
using the Microsoft Instructor-Led Courseware. Trainer Content may include Microsoft PowerPoint
presentations, trainer preparation guide, train the trainer materials, Microsoft One Note packs,
classroom setup guide and Pre-release course feedback form. To clarify, Trainer Content does not
include any software, virtual hard disks or virtual machines.
2. USE RIGHTS. The Licensed Content is licensed, not sold. The Licensed Content is licensed on a one
copy per user basis, such that you must acquire a license for each individual that accesses or uses the
Licensed Content.
●● 2.1 Below are five separate sets of use rights. Only one set of rights apply to you.
1. If you are a Microsoft Imagine Academy (MSIA) Program Member:
1. Each license acquired on behalf of yourself may only be used to review one (1) copy of the
Microsoft Instructor-Led Courseware in the form provided to you. If the Microsoft Instruc-
tor-Led Courseware is in digital format, you may install one (1) copy on up to three (3)
Personal Devices. You may not install the Microsoft Instructor-Led Courseware on a device
you do not own or control.
2. For each license you acquire on behalf of an End User or Trainer, you may either:

1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one
(1) End User who is enrolled in the Authorized Training Session, and only immediately
prior to the commencement of the Authorized Training Session that is the subject matter
of the Microsoft Instructor-Led Courseware being provided, or
2. provide one (1) End User with the unique redemption code and instructions on how they
can access one (1) digital version of the Microsoft Instructor-Led Courseware, or
3. provide one (1) Trainer with the unique redemption code and instructions on how they
can access one (1) Trainer Content.
3. For each license you acquire, you must comply with the following:

1. you will only provide access to the Licensed Content to those individuals who have
acquired a valid license to the Licensed Content,
2. you will ensure each End User attending an Authorized Training Session has their own
valid licensed copy of the Microsoft Instructor-Led Courseware that is the subject of the
Authorized Training Session,
3. you will ensure that each End User provided with the hard-copy version of the Microsoft
Instructor-Led Courseware will be presented with a copy of this agreement and each End
EULA V

User will agree that their use of the Microsoft Instructor-Led Courseware will be subject
to the terms in this agreement prior to providing them with the Microsoft Instructor-Led
Courseware. Each individual will be required to denote their acceptance of this agree-
ment in a manner that is enforceable under local law prior to their accessing the Micro-
soft Instructor-Led Courseware,
4. you will ensure that each Trainer teaching an Authorized Training Session has their own
valid licensed copy of the Trainer Content that is the subject of the Authorized Training
Session,
5. you will only use qualified Trainers who have in-depth knowledge of and experience with
the Microsoft technology that is the subject of the Microsoft Instructor-Led Courseware
being taught for all your Authorized Training Sessions,
6. you will only deliver a maximum of 15 hours of training per week for each Authorized
Training Session that uses a MOC title, and
7. you acknowledge that Trainers that are not MCTs will not have access to all of the trainer
resources for the Microsoft Instructor-Led Courseware.
2. If you are a Microsoft Learning Competency Member:
1. Each license acquire may only be used to review one (1) copy of the Microsoft Instruc-
tor-Led Courseware in the form provided to you. If the Microsoft Instructor-Led Course-
ware is in digital format, you may install one (1) copy on up to three (3) Personal Devices.
You may not install the Microsoft Instructor-Led Courseware on a device you do not own or
control.
2. For each license you acquire on behalf of an End User or MCT, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one
(1) End User attending the Authorized Training Session and only immediately prior to
the commencement of the Authorized Training Session that is the subject matter of the
Microsoft Instructor-Led Courseware provided, or
2. provide one (1) End User attending the Authorized Training Session with the unique
redemption code and instructions on how they can access one (1) digital version of the
Microsoft Instructor-Led Courseware, or
3. you will provide one (1) MCT with the unique redemption code and instructions on how
they can access one (1) Trainer Content.
3. For each license you acquire, you must comply with the following:
1. you will only provide access to the Licensed Content to those individuals who have
acquired a valid license to the Licensed Content,
2. you will ensure that each End User attending an Authorized Training Session has their
own valid licensed copy of the Microsoft Instructor-Led Courseware that is the subject of
the Authorized Training Session,
3. you will ensure that each End User provided with a hard-copy version of the Microsoft
Instructor-Led Courseware will be presented with a copy of this agreement and each End
User will agree that their use of the Microsoft Instructor-Led Courseware will be subject
to the terms in this agreement prior to providing them with the Microsoft Instructor-Led
Courseware. Each individual will be required to denote their acceptance of this agree-
ment in a manner that is enforceable under local law prior to their accessing the Micro-
soft Instructor-Led Courseware,
VI EULA

4. you will ensure that each MCT teaching an Authorized Training Session has their own
valid licensed copy of the Trainer Content that is the subject of the Authorized Training
Session,
5. you will only use qualified MCTs who also hold the applicable Microsoft Certification
credential that is the subject of the MOC title being taught for all your Authorized
Training Sessions using MOC,
6. you will only provide access to the Microsoft Instructor-Led Courseware to End Users,
and
7. you will only provide access to the Trainer Content to MCTs.
3. If you are a MPN Member:
1. Each license acquired on behalf of yourself may only be used to review one (1) copy of the
Microsoft Instructor-Led Courseware in the form provided to you. If the Microsoft Instruc-
tor-Led Courseware is in digital format, you may install one (1) copy on up to three (3)
Personal Devices. You may not install the Microsoft Instructor-Led Courseware on a device
you do not own or control.
2. For each license you acquire on behalf of an End User or Trainer, you may either:

1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one
(1) End User attending the Private Training Session, and only immediately prior to the
commencement of the Private Training Session that is the subject matter of the Micro-
soft Instructor-Led Courseware being provided, or
2. provide one (1) End User who is attending the Private Training Session with the unique
redemption code and instructions on how they can access one (1) digital version of the
Microsoft Instructor-Led Courseware, or
3. you will provide one (1) Trainer who is teaching the Private Training Session with the
unique redemption code and instructions on how they can access one (1) Trainer
Content.
3. For each license you acquire, you must comply with the following:

1. you will only provide access to the Licensed Content to those individuals who have
acquired a valid license to the Licensed Content,
2. you will ensure that each End User attending an Private Training Session has their own
valid licensed copy of the Microsoft Instructor-Led Courseware that is the subject of the
Private Training Session,
3. you will ensure that each End User provided with a hard copy version of the Microsoft
Instructor-Led Courseware will be presented with a copy of this agreement and each End
User will agree that their use of the Microsoft Instructor-Led Courseware will be subject
to the terms in this agreement prior to providing them with the Microsoft Instructor-Led
Courseware. Each individual will be required to denote their acceptance of this agree-
ment in a manner that is enforceable under local law prior to their accessing the Micro-
soft Instructor-Led Courseware,
4. you will ensure that each Trainer teaching an Private Training Session has their own valid
licensed copy of the Trainer Content that is the subject of the Private Training Session,
EULA VII

5. you will only use qualified Trainers who hold the applicable Microsoft Certification
credential that is the subject of the Microsoft Instructor-Led Courseware being taught
for all your Private Training Sessions,
6. you will only use qualified MCTs who hold the applicable Microsoft Certification creden-
tial that is the subject of the MOC title being taught for all your Private Training Sessions
using MOC,
7. you will only provide access to the Microsoft Instructor-Led Courseware to End Users,
and
8. you will only provide access to the Trainer Content to Trainers.
4. If you are an End User:
For each license you acquire, you may use the Microsoft Instructor-Led Courseware solely for
your personal training use. If the Microsoft Instructor-Led Courseware is in digital format, you
may access the Microsoft Instructor-Led Courseware online using the unique redemption code
provided to you by the training provider and install and use one (1) copy of the Microsoft
Instructor-Led Courseware on up to three (3) Personal Devices. You may also print one (1) copy
of the Microsoft Instructor-Led Courseware. You may not install the Microsoft Instructor-Led
Courseware on a device you do not own or control.
5. If you are a Trainer.
1. For each license you acquire, you may install and use one (1) copy of the Trainer Content in
the form provided to you on one (1) Personal Device solely to prepare and deliver an
Authorized Training Session or Private Training Session, and install one (1) additional copy
on another Personal Device as a backup copy, which may be used only to reinstall the
Trainer Content. You may not install or use a copy of the Trainer Content on a device you do
not own or control. You may also print one (1) copy of the Trainer Content solely to prepare
for and deliver an Authorized Training Session or Private Training Session.
2. If you are an MCT, you may customize the written portions of the Trainer Content that are
logically associated with instruction of a training session in accordance with the most recent
version of the MCT agreement.
3. If you elect to exercise the foregoing rights, you agree to comply with the following: (i)
customizations may only be used for teaching Authorized Training Sessions and Private
Training Sessions, and (ii) all customizations will comply with this agreement. For clarity, any
use of “customize” refers only to changing the order of slides and content, and/or not using
all the slides or content, it does not mean changing or modifying any slide or content.
●● 2.2 Separation of Components. The Licensed Content is licensed as a single unit and you
may not separate their components and install them on different devices.
●● 2.3 Redistribution of Licensed Content. Except as expressly provided in the use rights
above, you may not distribute any Licensed Content or any portion thereof (including any permit-
ted modifications) to any third parties without the express written permission of Microsoft.
●● 2.4 Third Party Notices. The Licensed Content may include third party code that Micro-
soft, not the third party, licenses to you under this agreement. Notices, if any, for the third party
code are included for your information only.
●● 2.5 Additional Terms. Some Licensed Content may contain components with additional
terms, conditions, and licenses regarding its use. Any non-conflicting terms in those conditions
and licenses also apply to your use of that respective component and supplements the terms
described in this agreement.
VIII EULA

3. LICENSED CONTENT BASED ON PRE-RELEASE TECHNOLOGY. If the Licensed Content’s subject


matter is based on a pre-release version of Microsoft technology (“Pre-release”), then in addition to
the other provisions in this agreement, these terms also apply:
1. Pre-Release Licensed Content. This Licensed Content subject matter is on the Pre-release
version of the Microsoft technology. The technology may not work the way a final version of the
technology will and we may change the technology for the final version. We also may not release a
final version. Licensed Content based on the final version of the technology may not contain the
same information as the Licensed Content based on the Pre-release version. Microsoft is under no
obligation to provide you with any further content, including any Licensed Content based on the
final version of the technology.
2. Feedback. If you agree to give feedback about the Licensed Content to Microsoft, either directly
or through its third party designee, you give to Microsoft without charge, the right to use, share
and commercialize your feedback in any way and for any purpose. You also give to third parties,
without charge, any patent rights needed for their products, technologies and services to use or
interface with any specific parts of a Microsoft technology, Microsoft product, or service that
includes the feedback. You will not give feedback that is subject to a license that requires Micro-
soft to license its technology, technologies, or products to third parties because we include your
feedback in them. These rights survive this agreement.
3. Pre-release Term. If you are an Microsoft Imagine Academy Program Member, Microsoft Learn-
ing Competency Member, MPN Member, Microsoft Learn for Educators – Validated Educator, or
Trainer, you will cease using all copies of the Licensed Content on the Pre-release technology upon
(i) the date which Microsoft informs you is the end date for using the Licensed Content on the
Pre-release technology, or (ii) sixty (60) days after the commercial release of the technology that is
the subject of the Licensed Content, whichever is earliest (“Pre-release term”). Upon expiration or
termination of the Pre-release term, you will irretrievably delete and destroy all copies of the
Licensed Content in your possession or under your control.
4. SCOPE OF LICENSE. The Licensed Content is licensed, not sold. This agreement only gives you some
rights to use the Licensed Content. Microsoft reserves all other rights. Unless applicable law gives you
more rights despite this limitation, you may use the Licensed Content only as expressly permitted in
this agreement. In doing so, you must comply with any technical limitations in the Licensed Content
that only allows you to use it in certain ways. Except as expressly permitted in this agreement, you
may not:
●● access or allow any individual to access the Licensed Content if they have not acquired a valid
license for the Licensed Content,
●● alter, remove or obscure any copyright or other protective notices (including watermarks), brand-
ing or identifications contained in the Licensed Content,
●● modify or create a derivative work of any Licensed Content,
●● publicly display, or make the Licensed Content available for others to access or use,
●● copy, print, install, sell, publish, transmit, lend, adapt, reuse, link to or post, make available or
distribute the Licensed Content to any third party,
●● work around any technical limitations in the Licensed Content, or
●● reverse engineer, decompile, remove or otherwise thwart any protections or disassemble the
Licensed Content except and only to the extent that applicable law expressly permits, despite this
limitation.
5. RESERVATION OF RIGHTS AND OWNERSHIP. Microsoft reserves all rights not expressly granted to
you in this agreement. The Licensed Content is protected by copyright and other intellectual property
EULA IX

laws and treaties. Microsoft or its suppliers own the title, copyright, and other intellectual property
rights in the Licensed Content.
6. EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regula-
tions. You must comply with all domestic and international export laws and regulations that apply to
the Licensed Content. These laws include restrictions on destinations, end users and end use. For
additional information, see www.microsoft.com/exporting.
7. SUPPORT SERVICES. Because the Licensed Content is provided “as is”, we are not obligated to
provide support services for it.
8. TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you
fail to comply with the terms and conditions of this agreement. Upon termination of this agreement
for any reason, you will immediately stop all use of and delete and destroy all copies of the Licensed
Content in your possession or under your control.
9. LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed
Content. The third party sites are not under the control of Microsoft, and Microsoft is not responsible
for the contents of any third party sites, any links contained in third party sites, or any changes or
updates to third party sites. Microsoft is not responsible for webcasting or any other form of trans-
mission received from any third party sites. Microsoft is providing these links to third party sites to
you only as a convenience, and the inclusion of any link does not imply an endorsement by Microsoft
of the third party site.
10. ENTIRE AGREEMENT. This agreement, and any additional terms for the Trainer Content, updates and
supplements are the entire agreement for the Licensed Content, updates and supplements.
11. APPLICABLE LAW.
1. United States. If you acquired the Licensed Content in the United States, Washington state law
governs the interpretation of this agreement and applies to claims for breach of it, regardless of
conflict of laws principles. The laws of the state where you live govern all other claims, including
claims under state consumer protection laws, unfair competition laws, and in tort.
2. Outside the United States. If you acquired the Licensed Content in any other country, the laws of
that country apply.
12. LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the
laws of your country. You may also have rights with respect to the party from whom you acquired the
Licensed Content. This agreement does not change your rights under the laws of your country if the
laws of your country do not permit it to do so.
13. DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS" AND "AS AVAILA-
BLE." YOU BEAR THE RISK OF USING IT. MICROSOFT AND ITS RESPECTIVE AFFILIATES GIVES NO
EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS. YOU MAY HAVE ADDITIONAL CON-
SUMER RIGHTS UNDER YOUR LOCAL LAWS WHICH THIS AGREEMENT CANNOT CHANGE. TO
THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS, MICROSOFT AND ITS RESPECTIVE AFFILI-
ATES EXCLUDES ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICU-
LAR PURPOSE AND NON-INFRINGEMENT.
14. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. YOU CAN RECOVER FROM
MICROSOFT, ITS RESPECTIVE AFFILIATES AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP TO
US$5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL, LOST
PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES.
X EULA

This limitation applies to


●● anything related to the Licensed Content, services, content (including code) on third party Internet
sites or third-party programs; and
●● claims for breach of contract, breach of warranty, guarantee or condition, strict liability, negligence,
or other tort to the extent permitted by applicable law.
It also applies even if Microsoft knew or should have known about the possibility of the damages. The
above limitation or exclusion may not apply to you because your country may not allow the exclusion
or limitation of incidental, consequential, or other damages.
Please note: As this Licensed Content is distributed in Quebec, Canada, some of the clauses in this
agreement are provided below in French.
Remarque : Ce le contenu sous licence étant distribué au Québec, Canada, certaines des clauses
dans ce contrat sont fournies ci-dessous en français.
EXONÉRATION DE GARANTIE. Le contenu sous licence visé par une licence est offert « tel quel ». Toute
utilisation de ce contenu sous licence est à votre seule risque et péril. Microsoft n’accorde aucune autre
garantie expresse. Vous pouvez bénéficier de droits additionnels en vertu du droit local sur la protection
dues consommateurs, que ce contrat ne peut modifier. La ou elles sont permises par le droit locale, les
garanties implicites de qualité marchande, d’adéquation à un usage particulier et d’absence de contre-
façon sont exclues.
LIMITATION DES DOMMAGES-INTÉRÊTS ET EXCLUSION DE RESPONSABILITÉ POUR LES DOMMAG-
ES. Vous pouvez obtenir de Microsoft et de ses fournisseurs une indemnisation en cas de dommages
directs uniquement à hauteur de 5,00 $ US. Vous ne pouvez prétendre à aucune indemnisation pour les
autres dommages, y compris les dommages spéciaux, indirects ou accessoires et pertes de bénéfices.
Cette limitation concerne:
●● tout ce qui est relié au le contenu sous licence, aux services ou au contenu (y compris le code)
figurant sur des sites Internet tiers ou dans des programmes tiers; et.
●● les réclamations au titre de violation de contrat ou de garantie, ou au titre de responsabilité stricte, de
négligence ou d’une autre faute dans la limite autorisée par la loi en vigueur.
Elle s’applique également, même si Microsoft connaissait ou devrait connaître l’éventualité d’un tel
dommage. Si votre pays n’autorise pas l’exclusion ou la limitation de responsabilité pour les dommages
indirects, accessoires ou de quelque nature que ce soit, il se peut que la limitation ou l’exclusion ci-dessus
ne s’appliquera pas à votre égard.
EFFET JURIDIQUE. Le présent contrat décrit certains droits juridiques. Vous pourriez avoir d’autres droits
prévus par les lois de votre pays. Le présent contrat ne modifie pas les droits que vous confèrent les lois
de votre pays si celles-ci ne le permettent pas.
Revised April 2019
Contents

■■ Module 0 Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
Welcome to the Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
■■ Module 1 Introduction to Azure Databricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
Getting started with Azure Databricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
Working with data in Azure Databricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14
■■ Module 2 Training and Evaluating Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .  21
Preparing Data for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  21
Training a Machine Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  27
■■ Module 3 Managing Experiments and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  31
Using MLflow to Track Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  31
Managing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  37
■■ Module 4 Integrating Azure Databricks and Azure Machine Learning . . . . . . . . . . . . . . . . . . . .  45
Tracking Experiments with Azure Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  45
Deploying Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  53
Module 0 Welcome

Welcome to the Course


Introduction
Welcome to this course on Azure Databricks.
In this course, you will learn how to use Azure Databricks for machine learning workloads in the cloud. As
you work through the material and hands-on exercises in this course, you will build on your existing data
science and machine learning knowledge and learn how to leverage cloud services to perform machine
learning at scale.
The course assumes that you are familiar with Python or Scala and have experience with training machine
learning models.
After completing the course, you will be able to:
●● Create an Azure Databricks workspace, and manage compute, data, and coding environments for
machine learning workloads
●● Prepare data and train a machine learning model using Spark ML
●● Track model details and register models with MLflow
●● Run Azure Machine Learning experiments on Azure Databricks and deploy trained models onto Azure
Kubernetes Service and Azure Container Instances using Azure Machine Learning

Course Agenda
This course includes the following modules.

Module 1: Introduction to Azure Databricks


In this module, you will discover the main topics in Azure Databricks, you will learn how to setup the
cluster, how to provision an Azure Databricks workspace and to use notebooks to run your code. You will
learn how about the Spark API and its datasets, how to load data from your storage, how manipulate it
using the SQL Spark API and how to visualize results.
2 Module 0 Welcome

Module 2: Training and Evaluating Machine Learning Mod-


els
This module introduces the concepts of training and evaluating machine learning models using Azure
Databricks. You will prepare data for model training and then use Spark ML to train and validate a
machine learning model.

Module 3: Managing Experiments and Models


In this module, you will get started with MLflow, an open source set of components to manage machine
learning models. You will learn how to use MLflow to track model results, register models, and stage
model versions to test out model changes before moving the new model to production.

Module 4: Integrating Azure Databricks and Azure Ma-


chine Learning
In this module, you will learn how to run Azure Machine Learning experiments on Azure Databricks,
tracking results in MLflow. You will also learn how to deploy trained models on Azure Kubernetes Service
or Azure Container Instances using Azure Machine Learning.

Lab Environment
This course includes hands-on activities designed to help you learn by working with Azure Databricks. To
complete the labs in this course, you will need:
●● A modern web browser - for example, Microsoft Edge.
●● The lab files for this course, which are published online at https://aka.ms/mslearn-dp090.
●● A Microsoft Azure1 subscription.
If you are taking this course with a Microsoft Learning Partner, you can use an “Azure Pass” to claim a
free temporary Azure subscription. Redeem your Azure Pass code at https://www.microsoftazurepass.
com, signing in with a Microsoft account that hasn't been used to redeem an Azure Pass previously.
You can complete the labs on your own computer. In some classes, a hosted environment may be
available - check with your instructor.

1 https://azure.microsoft.com
Module 1 Introduction to Azure Databricks

Getting started with Azure Databricks


What is Azure Databricks?
Azure Databricks is a Microsoft analytics service, part of the Microsoft Azure cloud platform.
It offers integration between Microsoft Azure and the Apache Spark's Databricks implementation.
It natively integrates with Azure security and data services.
Azure Databricks runs on top of a proprietary data processing engine called Databricks Runtime, an
optimized version of Apache Spark.
It allows up to 50x performance for Apache Spark workloads.
Apache Spark is the core technology. Spark is an open-source analytics engine for large-scala data
processing.
It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
In a nutshell: Azure Databricks offers a fast, easy and collaborative Spark based analytics service. It is used
to accelerate big data analytics, artificial intelligence, performant data lakes, interactive data science,
machine learning and collaboration.

The main concepts in Azure Databricks


The landing page shows the fundamental concepts to be used in Databricks
●● The cluster: a set of computational resources on which we run the code
●● The workspace: groups all the Databricks elements, clusters, notebooks, data
●● The notebook: a document that contains runnable code, descriptive text and visualizations
More information: for more information about Azure Databricks, see What-is-azure-databricks1 in the
documentation.

1 https://docs.microsoft.com/en-us/azure/databricks/scenarios/what-is-azure-databricks
4 Module 1 Introduction to Azure Databricks

Workspaces and Clusters


Two of the key concepts you need to be familiar with when working with Azure Databricks are workspaces
and clusters.

Workspaces
A workspace is an environment for accessing all of your Databricks elements
●● It groups objects (like notebooks, libraries, experiments) into folders
●● Provides access to your data
●● Provides access to the computations resources used (clusters, jobs)

Each user has a home folder for their notebooks and libraries.
The objects stored in the Workspace root folder are: folders, notebooks, libraries, and experiments.
To perform an action on a Workspace object, we can right-click the object and choose one of the availa-
ble actions.

Clusters
A cluster is a set of computational resources on which you run your code (as notebooks or jobs). We can
run ETL pipelines, or machine learning, data science, analytics workloads on the cluster.
We can create:
●● An all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
Getting started with Azure Databricks 5

●● A job cluster to run a specific job. The cluster will be terminated when the job completes
(A job is a way of running a notebook or JAR either immediately or on a scheduled basis)
Before we can use a cluster, we have to choose one of the available runtimes.
Databricks runtimes are the set of core components that run on Azure Databricks clusters. Azure Data-
bricks offers several types of runtimes:
●● Databricks Runtime: includes Apache Spark, components and updates that optimize the usability,
performance, and security for big data analytics
●● Databricks Runtime for Machine Learning: a variant that adds multiple machine learning libraries,
TensorFlow, Keras, PyTorch
●● Databricks Light: for jobs that don’t need the advanced performance, reliability, or autoscaling of the
Databricks Runtime
To create and configure a new cluster, we have to click on the Create Cluster button and choose our
options:

We will see your new cluster appearing in the clusters list.


6 Module 1 Introduction to Azure Databricks

To launch the cluster, we have to click the Start button and then confirm to launch it. It is recommend-
ed to wait until the cluster is started.
A cluster can be customized in many ways. In case you want to make third-party code available to your
notebooks, you can install a library. Your cluster can be provisioned to use Python/Java/Scala/R libraries
via PyPI or Maven.
Once the cluster is running, we can click Edit to change its properties. In case we want to provision your
cluster with additional libraries, we can click on the Libraries and then choose Install New.

We can pick a library and it will be available later to be used in your notebooks.
More information: for more information about provisioning clusters, see libraries2 in the Azure Data-
bricks documentation.

Working with data in a workspace


An Azure Databricks database is a collection of tables.
An Azure Databricks table is a collection of structured data.
We can can cache, filter, and perform any operations supported by Apache Spark DataFrames on Azure
Databricks tables.
We can query tables with Spark APIs and Spark SQL.
To access our data
●● We can import our files to DBFS using the UI
●● We can mount and use supported datasources via DBFS
We can then use Spark or local APIs to access the data.
We will be able to use a DBFS file path in our notebook to access our data, independent of its datasource.
It is possible to import existing data or code in the workspace.

2 https://docs.microsoft.com/azure/databricks/libraries/
Getting started with Azure Databricks 7

If we use small data files on the local machine that we want to analyze with Azure Databricks, we can
import them to DBFS using the UI. There are two ways to upload data to DBFS with the UI
●● Upload files to the FileStore in the Upload Data UI.
●● Upload data to a table with the Create table UI, which is also accessible via the Import & Explore Data
box on the landing page.
We may also read data on cluster nodes using Spark APIs.
We can read data imported to DBFS into Apache Spark DataFrames. For example, if you import a CSV file,
you can read the data using this code
df = spark.read.csv('/FileStore/tables/nyc_taxi.csv', header="true", infer-
Schema="true")

We can also read data imported to DBFS in programs running on the Spark driver node using local file
APIs. For example:
df = spark.read.csv('/dbfs/FileStore/tables/nyc_taxi.csv', header="true",
inferSchema="true")

Importing data
To add data, we can go to the landing page and click on Import & Explore Data
To get the data in a table there are multiple options available:
●● Upload a local file and import the data
●● Use data already existing under DBFS
●● Mount external datasources, like Azure Storage, Azure Datalake and more
To create a table based on a local file, we can select Upload File to upload data from your local
machine.
8 Module 1 Introduction to Azure Databricks

Once the data is uploaded, it will be available as a table or as a mountpoint under the DBFS filesystem (/
FileStore)
Databricks can create a table automatically if we click on Create Table with UI
Getting started with Azure Databricks 9

Alternately, we can have full control over the structure of the new table by choosing Create Table in
Notebook.
Azure Databricks will generate spark code that loads your data (and we can customize it via spark API)

Using DBFS. Mounted data


Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and
available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the
following benefits:
DBFS allows to mount storage objects so that you can seamlessly access data without requiring creden-
tials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.
The default storage location in DBFS is known as the DBFS root.
We can use the DBFS to access:
●● Local files (previously imported). For example, the tables you imported above are available under /
FileStore
●● Remote files, objects kept in separate storages as if they were on the local file system
For example, to mount a the remote Azure storage account as a DBFS folder, we can use the dbutils
module
data_storage_account_name = '<data_storage_account_name>'
data_storage_account_key = '<data_storage_account_key>'
10 Module 1 Introduction to Azure Databricks

data_mount_point = '/mnt/data'

data_file_path = '/bronze/wwi-factsale.csv'

dbutils.fs.mount(
source = f"wasbs://dev@{data_storage_account_name}.blob.core.windows.
net",
mount_point = data_mount_point,
extra_configs = {f"fs.azure.account.key.{data_storage_account_name}.blob.
core.windows.net": data_storage_account_key})

display(dbutils.fs.ls("/mnt/data"))
#this path is available as dbfs:/mnt/data for spark APIs, e.g. spark.read
#this path is available as file:/dbfs/mnt/data for regular APIs, e.g. os.
listdir

Notebooks support a shorthand — %fs magic command — for accessing the dbutils filesystem module.
Most dbutils.fs commands are available using %fs magic commands:
# List the DBFS root
%fs ls

# Overwrite the file "/mnt/my-file" with the string "Hello world!"


%fs put -f "/mnt/my-file" "Hello world!"

More information: for more information about DBFS, see the Databricks File System3 in the Azure
Databricks documentation.

Working with Notebooks


A notebook is a web-based interface to a document that contains
●● Runnable code
●● Descriptive text
●● Visualizations
A notebook is a collection of runnable cells (commands). When you use a notebook, you are primarily
developing and running cells.
Runnable cells operate on files and tables. These can be run in sequence, referring to the output of
previously run cells.

3 https://docs.microsoft.com/azure/databricks/data/databricks-file-system
Getting started with Azure Databricks 11

To create a notebook, we can click on Workspace, browse into the desired folder, right click and choose
Create then select Notebook.

A name should be given to the new notebook, and a default language to be used inside the code cells. A
cluster has to be specified for running the code.
For runnable cells, the following programming languages are supported: Python, Scala, R, SQL.
You may choose the default language for the cells in a notebook. You may also override that language
later.

The notebook editor opens with a first empty cell


12 Module 1 Introduction to Azure Databricks

By hovering the Plus button below the current cell or by choosing the top right menu options, we can
change the contents of the notebook. We may add new cells, cut/copy/export the cell contents, run a
specific cell.
We can override the default language by specifying the language magic command %<language> at the
beginning of a cell.
The supported magic commands are:
●● %python
●● %r
●● %scala
●● %sql
Notebooks also support a few auxiliary magic commands:
●● %sh: Allows you to run shell code in your notebook
●● %fs: Allows you to use dbutils filesystem commands
●● %md: Allows you to include various types of documentation, including text, images, and mathematical
formulas and equations.

Lab: Getting Started with Azure Databricks


Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will use Azure Databricks to configure a cluster, create a workspace and a notebook.
This lab will cover following exercises:
●● Exercise 1: Creating an Azure Databricks Cluster
●● Exercise 2: Provisioning an Azure Databricks Workspace
●● Exercise 3: Working with Notebooks
●● Exercise 4: Using DBFS
Getting started with Azure Databricks 13

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Getting Started with Azure Databricks lab.
14 Module 1 Introduction to Azure Databricks

Working with data in Azure Databricks


Introduction to DataFrames
Spark uses 3 different APIs: RDDs, DataFrames and DataSets. The architectural foundation is the resilient
distributed dataset (RDD). The DataFrame API was released as an abstraction on top of the RDD, followed
later by the Dataset API. We'll only use DataFrames in our notebook examples.
A DataFrame4 is equivalent to a relational table in Spark SQL.
DataFrames are the distributed collections of data, organized into rows and columns. Each column in a
DataFrame has a name and an associated type.
Spark DataFrames can be created from various sources, such as csv files, json, parquet files, Hive tables,
log tables, external databases.
More information: for more information about Spark data structures, see Dataframes5 in the Azure
Databricks documentation.

Using spark to load table data


Assuming we have this data available in a table

We can use spark to load the table data by using the sql method:
df = spark.sql("SELECT * FROM nyc_taxi_csv")

Using spark to load file/dbfs data


We can also read the data from the original files we've uploaded; or indeed from any other file available
in the DBFS. The code is the same regardless of whether a file is local or in remote storage that was
mounted, thanks to DBFS mountpoints.
df = spark.read.csv('dbfs:/FileStore/tables/nyc_taxi.csv', header=True,
inferSchema=True)

Spark supports many different data formats, such as csv, json, xml, parquet, avro, orc and more.

Dataframe size
To get the number of rows available in a dataframe we can use the count() method.

4 https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame
5 https://docs.microsoft.com/azure/databricks/spark/latest/dataframes-datasets/
Working with data in Azure Databricks 15

df.count

Dataframe structure
To get the schema metadata for a given dataframe we can use the printSchema() method.
Each column in a given dataframe has a name, a type and a nullable flag.
df.printSchema

Dataframe contents
Spark has a builtin function that allows to print the rows inside a dataframe: show()
df.show
df.show(100, truncate=False) #show more lines, do not truncate

By default it will only show the first 20 lines in your dataframe and it will truncate long columns.
Additional parameters are available to override these settings.

Querying DataFrames
DataFrames allow the processing of huge amounts of data. Spark uses an optimization engine to gener-
ate logical queries. Data is distributed over your cluster and you get huge performance for massive
amounts of data.
Spark SQL is a component that introduced the DataFrames which provides support for structured and
semi-structured data.
Spark has multiple interfaces (apis) for dealing with dataframes:
●● We have seen the .sql() method which allows to run arbitrary SQL queries on table data.
●● Another option is to use the spark domain specific language for structured data manipulation,
available in Scala, Java, Python and R.

DataFrame API
The Apache Spark DataFrame API provides a rich set of functions (select columns, filter, join, aggregate,
and so on) that allow you to solve common data analysis problems efficiently.
A complex operation where tables are joined, filtered and restructures is easy to write, easy to under-
stand, typesafe, feels natural for those with prior sql experience, and comes with the added speed of
parallel processing given by the spark engine.
To load or save data use read and write:
df = spark.read.format('json').load('sample/trips.json')
df.write.format('parquet').bucketBy(100, 'year', 'month').mode("over-
write").saveAsTable('table1'))
16 Module 1 Introduction to Azure Databricks

To get the available data in a dataframe use select:


df.select('*')
df.select('tripDistance', 'totalAmount')

To extract the first rows use take:


df.take(15)

To order the data use the sort method:


df.sort(df.tripDistance.desc())

To combine the rows in multiple dataframes use union:


df1.union(df2)

This is equivalent to UNION ALL in SQL.


To do a SQL-style set union (that does deduplication of elements), use this function followed by dis-
tinct().
The dataframes must have the same structure/schema.
To add or update columns use withColumn or withColumnRenamed
df.withColumn('isHoliday', False)
df.withColumnRenamed('isDayOff', 'isHoliday')

To use aliases for the whole dataframe or specific columns:


df.alias("myTrips")
df.select(df.passengerCount.alias("numberOfPassengers"))

To create a temporary view


df.createOrReplaceTempView("tripsView")

To aggregate on the entire DataFrame without groups use agg:


df.agg({"age": "max"})

To do more complex queries use filter, groupBy and join:


people \
.filter(people.age > 30) \
.join(department, people.deptId == department.id) \
.groupBy(department.name, "gender")
.agg({"salary": "avg", "age": "max"})

These join types are supported: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_
semi, and left_anti.
Note that filter is an alias for where.
Working with data in Azure Databricks 17

To use columns aggregations using windows:


w = Window.partitionBy("name").orderBy("age").rowsBetween(-1, 1)
df.select(rank().over(w), min('age').over(window))

To use a list of conditions for a column and return an expression use when:
df.select(df.name, F.when(df.age > 4, 1).when(df.age < 3, -1).other-
wise(0)).show()

To check the presence of data use isNull or isNotNull:


df.filter(df.passengerCount.isNotNull())
df.filter(df.totalAmount.isNull())

To clean the data use dropna, fillna or dropDuplicates:


df1.fillna(1) #replace nulls with specified value
df2.dropna #drop rows containing null values
df3.dropDuplicates #drop duplicate rows

To get statistics about the dataframe use summary or describe:


df.summary().show()
df.summary("passengerCount", "min", "25%", "75%", "max").show()
df.describe(['age']).show()

Available statistics are: - count - mean - stddev - min - max - arbitrary approximate percentiles specified
as a percentage (eg, 75%).
To find correlations between specific columns use corr.
Currently only supports the Pearson Correlation Coefficient
df.corr('tripDistance', 'totalAmount')

More information: for more information about the Spark API, see Dataframe API6 and the Column
API7 in the Spark documentation.

Visualizing Data
Spark has a builtin show function which allows to print the rows in a dataframe.
Azure Databricks adds its own display capabilities and adds various other types of visualizations out of
the box using the display and displayHTML functions.
The same data we've seen above as a table can be displayed as a bar chart, pie, histogram or other
graphs. Even maps or images can be displayed:

6 https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame
7 https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.Column
18 Module 1 Introduction to Azure Databricks

Plot options
The following display options are available:
●● we can choose the dataframe columns to be used as axes (Keys, Values)
●● we can choose to group our series of data
●● we can choose the aggregations to be used with our grouped data (avg, sum, count, min, max)

More information: for more information about the available visualizations, see Visualizations8 in the
Azure Databricks documentation.

8 https://docs.microsoft.com/azure/databricks/notebooks/visualizations/
Working with data in Azure Databricks 19

Lab: Working with Data in Azure Databricks


Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will use Azure Databricks to load your data, manipulate it and visualize the results.
This lab will cover following exercises:
●● Exercise 1: Loading data into a dataframe
●● Exercise 2: Querying a dataframe
●● Exercise 3: Data transformations using dataframes
●● Exercise 4: Visualizing Data

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Working with Data in Azure Databricks lab.

Module Review
Knowledge Check
In this lesson, you learned how to run code in your notebooks, how to do basic manipulation of data-
frames.
Use the following review questions to check your learning.

Question 1
Alice creates a notebook on Azure Databricks to train her datasets, before using them with SparkML. Which
of the following languages are supported for doing that in a notebook?
†† Java
†† Python
†† C#

Question 2
Bob has to ingest data in Azure Databricks, coming from multiple datasources. He will then run some ETLs
to move the data into the datalake. Which of his datasources can be mounted under DBFS?
†† FTP
†† Azure Data Lake Store
†† SMB
20 Module 1 Introduction to Azure Databricks

Answers
Question 1
Alice creates a notebook on Azure Databricks to train her datasets, before using them with SparkML.
Which of the following languages are supported for doing that in a notebook?
†† Java
■■ Python
†† C#
Explanation
Azure Databricks supports the following programming languages in runnable notebook cells: Python, Scala,
SQL and R.
Question 2
Bob has to ingest data in Azure Databricks, coming from multiple datasources. He will then run some
ETLs to move the data into the datalake. Which of his datasources can be mounted under DBFS?
†† FTP
■■ Azure Data Lake Store
†† SMB
Explanation
Azure Databricks allows using several datasources: Azure Blob Storage, Azure Data Lake, Cassandra, JDBC,
Kakfa, Redis, Elasticsearch, files uploaded and mounted via DBFS, as well as data integrations with various
other products and databases.
Module 2 Training and Evaluating Machine
Learning Models

Preparing Data for Machine Learning


What is Machine Learning?
Machine learning is a data science technique used to extract patterns from data allowing computers to
identify related data, forecast future outcomes, behaviors, and trends.

Machine Learning as the new programming paradigm

Traditional programming
In traditional programming, the inputs of hard coded rules and data are used to arrive at the output of
answers.

You provide the traditional program with Rules and Data, and it gives your results or answers.

Machine learning
The result of training a machine learning algorithm is that the algorithm has learned the rules to map the
input data to answers.
22 Module 2 Training and Evaluating Machine Learning Models

In machine learning, you train the algorithm with data and answers, also known as labels, and the
algorithm learns the rules to map the data to their respective labels.

Data Cleaning
Big Data has become part of the lexicon of organizations worldwide, as more and more organizations
look to leverage data to drive more informed business decisions. With this evolution in business deci-
sion-making, the amount of raw data collected, along with the number and diversity of data sources, is
growing at an astounding rate. Raw data, however, is often noisy and unreliable and may contain missing
values and outliers. Using such data for Machine Learning can produce misleading results. Thus, data
cleaning of the raw data is one of the most important steps in preparing data for Machine Learning. As
we discussed in the previous lesson that Machine Learning algorithm learns the rules from data, and thus
having clean and consistent data is an important factor in influencing the predictive abilities of the
underlying algorithms.
The most common type of data available for machine learning is in tabular format. The tabular data is
typically available in the form of rows and columns. In tabular data the row describes a single observa-
tion, and each column describes different properties of the observation. Column values can be continu-
ous (numerical), discrete (categorical), datetime (time-series), or text. Columns that are chosen as inputs
to the Machine Learning models are also known as model features. Data cleaning deals with issues in the
data quality such as errors, missing values and outliers. There are several techniques in dealing with data
quality issues and we will discuss some of the common approaches below.

Imputation of null values


Null values refer to unknown or missing data as well as irrelevant responses. Strategies for dealing with
this scenario include:
●● Dropping these records: Works when you do not need to use the information for downstream
workloads.
●● Adding a placeholder (e.g. -1): Allows you to see missing data later on without violating a schema.
●● Basic imputing: Allows you to have a “best guess” of what the data could have been, often by using
the “mean” or “median” of non-missing data for numerical data type, or “most_frequent” value of
non-missing data for categorical data type.
●● Advanced imputing: Determines the “best guess” of what data should be using more advanced
strategies such as clustering machine learning algorithms or oversampling techniques such as SMOTE
(Synthetic Minority Over-sampling Technique).
Preparing Data for Machine Learning 23

Converting data types


In some situations, the columns have inconsistent data types. For example, a column can have a combi-
nation of numbers and number present as strings, like “44.5” and “25.1”. As part of data cleaning
often you have to convert the data in the column to its correct data type.

Duplicate records
In some situations, you find duplicate records in the table. The easiest solution is to drop the duplicate
records.

Outliers
An outlier is defined as an observation that is significantly different than all other observations in a given
column. There are several ways to identify outliers, and one common approach is to compute the Z-score
for an observation x:
You can use similar strategies as imputing null values to deal with outliers. However, it is important to
note that outliers are not necessarily invalid data and, in some situations, it is perfectly valid to retain the
outliers in your training data.

Feature Engineering
Machine learning models are as strong as the data they a trained on. Often it is important to derive
features from existing raw data that better represent the nature of the data and thus help improve the
predictive power of the machine learning algorithms. This process of generating new predictive features
from existing raw data is commonly referred to as feature engineering.
There are certainly many valid approaches to feature engineering and some of the most popular ones,
categorized by data type, are as follows:
●● Aggregation (count, sum, average, mean, median, and the like)
●● Part-of (year of date, month of date, week of date, and the like)
●● Binning (grouping entities into bins and then applying aggregations)
●● Flagging (boolean conditions resulting in True of False)
●● Frequency-based (calculating the frequencies of the levels of one or more categorical variables)
●● Embedding (transforming one or more categorical or text features into a new set of features, possibly
with a different cardinality)
●● Deriving by example
Feature engineering is not limited to the above list and can include involve domain knowledge-based
approaches for deriving features. Let’s work with an example to understand the process of feature
engineering. In our example, we are working with system that gives us weather data on an hourly basis,
and we have a column in the data that is hour of day. The hour of day column is of type integer
and it can assume any integer value in the range [0, 23]. The question is, how best to represent this
data to a machine learning algorithm that can learn its cyclical nature? One approach is to engineer a set
of new features that transforms the hour of day column using sine and cosine functions. These
derived features are plotted in the figure below for the range [0, 24]:
24 Module 2 Training and Evaluating Machine Learning Models

The cosine function provides symmetrically equal weights to corresponding AM and PM hours, and the
sine function provides symmetrically opposite weights to corresponding AM and PM hours. Both func-
tions capture the cyclical nature of hour of day.

Data Scaling
Scaling numerical features is an important part of preprocessing data for machine learning. Typically
range of values each input feature takes vary greatly between features. There are many machine learning
algorithms that are sensitive to the magnitude of the input features and thus without feature scaling
higher weights might get assigned to features with higher magnitudes irrespective of the importance of
the feature on the predicted output.
There are two common approaches to scaling numerical features: (1) Normalization and (2) Standardiza-
tion. We will discuss each of these approaches below.

Normalization
Normalization rescales the data into the range [0, 1] and mathematically.
For example, for each individual value, you can subtract the minimum value for that input in the training
dataset, and then divide by the range of the values in the training dataset. The range of the values is the
difference between the maximum value and the minimum value.

Standardization
Standardization rescales the data to have mean = 0 and standard deviation = 1.
For the numeric input, you first compute the mean and standard deviation using all the data available in
the training dataset. Then, for each individual input value, you scale that value by subtracting the mean
and then dividing by the standard deviation.
Preparing Data for Machine Learning 25

Data Encoding
A common type of data that is prevalent in machine learning is called categorical data. Categorical data
implies discrete or limit set of values. For example, a person’s gender or ethnicity is considered as
categorical. Let’s consider the following data table:

SKU Make Color Quantity Price


908721 G Blue 789 45.33
456552 T Red 244 22.91
789921 A Green 387 25.92
872266 G Blue 154 17.56
In the table above, the row describes a single observation, and each column describes different proper-
ties of the observation. In the table, we have two types of data, numeric data such as Quantity and
Price, and categorical data such as Make and Color. In the previous lesson we looked at examples of
how to scale numeric data types. Furthermore, it is important to note that in machine learning, we
ultimately always work with numbers or specifically, vectors. In this context, a vector is either an array
of numbers or nested arrays of arrays of numbers. So how does one encode categorical data for the
purposes of machine learning? We will look at two common approaches for encoding categorical data:
(1) Ordinal encoding, and (2) One-hot encoding.

Ordinal encoding
Ordinal encoding, converts categorical data into integer codes ranging from 0 to (number of categories
– 1). For example, the categories Make and Color from the above table are encoded as:

Make Encoding
A 0
G 1
T 2

Color Encoding
Red 0
Green 1
Blue 2
Using the above encoding, the transformed table is shown below:

SKU Make Color Quantity Price


908721 1 2 789 45.33
456552 2 0 244 22.91
789921 0 1 387 25.92
872266 1 2 154 17.56

One-hot encoding
One Hot encoding is often the recommended approach, and it involves transforming each categorical
value into n (= number of categories) binary values, with one of them 1, and all others 0. For example, the
above table can be transformed as:
26 Module 2 Training and Evaluating Machine Learning Models

SKU A G T Red Green Blue Quantity Price


908721 0 1 0 0 0 1 789 45.33
456552 0 0 1 1 0 0 244 22.91
789921 1 0 0 0 1 0 387 25.92
872266 0 1 0 0 0 1 154 17.56
One-hot encoding is often preferred over ordinal encoding because it encodes each category item with
equal weight. In our above example, the ordinal encoder assigned color Green = 1 and color Blue =
2, and that can imply that color Blue is twice as important as color Green. Whereas, with one-hot encod-
ing each color is weighted equally.

Lab: Preparing Data for Machine Learning


Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will use Azure Databricks to prepare data for Machine Learning. This lab will cover follow-
ing exercises:
●● Exercise 1: Handling missing data
●● Exercise 2: Feature Engineering
●● Exercise 3: Scaling Numeric features
●● Exercise 4: Encoding Categorical Features

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Preparing Data for Machine Learning exercises.
Training a Machine Learning Model 27

Training a Machine Learning Model


Introduction to Spark ML
Azure Databricks supports several libraries for machine learning. One key library has two approaches
which are native to Apache Spark: MLLib and Spark ML.

MLLib
MLLib is a legacy approach for machine learning on Apache Spark. It builds off of Spark's Resilient
Distributed Dataset1 (RDD) data structure. This data structure forms the foundation of Apache Spark,
but additional data structures on top of the RDD, such as DataFrames, have reduced the need to work
directly with RDDs.
As of Apache Spark 2.0, the library entered a maintenance mode. This means that MLLib is still available
and has not been deprecated, but there will be no new functionality added to the library. Instead,
customers are advised to move to the org.apache.spark.ml library, commonly referred to as Spark
ML.

Spark ML
Spark ML is the primary library for machine learning development in Apache Spark. It supports Data-
Frames in its API, versus the classic RDD approach. This makes Spark ML an easier library to work with for
data scientists, as Spark DataFrames share many common ideas with DataFrames in Pandas and R.
The most confusing part about MLLib versus Spark ML is that they are both the same library. The
difference is that “classic” MLLib namespace is org.apache.spark.mllib whereas the Spark ML
namespace is org.apache.spark.ml. Whenever possible, use the Spark ML namespace when per-
forming new data science activities.

A Typical Training and Validation Process


The process of training and validating a machine learning model using Spark ML is fairly straightforward.
The steps are as follows.

Splitting Data
The first step involves splitting data between training and validation datasets. Doing so allows a data
scientist to train a model with a representative portion of the data, while still retaining some percentage
as a hold-out dataset. This hold-out dataset can be useful for determining whether the training model is
overfitting–that is, latching onto the peculiarities of the training dataset rather than finding generally
applicable relationships between variables.
Dataframes support a randomSplit() method which makes this process of splitting data simple.

Training a Model
Training a model relies on three key abstractions: a transformer, an estimator, and a pipeline.

1 https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds
28 Module 2 Training and Evaluating Machine Learning Models

A transformer takes a DataFrame as an input and returns a new DataFrame as an output. Transformers
are helpful for performing feature engineering and feature selection, as the result of a transformer is
another DataFrame. An example of this might be to read in a text column, map that text column into a
set of feature vectors, and output a DataFrame with the newly mapped column. Transformers will
implement a .transform() method.
An estimator takes a DataFrame as an input and returns a model. It takes a DataFrame as an input and
returns a model, which is itself a transformer. An example of an estimator is the LinearRegression
machine learning algorithm. It accepts a DataFrame and produces a Model. Estimators implement a
.fit() method.
Pipelines combine together estimators and transformers and implement a .fit() method. This makes it
easier to combine multiple algorithms, by breaking the training process out into a series of stages.

Validating a Model
Once a model has been trained, it becomes possible to validate its results. Spark ML includes built-in
summary statistics for models based on the algorithm of choice. Using linear regression as an example,
the model contains a summary object which includes scores such as Root Mean Square Error (RMSE),
Mean Absolute Error (MAE), and coefficient of determination (R2, pronounced R-squared). These will be
the summary measures based on the training data.
From there, with a validation dataset, it is possible to calculate summary statistics on a never-before-seen
set of data, running the model's transform() function against the validation dataset. From there, use
evaluators such as the RegressionEvaluator to calculate measures such as RMSE, MAE, and R2.

Other Frameworks
Azure Databricks supports machine learning frameworks other than Spark ML / MLLib. For example,
Azure Databricks offers support for popular libraries like TensorFlow and PyTorch.
It is possible to install these libraries directly, but the best recommendation is to use the Databricks
Runtime for Machine Learning2. This comes with a variety of machine learning libraries pre-installed,
including TensorFlow, PyTorch, Keras, and XGBoost. It also includes libraries essential for distributed
training, allowing data scientists to take advantage of the distributed nature of Apache Spark.
For libraries which do not support distributed training, it is also possible to use a single node cluster3.
For example, PyTorch4 and TensorFlow5 both support single node use.

Lab: Training and Validating a Machine Learning


Model
Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will use Azure Databricks to train a multivariate regression model and interprets the
results. This lab will cover following exercises:
●● Exercise 1: Training a Model
●● Exercise 2: Validating a Model

2 https://docs.microsoft.com/azure/databricks/runtime/mlruntime
3 https://docs.microsoft.com/azure/databricks/clusters/single-node
4 https://docs.microsoft.com/azure/databricks/applications/machine-learning/train-model/pytorch#use-pytorch-on-a-single-node
5 https://docs.microsoft.com/azure/databricks/applications/machine-learning/train-model/tensorflow#use-tensorflow-on-a-single-node
Training a Machine Learning Model 29

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Training and Validating a Machine Learning Model exercises.

Module Review
Knowledge Check
In this lesson, you learned how to train and evaluate a machine learning model.
Use the following review questions to check your learning.

Question 1
John is looking to train his first machine learning model. One of his inputs includes the size of the T-Shirts,
with possible values of XS, S, M, L, and XL. What is the best approach John can employ to preprocess the
T-Shirt size input feature?
†† Standardization
†† One Hot Encoding
†† Normalization

Question 2
Which of the following is a key abstraction for model training in Spark ML?
†† Calculator
†† Iterator
†† Processor
†† Transformer
30 Module 2 Training and Evaluating Machine Learning Models

Answers
Question 1
John is looking to train his first machine learning model. One of his inputs includes the size of the
T-Shirts, with possible values of XS, S, M, L, and XL. What is the best approach John can employ to
preprocess the T-Shirt size input feature?
†† Standardization
■■ One Hot Encoding
†† Normalization
Explanation
One Hot Encoding is often the recommended approach to encode categorical features such as T-Shirt sizes.
Whereas, Standardization and Normalization are approaches to scale numerical features.
Question 2
Which of the following is a key abstraction for model training in Spark ML?
†† Calculator
†† Iterator
†† Processor
■■ Transformer
Explanation
Transformers, estimators, and pipelines are the three key abstractions in Spark ML. Transformers change
the shape of DataFrames, estimators convert DataFrames to objects like trained models, and pipelines
connect together chains of transformers and estimators.
Module 3 Managing Experiments and Models

Using MLflow to Track Experiments


What is MLflow?
MLflow is an open source product designed to manage the Machine Learning development lifecycle.
That is, MLflow allows data scientists to train models, register those models, deploy the models to a web
server, and manage model updates.

The Importance of MLflow


MLflow is an important part of machine learning with Azure Databricks, as it integrates key operational
processes with the Azure Databricks interface. This makes it easy for data scientists to train models and
make them available without writing a great deal of code.
As a side note, MLflow will also operate on workloads outside of Azure Databricks. The examples in this
module will all use Azure Databricks but this is not a requirement.

MLflow Product Components


There are four components to MLflow: MLflow Tracking, MLflow Projects, MLflow Models, and the
MLflow Model Registry.

MLflow Tracking
MLflow Tracking allows data scientists to work with experiments. For each run in an experiment, a data
scientist may log parameters, versions of libraries used, evaluation metrics, and generated output files
when training machine learning models.
This provides the ability to audit the results of prior model training executions.
32 Module 3 Managing Experiments and Models

MLflow Projects
An MLflow Project is a way of packaging up code in a manner which allows for consistent deployment
and the ability to reproduce results. MLflow supports several environments for projects, including via
Conda, Docker, and directly on a system.

MLflow Models
MLflow offers a standardized format for packaging models for distribution. This standardized model
format allows MLflow to work with models generated from several popular libraries, including scikit-
learn, Keras, MLlib, ONNX, and more. Review the MLflow Models documentation1 for information
on the full set of supported model flavors.

MLflow Model Registry


The MLflow Model Registry allows data scientists to register models in a registry.

1 https://mlflow.org/docs/latest/models.html
Using MLflow to Track Experiments 33

From there, MLflow Models and MLflow Projects combine with the MLflow Model Registry to allow
operations team members to deploy models in the registry, serving them either through a REST API or as
part of a batch inference solution using Azure Databricks.

MLflow Terminology
There are several terms which will be important to understand when working with MLflow. Most of these
terms are fairly common in the data science space and other products, such as Azure Machine Learning,
use very similar terminology to allow for simplified cross-product development of skills. The following
sections include key terms and concepts for each MLflow product.

MLflow Tracking
MLflow Tracking is built around runs, that is, executions of code for a data science task. Each run con-
tains several key attributes, including:
●● Parameters - Key-value pairs which represent inputs. Use parameters to track hyperparameters, that
is, inputs to functions which affect the machine learning process.
●● Metrics - Key-value pairs which represent how the model is performing. This can include evaluation
measures such as Root Mean Square Error, and metrics can be updated throughout the course of a
run. This allows a data scientist, for example, to track Root Mean Square Error for each epoch of a
neural network.
●● Artifacts - Output files. Artifacts may be stored in any format, and can include models, images, log
files, data files, or anything else which might be important for model analysis and understanding.
These runs can be combined together into experiments, which are intended to collect and organize runs.
For example, a data scientist may create an experiment to train a classifier against a particular data set.
Each run might try a different algorithm or different set of inputs. The data scientist can then review the
individual runs in order to determine which run generated the best model.

MLflow Projects
A project in MLflow is a method of packaging data science code. This allows other data scientists or
automated processes to use the code in a consistent manner.
34 Module 3 Managing Experiments and Models

Each project includes at least one entry point, which is a file (either .py or .sh) that is intended to act as
the starting point for project use. Projects also specify details about the environment. This includes the
specific packages (and versions of packages) used in developing the project, as new versions of packages
may include breaking changes.

MLflow Models
A model in MLflow is a directory containing an arbitrary set of files along with an MLmodel file in the
root of the directory.
MLflow allows models to be of a particular flavor, which is a descriptor of which tool or library generated
a model. This allows MLflow to work with a wide variety of modeling libraries, such as scikit-learn,
Keras, MLlib, ONNX, and many more. Each model has a signature, which describes the expected inputs
and outputs for the model.

MLflow Model Registry


The MLflow Model Registry allows a data scientist to keep track of a model from MLflow Models. In
other words, the data scientist registers a model with the Model Registry, storing details such as the
name of the model. Each registered model may have multiple versions, which allow a data scientist to
keep track of model changes over time.
It is also possible to stage models. Each model version may be in one stage, such as Staging, Produc-
tion, or Archived. Data scientists and administrators may transition a model version from one stage to
the next.

Creating and Running Experiments


MLflow experiments allow data scientists to track training runs in a collection called an experiment. This
is useful for comparing changes over time or comparing the relative performance of models with differ-
ent hyperparameter values.
Creating an experiment in Azure Databricks happens automatically when you start a run. Here is an
example of starting a run in MLflow, logging two parameters, and logging one metric:
with mlflow.start_run():
mlflow.log_param("input1", input1)
mlflow.log_param("input2", input2)
# Perform operations here like model training.
mlflow.log_metric("rmse", rmse)

In this case, the experiment's name will be the name of the notebook. It is possible to export a variable
named MLFLOW_EXPERIMENT_NAME to change the name of your experiment should you choose.

Reviewing Experiments
Inside a notebook, the Experiment menu option displays a context bar which includes information on
runs of the current experiment.
Using MLflow to Track Experiments 35

Selecting the External Link icon in the experiment run will provide additional details on a particular run.

This link will provide the information that MLflow Tracker logged, including notes, parameters, metrics,
tags, and artifacts.
36 Module 3 Managing Experiments and Models

Lab: Using MLflow to Track Experiments


Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will use Azure Databricks and MLflow to run an experiment and track the results of
different experimental tests. This lab will cover following exercises:
●● Exercise 1: Running an experiment
●● Exercise 2: Reviewing experiment metrics

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Using MLflow to Track Experiments lab.
Managing Models 37

Managing Models
Model Management Overview
Training a great model is a start to a data science project, but having a trained model that existed in a
notebook on a cluster at one point in time will not be enough. This is where model management comes
into play.
The two key steps for model management in MLflow are registration and versioning of models. With
registration, a data scientist stores the details of a model in the MLflow Model Registry, along with a
name for ease of access. Users can retrieve the model from the registry and use that model to perform
inference on new data sets. Further, it is possible to serve models on Azure Databricks or in Azure
Machine Learning, automatically generating a REST API to interact with the model.
Once a model is out in production, there is still more work to do. As models change over time, model
management becomes a process of training new candidate models, comparing to the current version and
prior candidate models, and determining whether a candidate is worthy of becoming the next production
model. MLflow's versioning system makes this easy by labeling new versions of models and retaining
information on prior model versions automatically. This allows a data scientist to perform testing on a
variety of model versions and ensure that new models are performing better than older models.

Registering a Model
Once you have a model trained using the library of your choice, the next step is to register that model.
Registration allows MLflow to keep track of the model, retaining details on how the model performed in
training as well as the contents of the model itself.

The Registration Process


Registration is possible through the Azure Databricks UI as well as through code.

Registration through the UI


Registering a model is fairly straightforward. First, start with an experiment run.

On the run details page, select the folder which contains the model and then select Register Model.
38 Module 3 Managing Experiments and Models

If you have not already created the model before, select the Model drop-down list and choose + Create
New Model.

Choose an appropriate name for the model and then select Register.

At this point, model registration will occur and you will have a new model. Navigate to the Models menu
to view the model.

Registration through Code


The other method to register a model is through code. There are two ways we can do this. The first
method is to register directly from an experiment.
model_details = mlflow.register_model(model_uri=model_uri, name=model_name)

The second method is to register during a run by naming registered_model_name.


Managing Models 39

with mlflow.start_run() as run:


mlflow.log_params("param1", 123)
mlflow.sklearn.log_model(
sk_model=model,
artifact_path="model",
registered_model_name="sklearn Trained Model")

At this point, model registration will occur and you will have a new model. You can reference the model
in code using the following method:
model = mlflow.sklearn.load_model(
model_uri=f"models:/{model_name}/{model_version}")

The Importance of Model Registration


Model registration allows MLflow and Azure Databricks to keep track of models. This is important for
two reasons. First, registering a model allows you to serve the model for real-time or batch scoring. This
makes the process of using a trained model easy, as now data scientists will not need to develop applica-
tion code; the serving process builds that wrapper and exposes a REST API or method for batch scoring
automatically.
Second, registering a model allows you to create new versions of that model over time. This gives you
the opportunity to track model changes and even perform comparisons between different historical
versions of models. This helps answer a question of whether your model changes are significant–that is,
newer models are definitely better than older models–or if the newer models are “chasing noise” and are
not actually better than their predecessors.

Model Versioning
With machine learning, model training is not a one-time process. Instead, models will update over time.
Keeping track of these changes is possible in MLflow using versioning.

The Versioning Process


Versioning a model using the Azure Databricks UI is essentially the same as the model registration
process. First, start with an experiment run.

On the run details page, select the folder which contains the model and then select Register Model.
40 Module 3 Managing Experiments and Models

Because you have already created a model, select the Model drop-down list and choose the appropriate
model name.

Select Register to complete model versioning.


At this point, you will have a new version of the model. Navigate to the Models menu to view the model
and its versions.
Managing Models 41

Staging Model Versions


In addition to creating versions of models, MLflow allows model versions to be in certain specified stages.
These include:
●● Production. This is a model version which is intended for deployment.
●● Staging. This is a model version which is intended for testing prior to taking over in production.
●● Archived. This is a model version which is no longer intended for use, usually because it has been
supplanted by a superior model version.
Model versions start out without a stage. There are two ways of doing this: through the Azure Data-
bricks UI as well as through code.

Staging Model Versions through the UI


In order to transition a model version to a stage through the Azure Databricks UI, select the version link
and in the Stage drop-down, select the new stage, either by requesting a transition or performing the
transition. Performing a transition requires one of the following permissions: Manage Staging Versions,
Manage Production Versions, or Manage. Any user with Read permissions or better may request a
transition.
42 Module 3 Managing Experiments and Models

After performing this transition, return to the model details and the Stage column will contain informa-
tion on the newly-transitioned model version.

Staging Model Versions through Code


In order to transition a model version to a stage through code, use the following method:
client.transition_model_version_stage(
name=model_details.name,
version=model_details.version,
stage='Staging',
)

After performing this transition, use the following method to retrieve a model at a particular stage:
import mlflow.pyfunc
model_uri = "models:/{model_name}/{model_stage}".format(model_name=model_
name, model_stage=model_stage)
model = mlflow.pyfunc.load_model(model_uri)

Lab: Managing Models


Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will use Azure Databricks and MLflow to manage a model. This includes registering and
serving the model through the user interface, followed by an exercise in registering, serving, and version-
ing models through the Azure Databricks API. This lab will cover following exercises:
●● Exercise 1: Register a Model using the UI
Managing Models 43

●● Exercise 2: Register a Model using the API

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Managing Models lab.

Module Review
Knowledge Check
In this module, you learned how to manage experiments and models using MLflow in Azure Databricks.
Use the following review questions to check your learning.

Question 1
Which of the following is the name of an MLflow component?
†† MLflow Framework
†† MLflow Tracking
†† MLflow Training

Question 2
An experiment is best defined as which of the following statements?
†† A collection of tests
†† A collection of runs
†† A collection of notebooks
44 Module 3 Managing Experiments and Models

Answers
Question 1
Which of the following is the name of an MLflow component?
†† MLflow Framework
■■ MLflow Tracking
†† MLflow Training
Explanation
MLflow is made up of four key components: a Model component, which provides a standard for shaping
models; a Model Registry, which allows registration and versioning of models; Projects, which package data
science code; and Tracking, which retains information on the execution of data science code.
Question 2
An experiment is best defined as which of the following statements?
†† A collection of tests
■■ A collection of runs
†† A collection of notebooks
Explanation
An experiment is defined as a collection of runs. A run is defined as the execution of code for a data science
task.
Module 4 Integrating Azure Databricks and
Azure Machine Learning

Tracking Experiments with Azure Machine


Learning
What is Azure Machine Learning?
Azure Machine Learning is a platform for operating machine learning workloads in the cloud.

Built on the Microsoft Azure cloud platform, Azure Machine Learning enables you to manage:
●● Scalable on-demand compute for machine learning workloads.
46 Module 4 Integrating Azure Databricks and Azure Machine Learning

●● Data storage and connectivity to ingest data from a wide range sources.
●● Machine learning workflow orchestration to automate model training, deployment, and management
processes.
●● Model registration and management, so you can track multiple versions of models and the data on
which they were trained.
●● Metrics and monitoring for training experiments, datasets, and published services.
●● Model deployment for real-time and batch inferencing.

Running Azure Machine Learning Experiments


on Databricks Compute
MLflow1 is an open-source library for managing the life cycle of your machine learning experiments.
MLFlow Tracking2 is a component of MLflow that logs and tracks your training run metrics and model
artifacts, no matter your experiment's environment. A recommended approach for running Azure Ma-
chine Learning Experiments on Azure Databricks cluster is use MLflow Tracking and connect Azure
Machine Learning as the backend for MLflow experiments.
The following diagram illustrates that with MLflow Tracking, you track an experiment's run metrics and
store model artifacts in your Azure Machine Learning workspace.

Track AML Experiments in Azure Databricks


When running AML experiments in Azure Databricks, there are three key steps:
1. Configure MLflow tracking URI to use AML
2. Configure a MLflow experiment
3. Run your experiment

1 https://www.mlflow.org/
2 https://mlflow.org/docs/latest/quickstart.html#using-the-tracking-api
Tracking Experiments with Azure Machine Learning 47

1. Configure MLflow tracking URI to use AML


In order to configure MLflow Tracking and connect Azure Machine Learning as the backend for MLFlow
experiments, you need to follow these steps as shown in the code snippet:
●● Get your AML workspace object
●● From your AML workspace object get the unique tracking URI address
●● Setup MLflow tracking URI to point to AML workspace
import mlflow
from azureml.core import Workspace

# Get your AML workspace


ws = Workspace.from_config()

# Get the unique tracking URI address to the AML workspace


tracking_uri = ws.get_mlflow_tracking_uri()

# Setup MLflow tracking URI to point to AML workspace


mlflow.set_tracking_uri(tracking_uri)

2. Configure a MLflow experiment


Provide the name for MLflow experiment as shown below. Note that the same experiment name will
appear in Azure Machine Learning.
experiment_name = 'MLflow-AML-Exercise'
mlflow.set_experiment(experiment_name)

3. Run your experiment


One the experiment is setup you can start your training run with start_run() as shown below:
with mlflow.start_run() as run:
...
...

Your model training and logging code is provided within the with block.

Logging Azure Machine Learning experiment


metrics with ML Flow
In the previous unit we discussed how to setup Azure Machine Learning as the backend for MLflow
experiments. We also looked at how to start your model training on Azure Databricks as a MLflow
experiment. In this section, we will look at how to log model metrics and artifacts to the MLflow logging
API. These logged metrics and artifacts are then captured in Azure Machine Learning workspace that
provides a centralized, secure, and scalable location to store training metrics and artifacts.
In you MLflow experiment, once you train and evaluate your model, you can use the MLflow logging API,
mlflow.log_metric(), to start logging your model metrics as show below:
48 Module 4 Integrating Azure Databricks and Azure Machine Learning

with mlflow.start_run() as run:


...
...
# Make predictions on hold-out data
y_predict = clf.predict(X_test)
y_actual = y_test.values.flatten().tolist()

# Evaluate and log model metrics on hold-out data


rmse = math.sqrt(mean_squared_error(y_actual, y_predict))
mlflow.log_metric('rmse', rmse)
mae = mean_absolute_error(y_actual, y_predict)
mlflow.log_metric('mae', mae)
r2 = r2_score(y_actual, y_predict)
mlflow.log_metric('R2 score', r2)

Next, you can use MLflow’s log_artifact() API to save model artifacts such as your Predicted vs
True curve as shown:
import matplotlib.pyplot as plt

with mlflow.start_run() as run:


...
...
plt.scatter(y_actual, y_predict)
plt.savefig("./outputs/results.png")
mlflow.log_artifact("./outputs/results.png")

Reviewing experiment metrics and artifacts in Azure ML


Studio
Since Azure Machine Learning is setup as the backend for MLflow experiments, you can review all the
training metrics and artifacts from within the Azure Machine Learning Studio. From within the studio, nav-
igate to the Experiments tab, and open the experiment run that corresponds to the MLflow experi-
ment. In the Metrics tab of the run, you will observe the model metrics that were logged via MLflow
tracking APIs.
Tracking Experiments with Azure Machine Learning 49

Next, when you open the Outputs + logs tab you will observe the model artifacts that were logged
via MLflow tracking APIs.

In summary, using MLflow integration with Azure Machine Learning, you can run experiments in Azure
Databricks and leverage Azure Machine Learning workspace capabilities of centralized, secure, and
scalable solution to store model training metrics and artifacts.
50 Module 4 Integrating Azure Databricks and Azure Machine Learning

Running Azure Machine Learning Pipeline on


Databricks Compute
Azure Machine Learning supports multiple types of compute for experimentation and training. Specifical-
ly, you can run an Azure Machine Learning Pipeline on Databricks compute.

What is an Azure Machine Learning Pipeline?


In Azure Machine Learning, a pipeline is a workflow of machine learning tasks in which each task is
implemented as a step. Steps can be arranged sequentially or in parallel, enabling you to build sophisti-
cated flow logic to orchestrate machine learning operations. Each step can be run on a specific compute
target, making it possible to combine different types of processing as required to achieve an overall goal.

Running Pipeline Step on Databricks Compute


Azure Machine Learning support a specialized pipeline step called DatabricksStep that setup to run a
notebook, script, or compiled JAR on a databricks cluster. In order to run a pipeline step on databricks
cluster, you need the following:
●● Attach Azure Databricks Compute to Azure Machine Learning workspace
●● Define DatabricksStep in a Pipeline
●● Submit the Pipeline

Attaching Azure Databricks Compute


The following code example can be used to attach an existing Azure Databricks cluster:
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, DatabricksCompute

# Load the workspace from the saved config file


ws = Workspace.from_config()

# Specify a name for the compute (unique within the workspace)


compute_name = 'db_cluster'

# Define configuration for existing Azure Databricks cluster


db_workspace_name = 'db_workspace'
db_resource_group = 'db_resource_group'
db_access_token = '1234-abc-5678-defg-90...' # Get this from the Databricks
workspace
db_config = DatabricksCompute.attach_configuration(resource_group=db_re-
source_group,
workspace_name=db_work-
space_name,
access_token=db_access_
token)

# Create the compute


databricks_compute = ComputeTarget.attach(ws, compute_name, db_config)
Tracking Experiments with Azure Machine Learning 51

databricks_compute.wait_for_completion(True)

Defining DatabricksStep in a Pipeline


To create a pipeline, you must first define each step and then create a pipeline that includes the steps.
The specific configuration of each step depends on the step type. For example, the following code
defines DatabricksStep step to run a python script, process_data.py, on the attached Databricks
compute.
from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps import DatabricksStep

script_directory = "./scripts"
script_name = "process_data.py"

dataset_name = "nyc-taxi-dataset"

spark_conf = {"spark.databricks.delta.preview.enabled": "true"}

databricksStep = DatabricksStep(name = "process_data",


run_name = "process_data",
python_script_params=["--dataset_name",
dataset_name],
spark_version = "7.3.x-scala2.12",
node_type = "Standard_DS3_v2",
spark_conf = spark_conf,
num_workers = 1,
python_script_name = script_name,
source_directory = script_directory,
pypi_libraries = [PyPiLibrary(package =
'scikit-learn'),
PyPiLibrary(package =
'scipy'),
PyPiLibrary(package =
'azureml-sdk'),
PyPiLibrary(package =
'azureml-dataprep[pandas]')],
compute_target = databricks_compute,
allow_reuse = False
)

The above step defines the configuration to create a new Databricks job cluster to run the python script.
The cluster is created on the fly to run the script and the cluster is subsequently deleted after the step
execution is completed.

Submit the Pipeline


After defining the step, you can assign it to a pipeline, and run it as an experiment:
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
52 Module 4 Integrating Azure Databricks and Azure Machine Learning

# Construct the pipeline


pipeline = Pipeline(workspace = ws, steps = [databricksStep])

# Create an experiment and run the pipeline


experiment = Experiment(workspace = ws, name = "process-data-pipeline")
pipeline_run = experiment.submit(pipeline)

Lab: Running experiments in Azure Machine


Learning
Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will learn to run experiments in Azure Machine Learning from Azure Databricks. This lab
will cover following exercises:
●● Exercise 1: Running an Azure ML experiment on Databricks
●● Exercise 2: Reviewing experiment metrics in Azure ML Studio

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Running experiments in Azure Machine Learning exercises.
Deploying Models 53

Deploying Models
Model Deployment Overview
In machine learning, Model Deployment can be considered as a process by which you integrate your
trained machine learning models into production environment such that your business or end user
applications can use the model predictions to make decisions or gain insights into your data. The most
common way you deploy model using Azure Machine Learning from Azure Databricks is to deploy the
model as a real-time inferencing service. Here the term inferencing refers to the use of a trained
model to make predictions on new input data on which the model has not been trained.

What is Real-Time Inferencing?


The model is deployed as part of a service that enables applications to request immediate, or real-time,
predictions for individual or small numbers of data observations.

In Azure Machine learning, you can create real-time inferencing solutions by deploying a model as a
real-time service, hosted in a containerized platform such as Azure Kubernetes Services (AKS).

Azure ML Deployment Endpoints


After you have trained your machine learning model and evaluated it to the point where you are ready to
use it outside your own development or test environment, you need to deploy it somewhere. Azure
Machine Learning service simplifies this process. You can use the service components and tools to
register your model and deploy it to one of the available compute targets so it can be made
available as a web service in the Azure cloud, or on an IoT Edge device.

Available compute targets


You can use the following compute targets to host your web service deployment:

Compute target Usage Description


Local web service Testing/debug Good for limited testing and
troubleshooting.
54 Module 4 Integrating Azure Databricks and Azure Machine Learning

Compute target Usage Description


Azure Kubernetes Service (AKS) Real-time inference Good for high-scale production
deployments. Provides autoscal-
ing, and fast response times.
Azure Container Instances (ACI) Testing Good for low scale, CPU-based
workloads.
Azure Machine Learning Com- Batch inference Run batch scoring on serverless
pute Clusters compute. Supports normal and
low-priority VMs.
Azure IoT Edge (Preview) IoT module Deploy & serve ML models on
IoT devices.

Model Deployment Process


As we discussed in the previous unit, you can deploy a model to several kinds of compute target, includ-
ing local compute, an Azure Container Instance (ACI), an Azure Kubernetes Service (AKS) cluster, or an
Internet of Things (IoT) module. Azure Machine Learning uses containers as a deployment mechanism,
packaging the model and the code to use it as an image that can be deployed to a container in your
chosen compute target.
To deploy a model as a inferencing webservice, you must perform the following tasks:

1. Register a trained model


After successfully training a model, you must register it in your Azure Machine Learning workspace. Your
real-time service will then be able to load the model when required.
To register a model from a local file, you can use the register method of the Model object as shown
here:
from azureml.core import Model

model = Model.register(workspace=ws,
model_name='nyc-taxi-fare',
model_path='model.pkl', # local path
description='Model to predict taxi fares in NYC.')

2. Define an Inference Configuration


The model will be deployed as a service that consist of:
●● A script to load the model and return predictions for submitted data.
●● An environment in which the script will be run.
You must therefore define the script and environment for the service.

Creating an Entry Script


Create the entry script (sometimes referred to as the scoring script) for the service as a Python (.py) file. It
must include two functions:
●● init(): Called when the service is initialized.
Deploying Models 55

●● run(raw_data): Called when new data is submitted to the service.


Typically, you use the init function to load the model from the model registry, and use the run function
to generate predictions from the input data. The following example script shows this pattern:
import json
import joblib
import numpy as np
from azureml.core.model import Model

# Called when the service is loaded


def init():
global model
# Get the path to the registered model file and load it
model_path = Model.get_model_path('nyc-taxi-fare')
model = joblib.load(model_path)

# Called when a request is received


def run(raw_data):
# Get the input data as a numpy array
data = np.array(json.loads(raw_data)['data'])
# Get a prediction from the model
predictions = model.predict(data)
# Return the predictions as any JSON serializable format
return predictions.tolist()

Creating an Environment
Azure Machine Learning environments are an encapsulation of the environment where your machine
learning training happens. They define Python packages, environment variables, Docker settings and
other attributes in declarative fashion. The below code snippet shows an example of how you can create
an environment for your deployment:
from azureml.core import Environment
from azureml.core.environment import CondaDependencies

my_env_name="nyc-taxi-env"
myenv = Environment.get(workspace=ws, name='AzureML-Minimal').clone(my_env_
name)
conda_dep = CondaDependencies()
conda_dep.add_pip_package("numpy==1.18.1")
conda_dep.add_pip_package("pandas==1.1.5")
conda_dep.add_pip_package("joblib==0.14.1")
conda_dep.add_pip_package("scikit-learn==0.24.1")
conda_dep.add_pip_package("sklearn-pandas==2.1.0")
myenv.python.conda_dependencies=conda_dep
56 Module 4 Integrating Azure Databricks and Azure Machine Learning

Combining the Script and Environment in an InferenceCon-


fig
After creating the entry script and environment, you can combine them in an InferenceConfig for the
service like this:
from azureml.core.model import InferenceConfig

from azureml.core.model import InferenceConfig


inference_config = InferenceConfig(entry_script='score.py',
source_directory='.',
environment=myenv)

3. Define a Deployment Configuration


Now that you have the entry script and environment, you need to configure the compute to which the
service will be deployed. If you are deploying to an AKS cluster, you must create the cluster and a
compute target for it before deploying:
from azureml.core.compute import ComputeTarget, AksCompute

cluster_name = 'aks-cluster'
compute_config = AksCompute.provisioning_configuration(location='eastus')
production_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
production_cluster.wait_for_completion(show_output=True)

With the compute target created, you can now define the deployment configuration, which sets the
target-specific compute specification for the containerized deployment:
from azureml.core.webservice import AksWebservice

deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1,
memory_gb = 1)

The code to configure an ACI deployment is similar, except that you do not need to explicitly create an
ACI compute target, and you must use the deploy_configuration class from the azureml.core.webser-
vice.AciWebservice namespace. Similarly, you can use the azureml.core.webservice.LocalWebservice
namespace to configure a local Docker-based service.

4. Deploy the Model


After all of the configuration is prepared, you can deploy the model. The easiest way to do this is to call
the deploy method of the Model class, like this:
from azureml.core.model import Model

service = Model.deploy(workspace=ws,
name = 'nyc-taxi-service',
models = [model],
inference_config = inference_config,
deployment_config = deploy_config,
Deploying Models 57

deployment_target = production_cluster)
service.wait_for_deployment(show_output = True)

For ACI or local services, you can omit the deployment_target parameter (or set it to None).
More Information: For more information about deploying models with Azure Machine Learning, see
Deploy models with Azure Machine Learning3 in the documentation.

Troubleshooting Deployment
There are a lot of elements to a service deployment, including the trained model, the runtime environ-
ment configuration, the scoring script, the container image, and the container host. Troubleshooting a
failed deployment, or an error when consuming a deployed service can be complex.

Check the Service State


As an initial troubleshooting step, you can check the status of a service by examining its state:
from azureml.core.webservice import AksWebservice

# Get the deployed service


service = AksWebservice(name='classifier-service', workspace=ws)

# Check its state


print(service.state)

Note: To view the state of a service, you must use the compute-specific service type (for example
AksWebservice) and not a generic WebService object.
For an operational service, the state should be Healthy.

Review Service Logs


If a service is not healthy, or you are experiencing errors when using it, you can review its logs:
print(service.get_logs())

The logs include detailed information about the provisioning of the service, and the requests it has
processed; and can often provide an insight into the cause of unexpected errors.

Deploy to a Local Container


Deployment and runtime errors can be easier to diagnose by deploying the service as a container in a
local Docker instance, like this:
from azureml.core.webservice import LocalWebservice

deployment_config = LocalWebservice.deploy_configuration(port=8890)
service = Model.deploy(ws, 'test-svc', [model], inference_config, deploy-

3 https://aka.ms/AA70zfv
58 Module 4 Integrating Azure Databricks and Azure Machine Learning

ment_config)

You can then test the locally deployed service using the SDK:
print(service.run(input_data = json_data))

You can then troubleshoot runtime issues by making changes to the scoring file that is referenced in the
inference configuration, and reloading the service without redeploying it (something you can only do
with a local service):
service.reload()
print(service.run(input_data = json_data))

Lab: Deploying Models in Azure Machine Learn-


ing
Note: This lab requires an Azure subscription. If you have not already done so, redeem your Azure Pass
code to get set up with an Azure subscription.
In this lab, you will learn to train models in Azure Databricks and then deploy models in Azure Machine
Learning. This lab will cover following exercises:
●● Exercise 1: Register a databricks-trained model in AML
●● Exercise 2: Deploy a service that uses the model
●● Exercise 3: Consume the deployed service

Instructions
You can complete this lab on your own computer, or a hosted lab environment may be available in your
class - check with your instructor.
1. Open the lab instructions at https://aka.ms/mslearn-dp090.
2. Complete the Deploying Models in Azure Machine Learning exercises.

Module Review
Knowledge Check
In this lesson, you learned how to integrate Azure Databricks and Azure Machine Learning.
Use the following review questions to check your learning.

Question 1
What is the correct method to log a model metric, _rmse, in MLflow?
†† mlflow.log("rmse", _rmse)
†† mlflow.log_artifact("rmse", _rmse)
†† mlflow.log_metric("rmse", _rmse)
Deploying Models 59

Question 2
To support real-time inferencing in production applications, which is the best choice as a deployment target
for the scoring web service?
†† Azure Kubernetes Service (AKS)
†† Azure Container Instances (ACI)
†† Azure Machine Learning Compute Clusters
60 Module 4 Integrating Azure Databricks and Azure Machine Learning

Answers
Question 1
What is the correct method to log a model metric, _rmse, in MLflow?
†† mlflow.log("rmse", _rmse)
†† mlflow.log_artifact("rmse", _rmse)
■■ mlflow.log_metric("rmse", _rmse)
Explanation
MLflow module provides “fluent” APIs and log_metric() is the correct method to log a model metric.
Question 2
To support real-time inferencing in production applications, which is the best choice as a deployment
target for the scoring web service?
■■ Azure Kubernetes Service (AKS)
†† Azure Container Instances (ACI)
†† Azure Machine Learning Compute Clusters
Explanation
AKS is recommended for high-scale production deployments. AKS provides fast response time and autoscal-
ing of the deployed service.

You might also like