0% found this document useful (0 votes)
44 views50 pages

Big Data and Machine Learning Using MATLAB

Uploaded by

shuvo23105101409
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views50 pages

Big Data and Machine Learning Using MATLAB

Uploaded by

shuvo23105101409
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Big Data and Machine Learning

Using MATLAB
Seth DeLand & Amit Doshi
MathWorks

© 2015 The MathWorks, Inc.


1
Data Analytics

Turn large volumes of complex data into actionable information


source: Gartner

2
Customer Example: Gas Natural Fenosa
User Story
Energy Production Optimization
Opportunity
• Allocate demand among power plants to minimize
generation costs

Analytics Use
• Data: Central database for historical power consumption
and price data, weather forecasts, and parameters for each
power plant
• Machine Learning: Develop price simulation scenarios
• Optimization: minimize production cost

Benefit
• Reduced generation costs
• White-box solution for optimizing power generation

3
Unit Commitment
Predictive and Prescriptive Analytics
Prescriptive Analytics

Predictive Analytics
Historical Unit Schedule
Weather Data Commitment

Load Forecast

Historical Generator
Load Data Parameters

4
Big Data Analytics Workflow
Access and Explore Develop Predictive Integrate Analytics with
Preprocess Data
Data Models Systems

Files Working with Model Creation e.g. Desktop Apps


Messy Data Machine Learning

Databases Data Reduction/ Parameter Enterprise Scale


Transformation Optimization Systems

Sensors Feature Model Embedded Devices


Extraction Validation and Hardware

5
Example: Working with Big Data in MATLAB

▪ Objective: Create a model to predict the cost of a taxi ride in New York City

▪ Inputs:
– Monthly taxi ride log files
– The local data set is small (~20 MB)
– The full data set is big (~21 GB)

▪ Approach:
– Access Data
– Preprocess and explore data
– Develop and validate predictive model (linear fit)
▪ Work with subset of data for prototyping and then run on spark enabled hadoop with full data
– Integrate analytics into a webapp
6
Example: Working with Big Data in MATLAB

7
Demo: Taxi Fare Predictor Web App

8
Big Data Analytics Workflow: Data Access and Pre-process
Access and Explore Develop Predictive Integrate Analytics with
Preprocess Data
Data Models Systems

Files Working with Model Creation e.g. Desktop Apps


Messy Data Machine Learning

Databases Data Reduction/ Parameter Enterprise Scale


Transformation Optimization Systems

Sensors Feature Model Embedded Devices


Extraction Validation and Hardware

9
Data Access and Pre-processing – Challenges

Challenges

▪ Data aggregation
– Different sources (files, web, etc.)
– Different types (images, text, audio, etc.)

▪ Data clean up
– Poorly formatted files
– Irregularly sampled data
– Redundant data, outliers, missing data etc.

▪ Data specific processing


– Signals: Smoothing, resampling, denoising,
Wavelet transforms, etc.
– Images: Image registration, morphological
filtering, deblurring, etc.
Data preparation accounts for about 80% of the work of data
scientists - Forbes
▪ Dealing with out of memory data (big data)

10
Data Analytics Workflow: Big Data Access and Pre-processing

11
Next: Access Big Data from MATLAB

▪ datastore
– Tabular text files
– Images
– Excel spreadsheets
– (SQL) Databases
– HDFS (Hadoop)
– S3 - Amazon

12
Get data in MATLAB

13
What if the data is saved in HDFS?

14
Or Data is stored in a Database

15
Data Access: Summary

Business and Transactional Data

▪ Repositories – SQL, NoSQL, etc.


▪ File I/O – Text, Spreadsheet, etc.
▪ Web Sources – RESTful, JSON, etc.

Engineering, Scientific and Field


Data
▪ Real-Time Sources – Sensors,
GPS, etc.
▪ File I/O – Image, Audio, etc.
Servers and Databases
▪ Communication Protocols – OPC

Hardware (OLE for Process Control), CAN


(Controller Area Network), etc.

C Java Fortran Python

Software
16
Process data which doesn't fit into memory
Access and Explore Develop Predictive Integrate Analytics with
Preprocess Data
Data Models Systems

Files Working with Model Creation e.g. Desktop Apps


Messy Data Machine Learning

Databases Data Reduction/ Parameter Enterprise Scale


Transformation Optimization Systems

Sensors Feature Model Embedded Devices


Extraction Validation and Hardware

17
Pre-processing Big Data

tall arrays in

▪ New data type designed for data that doesn’t fit into memory

▪ Lots of observations (hence “tall”)

▪ Looks like a normal MATLAB array


– Supports numeric types, tables, datetimes, strings, etc…
– Supports several hundred functions for basic math, stats, indexing, etc.
– Statistics and Machine Learning Toolbox support
(clustering, classification, etc.)

18
tall arrays Single
tall array Single
Machine Machine
Memory Process Memory

▪ Automatically breaks data up into


small “chunks” that fit in memory

▪ Tall arrays scan through the


dataset one “chunk” at a time

▪ Processing code for tall arrays is


the same as ordinary arrays

19
tall arrays Single
tall array Single
Machine Machine
Memory Process Memory

▪ With Parallel Computing Toolbox,


process several “chunks” at once Single
Machine
Process Memory

▪ Can scale up to clusters with


MATLAB Distributed Computing Single
Server Cluster of
Machines Process
Machine
Memory
Memory

Single
Machine
Process Memory

20
Demo: Working with Tall Arrays

21
Data Access and pre-processing – challenges and solution
MATLAB makes it easy to 1
Challenges
work with business and
▪ Data aggregation engineering data
– Different sources (files, web, etc.)
– Different types (images, text, audio, etc.)

▪ Data clean up
– Poorly formatted files
– Irregularly sampled data
– Redundant data, outliers, missing data etc.

Files Databases
▪ Data specific processing
– Signals: Smoothing, resampling, denoising,
Wavelet transforms, etc. Signals Images
– Images: Image registration, morphological
filtering, deblurring, etc. ▪ Built-in algorithms for data
▪ Point and click tools to access
▪ Dealing with out of memory data (big data) variety of data sources preprocessing including sensor,

▪ High-performance environment image, audio, video and other

for big data real-time data


22
Data Analytics Workflow: Develop Predictive Models using Big Data
Access and Explore Develop Predictive Integrate Analytics with
Preprocess Data
Data Models Systems

Files Working with Model Creation e.g. Desktop Apps


Messy Data Machine Learning

Databases Data Reduction/ Parameter Enterprise Scale


Transformation Optimization Systems

Sensors Feature Model Embedded Devices


Extraction Validation and Hardware

23
Machine Learning
Machine learning uses data and produces a program to perform a task

Task: Human Activity Detection


Standard Approach Machine Learning Approach

Computer Machine
Program Learning

Hand Written Program Formula or Equation


𝑚𝑜𝑑𝑒𝑙: Inputs → Outputs
If X_acc > 0.5 𝑌𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦
then “SITTING” = 𝛽1 𝑋𝑎𝑐𝑐 + 𝛽2 𝑌𝑎𝑐𝑐
If Y_acc < 4 and Z_acc > 5 𝑴𝒂𝒄𝒉𝒊𝒏𝒆
+ 𝛽3 𝑍𝑎𝑐𝑐 + 𝑚𝑜𝑑𝑒𝑙 = < 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 >(𝑠𝑒𝑛𝑠𝑜𝑟_𝑑𝑎𝑡𝑎, 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦)
then “STANDING”

… 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎

24
Consider Machine/Deep Learning When
Problem is too complex for hand written rules or equations Because algorithms can

learn complex non-


linear relationships

Speech Recognition Object Recognition Engine Health Monitoring

Program needs to adapt with changing data

update as more data


becomes available
Weather Forecasting Energy Load Forecasting Stock Market Prediction

Program needs to scale

learn efficiently from


very large data sets

IoT Analytics Taxi Availability Airline Flight Delays


25
Different Types of Learning
Type of Learning Categories of Algorithms

• Output is a choice between classes


Classification
(True, False) (Red, Blue, Green)
Supervised
Learning
• Output is a real number
Develop predictive Regression
(temperature, stock prices)
model based on both
Machine input and output data
Learning

Unsupervised • No output - find natural groups and


Clustering
Learning patterns from input data only

Discover an internal
representation from
input data only

26
Different Types of Learning
Type of Learning Categories of Algorithms

Support
Discriminant Nearest
Classification Vector
Analysis
Naive Bayes
Neighbor
Machines
Supervised
Learning
Linear
SVR, Ensemble Decision Neural
Develop predictive Regression Regression
GPR Methods Trees Networks
GLM
model based on both
Machine input and output data
Learning

kMeans, kmedoids Gaussian


Unsupervised Fuzzy C-Means
Hierarchical
Mixture
Clustering
Learning

Neural Hidden Markov


Discover an internal Networks Model
representation from
input data only

27
Machine Learning with Big Data

• Descriptive statistics (skewness, • Linear classification methods for SVM


tabulate, crosstab, cov, grpstats, …) and logistic regression (fitclinear)
• K-means clustering (kmeans) • Random forest ensembles of
classification trees (TreeBagger)
• Visualization (ksdensity, binScatterPlot;
histogram, histogram2) • Naïve Bayes classification (fitcnb)
• Dimensionality reduction (pca, pcacov, • Regularized regression (lasso)
factoran)
• Prediction applied to tall arrays
• Linear and generalized linear regression
(fitlm, fitglm)
• Discriminant analysis (fitcdiscr)

28
Demo: Training a Machine Learning Model

29
Demo: Training a Machine Learning Model

30
Regression Learner

31
Regression Learner
App to apply advanced regression methods to your data

▪ Added to Statistics and Machine Learning


Toolbox in R2017a
▪ Point and click interface – no coding
required
▪ Quickly evaluate, compare and select
regression models
▪ Export and share MATLAB code or
trained models

32
Classification Learner
App to apply advanced classification methods to your data

▪ Added to Statistics and Machine Learning


Toolbox in R2015a
▪ Point and click interface – no coding
required
▪ Quickly evaluate, compare and select
classification models
▪ Export and share MATLAB code or
trained models

33
and Many More MATLAB Apps for Data Analytics

Distribution Fitting
System Identification
Signal Analysis

Wavelet Design and Analysis


Neural Net Fitting
Neural Net Pattern Recognition
Training Image Labeler

and many more…

34
Tuning Machine Learning Models
Get more accurate models in less time

Automatically select best Automatically fine-tune


machine leaning “features” machine learning parameters

Select best “features”


to keep in model from
over 400 candidates

NCA: Neighborhood Component Analysis Hyperparameter Tuning

35
Machine Learning Hyperparameters

Hyperparameters

Tune a typical set of


hyperparameters for this model

Tune all
hyperparameters for this model

36
Bayesian Optimization in Action

37
Big Data Analytics Workflow: Developing 2
MATLAB enables
Predictive models domain experts to
do Data Science

Challenges Apps Language


▪ Lack of data science expertise

▪ Feature Extraction – How to transform


data to best represent the system?
– Requires subject matter expertise
– No right way of designing features

▪ Feature Selection – What attributes or


subset of data to use?
– Entails a lot of iteration – Trial and error
– Difficult to evaluate features
▪ Easy to use apps ▪ Automatic MATLAB code
▪ Model Development
– Many different models ▪ Wide breadth of tools to facilitate generation
– Model Validation and Tuning
domain specific analysis ▪ High speed processing of large
▪ Time required to conduct the analysis ▪ Examples/videos to get started data sets

38
Back to our example: Working with Big Data in MATLAB

▪ Objective: Create a model to predict the cost of a taxi ride in New York City

▪ Inputs:
– Monthly taxi ride log files
– The local data set is small (~20 MB)
– The full data set is big (~25 GB)

▪ Approach:
– Acecss Data
– Preprocess and explore data
– Develop and validate predictive model (linear fit)
▪ Work with subset of data for prototyping
▪ Scale to full data set on a cluster

39
Data Analytics Workflow: Develop Predictive Models using Big Data
Access and Explore Develop Predictive Integrate Analytics with
Preprocess Data
Data Models Systems

Files Working with Model Creation e.g. Desktop Apps


Messy Data Machine Learning

Databases Data Reduction/ Parameter Enterprise Scale


Transformation Optimization Systems

Sensors Feature Model Embedded Devices


Extraction Validation and Hardware

40
Demo: Taxi Fare Predictor Web App

41
MATLAB Production Server

▪ Server software
– Manages packaged MATLAB
programs and worker pool
Enterprise
MATLAB Production Server
Application

▪ MATLAB Runtime libraries MPS Client


Library
– Single server can use runtimes Request Broker
&
from different releases Program
Manager
Applications/
Database
Servers RESTful

▪ RESTful JSON interface JSON

MATLAB
Runtime

▪ Lightweight client libraries


– C/C++, .NET, Python, and Java

42
Integrate analytics with systems
MATLAB Analytics
3
run anywhere

Embedded Hardware Enterprise Systems

Standalone Excel Hadoop/ MATLAB


C, C++ HDL PLC Application C/C++ Java ++ Python .NET Production
Add-in Spark Server

MATLAB
Runtime
43
Product Support for Spark
Integrate with applications:
• Deploy MATLAB programs using “tall”
• Develop deployable applications for
From MATLAB desktop: Spark using MATLAB API for Spark
Web & Mobile Enterprise
• Access data from HDFS Applications Applications DEVELOPMENT TOOLS

• Run “tall” functions on


Spark/Hadoop using MDCS

MATLAB
Compiler

MATLAB Distributed
Computing Server

MATLAB
Spark Runtime

YARN

44
Deployment Offerings Program using tall
Program using
MATLAB API for Spark

▪ Deploy “tall” programs


– Create Standalone Applications: MATLAB Compiler
MATLAB
Compiler Since the Standalone
must run on a Linux
▪ MATLAB API for Spark Edge Node, you must
Edge Node compile on Linux
– Create Standalone Applications: MATLAB Compiler
Standalone
Application

– Functionality beyond tall arrays


– For advanced programmers familiar with Spark
Spark MATLAB
Runtime

YARN : Data Operating System


– Local install of Spark to run code in MATLAB
▪ Installed on same machine as MATLAB – single node, Linux

45
Data Analytics Workflow
Access and Explore Develop Predictive Integrate Analytics
Preprocess Data
Data Models with Systems

Files Working with Model Creation e.g. Desktop Apps


Messy Data Machine Learning

Databases Data Reduction/ Parameter Enterprise Scale


Transformation Optimization Systems

Sensors Feature Model Embedded Devices


Extraction
1 MATLAB enables
Validation2 and Hardware
3
MATLAB Analytics work MATLAB Analytics
with business and domain experts to do run anywhere
engineering data Data Science

46
Resources to learn and get started mathworks.com/machine-learning
mathworks.com/big-data

eBook

47
MathWorks Services

▪ Consulting
– Integration
– Data analysis/visualization
– Unify workflows, models, data
www.mathworks.com/services/consulting/

▪ Training
– Classroom, online, on-site
– Data Processing, Visualization, Deployment, Parallel Computing

www.mathworks.com/services/training/

48
MathWorks Training Offerings

http://www.mathworks.com/services/training/

49
Speaker Details Contact MathWorks India
Email:
Products/Training Enquiry Booth
[email protected]
[email protected]
Call: 080-6632-6000
LinkedIn: Email: [email protected]
https://in.linkedin.com/in/amit-doshi
https://www.linkedin.com/in/seth-deland

Your feedback is valued.


Please complete the feedback form provided to you.
50

You might also like