Machine Learning on Streaming Data with
OCI Data Flow
Sujoy Chowdhury
Product Manager, OCI Data Flow
Sivanesh Selvanataraj
Software Engineer, OCI Data Flow
May 5th, 2022
Introducing OCI Data Flow Streaming
Oracle Cloud Infrastructure Data Flow, a fully managed serverless Apache
Spark service, now supports Spark Streaming. Customers can now use OCI
Data Flow to do cloud scale ETL on their continuously produced stream data.
2
Data Lakehouse Platform on OCI
Open & flexible: analyze any database, any application, from anywhere
Data Lakehouse on OCI
Managed Open Source Data Warehouse
Data sources Data target
Data Stores
Any Database Big Data Service Autonomous Data
Data Flow Warehouse
Any BI Tool
Search Insights MySQL HeatWave
Data Definition
Any Application Object Storage
Data Movement & Discovery
Relational Data
Machine Learning
& Data Science
Any Cloud
Data Integration Data Catalog
GoldenGate
OCI Streaming
Data Science Platform Any Application
Any Events/Sensors
AI Services
Lakehouse: Cloud Scale ETL & ML using OCI Data Flow
Use all data to innovate in a lakehouse.
1. Cloud Scale ETL & ML
Cloud Scale ETL & ML
• Cloud Scale ETL at ‘landing zone’ for data warehouse
• Cloud Scale ETL at Data Lake (object store to object Raw data/ Lakehouse Insights
store) to optimize ADW stream
• Data Lake to optimize ADW storage (inexpensive, easily
accessible data store for important but rarely used data,
processed data & meta-data)
• Data Catalog harvests the lake (technical metadata) Autonomous Oracle
and applies a business context. This includes ETL OCI Data Flow ETL Data Warehouse Analytics
Cloud
data/metadata created by Spark applications.
ML
• Data Integration service for ETL orchestration,
Object Storage
workflows Data Lake
2. Stream data processing OCI Data
Catalog
Object Storage OCI Data Science
OCI AI Services
Data Lake
Oracle Machine
• Streamed logs, sensor data, CDR, clickstream that Learning
needs to be processed before ADW (ETL & ML)
3. Big Data processing using Python, Java OCI Data
• Data Engineers/ Data Scientists can use their preferred Integration
programming language for processing big data
4
Analyzing Streams with OCI Data Flow Spark Streaming
OCI Streaming
OCI Streaming
OCI Data Flow
Object Storage
Data Lake Object Storage
Data Lake
• Continuously retrieve input data, not just hourly/ daily.
• Support Kafka-compatible stream/ message queue as source and sink, in addition to batch file-
formats
• Processed data made available continuously (<1 min), not just after job completion
• Outage does not require job restart, can continue from last checkpoint
• Handle late data and watermarking: can catch up backlogged data over time
• Heavy-weight stream processing: Stream-static Joins, Stream-stream Joins, Inner Joins with optional
watermarking. Outer Joins with Watermarking, streaming Deduplication
• Machine Learning on stream data using MLLib.
Spark Streaming Machine Learning Use Cases
Manufacturing Financial Services Telecom
• Predicting RUL • Real time fraud • Churn prediction
‘Remaining Useful Life’ detection
6
Manufacturing: Predicting Remaining Useful Life (RUL) from stream data
Manufacturing
Lakehouse
OCI Streaming OCI Streaming
OCI Data Flow Autonomous
Data Warehouse
Object Storage
Data Lake
Object Storage
Data Lake
Color RUL Status, Action
Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
Green >60 days Healthy, no action needed.
7
Manufacturing: Primer on Predictive Maintenance
• Maintenance, repair and overhaul (MRO) of equipment is mission-critical for
Manufacturing industries
• 3 types of maintenance
• Preventive maintenance (PM) : Planned/Scheduled Maintenance
• Reactive maintenance (Run To Failure) : Maintenance after machine Runs To Failure
• Predictive maintenance (PdM) : Predict the trend and estimation when maintenance
is required
• Predictive maintenance
• Survival analysis:
• Spark ML Survival Regression Model: Accelerated Failure Model (AFT)
8
Spark ML Survival Regression Model
9
RUL Training dataset
• Turbofan Engine Degradation Simulation Data Set
A. Saxena and K. Goebel (2008 NASA Ames Prognostics
Data Repository (http://ti.arc.nasa.gov/project/prognostic-
data-repository), NASA Ames Research Center, Moffett
Field, CA
• Engine degradation Run-to-Failure simulation was carried
out using C-MAPSS (Commercial Modular Aero- Propulsion
System Simulation).
10
Trainer: OCI Data Flow application: RULSurvivalModelTrainer
11
Trainer: OCI Data Flow application: RULSurvivalModelTrainer
12
Simulator: OCI Data Flow Streaming application SensorDataSimulator
13
Simulating Sensor Data for Spark Streaming
Color RUL Status, Action
Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
14 Green >60 days Healthy, no action needed.
Predictor: OCI Data Flow streaming application RealTimeRULPredictor
15
Real time RUL prediction using OCI Data Flow Spark Streaming
16
Real time RUL prediction using OCI Data Flow Spark Streaming
Color RUL Status, Action
Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
Green 60 – 100 days Healthy, no action needed.
17
Demo: Predicting Remaining Useful Life (RUL) from stream data
Manufacturing
Lakehouse
OCI Streaming OCI Streaming
OCI Data Flow Autonomous
Data Warehouse
Object Storage
Data Lake
Object Storage
Data Lake
Color RUL Status, Action
Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
Green >60 days Healthy, no action needed.
18
Spark Streaming with OCI Data Flow: Managed Spark UI view
19
Spark Oracle Datasource for connecting to Oracle DB
20
Spark Streaming with OCI Data Flow: OCI Logging integration
21
Fully managed Spark Streaming with OCI Data Flow
Data Flow continues to be the fully managed Spark experience with zero administration overhead:
• End-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead
Logs from Spark
• Cloud native authentication: Resource principal-based authentication enables Data Flow Streaming
applications to run beyond 24 hours
• Managed automatic security patching so that customers can focus building their application, not
update/ upgrade / infra operation.
• Managed automatic run resubmission by Data Flow for additional fault tolerance on top of Spark
• Deep OCI integration with OCI Logging, OCI Metrics for simpler troubleshooting, along with other
1P OCI services
• Pay for infra only: Data Flow Streaming feature will not introduce any additional meters. Customers
will continue to pay only for the infra their Data Flow runs use.
22
Resources
Manufacturing
Lakehouse
OCI Streaming OCI Streaming
OCI Data Flow Autonomous
Data Warehouse
• RUL Survival Model
Trainer
Object Storage • Real time RUL
Data Lake Predictor
Object Storage
Data Lake
Learn more
https://docs.oracle.com/en-us/iaas/data-flow/using/spark-streaming.htm
https://docs.oracle.com/en-us/iaas/data-flow/using/spark_oracle_datasource.htm
Training Dataset: https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan
Sample code https://github.com/oracle-samples/oracle-dataflow-samples
Contact
[email protected]
23