0% found this document useful (0 votes)
86 views7 pages

DS Architecture

The document discusses the steps involved in building an ETL pipeline including data ingestion from various sources, data processing using tools like Spark and Hive, building predictive models, and presenting results. It also provides examples of using such a pipeline for applications in industries like telematics, banking, and financial services.

Uploaded by

mec101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views7 pages

DS Architecture

The document discusses the steps involved in building an ETL pipeline including data ingestion from various sources, data processing using tools like Spark and Hive, building predictive models, and presenting results. It also provides examples of using such a pipeline for applications in industries like telematics, banking, and financial services.

Uploaded by

mec101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 7

ETL

Pipeline

Steps
Day-1- Intro
Day-2- Architecture
Historical data matters, but to predict demand that is not 100% sure, we need to take other x,y,z parameters
Prediction forecasting-
E.g. Car After sales service- around more than 1 lakh inventories
spike order prediction bcz of any ext. environmental factors
KPI Key performance indicators - we have to do Prediction for more than 250 KPI

Problem Statement
Then different Source of data- Master Data Management
Telematics data- Sensor installed, Sensor Data, Position, Temperatre, Gear shift, Pressure
Data sources format can be in any format like Json, Excel, Xml, CSV etc
Freaquency- Batch Data or streaming data- Batch data is data from customer, supplier, etc in a batch. Streaming data is from T
frequency is low.
Streaming data- like high speed data, or real time data

Functional Arch layer or conceptual Arch


1st- Data Sources- Master Data Management
2nd- Data Ingestion Layer-Data Transformation i.e if accelartion is req. and we have velocity and time data(accelaration= Dv/d
Quality, Data Integration- Audit log to be maintained
3rd- Data Processing layer- Once data is transformed, then same is stored into Data Lake, +Data Archival
4th- Data Storage Layer- Fcntion Specified Repo, Statistics model, Data Virtualisation, Data mining (Making relationship)
5th- Analytics Layer & Functional Repo- Predictive Analysis, Data Discovery, Ad-hoc Analysis- Create smart data access method
right info for spe. Business initiative
6th- Presentation Layer- Internal/External Webs Apps, Reporting, Visualisations, Mobile, Workflow and case mgt

How model is built ?


Start
Data selction
Data Description- Story of what data us all about
Data Analysis- Stas and Graphical data analysis
Data transformation and derivation
Selection f ML algo based on Patterns obs in EDA
Data Standardisation and normalisation
Creation of train and test data set
Model training
Cal. Of model accuracy
Hyper parameter tning to achieve a better accuracy
Saving the created model file
Depl. Strategies for the model(live stream/batch/mini batch)
Production deply. And testing
Finalising the retraining approach
Logging and monitoring
Dashboard for monitoring reports
End/Stop
Day3

System
Involved
Technical Specification Architecture
1st- Data Source- Suplly chain, Customer Service, 3rd Party Data, Proactive sensing, Telematics
2nd- Data Ingestion Layer- SAP PI, HDFS Raw Layer, Oozie- After injesting data we move to next layer
3rd- Data Processing Layer- Here multiple layer is created, HDFS Processed layer, Data validation we have to do like
name, data type, null record, and other validation in 1st layer and cleansing, Spark store data
4th Data Storage Layer- Sandbox, Hive, Hbase, audit log
5th- Analytical Layer & functional repo- Model, predictive analysis (SPSS), Data Discovery, Python
6th- Presentation layer- Tableau or

Data on Azure Blob-- from blob to AG copy then from AG copy to Distr. Computation- cron job here
In telematics data- Telemetics operating system read Telematic data, convert into json, then kaapka cluster
SAP PI- Process Integrtion- Transfer data
HDFS- Haddop distributed file system
Oozie- Is a XML based scheduler
There is Differeence between data warehouse and data base, both are not same
Hive is data warehousing solution, Data fromMDM--> DW(Hive)
py Spark is used to cleanse data, then data standardisation operation, for bringing data to uniform unit.
Multiple layer as there would be good and bad data at each stage.
After standardisation, transformation and aggregation after this data will be dumped into L3 Layer
Once above ops is done then we will try to apply model then all the steps discussed yesterday will be done.

For continous data i.e. telematcs data- it goes through Event hub, then spark then to different layer
Hbase allow us to random access database.

After deployment, we need to define retraining approach

Language- Python, R, Shell Script


Cloud- Azure Cloud
Platform- System- HDFS
Data warehousing- Hive, Hbase
Data processing- PySpark Framework
Continous data streaming- Kafka
Shcheduler- Cron, oozie
monitor tool- Ambari or cloudera manager + audit log(own dashboard)
governance n security- ssl auth, secure authz
ml wise- customize ml, time series, deep learning
Blob-Data storage unit of Azure
Distributed Computation- for reducing time to compute

Master- slave concept in distribution computation


AG Copy- Written in power shell
Cron scheduler is a window based scheduler
Telematics Industry -(Which deals with movable devices, iot enabled)
Problem Statement- 1st Predict health, 2nd Driving pattern for Insurance Premium, 3rd push notification without
pattern or if I am not available on the location
Telematis devices is a hardware
it captures all data from devices, generates current status called as D-Pkg(Diagnosis pacakage)
e.g. Tempearure of battery, charging level, engine
based in this data, we need to predict health of different devices
In Europe/US- driving score is given, based on this insurance Premium is decided
Push notification(msg, email) without pattern
Context based Marketing
NPS- Net Promoter score forecast

Data Source- Telematics devices installed in my car- All collected data sent to TOS(Telematics Operating System)-
TOS will convert all dataset into JSON, then it will send data to middle tier and from there to Kafka (Streaming
purpose) and to EDB(Enterprise level Data base).
Spark streaming was able to consume data from kafka and then were able to send the same to different
databases, and here data lake(Hadoop) was being created.

Data was getting collected from multiple source like Tele Sales, eComm site, Retail Direct, Retail Indirect, Self Care
portal, Mobile App through OSB(Oracle Service Bus) API Gateway
This API gateway was connected to OSM and then further to CRM(Customer

1- Performing Diagnosis report- Predict Vehicle Health-


Streaming Data --> Model built on Historical Data, other local factor
Vendor will have historical details of all the parts he is providing. , For every part there would be some ID and
manufacturing date
Banking industry
Problem Statement- A/c Opening, End to end automation of A/c Opening
Different type of a/c like saving, current, OD etc.
Different kind of Documents submission
Manual process like- Manual data entry, Manual docs verification, 1-2 days for approval, Multiple charges for
processing loan, fastrack loan processing
Loan default- Decrease manual involvement

A/c Opening Automation


Pan Card, Aadhar card, DL, Rent Agreement, Gas bill
Develop portal, option to user to fill all the details, upload all docs there. System to recognize and authenticate all
these docs like Pan card, aadhar, DL. No agent required to verify these docs by this automation
1st detect Docs then Using OCR capture all data
Based on Docs auth. Digital debit card generation. If There is issue in docs then disapproval, then email alert.

Loan Approval
Different dept. even under loan, HL, PL, VL, GL etc
multi page form to be filled, then kyc docs, Salary slips , then background verification, go to mentioned address
Automation of these comp. with less manual involvement

Customer care Call


Conversational Speech bot inplace of IVR, correct pronunication is required, native accent will not be recognised
Acoustic Speech model to handle native accent, annotate speech
Acoustic framework- IBM Watson, Azure framework, watson is cheaper tan azure, offline model also
chat bot to handle customer call

Customer loyality or credibility


Clustering customer based on state, city, 5 kms radius
After clustering decide on the risk involved on each cluster, then provide the info to 3rd party who is caling custmer

Raise flag- GeoSpatial location based.

Fund Mgt
Suppose I have to invest somewhere, risk parameters realted to credit card default.
Factors like Geo Political news, Market risk

Stocks market

You might also like