0% found this document useful (0 votes)

86 views7 pages

DS Architecture

The document discusses the steps involved in building an ETL pipeline including data ingestion from various sources, data processing using tools like Spark and Hive, building predictive models, and presenting results. It also provides examples of using such a pipeline for applications in industries like telematics, banking, and financial services.

Uploaded by

mec101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as XLSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views7 pages

DS Architecture

Uploaded by

mec101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as XLSX, PDF, TXT or read online on Scribd

You are on page 1/ 7

ETL

Pipeline

Steps
Day-1- Intro
Day-2- Architecture
Historical data matters, but to predict demand that is not 100% sure, we need to take other x,y,z parameters
Prediction forecasting-
E.g. Car After sales service- around more than 1 lakh inventories
spike order prediction bcz of any ext. environmental factors
KPI Key performance indicators - we have to do Prediction for more than 250 KPI

Problem Statement
Then different Source of data- Master Data Management
Telematics data- Sensor installed, Sensor Data, Position, Temperatre, Gear shift, Pressure
Data sources format can be in any format like Json, Excel, Xml, CSV etc
Freaquency- Batch Data or streaming data- Batch data is data from customer, supplier, etc in a batch. Streaming data is from T
frequency is low.
Streaming data- like high speed data, or real time data

Functional Arch layer or conceptual Arch

1st- Data Sources- Master Data Management
2nd- Data Ingestion Layer-Data Transformation i.e if accelartion is req. and we have velocity and time data(accelaration= Dv/d
Quality, Data Integration- Audit log to be maintained
3rd- Data Processing layer- Once data is transformed, then same is stored into Data Lake, +Data Archival
4th- Data Storage Layer- Fcntion Specified Repo, Statistics model, Data Virtualisation, Data mining (Making relationship)
5th- Analytics Layer & Functional Repo- Predictive Analysis, Data Discovery, Ad-hoc Analysis- Create smart data access method
right info for spe. Business initiative
6th- Presentation Layer- Internal/External Webs Apps, Reporting, Visualisations, Mobile, Workflow and case mgt

How model is built ?

Start
Data selction
Data Description- Story of what data us all about
Data Analysis- Stas and Graphical data analysis
Data transformation and derivation
Selection f ML algo based on Patterns obs in EDA
Data Standardisation and normalisation
Creation of train and test data set
Model training
Cal. Of model accuracy
Hyper parameter tning to achieve a better accuracy
Saving the created model file
Depl. Strategies for the model(live stream/batch/mini batch)
Production deply. And testing
Finalising the retraining approach
Logging and monitoring
Dashboard for monitoring reports
End/Stop
Day3

System
Involved
Technical Specification Architecture
1st- Data Source- Suplly chain, Customer Service, 3rd Party Data, Proactive sensing, Telematics
2nd- Data Ingestion Layer- SAP PI, HDFS Raw Layer, Oozie- After injesting data we move to next layer
3rd- Data Processing Layer- Here multiple layer is created, HDFS Processed layer, Data validation we have to do like
name, data type, null record, and other validation in 1st layer and cleansing, Spark store data
4th Data Storage Layer- Sandbox, Hive, Hbase, audit log
5th- Analytical Layer & functional repo- Model, predictive analysis (SPSS), Data Discovery, Python
6th- Presentation layer- Tableau or

Data on Azure Blob-- from blob to AG copy then from AG copy to Distr. Computation- cron job here
In telematics data- Telemetics operating system read Telematic data, convert into json, then kaapka cluster
SAP PI- Process Integrtion- Transfer data
HDFS- Haddop distributed file system
Oozie- Is a XML based scheduler
There is Differeence between data warehouse and data base, both are not same
Hive is data warehousing solution, Data fromMDM--> DW(Hive)
py Spark is used to cleanse data, then data standardisation operation, for bringing data to uniform unit.
Multiple layer as there would be good and bad data at each stage.
After standardisation, transformation and aggregation after this data will be dumped into L3 Layer
Once above ops is done then we will try to apply model then all the steps discussed yesterday will be done.

For continous data i.e. telematcs data- it goes through Event hub, then spark then to different layer
Hbase allow us to random access database.

After deployment, we need to define retraining approach

Language- Python, R, Shell Script

Cloud- Azure Cloud
Platform- System- HDFS
Data warehousing- Hive, Hbase
Data processing- PySpark Framework
Continous data streaming- Kafka
Shcheduler- Cron, oozie
monitor tool- Ambari or cloudera manager + audit log(own dashboard)
governance n security- ssl auth, secure authz
ml wise- customize ml, time series, deep learning
Blob-Data storage unit of Azure
Distributed Computation- for reducing time to compute

Master- slave concept in distribution computation

AG Copy- Written in power shell
Cron scheduler is a window based scheduler
Telematics Industry -(Which deals with movable devices, iot enabled)
Problem Statement- 1st Predict health, 2nd Driving pattern for Insurance Premium, 3rd push notification without
pattern or if I am not available on the location
Telematis devices is a hardware
it captures all data from devices, generates current status called as D-Pkg(Diagnosis pacakage)
e.g. Tempearure of battery, charging level, engine
based in this data, we need to predict health of different devices
In Europe/US- driving score is given, based on this insurance Premium is decided
Push notification(msg, email) without pattern
Context based Marketing
NPS- Net Promoter score forecast

Data Source- Telematics devices installed in my car- All collected data sent to TOS(Telematics Operating System)-
TOS will convert all dataset into JSON, then it will send data to middle tier and from there to Kafka (Streaming
purpose) and to EDB(Enterprise level Data base).
Spark streaming was able to consume data from kafka and then were able to send the same to different
databases, and here data lake(Hadoop) was being created.

Data was getting collected from multiple source like Tele Sales, eComm site, Retail Direct, Retail Indirect, Self Care
portal, Mobile App through OSB(Oracle Service Bus) API Gateway
This API gateway was connected to OSM and then further to CRM(Customer

1- Performing Diagnosis report- Predict Vehicle Health-

Streaming Data --> Model built on Historical Data, other local factor
Vendor will have historical details of all the parts he is providing. , For every part there would be some ID and
manufacturing date
Banking industry
Problem Statement- A/c Opening, End to end automation of A/c Opening
Different type of a/c like saving, current, OD etc.
Different kind of Documents submission
Manual process like- Manual data entry, Manual docs verification, 1-2 days for approval, Multiple charges for
processing loan, fastrack loan processing
Loan default- Decrease manual involvement

A/c Opening Automation

Pan Card, Aadhar card, DL, Rent Agreement, Gas bill
Develop portal, option to user to fill all the details, upload all docs there. System to recognize and authenticate all
these docs like Pan card, aadhar, DL. No agent required to verify these docs by this automation
1st detect Docs then Using OCR capture all data
Based on Docs auth. Digital debit card generation. If There is issue in docs then disapproval, then email alert.

Loan Approval
Different dept. even under loan, HL, PL, VL, GL etc
multi page form to be filled, then kyc docs, Salary slips , then background verification, go to mentioned address
Automation of these comp. with less manual involvement

Customer care Call

Conversational Speech bot inplace of IVR, correct pronunication is required, native accent will not be recognised
Acoustic Speech model to handle native accent, annotate speech
Acoustic framework- IBM Watson, Azure framework, watson is cheaper tan azure, offline model also
chat bot to handle customer call

Customer loyality or credibility

Clustering customer based on state, city, 5 kms radius
After clustering decide on the risk involved on each cluster, then provide the info to 3rd party who is caling custmer

Raise flag- GeoSpatial location based.

Fund Mgt
Suppose I have to invest somewhere, risk parameters realted to credit card default.
Factors like Geo Political news, Market risk

Stocks market

Week 1 Lecture 2
No ratings yet
Week 1 Lecture 2
92 pages
Data Report Martin Inline Graphics R7 PDF
No ratings yet
Data Report Martin Inline Graphics R7 PDF
6 pages
Data Report Martin Inline Graphics R8 1
No ratings yet
Data Report Martin Inline Graphics R8 1
6 pages
Karthik (Project Details)
No ratings yet
Karthik (Project Details)
14 pages
Time Series Analysis Expertise in Spark
No ratings yet
Time Series Analysis Expertise in Spark
3 pages
Spec 1 Data Analytics
No ratings yet
Spec 1 Data Analytics
5 pages
Requirement and High Level Solution
No ratings yet
Requirement and High Level Solution
19 pages
Brief Introduction To Amazon
No ratings yet
Brief Introduction To Amazon
7 pages
Building Effective Data Pipelines
No ratings yet
Building Effective Data Pipelines
16 pages
Data Analytics Job Search Glossary
No ratings yet
Data Analytics Job Search Glossary
11 pages
Ai&ds Ie Report
No ratings yet
Ai&ds Ie Report
6 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
Big Data & Unsupervised Learning Guide
No ratings yet
Big Data & Unsupervised Learning Guide
6 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
No ratings yet
How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
22 pages
InfoQ Modern Data Architectures Pipelines Streams
No ratings yet
InfoQ Modern Data Architectures Pipelines Streams
42 pages
Data Lake and Serverless Architecture Guide
No ratings yet
Data Lake and Serverless Architecture Guide
83 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Data Analyst Training Guide
No ratings yet
Data Analyst Training Guide
4 pages
Real-Time Data Analytics Guide
100% (2)
Real-Time Data Analytics Guide
30 pages
AWS Innovate23 Data Agenda
No ratings yet
AWS Innovate23 Data Agenda
1 page
Systems Analysis and Design 3
No ratings yet
Systems Analysis and Design 3
5 pages
Comprehensive Report On Supply Chain Optimization
No ratings yet
Comprehensive Report On Supply Chain Optimization
8 pages
Data Task Breakdown
No ratings yet
Data Task Breakdown
12 pages
Unit 4
No ratings yet
Unit 4
30 pages
ODA - C2C Available Resources-April-2024+ Rate Card.
No ratings yet
ODA - C2C Available Resources-April-2024+ Rate Card.
6 pages
Jimmy Lamba Resume PDF
No ratings yet
Jimmy Lamba Resume PDF
8 pages
Tushar Resume Exp4+ PDF
No ratings yet
Tushar Resume Exp4+ PDF
1 page
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
System Design
No ratings yet
System Design
6 pages
Big Data en Gros Deepseek
No ratings yet
Big Data en Gros Deepseek
7 pages
Data Fin
No ratings yet
Data Fin
16 pages
Fundamentals of Big Data and Business Analytics
No ratings yet
Fundamentals of Big Data and Business Analytics
6 pages
Design Data Architecture 1st Unit
No ratings yet
Design Data Architecture 1st Unit
58 pages
1631422687334resume Klnhemanth
No ratings yet
1631422687334resume Klnhemanth
4 pages
Case Study
No ratings yet
Case Study
8 pages
Senior Manager in Data Engineering
100% (1)
Senior Manager in Data Engineering
2 pages
DSA Presentation
No ratings yet
DSA Presentation
34 pages
Part 5
No ratings yet
Part 5
4 pages
Rapport ISTIC 2023 2024 Ilef Tasnim
No ratings yet
Rapport ISTIC 2023 2024 Ilef Tasnim
94 pages
Research Report Real-World Applications of Event-Driven Data
No ratings yet
Research Report Real-World Applications of Event-Driven Data
5 pages
Ai-Powered Timesheet - Implementation Flow & Technical Documentation
No ratings yet
Ai-Powered Timesheet - Implementation Flow & Technical Documentation
8 pages
Lambda - A Modern Big Data Architecture 5 - 12 PDF
No ratings yet
Lambda - A Modern Big Data Architecture 5 - 12 PDF
128 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
BDA PPT (Paytm Case Study - Implementing Big Data)
No ratings yet
BDA PPT (Paytm Case Study - Implementing Big Data)
21 pages
Data Engineering Best Practices Guide
No ratings yet
Data Engineering Best Practices Guide
18 pages
Harteg Notes
No ratings yet
Harteg Notes
4 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
DSML Projects
No ratings yet
DSML Projects
10 pages
Data Analytics Assignment
No ratings yet
Data Analytics Assignment
17 pages
Banking Digital Transformation Platform
No ratings yet
Banking Digital Transformation Platform
30 pages
Cloud Analytics for Cargo Firms
No ratings yet
Cloud Analytics for Cargo Firms
45 pages
Big Data
No ratings yet
Big Data
86 pages
Data Warehousing in The Age of Artificial Intelligence
No ratings yet
Data Warehousing in The Age of Artificial Intelligence
94 pages
Cloud-Based Battery Management Insights
No ratings yet
Cloud-Based Battery Management Insights
10 pages
DEWA Project Timeline
No ratings yet
DEWA Project Timeline
10 pages
Bi Exam
No ratings yet
Bi Exam
24 pages
The 100+ Business Models by FourWeekMBA - Full Library
95% (42)
The 100+ Business Models by FourWeekMBA - Full Library
780 pages
The Chief Strategy Officer Playbook PDF
100% (11)
The Chief Strategy Officer Playbook PDF
176 pages
AI Artificial Intelligence, 60 Leaders 17 Questions
100% (14)
AI Artificial Intelligence, 60 Leaders 17 Questions
236 pages
Mastering AI Agents
100% (12)
Mastering AI Agents
93 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
97% (35)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Strategy Tools
100% (13)
Strategy Tools
317 pages
Artificial Intelligence in Business Management
100% (10)
Artificial Intelligence in Business Management
385 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
94% (18)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Top 100 Applications of Generative AI 1683282083
96% (23)
Top 100 Applications of Generative AI 1683282083
119 pages
Top 101 Consulting Framework
91% (11)
Top 101 Consulting Framework
205 pages
25 Case Studies On Pricing Strategy
100% (10)
25 Case Studies On Pricing Strategy
160 pages
Full Course of Machine Learning
100% (17)
Full Course of Machine Learning
660 pages
McKinsey Handbook - How To Write A Business Plan
97% (29)
McKinsey Handbook - How To Write A Business Plan
116 pages
Generative Ai Fundamentals v1
100% (19)
Generative Ai Fundamentals v1
80 pages
Adhiguna Mahendra - AI Startup Strategy - A Blueprint To Building Successful Artificial Intelligence Products From Inception To Exit-Apress (2023)
100% (7)
Adhiguna Mahendra - AI Startup Strategy - A Blueprint To Building Successful Artificial Intelligence Products From Inception To Exit-Apress (2023)
434 pages
Prompt Engineering Bible Join and Master The AI Revolution Profit Online With GPT-4 Plugins For Effortless Money Making (Robert E. Miller) (Z-Library)
100% (11)
Prompt Engineering Bible Join and Master The AI Revolution Profit Online With GPT-4 Plugins For Effortless Money Making (Robert E. Miller) (Z-Library)
209 pages
AI Agents by Google
100% (11)
AI Agents by Google
42 pages
The A.I. Playbook
82% (11)
The A.I. Playbook
43 pages
The Best ChatGPT
98% (53)
The Best ChatGPT
8 pages
Machine Learning With Python
100% (15)
Machine Learning With Python
692 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (18)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Prompt Engineer 101
97% (34)
Prompt Engineer 101
45 pages
Data Analytics Concepts Techniques and A PDF
100% (14)
Data Analytics Concepts Techniques and A PDF
451 pages
Tom Taulli - Generative AI - A Non-Technical Introduction-Apress (2023)
100% (9)
Tom Taulli - Generative AI - A Non-Technical Introduction-Apress (2023)
211 pages
Hackers Guide To Machine Learning With Python PDF
100% (16)
Hackers Guide To Machine Learning With Python PDF
272 pages
Flevy Com 50 Case Studies On Operational Excellence 1711344730
100% (6)
Flevy Com 50 Case Studies On Operational Excellence 1711344730
300 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (15)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Codi Byte - Chat GPT Bible - 10 Books in 1_ Everything You Need to Know About AI and Its Applications to Improve Your Life, Boost Productivity, Earn Money, Advance Your Career, And Develop New Skills.
94% (31)
Codi Byte - Chat GPT Bible - 10 Books in 1_ Everything You Need to Know About AI and Its Applications to Improve Your Life, Boost Productivity, Earn Money, Advance Your Career, And Develop New Skills.
447 pages
McKinsey's TEAM FOCUS Framework
95% (21)
McKinsey's TEAM FOCUS Framework
82 pages
Data Analytics and AI
100% (12)
Data Analytics and AI
267 pages
Sourabh23 Resume
No ratings yet
Sourabh23 Resume
1 page
Java Learning Resources
No ratings yet
Java Learning Resources
16 pages
Selenium WebDriver Exceptions
No ratings yet
Selenium WebDriver Exceptions
4 pages
Multithreaded Airport Simulation Systems
82% (17)
Multithreaded Airport Simulation Systems
18 pages
Final Report11
No ratings yet
Final Report11
68 pages
Download EMC Grab Scripts Guide
No ratings yet
Download EMC Grab Scripts Guide
2 pages
Living in The IT Era Reviewer
100% (1)
Living in The IT Era Reviewer
13 pages
Ethiopian TVET-System: Learning Guide # 5
No ratings yet
Ethiopian TVET-System: Learning Guide # 5
17 pages
BJMC I Sem Syllabus
No ratings yet
BJMC I Sem Syllabus
6 pages
Core Java MCQ 1
No ratings yet
Core Java MCQ 1
465 pages
Flowstone Component Reference
100% (1)
Flowstone Component Reference
548 pages
Slide 2
No ratings yet
Slide 2
20 pages
Chapter 5: Entity-Relationship Model
No ratings yet
Chapter 5: Entity-Relationship Model
69 pages
CH 2 IT Advanced Spreadsheet Notes
No ratings yet
CH 2 IT Advanced Spreadsheet Notes
6 pages
Load Balancing
No ratings yet
Load Balancing
23 pages
Data Warehousing Essentials
No ratings yet
Data Warehousing Essentials
25 pages
17.Sms Spam Detection & Url Malicious Classification-2
No ratings yet
17.Sms Spam Detection & Url Malicious Classification-2
33 pages
Evermotion Archexteriors 6 PDF
No ratings yet
Evermotion Archexteriors 6 PDF
2 pages
SEPM - Module 2
No ratings yet
SEPM - Module 2
37 pages
Acc 114 p1 Reviewer
No ratings yet
Acc 114 p1 Reviewer
9 pages
LGSpecSheet - Regional-Carriers - Stylo 6 - 082720
No ratings yet
LGSpecSheet - Regional-Carriers - Stylo 6 - 082720
3 pages
PM Debug Info
No ratings yet
PM Debug Info
905 pages
Bharat101 User Manual v2.8
No ratings yet
Bharat101 User Manual v2.8
120 pages
Ev - SSC Ict CH 4
No ratings yet
Ev - SSC Ict CH 4
1 page
Welcome To Printer Tales: Popular Brands
No ratings yet
Welcome To Printer Tales: Popular Brands
2 pages
I&cs MCQ Set-1
No ratings yet
I&cs MCQ Set-1
8 pages
Oauthlib Readthedocs Io en Latest
No ratings yet
Oauthlib Readthedocs Io en Latest
139 pages
Application Function Server
No ratings yet
Application Function Server
24 pages
The AI Tools Cheat Sheet
No ratings yet
The AI Tools Cheat Sheet
1 page
Skillathon Questions Answers - Updated
0% (1)
Skillathon Questions Answers - Updated
4 pages

DS Architecture

Uploaded by

DS Architecture

Uploaded by

ETL

Functional Arch layer or conceptual Arch

How model is built ?

After deployment, we need to define retraining approach

Language- Python, R, Shell Script

Master- slave concept in distribution computation

1- Performing Diagnosis report- Predict Vehicle Health-

A/c Opening Automation

Customer care Call

Customer loyality or credibility

Raise flag- GeoSpatial location based.

You might also like