SPLT Based Data Parallel Framework For ML Training

Uploaded by

078bme038

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views16 pages

SPLT Based Data Parallel Framework For ML Training

Uploaded by

078bme038

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Split Based Data

Parallel Framework
For ML Training

SUBMIT TED BY:

MILAN POKHREL(078B CT051)
SANAM BASTOLA (078B CT074)
SANDHYA SHRESTHA(078B CT077)
YAM NATH GURAGAIN(078B CT095)
Introduction
Distributed machine learning offers a solution by
dividing the training workload across multiple
machines, making the process faster and more
scalable. This project aims to design and implement a
scalable, modular and beginner-friendly distributed
machine learning training framework.
Related Theory
A Scalable Machine Learning Training Framework allows ML models to be
trained efficiently, even on very large datasets or with complex models by
distributing the work across multiple machines or processors.
 Machine Learning
 Distributed Computing
 Distributed Machine Learning
 Parameter Server Architecture
 All-Reduce Communication
 Fault Tolerance in Distributed Systems
 Scalability in ML Training
 Gradient Aggregation and Synchronization
 Resource Orchestration with Docker and Kubernets
 Existing Distributed ML Frameworks
Objectives
 To design a clean hub–worker architecture that coordinates -based,
synchronous training and model aggregation using Federated Averaging.
 To support heterogeneous runtimes (CUDA, Apple MPS, CPU) with a common
training loop and minimal configuration.
Literature Review
This literature review explores the current research and
technologies in distributed machine learning
(DML),data parallelism, synchronization strategies, and
distributed training frameworks. It also positions the
proposed framework as a solution to bridge these gaps.
Several distributed training platforms have been
proposed to address the challenges of training large-
scale ML models. Popular frameworks include
Horovod ,TensorFlow Distributed, and Ray.
Methodology
The development of the Scalable Distributed Machine Learning
Training Framework follows a structured and iterative
methodology to ensure performance, efficiency, and
extensibility. Major phases of this methodology include
requirement analysis, architecture design, implementation,
testing, and deployment. The methodology emphasizes the use
of modern ML technologies, distributed computing principles ,
and modular system design.
 System Architecture and Design
The system is built on a modular architecture comprising three main
components: Coordinator Node, Worker Node and Commmunication Layer.
Technology Stack
 Algorithm Design
 Data Parallelism
 All Reduce Synchronization
 Dynamic Load Balancing
Software Development Model
The agile software
development model is
employed, promoting
incremental improvements
and regular validation.
Requirement Analysis
 Functional Requirements:
 Distributed Training Management
 Model Support
 Monitoring and Logging
 Non-Functional Requirements :
 Performance
 Scalability
 Usability
 Security
 Maintainability
System Overview
Use Case Diagram
Gantt Chart
WORK COMPLETED
We implemented a barrier-synchronous, coordinator–
worker Distributed ML frame-work and tested it on
HAM10000. A Flask hub orchestrates rounds;
workers(CPU + one RTX 3050 GPU) train locally, upload
weights, and wait at a barrier . The hub aggregates
(FedAvg) and broadcasts the new global model.
Thank you !!

Nsdi21 SwitchML
No ratings yet
Nsdi21 SwitchML
25 pages
Distributed Machine Learning
No ratings yet
Distributed Machine Learning
23 pages
Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
No ratings yet
Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
8 pages
Computer Network
No ratings yet
Computer Network
10 pages
Modern Machine Learning Review
No ratings yet
Modern Machine Learning Review
31 pages
Real-Time Machine Learning: The Missing Pieces
No ratings yet
Real-Time Machine Learning: The Missing Pieces
6 pages
Unit - 4 Pyq Short
No ratings yet
Unit - 4 Pyq Short
8 pages
ICML'22 Big Model Tutorial (Public v2)
No ratings yet
ICML'22 Big Model Tutorial (Public v2)
160 pages
Communication Optimization For Distributed Training
No ratings yet
Communication Optimization For Distributed Training
8 pages
Getting Started With MLOPs 21 Page Tutorial
No ratings yet
Getting Started With MLOPs 21 Page Tutorial
21 pages
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
No ratings yet
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
14 pages
Overview of Distributed Machine Learning
No ratings yet
Overview of Distributed Machine Learning
33 pages
Deep Learning with Databricks Overview
No ratings yet
Deep Learning with Databricks Overview
38 pages
Today (AutoRecovered)
No ratings yet
Today (AutoRecovered)
1 page
11 Distributed DL Printable
No ratings yet
11 Distributed DL Printable
73 pages
Torchtitan: One-Stop Pytorch Native Solution For Production Ready LLM Pretraining
No ratings yet
Torchtitan: One-Stop Pytorch Native Solution For Production Ready LLM Pretraining
21 pages
ML Parallelization1
No ratings yet
ML Parallelization1
14 pages
ATLASHEP BigPanda Distributed Deep Learning4
No ratings yet
ATLASHEP BigPanda Distributed Deep Learning4
18 pages
Petuum: Distributed ML for Big Data
No ratings yet
Petuum: Distributed ML for Big Data
19 pages
DL Mid
No ratings yet
DL Mid
7 pages
Douillard 等 - 2024 - DiLoCo Distributed Low-Communication Training of Language Models
No ratings yet
Douillard 等 - 2024 - DiLoCo Distributed Low-Communication Training of Language Models
16 pages
WHITEPAPERs
No ratings yet
WHITEPAPERs
14 pages
1 s2.0 S1389128622000421 Main
No ratings yet
1 s2.0 S1389128622000421 Main
21 pages
Great Ways To Implement Parallel Processing and Distributed Model Training - by Amudhan Subbiah - Medium
No ratings yet
Great Ways To Implement Parallel Processing and Distributed Model Training - by Amudhan Subbiah - Medium
17 pages
Distributed Training of Deep Learning Models A Taxonomic Perspective
No ratings yet
Distributed Training of Deep Learning Models A Taxonomic Perspective
1 page
SoftMemoryBox II A Scalable Shared Memory Buffer F
No ratings yet
SoftMemoryBox II A Scalable Shared Memory Buffer F
15 pages
Automated Code Transformation For Distributed Traini 2025 Science of Compute
No ratings yet
Automated Code Transformation For Distributed Traini 2025 Science of Compute
19 pages
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
No ratings yet
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
12 pages
DSA in ML - DL
No ratings yet
DSA in ML - DL
8 pages
Distributed Training Scaling Deep Dive
No ratings yet
Distributed Training Scaling Deep Dive
33 pages
Distributed Graph Neural Network Training: A Survey
No ratings yet
Distributed Graph Neural Network Training: A Survey
46 pages
Automating Parallelism in Deep Learning
No ratings yet
Automating Parallelism in Deep Learning
20 pages
Cloud-Native MLOps Framework Overview
No ratings yet
Cloud-Native MLOps Framework Overview
38 pages
Jaghouar 等 - INTELLECT-1 Technical Report
No ratings yet
Jaghouar 等 - INTELLECT-1 Technical Report
19 pages
L4 - Challenges
No ratings yet
L4 - Challenges
6 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
49 pages
DeepLearning Aula7
No ratings yet
DeepLearning Aula7
41 pages
ML Libraries & Frameworks Guide
No ratings yet
ML Libraries & Frameworks Guide
13 pages
Distributed ML with Spark & Keras
No ratings yet
Distributed ML with Spark & Keras
23 pages
A Docker Based Federated Learning Framework Design and Deployment For Multi Modal Data Stream Classification
No ratings yet
A Docker Based Federated Learning Framework Design and Deployment For Multi Modal Data Stream Classification
35 pages
PhD in Distributed Machine Learning for Edge AI
No ratings yet
PhD in Distributed Machine Learning for Edge AI
3 pages
EE353 - 769 06 Intro To ML
No ratings yet
EE353 - 769 06 Intro To ML
27 pages
ML 1
No ratings yet
ML 1
9 pages
Alibaba HPN: A Data Center Network For Large Language Model Training
No ratings yet
Alibaba HPN: A Data Center Network For Large Language Model Training
16 pages
Machine Learning Tools
No ratings yet
Machine Learning Tools
14 pages
MLOps Notes
100% (1)
MLOps Notes
48 pages
Deep Learning Blog
No ratings yet
Deep Learning Blog
6 pages
ML Engineering Roadmap Chatgpt
No ratings yet
ML Engineering Roadmap Chatgpt
6 pages
RaySummit'22 - Large Scale Deep Learning Training and Tuning With Ray at Uber
No ratings yet
RaySummit'22 - Large Scale Deep Learning Training and Tuning With Ray at Uber
41 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
An Invitation To Distributed Quantum Neural Networks: Lirandë Pira Chris Ferrie
No ratings yet
An Invitation To Distributed Quantum Neural Networks: Lirandë Pira Chris Ferrie
24 pages
2025 - EVERY FLOP COUNTS - 2503.05139v2
No ratings yet
2025 - EVERY FLOP COUNTS - 2503.05139v2
34 pages
TensorFlow & CNTK for Deep Learning
No ratings yet
TensorFlow & CNTK for Deep Learning
23 pages
Advanced Systemdesign 2023
No ratings yet
Advanced Systemdesign 2023
65 pages
CCD Chapter 6 Notes
No ratings yet
CCD Chapter 6 Notes
18 pages
Deep Learning on GPU Clusters
No ratings yet
Deep Learning on GPU Clusters
50 pages
Machine Learning
No ratings yet
Machine Learning
38 pages
Chapter 3 - Interaction of Materials Processing and Design
No ratings yet
Chapter 3 - Interaction of Materials Processing and Design
9 pages
A Solution Guide To Past Questions of EPC
No ratings yet
A Solution Guide To Past Questions of EPC
140 pages
Industrial Attachment Report 092
No ratings yet
Industrial Attachment Report 092
47 pages
Daily Report 6
No ratings yet
Daily Report 6
3 pages
Engine Performance and Cycle Analysis
No ratings yet
Engine Performance and Cycle Analysis
6 pages
By Avishek Yadav
No ratings yet
By Avishek Yadav
8 pages
Tutorial 2 Gas Turbne Power Plant-Converted-2
No ratings yet
Tutorial 2 Gas Turbne Power Plant-Converted-2
2 pages
AE Numerical Chapter 10 1
No ratings yet
AE Numerical Chapter 10 1
39 pages
Alternative Fuels
No ratings yet
Alternative Fuels
41 pages
Stages of Combustion in CI Engines
No ratings yet
Stages of Combustion in CI Engines
22 pages
Assessment II
No ratings yet
Assessment II
2 pages
Urban Planning Challenges in Nepal
No ratings yet
Urban Planning Challenges in Nepal
9 pages
Estimate BT
No ratings yet
Estimate BT
1 page
Letter of Acceptance for Railway Contract
No ratings yet
Letter of Acceptance for Railway Contract
54 pages
Folding Workbench Plan New
No ratings yet
Folding Workbench Plan New
10 pages
Baroque Influence in Philippine Architecture
No ratings yet
Baroque Influence in Philippine Architecture
5 pages
Strand Century Portable Lighting Kits
No ratings yet
Strand Century Portable Lighting Kits
4 pages
Bill of Quantity For The Construction of 6x3 Bedroom Accomodation
100% (4)
Bill of Quantity For The Construction of 6x3 Bedroom Accomodation
19 pages
(Ebook) 1000 Ideas by 100 Architects by Sergi Costa Duran, Mariana R. Eguaras ISBN 9781592535736, 1592535739 All Chapters Available
No ratings yet
(Ebook) 1000 Ideas by 100 Architects by Sergi Costa Duran, Mariana R. Eguaras ISBN 9781592535736, 1592535739 All Chapters Available
164 pages
Laban Complete Drawings
100% (3)
Laban Complete Drawings
7 pages
Goonlinetools Lua - Lua
No ratings yet
Goonlinetools Lua - Lua
44 pages
SPTA-PPCES S.Y. 2023-2024: Floor Plan Section Thru-A Section Thru-B
No ratings yet
SPTA-PPCES S.Y. 2023-2024: Floor Plan Section Thru-A Section Thru-B
1 page
Estimate For The Construction of 45 MTR Span SFB Over Raffu Khola in Between Rackiding and Tukdang in South Sikkim
100% (6)
Estimate For The Construction of 45 MTR Span SFB Over Raffu Khola in Between Rackiding and Tukdang in South Sikkim
22 pages
CVE6007 - Pile Testing - NSF - Rev1
No ratings yet
CVE6007 - Pile Testing - NSF - Rev1
9 pages
Composite Column Excel
No ratings yet
Composite Column Excel
3 pages
RFA/RFI Submission Overview
No ratings yet
RFA/RFI Submission Overview
1 page
150 KL 18 MTR Sai Nagar R0 01.03.2025
No ratings yet
150 KL 18 MTR Sai Nagar R0 01.03.2025
2 pages
Dokumen - Pub - Light Earth Building A Handbook For Building With Wood and Earth 9783035606454 9783035606348
No ratings yet
Dokumen - Pub - Light Earth Building A Handbook For Building With Wood and Earth 9783035606454 9783035606348
312 pages
Pablo Antonio: Early Life
No ratings yet
Pablo Antonio: Early Life
15 pages
50 Days Code Challenge 2.0
No ratings yet
50 Days Code Challenge 2.0
10 pages
Outgoing Handover Fail Lte Troubleshooting
No ratings yet
Outgoing Handover Fail Lte Troubleshooting
6 pages
Elements of Successful Landscape Design: Style
No ratings yet
Elements of Successful Landscape Design: Style
2 pages
Steel SHS Beam Bending & Shear Strength
No ratings yet
Steel SHS Beam Bending & Shear Strength
2 pages
Literatu RE Study: Students Hostel
0% (1)
Literatu RE Study: Students Hostel
15 pages
List 2
No ratings yet
List 2
18 pages
JBL MTC CBT FlushMount Bracket Guide 012017
No ratings yet
JBL MTC CBT FlushMount Bracket Guide 012017
2 pages
A
No ratings yet
A
90 pages
Greek Architecture Timeline & Features
No ratings yet
Greek Architecture Timeline & Features
4 pages
Existenzminimum's Relevance in Swiss Housing
No ratings yet
Existenzminimum's Relevance in Swiss Housing
10 pages
Curtain Wall Calculation PDF
100% (2)
Curtain Wall Calculation PDF
134 pages
Bullet Resistant
No ratings yet
Bullet Resistant
2 pages
The Deck-E Brochure
No ratings yet
The Deck-E Brochure
28 pages

SPLT Based Data Parallel Framework For ML Training

Uploaded by

SPLT Based Data Parallel Framework For ML Training

Uploaded by

Split Based Data

SUBMIT TED BY:

You might also like