0% found this document useful (0 votes)
9 views16 pages

SPLT Based Data Parallel Framework For ML Training

Uploaded by

078bme038
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

SPLT Based Data Parallel Framework For ML Training

Uploaded by

078bme038
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Split Based Data

Parallel Framework
For ML Training

SUBMIT TED BY:


MILAN POKHREL(078B CT051)
SANAM BASTOLA (078B CT074)
SANDHYA SHRESTHA(078B CT077)
YAM NATH GURAGAIN(078B CT095)
Introduction
Distributed machine learning offers a solution by
dividing the training workload across multiple
machines, making the process faster and more
scalable. This project aims to design and implement a
scalable, modular and beginner-friendly distributed
machine learning training framework.
Related Theory
A Scalable Machine Learning Training Framework allows ML models to be
trained efficiently, even on very large datasets or with complex models by
distributing the work across multiple machines or processors.
 Machine Learning
 Distributed Computing
 Distributed Machine Learning
 Parameter Server Architecture
 All-Reduce Communication
 Fault Tolerance in Distributed Systems
 Scalability in ML Training
 Gradient Aggregation and Synchronization
 Resource Orchestration with Docker and Kubernets
 Existing Distributed ML Frameworks
Objectives
 To design a clean hub–worker architecture that coordinates -based,
synchronous training and model aggregation using Federated Averaging.
 To support heterogeneous runtimes (CUDA, Apple MPS, CPU) with a common
training loop and minimal configuration.
Literature Review
This literature review explores the current research and
technologies in distributed machine learning
(DML),data parallelism, synchronization strategies, and
distributed training frameworks. It also positions the
proposed framework as a solution to bridge these gaps.
Several distributed training platforms have been
proposed to address the challenges of training large-
scale ML models. Popular frameworks include
Horovod ,TensorFlow Distributed, and Ray.
Methodology
The development of the Scalable Distributed Machine Learning
Training Framework follows a structured and iterative
methodology to ensure performance, efficiency, and
extensibility. Major phases of this methodology include
requirement analysis, architecture design, implementation,
testing, and deployment. The methodology emphasizes the use
of modern ML technologies, distributed computing principles ,
and modular system design.
 System Architecture and Design
The system is built on a modular architecture comprising three main
components: Coordinator Node, Worker Node and Commmunication Layer.
Technology Stack
 Algorithm Design
 Data Parallelism
 All Reduce Synchronization
 Dynamic Load Balancing
Software Development Model
The agile software
development model is
employed, promoting
incremental improvements
and regular validation.
Requirement Analysis
 Functional Requirements:
 Distributed Training Management
 Model Support
 Monitoring and Logging
 Non-Functional Requirements :
 Performance
 Scalability
 Usability
 Security
 Maintainability
System Overview
Use Case Diagram
Gantt Chart
WORK COMPLETED
We implemented a barrier-synchronous, coordinator–
worker Distributed ML frame-work and tested it on
HAM10000. A Flask hub orchestrates rounds;
workers(CPU + one RTX 3050 GPU) train locally, upload
weights, and wait at a barrier . The hub aggregates
(FedAvg) and broadcasts the new global model.
Thank you !!

You might also like