Split Based Data
Parallel Framework
For ML Training
SUBMIT TED BY:
MILAN POKHREL(078B CT051)
SANAM BASTOLA (078B CT074)
SANDHYA SHRESTHA(078B CT077)
YAM NATH GURAGAIN(078B CT095)
Introduction
Distributed machine learning offers a solution by
dividing the training workload across multiple
machines, making the process faster and more
scalable. This project aims to design and implement a
scalable, modular and beginner-friendly distributed
machine learning training framework.
Related Theory
A Scalable Machine Learning Training Framework allows ML models to be
trained efficiently, even on very large datasets or with complex models by
distributing the work across multiple machines or processors.
Machine Learning
Distributed Computing
Distributed Machine Learning
Parameter Server Architecture
All-Reduce Communication
Fault Tolerance in Distributed Systems
Scalability in ML Training
Gradient Aggregation and Synchronization
Resource Orchestration with Docker and Kubernets
Existing Distributed ML Frameworks
Objectives
To design a clean hub–worker architecture that coordinates -based,
synchronous training and model aggregation using Federated Averaging.
To support heterogeneous runtimes (CUDA, Apple MPS, CPU) with a common
training loop and minimal configuration.
Literature Review
This literature review explores the current research and
technologies in distributed machine learning
(DML),data parallelism, synchronization strategies, and
distributed training frameworks. It also positions the
proposed framework as a solution to bridge these gaps.
Several distributed training platforms have been
proposed to address the challenges of training large-
scale ML models. Popular frameworks include
Horovod ,TensorFlow Distributed, and Ray.
Methodology
The development of the Scalable Distributed Machine Learning
Training Framework follows a structured and iterative
methodology to ensure performance, efficiency, and
extensibility. Major phases of this methodology include
requirement analysis, architecture design, implementation,
testing, and deployment. The methodology emphasizes the use
of modern ML technologies, distributed computing principles ,
and modular system design.
System Architecture and Design
The system is built on a modular architecture comprising three main
components: Coordinator Node, Worker Node and Commmunication Layer.
Technology Stack
Algorithm Design
Data Parallelism
All Reduce Synchronization
Dynamic Load Balancing
Software Development Model
The agile software
development model is
employed, promoting
incremental improvements
and regular validation.
Requirement Analysis
Functional Requirements:
Distributed Training Management
Model Support
Monitoring and Logging
Non-Functional Requirements :
Performance
Scalability
Usability
Security
Maintainability
System Overview
Use Case Diagram
Gantt Chart
WORK COMPLETED
We implemented a barrier-synchronous, coordinator–
worker Distributed ML frame-work and tested it on
HAM10000. A Flask hub orchestrates rounds;
workers(CPU + one RTX 3050 GPU) train locally, upload
weights, and wait at a barrier . The hub aggregates
(FedAvg) and broadcasts the new global model.
Thank you !!