ML Parallelization1
ML Parallelization1
The data matrix X and vector y are split across m Each core returns the best local split; a final aggrega-
nodes. tion selects the global best split.
Example 2: K-Nearest Neighbors (KNN) D. Parallel Gradient Descent
For a query point x, compute distances to all training In large-scale optimization, gradient descent can be
points: parallelized using data partitioning or model averaging.
d(x, xi ) = ∥x − xi ∥2 In this approach, each worker updates a local model
Each core processes a chunk Xj : and periodically synchronizes with a central parameter
server.
Dj = {d(x, xi ) | xi ∈ Xj }
Mathematical Representation
Merge distances and select the k nearest.
Standard Gradient Descent:
C. Multithreading-Based Parallelization
Multithreading leverages concurrent execution of mul- θt+1 = θt − η∇L(θt )
tiple lightweight threads within a single core or across
Parallel Gradient Descent (across m workers):
cores. It is lightweight compared to multicore processing.
Each worker computes:
Threads share the same memory space and are suitable
for tasks like I/O operations, feature-wise operations, 1 X
model scoring, and shared-data computations. ∇j = ∇l(xi , yi , θt )
|Xj |
xi ∈Xj
Mathematical Example:
Assume threads are computing gradients in parallel Then update:
for a loss function L(θ): m
1 X
n θt+1 = θt − η · ∇j
X m j=1
∇L(θ) = ∇l(xi , yi , θ)
i=1
Below is an example of applying parallel gradient de-
Threads compute partial gradients: scent to linear regression in machine learning.
X
∇j = ∇l(xi , yi , θ) Example: Linear Regression (Gradient Descent)
xi ∈Xj
Loss function:
Total gradient: n
1 X
m L(β) = (yi − x⊤
i β)
2
X 2n i=1
∇L(θ) = ∇j
j=1 Gradient:
Below are two examples demonstrating the applica- n
1X
tion of multithreading-based parallelization in machine ∇L(β) = − xi (yi − x⊤
i β)
learning. n i=1
TABLE II
PARALLELIZATION S UITABILITY OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY
Meta Algorithm Algorithms MapReduce Multicore Multithreading Gradient Descent
Ensemble - Boosting AdaBoost, Gradient Boosting, Not ideal: sequential Suitable: Parallel tree Limited: Threads No: Uses greedy
HistGradientBoostingRegressor dependency between construction in later mostly help in tree- stage-wise additive
learners implementations level operations modeling, not
gradient descent
Ensemble - Bagging Bagging, BaggingRegressor, Ran- Suitable: Each model Highly suitable: Mod- Suitable: Independent No
dom Forest, ExtraTreeRegressor (tree) can be trained els can be trained in learners allow thread-
independently on sub- parallel on CPU cores based training
sets
Ensemble - Voting VotingRegressor Suitable: Individual Suitable: Combine Suitable: Parallel pre- Depends on base
models can be trained predictions from diction phase learners
and evaluated in multiple models
parallel using parallel threads
or cores
Tree-Based Decision Tree, DecisionTreeRe- Limited: Not natu- Suitable: Tree nodes Limited: Threaded No
gressor, ExtraTreeRegressor rally a MapReduce can be split in parallel operations during
process, but parallel training or inference
tree building possible
Lazy Learners / Distance-Based k Nearest Neighbor Partially suitable: Suitable: Parallelize Suitable: Thread- No
MapReduce can distance computation based neighbor
distribute distance across cores searching
calculations
Probabilistic Models Multinomial Naive Bayes Suitable: MapReduce Suitable: Conditional Suitable: Thread-safe No
to calculate feature probabilities can be in batch mode
probabilities indepen- computed per-core
dently
Neural Networks (Deep and Shallow) Neural Network, MLPRegressor, Limited: Gradient Highly suitable: Par- Suitable: Thread- Yes: Core training
Convolutional Neural Network synchronization allel computation of level ops in training relies on stochastic
overhead in matrix ops and batch batches or batch gradient
distributed setting samples descent
Regularized Linear Models Ridge, Ridge Classifier Suitable: Split data Highly suitable: Vec- Limited: Single- Yes: Solved via
and aggregate gradi- torized matrix oper- threaded by default, closed form or
ents ations can be run but libraries support gradient descent
across cores threads
Bayesian Models BayesianRidge Limited: MapReduce Somewhat suitable: Some: If Yes: Optimization via
not practical due to Parallelize sampling implemented via marginal likelihood or
matrix inversion or matrix ops sampling closed-form
Support Vector Machines Support Vector Machine, SVR Difficult: Training Suitable: Libraries Limited: High mem- Yes (in SGD or kernel
involves solving like LIBSVM use ory usage and kernel approximations)
a quadratic multicore for kernel computation
optimization problem and matrix ops
Linear Models Linear Regression Suitable: Distribute Highly suitable: Limited: Simple Yes: Solved via nor-
computation of Batch-based matrix enough not to require mal equation or gradi-
normal equation or multiplications can heavy threading ent descent
gradient steps be parallelized
IV. C ONCLUSION
In this work, we presented CAMP, a parallelization-
Fig. 1. Sample images from MNIST test dataset
aware classification framework for machine learning
TABLE III
PARALLELIZATION W ORKING P RINCIPLES OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY
MapReduce
Meta Algorithm Algorithms Multicore Parallelism Multithreading Gradient Descent / Loss Function
Map Phase Reduce Phase
(t+1)
Ensemble - Boosting AdaBoost, Gradient Train weak learners Combine learners’ Parallelize tree build- Limited to tree split P loss: L
Greedy optimization; =
Boosting, HistGradient- on data partitions outputs and update ing per iteration or finding L(t) − η∇L, e.g., (y − F (x))2
BoostingRegressor weights across boosting stages
Ensemble - Bagging Bagging, BaggingRegres- Independently train Aggregate predictions Independent learners Threaded training and Not applicable; non-iterative model
sor, Random Forest, Ex- base learners on (majority vote or av- trained in parallel prediction
traTreeRegressor bootstrapped samples eraging)
Ensemble - Voting VotingRegressor Train individual base Combine predictions Parallel model execu- Parallel inference Depends on base learners; typically not
regressors via averaging tion threads gradient-based
Tree-Based Decision Tree, Calculate local best Merge to find best Find best split per Parallelize candidate Greedy optimization; surrogate loss: impu-
DecisionTreeRegressor, splits on subsets global split feature in parallel split search rity functions such as Gini or entropy
ExtraTreeRegressor
Lazy Learners / Distance-Based k Nearest Neighbor Compute distances Select k nearest based Parallelize distance Threaded neighbor No optimization; distance metric (e.g., Eu-
between query and on aggregated results computations across ranking and voting clidean) used directly
training points cores
Probabilistic Models Multinomial Naive Bayes Estimate probabilities Aggregate counts into Parallel class- Thread-safe batch No gradient descent; uses Bayes rule di-
per class and feature conditional probabili- wise probability probability estimation rectly P (C|X) ∝ P (X|C)P (C)
ties computation
Neural Networks (Deep and Shallow) Neural Network, MLPRe- Compute layer-wise Aggregate gradients Parallel batch matrix Threaded batch gradi- Gradient descent: w(t+1) = w(t) −
gressor, CNN activations on batches across batches and operations ent updates η∇L(w), e.g., MSE or cross-entropy loss
update weights
Regularized Linear Models Ridge, Ridge Classifier Compute partial gra- Aggregate gradients Solve closed-form or Threaded linear alge- Gradient descent: minw ||Xw − y||2 +
dients per data parti- and update weights use SGD per feature bra ops λ||w||2
tion
Bayesian Models BayesianRidge Estimate posteriors Combine sufficient Parallel sampling or Threaded matrix in- Closed-form solution or EM; minw ||Xw−
for coefficients on statistics for final matrix decomposition version y||2 + λ||w||2 with Bayesian prior
data splits posterior
Support Vector Machines SVM, SVR Compute kernel ma- Merge and solve QP Parallel kernel matrix Limited parallel Dual optimization (non-GD); approximate
trix and support vec- problem computation solver threads via SGD for hinge loss: max(0, 1 −
tors per block yi w T xi )
Linear Models Linear Regression Compute XTX and Aggregate to solve Parallel linear algebra Threaded matrix op- w = (X T X)−1 X T y, or SGD: w(t+1) =
XTy on chunks normal equations computation erations w(t) − η∇L(w)
TABLE IV
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND C LASSIFIERS
TABLE V
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND R EGRESSORS
classifiers, tailored to inform design decisions in high- example, ensemble methods such as bagging exhibit high
performance ML systems. By examining classifiers thread- and process-level parallelism, whereas boosting
through the lens of execution semantics and hardware methods are constrained by sequential dependencies.
affinity, we established a taxonomy that links algo- Similarly, deep neural networks align well with data-
rithmic structure with parallelization feasibility across parallel training on GPUs but impose communication
three major paradigms: distributed MapReduce process- overhead in distributed setups without careful gradient
ing, accelerator-based gradient descent optimization, and aggregation.
shared-memory multicore threading. We also identified key bottlenecks (e.g., kernel matrix
Our analysis demonstrates that classifiers vary not explosion, greedy tree splits, sequential iteration) and
only in prediction accuracy or interpretability but also mitigation strategies, including histogram-based learn-
in their compatibility with compute architectures. For ing, feature hashing, and approximate kernel methods.
These observations are distilled into a set of compati- useful when adapting them to MapReduce-based paral-
bility tables and parallelization guidelines that aid ML lelization frameworks like Hadoop or Spark. Here’s a
practitioners in deploying efficient pipelines. structured breakdown:
Future directions include extending CAMP to feder-
1. Probabilistic Classifiers
ated and decentralized learning systems, where commu-
Examples: Naive Bayes, Bayesian Networks
nication topology and privacy constraints further affect
Principle: Use Bayes’ Theorem to estimate the pos-
parallelization. Additionally, exploring the intersection
terior probability of classes given input features.
of CAMP with auto-parallelizing compilers, graph-based
Parallelization Suitability: ✓ Highly parallelizable
optimization frameworks (e.g., XLA, ONNX Runtime),
Map: Count feature and class occurrences in parallel.
and quantum-enhanced ML workflows offers promising
Reduce: Aggregate global counts to estimate proba-
research avenues for next-generation scalable AI sys-
bilities.
tems.
2. Linear Classifiers
ACKNOWLEDGMENT
Examples: Logistic Regression, Perceptron, Linear
This research is supported by NSERC Discovery SVM
Grant # RGPIN-2020-05422 and the Seed Grant Program Principle: Learn a linear boundary (hyperplane) be-
at Concordia University of Edmonton. Some experiments tween classes.
were conducted on Google Colab, and we gratefully Parallelization Suitability: Moderately paralleliz-
acknowledge its cloud infrastructure. Future experiments able
will be scaled on Canada’s Digital Research Alliance Map: Compute gradients on data partitions.
Graham cluster (https://www.alliancecan.ca). Reduce: Aggregate gradients to update weights.
Note: Requires iterative steps—use frameworks with
R EFERENCES in-memory iteration (e.g., Spark).
[1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals
of lipschitz-hankel type involving products of bessel functions,” 3. Tree-Based Classifiers
Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April Examples: Decision Trees, Random Forest, Gradient
1955.
[2] J. Dean and S. Ghemawat, “Mapreduce: Simplified data process- Boosted Trees
ing on large clusters,” Communications of the ACM, vol. 51, Principle: Recursively partition data based on fea-
no. 1, pp. 107–113, 2008. ture splits.
[3] A. P. et al., “Pytorch: An imperative style, high-performance deep
learning library,” 2019, https://pytorch.org/. Parallelization Suitability:
[4] F. P. et al., “Scikit-learn: Machine learning in python,” Journal • Decision Tree: orangeChallenging to parallelize
of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[5] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support
fully (needs centralized split decision).
vector machines,” ACM Transactions on Intelligent Systems and • Random Forest: ✓ Embarrassingly parallel (train
Technology, vol. 2, no. 3, pp. 27:1–27:27, 2011. trees independently).
[6] A. Rahimi and B. Recht, “Random features for large-scale kernel
• Gradient Boosted Trees: Sequential boosting lim-
machines,” Advances in Neural Information Processing Systems
(NIPS), vol. 20, 2007. its parallelism.
[7] Y. Freund and R. E. Schapire, “A decision-theoretic generaliza- Map: Train trees on bootstrapped data partitions.
tion of on-line learning and an application to boosting,” Journal
of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, Reduce: Combine trees into a forest.
1997.
[8] G. K. et al., “Lightgbm: A highly efficient gradient boosting de- 4. Kernel-Based Classifiers
cision tree,” Advances in Neural Information Processing Systems
(NeurIPS), 2017.
Examples: Support Vector Machines (SVM), Kernel
[9] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting Ridge
system,” Proceedings of the 22nd ACM SIGKDD International Principle: Transform data to higher dimensions us-
Conference on Knowledge Discovery and Data Mining, pp. 785–
794, 2016.
ing kernel functions for linear separation.
[10] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” Parallelization Suitability: Difficult
2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/ Requires computation of kernel matrix (O(n2 )), not
[11] Jun 2025. [Online]. Available: suitable for MapReduce unless approximated (e.g.,
https://climate.weather.gc.ca/climate data/daily data e.html
random Fourier features).
5. Instance-Based Classifiers
V. C LASSIFIERS BASED ON W ORKING P RINCIPLES
Examples: k-Nearest Neighbors (k-NN)
( FOR M AP R EDUCE A DAPTATION )
Principle: Store training data and predict based on
Machine learning classifiers can be categorized based similarity to k nearest training samples.
on their working principles, and this classification is Parallelization Suitability: ✓ Easily parallelizable
Map: Compute distances to test points in partitions. I. C LASSIFIER C ATEGORIES FOR G RADIENT-BASED
Reduce: Select top-k nearest neighbors. O PTIMIZATION
It is often faster than multiprocessing for memory- 2. S EMI -C OMPATIBLE T HREADED C LASSIFIERS
bound tasks.
These classifiers are partially parallelizable—some
parts can be multithreaded, but the algorithm’s structure
1. T HREAD -F RIENDLY C LASSIFIERS (NATURALLY limits parallelism.
M ULTITHREADED )
VIII. C LASSIFIER C ATEGORIZATION FOR
These classifiers can easily leverage multithreading, M ULTITHREADED PARALLELIZATION
usually through internal C/C++ implementations with
shared-memory concurrency. Multithreading speedup depends on data size, algo-
Best suited for multicore CPUs with shared-memory rithm structure, and the efficiency of the underlying
access and lightweight thread scheduling. BLAS implementation.
Classifier Principle Parallelizable Components Tools
Random Forest Bootstrap aggregating decision Train trees in parallel threads sklearn (n jobs), joblib, OpenMP
trees
Gradient Boosted Trees Additive tree models with Split-finding, histogram building XGBoost, LightGBM, CatBoost
residual learning (nthread/num threads)
Logistic Regression Linear classification via sig- Matrix ops, gradient updates liblinear, OpenBLAS, Intel MKL
moid
Linear SVM Max-margin linear separator Matrix ops, dual optimization liblinear, libsvm with multithread-
ing
Naive Bayes Count-based probabilistic Feature-class likelihood computa- Vectorized in NumPy (MKL-
model tion enabled)
k-NN Distance-based voting Parallel distance computation sklearn, faiss, n jobs
Neural Networks Layer-wise matrix operations Forward/backward propagation PyTorch, TensorFlow with
(CPU) OpenMP/MKL
TABLE XIII
T HREAD - FRIENDLY CLASSIFIERS AND TOOLS SUPPORTING MULTITHREADING