0% found this document useful (0 votes)
6 views14 pages

ML Parallelization1

The paper introduces CAMP (Classifier Adaptability Mapping for Parallelism), a framework for categorizing machine learning classifiers based on their adaptability to various parallel execution paradigms, including MapReduce, GPU optimization, and CPU threading models. It provides a taxonomy that helps practitioners select appropriate models and frameworks for scalable performance across different architectures while identifying bottlenecks and proposing strategies to overcome them. The study evaluates the effectiveness of different parallelization techniques, demonstrating their impact on speedup and accuracy across various machine learning algorithms.

Uploaded by

BaidyaNathSaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

ML Parallelization1

The paper introduces CAMP (Classifier Adaptability Mapping for Parallelism), a framework for categorizing machine learning classifiers based on their adaptability to various parallel execution paradigms, including MapReduce, GPU optimization, and CPU threading models. It provides a taxonomy that helps practitioners select appropriate models and frameworks for scalable performance across different architectures while identifying bottlenecks and proposing strategies to overcome them. The study evaluates the effectiveness of different parallelization techniques, demonstrating their impact on speedup and accuracy across various machine learning algorithms.

Uploaded by

BaidyaNathSaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CAMP: Categorizing Machine Learning

Classifiers for Scalable Parallelism in


Distributed and Multicore Systems
Parallelization-Aware Taxonomy Covering MapReduce, GPU Optimization, and CPU Threading Models

Baidya Nath Saha Pavan Sarvaiya


Mathematics & Information Technology Mathematics & Information Technology
Concordia University of Edmonton Concordia University of Edmonton
Edmonton, Alberta, Canada Edmonton, Alberta, Canada
[email protected] [email protected]

Wali Mohammad Abdullah Md. Morshedul Islam


Mathematics & Information Technology Mathematics & Information Technology
Concordia University of Edmonton Concordia University of Edmonton
Edmonton, Alberta, Canada Edmonton, Alberta, Canada
[email protected] [email protected]

Abstract—As machine learning systems scale across I. I NTRODUCTION


platforms—from multicore laptops to distributed cloud
clusters—understanding how different classifiers adapt to Modern machine learning (ML) workloads increas-
parallel execution paradigms becomes essential. In this ingly demand high-throughput, low-latency execution
paper, we introduce CAMP (Classifier Adaptability Map- across heterogeneous compute environments—ranging
ping for Parallelism), a unified framework to categorize
machine learning classifiers based on their computational
from shared-memory multicore systems to distributed
principles and their adaptability to three dominant paral- cloud platforms and accelerator-based clusters. While
lelization strategies: (i) MapReduce-based batch processing hardware capabilities continue to scale, the computa-
(e.g., Hadoop, Spark), (ii) gradient-based optimization on tional structure of ML classifiers remains a critical
accelerators (e.g., GPUs, TPUs via TensorFlow or PyTorch), bottleneck in achieving scalable performance. Each clas-
and (iii) shared-memory multithreading (e.g., OpenMP,
joblib, Intel MKL). We classify classifiers into parallel-
sifier embodies different optimization dynamics, memory
friendly, semi-compatible, and thread-unfriendly categories access patterns, and iterative dependencies, leading to
across each paradigm, providing detailed tables of working widely varying parallelization behavior under different
principles, parallelizable components, and implementation hardware and software execution models [1].
tools. Our taxonomy helps practitioners choose appropriate This paper introduces CAMP (Classifier Adaptabil-
models and frameworks that maximize performance based
on system architecture and problem scale. We also highlight ity Mapping for Parallelism), a structured framework
bottlenecks, such as kernel matrix computation in SVMs or that categorizes ML classifiers according to their suitabil-
sequential dependency in boosting, and propose strategies ity for three dominant parallel computing paradigms:
like feature hashing, histogram splits, and approximate
methods to overcome them. This comprehensive mapping
1) MapReduce-based distributed batch processing
facilitates informed design choices for scalable machine (e.g., Hadoop, Apache Spark [2]),
learning pipelines across CPUs, GPUs, and distributed 2) Gradient-based optimization on accelerators
environments. (e.g., GPUs, TPUs via TensorFlow or PyTorch [3]),
Index Terms—Machine Learning Classifiers, Parallel 3) Shared-memory multicore/threaded execution
Computing, Distributed Systems, Multicore Processing, (e.g., OpenMP, Intel MKL, joblib [4]).
MapReduce, Gradient-Based Optimization.
Our motivation stems from the observation that tradi-
tional ML taxonomies emphasize algorithmic properties
(e.g., linear vs. nonlinear, generative vs. discriminative),
but do not reflect how efficiently those algorithms map
Identify applicable funding agency here. If none, delete this. to parallel hardware. For instance, while Random Forests
exhibit embarrassingly parallel training, kernel-based Map Phase:
SVMs suffer from O(n2 ) kernel matrix bottlenecks [5],
making them poorly suited to distributed memory models Mapj → (Xj⊤ Xj , Xj⊤ yj )
without approximation [6]. Reduce Phase:
CAMP addresses this gap by (i) analyzing classi-
m m
fier families based on their core computational graphs X X
X ⊤X = Xj⊤ Xj , X ⊤y = Xj⊤ yj
and state synchronization requirements, (ii) evaluating
j=1 j=1
their compatibility with major parallelism strategies, and
(iii) surveying implementation-level tooling across open- Then solve:
source libraries and high-performance frameworks [7]–  −1  
[9]. X X
β̂ =  Xj⊤ Xj   Xj⊤ yj 
We define three compatibility levels—parallel-
j j
friendly, semi-compatible, and thread-unfriendly—and
assign classifiers accordingly across the three paradigms. Example 2: K-Means Clustering
Our contribution is twofold: (1) a practical taxonomy
Map Phase: Assign each point to the nearest centroid:
grounded in parallel system behavior rather than abstract
algorithmic theory, and (2) actionable guidelines for Assign(xi ) = arg min ∥xi − µk ∥2
k
ML engineers to select, configure, and optimize classi-
fier models based on target architecture constraints and Reduce Phase: Recompute cluster centroids:
workload scale. 1 X
µk = xi
II. PARALLELIZATION OF M ACHINE L EARNING |Ck |
xi ∈Ck
A LGORITHMS
Processing large datasets is computationally expen- B. Multicore-Based Parallelization
sive for traditional machine learning (ML) algorithms. Multicore systems allow multiple cores within a CPU
Parallelization techniques are essential for scaling ML to execute processes in parallel, ideal for data-parallel
algorithms across processors, cores, and distributed operations. Each core processes a chunk of the data in
nodes. This paper explores four major parallelization parallel, and the results are merged.
approaches: MapReduce, Multicore Processing, Mul-
tithreading, and Parallel Gradient Descent. Each tech- Mathematical Example:
nique is illustrated using well-known ML algorithms If we train a model using data X split across m cores:
including Linear Regression, Decision Trees, k-Nearest
m
Neighbors (KNN), and K-Means Clustering. [
X= Xj , θj = f (Xj )
A. MapReduce-Based Parallelization j=1
MapReduce is a distributed programming model intro- Each core computes model parameters θj , and the final
duced by Google suited for batch learning and embar- model aggregates them:
rassingly parallel problems for processing and generating
large datasets. It consists of two main functions: Map 1 X
m
(task distribution) and Reduce (aggregation of results). θ= θj
m j=1
Working Principle
The following two examples illustrate how multicore-
• Map: Divide the input data into key-value pairs and
based parallelization can be applied to machine learning
distribute tasks across nodes.
algorithms.
• Reduce: Aggregate the intermediate results from
each node to produce the final output. Example 1: Decision Tree Split Selection
Two examples of applying the MapReduce algorithm
to machine learning models are provided below. Each core evaluates different features or split points
in parallel. For feature f , compute:
Example 1: Linear Regression (Normal Equation)
To solve for: X |Sv |
Gain(f ) = Impurity(S) − Impurity(Sv )
β̂ = (X ⊤ X)−1 X ⊤ y v
|S|

The data matrix X and vector y are split across m Each core returns the best local split; a final aggrega-
nodes. tion selects the global best split.
Example 2: K-Nearest Neighbors (KNN) D. Parallel Gradient Descent
For a query point x, compute distances to all training In large-scale optimization, gradient descent can be
points: parallelized using data partitioning or model averaging.
d(x, xi ) = ∥x − xi ∥2 In this approach, each worker updates a local model
Each core processes a chunk Xj : and periodically synchronizes with a central parameter
server.
Dj = {d(x, xi ) | xi ∈ Xj }
Mathematical Representation
Merge distances and select the k nearest.
Standard Gradient Descent:
C. Multithreading-Based Parallelization
Multithreading leverages concurrent execution of mul- θt+1 = θt − η∇L(θt )
tiple lightweight threads within a single core or across
Parallel Gradient Descent (across m workers):
cores. It is lightweight compared to multicore processing.
Each worker computes:
Threads share the same memory space and are suitable
for tasks like I/O operations, feature-wise operations, 1 X
model scoring, and shared-data computations. ∇j = ∇l(xi , yi , θt )
|Xj |
xi ∈Xj
Mathematical Example:
Assume threads are computing gradients in parallel Then update:
for a loss function L(θ): m
1 X
n θt+1 = θt − η · ∇j
X m j=1
∇L(θ) = ∇l(xi , yi , θ)
i=1
Below is an example of applying parallel gradient de-
Threads compute partial gradients: scent to linear regression in machine learning.
X
∇j = ∇l(xi , yi , θ) Example: Linear Regression (Gradient Descent)
xi ∈Xj
Loss function:
Total gradient: n
1 X
m L(β) = (yi − x⊤
i β)
2
X 2n i=1
∇L(θ) = ∇j
j=1 Gradient:
Below are two examples demonstrating the applica- n
1X
tion of multithreading-based parallelization in machine ∇L(β) = − xi (yi − x⊤
i β)
learning. n i=1

Example 1: Decision Tree (Impurity Computation) On m partitions:


Threads compute impurity for different candidate 1 X
splits: ∇j = − xi (yi − x⊤
i β)
|Xj |
xi ∈Xj
X nc Update:
Impurity(S) = − pc log pc , pc =
c
|S| m
1 X
Each thread handles a different feature or threshold. β (t+1) = β (t) − η · ∇j
m j=1
Example 2: K-Means
Threads handle portions of data for centroid updates: Mini-batch SGD
Each worker does:
(j) (j)
X
µk = xi , |Ck | = count (t+1) (t)
(j) βj = βj − η∇j
xi ∈Ck

Then, aggregate: Then, model averaging:


P(j) m
j µk 1 X (t+1)
µk = β (t+1) = β
P (j) m j=1 j
j |Ck |
E. Comparison of Parallelization Techniques for Ma- images [10]. Sample images from the MNIST test dataset
chine Learning Algorithms are shown in Fig. 1. The second dataset is the historical
MapReduce is particularly suitable for distributed sys- climate change dataset of Alberta weather data, where air
tems and large-scale batch processing tasks, such as ma- temperature is predicted from other weather parameters
trix operations and K-Means clustering. Its key strength such as humidity, precipitation, and wind speed [11].
lies in the ability to distribute computation across mul- Table IV presents a comparative evaluation of
tiple nodes efficiently. However, MapReduce introduces four parallelization techniques—MapReduce, Multicore,
significant overhead in job scheduling and data shuffling, Multithreading, and Gradient Descent—applied across
making it less effective for iterative machine learning multiple classifiers in terms of speedup and accuracy
algorithms that require frequent synchronization—such speedup. Here, speedup refers to the ratio of the training
as gradient-based optimization methods. time of the non-parallel implementation to that of the
Multicore processing, on the other hand, is ideal for parallel version, while accuracy speedup refers to the
CPU-bound operations that benefit from parallel exe- ratio of the classification accuracy of the parallel version
cution across multiple physical cores within the same to that of the baseline. A value greater than 1 in either
machine. It enables faster memory access and reduced metric indicates an improvement.
latency compared to distributed frameworks. This makes From Table IV, we observe that MapReduce achieves
multicore parallelism well-suited for tasks like decision the highest speedup for AdaBoost (16.59×) and Gra-
tree construction, feature-wise computations, and dis- dient Boosting (4.8×), demonstrating its suitability for
tance calculations in k-Nearest Neighbors (KNN), where ensemble-based models. Multicore parallelization proves
moderate-sized datasets can be fully loaded into memory most effective for Gradient Boosting (7×), Support Vec-
and processed concurrently. tor Machines (3.53×), and CNNs (2.52×), where CPU-
Multithreading offers a lightweight parallelization ap- intensive matrix operations benefit from concurrent ex-
proach within a shared-memory environment. It excels in ecution. Multithreading offers strong improvements for
scenarios involving real-time inference, data preprocess- Random Forest (2.7×) and k Nearest Neighbor (2.59×),
ing, I/O-bound workloads, and operations like entropy or particularly in scenarios that involve repeated data access
information gain computations in tree-based models and or shallow computation. Gradient Descent paralleliza-
feature selection. Because threads share memory, context tion, while slightly less impactful on classifiers, shows
switching is fast, although care must be taken to manage meaningful improvement for kNN (5.24×) and Decision
concurrency and avoid contention. Trees (accuracy speedup of 1.89×), reflecting benefits of
Parallel Gradient Descent, including distributed or iterative parameter updates.
mini-batch variants, is highly effective for iterative ma- Table XVIII shows the corresponding results for re-
chine learning tasks such as training linear models gressors on the second dataset. Here, Gradient Descent
and neural networks or regression models. It allows delivers strong speedup for models like Ridge Regressor
gradients to be computed independently across data (5.6×) and Support Vector Regressor (5.33×), both of
partitions, followed by synchronization steps to update which rely on iterative optimization procedures well-
model parameters. This technique combines the benefits suited for gradient-based acceleration. Multicore paral-
of data parallelism and iterative convergence, making it a lelism is highly effective for Random Forest Regressor
fundamental strategy in scalable deep learning and large- (3.4×) and Histogram Gradient Boosting (2.89×), bene-
scale optimization pipelines. fiting from efficient CPU utilization. Multithreading also
The three tables respectively present: a comparative shows consistent, though moderate, speedup across mod-
overview of parallelization techniques and their char- els, particularly for Extra Trees Regressor (2.75×) and
acteristics (Table I); the suitability of various machine Decision Tree Regressor (2.01×). Accuracy speedups
learning algorithms for each parallelization method by across regressors generally hover around 1, with small
meta algorithm category (Table II); and the detailed improvements in specific configurations (e.g., Histogram
working principles of these methods across algorithm Gradient Boosting with MapReduce at 1.11×), indi-
types, including MapReduce phases, core/thread-level cating that parallelization does not significantly affect
parallelism, and gradient-based optimization (Table III). regression model quality.
Overall, these results confirm that appropriate par-
III. E XPERIMENTAL R ESULTS AND D ISCUSSIONS allelization strategies can significantly accelerate both
We conducted experiments on two real-world datasets classification and regression tasks. While speedup varies
to evaluate the effectiveness of different parallelization by algorithm and parallelism method, most techniques
strategies in training machine learning models. The first maintain—if not slightly improve—predictive accuracy,
dataset is the MNIST database of handwritten digits, highlighting their practical utility in large-scale machine
containing 60,000 training and 10,000 test grayscale learning workflows.
TABLE I
C OMPARISON OF PARALLELIZATION T ECHNIQUES FOR M ACHINE L EARNING A LGORITHMS

Method Applicable Models Granularity (task Strength Limitation


decomposition level)
MapReduce Linear Regression, K- Coarse Distributed Aggregation High Latency
Means
Multicore Decision Tree, KNN Medium Efficient CPU Utilization and low Memory Overhead
latency
Multithreading K-Means, Trees Fine Lightweight Tasks and very low Shared Memory Is-
latency sues
Gradient Descent Linear Models, NNs Iterative Scalable Optimization and medium Sync Overhead
and Deep Learning latency

TABLE II
PARALLELIZATION S UITABILITY OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY
Meta Algorithm Algorithms MapReduce Multicore Multithreading Gradient Descent
Ensemble - Boosting AdaBoost, Gradient Boosting, Not ideal: sequential Suitable: Parallel tree Limited: Threads No: Uses greedy
HistGradientBoostingRegressor dependency between construction in later mostly help in tree- stage-wise additive
learners implementations level operations modeling, not
gradient descent
Ensemble - Bagging Bagging, BaggingRegressor, Ran- Suitable: Each model Highly suitable: Mod- Suitable: Independent No
dom Forest, ExtraTreeRegressor (tree) can be trained els can be trained in learners allow thread-
independently on sub- parallel on CPU cores based training
sets
Ensemble - Voting VotingRegressor Suitable: Individual Suitable: Combine Suitable: Parallel pre- Depends on base
models can be trained predictions from diction phase learners
and evaluated in multiple models
parallel using parallel threads
or cores
Tree-Based Decision Tree, DecisionTreeRe- Limited: Not natu- Suitable: Tree nodes Limited: Threaded No
gressor, ExtraTreeRegressor rally a MapReduce can be split in parallel operations during
process, but parallel training or inference
tree building possible
Lazy Learners / Distance-Based k Nearest Neighbor Partially suitable: Suitable: Parallelize Suitable: Thread- No
MapReduce can distance computation based neighbor
distribute distance across cores searching
calculations
Probabilistic Models Multinomial Naive Bayes Suitable: MapReduce Suitable: Conditional Suitable: Thread-safe No
to calculate feature probabilities can be in batch mode
probabilities indepen- computed per-core
dently
Neural Networks (Deep and Shallow) Neural Network, MLPRegressor, Limited: Gradient Highly suitable: Par- Suitable: Thread- Yes: Core training
Convolutional Neural Network synchronization allel computation of level ops in training relies on stochastic
overhead in matrix ops and batch batches or batch gradient
distributed setting samples descent
Regularized Linear Models Ridge, Ridge Classifier Suitable: Split data Highly suitable: Vec- Limited: Single- Yes: Solved via
and aggregate gradi- torized matrix oper- threaded by default, closed form or
ents ations can be run but libraries support gradient descent
across cores threads
Bayesian Models BayesianRidge Limited: MapReduce Somewhat suitable: Some: If Yes: Optimization via
not practical due to Parallelize sampling implemented via marginal likelihood or
matrix inversion or matrix ops sampling closed-form
Support Vector Machines Support Vector Machine, SVR Difficult: Training Suitable: Libraries Limited: High mem- Yes (in SGD or kernel
involves solving like LIBSVM use ory usage and kernel approximations)
a quadratic multicore for kernel computation
optimization problem and matrix ops
Linear Models Linear Regression Suitable: Distribute Highly suitable: Limited: Simple Yes: Solved via nor-
computation of Batch-based matrix enough not to require mal equation or gradi-
normal equation or multiplications can heavy threading ent descent
gradient steps be parallelized

The second dataset is the historical climate change


dataset [11] comprising Alberta weather data, where air
temperature is predicted based on other meteorological
parameters such as humidity, precipitation, and wind
speed. Table XVIII presents a comparative evaluation of
four parallelization techniques applied across multiple
regressors, in terms of speedup and mean squared error
(MSE) speedup.

IV. C ONCLUSION
In this work, we presented CAMP, a parallelization-
Fig. 1. Sample images from MNIST test dataset
aware classification framework for machine learning
TABLE III
PARALLELIZATION W ORKING P RINCIPLES OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY
MapReduce
Meta Algorithm Algorithms Multicore Parallelism Multithreading Gradient Descent / Loss Function
Map Phase Reduce Phase
(t+1)
Ensemble - Boosting AdaBoost, Gradient Train weak learners Combine learners’ Parallelize tree build- Limited to tree split P loss: L
Greedy optimization; =
Boosting, HistGradient- on data partitions outputs and update ing per iteration or finding L(t) − η∇L, e.g., (y − F (x))2
BoostingRegressor weights across boosting stages
Ensemble - Bagging Bagging, BaggingRegres- Independently train Aggregate predictions Independent learners Threaded training and Not applicable; non-iterative model
sor, Random Forest, Ex- base learners on (majority vote or av- trained in parallel prediction
traTreeRegressor bootstrapped samples eraging)
Ensemble - Voting VotingRegressor Train individual base Combine predictions Parallel model execu- Parallel inference Depends on base learners; typically not
regressors via averaging tion threads gradient-based
Tree-Based Decision Tree, Calculate local best Merge to find best Find best split per Parallelize candidate Greedy optimization; surrogate loss: impu-
DecisionTreeRegressor, splits on subsets global split feature in parallel split search rity functions such as Gini or entropy
ExtraTreeRegressor
Lazy Learners / Distance-Based k Nearest Neighbor Compute distances Select k nearest based Parallelize distance Threaded neighbor No optimization; distance metric (e.g., Eu-
between query and on aggregated results computations across ranking and voting clidean) used directly
training points cores
Probabilistic Models Multinomial Naive Bayes Estimate probabilities Aggregate counts into Parallel class- Thread-safe batch No gradient descent; uses Bayes rule di-
per class and feature conditional probabili- wise probability probability estimation rectly P (C|X) ∝ P (X|C)P (C)
ties computation
Neural Networks (Deep and Shallow) Neural Network, MLPRe- Compute layer-wise Aggregate gradients Parallel batch matrix Threaded batch gradi- Gradient descent: w(t+1) = w(t) −
gressor, CNN activations on batches across batches and operations ent updates η∇L(w), e.g., MSE or cross-entropy loss
update weights
Regularized Linear Models Ridge, Ridge Classifier Compute partial gra- Aggregate gradients Solve closed-form or Threaded linear alge- Gradient descent: minw ||Xw − y||2 +
dients per data parti- and update weights use SGD per feature bra ops λ||w||2
tion
Bayesian Models BayesianRidge Estimate posteriors Combine sufficient Parallel sampling or Threaded matrix in- Closed-form solution or EM; minw ||Xw−
for coefficients on statistics for final matrix decomposition version y||2 + λ||w||2 with Bayesian prior
data splits posterior
Support Vector Machines SVM, SVR Compute kernel ma- Merge and solve QP Parallel kernel matrix Limited parallel Dual optimization (non-GD); approximate
trix and support vec- problem computation solver threads via SGD for hinge loss: max(0, 1 −
tors per block yi w T xi )
Linear Models Linear Regression Compute XTX and Aggregate to solve Parallel linear algebra Threaded matrix op- w = (X T X)−1 X T y, or SGD: w(t+1) =
XTy on chunks normal equations computation erations w(t) − η∇L(w)

TABLE IV
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND C LASSIFIERS

Speedup Accuracy Speedup


Classifier
MapReduce Multicore Multithreading Gradient Descent MapReduce Multicore Multithreading Gradient Descent
AdaBoost 16.59 0.99 1.02 1 0.8 1 0.89 0.9
Bagging 1.66 1.21 1.24 1 1 1 0.98 1
Convolutional Neural Network 1.5 2.52 1.02 1.03 1 1 1 1
Decision Tree 1 1.5 1.33 0.84 1.06 1 1 1.89
Gradient Boosting 4.8 7 1.8 1.2 1 1.01 0.97 1
k Nearest Neighbor 1.2 1.03 2.59 5.24 1.08 1 1 0.9
Multinomial Naive Bayes 0.8 0.7 0.9 0.85 0.9 1 1 0.9
Neural Network 0.8 0.9 1.8 0,8 1 1 1 1
Random Forest 1.2 1.5 2.7 1.1 1 1 1 1
Ridge Classifier 1.32 1.6 1.2 0.9 0.9 1 1 0.9
Support Vector Machine 2.3 3.53 1.2 1.1 1.1 1.01 1 1

TABLE V
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND R EGRESSORS

Speedup Mean Squared Error Speedup


Regressor
MapReduce Multicore Multithreading Gradient Descent MapReduce Multicore Multithreading Gradient Descent
Linear Regression 7.69 0.39 1.05 1 1 1 0.99 1
DecisionTree Regressor 1.15 1.52 1.49 0.0 1 1.27 1.26 1
ExtraTreeRegressor 1.16 0.74 0.79 1.02 1.4 1.26 1.2 1
HistGradientBoostingRegressor 1.9 0.51 0.41 0.0 0.99 1 1.01 1
VotingRegressor 1.70 1.82 1.06 0.14 1.01 1 1.03 0.99
MLPRegressor 0.65 1.9 0.62 2.9 0.12 1.1 0.94 1.02
SVR 0.44 1.06 1.83 0.58 1.29 1.02 1.01 1.01
BaggingRegressor 0.89 1.48 1.51 1.6 0.99 1 1.10 1.3
Ridge 0.43 0.2 0.5 0.4 0.21 1 1 1
BayesianRidge 1.14 0.77 1.07 1.1 1 0.99 1 1.01

classifiers, tailored to inform design decisions in high- example, ensemble methods such as bagging exhibit high
performance ML systems. By examining classifiers thread- and process-level parallelism, whereas boosting
through the lens of execution semantics and hardware methods are constrained by sequential dependencies.
affinity, we established a taxonomy that links algo- Similarly, deep neural networks align well with data-
rithmic structure with parallelization feasibility across parallel training on GPUs but impose communication
three major paradigms: distributed MapReduce process- overhead in distributed setups without careful gradient
ing, accelerator-based gradient descent optimization, and aggregation.
shared-memory multicore threading. We also identified key bottlenecks (e.g., kernel matrix
Our analysis demonstrates that classifiers vary not explosion, greedy tree splits, sequential iteration) and
only in prediction accuracy or interpretability but also mitigation strategies, including histogram-based learn-
in their compatibility with compute architectures. For ing, feature hashing, and approximate kernel methods.
These observations are distilled into a set of compati- useful when adapting them to MapReduce-based paral-
bility tables and parallelization guidelines that aid ML lelization frameworks like Hadoop or Spark. Here’s a
practitioners in deploying efficient pipelines. structured breakdown:
Future directions include extending CAMP to feder-
1. Probabilistic Classifiers
ated and decentralized learning systems, where commu-
Examples: Naive Bayes, Bayesian Networks
nication topology and privacy constraints further affect
Principle: Use Bayes’ Theorem to estimate the pos-
parallelization. Additionally, exploring the intersection
terior probability of classes given input features.
of CAMP with auto-parallelizing compilers, graph-based
Parallelization Suitability: ✓ Highly parallelizable
optimization frameworks (e.g., XLA, ONNX Runtime),
Map: Count feature and class occurrences in parallel.
and quantum-enhanced ML workflows offers promising
Reduce: Aggregate global counts to estimate proba-
research avenues for next-generation scalable AI sys-
bilities.
tems.
2. Linear Classifiers
ACKNOWLEDGMENT
Examples: Logistic Regression, Perceptron, Linear
This research is supported by NSERC Discovery SVM
Grant # RGPIN-2020-05422 and the Seed Grant Program Principle: Learn a linear boundary (hyperplane) be-
at Concordia University of Edmonton. Some experiments tween classes.
were conducted on Google Colab, and we gratefully Parallelization Suitability: Moderately paralleliz-
acknowledge its cloud infrastructure. Future experiments able
will be scaled on Canada’s Digital Research Alliance Map: Compute gradients on data partitions.
Graham cluster (https://www.alliancecan.ca). Reduce: Aggregate gradients to update weights.
Note: Requires iterative steps—use frameworks with
R EFERENCES in-memory iteration (e.g., Spark).
[1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals
of lipschitz-hankel type involving products of bessel functions,” 3. Tree-Based Classifiers
Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April Examples: Decision Trees, Random Forest, Gradient
1955.
[2] J. Dean and S. Ghemawat, “Mapreduce: Simplified data process- Boosted Trees
ing on large clusters,” Communications of the ACM, vol. 51, Principle: Recursively partition data based on fea-
no. 1, pp. 107–113, 2008. ture splits.
[3] A. P. et al., “Pytorch: An imperative style, high-performance deep
learning library,” 2019, https://pytorch.org/. Parallelization Suitability:
[4] F. P. et al., “Scikit-learn: Machine learning in python,” Journal • Decision Tree: orangeChallenging to parallelize
of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[5] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support
fully (needs centralized split decision).
vector machines,” ACM Transactions on Intelligent Systems and • Random Forest: ✓ Embarrassingly parallel (train
Technology, vol. 2, no. 3, pp. 27:1–27:27, 2011. trees independently).
[6] A. Rahimi and B. Recht, “Random features for large-scale kernel
• Gradient Boosted Trees: Sequential boosting lim-
machines,” Advances in Neural Information Processing Systems
(NIPS), vol. 20, 2007. its parallelism.
[7] Y. Freund and R. E. Schapire, “A decision-theoretic generaliza- Map: Train trees on bootstrapped data partitions.
tion of on-line learning and an application to boosting,” Journal
of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, Reduce: Combine trees into a forest.
1997.
[8] G. K. et al., “Lightgbm: A highly efficient gradient boosting de- 4. Kernel-Based Classifiers
cision tree,” Advances in Neural Information Processing Systems
(NeurIPS), 2017.
Examples: Support Vector Machines (SVM), Kernel
[9] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting Ridge
system,” Proceedings of the 22nd ACM SIGKDD International Principle: Transform data to higher dimensions us-
Conference on Knowledge Discovery and Data Mining, pp. 785–
794, 2016.
ing kernel functions for linear separation.
[10] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” Parallelization Suitability: Difficult
2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/ Requires computation of kernel matrix (O(n2 )), not
[11] Jun 2025. [Online]. Available: suitable for MapReduce unless approximated (e.g.,
https://climate.weather.gc.ca/climate data/daily data e.html
random Fourier features).

5. Instance-Based Classifiers
V. C LASSIFIERS BASED ON W ORKING P RINCIPLES
Examples: k-Nearest Neighbors (k-NN)
( FOR M AP R EDUCE A DAPTATION )
Principle: Store training data and predict based on
Machine learning classifiers can be categorized based similarity to k nearest training samples.
on their working principles, and this classification is Parallelization Suitability: ✓ Easily parallelizable
Map: Compute distances to test points in partitions. I. C LASSIFIER C ATEGORIES FOR G RADIENT-BASED
Reduce: Select top-k nearest neighbors. O PTIMIZATION

6. Ensemble Methods 1. Gradient-Friendly Models (Directly Optimized via


Examples: Bagging, Boosting, Stacking Gradients)
Principle: Combine predictions of multiple base These models use differentiable loss functions and are
classifiers. naturally adapted to parallelization through data-parallel
Parallelization Suitability: or model-parallel gradient computation.
• Bagging: ✓ Parallelizable (independent learners) Suitable
• Boosting: Sequential dependency These models are easily adapted using standard gra-
• Stacking: Meta-learner adds some dependency dient descent and frameworks like PyTorch DDP,
TensorFlow MirroredStrategy, or Horovod.
7. Neural Network Classifiers
Examples: Multi-layer Perceptrons (MLPs), CNNs, 2. Weakly Gradient-Compatible Models (Approximation
RNNs or Custom Gradients)
Principle: Learn hierarchical non-linear transforma- These models benefit from gradient-based optimiza-
tions via layers of neurons. tion with some engineering effort.
Parallelization Suitability: ✓ Good, but better on Partially Suitable
GPU/Parameter Server setups
Map: Compute local gradients. 3. Non-Gradient-Based Models (Incompatible with Gra-
Reduce: Aggregate gradients to update parameters. dient Optimization)
Challenge: Large models need efficient synchro-
nization (e.g., asynchronous SGD, data/model par- These models rely on discrete or statistical methods,
allelism). not suitable for gradient descent.
Not Suitable

II. G RADIENT-BASED PARALLELIZATION


II. A DAPTING TO M AP R EDUCE F RAMEWORK S TRATEGIES
III. S UMMARY TABLE
VII. C LASSIFIER C ATEGORIES FOR M ULTICORE
III. K EY C ONSIDERATIONS FOR M AP R EDUCE
PARALLELIZATION
A DAPTATION
Multicore parallelization typically leverages thread-
• Data Partitioning: Essential for balancing compu- based or process-based parallelism (e.g., via joblib,
tation across mappers. OpenMP, multiprocessing, or libraries like XG-
• Model Synchronization: Especially in iterative Boost and LightGBM with native multicore support).
models like logistic regression or neural nets.
• Memory Management: Tree-based methods can 1. E MBARRASSINGLY PARALLEL C LASSIFIERS
become memory-intensive.
• Framework: Prefer Spark over Hadoop for iterative These classifiers naturally allow independent compu-
algorithms (due to in-memory computation). tations that can be easily distributed across cores with
minimal synchronization overhead.
Suitable: High
VI. C LASSIFIER C ATEGORIZATION FOR These classifiers scale well on multicore CPUs with
G RADIENT-BASED O PTIMIZATION -BASED minimal communication.
PARALLELIZATION
2. I TERATIVE O PTIMIZATION -BASED C LASSIFIERS
Machine learning classifiers can also be classi-
fied based on their working principles in the context These classifiers use iterative updates (e.g., via gra-
of gradient-based optimization-based parallelization, dient descent), where batch-wise operations or ma-
which is commonly used in modern large-scale learning trix algebra can be parallelized across cores using
(e.g., with GPUs, TPUs, or parameter servers). This BLAS/LAPACK or vectorized operations.
is particularly relevant in frameworks like TensorFlow, Suitable: Moderate
PyTorch, or Horovod, where gradient computations are Often limited by memory bandwidth or gradient syn-
central. chronization.
Classifier Type Map Phase Reduce Phase Comments
Naive Bayes Count features/class co-occurrences Aggregate probabilities Very fast, ideal for MapReduce
Logistic Regression Compute gradients per partition Aggregate gradients & update weights Needs iterative job chaining
Decision Tree Calculate local splits Merge and find best global split Limited parallelism
Random Forest Train independent trees Combine all trees Very scalable
SVM (Linear) Compute hinge loss & gradients Aggregate and update Scalable if linear; kernel SVM is not
k-NN Compute distances to all training points Select k nearest Efficient with distributed storage
Neural Networks Compute gradients in batch Aggregate gradients and update Use Spark + GPU or parameter server
Gradient Boosted Trees Train weak learners sequentially Combine weak learners Partial parallelism in early steps
TABLE VI
C LASSIFIER BEHAVIOR IN M AP R EDUCE ENVIRONMENTS

Classifier Principle Optimization Parallelization Strategy


Logistic Linear boundary us- Gradient descent Data-parallel SGD on data
Regression ing sigmoid shards
Linear SVM Max-margin linear Subgradient or Data-parallel subgradient de-
separator SGD scent
Neural Networks Nonlinear function Backpropagation Data/model parallelism with
(MLPs, CNNs, approximators gradient aggregation
RNNs)
Multinomial Logis- Multiclass softmax Cross-entropy Highly scalable in deep learn-
tic Regression classifier loss ing
Ridge / Lasso Re- Regularized linear Closed-form or Mini-batch SGD in parallel
gression models gradient-based

Classifier Principle Optimization Parallelization Strategy


Kernel SVM Kernel-based nonlin- Dual Approximate kernels (e.g.,
ear separation optimization RFF)
Boosted Trees Additive loss- Greedy tree con- Partial gradient use; boosting is
minimizing model struction sequential
CRFs Structured prediction Log-likelihood Use batched or approximate in-
gradient ference
Autoencoders / Latent/adversarial Backpropagation Unstable gradients; careful
GANs modeling handling needed

Classifier Principle Optimization Parallelization Strategy


Decision Trees Recursive binary Greedy, non- Not differentiable; split search
(CART) splits gradient
Random Forests Ensemble of trees No gradient use Parallel trees; not gradient-
based
Naive Bayes Probabilistic Closed-form Count-based parallelization
inference with (MLE)
counts
k-NN Voting by similarity No optimization Parallel distance computations
only

Strategy Type Description Tools/Examples


Data Parallelism Split data across workers, compute gra- PyTorch DDP, Tensor-
dients independently, then aggregate Flow MirroredStrategy,
Horovod
Model Parallelism Distribute model layers or parameters Transformers, large
across devices CNNs
Hybrid Parallelism Combine data and model parallelism Megatron-LM,
DeepSpeed
Asynchronous SGD Workers update shared model indepen- Parameter servers
dently (may use stale gradients)
Gradient Accumulation Simulate large batch training without Transformer training on
high memory use GPUs
Working Principle Classifiers Gradient-Based
Parallelization
Linear Models Logistic Regression, SVM (linear) Fully compatible
Neural Models MLP, CNN, RNN, Transformers Fully compatible
Probabilistic Models Naive Bayes, Bayesian Nets Not gradient-based
Tree-Based Models Decision Trees, Random Forests Not differentiable
Ensemble Methods Bagging, Boosting Partially compati-
ble
Kernel Methods SVM (RBF), Kernel Ridge Only with approxi-
mations
Distance-Based Models k-NN, Clustering Not gradient-
compatible

Classifier Principle Parallelizable Components Tools


Random Forest Ensemble of decision trees Train trees independently sklearn (n_jobs), joblib,
on bootstrap samples XGBoost, LightGBM
Bagging Classifier Voting over multiple base Train each model on different sklearn.ensemble.BaggingClassifier
models data
Naive Bayes Independent feature likeli- Count frequencies per NumPy, multiprocessing
hoods class/feature
k-Nearest Neighbors Distance-based voting Distance computation over data joblib, KD-trees with parallel
chunks traversal
TABLE VII
E MBARRASSINGLY PARALLEL CLASSIFIERS WITH STRONG MULTICORE SCALABILITY

Classifier Principle Parallelizable Components Tools


Logistic Regression Sigmoid + linear classifier Gradient computation on mini- sklearn, NumPy, MKL
batches
Linear/Polynomial Max-margin linear separa- Subgradient computation, dual liblinear, libsvm with mul-
SVM tor optimization ticore
Neural Networks Backpropagation over lay- Matrix operations per layer, PyTorch (CPU mode), TensorFlow
ers mini-batch training
Multinomial Logistic Generalized softmax clas- Cross-entropy loss, softmax MKL-accelerated or thread-parallel
Regression sifier
Ridge / Lasso Regres- Regularized least squares Matrix inversion or coordinate sklearn, glmnet, OpenMP-
sion descent based libraries
TABLE VIII
G RADIENT- BASED CLASSIFIERS WITH MODERATE MULTICORE PARALLELIZATION

3. G REEDY OR R ECURSIVE C LASSIFIERS M ULTICORE PARALLELIZATION T ECHNIQUES (CPU)


S UMMARY TABLE
These involve sequential or recursive steps (like tree
splitting or boosting), which limit parallelism but allow Classifying machine learning classifiers based on
partial multicore speedup in some components. their working principles and their adaptability to multi-
Suitable: Partial threaded parallelization (i.e., shared-memory parallelism
using threads within the same process) helps in optimiz-
Best-split selection is not parallel-friendly, but candi-
ing performance on systems like laptops, CPUs, or cloud
date evaluation can be.
VMs with multi-core/thread CPUs.

4. N ON -O PTIMIZED OR L AZY C LASSIFIERS C LASSIFIER C ATEGORIES FOR M ULTITHREADED


PARALLELIZATION
These are not easily parallelizable through gradient Multithreaded parallelization typically occurs within a
descent or greedy search but can parallelize evaluation process, leveraging CPU threads and shared memory. It
or distance calculations. is used through:
Suitable: Low (but possible) Libraries like OpenMP, Intel MKL, BLAS, NumPy,
Require batching or low-level optimization to benefit scikit-learn, or native parallelism in frameworks like
from cores. XGBoost and LightGBM.
Classifier Principle Parallelizable Components Tools
Decision Tree (CART) Greedy recursive partition- Find best split per feature in sklearn, LightGBM (histogram
ing parallel split search)
Gradient Boosted Trees Sequential ensemble of Parallel tree-building, LightGBM, XGBoost with
weak learners histogram construction nthread
Extra Trees Random splits instead of Multiple trees built in parallel sklearn, joblib
best split
TABLE IX
G REEDY CLASSIFIERS WITH LIMITED PARALLELIZATION POTENTIAL

Classifier Principle Parallelizable Components Tools


k-Nearest Neighbors Lazy learning, voting Parallel distance calculations FAISS, joblib, NumPy with
batching
Rule-based Classifiers Symbolic rule extraction Rule evaluation per instance Custom parallel rule evaluation
Clustering (k-means) Partitioning based on cen- Distance computation and cen- sklearn, MiniBatchKMeans
troids troid update
TABLE X
N ON - GRADIENT CLASSIFIERS WITH LIMITED BUT POSSIBLE MULTICORE USAGE

Technique Description Python Tools


Thread-based (shared Use multithreading when operations release the GIL joblib, OpenMP, NumPy
memory) or are implemented in C/C++ (MKL)
Process-based Use multiprocessing to spawn separate processes (no multiprocessing, joblib,
GIL) Ray
Vectorized operations Use NumPy/BLAS for internal matrix operations NumPy with MKL, SciPy
Parallel for-loops For looping over models or data partitions in ensem- joblib.Parallel,
ble/classification concurrent.futures
TABLE XI
M ULTICORE CPU PARALLELIZATION STRATEGIES

Working Principle Classifiers Multicore Parallelization Suit-


ability
Ensemble (Bagging) Random Forest, Bagging Classifier High (Excellent)
Linear Models Logistic Regression, Ridge, SVM (linear) Moderate (via BLAS/MKL)
Tree-Based Models CART, XGBoost, LightGBM Partial (parallel splits, histograms)
Neural Networks (CPU) MLPs, CNNs Moderate (MKL-accelerated)
Probabilistic Models Naive Bayes High (independent stats)
Distance-Based Models k-NN Moderate (distance computation
only)
TABLE XII
M ULTICORE PARALLELIZATION SUITABILITY BY CLASSIFIER TYPE

It is often faster than multiprocessing for memory- 2. S EMI -C OMPATIBLE T HREADED C LASSIFIERS
bound tasks.
These classifiers are partially parallelizable—some
parts can be multithreaded, but the algorithm’s structure
1. T HREAD -F RIENDLY C LASSIFIERS (NATURALLY limits parallelism.
M ULTITHREADED )
VIII. C LASSIFIER C ATEGORIZATION FOR
These classifiers can easily leverage multithreading, M ULTITHREADED PARALLELIZATION
usually through internal C/C++ implementations with
shared-memory concurrency. Multithreading speedup depends on data size, algo-
Best suited for multicore CPUs with shared-memory rithm structure, and the efficiency of the underlying
access and lightweight thread scheduling. BLAS implementation.
Classifier Principle Parallelizable Components Tools
Random Forest Bootstrap aggregating decision Train trees in parallel threads sklearn (n jobs), joblib, OpenMP
trees
Gradient Boosted Trees Additive tree models with Split-finding, histogram building XGBoost, LightGBM, CatBoost
residual learning (nthread/num threads)
Logistic Regression Linear classification via sig- Matrix ops, gradient updates liblinear, OpenBLAS, Intel MKL
moid
Linear SVM Max-margin linear separator Matrix ops, dual optimization liblinear, libsvm with multithread-
ing
Naive Bayes Count-based probabilistic Feature-class likelihood computa- Vectorized in NumPy (MKL-
model tion enabled)
k-NN Distance-based voting Parallel distance computation sklearn, faiss, n jobs
Neural Networks Layer-wise matrix operations Forward/backward propagation PyTorch, TensorFlow with
(CPU) OpenMP/MKL
TABLE XIII
T HREAD - FRIENDLY CLASSIFIERS AND TOOLS SUPPORTING MULTITHREADING

Classifier Principle Parallelizable Components Limiting Factors


Decision Tree (CART) Recursive greedy splitting Parallelize candidate split search Splits chosen sequentially
SVM (Kernel) Dual optimization with kernels Gram matrix precomputation Kernel matrix size grows as O(nˆ2)
Ridge / Lasso Regres- Regularized linear models Matrix algebra via BLAS Coordinate descent may be sequen-
sion tial
k-Means Clustering Centroid update based on dis- Parallel distance updates Synchronization after each itera-
tances tion
TABLE XIV
PARTIALLY MULTITHREADABLE CLASSIFIERS WITH ALGORITHMIC BOTTLENECKS

3. T HREAD -U NFRIENDLY / L AZY OR S EQUENTIAL


M ODELS
These classifiers do not benefit significantly from
multithreading due to lazy evaluation, sequential depen-
dency, or low compute-to-memory ratio.
These models are better suited to GPU or distributed
frameworks when scalability is required.
M ULTITHREADED PARALLELIZATION T ECHNIQUES
S UMMARY TABLE
Classifier Principle Limitation
k-NN (large memory) Lazy learning with brute-force search Memory bandwidth becomes the bottleneck
Boosting (e.g., Sequential model updates Each learner depends on the previous model
AdaBoost)
Rule-based Models Apply symbolic rules Rule evaluation is inexpensive and mostly sequential
Gaussian Processes Full covariance matrix computation O(nˆ3) scaling; kernel matrix is thread-unfriendly
TABLE XV
T HREAD - UNFRIENDLY CLASSIFIERS WITH LIMITED OR NO MULTITHREADING BENEFIT

Technique Description Example Libraries


OpenMP (C/C++) Parallel loops, commonly used in C/C++ libraries liblinear, libsvm, XGBoost
Intel MKL / OpenBLAS Multithreaded linear algebra backends NumPy, SciPy, scikit-learn
Python-level threading Limited by GIL, unless using C-extensions threading, joblib (thread backend)
Scikit-learn n jobs Threads or processes depending on backend Supported in many classifiers
TensorFlow / PyTorch Internal thread pools for CPU ops Controlled via thread environment
vars
TABLE XVI
C OMMON MULTITHREADED STRATEGIES AND LIBRARIES

Working Principle Classifiers Threading Suitability


Linear Models Logistic Regression, Ridge, Excellent (BLAS-powered)
SVM (linear)
Ensemble (Bagging/Boosting) Random Forest, XGBoost High/Moderate (Forests =
good, Boosting = limited)
Neural Models (CPU) MLP, CNN, RNN Excellent (PyTorch, Ten-
sorFlow threading)
Probabilistic Models Naive Bayes Good (independent feature
computations)
Lazy Models k-NN Moderate (memory-bound,
good with FAISS)
Kernel Methods SVM (RBF), Gaussian Pro- Moderate to Low (expen-
cesses sive kernel matrix)
Rule/Tree Models Decision Tree, Rule-based Depends (Trees: partially
classifiers parallel, Rules: mostly se-
rial)
TABLE XVII
T HREADING SUITABILITY BY CLASSIFIER TYPE AND COMPUTATIONAL STRUCTURE
TABLE XVIII
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND R EGRESSORS

Speedup Mean Squared Error Speedup


Regressor
MapReduce Multicore Multithreading Gradient Descent MapReduce Multicore Multithreading Gradient Descent
Linear Regression
DecisionTree Regressor
ExtraTreeRegressor
HistGradientBoostingRegressor
VotingRegressor
MLPRegressor
GradientBoostingRegressor
SVR
BaggingRegressor
GaussianProcessRegressor
Ridge
BayesianRidge
RidgeCV
LarsCV
LassoLarsCV
LassoLarsIC
LassoCV
ARDRegression
SGDRegressor
LinearSVR
HuberRegressor
ElasticNetCV
RANSACRegressor
CCA
TheilSenRegressor
DecisionTreeRegressor
AdaBoostRegressor
KernelRidge
Random Forest Regressor
ElasticNet
RadiusNeighborsRegressor
DummyRegressor

You might also like