Tensorflow
Tensorflow
net/publication/382080360
CITATIONS READS
0 522
3 authors:
Cor-Paul Bezemer
University of Alberta
123 PUBLICATIONS 2,894 CITATIONS
SEE PROFILE
All content following this page was uploaded by Hao Li on 12 July 2024.
1 INTRODUCTION
The rapidly improving capabilities of Deep Learning (DL) and Machine Learning (ML) frameworks
have been the main drivers that allow new intelligent software applications, such as self-driving
cars [27, 61] and robotic surgeons [18, 77, 82]. These intelligent software systems all contain
components that integrate one or more complex DL and/or ML algorithms. Fortunately, over
the past decade, the need for coding these ML and DL algorithms from scratch has been largely
eliminated by the availability of several mature ML frameworks and tools such as TensorFlow [1]
and PyTorch [63]. These frameworks provide developers with a high-level interface to integrate
ML functionality into their projects. Using such ML frameworks has several advantages including
∗ Hao Li and Cor-Paul Bezemer are with the Analytics of Software, GAmes And Repository Data (ASGAARD) Lab, University
of Alberta, Canada.
Authors’ addresses: Hao Li, [email protected], University of Alberta, Edmonton, AB, Canada, T6G 2R3; Gopi Krishnan
Rajbahadur, Centre for Software Excellence, Huawei Canada, Kingston, ON, Canada, K7L 1H3, gopi.krishnan.rajbahadur1@
huawei.com; Cor-Paul Bezemer, [email protected], University of Alberta, Edmonton, AB, Canada, T6G 2R3.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Association for Computing Machinery.
1049-331X/2024/0-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
2 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
1 https://githut.info
2 https://github.com/tensorflow/tfjs
3 As can be seen in this GitHub issue for TensorFlow: https://github.com/tensorflow/tensorflow/issues/55476
4 https://github.com/SciSharp/TensorFlow.NET/issues/991 and https://github.com/SciSharp/TensorFlow.NET/pull/1001
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 3
Therefore, in this paper we study the impact of bindings on two important ML software quality
aspects:
• Correctness: We evaluate if models trained using different bindings for a given ML frame-
work have the same accuracy. We study (1) training accuracy, which captures the model’s
classification performance on the train set during the training process, and (2) test accuracy,
which captures the classification performance of the final trained model on the test set. In
addition, we measure whether the test accuracy is the same after loading a pre-trained model
in a binding that was not used to train the model (the cross-binding test accuracy).
• Time cost: We evaluate if models trained using different bindings for an ML framework take
similar time for training and making inferences. Bindings that produce models with a high
time cost are expensive (in terms of computational resources), which limits their applicability.
We conducted model training and model inference experiments using bindings for TensorFlow
and PyTorch in C#, Rust, Python, and JavaScript. In the model training experiments, we trained
LeNet-1, LeNet-5, VGG-16, LSTM, GRU, and BERT models on the GPU in every binding (excluding
BERT which is only trained on the Python bindings) using the same data and as far as possible,
the same framework configuration. In the model inference experiments, we loaded pre-trained
models and performed inference using every binding on the CPU and GPU. We do so to address
the following research questions (RQs), with RQ1 and RQ2 focusing on correctness, and RQ3 and
RQ4 focusing on time cost:
RQ1. How do the studied bindings impact the training accuracy and test accuracy of the
studied DL models?
During the training process, bindings for the same ML framework can have different training
accuracies for the same model as well as varying test accuracy values (2% difference) in the
final trained models.
RQ2. How do the studied bindings impact the cross-binding test accuracy of pre-trained
models?
The cross-binding test accuracy of the pre-trained models was not impacted by the bindings.
RQ3. How do the studied bindings impact the training time of the studied DL models?
Non-default bindings can be faster than the default Python bindings for ML frameworks. For
instance, PyTorch’s Python binding has the slowest training time for the studied models;
PyTorch’s C# binding is more than two times faster than the Python binding in training the
LeNet-5 model.
RQ4. How do the studied bindings impact the inference time of pre-trained models?
Bindings can have very different inference times for the same pre-trained model, and the
inference time of certain bindings on CPU can be faster than that of other bindings on GPU.
For example, TensorFlow’s Rust binding can perform inference faster for an LSTM model on
CPU than the JavaScript binding on GPU (73.9 vs. 177.7 seconds).
The main contributions of our paper are as follows:
(1) We are the first to study the impact of using different bindings for ML frameworks on the ML
software quality in terms of correctness and time cost.
(2) We found that using a non-default binding can help improve ML software quality (from the
time cost perspective) compared to the default Python binding of the studied frameworks in
certain tasks, while still achieving the same level of correctness.
(3) We provide a replication package [48], which consists of the implementation of the studied ML
models in the studied bindings, scripts for running the experiments, and Jupyter Notebooks
for analyzing the experiment results.
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
4 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
Bindings Defining
Training Saving
(Python the model
the model the model
included) structure
Model training
FFIs
Defining Loading
the model the model
structure parameters Performing
ML model
Framework Loading the inference
Core model via
serialization Model inference
Fig. 1. Bindings use the functionality of ML frameworks via foreign function interfaces (FFIs) to train models
and perform model inference.
The remainder of this paper is outlined as follows. Sections 2 provides background information.
Section 3 describes the design of our study. Sections 4 and 5 present the results. Section 6 discusses
the implications of our findings. Section 7 gives an overview of related work. Section 8 outlines
threats to the validity of our study and Section 9 concludes the paper.
2 BACKGROUND
2.1 ML Frameworks
Machine learning frameworks are software libraries that provide ML techniques to developers for
the development and deployment of ML systems. Most popular ML frameworks are supported by
large companies such as Google and Facebook [4]. As shown in Figure 1, an ML framework provides
interfaces to define the structure of a model, train the defined model using a selected optimizer,
and save the trained model for later use. In addition, developers can deploy the trained models to
the production environment by loading a saved (or pre-trained) model and performing inference.
ML frameworks can load a pre-trained model using (1) the model parameters (e.g., weights and
hyperparameters) or (2) serialization. If only the model parameters are saved, developers first have
to define the model structure before they can load the stored parameters into the defined model.
When loading a serialized model, the ML framework can recreate the model from the saved file
automatically since it contains both the structure and the weights of the pre-trained model.
Modern ML frameworks, such as TensorFlow and PyTorch, have been built upon a foundation that
leverages parallel processing devices like GPUs. GPUs have proven to be highly efficient for tasks
that demand parallel computation, especially in the realm of ML. Their architecture is inherently
designed to handle multiple tasks simultaneously, allowing for massive parallelism. However, one
significant characteristic of GPU computations that needs emphasis is their asynchronous nature.
When a task is dispatched to a GPU, it does not always execute immediately. Instead, it often gets
scheduled in a queue.5 Consequently, a CPU might continue with its tasks believing that a GPU job
is complete when, in fact, it has not even started. This asynchronous behaviour allows GPUs to
optimize task execution but also necessitates careful synchronization when precise timing or task
ordering is crucial.
5 https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 5
3 STUDY DESIGN
In this section, we first describe our experimental environment and the studied datasets, models,
ML frameworks, and bindings. Then, we discuss how we evaluate the correctness and time cost in
the model training and model inference experiments. Finally, we introduce the experimental setup
of our study. Figure 2 gives an overview of our study design.
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
6 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
Table 1. Our studied datasets and models. (Each model is paired with a dataset for the experiments)
#Samples∗ Model
Dataset
Train Test Name #Parameters
LeNet-1 4,326
MNIST 60,000 10,000
LeNet-5 61,706
CIFAR-10 50,000 10,000 VGG-16 33,650,890
LSTM 4,665,537
IMDb 25,000 25,000
GRU 4,250,817
SQuAD 87,599 10,570 BERT (base) 108,893,186
∗ The split of the training and test set is provided by the dataset.
11 https://github.com/pytorch/pytorch/releases/tag/v0.1.1
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 7
Table 2. Studied bindings for TensorFlow and PyTorch in software package ecosystems.
have recently grown in popularity as Caffe2 was merged into PyTorch in 201812 and Keras became
“the high-level API of TensorFlow 2" [41].
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
8 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
Table 3. Supported features of studied bindings for TensorFlow (TF) and PyTorch (PT).
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 9
• Hyperparameters. We use the same hyperparameters (e.g., the number of epochs and batch
size) and optimizers from prior research [26]. However, TensorFlow’s C# binding does not
support setting the momentum and weight decay hyperparameters for a stochastic gradient
descent (SGD) optimizer. Hence, we only set the learning rate for the SGD optimizer without
enabling momentum and weight decaying when training the LeNet-1, LeNet-5, and VGG-16
models to maintain consistency across all bindings. In addition, to mitigate the risk of default
hyperparameters influencing our results, we explicitly defined all configurable parameters
and kept them the same across bindings.
• Random seed. We fix the value of the random seed across bindings when training the same
model to control the randomness.
In addition, we repeat the same training process five times for each binding with different random
seeds (that are kept consistent across bindings) to reduce the impact of seed selection on the results.
Running example. We train the LeNet-1 model in TensorFlow’s Python, C#, and JavaScript
bindings. These bindings all set the same random seed at the start of the training process. To build
up the same convolution layers of the model, we use the “Conv2D” interface in Python, “Conv2D”
in C#, and the “conv2d” interface in JavaScript. In addition, we use SGD with a learning rate of 0.05
for all three bindings to train the LeNet-1 model.
Step 2 – Record the training correctness and save the model: We record the training
accuracy in each epoch for all model training experiments. After the training is completed, we
compute the trained model’s test accuracy and save the model for later use. Considering the impact
of randomness, we repeat the training process 5 times in each training experiment and analyze the
distribution of the results to draw conclusions.
Running example. During training the LeNet-1 model in PyTorch’s C# binding, we calculate
the training accuracy in each epoch and store the value. After finishing the training, we save the
trained LeNet-1 model.
Step 3 – Perform inference using the trained models and record the inference cor-
rectness: For each model inference experiment, each binding loads a pre-trained model via the
supported model loading approach(es) (as shown in Table 3) and performs inference on the test
set on both CPU and GPU. In addition, bindings for the same ML framework perform inference
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
10 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
for the same pre-trained model. We select the pre-trained models (which are saved in Step 2) from
TensorFlow and PyTorch’s default Python bindings since the default bindings tend to have the best
support and maintenance [47].
Running example. In TensorFlow’s Rust binding, we load the pre-trained LeNet-1 model from
TensorFlow’s default Python binding via serialization to perform model inference on the test set
and record the cross-binding test accuracy.
Step 4 – Measure and record the training time cost: Our primary focus is on measuring the
time cost of the entire training process on GPU and recording it, as shown in Procedures 1 and 4.
Due to the asynchronous nature of GPU computations (as explained in Section 2), we only keep
the code directly related to the training process in this step to ensure accurate time measurements,
excluding activities like calculating correctness metrics in each epoch (which is included in Steps 1
and 2). We also do not include the time cost of initialization processes, such as model initialization,
optimizer initialization, and initial dataset loading.
Procedure 1 within PyTorch showcases its granular control over the training process. It initiates
by setting up the model and optimizer, loading the training dataset, and iterating through the
epochs for optimizing the model weights. For each epoch, the process starts with loading a batch
of the data. Following this, forward propagation is performed to produce outputs which are used
for calculating the loss values. Lastly, backward propagation is executed to calculate the gradients
which guide the optimizer for updating the model parameters. In contrast, as demonstrated in
Procedure 4, TensorFlow offers less granularity since it encapsulates the entire training process (i.e.,
batch data loading, forward propagation, and backward propagation) within a single function to
optimize performance.
As shown in Procedure 3, the granularity control in PyTorch is particularly helpful in measuring
time costs for specific subactivities using the “cuda.synchronize()" function to facilitate synchro-
nization between the CPU and GPU. The “cuda.synchronize()" function is only available in the
Python and Rust bindings. Procedure 3 starts a timer, runs a subactivity (e.g., forward propagation),
waits for the subactivity to finish using "cuda.synchronize()", and then computes the elapsed time.
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 11
Running example. We train the LeNet-1 model with PyTorch’s Python binding and employ
Procedure 1 to record the training time cost. In addition, we rerun the training experiment utilizing
Procedure 1 with additional synchronization steps as described in Procedure 3 to capture accurate
time costs for individual subactivities.
Step 5 – Measure and record the inference time cost: Similar to Step 4, we measure and
record the time cost of the entire inference process on both CPU and GPU following Procedures 2
and 4. For measuring the time costs of inference subactivities (i.e., batch data loading and forward
propagation), we rerun the inference experiments employing Procedure 3, but only for PyTorch’s
Python and Rust bindings on GPU.
Running example. In PyTorch’s Python binding, we use Procedure 2 to determine the inference
time cost for the pre-trained LeNet-1 model. Furthermore, we rerun the inference experiment with
additional steps from Procedure 3 to separately record time costs for batch data loading and forward
propagation.
15 https://github.com/SciSharp/TensorFlow.NET/issues/640
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
12 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
Table 4. Mean/Max DTW distances of training accuracy curves for bindings in training models with the same
random seed. (Highlighted numbers indicate negligible DTW distance. Py: Python; JS: JavaScript; Rs: Rust)
4 CORRECTNESS EVALUATION
Motivation. Developers can use a binding for an ML framework in their preferred programming
language to train a DL model. We want to observe if the DL models trained using a binding for
a given ML framework have the same training accuracy as the DL models trained using the ML
framework’s default Python binding (RQ1). These results can help developers understand if using
a binding will achieve the same model accuracy during training and provide the same model
performance for the final trained models.
In addition, it is important to ascertain if performing inference for these trained models using
different bindings for a given framework will impact the accuracy. Pre-trained models have been
widely used by the ML community [29, 85] and bindings can help developers to run inference with
pre-trained models in different programming languages. Importantly, in high-stakes domains such
as medical diagnosis and autonomous driving, accuracy is particularly important when decisions
are made by ML systems [62]. Even a slight drop in accuracy can trigger erroneous decisions with
serious implications. Hence, it is vital that bindings have the capability to achieve the same accuracy
for pre-trained models as with the binding they were trained with. In RQ2, we investigate the
cross-binding test accuracy of pre-trained models using the bindings for TensorFlow and PyTorch
to understand whether the pre-trained models perform as we would expect them to.
Together, the bindings’ impact on training correctness and inference correctness will enable us
to understand the impact on the correctness of the ML software quality.
RQ1: How do the studied bindings impact the training accuracy and test accuracy of the
studied DL models?
Approach. We employ both dynamic time warping (DTW) [72] for analyzing training accuracy
curves and the Mann-Whitney U test [56] for comparing the performance metrics of the final trained
models. We chose DTW due to its ability to analyze time-series data, which allows us to investigate
whether different bindings follow the same trajectory during training. DTW calculates the distance
between the training accuracy curves of the bindings (e.g., between TensorFlow’s Python and C#
binding) for training the same model. DTW is widely used as a distance measurement for time
series data since it can manage time distortion by aligning two time series before computing the
distance, which is more accurate than the Euclidean distance [15]. We normalize the calculated
DTW distances between 0 to 1 to interpret the results. A normalized DTW distance of 0 means that
the difference between the two curves is negligible.
In addition, we calculate the test accuracy, F1-score, and AUC-ROC for the final trained models
to compare their classification performance. For each metric, we perform the Mann-Whitney U
test [56] separately at a significance level of 𝛼 = 0.05 to determine if the values obtained from
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 13
JS JS JS JS JS
0.90 0 0.60 0.60
100 200 0.90 0 100 200 0 100 200 0 10 20 30 0 10 20 30
1.00 1.00 1.00 1.00 1.00
Fig. 3. Mean training accuracy curves of LeNet-1, LeNet-5, VGG-16, LSTM, and GRU on GPU in bindings for
TensorFlow (first row) and PyTorch (second row).
different bindings are significantly different. We computed Cliff’s delta 𝑑 [53] effect size to quantify
the difference based on the following thresholds [70]:
𝑛𝑒𝑔𝑙𝑖𝑔𝑖𝑏𝑙𝑒, if |𝑑 | ≤ 0.147
if 0.147 < |𝑑 | ≤ 0.33
𝑠𝑚𝑎𝑙𝑙,
Effect size = (1)
𝑚𝑒𝑑𝑖𝑢𝑚, if 0.33 < |𝑑 | ≤ 0.474
if 0.474 < |𝑑 | ≤ 1
𝑙𝑎𝑟𝑔𝑒,
Findings. Bindings can have different training accuracy curves when training DL models
under the same configuration (i.e., model structure, training data, hyperparameters, and
random seed). Table 4 reports the mean and maximum DTW distances for the training curves
between bindings. Moreover, Figure 3 presents the mean training accuracy curves of the models (out
of the five training processes) that have the best test accuracy after the last epoch. The figure
and table show that bindings can have quite different training accuracy curves according to the
DTW distance when using the same training configuration. For example, the distances between
the curves of TensorFlow’s C# binding and the other two bindings are relatively large for LeNet-1,
LeNet-5, and VGG-16 models. Another example is that all PyTorch bindings have a relatively large
distance between the curves for the RNN models compared to the distances in the CNN models. One
reason could be the differential numerical precision across programming languages. For example,
Python supports arbitrary-precision arithmetic, while languages like Rust and C# typically operate
with fixed precision. These variations in numerical precision might spawn minor differences in
mathematical computation outputs. These minor differences might accumulate over numerous
iterations during model training, resulting in variations in the final model accuracy. In contrast,
bindings can exhibit nearly the same behaviour for training some DL models; the training accuracy
curves of the LeNet models differ negligibly between TensorFlow’s Python and JavaScript bindings,
as well as between PyTorch’s bindings.
The trained models produced by certain bindings can perform worse than the models
produced by other bindings for the same ML framework. Table 5 shows the test accuracy,
F1-score, and AUC-ROC for the trained models produced by bindings can be different. For the
trained VGG-16 models, the Mann-Whitney U test reveals significant differences between bindings
for both frameworks in these metrics with large effect sizes. This pattern is also observed in the
trained GRU models in PyTorch’s bindings. Specifically, while the test accuracy and F1-score of the
trained LeNet-1 models have statistically significant differences between bindings for TensorFlow,
the AUC-ROC values of LeNet models in TensorFlow and PyTorch bindings are close (all rounded
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
14 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
Table 5. The average test accuracy (Acc), F1-score (F1), and AUC-ROC (AUC) for TensorFlow and PyTorch
bindings. (Statistically significant differences between bindings are highlighted in bold. Py: Python; JS:
JavaScript; Rs: Rust; MD: Max Diff; ES: Effect Size)
TensorFlow PyTorch
LN1 LN5 VGG LSTM GRU LN1 LN5 VGG LSTM GRU
Py 98.8 98.9 84.8 83.7 85.0 Py 98.8 98.9 86.2 86.5 87.9
C# 98.6 98.9 83.8 C# 98.8 99.0 86.2 87.3 85.5
JS 98.8 99.0 85.6 84.2 84.7 Rs 98.8 98.9 85.6 87.4 87.0
Acc MD 0.2 0.1 1.9 0.6 0.3 MD 0.0 0.1 0.6 0.8 2.5
𝑝 0.01 0.40 0.01 0.10 0.15 𝑝 0.68 0.31 0.03 0.10 0.01
ES large - large - - ES - - large - large
Py 98.8 98.9 84.7 83.5 85.0 Py 98.8 99.0 86.3 86.7 87.9
C# 98.6 98.9 83.8 C# 98.8 99.0 86.1 87.2 85.1
F1 JS 98.8 99.0 85.6 83.8 84.7 Rs 98.9 98.9 85.6 87.2 86.9
MD 0.2 0.1 1.9 0.3 0.3 MD 0.1 0.1 0.7 0.5 2.8
𝑝 0.01 0.42 0.01 0.22 0.15 𝑝 0.06 0.01 0.01 0.10 0.01
ES large - large - - ES - large large - large
Py 100.0 100.0 98.2 91.7 92.3 Py 100.0 100.0 98.5 94.1 94.3
C# 100.0 100.0 97.3 C# 100.0 100.0 98.5 94.6 92.9
AUC JS 100.0 100.0 98.4 92.3 91.9 Rs 100.0 100.0 98.3 94.5 93.8
MD 0.0 0.0 1.1 0.6 0.5 MD 0.0 0.0 0.2 0.5 1.5
𝑝 0.10 0.84 0.01 0.01 0.01 𝑝 0.55 0.42 0.01 0.10 0.01
ES - - large large large ES - - large - large
up to 100 in Table 5). Furthermore, we observed some models produced by non-Python bindings
have higher values of the metrics than the models produced by the default Python bindings, e.g.,
the VGG-16 model produced by TensorFlow’s JavaScript binding.
Summary of RQ1
TensorFlow and PyTorch bindings can have different training accuracy curves for training
the same DL models even when using the same configuration. In addition, the test accuracy
of the final trained models can be slightly different. Hence, developers should not assume
that all bindings offer the same level of correctness and should verify the model’s correctness
when utilizing a binding for training.
RQ2: How do the studied bindings impact the cross-binding test accuracy of pre-trained
models?
Approach. We conducted inference experiments with all bindings using pre-trained models pro-
duced by the default Python bindings for TensorFlow and PyTorch (see Figure 4). We loaded the
pre-trained models using the supported loading approach(es) and recorded the cross-binding test
accuracy on both CPU and GPU for each binding. If the cross-binding test accuracy of a pre-trained
model in a binding shows a 0% difference compared to the test accuracy when the model was initially
trained, we considered the test accuracy “reproduced” by that binding. Any non-zero difference
resulted in a “failed” mark. Since some bindings only support one way of loading models (as shown
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 15
Fig. 4. All bindings load the trained models that are saved by the default Python bindings for ML frameworks.
TensorFlow PyTorch
hon t hon t
Pyt C# Rus JS Pyt C# Rus JS
states
LeNet-1
serialization
states
LeNet-5
serialization
states
VGG-16
serialization
states
LSTM
serialization
states
GRU
serialization
states
BERT
serialization
Fig. 5. Results of reproducing the test accuracy of pre-trained models in TensorFlow and PyTorch bindings
on the CPU and GPU (the results are identical). Note: the failed cases in the PyTorch’s C# binding were fixed
in a newer version of the binding.
in Table 3), we marked the result as “unsupported” if the loading approach is not supported by a
binding.
Findings. The test accuracy of pre-trained models can be reproduced across bindings
in different languages for the same ML framework. Figure 5 shows that only PyTorch’s
C# binding failed to reproduce the test accuracy in the saved VGG-16, LSTM, and GRU models.
We noticed that the differences in the test accuracy in these three models are all within 1% and
the root cause of the reproduction failure is a bug that results in “eval() and train() methods not
being properly propagated to all submodules”.16 This bug prevents setting the model to evaluation
mode, hence, the dropout layers of these three models are not disabled which leads to different
cross-binding test accuracy. This bug is fixed in version 0.96.0 which does not support PyTorch 1.9.0
16 See https://github.com/dotnet/TorchSharp/pull/501 and https://github.com/dotnet/TorchSharp/issues/500
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
16 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
but targets version 1.10.0. In other words, the saved models can be reproduced in the newer version
of PyTorch’s C# binding. For consistency, we still use the 0.93.9 version of this binding for the other
experiments.
Bindings can reproduce the test accuracy of pre-trained models via different loading
approaches and on different types of processing units (i.e., CPU and GPU). As shown in
Figure 5, PyTorch’s Python and Rust bindings and TensorFlow’s Python binding support both
loading via parameters and serialization, and both loading approaches can reproduce the test
accuracy of the pre-trained models. In addition, we noticed that bindings can reproduce the test
accuracy of pre-trained models on both CPU and GPU.
Summary of RQ2
TensorFlow and PyTorch bindings can perform inference using pre-trained models and
reproduce the same test accuracy as when the models were originally trained. This cor-
rectness property holds true whether model inference is performed on CPU or GPU. As a
result, developers can leverage the capabilities of pre-trained models while still being able
to use the model in their preferred language.
RQ3: How do the studied bindings impact the training time of the studied DL models?
Approach. To study the difference in training time across bindings, we performed the Mann-
Whitney U test [56] using the Bonferroni correction [74] to adjust the significance level for multiple
comparisons. Specifically, for an initial significance level of 𝛼 = 0.05, we adjusted the significance
level to 𝛼𝑛 (where 𝑛 is the number of comparisons made) to determine whether the distributions of
the training times of the default Python bindings and the non-Python bindings, which trained the
same model for the same framework, are significantly different. For example, the LeNet-1 model
in TensorFlow bindings, we performed Bonferroni-corrected Mann-Whitney U test between the
Python and C# bindings and Python and JavaScript bindings with an adjusted significance level
of 𝛼2 = 0.025. We also computed Cliff’s delta 𝑑 [53] effect size to quantify the difference based on
Equation 1 in Section 4.
Findings. Training times can differ greatly across bindings for the same ML framework.
Figure 6 shows the training time distributions on GPU for the studied models across the studied
bindings. The Bonferroni-corrected Mann-Whitney U test shows that the training time distributions
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 17
1,000
LeNet-1 LeNet-5 VGG-16
30,000
Training Time (Seconds)
1,000
20,000
10,000
500
500 5,000
3,000
300
2,000
300
TensorFlow PyTorch TensorFlow PyTorch TensorFlow PyTorch
LSTM GRU
5,000
Training Time (Seconds)
5,000
1,000
1,000 Python
500 C#
500
300
Rust
300 JavaScript
of the same model are all significantly different between the default Python bindings and the other
bindings for the same framework and the effect sizes are all large. In addition, the difference in
training time of bindings for the same ML framework can be very large when training certain
models. For example, the median training time of TensorFlow’s JavaScript binding for the VGG-16
model is 15 times larger than its Python binding (32,783 vs. 1,991 seconds).
PyTorch’s default Python binding has the slowest training time for the studied models.
Figure 6 shows that PyTorch’s Python binding is more than two times slower than the other two
bindings for training LeNet models. However, we note that the training time difference between
PyTorch’s Python binding and other bindings for the VGG-16, LSTM, and GRU models is relatively
small (less than 15%). In contrast, TensorFlow’s default Python binding has the fastest training time
in the studied models.
Batch data loading time affects the training cost of PyTorch’s Python binding. As shown
in Table 6, PyTorch’s Python binding has a long batch data loading time, which is notably slower
(between 4 to 14 times) than the Rust binding for all studied models. Specifically, For LeNet models,
the Python binding’s batch data loading times account for roughly 30% of the training cost, whereas
the Rust binding’s batch data loading for the same models consumes less than 10% of the training
cost. Furthermore, the Python binding consistently underperforms the Rust binding during both
forward and backward propagation phases in the studied models.
The observed variations in batch data loading times between bindings suggest that the native
speed of a programming language [59, 64, 66] is an important factor that influences the perfor-
mance of a binding. However, there could be other factors involved in the implementation of
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
18 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
Table 6. Time costs (in seconds) of the subactivities in the training process using PyTorch’s Python and Rust
bindings on GPU.
bindings. For example, these factors could include overheads arising from differences in data struc-
ture implementations and initialization routines. Additionally, the overhead of the marshalling
mechanism [6, 16, 89] implemented to convert data between the binding’s programming language
and the ML framework could impact efficiency. Finally, the way the binding interacts with the ML
framework’s lower-level APIs, such as those for memory management and tensor operations, could
also play a crucial role in performance differences.
Summary of RQ3
Training times for training the same DL models differ significantly between the default
Python bindings and the non-Python bindings for the same ML framework. Surprisingly,
non-Python bindings for PyTorch are even faster in training the studied models than the
default Python binding. Hence, choosing the right binding can help developers to lower the
training time cost for certain models.
RQ4: How do the studied bindings impact the inference time of pre-trained models?
Approach. We followed the same process as shown in Figure 4 and investigated the inference
time of each model on both CPU and GPU. We performed the Bonferroni-corrected Mann-Whitney
U test on the recorded inference time distributions between the default Python bindings and the
non-Python bindings, grouped by the same framework, model, and processing unit (CPU or GPU).
We also computed Cliff’s Delta effect size as described in RQ3.
Findings. The inference time of the same pre-trained model differs greatly between
the default Python bindings and the other bindings for the same ML framework. Figure 7
shows the distributions of the inference time of the pre-trained models in the studied bindings. The
results of the Bonferroni-corrected Mann-Whitney U test and Cliff’s Delta 𝑑 show that the Python
and non-Python bindings for the same ML framework have significantly different inference times
for the same model on the same processing unit (i.e., CPU and GPU) and the effect size is large,
except for the TensorFlow bindings for LSTM on CPU and for BERT on GPU where the Python
binding has similar inference time costs as the Rust binding. We observed that the default Python
bindings for TensorFlow and PyTorch do not always offer the best inference time for all studied
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 19
0.1 0.1
1.0
on
C#
aS st
Py t
on
C#
aS st
pt
on
C#
aS st
Py t
on
C#
aS st
pt
on
C#
aS st
Py t
on
C#
aS st
pt
p
p
Jav Ru
Jav Ru
Jav Ru
Jav Ru
Jav Ru
Jav Ru
cri
cri
cri
cri
cri
cri
th
th
th
th
th
th
Py
Py
Py
TensorFlow PyTorch TensorFlow PyTorch TensorFlow PyTorch
LSTM GRU BERT
100.0
Inference Time (Seconds)
100.0
1000.0
10.0 10.0
100.0
on
C#
aS st
Py t
on
C#
aS st
pt
on
C#
aS st
Py t
on
C#
aS st
pt
on
C#
aS st
Py t
on
C#
aS st
pt
p
p
Jav Ru
Jav Ru
Jav Ru
Jav Ru
Jav Ru
Jav Ru
cri
cri
cri
cri
cri
cri
th
th
th
th
th
th
Py
Py
Py
Fig. 7. Inference time distributions for pre-trained models in TensorFlow (TF) and PyTorch (PT) bindings on
the CPU and GPU.
pre-trained models, with Rust bindings often outperforming them. On the other hand, TensorFlow’s
C# binding has the worst performance for the studied models on both CPU and GPU, and PyTorch’s
JavaScript binding has the worst performance on CPU. Moreover, the performance gap in model
inference time can be very large, for example, TensorFlow’s Python binding is 17 times as fast as
the JavaScript binding for the GRU model on the GPU (3.35 vs. 58.32 seconds).
Inference time differences in PyTorch arise from both batch data loading and forward
propagation speed. Table 7 shows that the majority of the inference cost is allocated towards
forward propagation and the Rust binding outperforms the Python binding in this regard. As we
observed the same pattern in RQ3, the Rust binding also demonstrates faster batch data loading
times compared to the Python binding across all studied models. Although both bindings leverage
PyTorch’s computational core, which is written in C/C++ and predominantly runs computations
on GPUs, the variations in time costs can be attributed to overheads introduced by the bindings
themselves.
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
20 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
Table 7. Time costs (in seconds) of the subactivities in the inference process using PyTorch’s Python and Rust
bindings on GPU.
Certain bindings on the CPU may have a faster inference time than other bindings on
the GPU for the same pre-trained model. Generally, inference time for pre-trained models on
GPU outperforms CPU in bindings for both studied frameworks (as shown in Figure 7). However,
we found that for the same framework, one binding that runs inference on CPU can outperform
another binding that runs on GPU for the same pre-trained model. For example, the Rust binding
for TensorFlow is faster on CPU than the C# binding on GPU for LeNet and VGG-16 models, as
well as faster on CPU than the JavaScript binding on GPU for GRU model. Furthermore, we noticed
that TensorFlow’s C# binding in model inference on CPU is similar to or even faster than on
GPU. According to the maintainer of the C# binding, the reason could be that “there is I/O cost
underlying”17 model inference on GPU.
Certain bindings lack support for certain features which leads to a slower inference time.
We noticed that TensorFlow’s JavaScript binding cannot load a GRU model with “reset_after=True”18 ,
either by loading parameters or through serialization. However, “reset_after=True” is the default
setting in the framework (and other bindings) to enable the “fast cuDNN implementation”, which
speeds up the inference of the GRU model19 This unsupported feature can be one of the reasons
behind the large increase of GRU inference time in TensorFlow’s JavaScript binding (256.5 seconds)
compared to the inference time of the default Python binding (3.6 seconds).
Summary of RQ4
TensorFlow and PyTorch bindings have various inference times for the same pre-trained
models on CPU and GPU. Remarkably, the inference time of certain models in bindings
on the CPU can be faster than other bindings for the same framework on GPU. Therefore,
developers can experiment and choose the fastest binding for their usage scenario.
17 https://github.com/SciSharp/TensorFlow.NET/issues/876
18 https://github.com/tensorflow/tfjs/issues/4621
19 https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 21
6 IMPLICATIONS
6.1 Implications for developers
Developers are not limited to writing their projects in Python when using an ML frame-
work. Although Python dominates the development in ML [4, 60], developers can also use bindings
in other programming languages. Our results in Section 4 shows that non-default bindings for
TensorFlow and PyTorch can have the same inference accuracy of a pre-trained model as the default
Python binding and sometimes even faster performance. We recommend developers use the binding
in their preferred programming language for either model training or inference if supported by the
binding. Hence, developers can save time and effort when adopting ML techniques in their projects
without having to settle for non-mature ML frameworks that might be available in the language
that their current software is programmed in. For instance, in Integration Scenario 1 of Section 1,
Anna can use the JavaScript binding to perform inference with pre-trained models provided by the
ML team.
Developers can use a binding for an ML framework which has a shorter training time
for a certain model and perform inference on the trained model in another binding
that has a shorter inference time based on task and requirements. Bindings for an ML
framework have various training times and inference times for ML models (Section 5). Hence,
developers can choose different bindings which are faster for a certain model in training and
inference respectively since the accuracy of pre-trained models can be reproduced across bindings
for the same framework (Section 4). We suggest that developers refer to an existing benchmark
like ours or conduct experiments themselves based on our replication package [48]. For example,
when using TensorFlow for LeNet models as described in Integration Scenario 3 of Section 1, Anna
can train the models using the default Python binding for TensorFlow and then run inference
for the trained model in the Rust binding with the assistance of a hired expert to save time and
computational resources, as this factor is critical in their project requirements.
Developers should perform a sanity check before using a model that was trained by a
binding other than the default Python binding. Bindings corresponding to different languages
can have different training accuracy curves while training the same model, and the final trained
model can behave differently (as discussed in Section 4). Since the Python bindings are the default
binding for most ML frameworks, these Python bindings have a larger user base and better support
than other bindings. We suggest that developers perform a sanity check on the trained model if
they are using a binding other than the default Python binding before deploying the models to the
production environment.
In resource-limited scenarios (e.g., CPU only), developers may prefer or need to use
a non-default binding for model inference. Traditionally, model inference is done using a
GPU due to the superior inference time of GPUs [7, 46]. However, GPUs are expensive and not
available in all scenarios. We found that the bindings for ML frameworks can be fast for running
inference on CPU for some pre-trained models (Section 5). Developers can use such bindings if
the production environment does not contain a GPU or the computational resource is limited. For
example, in Integration Scenario 3 of Section 1, if Anna is using PyTorch for LeNet models and
there is no GPU available in the production environment, she can use PyTorch’s Rust binding on
CPU with expert assistance. The inference time of LeNet models in the Rust binding on CPU is
faster than the default Python binding both on CPU and GPU. This is particularly beneficial for
constrained environments like the Internet-of-Things (IoT) devices (e.g., unmanned aerial vehicles)
where resource availability is often limited [3, 20, 40, 71].
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
22 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
7 RELATED WORK
7.1 Impact of ML frameworks on ML software correctness
Researchers have studied the correctness of ML frameworks. However, no one has studied how
bindings for those frameworks impact the correctness of the ML software that is created with them.
The study by Guo et al. [26] is the closest related to our work. However, even though they included
several bindings in their study, their work differs from ours as they focus on the impact on ML
20 https://huggingface.co/docs/diffusers/en/training/lora
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 23
software quality of using different ML frameworks and executing ML models on different computing
devices (such as PC and various types of mobile devices). In contrast, we run our experiments on
the same device but we study the impact of various bindings on ML software quality. Hence, we
can reason about the impact of using a binding, while in Guo et al.’s study, the different devices
make this impossible.
Several others have focused on comparing the accuracy of the same model across ML frameworks.
Chirodea et al. [10] compared a CNN model that was built with TensorFlow and PyTorch and found
that these two frameworks have similar training curves but the final trained model has a lower
accuracy in PyTorch. Gevorkyan et al. [22] gave an overview of five ML frameworks and compared
the accuracy of training a neural network for the MNIST dataset. They reported that the final trained
model has a lower accuracy in TensorFlow than in other frameworks. Moreover, Elshawi et al. [17]
conducted training experiments for six ML frameworks by using the default configuration and
reported that certain frameworks have better performance than the other frameworks on the same
model (e.g., Chainer on the LSTM model).
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
24 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 25
highlighting potential threats to software correctness. Meanwhile, Grimmer [24] explored high-
performance language interoperability in multi-language runtimes. Their approach leveraged
just-in-time (JIT) compilers to optimize across language borders, enhancing the efficiency of
cross-language operations.
To the best of our knowledge, our study is the first to systematically investigate the impact of
using different language bindings on ML software quality. While Ravitch et al. [69] touched upon
type correctness in bindings, the unique challenges posed by the inherently non-deterministic
nature of ML software remain under-explored. Our work stands out as we specifically evaluate
the impact of bindings on the correctness of ML software for model training and inference across
different languages. In addition, The computationally intensive nature of ML software introduces
unique challenges when assessing time costs, especially when relying on GPUs. While time cost
is a widely used metric in the domain of FFIs and bindings, existing works do not explore its
significance within the context of ML frameworks. Our research actively fills this void, presenting
a comprehensive analysis of time costs associated with different bindings in ML software on CPUs
and GPUs.
8 THREATS TO VALIDITY
8.1 Construct validity
We use the accuracy metric to assess the correctness of TensorFlow and PyTorch bindings on model
training and inference since it is a widely used metric among researchers and developers [10, 17,
22, 26, 54]. However, other metrics may also be used to assess correctness and use of other metrics
could potentially change our results. For evaluating the time cost of bindings on model training,
we ran training experiments on the GPU since training DL models on CPU is time-consuming and
developers usually train DL models on GPU. The results might be different from those obtained by
measuring the time cost on CPU.
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
26 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
investigation may not be able to generalize to other models and datasets. Future studies should
leverage our methodology to analyze bindings for other ML frameworks using different models
and datasets.
Our analysis focused on small to medium-sized models that are widely adopted in real-world
applications. However, the implications for large-scale models, particularly frontier ML models
with billions or trillions of parameters, require further investigation. Future research should build
on our work to examine how the observed differences might persist or change at this extreme scale.
9 CONCLUSION
In this paper, we investigate the impact on ML software quality (correctness and time cost) of
using bindings for ML frameworks for DL model training and inference. We conducted model
training and model inference experiments on three CNN-based models and two RNN-based models
in TensorFlow and PyTorch bindings written in four different programming languages. The most
important findings of our study are:
• When training models, bindings for ML frameworks can have various training accuracy
curves and slightly different test accuracy values for the trained models.
• Bindings have different training times for the same model, and the default Python bindings
for ML frameworks may not have the fastest training time.
• Bindings for ML frameworks have the capabilities to reproduce the accuracy of pre-trained
models for inference.
• Bindings for ML frameworks have different inference times for the same pre-trained model
and certain models in bindings on the CPU can outperform other bindings on the GPU.
Our findings show that developers can utilize a binding to speed up the training time for an ML
model. For pre-trained models, developers can perform inference in their favoured programming
language without sacrificing accuracy, or they can choose a binding that has better inference time.
DISCLAIMER
Any opinions, findings, and conclusions, or recommendations expressed in this material are those
of the author(s) and do not reflect the views of Huawei.
ACKNOWLEDGMENTS
The work described in this paper has been supported by the ECE-Huawei Research Initiative (HERI)
at the University of Alberta.
REFERENCES
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat,
Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,
Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016.
TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating
Systems Design and Implementation (Savannah, GA, USA) (OSDI ’16). USENIX Association, USA, 265–283.
[2] Saheed Salahudeen Abdullahi, Sun Yiming, Shamsuddeen Hassan Muhammad, Abdulrasheed Mustapha, Ah-
mad Muhammad Aminu, Abdulkadir Abdullahi, Musa Bello, and Saminu Mohammad Aliyu. 2021. Deep Sequence
Models for Text Classification Tasks. In International Conference on Electrical, Communication, and Computer Engineering
(ICECCE). 1–6. https://doi.org/10.1109/ICECCE52056.2021.9514261
[3] Andrea Albanese, Matteo Nardello, and Davide Brunelli. 2022. Low-power deep learning edge computing platform for
resource constrained lightweight compact UAVs. Sustainable Computing: Informatics and Systems 34 (2022), 100725.
https://doi.org/10.1016/j.suscom.2022.100725
[4] Houssem Ben Braiek, Foutse Khomh, and Bram Adams. 2018. The Open-Closed Principle of Modern Machine Learning
Frameworks. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden)
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 27
(MSR ’18). Association for Computing Machinery, New York, NY, USA, 353–363. https://doi.org/10.1145/3196398.
3196445
[5] Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the Factors That Impact the Popularity of
GitHub Repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 334–344.
https://doi.org/10.1109/ICSME.2016.31
[6] Camillo Bruni, Stéphane Ducasse, Igor Stasenko, and Luc Fabresse. 2013. Language-side foreign function interfaces
with nativeboost. In International Workshop on Smalltalk Technologies.
[7] Ebubekir Buber and Banu Diri. 2018. Performance Analysis and CPU vs GPU Comparison for Deep Learning. In 2018
6th International Conference on Control Engineering Information Technology (CEIT). 1–6. https://doi.org/10.1109/CEIT.
2018.8751930
[8] Boyuan Chen, Mingzhi Wen, Yong Shi, Dayi Lin, Gopi Krishnan Rajbahadur, and Zhen Ming (Jack) Jiang. 2022.
Towards Training Reproducible Deep Learning Models. In Proceedings of the 44th International Conference on Software
Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2202–
2214. https://doi.org/10.1145/3510003.3510163
[9] Junjie Chen, Yihua Liang, Qingchao Shen, Jiajun Jiang, and Shuochuan Li. 2023. Toward Understanding Deep
Learning Framework Bugs. ACM Trans. Softw. Eng. Methodol. 32, 6, Article 135 (Sept. 2023), 31 pages. https:
//doi.org/10.1145/3587155
[10] Mihai Cristian Chirodea, Ovidiu Constantin Novac, Cornelia Mihaela Novac, Nicu Bizon, Mihai Oproescu, and
Cornelia Emilia Gordan. 2021. Comparison of Tensorflow and PyTorch in Convolutional Neural Network-based
Applications. In 2021 13th International Conference on Electronics, Computers and Artificial Intelligence (ECAI). 1–6.
https://doi.org/10.1109/ECAI52376.2021.9515098
[11] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural
Machine Translation: Encoder–Decoder Approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics
and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103–111. https:
//doi.org/10.3115/v1/W14-4012
[12] Agnieszka Ciborowska and Kostadin Damevski. 2022. Fast Changeset-Based Bug Localization with BERT. In Proceedings
of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for
Computing Machinery, New York, NY, USA, 946–957. https://doi.org/10.1145/3510003.3510042
[13] Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural
Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (Helsinki,
Finland) (ICML ’08). Association for Computing Machinery, New York, NY, USA, 160–167. https://doi.org/10.1145/
1390156.1390177
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association
for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[15] Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. 2008. Querying and Mining of
Time Series Data: Experimental Comparison of Representations and Distance Measures. Proc. VLDB Endow. 1, 2 (Aug.
2008), 1542–1552. https://doi.org/10.14778/1454159.1454226
[16] Anton Ekblad. 2015. Foreign Exchange at Low, Low Rates a Lightweight FFI for Web-Targeting Haskell Dialects.
In Proceedings of the 27th Symposium on the Implementation and Application of Functional Programming Languages
(Koblenz, Germany) (IFL ’15). Association for Computing Machinery, New York, NY, USA, Article 2, 13 pages. https:
//doi.org/10.1145/2897336.2897338
[17] Radwa Elshawi, Abdul Wahab, Ahmed Barnawi, and Sherif Sakr. 2021. DLBench: a comprehensive experimental
evaluation of deep learning frameworks. Cluster Computing 24, 3 (Sept. 2021), 2017–2038. https://doi.org/10.1007/
s10586-021-03240-4
[18] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire
Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare. Nature medicine 25, 1
(2019), 24–29.
[19] Hongbo Fang, Hemank Lamba, James Herbsleb, and Bogdan Vasilescu. 2022. "This is Damn Slick!": Estimating the
Impact of Tweets on Open Source Project Popularity and New Contributors. In Proceedings of the 44th International
Conference on Software Engineering (ICSE ’22). 2116–2129. https://doi.org/10.1145/3510003.3510121
[20] Igor Fedorov, Ryan P Adams, Matthew Mattina, and Paul Whatmough. 2019. SpArSe: Sparse Architecture Search
for CNNs on Resource-Constrained Microcontrollers. In Advances in Neural Information Processing Systems, Vol. 32.
Curran Associates, Inc.
[21] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. 2020. Linear Mode Connectivity and
the Lottery Ticket Hypothesis. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR,
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
28 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
3259–3269.
[22] Migran N Gevorkyan, Anastasia V Demidova, Tatiana S Demidova, and Anton A Sobolev. 2019. Review and comparative
analysis of machine learning libraries for machine learning. Discrete and Continuous Models and Applied Computational
Science 27, 4 (Dec. 2019), 305–315. https://doi.org/10.22363/2658-4670-2019-27-4-305-315
[23] Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The state of the ML-universe: 10 years of
artificial intelligence & machine learning software development on GitHub. In Proceedings of the 17th International
Conference on Mining Software Repositories. 431–442.
[24] Matthias Grimmer. 2014. High-Performance Language Interoperability in Multi-Language Runtimes. In Proceedings of
the Companion Publication of the 2014 ACM SIGPLAN Conference on Systems, Programming, and Applications: Software
for Humanity (Portland, Oregon, USA) (SPLASH ’14). Association for Computing Machinery, New York, NY, USA,
17–19. https://doi.org/10.1145/2660252.2660256
[25] Odd Erik Gundersen and Sigbjørn Kjensmo. 2018. State of the Art: Reproducibility in Artificial Intelligence. Proceedings
of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018).
[26] Qianyu Guo, Sen Chen, Xiaofei Xie, Lei Ma, Qiang Hu, Hongtao Liu, Yang Liu, Jianjun Zhao, and Xiaohong Li. 2019. An
Empirical Study Towards Characterizing Deep Learning Development and Deployment Across Different Frameworks
and Platforms. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (San
Diego, California) (ASE ’19). IEEE Press, 810–822. https://doi.org/10.1109/ASE.2019.00080
[27] Abhishek Gupta, Alagan Anpalagan, Ling Guan, and Ahmed Shaharyar Khwaja. 2021. Deep learning for object
detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array 10 (2021), 100057.
https://doi.org/10.1016/j.array.2021.100057
[28] Junxiao Han, Shuiguang Deng, Xin Xia, Dongjing Wang, and Jianwei Yin. 2019. Characterization and Prediction of
Popular Projects on GitHub. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC),
Vol. 1. 21–26. https://doi.org/10.1109/COMPSAC.2019.00013
[29] Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang
Zhang, et al. 2021. Pre-trained models: Past, present and future. AI Open 2 (2021), 225–250.
[30] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (Nov. 1997),
1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[31] Oumaima Hourrane, Nouhaila Idrissi, and El Habib Benlahmar. 2019. An Empirical Study of Deep Neural Networks
Models for Sentiment Classification on Movie Reviews. In 1st International Conference on Smart Systems and Data
Science (ICSSD). 1–6. https://doi.org/10.1109/ICSSD47982.2019.9003171
[32] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
[33] Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Lei Ma, Mike Papadakis, and Yves Le Traon. 2022. An Empirical
Study on Data Distribution-Aware Test Selection for Deep Learning Enhancement. ACM Transactions on Software
Engineering and Methodology (Jan. 2022). https://doi.org/10.1145/3511598
[34] Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020.
Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference
on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1110–1121. https:
//doi.org/10.1145/3377811.3380395
[35] Richard Isdahl and Odd Erik Gundersen. 2019. Out-of-the-Box Reproducibility: A Survey of Machine Learning
Platforms. In 15th International Conference on eScience (eScience). 86–95. https://doi.org/10.1109/eScience.2019.00017
[36] Md Johirul Islam, Hoan Anh Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. What Do Developers Ask About ML
Libraries? A Large-scale Study Using Stack Overflow. arXiv preprint arXiv:1906.11940 (June 2019). arXiv:1906.11940
[37] Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, and Dhableswar K. DK Panda. 2019. Performance
Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters. In IEEE International Conference
on Cluster Computing (CLUSTER). 1–11. https://doi.org/10.1109/CLUSTER.2019.8891042
[38] Li Jia, Hao Zhong, Xiaoyin Wang, Linpeng Huang, and Xuansheng Lu. 2020. An Empirical Study on Bugs Inside
TensorFlow. In Database Systems for Advanced Applications, Yunmook Nah, Bin Cui, Sang-Won Lee, Jeffrey Xu Yu,
Yang-Sae Moon, and Steven Euijong Whang (Eds.). Springer International Publishing, Cham, 604–620.
[39] Li Jia, Hao Zhong, Xiaoyin Wang, Linpeng Huang, and Xuansheng Lu. 2021. The symptoms, causes, and repairs of
bugs inside a deep learning library. Journal of Systems and Software 177 (2021), 110935. https://doi.org/10.1016/j.jss.
2021.110935
[40] Mohammed Jouhari, Abdulla Khalid Al-Ali, Emna Baccour, Amr Mohamed, Aiman Erbad, Mohsen Guizani, and
Mounir Hamdi. 2022. Distributed CNN Inference on Resource-Constrained UAVs for Surveillance Systems: Design and
Optimization. IEEE Internet of Things Journal 9, 2 (2022), 1227–1242. https://doi.org/10.1109/JIOT.2021.3079164
[41] Keras. 2021. About Keras. Retrieved March 28, 2022 from https://keras.io/about/
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 29
[42] Serhat Kiliçarslan and Mete Celik. 2021. RSigELU: A nonlinear activation function for deep neural networks. Expert
Systems with Applications 174 (2021), 114805. https://doi.org/10.1016/j.eswa.2021.114805
[43] Alex Krizhevsky. 2012. Learning Multiple Layers of Features from Tiny Images (Technical Report). (April 2012).
[44] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc.
IEEE 86, 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791
[45] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. 1998. The MNIST database of handwritten digits. Retrieved
March 28, 2022 from http://yann.lecun.com/exdb/mnist/
[46] Feng Li, Yunming Ye, Zhaoyang Tian, and Xiaofeng Zhang. 2019. CPU versus GPU: which can perform matrix
computation faster—performance comparison for basic linear algebra subprograms. Neural Computing and Applications
31, 8 (Aug. 2019), 4353–4365. https://doi.org/10.1007/s00521-018-3354-z
[47] Hao Li and Cor-Paul Bezemer. 2022. Studying Popular Open Source Machine Learning Libraries and Their Cross-
Ecosystem Bindings. arXiv preprint arXiv:2201.07201 (Jan. 2022). https://doi.org/10.48550/ARXIV.2201.07201
[48] Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer. 2024. The replication package of our study on bindings for
TensorFlow and PyTorch. https://github.com/asgaardlab/CmpMLBindings
[49] Xiaoyun Li, Belhal Karimi, and Ping Li. 2022. On Distributed Adaptive Optimization with Gradient Compression. In
International Conference on Learning Representations.
[50] Enlu Lin, Qiong Chen, and Xiaoming Qi. 2020. Deep reinforcement learning for imbalanced classification. Applied
Intelligence 50, 8 (Aug. 2020), 2488–2502. https://doi.org/10.1007/s10489-020-01637-z
[51] Chao Liu, Cuiyun Gao, Xin Xia, David Lo, John Grundy, and Xiaohu Yang. 2021. On the Reproducibility and Replicability
of Deep Learning in Software Engineering. ACM Transactions on Software Engineering and Methodology 31, 1, Article
15 (Oct. 2021), 46 pages. https://doi.org/10.1145/3477535
[52] Jiakun Liu, Qiao Huang, Xin Xia, Emad Shihab, David Lo, and Shanping Li. 2020. Is using deep learning frameworks free?
characterizing technical debt in deep learning frameworks. In Proceedings of the ACM/IEEE 42nd International Conference
on Software Engineering: Software Engineering in Society (ICSE-SEIS ’20). Association for Computing Machinery, New
York, NY, USA, 1–10. https://doi.org/10.1145/3377815.3381377
[53] Jeffrey D. Long, Du Feng, and Norman Cliff. 2003. Ordinal Analysis of Behavioral Data. In Handbook of Psychology,
Irving B. Weiner (Ed.). John Wiley & Sons, Inc., Hoboken, NJ, USA, Chapter 25, 635–661. https://doi.org/10.1002/
0471264385.wei0225
[54] Chunjie Luo, Xiwen He, Jianfeng Zhan, Lei Wang, Wanling Gao, and Jiahui Dai. 2020. Comparison and Benchmarking
of AI Models and Frameworks on Mobile Devices. arXiv preprint arXiv:2005.05085 (May 2020).
[55] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning
Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies (Portland, Oregon) (HLT ’11). Association for Computational Linguistics,
USA, 142–150.
[56] H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than
the Other. Annals of Mathematical Statistics 18 (1947), 50–60.
[57] Matthew B. A. McDermott, Shirly Wang, Nikki Marinsek, Rajesh Ranganath, Luca Foschini, and Marzyeh Ghassemi.
2021. Reproducibility in machine learning for health research: Still a ways to go. Science Translational Medicine 13, 586
(2021), eabb1655. https://doi.org/10.1126/scitranslmed.abb1655
[58] Prabhat Nagarajan, Garrett Warnell, and Peter Stone. 2019. Deterministic Implementations for Reproducibility in Deep
Reinforcement Learning. In AAAI 2019 Workshop on Reproducible AI. https://doi.org/10.48550/ARXIV.1809.05676
[59] Sebastian Nanz and Carlo A. Furia. 2015. A Comparative Study of Programming Languages in Rosetta Code. In 2015
IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 778–788. https://doi.org/10.1109/ICSE.
2015.90
[60] Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet Tran, Alvaro Lopez Garcia, Ignacio Heredia, Peter Malík, and
Ladislav Hluchỳ. 2019. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a
survey. Artificial Intelligence Review 52, 1 (June 2019), 77–124. https://doi.org/10.1007/s10462-018-09679-z
[61] Jianjun Ni, Yinan Chen, Yan Chen, Jinxiu Zhu, Deena Ali, and Weidong Cao. 2020. A Survey on Theories and
Applications for Self-Driving Cars Based on Deep Learning Methods. Applied Sciences 10, 8 (2020). https://doi.org/10.
3390/app10082749
[62] Anne-Marie Nussberger, Lan Luo, L Elisa Celis, and Molly J Crockett. 2022. Public attitudes value interpretability but
prioritize accuracy in Artificial Intelligence. Nature communications 13, 1 (2022), 5821.
[63] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An
Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32
(NeurIPS). Curran Associates, Inc., 8024–8035.
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
30 Hao Li, Gopi Krishnan Rajbahadur, and Cor-Paul Bezemer
[64] Rui Pereira, Marco Couto, Francisco Ribeiro, Rui Rua, Jácome Cunha, João Paulo Fernandes, and João Saraiva. 2017.
Energy Efficiency across Programming Languages: How Do Energy, Time, and Memory Relate?. In Proceedings of the
10th ACM SIGPLAN International Conference on Software Language Engineering (Vancouver, BC, Canada) (SLE 2017).
Association for Computing Machinery, New York, NY, USA, 256–267. https://doi.org/10.1145/3136014.3136031
[65] Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and
Nachiappan Nagappan. 2020. Problems and Opportunities in Training Deep Learning Software Systems: An Analysis
of Variance. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual
Event, Australia) (ASE ’20). Association for Computing Machinery, New York, NY, USA, 771–783. https://doi.org/10.
1145/3324884.3416545
[66] L. Prechelt. 2000. An empirical comparison of seven programming languages. Computer 33, 10 (2000), 23–29.
https://doi.org/10.1109/2.876288
[67] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine
Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
[68] Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020. Machine learning in Python: Main developments and
technology trends in data science, machine learning, and artificial intelligence. Information 11, 4 (2020), 193.
[69] Tristan Ravitch, Steve Jackson, Eric Aderhold, and Ben Liblit. 2009. Automatic Generation of Library Bindings
Using Static Analysis. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and
Implementation (Dublin, Ireland) (PLDI ’09). Association for Computing Machinery, New York, NY, USA, 352–362.
https://doi.org/10.1145/1542476.1542516
[70] Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, Jeff Skowronek, and Linda Devine. 2006. Exploring methods for
evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’sd indices the most appropriate
choices. In annual meeting of the Southern Association for Institutional Research. Citeseer, 1–51.
[71] Arish S., Sharad Sinha, and Smitha K.G. 2019. Optimization of Convolutional Neural Networks on Resource Constrained
Devices. In 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 19–24. https://doi.org/10.1109/ISVLSI.
2019.00013
[72] Stan Salvador and Philip Chan. 2007. FastDTW: Toward accurate dynamic time warping in linear time and space.
Intelligent Data Analysis 11, 5 (2007), 561–580.
[73] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael
Young, Jean-François Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. In
Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.),
Vol. 28. Curran Associates, Inc.
[74] Juliet Popper Shaffer. 1995. Multiple hypothesis testing. Annual review of psychology 46, 1 (1995), 561–584.
[75] Kedi Shen, Yun Zhang, Lingfeng Bao, Zhiyuan Wan, Zhuorong Li, and Minghui Wu. 2023. Patchmatch: A Tool
for Locating Patches of Open Source Project Vulnerabilities. In Proceedings of the 45th International Conference
on Software Engineering: Companion Proceedings (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 175–179.
https://doi.org/10.1109/ICSE-Companion58688.2023.00049
[76] Nischal Shrestha, Colton Botta, Titus Barik, and Chris Parnin. 2020. Here We Go Again: Why Is It Difficult for
Developers to Learn Another Programming Language?. In IEEE/ACM 42nd International Conference on Software
Engineering (ICSE). 691–701.
[77] Alexey A. Shvets, Alexander Rakhlin, Alexandr A. Kalinin, and Vladimir I. Iglovikov. 2018. Automatic Instrument
Segmentation in Robot-Assisted Surgery using Deep Learning. In 2018 17th IEEE International Conference on Machine
Learning and Applications (ICMLA). 624–628. https://doi.org/10.1109/ICMLA.2018.00100
[78] Julien Siebert, Lisa Joeckel, Jens Heidrich, Adam Trendowicz, Koji Nakamichi, Kyoko Ohashi, Isao Namba, Rieko
Yamamoto, and Mikio Aoyama. 2022. Construction of a quality model for machine learning systems. Software Quality
Journal 30, 2 (June 2022), 307–335. https://doi.org/10.1007/s11219-021-09557-y
[79] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition.
In 3rd International Conference on Learning Representations (ICLR 2015).
[80] Rachael Tatman, J. Vanderplas, and Sohier Dane. 2018. A Practical Taxonomy of Reproducibility for Machine Learning
Research. In Reproducibility in Machine Learning Workshop at ICML 2018.
[81] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya
Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem
Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman
Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich,
Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton,
Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality 31
Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov,
and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
[82] Beatrice van Amsterdam, Matthew J. Clarkson, and Danail Stoyanov. 2021. Gesture Recognition in Robotic Surgery:
A Review. IEEE Transactions on Biomedical Engineering 68, 6 (2021), 2021–2035. https://doi.org/10.1109/TBME.2021.
3054828
[83] Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, and Jakob Grue Simonsen. 2021. On
Position Embeddings in BERT. In International Conference on Learning Representations.
[84] Kang Wang, Yong Dou, Tao Sun, Peng Qiao, and Dong Wen. 2022. An automatic learning rate decay strategy for
stochastic gradient descent optimization methods in neural networks. International Journal of Intelligent Systems
(2022). https://doi.org/10.1002/int.22883
[85] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac,
Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020.
Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 38–45.
https://doi.org/10.18653/v1/2020.emnlp-demos.6
[86] Thomas Wolter, Ann Barcomb, Dirk Riehle, and Nikolay Harutyunyan. 2023. Open Source License Inconsistencies on
GitHub. ACM Trans. Softw. Eng. Methodol. 32, 5, Article 110 (July 2023), 23 pages. https://doi.org/10.1145/3571852
[87] Xiaoya Xia, Shengyu Zhao, Xinran Zhang, Zehua Lou, Wei Wang, and Fenglin Bi. 2023. Understanding the Archived
Projects on GitHub. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).
13–24. https://doi.org/10.1109/SANER56733.2023.00012
[88] Dongpo Xu, Shengdong Zhang, Huisheng Zhang, and Danilo P. Mandic. 2021. Convergence of the RMSProp deep
learning method with penalty for nonconvex optimization. Neural Networks 139 (2021), 17–23. https://doi.org/10.
1016/j.neunet.2021.02.011
[89] Jeremy Yallop, David Sheets, and Anil Madhavapeddy. 2018. A modular foreign function interface. Science of Computer
Programming 164 (2018), 82–97. https://doi.org/10.1016/j.scico.2017.04.002 Special issue of selected papers from FLOPS
2016.
[90] Zhanglu Yan, Jun Zhou, and Weng-Fai Wong. 2021. Near Lossless Transfer Learning for Spiking Neural Networks.
Proceedings of the AAAI Conference on Artificial Intelligence 35, 12 (May 2021), 10577–10584.
[91] Chengran Yang, Bowen Xu, Jiakun Liu, and David Lo. 2023. TECHSUMBOT: A Stack Overflow Answer Summa-
rization Tool for Technical Query. In Proceedings of the 45th International Conference on Software Engineering: Com-
panion Proceedings (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 132–135. https://doi.org/10.1109/ICSE-
Companion58688.2023.00040
[92] Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael Lyu, and Miryung Kim. 2019. An Empirical Study of Common Challenges in
Developing Deep Learning Applications. In 2019 IEEE 30th International Symposium on Software Reliability Engineering
(ISSRE). 104–115. https://doi.org/10.1109/ISSRE.2019.00020
ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: 2024.