2022 State of Competitive ML — The Downfall of TensorFlow
The 2022 State of Competitive Machine Learning was released recently, and in it, the authors highlight the core trends across the past year. In their words, “2022 was a big year for competitive machine learning, with a total prize pool of more than $5m across all platforms”. Not only this, but the trends they uncover for the ML tools and platforms are an excellent metric for uncovering the preferences and familiarity of practitioners across the community that defines trends within industry and research.
Perhaps the most striking trend from the 2022 State of Competitive Machine Learning report was how far TensorFlow has fallen. Without question, it dominated the Deep Learning industry, research, and community several years ago. It has just 4% of the share in active competition submissions — something TensorFlow owned just 5 years ago. Why is this, though?
The Cost of Non-Pythonic Design
TensorFlow held the first-mover advantage as the first industry-backed entry for deep learning training and deployment. This led to better scalability, customization, and consistency, which prior frameworks such as Caffe and Theano severely needed to improve. The adoption came seemingly overnight, and TensorFlow may have been close to a household name following the hype of the early deep learning craze.
Underneath the TensorFlow name were severe complexities that added a learning curve to the already difficult machine learning one. Attempting to create a simple neural network and training loop required days of debugging and study to enable further experimentation. The diagnosis for these complexities? A lack of documentation, user guides, and, most importantly, pythonic design.
The architects behind TensorFlow decided on a graph execution approach resulting in a convoluted coding style. As engineers and researchers were coding new solutions, that code created an invisible, global graph that was only interactable through a later session object. Understanding or debugging the code was only possible with the help of secondary software like the TensorBoard graph visualizer introducing a further learning curve. Below are PyTorch and TensorFlow code examples, illustrating their stark structure and readability differences.
Minimal TensorFlow Example
import tensorflow as tf
# Define the model
inp = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
out = tf.nn.softmax(tf.matmul(x, W) + b)
# Initialize the variables and create a session
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
# Invoke the model
test_input = None
out_res = sess.run(out, feed_dict={inp: test_inp})
Minimal PyTorch Example
import torch
import torch.nn as nn
import torch.nn.functional as F
# Define and Initialize the model
class ExampleNetwork(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(784, 10)
def forward(self, x):
return F.softmax(self.fc1(x), dim=1)
model = Net()
# Run an inference
test_input = None
out = ExampleNetwork(test_inp)
A Failure to Integrate
Looking through PyTorch’s success from the earlier architecture and docs pieces we discussed, another piece sticks out in their strategy — integrations. PyTorch has added and continually maintained many third-party integrations into other ecosystems, such as ONNX. Additionally, the solid architecture and lack of lockdown on the ecosystem enabled many further products and open-source projects to build on top of PyTorch, such as PyTorch Lightning.
The TensorFlow team took the opposite approach. The focus was on a closed ecosystem with tools designed to integrate solely within the system, such as TensorFlow, TF Serving, TF Lite, TensorBoard, etc. Adding to this point, TensorFlow with ONNX is enabled only through a third-party library and no native support or commitment, leading to many consistent issues and bugs.
Ultimately the TF team needed to realize at the time just how deep and wide the ML space would become. Successful projects in this day and age enable clean interfaces and integrations across the myriad of options in the full ML Ops landscape — it’s not about owning the system and every component for uniformity but finding a core value and integrating the rest.
What’s Next?
Poor architectural decisions led to abandonment from the community, and a monopoly-style view of ML led to a further lack of adoption from necessary tool chains in the ML ecosystem. The TensorFlow team tried to fix all of this with the TensorFlow v2 refactor, but it was too little, too late, and it abandoned the core piece TensorFlow was still holding on to — legacy systems. The industry had long adopted TensorFlow due to its excellent deployment support and first-mover advantage. Still, the move to 2.0 required refactors to migrate and effectively abandoned the industry on a no longer supported version, 1.15.5.
Jax is Google’s latest entry with TensorFlow dying out aside from industry legacy systems. Still, what adoption Jax will get outside of Google remains to be seen. It’s much better in architecture than TensorFlow, but its documentation, getting-started guides, and convenience APIs leave a lot to be desired. Jax is still at the research tool stage rather than something ready for the full ML community, but it is evolving quickly. Until then, PyTorch will continue to reign.