Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Proceedings of the ACM on Programming Languages
…
37 pages
1 file
Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of large-scale optimization. Commonly used RAD algorithms such as backpropagation, however, are complex and stateful, hindering deep understanding, improvement, and parallel execution. This paper develops a simple, generalized AD algorithm calculated from a simple, natural specification. The general algorithm is then specialized by varying the representation of derivatives. In particular, applying well-known constructions to a naive representation yields two RAD algorithms that are far simpler than previously known. In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallel-friendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an AD-agnostic compiler p...
Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD) is a technique for calculating derivatives efficiently and accurately, established in fields such as computational fluid dynamics, nuclear engineering, and atmospheric sciences. Despite its advantages and use in other fields, machine learning practitioners have been little influenced by AD and make scant use of available tools. We survey the intersection of AD and machine learning, cover applications where AD has the potential to make a big impact, and report on the recent developments in the adoption this technique. We also aim to dispel some misconceptions that we think have impeded the widespread awareness of AD within the machine learning community.
ArXiv, 2015
As computational challenges in optimization and statistical inference grow ever harder, algorithms that utilize derivatives are becoming increasingly more important. The implementation of the derivatives that make these algorithms so powerful, however, is a substantial user burden and the practicality of these algorithms depends critically on tools like automatic differentiation that remove the implementation burden entirely. The Stan Math Library is a C++, reverse-mode automatic differentiation library designed to be usable, extensive and extensible, efficient, scalable, stable, portable, and redistributable in order to facilitate the construction and utilization of such algorithms. Usability is achieved through a simple direct interface and a cleanly abstracted functional interface. The extensive built-in library includes functions for matrix operations, linear algebra, differential equation solving, and most common probability functions. Extensibility derives from a straightforwa...
Automatic differentiation-the mechanical transformation of numeric computer programs to calculate derivatives efficiently and accurately-dates to the origin of the computer age. Reverse mode automatic differentiation both antedates and generalizes the method of backwards propagation of errors used in machine learning. Despite this, practitioners in a variety of fields, including machine learning, have been little influenced by automatic differentiation, and make scant use of available tools. Here we review the technique of automatic differentiation, describe its two main modes, and explain how it can benefit machine learning practitioners. To reach the widest possible audience our treatment assumes only elementary differential calculus, and does not assume any knowledge of linear algebra.
Proceedings of the ACM on Programming Languages, 2019
Deep learning has seen tremendous success over the past decade in computer vision, machine translation, and gameplay. This success rests crucially on gradient-descent optimization and the ability to “learn” parameters of a neural network by backpropagating observed errors. However, neural network architectures are growing increasingly sophisticated and diverse, which motivates an emerging quest for even more general forms of differentiable programming, where arbitrary parameterized computations can be trained by gradient descent. In this paper, we take a fresh look at automatic differentiation (AD) techniques, and especially aim to demystify the reverse-mode form of AD that generalizes backpropagation in neural networks. We uncover a tight connection between reverse-mode AD and delimited continuations, which permits implementing reverse-mode AD purely via operator overloading and without managing any auxiliary data structures. We further show how this formulation of AD can be fruitf...
2010
In scientific computing, we often need to compute derivatives in various numerical methods. Automatic differentiation [1] is a method to automatically generate a program that computes derivatives, from the code that evaluates the function value. With that, people can focus on the core of their scientific problems, and avoid manually writing the code for derivative evaluations which is a tedious and time-consuming job.
1998
Differentiation is one of the fundamental problems in numerical mathemetics. The solution of many optimization problems and other applications require knowledge of the gradient, the Jacobian matrix, or the Hessian matrix of a given function.
Proceedings of the 5th international conference on Supercomputing - ICS '91, 1991
The numerical methods employed in the solution of many scienti c computing problems require the computation of rst-or second-order derivatives of a function f : R n ! R m. We present an approach that, given a serial C program for the computationof f (x), derives a parallel execution schedule for the computation of f and its derivatives in a completely automatic fashion. This is achieved by overloading the computation of f (x) in C++ to obtain a trace of the computations to be performed and then transforming this trace into a data ow graph for the computation of f (x). In addition to the computation of f (x), this graph also allows us to exactly and inexpensively compute derivates of f by the repeated use of the chain rule. Parallelism is exploited in two ways: rows or columns of derivative matrices can be computed by independent passes through the computational graph, and parallelism within the processing of this computational graph can be exploited by processing independent subgraphs concurrently. We present experimental results that show that good performance on shared-memory machines can be obtained by using a graph interpreter approach. We then present some ideas that are currently under development for improving computational granularity and for implementing parallelautomaticdi erentiationschemes in a portableand more e cient fashion.
Lecture Notes in Computational Science and Engineering, 2006
Backwards calculation of derivatives-sometimes called the reverse mode, the full adjoint method, or backpropagation, has been developed and applied in many fields. This paper reviews several strands of history, advanced capabilities and types of application-particularly those which are crucial to the development of brain-like capabilities in intelligent control and artificial intelligence.
Future Generation Computer Systems, 2005
The automatic generation of adjoints of mathematical models that are implemented as computer programs is receiving increased attention in the scientific and engineering communities. Reverse-mode automatic differentiation is of particular interest for large-scale optimization problems. It allows the computation of gradients at a small constant multiple of the cost for evaluating the objective function itself, independent of the number of input parameters. Source-to-source transformation tools apply simple differentiation rules to generate adjoint codes based on the adjoint version of every statement. In order to guarantee correctness, certain values that are computed and overwritten in the original program must be made available in the adjoint program. For their determination we introduce a static dataflow analysis called "to be recorded" analysis. Possible overestimation of this set must be kept minimal to get efficient adjoint codes. This efficiency is essential for the applicability of source-to-source transformation tools to real-world applications.
HAL (Le Centre pour la Communication Scientifique Directe), 2022
Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Computer Physics Communications, 2009
Journal of Open Source Education
Computers & Chemical Engineering, 1998
arXiv (Cornell University), 2022
arXiv (Cornell University), 2020
Higher-Order and Symbolic Computation, 2008
Control Problems in Industry, 1995
Lecture Notes in Computational Science and Engineering, 2012
ACM SIGPLAN Notices, 2002
arXiv (Cornell University), 2021
arXiv (Cornell University), 2023
SIAM Interdiscplinary Workshop on Object Oriented …, 2002
Optimization Methods and Software, 1998