Skip to main content

Barak Pearlmutter

Followers

62

Following

24

Co-authors

22

Public Views

Jeffrey Siskind

Roskilde University

Torben Æ Mogensen

University of Copenhagen

University of Aarhus

Michael Sperber

Matthias Neubauer

Albert-Ludwigs-University of Freiburg

University of Utah

University of Maryland

Università degli Studi di Padova

Interests

Uploads

Papers by Barak Pearlmutter

Using Polyvariant Union-Free Flow Analysis to Compile a Higher-Order Functional-Programming Language with a First-Class Derivative Operator to Efficient Fortran-like Code

We exhibit an aggressive optimizing compiler for a functionalprogramming language which includes ... more We exhibit an aggressive optimizing compiler for a functionalprogramming language which includes a first-class forward automatic differentiation (AD) operator. The compiler's performance is competitive with FORTRAN-based systems on our numerical examples, despite the potential inefficiencies entailed by support of a functional-programming language and a first-class AD operator. These results are achieved by combining (1) a novel formulation of forward AD in terms of a reflexive mechanism that supports firstclass nestable nonstandard interpretation with (2) the migration to compile-time of the conceptually run-time nonstandard interpretation by whole-program inter-procedural flow analysis.

The oaklisp language manual and the oaklisp implementation cmide unpublished

Putting the Automatic Back into AD: Part I, What’s Wrong (CVS: 1.1)

Current implementations of automatic differentiation are far from automatic. We survey the diffic... more Current implementations of automatic differentiation are far from automatic. We survey the difficulties encountered when applying four existing AD systems, ADIFOR, TAPENADE, ADIC, and FADBAD++, to two simple tasks, minimax optimization and control of a simulated physical system, that involve taking derivatives of functions that themselves take derivatives of other functions. ADIC is not able to perform these tasks as it cannot transform its own generated code. Using FADBAD++, one cannot compute derivatives of different orders with unmodified code, as needed by these tasks. One must either manually duplicate code for the different derivative orders or write the code using templates to automate such code duplication. ADIFOR and TAPENADE are both able to perform these tasks only with significant intervention: modification of source code and manual editing of generated code. A companion paper presents a new AD system that handles both tasks without any manual intervention yet performs as well as or better than these existing systems.

Neuronal Predictions of Sparse Linear Representations

A striking feature of many sensory processing problems is that there appear to be many more neuro... more A striking feature of many sensory processing problems is that there appear to be many more neurons engaged in the internal representations of the signal than in its transduction. For example, humans have about 30,000 cochlear neurons, but at least a thousand times as many neurons in the auditory cortex. Such apparently redundant internal representations have sometimes been proposed as necessary to overcome neuronal noise. We instead posit that they directly subserve computations of interest. We first review how sparse overcomplete linear representations can be used for source separation, using a particularly difficult case, the HRTF cue (the differential filtering imposed on a source by its path from its origin to the cochlea) as an example. We then explore some robust and generic predictions about neuronal representations that follow from taking sparse linear representations as a model of neuronal sensory processing.

Fusion of Functional Brain Imaging Modalities via Linear Programming

EEG/fMRI fusion algorithms attempt to construct a spatiotemporal estimate of neuronal activity us... more

Optimizing properties of the Jacobian of nonlinear feedforward systems

The implementation of oaklisp

Putting the Automatic Back into AD: Part II, Dynamic, Automatic, Nestable, and Fast (CVS: 1.1)

This paper discusses a new AD system that correctly and automatically accepts nested and dynamic ... more This paper discusses a new AD system that correctly and automatically accepts nested and dynamic use of the AD operators, without any manual intervention. The system is based on a new formulation of AD as highly generalized first-class citizens in a ń-calculus, which is briefly described. Because the ń-calculus is the basis for modern programminglanguage implementation techniques, integration of AD into the ń-calculus allows AD to be integrated into an aggressive compiler. We exhibit a research compiler which does this integration, and uses some novel analysis techniques to accept code involving free dynamic use of nested AD operators, yet performs as well as or better than the most aggressive existing AD systems.

Perturbation Confusion and Referential Transparency Correct Functional Implementation of Forward-Mode AD

It is tempting to incorporate differentiation operators into functional-programming languages. Ma... more It is tempting to incorporate differentiation operators into functional-programming languages. Making them first-class citizens, however, is an enterprise fraught with danger. We discuss a potential problem with forward-mode AD common to many AD systems, including all attempts to integrate a forward-mode AD operator into Haskell. In particular, we show how these implementations fail to preserve referential transparency, and can compute grossly incorrect results when the differentiation operator is applied to a function that itself uses that operator. The underlying cause of this problem is perturbation confusion, a failure to distinguish between distinct perturbations introduced by distinct invocations of the differentiation operator. We then discuss how perturbation confusion can be avoided.

Attention and Optimal Sensory Codes

Neuronal activity can be modulated by attention even while the sensory stimulus is held fixed. Th... more Neuronal activity can be modulated by attention even while the sensory stimulus is held fixed. This modulation implies changes in the tuning curve (or receptive field) of the neurons involved in sensory processing. We propose an information-theoretic hypothesis for the purpose of this modulation, and show using computer simulation that the similar modulation emerges in a system that is optimally encoding a sensory stimulus when the system is informed about the changing relevance of different features of the input. We present a simple model that learns a covert attention mechanism, given input patterns and tradeoff requirements. After optimization, the system gains the ability to reorganize its computational resources (or coding strategy) depending on the incoming covert attentional signal, using only threshold shifts in neurons throughout the network. The modulation of activity of the encoding units for different attentional states qualitatively matches that observed in animal selective attention experiments. Due to its generality, the model can be applied to any modality, and to any attentional goal.

Fusion of Functional Brain Imaging Modalities using L-Norms Signal Reconstruction

EEG/fMRI fusion algorithms attempt to construct a spatiotemporal estimate of neuronal activity us... more

Automatic Learning Rate Maximization in Large Adaptive Machines

Draft: Error Measures that make Outputs Probabilities

When training a backpropagation network with binary la-beled data, where the labels are generated... more When training a backpropagation network with binary la-beled data, where the labels are generated stochastically from an underlying probability associated with each pos-sible input, some error measures have the property that minimizing the error measure leads asymptotically to out-puts which correctly estimate the involved probabilities. Below, we derive a necessary and sufficient condition for an error measure to have this property, solve the condi-tion in general, and exhibit some families of such error measures.

Gradient Descent: Second-Order Momentum and Saturating Error

Batch gradient descent, Deltaw(t) = GammajdE=dw(t), converges to a minimum of quadratic form with... more Batch gradient descent, Deltaw(t) = GammajdE=dw(t), converges to a minimum of quadratic form with a time constant no better than 1 4 max= min where min and max are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term Deltaw(t) = GammajdE=dw(t) + ffDeltaw(t Gamma 1) improves this to 1 4 p max = min , although only in the batch case. Here we show that secondorder momentum, Deltaw(t) = GammajdE=dw(t) + ffDeltaw(t Gamma 1) + fi Deltaw(t Gamma 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a nonquadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics. 1 INTRODUCTION Gradient descent is the bread-and-butter optimization technique in neural networks. Some people build special purpose hardware to accelerate gradient descent optimization...

Relating Egomotion and Image Evolution

By considering the dynamics of the apparent motion of a stationary object relative to a moving ob... more By considering the dynamics of the apparent motion of a stationary object relative to a moving observer, we construct a partial differential equation that relates the changes in an image to the motion of the observer. These come in two varieties: a first order system that describes the coevolution of the egocentric radial distances to objects and the visual scene, and a second order system that does not involve any distances or other geometry. The later equation leads, via the calculus of variations, to a novel technique for recovering egomotion from image sequences, a so-called visual yaw detector, which is tested on real data. For expository purposes the derivation is carried out in two dimensions, but the approach extends immediately to three. Present address: NEC Research Institute, 4 Independence Way, Princeton, NJ 1. Introduction Using a special camera mounted on the roof of a motorcar which gives a narrow 360 degree strip along the horizon, we are interested in recovering th...

Using Backpropagation with Temporal Windows to Learn the Dynamics of the CMU Direct-Drive Arm II

Automatic learning rate maximization by on-line estimation of the Hessian's eigenvectors

Advances in neural information processing systems

We propose a very simple, and well principled way o f computing the optimal step size in gradient... more We propose a very simple, and well principled way o f computing the optimal step size in gradient descent algorithms. The on-line version is very e cient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivative m atrix (Hessian), which does not require to even calculate the Hessian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters.

Efficient Computation of Sparse Elements of the Inverse of a Sparse Near-Tridiagonal Matrix with Application to the Nerve Equation

Standard algorithms for computing the inverse of a tridiagonal matrix (or more generally, any Hin... more Standard algorithms for computing the inverse of a tridiagonal matrix (or more generally, any Hines matrix) compute the entire inverse, which is not sparse. For some problems, only the elements of the inverse at locations corresponding to nonzero elements in the original matrix are required. We present an algorithm that efficiently computes only these elements in O(n) time and memory. This algorithm is useful in solving discretized systems of partial differential equations that arise when computing electrical flow along a branched structure, such as a neuron's dendritic arbor. 1 Introduction The electrical parameters and connectivity in branched RC networks define a sparse matrix B which has nonzero elements only at locations that correspond to electrical connections. This sparseness can be exploited to compute efficiently the distribution of potential in such networks (Hines, 1984; Mascagni, 1989). Some applications, however, require the transfer impedance matrix, K = B Gamma1 ...

Blind Source Separation of Neuromagnetic Responses

Magnetoencephalography (MEG) is a functional brain imaging technique with millisecond temporal re... more Magnetoencephalography (MEG) is a functional brain imaging technique with millisecond temporal resolution and millimeter spatial resolution. The high temporal resolution of MEG compared to fMRI and PET (milliseconds vs. seconds and tens of seconds) makes it ideal for measuring the precise time of neuronal responses, thereby oering a powerful tool for studying temporal dynamics. We applied blind source separation (BSS) to continuous 122-channel human magnetoencephalographic data from two subjects and ve tasks. We demonstrate that without using any domain specic knowledge and without making the common assumption of single or multiple current dipole sources, BSS is capable of separating non-neuronal noise sources from neuronal responses and also of separating neuronal responses from dierent sensory modalities, and from dierent processing stages within a given modality. Key words: functional brain imaging; ICA; MEG; blind source separation. 1 Introduction The brain's neuromagnetic ...

The discriminative power of a dynamical model neuron

Using Polyvariant Union-Free Flow Analysis to Compile a Higher-Order Functional-Programming Language with a First-Class Derivative Operator to Efficient Fortran-like Code

We exhibit an aggressive optimizing compiler for a functionalprogramming language which includes ... more We exhibit an aggressive optimizing compiler for a functionalprogramming language which includes a first-class forward automatic differentiation (AD) operator. The compiler's performance is competitive with FORTRAN-based systems on our numerical examples, despite the potential inefficiencies entailed by support of a functional-programming language and a first-class AD operator. These results are achieved by combining (1) a novel formulation of forward AD in terms of a reflexive mechanism that supports firstclass nestable nonstandard interpretation with (2) the migration to compile-time of the conceptually run-time nonstandard interpretation by whole-program inter-procedural flow analysis.

The oaklisp language manual and the oaklisp implementation cmide unpublished

Putting the Automatic Back into AD: Part I, What’s Wrong (CVS: 1.1)

Current implementations of automatic differentiation are far from automatic. We survey the diffic... more Current implementations of automatic differentiation are far from automatic. We survey the difficulties encountered when applying four existing AD systems, ADIFOR, TAPENADE, ADIC, and FADBAD++, to two simple tasks, minimax optimization and control of a simulated physical system, that involve taking derivatives of functions that themselves take derivatives of other functions. ADIC is not able to perform these tasks as it cannot transform its own generated code. Using FADBAD++, one cannot compute derivatives of different orders with unmodified code, as needed by these tasks. One must either manually duplicate code for the different derivative orders or write the code using templates to automate such code duplication. ADIFOR and TAPENADE are both able to perform these tasks only with significant intervention: modification of source code and manual editing of generated code. A companion paper presents a new AD system that handles both tasks without any manual intervention yet performs as well as or better than these existing systems.

Neuronal Predictions of Sparse Linear Representations

A striking feature of many sensory processing problems is that there appear to be many more neuro... more A striking feature of many sensory processing problems is that there appear to be many more neurons engaged in the internal representations of the signal than in its transduction. For example, humans have about 30,000 cochlear neurons, but at least a thousand times as many neurons in the auditory cortex. Such apparently redundant internal representations have sometimes been proposed as necessary to overcome neuronal noise. We instead posit that they directly subserve computations of interest. We first review how sparse overcomplete linear representations can be used for source separation, using a particularly difficult case, the HRTF cue (the differential filtering imposed on a source by its path from its origin to the cochlea) as an example. We then explore some robust and generic predictions about neuronal representations that follow from taking sparse linear representations as a model of neuronal sensory processing.

Fusion of Functional Brain Imaging Modalities via Linear Programming

EEG/fMRI fusion algorithms attempt to construct a spatiotemporal estimate of neuronal activity us... more

Optimizing properties of the Jacobian of nonlinear feedforward systems

The implementation of oaklisp

Putting the Automatic Back into AD: Part II, Dynamic, Automatic, Nestable, and Fast (CVS: 1.1)

This paper discusses a new AD system that correctly and automatically accepts nested and dynamic ... more This paper discusses a new AD system that correctly and automatically accepts nested and dynamic use of the AD operators, without any manual intervention. The system is based on a new formulation of AD as highly generalized first-class citizens in a ń-calculus, which is briefly described. Because the ń-calculus is the basis for modern programminglanguage implementation techniques, integration of AD into the ń-calculus allows AD to be integrated into an aggressive compiler. We exhibit a research compiler which does this integration, and uses some novel analysis techniques to accept code involving free dynamic use of nested AD operators, yet performs as well as or better than the most aggressive existing AD systems.

Perturbation Confusion and Referential Transparency Correct Functional Implementation of Forward-Mode AD

It is tempting to incorporate differentiation operators into functional-programming languages. Ma... more It is tempting to incorporate differentiation operators into functional-programming languages. Making them first-class citizens, however, is an enterprise fraught with danger. We discuss a potential problem with forward-mode AD common to many AD systems, including all attempts to integrate a forward-mode AD operator into Haskell. In particular, we show how these implementations fail to preserve referential transparency, and can compute grossly incorrect results when the differentiation operator is applied to a function that itself uses that operator. The underlying cause of this problem is perturbation confusion, a failure to distinguish between distinct perturbations introduced by distinct invocations of the differentiation operator. We then discuss how perturbation confusion can be avoided.

Attention and Optimal Sensory Codes

Neuronal activity can be modulated by attention even while the sensory stimulus is held fixed. Th... more Neuronal activity can be modulated by attention even while the sensory stimulus is held fixed. This modulation implies changes in the tuning curve (or receptive field) of the neurons involved in sensory processing. We propose an information-theoretic hypothesis for the purpose of this modulation, and show using computer simulation that the similar modulation emerges in a system that is optimally encoding a sensory stimulus when the system is informed about the changing relevance of different features of the input. We present a simple model that learns a covert attention mechanism, given input patterns and tradeoff requirements. After optimization, the system gains the ability to reorganize its computational resources (or coding strategy) depending on the incoming covert attentional signal, using only threshold shifts in neurons throughout the network. The modulation of activity of the encoding units for different attentional states qualitatively matches that observed in animal selective attention experiments. Due to its generality, the model can be applied to any modality, and to any attentional goal.

Fusion of Functional Brain Imaging Modalities using L-Norms Signal Reconstruction

EEG/fMRI fusion algorithms attempt to construct a spatiotemporal estimate of neuronal activity us... more

Automatic Learning Rate Maximization in Large Adaptive Machines

Draft: Error Measures that make Outputs Probabilities

When training a backpropagation network with binary la-beled data, where the labels are generated... more When training a backpropagation network with binary la-beled data, where the labels are generated stochastically from an underlying probability associated with each pos-sible input, some error measures have the property that minimizing the error measure leads asymptotically to out-puts which correctly estimate the involved probabilities. Below, we derive a necessary and sufficient condition for an error measure to have this property, solve the condi-tion in general, and exhibit some families of such error measures.

Gradient Descent: Second-Order Momentum and Saturating Error

Batch gradient descent, Deltaw(t) = GammajdE=dw(t), converges to a minimum of quadratic form with... more Batch gradient descent, Deltaw(t) = GammajdE=dw(t), converges to a minimum of quadratic form with a time constant no better than 1 4 max= min where min and max are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term Deltaw(t) = GammajdE=dw(t) + ffDeltaw(t Gamma 1) improves this to 1 4 p max = min , although only in the batch case. Here we show that secondorder momentum, Deltaw(t) = GammajdE=dw(t) + ffDeltaw(t Gamma 1) + fi Deltaw(t Gamma 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a nonquadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics. 1 INTRODUCTION Gradient descent is the bread-and-butter optimization technique in neural networks. Some people build special purpose hardware to accelerate gradient descent optimization...

Relating Egomotion and Image Evolution

By considering the dynamics of the apparent motion of a stationary object relative to a moving ob... more By considering the dynamics of the apparent motion of a stationary object relative to a moving observer, we construct a partial differential equation that relates the changes in an image to the motion of the observer. These come in two varieties: a first order system that describes the coevolution of the egocentric radial distances to objects and the visual scene, and a second order system that does not involve any distances or other geometry. The later equation leads, via the calculus of variations, to a novel technique for recovering egomotion from image sequences, a so-called visual yaw detector, which is tested on real data. For expository purposes the derivation is carried out in two dimensions, but the approach extends immediately to three. Present address: NEC Research Institute, 4 Independence Way, Princeton, NJ 1. Introduction Using a special camera mounted on the roof of a motorcar which gives a narrow 360 degree strip along the horizon, we are interested in recovering th...

Using Backpropagation with Temporal Windows to Learn the Dynamics of the CMU Direct-Drive Arm II

Automatic learning rate maximization by on-line estimation of the Hessian's eigenvectors

Advances in neural information processing systems

We propose a very simple, and well principled way o f computing the optimal step size in gradient... more We propose a very simple, and well principled way o f computing the optimal step size in gradient descent algorithms. The on-line version is very e cient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivative m atrix (Hessian), which does not require to even calculate the Hessian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters.

Efficient Computation of Sparse Elements of the Inverse of a Sparse Near-Tridiagonal Matrix with Application to the Nerve Equation

Standard algorithms for computing the inverse of a tridiagonal matrix (or more generally, any Hin... more Standard algorithms for computing the inverse of a tridiagonal matrix (or more generally, any Hines matrix) compute the entire inverse, which is not sparse. For some problems, only the elements of the inverse at locations corresponding to nonzero elements in the original matrix are required. We present an algorithm that efficiently computes only these elements in O(n) time and memory. This algorithm is useful in solving discretized systems of partial differential equations that arise when computing electrical flow along a branched structure, such as a neuron's dendritic arbor. 1 Introduction The electrical parameters and connectivity in branched RC networks define a sparse matrix B which has nonzero elements only at locations that correspond to electrical connections. This sparseness can be exploited to compute efficiently the distribution of potential in such networks (Hines, 1984; Mascagni, 1989). Some applications, however, require the transfer impedance matrix, K = B Gamma1 ...

Blind Source Separation of Neuromagnetic Responses

Magnetoencephalography (MEG) is a functional brain imaging technique with millisecond temporal re... more Magnetoencephalography (MEG) is a functional brain imaging technique with millisecond temporal resolution and millimeter spatial resolution. The high temporal resolution of MEG compared to fMRI and PET (milliseconds vs. seconds and tens of seconds) makes it ideal for measuring the precise time of neuronal responses, thereby oering a powerful tool for studying temporal dynamics. We applied blind source separation (BSS) to continuous 122-channel human magnetoencephalographic data from two subjects and ve tasks. We demonstrate that without using any domain specic knowledge and without making the common assumption of single or multiple current dipole sources, BSS is capable of separating non-neuronal noise sources from neuronal responses and also of separating neuronal responses from dierent sensory modalities, and from dierent processing stages within a given modality. Key words: functional brain imaging; ICA; MEG; blind source separation. 1 Introduction The brain's neuromagnetic ...

The discriminative power of a dynamical model neuron