Hebbian Deep Learning: SoftHebb Algorithm
Hebbian Deep Learning: SoftHebb Algorithm
Abstract
Recent approximations to backpropagation (BP) have mitigated many of
BP’s computational inefficiencies and incompatibilities with biology, but
arXiv:2209.11883v2 [cs.NE] 2 Aug 2023
1
Published as a conference paper at ICLR 2023
it requires the error signal, which is computed at a different point in time and elsewhere
in the network, i.e. at the output. That makes BP non-local in space and time, which is a
critical discrepancy from the locality that is generally believed to govern biological synaptic
plasticity (Baldi et al., 2017). This non-locality implies further computational inefficien-
cies. Specifically, forward-passing variables must be memorized, which increases memory
requirements (Löwe et al., 2019). Moreover, additional backward signals must be computed
and propagated, which increases operations and electrical currents. It is noteworthy that
these aspects are not limiting only future neuromorphic technologies, but even the hardware
foundation of today’s DL, i.e. graphical processing units (GPUs), which have their own
constraints in memory and FLOPS.
Update locking. The error credited by BP to a synapse can only be computed after the
information has propagated forward and then backward through the entire network. The
weight updates are therefore time-locked to these delays (Czarnecki et al., 2017; Jaderberg
et al., 2017; Frenkel et al., 2021). This slows down learning, so that training examples
must be provided at least as slowly as the time to process the propagation through the
two directions. Besides this important practical limitation of BP for DL, it also does not
appear plausible that multiple distant neurons in the brain coordinate their processing and
learning operations with such precision in time, nor that the brain can only learn from slow
successions of training examples.
Global loss function. BP is commonly applied in the supervised setting, where humans
provide descriptive labels of training examples. This is a costly process, thus supervised
BP cannot exploit most of the available data, which is unlabelled. In addition, it does
not explain how humans or animals can learn without supervisors. As a result, significant
research effort has been dedicated to techniques for learning without labels, with increasing
success recently, especially from self-supervised learning (SSL) (Chen et al., 2020; Mitrovic
et al., 2020; Lee et al., 2021; Tomasev et al., 2022; Scherr et al., 2022). In SSL, BP can also
use certain supervisory signals generated by the model itself as a global error. Therefore,
while BP does not require labels per se, it does require top-down supervision in the form
of a global loss function. The drawback of this is that learning then becomes specialized
to the particular task that is explicitly defined by the minimization of the loss function, as
2
Published as a conference paper at ICLR 2023
3
Published as a conference paper at ICLR 2023
4
Published as a conference paper at ICLR 2023
Hopfield, 2019; Grinberg et al., 2019). However, such plasticity matched the assumptions
of a hard WTA, as opposed to SoftHebb’s distributed activation, and involved additional
hyperparameters. Here we introduce a new, simple form of anti-Hebbian plasticity for soft
WTA networks, that simply negates SoftHebb’s weight update (Equation (2)) in all neurons
except the maximally activated one.
Convolutions. Towards a multilayer architecture, and to represent input information
of each layer in a more distributed manner, we used a localized representation through
convolutional kernels. The plasticity rule is readily transferable to such an architecture.
Convolution can be viewed as a data augmentation, where the inputs are no longer the
original images but are rather cropped into smaller patches that are presented to a fully
connected SoftHebb network. Convolution with weight sharing between patches is efficient
for parallel computing platforms like GPUs, but in its literal sense it is not biologically
plausible. However, this does not fundamentally affect the plausibility of convolutions,
because the weights between neurons with different localized receptive fields can become
matching through biologically plausible rules (Pogodin et al., 2021).
Alternative activations for forward propagation. In addition to the softmax involved
in the plasticity rule, different activation functions can be considered for propagation to each
subsequent layer. In biology, this dual type of activation may be implemented by multiplexing
overlapping temporal or rate codes of spiking neurons, which have been studied and modelled
extensively (Naud et al., 2008; Kayser et al., 2009; Akam and Kullmann, 2014; Herzfeld et al.,
2015; Moraitis et al., 2018; Payeur et al., 2021). We settled on a combination of rectified
polynomial unit (RePU) (Krotov and Hopfield, 2016; 2019), with Triangle (Appendix A.3.1),
which applies lateral inhibition by subtracting the layer’s mean activity. These perform
well (Coates et al., 2011; Miconi, 2021), and offer tunable parametrization.
Weight-norm-dependent adaptive learning rate. We introduce a per-neuron adaptive
learning rate scheme that stabilizes to zero as neuron weight vectors converge to a sphere
of radius 1, and is initially big when the weight vectors’ norms are large compared to 1:
ηi = η ·(ri −1)q , where q is a power hyperparameter. This per-neuron adaptation based on the
weights remains a local operation and is reminiscent of another important adaptive learning
rate scheme that is individualized per synapse, has biological and theoretical foundations
and speeds up learning (Aitchison, 2020; Aitchison et al., 2021). Ours is arguably simpler,
and its relevance is that it increases robustness to hyperparameters and initializations, and,
combined with the Bayesian nature of SoftHebb (Section 2), it speeds up learning so that a
mere single learning epoch suffices (Section 4).
Width scaling. Each new layer halves the image resolution in each dimension by a pooling
operation, while the layer width, i.e. the number of convolutional neurons, is multiplied by a
“width factor” hyperparameter. Our reported benchmarks used a factor of 4.
Figure 2: Example SoftHebb receptive fields, learned from STL-10. More in Appendix B.6.
The most crucial elements for deep representation learning are the soft competition
and the corresponding Hebbian plasticity rule that underpin SoftHebb (Figures 3 and B.2),
the similarly soft anti-Hebbian plasticity that we introduced (Fig. B.2), the convolutional
neurons, and the width scaling architecture that involves a depth-wise diminishing output
resolution (Fig. 4). The adaptive learning rate significantly speeds up the training (Fig. B.3B),
such that we only use a single unsupervised learning epoch for the deep network. The
specific activation function and its tunable parametrization are less crucial but do improve
5
Published as a conference paper at ICLR 2023
performance (Appendix B). We arrived at this novel setup grounded on Moraitis et al. (2021)
and through a literature- and intuition-guided search of possible additions.
4 Results
Summary of experimental protocol. The first layer used 96 convolutional neurons to
match related works (Fig. 1), except our ImageNet experiments that used 48 units. The
width of the subsequent layers was determined by the width factor (see previous section).
Unsupervised Hebbian learning received only one presentation of the training set, i.e. epoch.
Each layer was fully trained and frozen before the next one, a common approach known as
greedy layer-wise training in such local learning schemes (Bengio et al., 2006; Tavanaei and
Maida, 2016; Löwe et al., 2019). Batch normalization (Ioffe and Szegedy, 2015) was used,
with its standard initial parameters (γ = 1, β = 0), which we did not train. Subsequently,
a linear classifier head was trained with cross-entropy loss, using dropout regularization,
with mini-batches of 64 examples, and trained for 50, 50, 100, and 200 supervised epochs
for MNIST, CIFAR-10, STL-10, and ImageNet accordingly. We used an NVIDIA Tesla
V100 32GB GPU. All details to the experimental methods are provided in Appendix A, and
control experiments, including the hyperparameters’ impact in Appendix B.
Fully connected baselines. The work of Moraitis et al. (2021) presented SoftHebb
mainly through theoretical analysis. Experiments showed interesting generative and
Bayesian properties of these networks, such as high learning speed and adversarial ro-
bustness. Reported accuracies focused on fully connected single-hidden-layer networks,
showing that it was well applicable to MNIST and Fashion-MNIST datasets reaching ac-
curacies of (96.94 ± 0.15)% and (75.14 ± 0.17)%, respectively, using 2000 hidden neurons.
Starting from that work, we found
that when moving to more com-
plex datasets such as CIFAR-10
or STL-10, SoftHebb performance
was not competitive. Specif-
ically, the same shallow fully-
connected network’s accuracies
reached (43.9 ± 0.18)% and (36.9
± 0.19)% accordingly. Compared
to BP’s (55.7 ± 0.13)% and (50.0
± 0.16)%, this suggested that (A) CIFAR-10 (B) CIFAR-100
single-hidden-layer networks were
insufficient for extraction of mean-
ingful features and separation of
input classes. We then stacked
two such fully connected Heb-
bian WTA layers. This network
actually performed worse, reach-
ing (32.9 ± 0.22)% and (31.5 ±
0.20)% on these tasks. Convo-
lutional baselines. Recent re-
search has applied Hebbian plas-
ticity to convolutional hard-WTA (C) STL-10 (D) ImageNette
neural networks (CNN) (Miconi,
Figure 3: Depth-wise performance for various training
2021; Lagani et al., 2021; Amato
setups and for untrained random weights, in 4 datasets.
et al., 2019). However, it has not
Number of hidden layers is indicated.
achieved significant, if any, im-
provement through the addition of layers (Fig. 1, green curves). In our control experiments,
we found that these networks with the plasticity rules from the literature do not learn
helpful features, as the fixed random weights performed better than the learned ones, also
in agreement with results from Miconi (2021). Indeed, we find that the features learned by
such hard-WTA networks are simple Gabor-like filters in the first layer (Fig. B.5A) and in
deeper ones (see also Miconi (2021)).
A new learning regime. One way to learn more complex features is by adding an anti-
6
Published as a conference paper at ICLR 2023
Hebbian term to the plasticity of WTA networks (Krotov and Hopfield, 2019; Grinberg
et al., 2019). Notably, this method was previously tested only with hard-WTA networks
and their associated plasticity rules and not with the recent SoftHebb model and plasticity.
In those cases, anti-Hebbian plasticity was applied to the k-th most active neuron. Here,
we introduced a new, soft type of anti-Hebbian plasticity (see Section 3). We studied its
effect first by proxy of the number of “R1” features (Moraitis et al., 2021), i.e. the weight
vectors that lie on a unit sphere, according to their norm. Simple Gabor-like filters emerge
without anti-Hebbian terms in the plasticity (Fig. B.5A) and are R1. Even with anti-Hebbian
plasticity, in hard WTA networks, the learned features are R1 (Krotov and Hopfield, 2019).
In the case of SoftHebb with our soft type of anti-Hebbian plasticity, we observed that less
standard, i.e. non-R1, features emerge (Fig. B.1 & B.5B). By measuring the accuracy of
the SoftHebb network while varying the temperature, we discovered a regime around τ = 1
(Fig. B.1) where R1 and non-R1 features co-exist, and accuracy is highest. This regime only
emerges with SoftHebb. For example, on CIFAR-10 with a single convolutional layer and an
added linear classifier, SoftHebb accuracy (71.10 ± 0.06)% significantly outperformed hard
WTA (62.69 ± 0.47)% and random weights (63.93 ± 0.22)%, almost reaching the accuracy
of BP (72.42 ± 0.24)% on the same two-layer network trained end-to-end.
Convergence in a single epoch. Adaptive learning rate. By studying convergence
through R1 features and by experimenting with learning rates, we conceived an adaptive
learning rate that adapts to the norm of each neuron’s weight vector (Section 3). We found
that it also speeds up learning compared to more conventional learning schedules (Fig. B.3B)
with respect to the number of training iterations. The extent of the speed-up is such that we
only needed to present the training dataset once, before evaluating the network’s performance.
All results we report on SoftHebb are in fact after just one epoch of unsupervised learning.
The speed-up is in agreement with observations of Moraitis et al. (2021) that attributed the
speed to SoftHebb’s Bayesian nature. Moreover, we found that with our adaptive learning
rate, convergence is robust to the initial conditions, to which such unsupervised learning
schemes are usually highly sensitive (Fig. B.3A & B.3B).
The architecture’s impact. The multilayer architecture uses
a pooling operation of stride 2, which halves each dimension
of the resolution after each layer. We stop adding layers when
the output resolution becomes at most 4 × 4. The layer where
this occurs depends on the original input’s resolution. Thus,
the multilayer network has three hidden convolutional layers for
MNIST or CIFAR-10, four layers for STL-10, and five layers
for ImageNet at a resolution setting of 160 × 160 px. We used
four times more neurons in each layer than in the previous layer.
This architecture on its own, with random initial weights, shows
inconsistent increases in classification accuracy, up to a variable
depth (Fig. 3). The performance increases caused by the addition
of random layers seem surprising but are consistent with the
literature where random weights often outperform biologically-
plausible learning rules (Miconi, 2021; Frenkel et al., 2021). Indeed, Figure 4: CIFAR-10
by using the more common hard-WTA approach to train such a layer-wise performance
network, performance not only deteriorated compared to random of SoftHebb, for differ-
weights but also failed to increase with added depth (Fig. 3). ent width factors. Soft-
Hebb enables depth-
Classification accuracy (see Tables 1 & 2). We report that scaling when the width
with SoftHebb and the techniques we have described, we achieve of deep layers scales suf-
Deep Learning with up to 5 hidden layers. For example, layer-wise ficiently. Factors (1x,
accuracy increases on CIFAR-10 are visible in Fig. 1. Learning 2x, or 4x) indicate layer-
does occur, and the layer-wise accuracy improvement is not merely wise increase in the num-
due to the architecture choice. That is testified, first, by the fact ber of neurons.
that the weights do change and the receptive fields that emerge
are meaningful (Fig. 5). Second, and more concretely, for the end-point task of classification,
accuracy improves significantly compared to the untrained random weights, and this is
true in all datasets (Fig. 3). SoftHebb achieves test accuracies of (99.35 ± 0.03)%, (80.31
± 0.14)%, (76.23 ± 0.19)%, 27.3% and (80.98 ± 0.43)% on MNIST, CIFAR-10, STL-10,
the full ImageNet, and ImageNette. We also evaluated the same networks trained in a
7
Published as a conference paper at ICLR 2023
fully label-supervised manner with end-to-end BP on all datasets. Due to BP’s resource
demands, we could not compare with BP directly on ImageNet but we did apply BP to the
ImageNette subset. The resulting accuracies are not too distant from SoftHebb’s (Fig. 3).
The BP-trained network reaches (99.45 ± 0.02)%, (83.97 ± 0.07)%, (74.51 ± 0.36)% and
(85.30 ± 0.45)% on MNIST, CIFAR-10, STL-10, and ImageNette respectively. Notably, this
performance is very competitive (see Tables 1 & 2).
planes
A B birds
Layer 1 Layer 4
Layer 1 Layer 4
C D
neuron p1 neuron p4
neuron q1 neuron q4
neuron r1 neuron r4
neuron s1 neuron s4
neuron t1 neuron t4
8
Published as a conference paper at ICLR 2023
Table 2: STL-10 & ImageNet top-1 accuracy (%) of un- or self-supervised (blue frame) &
partly bio-plausible networks (green frame). Bold indicates the best-performing biologically-
plausible row, i.e. SoftHebb. SoftHebb’s unsupervised learning only involved 1 epoch.
SimCLR (100 epochs) ResNet-18 86.3 30.0 Chen et al. 2020 (our repr.)
Greedy InfoMax ResNet-50 81.9 n.a. Löwe et al. 2019
None (Random chance) None 10.0 0.1 Chance
biologically plausible
None (Random weights) SoftHebb 68.2 14.0 Ours
Hebbian Hard WTA 54.8 n.a. Ours
SoftHebb (1 epoch) SoftHebb 76.2 27.3 Ours
CLAPP VGG-6 73.6 n.a. Illing et al. 2021
LPL VGG-11 61.9 n.a. Halvagal and Zenke 2022
K-means K-means 74.1 n.a. Dundar et al. 2015
Feedback Alignment 5-layer CNN n.a. 6.9 Bartunov et al. 2018
Direct Feedback Alignment AlexNet n.a. 6.2 Crafton et al. 2019
Single Sparse DFA AlexNet n.a. 2.8 Crafton et al. 2019
5 Discussion
SoftHebb’s accuracy and applicability in difficult tasks challenges several other biologically-
constrained DL algorithms. Arguably it is also a highly biologically plausible and computa-
tionally efficient method, based on it being free of weight-transport, non-local plasticity, and
time-locking of weight updates, it being fully unsupervised, It is also founded on physiological
experimental observations in cortical circuits, such as Hebbian plasticity and WTA structure.
Importantly, such Hebbian WTA networks enable non-von Neumann neuromorphic learning
chips (Qiao et al., 2015; Kreiser et al., 2017; Sebastian et al., 2020; Indiveri, 2021; Sarwat
et al., 2022a). That is an extremely efficient emerging computing technology, and SoftHebb
makes high performance with such hardware more likely. The algorithm is applicable in
tasks such as MNIST, CIFAR-10, STL-10 and even ImageNet where other algorithms with
similar goals were either not applicable or have underperformed SoftHebb (Fig. 1, Table 1,
Table 2, & Bartunov et al. (2018)). This is despite the fact that most alternatives only
address subsets of SoftHebb’s goals of efficiency and plausibility (Section 2 & Table 1). Löwe
et al. (2019) and Burstprop (Payeur et al., 2021) results on STL-10 and ImageNet are not
included in Table 2, because the ResNet-50 of Löwe et al. (2019) used standard BP through
modules of at least 15 layers, and because Payeur et al. (2021) did not report ImageNet top-1
accuracy. SoftHebb did outperform Burstprop and its successor BurstCCN (Greedy et al.,
2022) on CIFAR-10 (Table 1). Beyond neural networks, K-means has also been applied to
CIFAR-10 (Coates et al., 2011), however, without successful stacking of K-means “layers”.
From the perspective of neuroscience, our results suggest that Deep Learning up to a few
layers may be plausible in the brain not only with approximations of BP (Payeur et al.,
2021; Illing et al., 2021; Greedy et al., 2022), but also with radically different approaches.
Nevertheless, to maximize applicability, biological details such as spiking neurons were
avoided in our simulations. In a ML context, our work has important limitations that
should be noted. For example, we have tested SoftHebb only in computer vision tasks. In
addition, it is unclear how to apply SoftHebb to generic deep network architectures, because
thus far we have only used specifically width-scaled convolutional networks. Furthermore, our
deepest SoftHebb network has only 6 layers in the case of ImageNet, deeper than most bio-
plausible approaches (see Table 1), but limited. As a consequence, SoftHebb cannot compete
with the true state of the art in ML (see e.g. ResNet-50 SimCLR result in Table 2). Such
networks have been termed “very deep” (Simonyan and Zisserman, 2014) and “extremely
deep” (He et al., 2016). This distinguishes from the more general term “Deep Learning” that
was originally introduced for networks as shallow as ours (Hinton et al., 2006) has continued
to be used so (see e.g. Frenkel et al. (2021) and Table 1), and its hallmark is the hierarchical
representation that appears to emerge in our study. We propose that SoftHebb is worth
practical exploitation of its present advantages, research around its limitations, and search
for its possible physiological signatures in the brain.
9
Published as a conference paper at ICLR 2023
Acknowledgments
This work was partially supported by the Science and Technology Innovation 2030 – Major
Project (Brain Science and Brain-Like Intelligence Technology) under Grant 2022ZD0208700.
The authors would like to thank Lukas Cavigelli, Renzo Andri, Édouard Carré, and the rest
of Huawei’s Von Neumann Lab, for offering compute resources. TM would like to thank
Yansong Chua, Alexander Simak, and Dmitry Toichkin for the discussions.
References
Aitchison, L. (2020). Bayesian filtering unifies adaptive and non-adaptive neural network
optimization methods. Advances in Neural Information Processing Systems, 33:18173–
18182.
Aitchison, L., Jegminat, J., Menendez, J. A., Pfister, J.-P., Pouget, A., and Latham, P. E.
(2021). Synaptic plasticity as bayesian inference. Nature neuroscience, 24(4):565–571.
Akam, T. and Kullmann, D. M. (2014). Oscillatory multiplexing of population codes
for selective communication in the mammalian brain. Nature Reviews Neuroscience,
15(2):111–122.
Amato, G., Carrara, F., Falchi, F., Gennaro, C., and Lagani, G. (2019). Hebbian learning
meets deep convolutional neural networks. In International Conference on Image Analysis
and Processing, pages 324–334. Springer.
Baldi, P., Sadowski, P., and Lu, Z. (2017). Learning in the machine: The symmetries of the
deep learning channel. Neural Networks, 95:110–133.
Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G. E., and Lillicrap, T. (2018). As-
sessing the scalability of biologically-motivated deep learning algorithms and architectures.
Advances in neural information processing systems, 31.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2006). Greedy layer-wise training
of deep networks. Advances in neural information processing systems, 19.
Binzegger, T., Douglas, R. J., and Martin, K. A. (2004). A quantitative map of the circuit
of cat primary visual cortex. Journal of Neuroscience, 24(39):8441–8453.
Binzegger, T., Douglas, R. J., and Martin, K. A. (2009). Topology and dynamics of the
canonical circuit of cat v1. Neural Networks, 22(8):1071–1078.
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. (2020). Unsuper-
vised learning of visual features by contrasting cluster assignments. Advances in Neural
Information Processing Systems, 33:9912–9924.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for
contrastive learning of visual representations. In International conference on machine
learning, pages 1597–1607. PMLR.
Coates, A., Ng, A., and Lee, H. (2011). An analysis of single-layer networks in unsupervised
feature learning. In Proceedings of the fourteenth international conference on artificial
intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings.
Crafton, B., West, M., Basnet, P., Vogel, E., and Raychowdhury, A. (2019). Local learning
in rram neural networks with sparse direct feedback alignment. In 2019 IEEE/ACM
International Symposium on Low Power Electronics and Design (ISLPED), pages 1–6.
IEEE.
Crick, F. (1989). The recent excitement about neural networks. Nature, 337(6203):129–132.
Czarnecki, W. M., Świrszcz, G., Jaderberg, M., Osindero, S., Vinyals, O., and Kavukcuoglu, K.
(2017). Understanding synthetic gradients and decoupled neural interfaces. In International
Conference on Machine Learning, pages 904–912. PMLR.
10
Published as a conference paper at ICLR 2023
Diehl, P. U. and Cook, M. (2015). Unsupervised learning of digit recognition using spike-
timing-dependent plasticity. Frontiers in computational neuroscience, 9:99.
Douglas, R. J. and Martin, K. A. (2004). Neuronal circuits of the neocortex. Annu. Rev.
Neurosci., 27:419–451.
Douglas, R. J., Martin, K. A., and Whitteridge, D. (1989). A canonical microcircuit for
neocortex. Neural computation, 1(4):480–488.
Dundar, A., Jin, J., and Culurciello, E. (2015). Convolutional clustering for unsupervised
learning. arXiv preprint arXiv:1511.06241.
Erhan, D., Bengio, Y., Courville, A., and Vincent, P. (2009). Visualizing higher-layer features
of a deep network. University of Montreal, 1341(3):1.
Feldman, D. E. (2012). The spike-timing dependence of plasticity. Neuron, 75(4):556–571.
Frenkel, C., Lefebvre, M., and Bol, D. (2021). Learning without feedback: Fixed random
learning signals allow for feedforward training of deep neural networks. Frontiers in
neuroscience, page 20.
Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial
examples. arXiv preprint arXiv:1412.6572.
Greedy, W., Zhu, H. W., Pemberton, J., Mellor, J., and Costa, R. P. (2022). Single-phase
deep learning in cortico-cortical networks. arXiv preprint arXiv:2206.11769.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C.,
Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a
new approach to self-supervised learning. Advances in neural information processing
systems, 33:21271–21284.
Grinberg, L., Hopfield, J., and Krotov, D. (2019). Local unsupervised learning for image
analysis. arXiv preprint arXiv:1908.08993.
Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance.
Cognitive science, 11(1):23–63.
Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning an
invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE.
Halvagal, M. S. and Zenke, F. (2022). The combination of hebbian and predictive plasticity
learns invariant object representations in deep sensory networks. bioRxiv.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). Momentum contrast for unsu-
pervised visual representation learning. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 9729–9738.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778.
Herzfeld, D. J., Kojima, Y., Soetedjo, R., and Shadmehr, R. (2015). Encoding of action by
the purkinje cells of the cerebellum. Nature, 526(7573):439–442.
Hinton, G. (2022). The forward-forward algorithm: Some preliminary investigations. arXiv
preprint arXiv:2212.13345.
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief
nets. Neural computation, 18(7):1527–1554.
Illing, B., Ventura, J., Bellec, G., and Gerstner, W. (2021). Local plasticity rules can learn
deep representations using self-supervised contrastive predictions. Advances in Neural
Information Processing Systems, 34.
11
Published as a conference paper at ICLR 2023
12
Published as a conference paper at ICLR 2023
Miconi, T. (2021). Multi-layer hebbian networks with modern deep learning frameworks.
arXiv preprint arXiv:2107.01729.
Millidge, B., Tschantz, A., and Buckley, C. L. (2020). Predictive coding approximates
backprop along arbitrary computation graphs. arXiv preprint arXiv:2006.04182.
Mitrovic, J., McWilliams, B., Walker, J., Buesing, L., and Blundell, C. (2020). Representation
learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922.
Moraitis, T., Sebastian, A., and Eleftheriou, E. (2018). Spiking neural networks enable
two-dimensional neurons and unsupervised multi-timescale learning. In 2018 International
Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
Moraitis, T., Sebastian, A., and Eleftheriou, E. (2020). Short-term synaptic plasticity
optimally models continuous environments.
Moraitis, T., Toichkin, D., Chua, Y., and Guo, Q. (2021). Softhebb: Bayesian inference in
unsupervised hebbian soft winner-take-all networks. arXiv preprint arXiv:2107.05747.
Naud, R., Marcille, N., Clopath, C., and Gerstner, W. (2008). Firing patterns in the adaptive
exponential integrate-and-fire model. Biological cybernetics, 99(4):335–347.
Nessler, B., Pfeiffer, M., Buesing, L., and Maass, W. (2013). Bayesian computation emerges
in generic cortical microcircuits through spike-timing-dependent plasticity. PLoS computa-
tional biology, 9(4):e1003037.
Nessler, B., Pfeiffer, M., and Maass, W. (2009). Stdp enables spiking neurons to detect hidden
causes of their inputs. Advances in neural information processing systems, 22:1357–1365.
Nguyen, A., Yosinski, J., and Clune, J. (2015). Deep neural networks are easily fooled: High
confidence predictions for unrecognizable images. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 427–436.
Nøkland, A. (2016). Direct feedback alignment provides learning in deep neural networks.
Advances in neural information processing systems, 29.
Nøkland, A. and Eidnes, L. H. (2019). Training neural networks with local error signals. In
International conference on machine learning, pages 4839–4850. PMLR.
Payeur, A., Guerguiev, J., Zenke, F., Richards, B. A., and Naud, R. (2021). Burst-dependent
synaptic plasticity can coordinate learning in hierarchical circuits. Nature neuroscience,
24(7):1010–1019.
Pogodin, R., Mehta, Y., Lillicrap, T. P., and Latham, P. E. (2021). Towards biologically
plausible convolutional networks. arXiv preprint arXiv:2106.13031.
Qiao, N., Mostafa, H., Corradi, F., Osswald, M., Stefanini, F., Sumislawska, D., and Indiveri,
G. (2015). A reconfigurable on-line learning spiking neuromorphic processor comprising
256 neurons and 128k synapses. Frontiers in neuroscience, 9:141.
Rauber, J., Zimmermann, R., Bethge, M., and Brendel, W. (2020). Foolbox native: Fast
adversarial attacks to benchmark the robustness of machine learning models in pytorch,
tensorflow, and jax. Journal of Open Source Software, 5(53):2607.
13
Published as a conference paper at ICLR 2023
Rodriguez, H. G., Guo, Q., and Moraitis, T. (2022). Short-term plasticity neurons learning
to learn and forget. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and
Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning,
volume 162 of Proceedings of Machine Learning Research, pages 18704–18722. PMLR.
Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward
neural network. Neural networks, 2(6):459–473.
Sarwat, S. G., Kersting, B., Moraitis, T., Jonnalagadda, V. P., and Sebastian, A. (2022a).
Phase-change memtransistive synapses for mixed-plasticity neural computations. Nature
Nanotechnology, pages 1–7.
Sarwat, S. G., Moraitis, T., Wright, C. D., and Bhaskaran, H. (2022b). Chalcogenide
optomemristors for multi-factor neuromorphic computation. Nature communications,
13(1):1–9.
Scellier, B. and Bengio, Y. (2017). Equilibrium propagation: Bridging the gap between
energy-based models and backpropagation. Frontiers in computational neuroscience, 11:24.
Scherr, F., Guo, Q., and Moraitis, T. (2022). Self-supervised learning through efference
copies.
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R., and Eleftheriou, E. (2020). Memory
devices and applications for in-memory computing. Nature nanotechnology, 15(7):529–544.
Sejnowski, T. J. (2020). The unreasonable effectiveness of deep learning in artificial intelli-
gence. Proceedings of the National Academy of Sciences, 117(48):30033–30038.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556.
Sjöström, P. J., Turrigiano, G. G., and Nelson, S. B. (2001). Rate, timing, and cooperativity
jointly determine cortical synaptic plasticity. Neuron, 32(6):1149–1164.
Stuhr, B. and Brauer, J. (2019). Csnns: Unsupervised, backpropagation-free convolutional
neural networks for representation learning. In 2019 18th IEEE International Conference
On Machine Learning And Applications (ICMLA), pages 1613–1620. IEEE.
Tavanaei, A. and Maida, A. S. (2016). Bio-inspired spiking convolutional neural network
using layer-wise sparse coding and stdp learning. arXiv preprint arXiv:1611.03000.
Tomasev, N., Bica, I., McWilliams, B., Buesing, L., Pascanu, R., Blundell, C., and Mitrovic,
J. (2022). Pushing the limits of self-supervised resnets: Can we outperform supervised
learning without labels on imagenet? arXiv preprint arXiv:2201.05119.
Von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate
cortex. Kybernetik, 14(2):85–100.
14
Published as a conference paper at ICLR 2023
Determining the initial radius from the weight distribution parameters: We can
then derive the distribution parameters from the optimal initial radius using the distribution
moment calculation.
v v
uN uN
uX uX √ √ √
Ri = E(t Wij 2 ) = E(t w2 ) = E( N · w2 ) = E( N · |w|) = N · E(|w|) (4)
j=0 j=0
Where i is the index of a neuron, j is the index of this neuron’s synapses, N is number of
synapses of that neuron, E() is the expected value and so E(|w|) p
is the first absolute moment
p π
of distribution. Thus, for the normal distribution E(|w|) = σ · 2/π ⇒ σ = R · 2N and
q
2
positive distribution E(|w|) = E(w) = range/2 ⇒ range = R · N .
Learning rates that decay linearly with the number of training examples have been extensively
used in Hebbian learning (Krotov and Hopfield, 2019; Grinberg et al., 2019; Miconi, 2021;
Lagani et al., 2021; Amato et al., 2019). It is a simple scheduler, reaching convergence
with a sufficient amount of training example. The linear decay ties the learning rate’s value
throughout learning to the proportion of the training examples that have been seen, and
it does so uniformly across neurons. However, the weights may theoretically be able to
converge before the full training set’s presentation, and they may do so at different stages
across neurons. To address this, we tied the learning rate ηi to each neuron’s i convergence.
Convergence was assessed by the norm ri of the neuron’s weights. Based on previous work
on similar learning rules (Oja, 1982; Krotov and Hopfield, 2019) including SoftHebb itself
(Moraitis et al., 2021), and on our own new observations, the convergence of the neuronal
weights is associated with a convergence to a norm of 1, in case of simple learned features at
least. Therefore, we used a learning rate that stabilizes to zero as neuron weight vectors
15
Published as a conference paper at ICLR 2023
converge to a sphere of radius 1, and is initially big when the weight vectors are large
compared to 1:
ηi = η · (ri − 1)q , (5)
where q is a power hyperparameter. Note that this adaptivity has no explicit time-dependence,
but as learning proceeds towards convergence, η does decay with time. To account for the
case of complex features that do not converge to a norm of 1, a time-dependence can be
added on top, e.g. as usual with a linearly decaying factor, multiplying the norm-dependent
learning rate. In our experiments we only used the norm-dependence and simply stopped
learning after the first training epoch, i.e. the first presentation of the full training set in
most cases, or earlier. In practice, our adaptive rate reaches convergence faster (Fig. B.3B)
than the linear time-dependence, by maintaining a separate learning rate for each neuron
and adapting it based on each neuron’s own convergence status.
We systematically investigated the best set of hyperparameters at each hidden layer, based
on the validation accuracy of a linear classifier trained directly on top of that hidden layer.
All grid searches were performed on three different random seeds, varying the batch sampling
and the validation set (20% of the training dataset). The classifier is a simple linear classifier
with a dropout of 0.5 and no other regularisation term. For all searches and final results, we
used 96 kernels in the first layer. Subsequent layers scaled with a fw = 4 (see Appendix A.5).
However, based on our observations, only the optimal temperature depends on the fw .
For each added layer, grid search was performed in three stages: For the first two stages we
used square convolutional kernels with a size of 5, a max-pooling with a square kernel of size
2, and a Triangle with a power of 1 as forward activation function.
16
Published as a conference paper at ICLR 2023
Table 3: Network architecture and hyper-parameters search, and best results on CIFAR-10.
More details are provided in section A.4.
{10, 100, 1000}, η ∈ {0.001, 0.004, 0.008, 0.01, 0.04, 0.08, 0.12}, q ∈ {0.25, 0.5, 0.75}
and the 1/τ ∈ {0.25, 0.5, 0.75, 1, 2, 5, 10}.
2. A finer grid search over 1/τ ∈ {0.15, 0.25, 0.35, 0.5, 0.6, 0.75, 0.85, 1, 1.2, 1.5, 2} and
conv kernel size ∈ {3, 5, 7, 9} using the best result from the previous search.
3. A final grid search over the pooling: pool type ∈ {AvgP ooling, M axP ooling},
pool kernel size ∈ {2, 3, 4}, and the activation function: f unction ∈
{ReP U, T riangle, Sof tmax} with for power ∈ {0.1, 0.35, 0.7, 1, 1.4, 2, 5, 10} (for
RePU or Triangle) and τ ∈ {0.1, 0.5, 1, 2, 5, 10, 50, 100} (for softmax) using the best
result from the two previous searches.
The pooling of stride 2 halves each dimension of the resolution after each layer. We stop
adding layers when the output resolution becomes at most 4 × 4. The layer where this
occurs depends on the original input’s resolution. Thus, the multilayer network has three
convolutional layers for MNIST or CIFAR-10, four layers for STL-10, and five layers for
ImageNet at a resolution setting of 160 × 160 px (Table 4).
A width factor fw characterizes the multilayer network. fw links the width of each layer to
that of the previous layer, thus determining the depth-dependent architecture. Specifically,
the number of filters #Fl in the hidden layer l is fw times the number of filters at layer l − 1:
#Fl = fw · #Fl−1 . The first hidden layer has 96 filters in order to compare with Miconi
(2021); Lagani et al. (2021); Amato et al. (2019). We then explored, using CIFAR-10, the
impact of fw on performance. We tried three different values for fw ∈ {1, 2, 4}. A value of 1
keeps the same number of filters in all layers, while a value of 4 keeps the number of features
provided to the classifier head equal to the number of features at the input layer (due to
the pooling stride of 2 in each layer). A fw bigger than four would substantially increase
17
Published as a conference paper at ICLR 2023
Table 4: Network architecture (all pooling layers use a stride of 2). The number of channels
is also defined, e.g. conv96 means 96 channels. More details can be found in Appendix A.5.
the size of the network, and is impractical for deeper networks. We found that to increase
performance with depth, the network needs to also grow in width (Fig. 4).
Each experiment was performed 4 times, with random initializations, on all datasets except
the full ImageNet where only one random seed was tried.
The optimal number of SoftHebb weight update iterations is around 5000 based on CIFAR-10
experiments. Thus, for CIFAR-10 and MNIST (50k training examples), unsupervised training
was performed in one epoch with a mini-batch of 10 and 20 for STL-10 unlabelled training
set (100k training example). Because of the large number of training examples, we randomly
select 10% of the ImageNet dataset with a mini-batch of 20. The accuracies we report
for SoftHebb are for layers that are trained successively, meaning each SoftHebb layer was
trained, and then frozen, before the subsequent layer was trained. However, the results are
very similar for simultaneous training of all layers, where each training example updates all
layers, as it passes forward through the deep network.
The linear classifier on top uses a mini-batch of 64 and trains on 50 epochs for MNIST
and CIFAR-10, 100 epochs for STL-10, and 200 epochs for ImageNet. For all datasets, the
learning-rate has an initial value of 0.001 and is halved repeatedly at [20%, 35%, 50%, 60%,
70%, 80%, 90%] of the total number of epochs. Data augmentation (random cropping and
flipping) was applied for STL-10 and ImageNet.
18
Published as a conference paper at ICLR 2023
A.6.4 Fine-tuning
In the fine-tuning experiment on STL-10, all Hebbian CNN layers learned using SoftHebb
and the large unlabelled training dataset of STL-10; then, an output layer was added and the
entire network was trained end-to-end using BP on the full small labelled training subset.
We have visualized the receptive fields (RFs) of hidden layers in the network (Figure 2 and
end of Appendix B). The method that we used is activation maximization (Erhan et al.,
2009; Le et al., 2012; Goodfellow et al., 2014; Nguyen et al., 2015). Specifically, we started
from a square of random pixels, and we optimized the input through gradient descent (or
rather ascent) to maximize the activation of each neuron, under the constraint of an L2
norm of 1, i.e. projection to a unit sphere. That is then a form of projected gradient
descent (PGD), which can also be used as an adversarial attack, if a loss function rather
than the activation function is maximized. For this purpose, we modified a toolbox for
adversarial attacks, named Foolbox (Rauber et al., 2020). We show RFs that maximize
the linear response of the neurons, i.e. the total weighted input. We have tuned the step
size of the descent, and we have validated the approach (a) by verifying that the number
of iterations suffice for convergence, (b) by confirming that its results at the first layer
match the layer’s weights, (c) by verifying that the hidden neurons are strongly active
if the network is fed with inputs that match the neuron’s found RFs, and (d) by seeing
that alternative initializations also converge to the same RF. We also tried an alternative
method that was used by Miconi (2021). Specifically, we used that paper’s available code
(https://github.com/ThomasMiconi/HebbianCNNPyTorch). We found that the RFs found
by PGD activate the neurons more than the RFs found by the alternative method. Moreover,
PGD takes into consideration pooling and activation functions, which the other method does
not. Therefore we chose to present the results from PGD. Example results are presented in
Fig. 2 as well as in extended form with more examples at the end of Appendix B.
19
Published as a conference paper at ICLR 2023
Figure B.2: Same as Fig. B.1, but in the 3-hidden-layer network, such that all layers use the
same temperature. R1 features are reported as a percentage over all features of all layers.
20
Published as a conference paper at ICLR 2023
We speculate that low initial radius is problematic (Figure B.3A) because in that regime
the balance between excitation (from the input) and inhibition (from the soft WTA) is
overly tilted towards inhibition. Regarding Figure B.3B, see also main text’s Section 3,
“Weight-norm-dependent adaptive learning rate”, and the related paragraph in Section 4.
The choice of activation function for forward propagation through SoftHebb layers is im-
portant. To study this, we compared SoftHebb’s performance on CIFAR-10 for various
activation functions. In each result, all neurons across all layers used the same activation
function hyperparameters, except the case of Triangle. The parameters that were tried were:
These three values were tried because they were good for individual layers in previous
tuning. Asterisk indicates the best parameter value according to validation accuracy. Using
the best values produced the test accuracy results reported below, whereas the remaining
hyperparameters were not tuned to each case, but rather were the same, as found for Triangle
by the process described in Appendix A.4.
The width of the layers is rather impactful, as indicated by varying the width factor of deep
layers while keeping the first layer’s width constant (Fig. 4). To further study that impact,
21
Published as a conference paper at ICLR 2023
we varied the first layer’s width and kept the width factor fixed to 4 (which scales all layers).
The results are presented in Fig. B.4.
Figure B.4: Performance on CIFAR-10 for varying width of the layers. First-layer width is
indicated, while subsequent layers are scaled by the width factor 4 (see Appendix A.5).
For this control experiment, we did not re-tune the hyperparameters to each width, but
rather only to the 96-neuron case. That is in contrast to Fig. 4, where hyperparameters were
tuned to each width factor.
22
Published as a conference paper at ICLR 2023
Figure B.5: Receptive fields of the first convolutional layer’s neurons, learned from CIFAR-10
by different algorithms.
As a control, we perform again the experiment that we presented in Fig. 5, this time in
comparison to a network with the initial, untrained, random weights. The results are shown
in Fig. B.6 and Fig. B.7.
Figure B.6: UMAP projection (similar to main text’s Fig. 5, top row) of the test set after
passing through 4 SoftHebb layers. Here, (A) from a trained and (B) from a randomly
initialized, untrained network.
23
Published as a conference paper at ICLR 2023
Layer 1 Layer 1
Layer 4 Layer 4
Figure B.7: Images and patches that best activate 5 random neurons from SoftHebb Layers
1 and 4 (similar to main text’s Fig. 5, bottom row). Here, (A) from a trained and (B) from
a randomly initialized, untrained network.
For the method, see Appendix A.7. RFs of deeper layers are not all Gabor-like, but rather
also include mixtures of Gabor filters, and also take different shapes and textures. In addition,
RFs do appear increasingly complex with depth. These results could possibly be expected
based on the RFs of the first layer, which are already more complex than the mere Gabor
filters that are learned by other Hebbian approaches, such as hard WTA (Figure B.5A).
Their mixture in subsequent layers then was unlikely to only produce Gabor filters. It is
difficult to interpret each RF precisely, but this is common in the hierarchies of deep neural
networks.
24
Published as a conference paper at ICLR 2023
25
Published as a conference paper at ICLR 2023
Figure B.9: 250 randomly sampled receptive fields of layer 2, learned from STL-10.
26
Published as a conference paper at ICLR 2023
Figure B.10: 250 randomly sampled receptive fields of layer 3, learned from STL-10.
27
Published as a conference paper at ICLR 2023
Figure B.11: 250 randomly sampled receptive fields of layer 4, learned from STL-10.
28