Backpropagation everywhere

Can the brain do backpropagation? - Hinton, 2016

In this talk, Hinton rebuts four arguments which neuroscientists have used to argue that the brain is not learning using backpropagation.

  • Most human learning is unsupervised, without the explicit loss functions usually used in backpropagation.
Hinton argues that error signals can be derived in unsupervised contexts using many different methods: reconstructing input signal (like autoencoders); comparing local predictions with contextual predictions; learn a generative model (wake-sleep algorithm); use a variational autoencoder; generative adversarial learning.

  • Neurons don't send real numbers, but rather binary spikes.
Hinton argues that this is a form of regularisation which actually makes the brain more effective. Any real number can be converted into a probability of a binary firing signal; the effects of doing so are similar to those of using dropout. Such strong regularisation is necessary because the brain has around 10^14 parameters but humans only experience 10^9 seconds, so otherwise we might massively overfit. In general Hinton claims that using more parameters and more regularisation is better than keeping both low, which is what traditional statistics would recommend; he cites ensemble methods as an example.

  • Neurons can't send both feature information and training gradients.
If we want to change the weights of a forward neuron using backpropagation, we need to send back the gradients of the loss function with respect to those weights. Since neurons only transmit in one direction, we therefore need a second neuron to do this. But now we run into the "weight transport problem": how can the backwards neuron find out what the weights of the corresponding forwards neuron are, in order to calculate the corresponding gradients? Hinton apparently has a solution to this: if we think of neurons as sending a continuous signal (which, as discussed above, is approximated by binary spikes), then over short time periods the gradient of that signal can vary roughly independently from the magnitude of the signal itself. Theoretically, that gradient could be used to encode information about weights on top of the normal signal. This hypothesis is bolstered slightly by evidence that even in cases where using the actual gradient of the signal's magnitude would be useful (for example, a neuron tracking speed, whose gradient would normally be acceleration) the brain doesn't do so. The details are quite involved, so I'll leave it there and move on to the fourth objection.

  • Neurons aren't actually paired up.
Hinton said that this was the problem he considered insurmountable a few years ago. The solution to the weight transport problem above requires every forward neuron to have a corresponding backwards neuron, since backpropagation in neural networks transmits gradient information for each neuron individually. But the brain simply isn't composed of forward-backward neuron pairs. There are some backwards connections between entire brain regions, but not nearly enough to do conventional backpropagation. Recently, though, the previous consensus on this point has been overturned by a demonstration that backpropagation can in fact work without having precise correspondences between forwards and backwards neurons. This is surprising enough that it's worth looking into the original source.

Random synaptic feedback weights support error backpropagation for deep learning - Lillicrap, Cownden, Tweed, and Akerman, 2016

Normally, after we calculate the loss function, we calculate the exact gradient of the loss function with respect to each weight. This paper is concerned with the cases when we can't access those weights, i.e. if we can't solve the weight transport problem. In this situation, we can still do backpropagation using feedback weights which are determined independently from the forward weights - but without being able to determine which weights are responsible for the outcome, we would expect performance to diminish drastically. The key result of this paper is that fixed, random feedback weights are actually sufficient for a neural network to achieve performance equal to that of standard backpropagation. The only requirement is that the teaching signal from the backward links "pushes the network in roughly the same direction as backprop would". When the backwards weights are fixed, this condition can still be fulfilled if the forwards weights evolve to better match the backwards weights; the authors call this "feedback alignment". This works even when 50% of the forward connections and 50% of the backward connections are removed.

Hinton suggests the intuitive reason why this works: apart from the last layer, the rest of a neural network is mainly focused on creating a good representation of the input. Even though the feedback weights are fixed, different data classes will result in different error signals along those weights, which allows the rest of the network to adapt. The authors claim that we should think of feedback alignment as being part of a spectrum between a global reward function which sends the same signal to every neuron, and a reward function exactly tailored for each neuron like standard backpropagation.

How important is weight symmetry in backpropagation? - Liao, Leibo, and Poggio, 2016

This paper is similar to the one by Lillicrap et al., but has a slightly different focus. Liao et al. find that when they set all feedback weights to the sign of the corresponding forward weight (which they call sign concordance), they achieve results on par with or better than standard SGD (as long as they also apply batch normalisation and batch manhattan, as described below). Similarly good results obtain if feedback magnitudes are varied randomly with the same sign, and even if the last layer is initialised randomly and frozen with those values. Liao et al claim that the successes of sign concordance are more robust than those from fixed random weights - for example, the latter has very bad performance when the last layer is frozen, which suggests that the key to feedback alignment is co-adaption between the last layer and previous layers. However, it's still unclear how biologically plausible sign concordance is, since it still requires communication of forward neurons' weight signs to backward neurons. Its performance when feedback connections are sparse is also untested, although presumably it will do at least as well as feedback alignment.

Liao et al. also emphasise that the success of these asymmetric backpropagation algorithms is enhanced greatly by the use of batch manhattan and batch normalisation, so let's have a look at them. When doing gradient descent, we calculate the gradient based on a group of examples, which we'll call a batch. Full gradient descent uses all the training examples as one batch; stochastic gradient descent uses batches of size one. Mini-batches are batches of intermediate size; they are a particularly useful alternative to SGD on computing platforms which can efficiently implement parallelism. In batch manhattan, after we have used feedback weights to calculate a given forward weight's gradient over a mini-batch, we update the forward weight based only on the sign of that gradient. This is discussed in more detail in (Hifny, 2013).

Batch Normalisation: Accelerating Deep Network Training by Reducing Internal Covariate Shift - Ioffe and Szegedy, 2015

(Note that this more recent paper argues against this interpretation of batch normalisation).

Batch normalisation is a bit more complicated. One problem with gradient descent is that as training occurs, the outputs of each layer shift, and so the distribution of inputs of the following layer also shifts, and subsequent layers need to adapt. This is known as internal covariate shift, and may cause nonlinearities to become saturated. While this can be mitigated by using RELUs and small learning rates, solving it could accelerate training speed. Ioffe and Szegedy propose a variant of "whitening" which they call "batch normalisation". Whitening involves normalising a set of input variables so that they are uncorrelated and each have mean 0 and variance 1. This essentially transforms the covariance matrix into the identity matrix, and can be done in a number of ways. Whitening data before using it to train a neural network is standard.

However, whitening each layer after each gradient descent step is expensive, and so the authors make two changes. Firstly, they normalise each feature independently, without decorrelating them, which avoids the need to calculate the correlation matrix or its inverse. Secondly, they use estimates for the mean and variance based on each mini-batch. Another issue is that normalisation could reduce the expressive power of the network - for example, when using sigmoid nonlinearities the normalised data might end up only in the linear section. They address this by adding learned parameters which rescale and reshift the data (the latter essentially replaces the bias term).

The effects of batch normalisation are to allow higher learning rates, faster convergence, and diminished use of other forms of regularisation, such as dropout and weight regularisation. The effectiveness of batch normalisation as a regulariser is greater when training examples are shuffled so that the batches aren't the same every time they're seen.


Popular posts from this blog

Book review: Very Important People

In Search of All Souls

Characterising utopia