Why binary stochastic units train slowly

In Geoffrey Hinton’s coursera lecture 9, he talks about Alex Graves adding noise to weights in an RNN for handwriting recognition (possibly this paper is a good reference for this?).

Hinton goes on to say that just adding noise in activations is a good regularizer, and talks about doing this in an MLP by making the units binary and stochastic on the forward pass, and then doing backprop as though we’d done the forward pass deterministically (as a usual logistic unit).

So you compute the logistic p, and treat that p as the probability of outputting a 1. In the forward pass, you make a random decision to output 1 or 0 depending on that probability p, but in the backward pass you use the real value of p.

\displaystyle p_i = sigmoid(W_i +b)

Forward pass:

\displaystyle h_i ~~ Bernoulli(p_i)

Backward pass:

\displaystyle \frac{\partial C}{\partial p_i}    (not \displaystyle \frac{\partial C}{\partial h_i} )

This works pretty well as a regularizer (increases error on the training set, but decreases error on the test set). However, it does train a lot slower. Yiulau in these lecture comments asked why that happens.


There are three main reasons:

  1. We’re adding stochasticity to the gradient of the update,
  2. By quantizing p to be 0 or 1, we’re losing information, and therefore have reduced capacity.
  3. The true gradient w.r.t activations is 0, so really, we should not be able to do anything (there should be no weight updates). But we just use the gradient as though the unit had been a sigmoid, and it just seems to work pretty well.

Why is the true gradient 0? Consider \frac{\partial C}{\partial p_i} ; to get from \frac{\partial C}{\partial h_i} (from our forward pass) to \frac{\partial C}{\partial p_i} , we need to multiply by \frac{\partial h_i}{\partial p_i} . How did we get h_i from p_i again?

\displaystyle h_i = 1_{U<p_i}

Where U is sampled from the uniform distribution [0,1]. The derivative of this function w.r.t. p_i is 0, so when we do chain rule, we should be multiplying by this 0  in the second term:

\displaystyle \frac{\partial C}{\partial p_i} = \frac{\partial C}{\partial h_i}\frac{\partial h_i}{\partial p_i}

And therefore we should never have any gradient. But we just ignore this \frac{\partial h_i}{\partial p_i} term, and we get a regularizer!

I suppose that this amounts to adding noise to the gradient, and I wonder if it has a specific form.

On the biological plausibility of LSTMs

This is what an LSTM unit approximately looks like:


I say approximately because it seems to vary depending on the implementation whether the input goes through a non-linearity before getting to the gates, whether input is given to every gate, etc.

That image is from this page on Eric Yuan’s blog which has a great explanation of LSTMs, as does this page of Chris Olah’s blog.

I’m going to assume you’ve read those or something like them, and talk about the biological plausibility of LSTMs. This post began as a response to discussion in class over Thomas George’s question on this lecture page.

Just as a refresher, this is what a cartoon neuron looks like; generally we consider that the dendrites perform a weighted sum, the cell body performs some non-linear transformation of that sum, and the output is passed along the axon to other neurons (‘axon ending’ in this diagram generally ‘ends’ at a synapse with the dendrites of other neurons). Image from here.


Gating in general:

I think gating is like having some connections higher in the dendritic arbor (closer to the cell body), which have a proportionally very large impact on the processing done by the cell. So you unfold each gate into a sigmoid MLP with one output to the central cell, and then an LSTM is just a specific way to nest/restrict connectivity between neurons where the neurons can have different activation functions and different degrees of effect on each other.

In this view, the different LSTM variants (peepholes, coupled gates, etc.) are just different connectivity structures – in particular, peepholes allow connections between the gate-cells, and coupled gates use the same MLP for various gates.

To look at the three gates (“neuron types”) individially:

Input gate:

This is like the central cell having input from a neuron which has had inputs from the “real” input neurons x(t), and also from the “output” of the previous state h(t-1).

For example, this could be the case of a cell in the LGN (lateral geniculate nucleus) which receives information about light levels from sensory neurons in the retina, and also feedback information from the primary visual cortex.

Forget gate:

This is like the central cell having input from a neuron which looks at the same stuff the input gate does, and also has input from the central cell.

In class we talked about a differential formula for how voltage changes across the membrane:

\displaystyle \frac{dV_i}{dt} = \sum_{j} w_{ij} x_{j}(t) - V_i

where Vi is like an input to one’s self where the value depends on both the input and on the previous value (if you were to discretize time). In other words, makes the cell inertial to changing its value. I see this like giving a bias to the ‘central’ cell in this unfolded LSTM.

The above formula should actually be

\displaystyle \tau\frac{dV_i}{dt} = \sum_{j} w_{ij} x_{j}(t) - V_i

where \tau is a time constant. We talked in class about this time constant not being fixed in LSTMs, and whether or not this is biologically plausible. There are cells in the hippocampus whose spiking rates appear to modulate our perception of time, and these correlations are not fixed.

To continue the previous example, we can imagine that the forget gate is a cell in the LGN connected to the input gate cell I describe above, and which also has connections to the hippocampus and to the central cell. This paper talks about hippocampal connections to the LGN.

Output gate:

This is like the central cell having input from a neuron which looks at the same stuff as the input gate does, and attaches to the central cell not in the dendritic arbor but on the axon.

This paper talks about a region of hippocampal cells in particular having this structure (well, they observe that many axons in this area derive from dendrites, which amounts to the same thing).