Hinton goes on to say that just adding noise in activations is a good regularizer, and talks about doing this in an MLP by making the units binary and stochastic on the forward pass, and then doing backprop as though we’d done the forward pass deterministically (as a usual logistic unit).
So you compute the logistic p, and treat that p as the probability of outputting a 1. In the forward pass, you make a random decision to output 1 or 0 depending on that probability p, but in the backward pass you use the real value of p.
This works pretty well as a regularizer (increases error on the training set, but decreases error on the test set). However, it does train a lot slower. Yiulau in these lecture comments asked why that happens.
There are three main reasons:
- We’re adding stochasticity to the gradient of the update,
- By quantizing p to be 0 or 1, we’re losing information, and therefore have reduced capacity.
- The true gradient w.r.t activations is 0, so really, we should not be able to do anything (there should be no weight updates). But we just use the gradient as though the unit had been a sigmoid, and it just seems to work pretty well.
Why is the true gradient 0? Consider ; to get from (from our forward pass) to , we need to multiply by . How did we get from again?
Where U is sampled from the uniform distribution [0,1]. The derivative of this function w.r.t. is 0, so when we do chain rule, we should be multiplying by this 0 in the second term:
And therefore we should never have any gradient. But we just ignore this term, and we get a regularizer!
I suppose that this amounts to adding noise to the gradient, and I wonder if it has a specific form.