Why binary stochastic units train slowly

In Geoffrey Hinton’s coursera lecture 9, he talks about Alex Graves adding noise to weights in an RNN for handwriting recognition (possibly this paper is a good reference for this?).

Hinton goes on to say that just adding noise in activations is a good regularizer, and talks about doing this in an MLP by making the units binary and stochastic on the forward pass, and then doing backprop as though we’d done the forward pass deterministically (as a usual logistic unit).

So you compute the logistic p, and treat that p as the probability of outputting a 1. In the forward pass, you make a random decision to output 1 or 0 depending on that probability p, but in the backward pass you use the real value of p.

\displaystyle p_i = sigmoid(W_i +b)

Forward pass:

\displaystyle h_i ~~ Bernoulli(p_i)

Backward pass:

\displaystyle \frac{\partial C}{\partial p_i}    (not \displaystyle \frac{\partial C}{\partial h_i} )

This works pretty well as a regularizer (increases error on the training set, but decreases error on the test set). However, it does train a lot slower. Yiulau in these lecture comments asked why that happens.


There are three main reasons:

  1. We’re adding stochasticity to the gradient of the update,
  2. By quantizing p to be 0 or 1, we’re losing information, and therefore have reduced capacity.
  3. The true gradient w.r.t activations is 0, so really, we should not be able to do anything (there should be no weight updates). But we just use the gradient as though the unit had been a sigmoid, and it just seems to work pretty well.

Why is the true gradient 0? Consider \frac{\partial C}{\partial p_i} ; to get from \frac{\partial C}{\partial h_i} (from our forward pass) to \frac{\partial C}{\partial p_i} , we need to multiply by \frac{\partial h_i}{\partial p_i} . How did we get h_i from p_i again?

\displaystyle h_i = 1_{U<p_i}

Where U is sampled from the uniform distribution [0,1]. The derivative of this function w.r.t. p_i is 0, so when we do chain rule, we should be multiplying by this 0  in the second term:

\displaystyle \frac{\partial C}{\partial p_i} = \frac{\partial C}{\partial h_i}\frac{\partial h_i}{\partial p_i}

And therefore we should never have any gradient. But we just ignore this \frac{\partial h_i}{\partial p_i} term, and we get a regularizer!

I suppose that this amounts to adding noise to the gradient, and I wonder if it has a specific form.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s