# Why binary stochastic units train slowly

In Geoffrey Hinton’s coursera lecture 9, he talks about Alex Graves adding noise to weights in an RNN for handwriting recognition (possibly this paper is a good reference for this?).

Hinton goes on to say that just adding noise in activations is a good regularizer, and talks about doing this in an MLP by making the units binary and stochastic on the forward pass, and then doing backprop as though we’d done the forward pass deterministically (as a usual logistic unit).

So you compute the logistic p, and treat that p as the probability of outputting a 1. In the forward pass, you make a random decision to output 1 or 0 depending on that probability p, but in the backward pass you use the real value of p.

$\displaystyle p_i = sigmoid(W_i +b)$

Forward pass:

$\displaystyle h_i ~~ Bernoulli(p_i)$

Backward pass:

$\displaystyle \frac{\partial C}{\partial p_i}$    (not $\displaystyle \frac{\partial C}{\partial h_i}$)

This works pretty well as a regularizer (increases error on the training set, but decreases error on the test set). However, it does train a lot slower. Yiulau in these lecture comments asked why that happens.

## Why?

There are three main reasons:

2. By quantizing p to be 0 or 1, we’re losing information, and therefore have reduced capacity.
3. The true gradient w.r.t activations is 0, so really, we should not be able to do anything (there should be no weight updates). But we just use the gradient as though the unit had been a sigmoid, and it just seems to work pretty well.

Why is the true gradient 0? Consider $\frac{\partial C}{\partial p_i}$ ; to get from $\frac{\partial C}{\partial h_i}$ (from our forward pass) to $\frac{\partial C}{\partial p_i}$, we need to multiply by $\frac{\partial h_i}{\partial p_i}$ . How did we get $h_i$ from $p_i$ again?

$\displaystyle h_i = 1_{U

Where U is sampled from the uniform distribution [0,1]. The derivative of this function w.r.t. $p_i$ is 0, so when we do chain rule, we should be multiplying by this 0  in the second term:

$\displaystyle \frac{\partial C}{\partial p_i} = \frac{\partial C}{\partial h_i}\frac{\partial h_i}{\partial p_i}$

And therefore we should never have any gradient. But we just ignore this $\frac{\partial h_i}{\partial p_i}$ term, and we get a regularizer!

I suppose that this amounts to adding noise to the gradient, and I wonder if it has a specific form.