In the previous post, we showed how powerful and effective dimension analysis is. However, with it alone you will still step in trouble when taking derivatives in back propagation. The second rule is also quite important to know.
Rule two: Stick to the chain rule.
I used to think that the chain rule is only for beginners. For example, if , I can figure out in ten seconds. What is the point of applying chain rule on such a simple compound function?
Unfortunately, things are not so easy in neural networks. When it comes to matrix calculus, such clever derivations can soon push you into a dead end. Still taking the example above, if all variables are matrices, we have . If we want to know , the following equation can be quickly derived:
The derivative of
H is given by the previous layer and we want to solve here. If you have mastered dimension analysis, perhaps you will consider it to be a
DxN matrix. After that, you will be lost and don’t know how to proceed from here. It is even hard to give an appropriate draft of the derivative that “seems” to be correct. The reason is that is not a common 2-D matrix. Instead, it is a 4-D tensor that is far out of control for a beginner.
This is a common trap for a beginner where they try to figure out the final gradient in one single step, not using any intermediate variables. You will be very likely to encounter high dimensional tensors if you do so because the result of a matrix-matrix gradient is supposed to be a 4-D tensor from the very beginning. However, you will find everything in your reach if you frankly added an intermediate variable . Matrix-matrix gradients are 2-D matrices if a basic arithmetic operation is performed only. If all the intermediate variables are matrices, we will be able to use dimension analysis.
Let’s try again. We have
dS is of
dH is of
NxC too. Taking into account, the most possible shape of would also be
NxC, which is itself. In this way we know this is actually an element-wise multiplication.
Now we have
, the only thing to do is to figure out
This is identical to the example in the previous post. It is easy to get , done.
With all the derivatives known, let’s look back to
and see where went wrong if you mistakenly consider to be a
DxN matrix. Look at the following equation
we already know that , , these two matrices have shapes
DxN respectively. You will never be able to get a
DxN matrix no matter how you arrange them. The root cause is that the derivative of
W is not a matrix, but a tensor. If we stick to the chain rule, we are able to avoid such tensors and can happily compute each gradient with dimension analysis and scalar calculus.
An example: Softmax
We have learned the two rules in computing gradients in neural networks. Let’s try a concrete example now: the softmax layer.
Softmax layer is usually the output layer of a network. Its forward pass is:
X is a
NxD matrix, and there are
C classes, then
W should be
b should be
Pi refers to the i-th data’s predicted probability to be its correct class. Please refer to the course notes in CS231n to get more detailed information on softmax. Our goal here is to compute the derivatives of
Loss in terms of
b respectively. To make reading simpler, all the derivatives in the following part refers to .
First, we replace
Don’t go too fast, let’s consider the first term and second term one by one. We define as an intermediate variable.
rowsum is the exponential sum of every row in
score, which is a
Nx1 matrix, so we have
Consider first. It is of the same size
score. Ignoring the 1/N term in the front,
d(score) is actually all
zeros, with only
ones in every row at the position of their correct classes. It is better explained in Python code:
dscore = np.zeros_like(score) dscore[range(N),y] -= 1
Then let’s compute , it is…, so simple.
Now it’s time for . Do not directly compute because they are both matrices and the derivative is hard to compute. Rather, since
score is involved we can compute . You may still remember that we have computed a
d(score) just now, and we compute another here, which means
score is involved into two parts of calculations in
Loss, and it is indeed the fact (the first time as , and the second time in
rowsum). We can add them together according to the rule of taking derivatives.
When you think in this way, it’s not hard to compute
The left part is a
NxC matrix, the right part has a known matrix of
Nx1, so the remaining part can be either
NxC. This requires a bit of deeper consideration. It is not hard to reach the conclusion that it should be of
NxC because every element in
score affects only one element in
rowsum, so there is no reason to sum it up. A reasonable
NxC matrix is itself. At the end, we got:
dscore += drowsum.dot(np.exp(score)) dscore /= N # there is a 1/N in the front
We are almost done here. All the left is computing the derivatives of
b, I believe you can do it yourself since this example has appeared many times in this tutorial. :)
You may notice that the result in the second part is actually
P itself. It can make the computation faster but it doesn’t matter if you didn’t notice it.
If you have mastered the two rules, you may find that even the gradients for Batch Normalization won’t block you off anymore. Have a try at it.
Hope this can help the beginners who are struggling with gradient computing out a little bit.
Thank you for reading!