# How to derive gradients in backprop without knowing matrix calculus (Part 2)

In the previous post, we showed how powerful and effective dimension analysis is. However, with it alone you will still step in trouble when taking derivatives in back propagation. The second rule is also quite important to know.

### Rule two: Stick to the chain rule.

I used to think that the chain rule is only for beginners. For example, if , I can figure out in ten seconds. What is the point of applying chain rule on such a simple compound function?

Unfortunately, things are not so easy in neural networks. When it comes to matrix calculus, such clever derivations can soon push you into a dead end. Still taking the example above, if all variables are matrices, we have . If we want to know , the following equation can be quickly derived:

The derivative of `L`

to `H`

is given by the previous layer and we want to solve here. If you have mastered dimension analysis, perhaps you will consider it to be a `DxN`

matrix. After that, you will be lost and don’t know how to proceed from here. It is even hard to give an appropriate draft of the derivative that “seems” to be correct. The reason is that is not a common 2-D matrix. Instead, it is a 4-D tensor that is far out of control for a beginner.

This is a common trap for a beginner where they try to figure out the final gradient in one single step, not using any intermediate variables. You will be very likely to encounter high dimensional tensors if you do so because the result of a matrix-matrix gradient is supposed to be a 4-D tensor from the very beginning. However, you will find everything in your reach if you frankly added an intermediate variable . Matrix-matrix gradients are 2-D matrices if a basic arithmetic operation is performed only. If all the intermediate variables are matrices, we will be able to use dimension analysis.

Let’s try again. We have

where `dS`

is of `NxC`

and `dH`

is of `NxC`

too. Taking into account, the most possible shape of would also be `NxC`

, which is itself. In this way we know this is actually an element-wise multiplication.

Now we have

, the only thing to do is to figure out

.

This is identical to the example in the previous post. It is easy to get , done.

With all the derivatives known, let’s look back to

and see where went wrong if you mistakenly consider to be a `DxN`

matrix. Look at the following equation

,

we already know that , , these two matrices have shapes `NxC`

and `DxN`

respectively. You will never be able to get a `DxN`

matrix no matter how you arrange them. The root cause is that the derivative of `H`

to `W`

is not a matrix, but a tensor. If we stick to the chain rule, we are able to avoid such tensors and can happily compute each gradient with dimension analysis and scalar calculus.

### An example: Softmax

We have learned the two rules in computing gradients in neural networks. Let’s try a concrete example now: the softmax layer.

Softmax layer is usually the output layer of a network. Its forward pass is:

Suppose `X`

is a `NxD`

matrix, and there are `C`

classes, then `W`

should be `DxC`

, and `b`

should be `1xC`

. `Pi`

refers to the i-th data’s predicted probability to be its correct class. Please refer to the course notes in CS231n to get more detailed information on softmax. Our goal here is to compute the derivatives of `Loss`

in terms of `W`

, `X`

and `b`

respectively. To make reading simpler, all the derivatives in the following part refers to .

First, we replace `P`

in `Loss`

:

Don’t go too fast, let’s consider the first term and second term one by one. We define as an intermediate variable. `rowsum`

is the exponential sum of every row in `score`

, which is a `Nx1`

matrix, so we have

Consider first. It is of the same size `NxC`

as `score`

. Ignoring the 1/N term in the front, `d(score)`

is actually all `zero`

s, with only `one`

s in every row at the position of their correct classes. It is better explained in Python code:

```
dscore = np.zeros_like(score)
dscore[range(N),y] -= 1
```

Then let’s compute , it is…, so simple.

Now it’s time for . Do not directly compute because they are both matrices and the derivative is hard to compute. Rather, since `score`

is involved we can compute . You may still remember that we have computed a `d(score)`

just now, and we compute another here, which means `score`

is involved into two parts of calculations in `Loss`

, and it is indeed the fact (the first time as , and the second time in `rowsum`

). We can add them together according to the rule of taking derivatives.

When you think in this way, it’s not hard to compute `d(scores)`

anymore:

The left part is a `NxC`

matrix, the right part has a known matrix of `Nx1`

, so the remaining part can be either `1xC`

or `NxC`

. This requires a bit of deeper consideration. It is not hard to reach the conclusion that it should be of `NxC`

because every element in `score`

affects only one element in `rowsum`

, so there is no reason to sum it up. A reasonable `NxC`

matrix is itself. At the end, we got:

```
dscore += drowsum.dot(np.exp(score))
dscore /= N # there is a 1/N in the front
```

We are almost done here. All the left is computing the derivatives of `score`

to `W`

, `X`

and `b`

, I believe you can do it yourself since this example has appeared many times in this tutorial. :)

You may notice that the result in the second part is actually `P`

itself. It can make the computation faster but it doesn’t matter if you didn’t notice it.

If you have mastered the two rules, you may find that even the gradients for Batch Normalization won’t block you off anymore. Have a try at it.

Hope this can help the beginners who are struggling with gradient computing out a little bit.

Thank you for reading!

## Leave a Comment