Matrix Calculus


This post is a summary of The Matrix Calculus You Need For Deep Learning

Review: Scalar Calculus

Rulef(x)f(x)f(x)f'(x)
Constantcc00
Multiplication by constantcxcxcc
Power rulexnx^nnxn1nx^{n-1}
Sum rulef+gf + gf+gf' + g'
Product rulefgf \cdot gfg+gff\cdot g' + g\cdot f'
Quotient rulefg\frac{f}{g}fggfg2\frac{f'\cdot g - g'\cdot f}{g^2}
Chain rulef(g)f(g)f(g)g f'(g) \cdot g'

Introduction to Vector Calculus and Partial derivatives

Most “functions” in deep learning are influenced by multiple variables. And the way we describe the derivative / change of the function with respect to each variable is called the partial derivative. Consider the function f(x,y)f(x, y), its partial derivatives are f(x,y)x\frac{\partial f(x, y)}{\partial x} and f(x,y)y\frac{\partial f(x, y)}{\partial y}. But we would like to represent this in vector / matrix form.

For a single function, for eg. f(x,y)=3x2yf(x, y) = 3x^2y we can write the gradients as

f(x,y)x=6xy,f(x,y)y=3x2\frac{\partial f(x, y)}{\partial x} = 6xy ,\quad \frac{\partial f(x, y)}{\partial y} = 3x^2

and can arrange them in a vector form

f=[f(x,y)xf(x,y)y]=[6xy3x2]\nabla f = \begin{bmatrix} \frac{\partial f(x, y)}{\partial x} \\ \frac{\partial f(x, y)}{\partial y} \\ \end{bmatrix} = \begin{bmatrix} 6xy \\ 3x^2 \\ \end{bmatrix}

And when we want to find gradient for more than one function, say another function g(x,y)=2x+y8 g(x, y) = 2x + y^8, we can write this gradient as

g=[g(x,y)xg(x,y)y]=[28y7]\nabla g = \begin{bmatrix} \frac{\partial g(x, y)}{\partial x} \\ \frac{\partial g(x, y)}{\partial y} \\ \end{bmatrix} = \begin{bmatrix} 2 \\ 8y^7 \\ \end{bmatrix}

And we can organize these two in a matrix form

J=[f(x,y)g(x,y)]=[f(x,y)xf(x,y)yg(x,y)xg(x,y)y]=[6xy3x228y7]J = \begin{bmatrix} \nabla f(x, y) \\ \nabla g(x, y) \\ \end{bmatrix} = \begin{bmatrix} \frac {\partial f(x, y)}{\partial x} & \frac {\partial f(x, y)}{\partial y} \\ \frac {\partial g(x, y)}{\partial x} & \frac {\partial g(x, y)}{\partial y} \\ \end{bmatrix} = \begin{bmatrix} 6xy & 3x^2 \\ 2 & 8y^7 \\ \end{bmatrix}

This is called Jacobian Matrix. This particular form is called Numerator layout, and its transpose is called Denominator layout. (Probably because the numerator of partial remains same in a row)

Generalisation of Jacobian Matrix

Instead of writing multiple variables, we can write them in a single vector. So, f(x1,x2,xi...,xn)=f(x)f(x_1, x_2, x_i ..., x_n) = f(\mathbf{x}) and if a function fif_i gives an output yiy_i, for multiple functions f1,f2,...,fmf_1, f_2, ..., f_m we can write the output as y\mathbf{y}.

y1=f1(x)y2=f2(x)ym=fm(x) y_1 = f_1(\mathbf{x}) \\ y_2 = f_2(\mathbf{x}) \\ \vdots \\ y_m = f_m(\mathbf{x})

And taking Jacobian,

J=yx=[f1(x)f2(x)fm(x)]=[f1(x)x1f1(x)x2f1(x)xnf2(x)x1f2(x)x2f2(x)xnfm(x)x1fm(x)x2fm(x)xn]J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \nabla f_1(\mathbf{x}) \\ \nabla f_2(\mathbf{x}) \\ \vdots \\ \nabla f_m(\mathbf{x}) \\ \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & \frac{\partial f_1(\mathbf{x})}{\partial x_2} & \cdots & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\ \frac{\partial f_2(\mathbf{x})}{\partial x_1} & \frac{\partial f_2(\mathbf{x})}{\partial x_2} & \cdots & \frac{\partial f_2(\mathbf{x})}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m(\mathbf{x})}{\partial x_1} & \frac{\partial f_m(\mathbf{x})}{\partial x_2} & \cdots & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\ \end{bmatrix}

Matrix differential representations

Hessian Matrix

For a scalar function f(x):RnRf(\mathbf{x}): R^n \to R, we can write the second derivative, Hessian Matrix as

H=[2f(x)x122f(x)x1x22f(x)x1xn2f(x)x2x12f(x)x222f(x)x2xn2f(x)xnx12f(x)xnx22f(x)xn2] H = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_1^2} & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_2^2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_n^2} \\ \end{bmatrix} (Hf)i,j=2f(x)xixj(H_f)_{i, j} = \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_j}

Important Matrix derivatives

FunctionDerivative
xc\nabla_x c0
x(x+y)\nabla_x (x + y)II
y(xy)\nabla_y (x - y)I-I
x(f(x)Tg(x))\nabla_x (f(x)^T g(x))(xf(x))g(x)+(xg(x))f(x)(\nabla_x f(x)) g(x) + (\nabla_x g(x)) f(x)
xAx\nabla_x AxATA^T
xxTAx\nabla_x x^TAx(AT+A)x(A^T + A)x
xg(f(xT))\nabla_x g(f(x^T))(xfT)fg(\nabla_x f^T) \nabla_f g

References

https://www.youtube.com/watch?v=ny-i8_9NtHA https://ccrma.stanford.edu/~dattorro/matrixcalc.pdf