Matrix Calculus

This post is a summary of The Matrix Calculus You Need For Deep Learning

Review: Scalar Calculus

Rule	$f(x)$	$f'(x)$
Constant	$c$	$0$
Multiplication by constant	$cx$	$c$
Power rule	$x^n$	$nx^{n-1}$
Sum rule	$f + g$	$f' + g'$
Product rule	$f \cdot g$	$f\cdot g' + g\cdot f'$
Quotient rule	$\frac{f}{g}$	$\frac{f'\cdot g - g'\cdot f}{g^2}$
Chain rule	$f(g)$	$f'(g) \cdot g'$

Introduction to Vector Calculus and Partial derivatives

Most “functions” in deep learning are influenced by multiple variables. And the way we describe the derivative / change of the function with respect to each variable is called the partial derivative. Consider the function $f(x, y)$ , its partial derivatives are $\frac{\partial f(x, y)}{\partial x}$ and $\frac{\partial f(x, y)}{\partial y}$ . But we would like to represent this in vector / matrix form.

For a single function, for eg. $f(x, y) = 3x^2y$ we can write the gradients as

\frac{\partial f(x, y)}{\partial x} = 6xy ,\quad \frac{\partial f(x, y)}{\partial y} = 3x^2

and can arrange them in a vector form

\nabla f = \begin{bmatrix} \frac{\partial f(x, y)}{\partial x} \\ \frac{\partial f(x, y)}{\partial y} \\ \end{bmatrix} = \begin{bmatrix} 6xy \\ 3x^2 \\ \end{bmatrix}

And when we want to find gradient for more than one function, say another function $g(x, y) = 2x + y^8$ , we can write this gradient as

\nabla g = \begin{bmatrix} \frac{\partial g(x, y)}{\partial x} \\ \frac{\partial g(x, y)}{\partial y} \\ \end{bmatrix} = \begin{bmatrix} 2 \\ 8y^7 \\ \end{bmatrix}

And we can organize these two in a matrix form

J = \begin{bmatrix} \nabla f(x, y) \\ \nabla g(x, y) \\ \end{bmatrix} = \begin{bmatrix} \frac {\partial f(x, y)}{\partial x} & \frac {\partial f(x, y)}{\partial y} \\ \frac {\partial g(x, y)}{\partial x} & \frac {\partial g(x, y)}{\partial y} \\ \end{bmatrix} = \begin{bmatrix} 6xy & 3x^2 \\ 2 & 8y^7 \\ \end{bmatrix}

This is called Jacobian Matrix. This particular form is called Numerator layout, and its transpose is called Denominator layout. (Probably because the numerator of partial remains same in a row)

Generalisation of Jacobian Matrix

Instead of writing multiple variables, we can write them in a single vector. So, $f(x_1, x_2, x_i ..., x_n) = f(\mathbf{x})$ and if a function $f_i$ gives an output $y_i$ , for multiple functions $f_1, f_2, ..., f_m$ we can write the output as $\mathbf{y}$ .

y_1 = f_1(\mathbf{x}) \\ y_2 = f_2(\mathbf{x}) \\ \vdots \\ y_m = f_m(\mathbf{x})

And taking Jacobian,

J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \nabla f_1(\mathbf{x}) \\ \nabla f_2(\mathbf{x}) \\ \vdots \\ \nabla f_m(\mathbf{x}) \\ \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & \frac{\partial f_1(\mathbf{x})}{\partial x_2} & \cdots & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\ \frac{\partial f_2(\mathbf{x})}{\partial x_1} & \frac{\partial f_2(\mathbf{x})}{\partial x_2} & \cdots & \frac{\partial f_2(\mathbf{x})}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m(\mathbf{x})}{\partial x_1} & \frac{\partial f_m(\mathbf{x})}{\partial x_2} & \cdots & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\ \end{bmatrix}

Matrix differential representations

Hessian Matrix

For a scalar function $f(\mathbf{x}): R^n \to R$ , we can write the second derivative, Hessian Matrix as

H = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_1^2} & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_2^2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_n^2} \\ \end{bmatrix}

(H_f)_{i, j} = \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_j}

Important Matrix derivatives

Function	Derivative
$\nabla_x c$	0
$\nabla_x (x + y)$	$I$
$\nabla_y (x - y)$	$-I$
$\nabla_x (f(x)^T g(x))$	$(\nabla_x f(x)) g(x) + (\nabla_x g(x)) f(x)$
$\nabla_x Ax$	$A^T$
$\nabla_x x^TAx$	$(A^T + A)x$
$\nabla_x g(f(x^T))$	$(\nabla_x f^T) \nabla_f g$

References

https://www.youtube.com/watch?v=ny-i8_9NtHA https://ccrma.stanford.edu/~dattorro/matrixcalc.pdf