Math Refresher

This review will cover the calculus and linear algebra necessary to perform the forward and backward passes of a basic neural network.

Calculus Review

Derivatives:

A derivative is a rate of change. It tells us how much one thing changes as something else is changed. If we have a function of a single variable, for example \(f(x) = x^2\), the derivative of \(f\) tells us how \(f\) changes as \(x\) changes. The derivative of \(f\) at a chosen value of \(x\) is a line tangent to \(f\) at that chosen point. This means the derivative is the slope of the line. The chart below shows the plot of \(f(x) = x^2\) and a line tangent to \(f\) at \(x = 2.\) The slope of this tangent line equals the derivative of \(f\) at \(x = 2\)

To calculate the derivative take the slope over an infinitesimally small region. Remember, the slope is rise over run. So for our example \(f(x) = x^2\), the derivative of \(f\), denoted \(f'\) (or \(f'(x)\) or \(\frac{df}{dx}\)), is 2x:

\[f' = \frac{f(x + h) - f(x)}{h}\] \[= \frac{(x + h)^2 - x^2}{h}\] \[= \frac{x^2 + 2xh + h^2 - x^2}{h}\] \[= 2x + h\] \[\lim_{h\to0} 2x + h = 2x\]

Thus, \(f'(2) = 2 \cdot 2\), which is why the slope of the red line in the chart above is 4.

If a function has multiple terms, the derivative of each term can be calculated independently and then summed. For example, if \(f = x^2 + cos(x)\), then \(f' = 2x - sin(x)\) because the derivative of \(x^2\) is \(2x\) and the derivative of \(cos(x)\) is \(-sin(x).\) It's a good idea to memorize the derivative of common functions. Here's a cheat sheet.

Lastly, to calculate the derivative of a product or quotient of functions use the product rule or quotient rule:

Product rule: \[(f \cdot g)' = f' \cdot g + f \cdot g'\]

Quotient rule: \[(\frac{f}{g})' = \frac{f' \cdot g - f \cdot g'}{g^2}\]

One minor note: Since \(f\) is a function of only one variable, it's clear that when we say, "The derivative of \(f\)" we mean "The derivative of f with respect to x". As we'll see next, when the function has more than one variable it's necessary to be explicit about which variable we're taking the derivative with respect to.

Partial Derivatives:

A partial derivative of a function with more than one variable is a derivative with respect to one variable while the rest are kept constant. The most common notation is \(\frac{\partial f}{\partial x}\): the function you're taking the derivative of is in the numerator and the variable you are taking the derivative with respect to is on the denominator.

Let's use an example: \(f(x, y) = 2x^2y+3y.\) To compute \(\frac{\partial f}{\partial x}\) simply treat \(y\) as a constant: \(\frac{\partial f}{\partial x} = 4xy.\) Similarly, to compute \(\frac{\partial f}{\partial y}\) we treat x as a constant: \(\frac{\partial f}{\partial y} = 2x^2 + 3.\)

Total Derivatives:

Unlike the partial derivative, in which all but one variable is held constant, in the total derivative it is not assumed that the other variables are held constant. Again, let \(f(x, y) = 2x^2y+3y.\) We know that \(\frac{\partial f}{\partial x} = 4xy.\) But what if \(y = x^2\)? The partial derivative does not give the true rate of change of \(f\) as \(x\) changes since the partial derivative assumes \(y\) is fixed. In order to compute the total derivative, we need to take a brief detour to cover the Chain Rule.

The chain rule allows us to compute the derivative of a function whose inputs are also functions (also known as a composite functions), such as \(f(g(x)).\) The chain rule says If \(f\) is a function of \(g\) which is a function of \(x\), then the derivative of \(f\) with respect to \(x\) is the derivative of \(f\) with respect to \(g\) times the derivative of \(g\) with respect to \(x\): \[\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)\] As an example let \(f(g) = g^3\) and \(g(x) = x^2.\) Then \(\frac{df}{dg} = 3g^2\) and \(\frac{dg}{dx} = 2x.\) Thus, \(\frac{df}{dx} = 3g^2 \cdot 2x = 3 (x^2)^2 \cdot 2x = 6x^5.\)

Now let's return to our example where \(f(x, y) = 2x^2y+3y.\) To calculate the total derivative, \(\frac{df}{dx}\), when \(y = x^2\), we apply the chain rule:

\[\frac{df}{dx} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial y}\frac{dy}{dx}\]

\(= 4xy + (2x^2 + 3) \cdot 2x\) \(= 8x^3 + 6x\)

Note this gives the same answer we would get if we substituted \(y\) with \(x^2\) in our original function \(f\) (giving \(f(x) = 2x^4 + 3x^2\)) and then took the derivative of that with respect to x. Also note that this gives a different answer than if we took the partial derivative with respect to x, \(\frac{\partial f}{\partial x} = 4xy\), and then substituted in \(y\) with \(x^2\), giving \(\frac{\partial f}{\partial x} = 4x^3.\)

One final note on total derivatives: Say we have \(f(x, y, z)\), and \(x(t), y(t), z(t).\) Then \[\frac{df}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} + \frac{\partial f}{\partial z}\frac{dz}{dt}\] A common question is, "Why are we adding these terms?" Each term represents how \(f\) is affected as \(t\) is increased by an infinitesimal amount, so the total amount \(f\) is affected is simply the sum of those terms. One way to view it is that each term is a different path that the change in \(t\) flows through as it affects \(f.\)

Gradient:

The gradient of a scalar-valued function (i.e. a function that outputs a single value, as opposed to say a vector) is a vector with a number of dimensions equal to the number of variables in the function. For example, the gradient of \(f(x, y, z)\) will be a three-dimensional vector. The value of each component in the vector is the partial derivative of \(f\) with respect to a different variable (note the nabla symbol, \(\nabla\), also referred to as del, is the gradient operator): \[\nabla f = \sum_{i=1}^{n}\vec{e_i}\frac{\partial f}{\partial x_i} = (\frac{\partial f}{\partial x_1}, ..., \frac{\partial f}{\partial x_n})\] Here's an example where \(f\) is a function of three variables. Thus, the output is a three-dimensional vector: \[f(x, y, z) = x + y^2 + z^3\]

\(\nabla f = 1\vec{e_x} + 2y\vec{e_y} + 3z^2\vec{e_z}\) \(= (1, 2y, 3z^2)\)

The gradient tells us how fast and in what direction the fastest increase in \(f\) is at any point. That is, given some point, for example \((x=3, y=4, z=5)\), the gradient tells us in what direction the function increases the most and what the magnitude of that increase is at that point. Using our example point \(p = (3, 4, 5)\), and our example function which we calculated the gradient to be \((1, 2y, 3z^2)\), the gradient at point \(p\) is \(\nabla f(p) = (1, 2 \cdot 4, 3 \cdot 5^2) = (1, 8, 75).\) This is a three-dimensional vector that points 1 unit in the \(x\) direction, 5 units in the \(y\), and 75 units in the \(z\) direction and has magnitude \(\sqrt{1^2 + 8^2 + 75^2} = 75.43\)

Jacobian:

The last calculus item to cover is the Jacobian. If \(f\) is a vector-valued function (i.e. rather than output a scalar it outputs a vector), then the Jacobian matrix of \(f\) is a matrix where each element is the first-order partial derivative of one of the dimensions of the output vector with respect to one of the input variables. Essentially this is just the gradient of a vector-valued function. When we took the gradient of a scalar-valued function we had a vector as the output. If we treat each component of the output as a scalar, take the gradient of it, and then stack each of these gradient vectors in different rows the result will be the Jacobian matrix. Example:

If we have the following vector function: \[\boldsymbol{f}\begin{pmatrix}\begin{bmatrix}f_1(x, y)\\f_2(x, y)\end{bmatrix}\end{pmatrix} = \begin{bmatrix}xy^2\\x^2 + y^3\end{bmatrix}\]

Then the Jacobian is:

\(\boldsymbol{J_f(x, y)}\)\( = \begin{bmatrix}\frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y}\\ \frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y}\end{bmatrix}\)\( = \begin{bmatrix}y^2 & 2xy\\ 2x & 3y^2\end{bmatrix}\)

Why is this relevant? Well, in neural networks we have multiple outputs that are functions of multiple inputs. And we want to know how each of the outputs change with each of the inputs. The Jacobian tells us exactly that.

Linear Algebra Review

Dot Product:

The dot product is an operation that takes two vectors of equal length and returns a scalar. The computation is done by mutliplying each term elementwise and the summing up the result. For example, take the two vectors \(\boldsymbol{a} = (1, 2, 3)\) and \(\boldsymbol{b} = (2, -2, 4).\) The dot product of these two vectors is \(\boldsymbol{a} \cdot \boldsymbol{b} = 1\cdot2 + 2\cdot -2 + 3\cdot4 = 10.\) More generally, the dot product between two vectors, each \(n\) dimensions, is: \[\boldsymbol{a} \cdot \boldsymbol{b} = \sum_{i=1}^{n}a_i b_i\] Why is this useful? One reason is that it allows for a convenient way to represent linear expressions. For example: \(c_1 x_1 + c_2 x_2 + c_3 x_3 = \boldsymbol{c} \cdot \boldsymbol{x}\) if \(\boldsymbol{c} = (c_1, c_2, c_3)\) and \(\boldsymbol{x} = (x_1, x_2, x_3).\)

Additionally, one commonly used property of the dot product is the fact that the dot product of two vectors equals the product of their magnitudes times the cosine of the angle between them: \[\boldsymbol{a} \cdot \boldsymbol{b} = \vert \boldsymbol{a} \vert \vert \boldsymbol{b} \vert \cos{\theta}\] This is often used to get a similarity score between two vectors, i.e. how much they point in the same direction. If we divide the dot product by the product of the vectors' magnitudes we can recover the cosine of the angle between them. Let's do that for our example above where \(\boldsymbol{a} = (1, 2, 3)\), \(\boldsymbol{b} = (2, -2, 4)\), and \(\boldsymbol{a} \cdot \boldsymbol{b} = 10\)

\[\cos{\theta} = \frac{\boldsymbol{a} \cdot \boldsymbol{b}}{\vert \boldsymbol{a} \vert \vert \boldsymbol{b} \vert}\] \[= \frac{10}{\sqrt{1^2 + 2^2 + 3^2} \sqrt{2^2 + -2^2 + 4^2}}\] \[= 0.546\]

Matrix Multiplication:

The dimensions of a matrix are typically specified as \(m \times n\), where\(m\) is the number of rows and \(n\) is the number of columns. There are a few key things to remember about matrix multiplication:

The number of columns of the first matrix must equal the number of rows of the second matrix. Otherwise, matrix multiplication cannot be performed.
The resulting matrix has dimensions \(m_A \times n_B\), where \(m_A\) is the number of rows of the first matrix and \(n_B\) is the number of columns of the second matrix.
Matrix multiplication is non-commutative. That is, generally speaking for two matricies \(\boldsymbol{A}\) and \(\boldsymbol{B}\), \(\boldsymbol{AB} \neq \boldsymbol{BA}\)

Matrix multiplication is essentially just a collection of dot products. It takes two matrices and outputs one matrix. To compute the matrix multiplication we perform the dot product of row \(i\) of matrix \(\boldsymbol{A}\) with column \(j\) of matrix \(\boldsymbol{B}\), and put the resulting value in cell row \(i\) column \(j\) of the output matrix. Here's an example:

\(\boldsymbol{A} = \begin{bmatrix}1 & 2\\ 3 & 4\\ 5 & 6\end{bmatrix}\) \(\boldsymbol{B} = \begin{bmatrix}7 & 8\\ 9 & 10\end{bmatrix}\) \(\boldsymbol{C} = \boldsymbol{A} \cdot \boldsymbol{B}\)

\[\boldsymbol{C} = \begin{bmatrix}1\cdot7+2\cdot9 & 1\cdot8+2\cdot10\\ 3\cdot7+4\cdot9 & 3\cdot8+4\cdot10\\ 5\cdot7+6\cdot9 & 5\cdot8+6\cdot10\end{bmatrix}\]

Since the shape of \(\boldsymbol{A}\) is \(3 \times 2\) and the shape of \(\boldsymbol{B}\) is \(2 \times 2\) we can perform the matrix multiplication and the resulting matrix will be of shape \(3 \times 2.\)

Matrix Transpose:

The last item to briefly cover is transposes. A transpose of a matrix is simply a swapping of the rows and columns. That is, the values in the first row become the values in the first column, the values in the second row become the values in the second column, etc. Here's an example for our matrix \(\boldsymbol{A}:\) \[\boldsymbol{A}^\intercal = \begin{bmatrix}1 & 3 & 5\\ 2 & 4 & 6\end{bmatrix}\]