Warning ⚠: This is not meant to be a rigouros explanation of automatic differentiation. These must be considered as some notes that I wrote for myself, to better understand the topic of automatic differentiation. Furthermore, these notes are far to be completed and need to be refined.

Several computational techniques (minimization algorithms, training of Deep Neural Networks, Hamiltonian MonteCarlo) requires the computation of gradients, jacobian, and hessians. While we all learnt how to compute these quantities during calculus courses, these techniques may be not well suited when dealing with numerical computations.

In the reminder of this post, I'll walk through the main techniques that can be used to compute derivatives:

Symbolic approach
Finite difference derivatives
Forward algorithmic differentiation
Backward algorithmic differentiation

Example

Let us consider the following function $f(\boldsymbol{x}):\mathbb{R}^3\rightarrow\mathbb{R}$

f(x_1,x_2, x_3)= \sin x_1 + x_1 x_2 + \exp(x_2 + x_3)

For instance, if $\boldsymbol{x}_0= (0,0,0)$ , which is the value of the gradient of $f$ evaluated at $\boldsymbol{x}_0$ ?

\nabla f(\boldsymbol{x}_0) = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \frac{\partial f}{\partial x_3}\right)(\boldsymbol{x}_0) = ?

Symbolic approach

This is the classical approach that we all learnt during calculus courses: you simply have to write down the analytical derivatives and compute their values

\frac{\partial f}{\partial x_1} = \cos x_1 + x_2

\frac{\partial f}{\partial x_2} = x_1 + \exp(x_2+x_3)

\frac{\partial f}{\partial x_3} = \exp(x_2+x_3)

\nabla f(\boldsymbol{x}) = \left(\cos x_1 + x_2, x_1 + \exp(x_2+x_3), \exp(x_2+x_3)\right)(\boldsymbol{x})

\nabla f(\boldsymbol{x}_0) = \left(\cos 0 + 0, 0 + \exp(0+0), \exp(0+0)\right) = \left(1,1,1 \right)

Quite easy, isn't it? What are the advantages of this approach?

✅ High precision, since derivatives are analytical
✅ As long as you have an analytical expression for the input function, this approach works

However, this approach has several drawbacks

❌ For complicated functions, the derivative may not have a maneageable analytical expression
❌ We don't always have an analytical functions to differentiate. For instance, when dealing with Differential Equations, we often have only a numerical solution

Some of these problems can be alleviated with the next approach: finite difference derivatives.

Finite difference derivatives

The second approach replaces the limit in the derivative definition with a finite difference derivative:

f^{\prime}(x)=\lim _{\epsilon \rightarrow 0} \frac{f(x+\epsilon)-f(x)}{\epsilon}\approx \frac{f(x+\Delta x)-f(x)}{\Delta x}

If we have a function with more variables, this approach need to be extended to each of the variables involved. Let us code it in Julia!

f(x) = sin(x[1])+x[1]*x[2]+exp(x[2]+x[3])

What about the efficiency of this function?

using BenchmarkTools
@benchmark f([0,0,0])

BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (min … max):  29.864 ns …  14.978 μs  ┊ GC (min … max):  0.00% … 99.63%
 Time  (median):     32.536 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   52.093 ns ± 360.890 ns  ┊ GC (mean ± σ):  16.95% ±  2.44%

  ▃██▅▄▃▂▂▁  ▁▁                         ▁▁    ▃▅▄▆▅▅▄▃▁▁       ▂
  ██████████████▇█▇█▇█▇▇▇▅▆▅▅▅▆▇▅▅▅▆▅▆▆████▇▇▅██████████▆▅▇▆▅▆ █
  29.9 ns       Histogram: log(frequency) by time      69.2 ns <

 Memory estimate: 80 bytes, allocs estimate: 1.

Let us evaluate the finite difference derivative!^[1]

using LinearAlgebra

function  ∇f(f, x, ϵ)
    gradf = zeros(size(x))
    ϵ_matrix = ϵ * Matrix(I, length(x), length(x))
    for i in 1:length(x)
        gradf[i] = (f(x+ϵ_matrix[i,:])-f(x-ϵ_matrix[i,:]))/2ϵ
    end
    return gradf
end

gradf = ∇f(f,[0,0,0], 1e-4)

Let us see the result of the calculation!

∂f1=0.99999999833289
∂f2=1.0000000016668897
∂f3=1.0000000016668897

Nice! We have evaluated the required gradient. However, the result is imprecise: there is a truncation error! We have approximated the derivative and this gave us an imprecise result. Furthermore, the error depends on the chosen step-size. If the step-size is too big, we are not going to approximate the derivative...

gradf = ∇f(f,[0,0,0], 1e-1)

∂f1=0.9983341664682821
∂f2=1.001667500198441
∂f3=1.001667500198441

...on the other hand, a small step-size will incure on floating-precision error

gradf = ∇f(f,[0,0,0], 1e-15)

∂f1=1.0547118733938987
∂f2=1.0547118733938987
∂f3=1.0547118733938987

On the performance side,

@benchmark gradf = ∇f(f,[0,0,0], 1e-4)

BenchmarkTools.Trial: 10000 samples with 191 evaluations.
 Range (min … max):  519.921 ns … 83.841 μs  ┊ GC (min … max):  0.00% … 99.21%
 Time  (median):     548.408 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   801.199 ns ±  3.208 μs  ┊ GC (mean ± σ):  17.38% ±  4.31%

  ▄██▆▄▃▃▃▃▂▂▁▁ ▁▁                                    ▁▅▆▄▂  ▁ ▂
  █████████████████▇██▇▅▆▅▆▆▆▆▅▆▆▅▅▄▅▄▅▄▃▅▃▅▄▄▅▄▄▂▄▃▄▅████████ █
  520 ns        Histogram: log(frequency) by time       1.1 μs <

 Memory estimate: 1.28 KiB, allocs estimate: 16.

Forward algorithmic differentiation

using ForwardDiff
@benchmark ForwardDiff.gradient(f, [0,0,0])

[1.0, 1.0, 1.0]
BenchmarkTools.Trial: 10000 samples with 219 evaluations.
 Range (min … max):  338.027 ns … 66.245 μs  ┊ GC (min … max):  0.00% … 99.32%
 Time  (median):     355.895 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   466.664 ns ±  1.822 μs  ┊ GC (mean ± σ):  11.03% ±  2.81%

  ▅█▇▄▃▂▁▁  ▁▁▁    ▁▁            ▂▄▅▅▅▅▃▂                      ▂
  █████████████████████▇▇▆▆▇▆▆▆▆▇██████████▇▇▇▇▆▇▆▆▄▆▆▆▆▅▅▄▅▅▅ █
  338 ns        Histogram: log(frequency) by time       622 ns <

 Memory estimate: 512 bytes, allocs estimate: 5.

Backward algorithmic differentiation

In this part, I'll show a different approach to compute gradient: backward automatic differentiation. I'll write all steps and show them graphically. Disclaimer: I am going to be a bit tedious.

Forward pass

The first step requires the computation of the function. While computing the function, we define some intermediate variables $w_i$ and store their values, writing down the Wengert list. Let us start from the first step.

Variable	Value
$w_1$	$0$
$w_2$	$0$
$w_3$	$0$

This was an easy one: basically, we simply had to define three variables, corresponding to the three inputs. Let's move forward!

Variable	Value
$w_1$	$0$
$w_2$	$0$
$w_3$	$0$
$w_4$	$0$
$w_5$	$0$
$w_6$	$0$

As we can see, we have included some real functions ( $\sin$ , $+$ , ...) and combined the previous step.

Variable	Value
$w_1$	$0$
$w_2$	$0$
$w_3$	$0$
$w_4$	$0$
$w_5$	$0$
$w_6$	$0$
$w_7$	$1$

Let's just keep going on...

Variable	Value
$w_1$	$0$
$w_2$	$0$
$w_3$	$0$
$w_4$	$0$
$w_5$	$0$
$w_6$	$0$
$w_7$	$1$
$w_8$	$1$

...and on...

...and here we are! We have computed the Wengert list and the output of the function. While this look trivial and useless, we have evaluated al quantities required for the next step!

Backward pass

We are now in the position to compute the gradient of the function. Let us start defining the adjoint of a quantity $x$ , which is mapped in another quantity $y$ :

\bar{x} = \frac{\mathrm{d}y}{\mathrm{d} x}

This quantity and the chain rule will be the key ingredients in the forecoming calculations. Using the chain rule, we will start from the output, $y$ , and we will pull back the calculations, till we will reach the beginning of the calculation. How can we use the chain rule? The gradient che be rewritten as

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\frac{\partial y}{\partial w_8}\frac{\mathrm{d}w_8}{\mathrm{d} x}

Since we know the relation between $y$ and $w_8$ we can compute the partial derivative we added

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\frac{\mathrm{d}w_8}{\mathrm{d} x}

But now, let's get back to the graph! I'll add in green each calcuation we have been doing

Let's keep going!

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left( \frac{\partial w_8}{\partial w_4}\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\partial w_8}{\partial w_5}\frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\partial w_8}{\partial w_7}\frac{\mathrm{d}w_7}{\mathrm{d} x} \right)

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\mathrm{d}w_7}{\mathrm{d} x} \right)

Up to now, everything has been quite easy. Let's continue with the next step!

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\partial w_7}{\partial w_6}\frac{\mathrm{d}w_6}{\mathrm{d} x} \right)

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \exp (w_6)\frac{\mathrm{d}w_6}{\mathrm{d} x} \right)

Finally! Has we can see evaluating the partial derivative requires the value of $w_6$ . But we already know this values, since it is stored in the Wengert list!

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\mathrm{d}w_6}{\mathrm{d} x} \right)

Now, the meaning of the first step should be clearer: we have evaluated and stored all intermediate quantities. In this way, while moving back along the computationad graph, we already have all quantities required to compute the gradient!

Let's keep going!

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(\frac{\partial w_4}{\partial w_1} \frac{\mathrm{d}w_1}{\mathrm{d} x}\right) + \left(\frac{\partial w_5}{\partial w_1} \frac{\mathrm{d}w_1}{\mathrm{d} x}+\frac{\partial w_5}{\partial w_2} \frac{\mathrm{d}w_2}{\mathrm{d} x}\right) + \left(\frac{\partial w_6}{\partial w_2} \frac{\mathrm{d}w_2}{\mathrm{d} x}+\frac{\partial w_6}{\partial w_3} \frac{\mathrm{d}w_3}{\mathrm{d} x}\right) \right)

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(\frac{\partial w_4}{\partial w_1}+\frac{\partial w_5}{\partial w_1} \right)\frac{\mathrm{d}w_1}{\mathrm{d} x} + \left(\frac{\partial w_5}{\partial w_2}+\frac{\partial w_6}{\partial w_2} \right)\frac{\mathrm{d}w_2}{\mathrm{d} x} + \frac{\partial w_6}{\partial w_3} \frac{\mathrm{d}w_3}{\mathrm{d} x} \right)

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(\cos w_1+w_2 \right)\frac{\mathrm{d}w_1}{\mathrm{d} x} + \left(w_1+1 \right)\frac{\mathrm{d}w_2}{\mathrm{d} x} + 1 \frac{\mathrm{d}w_3}{\mathrm{d} x} \right)

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(1+0 \right)\frac{\mathrm{d}w_1}{\mathrm{d} x} + \left(0+1 \right)\frac{\mathrm{d}w_2}{\mathrm{d} x} + 1 \frac{\mathrm{d}w_3}{\mathrm{d} x} \right)

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_1}{\mathrm{d} x} + \frac{\mathrm{d}w_2}{\mathrm{d} x} + \frac{\mathrm{d}w_3}{\mathrm{d} x} \right)

We are almost there!

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\partial w_1}{\partial x_1}\frac{\mathrm{d}x_1}{\mathrm{d} x} + \frac{\partial w_2}{\partial x_2}\frac{\mathrm{d}x_2}{\mathrm{d} x} + \frac{\partial w_3}{\partial x_3}\frac{\mathrm{d}x_3}{\mathrm{d} x} \right)

\frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\partial w_1}{\partial x_1} + \frac{\partial w_2}{\partial x_2} + \frac{\partial w_3}{\partial x_3}\right)

We have now written the gradient of our function! We three terms, each of the proportional to a partial derivative. The coefficient multiplying each of these derivatives is the corresponding element of the gradient! Thus, we can conclude that the calculation give the same result as before!

References and Footnotes

[1]	This is not the most efficient way to code the finite difference derivative, it is just something quick and dirt to show the method. A more efficient implementation can be found in (FiniteDifference.jl)[https://github.com/JuliaDiff/FiniteDifferences.jl]

Marco Bonici