Several computational techniques (minimization algorithms, training of Deep Neural Networks, Hamiltonian MonteCarlo) requires the computation of gradients, jacobian, and hessians. While we all learnt how to compute these quantities during calculus courses, these techniques may be not well suited when dealing with numerical computations.
In the reminder of this post, I'll walk through the main techniques that can be used to compute derivatives:
Example
Let us consider the following function
For instance, if , which is the value of the gradient of evaluated at ?
Symbolic approach
This is the classical approach that we all learnt during calculus courses: you simply have to write down the analytical derivatives and compute their values
Quite easy, isn't it? What are the advantages of this approach?
✅ High precision, since derivatives are analytical
✅ As long as you have an analytical expression for the input function, this approach works
However, this approach has several drawbacks
❌ For complicated functions, the derivative may not have a maneageable analytical expression
❌ We don't always have an analytical functions to differentiate. For instance, when dealing with Differential Equations, we often have only a numerical solution
Some of these problems can be alleviated with the next approach: finite difference derivatives.
Finite difference derivatives
The second approach replaces the limit in the derivative definition with a finite difference derivative:
If we have a function with more variables, this approach need to be extended to each of the variables involved. Let us code it in Julia!
f(x) = sin(x[1])+x[1]*x[2]+exp(x[2]+x[3])
What about the efficiency of this function?
using BenchmarkTools
@benchmark f([0,0,0])
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
Range (min … max): 24.403 ns … 18.088 μs ┊ GC (min … max): 0.00% … 99.68%
Time (median): 25.298 ns ┊ GC (median): 0.00%
Time (mean ± σ): 30.660 ns ± 200.373 ns ┊ GC (mean ± σ): 11.16% ± 2.21%
▅██▆▃▁ ▁▁▁ ▂
██████▇▇▇▆▄▄▃▄▄▅▁▄▃▃▁▆███▇▅▄▅▅▅▄▄▆▄▄▃▃▁▆▇▆▆▇█▆▆▇▆▄▆▇▇▆▄▃▄▄▁▄ █
24.4 ns Histogram: log(frequency) by time 49.5 ns <
Memory estimate: 48 bytes, allocs estimate: 1.
Let us evaluate the finite difference derivative![1]
using LinearAlgebra
function ∇f(f, x, ϵ)
gradf = zeros(size(x))
ϵ_matrix = ϵ * Matrix(I, length(x), length(x))
for i in 1:length(x)
gradf[i] = (f(x+ϵ_matrix[i,:])-f(x-ϵ_matrix[i,:]))/2ϵ
end
return gradf
end
gradf = ∇f(f,[0,0,0], 1e-4)
Let us see the result of the calculation!
∂f1=0.99999999833289
∂f2=1.0000000016668897
∂f3=1.0000000016668897
Nice! We have evaluated the required gradient. However, the result is imprecise: there is a truncation error! We have approximated the derivative and this gave us an imprecise result. Furthermore, the error depends on the chosen step-size. If the step-size is too big, we are not going to approximate the derivative...
gradf = ∇f(f,[0,0,0], 1e-1)
∂f1=0.9983341664682821
∂f2=1.001667500198441
∂f3=1.001667500198441
...on the other hand, a small step-size will incure on floating-precision error
gradf = ∇f(f,[0,0,0], 1e-15)
∂f1=1.0547118733938987
∂f2=1.0547118733938987
∂f3=1.0547118733938987
On the performance side,
@benchmark gradf = ∇f(f,[0,0,0], 1e-4)
BenchmarkTools.Trial: 10000 samples with 199 evaluations per sample.
Range (min … max): 428.487 ns … 153.772 μs ┊ GC (min … max): 0.00% … 99.34%
Time (median): 450.940 ns ┊ GC (median): 0.00%
Time (mean ± σ): 610.754 ns ± 2.360 μs ┊ GC (mean ± σ): 17.31% ± 5.36%
▅▇█▅▃▂▃▄▂▁ ▂▃▃▂▁ ▁▁ ▂
███████████▇▅▅▆▆▆▅▄▅▄▅▄▄▁▄▁▁▄▄▁▁▃▁▁▆█████████▇▆▅▄▄▄▅▄▅▇██▆███ █
428 ns Histogram: log(frequency) by time 992 ns <
Memory estimate: 1.33 KiB, allocs estimate: 32.
Forward algorithmic differentiation
using ForwardDiff
@benchmark ForwardDiff.gradient(f, [0,0,0])
[1.0, 1.0, 1.0]
BenchmarkTools.Trial: 10000 samples with 204 evaluations per sample.
Range (min … max): 379.779 ns … 169.712 μs ┊ GC (min … max): 0.00% … 99.55%
Time (median): 471.615 ns ┊ GC (median): 0.00%
Time (mean ± σ): 507.406 ns ± 2.275 μs ┊ GC (mean ± σ): 8.33% ± 1.98%
▂▅██▆▃▁ ▃▄▄▃▃▂▂▁▁▁ ▁▁▁▁▁▁▁▂▃▄▅▆▅▆▆▆▆▅▄▃▁▁ ▁▁▂▁▁▁▁▁ ▃
███████▇▆▃▃▃▄▅██████████████████████████████████████████████▇ █
380 ns Histogram: log(frequency) by time 590 ns <
Memory estimate: 416 bytes, allocs estimate: 7.
Backward algorithmic differentiation
In this part, I'll show a different approach to compute gradient: backward automatic differentiation. I'll write all steps and show them graphically. Disclaimer: I am going to be a bit tedious.
Forward pass
The first step requires the computation of the function. While computing the function, we define some intermediate variables and store their values, writing down the Wengert list. Let us start from the first step.
// Image matching '/assets/blog/autodiff/code/forwardpass_1' not found. //
Variable | Value |
---|---|
This was an easy one: basically, we simply had to define three variables, corresponding to the three inputs. Let's move forward!
// Image matching '/assets/blog/autodiff/code/forwardpass_2' not found. //
Variable | Value |
---|---|
As we can see, we have included some real functions (, , ...) and combined the previous step.
// Image matching '/assets/blog/autodiff/code/forwardpass_3' not found. //
Variable | Value |
---|---|
Let's just keep going on...
// Image matching '/assets/blog/autodiff/code/forwardpass_4' not found. //
Variable | Value |
---|---|
...and on...
// Image matching '/assets/blog/autodiff/code/forwardpass_5' not found. //
...and here we are! We have computed the Wengert list and the output of the function. While this look trivial and useless, we have evaluated al quantities required for the next step!
Backward pass
We are now in the position to compute the gradient of the function. Let us start defining the adjoint of a quantity , which is mapped in another quantity :
This quantity and the chain rule will be the key ingredients in the forecoming calculations. Using the chain rule, we will start from the output, , and we will pull back the calculations, till we will reach the beginning of the calculation. How can we use the chain rule? The gradient che be rewritten as
Since we know the relation between and we can compute the partial derivative we added
But now, let's get back to the graph! I'll add in green each calcuation we have been doing
// Image matching '/assets/blog/autodiff/code/backwardpass_1' not found. //
Let's keep going!
// Image matching '/assets/blog/autodiff/code/backwardpass_2' not found. //
Up to now, everything has been quite easy. Let's continue with the next step!
Finally! Has we can see evaluating the partial derivative requires the value of . But we already know this values, since it is stored in the Wengert list!
Now, the meaning of the first step should be clearer: we have evaluated and stored all intermediate quantities. In this way, while moving back along the computationad graph, we already have all quantities required to compute the gradient!
// Image matching '/assets/blog/autodiff/code/backwardpass_3' not found. //
Let's keep going!
// Image matching '/assets/blog/autodiff/code/backwardpass_4' not found. //
We are almost there!
// Image matching '/assets/blog/autodiff/code/backwardpass_5' not found. //
We have now written the gradient of our function! We three terms, each of the proportional to a partial derivative. The coefficient multiplying each of these derivatives is the corresponding element of the gradient! Thus, we can conclude that the calculation give the same result as before!
References and Footnotes
[1] | This is not the most efficient way to code the finite difference derivative, it is just something quick and dirt to show the method. A more efficient implementation can be found in (FiniteDifference.jl)[https://github.com/JuliaDiff/FiniteDifferences.jl] |