Little by Little: Optimisation with analysis and gradients
Subscribe to my newsletter to hear about my latest posts. No spam, I promise.
Problems for with a known objective function or observable gradient is known can be solved efficiently through analysis or gradient descent/ascent.
This article will formalise the concept of optimisation and introduce optimisation with analysis and gradient descent.
This is the second part of a series on optimisation. Head to the first part to develop an intuition of optimisation problems with very little maths involved.
As mentioned in the first part, an optimisation problem refers to minimising or maximising an objective function, depending on a set of decision variables, with respect to a list of constraints.
Example: A private company producing tables and chairs would try to maximise its profit (revenue minus costs, the objective function) depending on the quantities it produces (decision variables). Some constraints could involve the current capital or number of workers the firm hires.
Formalising this, an optimisation problem is:
or
where
The optimisation methods at our disposal depend on the amount of information we have about a problem.
Using analysis
In the production example, if we are given both the revenue and cost functions, it may be possible to find the solution to the problem. To simplify, let’s only consider
Let’s define:
- Cost function:
- Revenue function:
Deriving the profits from revenues minus costs:
Simplifying:
We can then use first and second derivatives to find the critical points of this function, and determine whether they are maximums or minimums.
The expression
Setting the first derivative to 0 allows us to find the critical point of the function (reached at
However, a critical point could be either a minimum or maximum. To make sure that this is a maximum, we need to check if the function is concave at this critical point. We can do that by ensuring that the second derivative at this point is negative.
Checking the sign of the second derivative:
Since
Suppose the price per table is €400:
Thus, the company maximises profit by producing 50 tables.
This process is highly efficient as only a few calculations are needed to get to the result. This would take less than a millisecond on a personal computer.
Observing the gradient
Other than analysis, we can use gradient descent to solve more complex problems. Let’s consider the following functions:
Revenue function:
Cost functions:
For tables:
For chairs:
Total cost:
Profit function:
Let’s specify:
- Price per table:
- Price per chair:
Therefore, the profit function is:
Simplifying:
Finding the maximum analytically is already more challenging than the univariate case, though definitely doable. We will solve this using gradient descent to illustrate this different approach.
The gradient of a function (noted
As an example, the gradient of the function
Following the opposite direction of this gradient will lead to the steepest decrease in the function. In practice, following the opposite of the gradient would mean updating
Trying to find a minimum to the function starting from the random guess
This process will slowly converge to
Similarly, following its direction will lead to the steepest increase in the function.
Going back to tables and chairs, the gradient vector of the profit function:
It is the vector of its partial derivatives:
Partial derivative with respect to
:Partial derivative with respect to
:
The gradient of
Starting from the following guess with
and applying the gradient ascent update rule, we get:
The full search is illustrated in the animation below:
Code to generate the animation
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib.patches import FancyArrowPatch
def profit(q1, q2):
return 200 * q1 - 2 * q1**2 + 100 * q2 - 3 * q2**2 - 180
def grad_profit(q1, q2):
= 200 - 4 * q1
dp_dq1 = 100 - 6 * q2
dp_dq2 return np.array([dp_dq1, dp_dq2])
# Gradient ascent parameters
= 0.05
alpha = 15
steps = []
history
# Initial guess
= 10
q1 = 10
q2
for _ in range(steps):
= grad_profit(q1, q2)
gradient
history.append((q1, q2, gradient))+= alpha * gradient[0]
q1 += alpha * gradient[1]
q2
# Prepare for the animation
= plt.subplots(figsize=(8, 6))
fig, ax = np.linspace(0, 60, 200)
q1_vals = np.linspace(0, 40, 200)
q2_vals = np.meshgrid(q1_vals, q2_vals)
Q1, Q2 = profit(Q1, Q2)
Pi = ax.contour(Q1, Q2, Pi, levels=50, cmap="Blues")
contours =True, fontsize=8)
plt.clabel(contours, inline'Profit as a Function of Tables and Chairs')
ax.set_title('Quantity of Tables ($q_t$)')
ax.set_xlabel('Quantity of Chairs ($q_c$)')
ax.set_ylabel(
= None
prev_arrow
def init():
0, 60)
ax.set_xlim(0, 40)
ax.set_ylim(return []
def update(frame):
global prev_arrow
= history[frame]
q1, q2, gradient # Remove previous arrow
if prev_arrow is not None:
prev_arrow.remove()# Scale gradient for arrow visualization
= gradient * alpha
norm_gradient = norm_gradient
dq1, dq2 # Draw an arrow showing the gradient ascent direction
= FancyArrowPatch(
arrow + dq1, q2 + dq2),
(q1, q2), (q1 ='-|>', color='r', mutation_scale=20
arrowstyle
)
ax.add_patch(arrow)= arrow
prev_arrow f'Step: {frame}')
ax.set_title(return [arrow]
= FuncAnimation(fig, update, frames=range(len(history)), init_func=init, blit=True, interval=1000)
ani plt.show()
Do we even need a learning rate?
The gradient descent update rule is the following:
With
Why not add or subtract the gradient directly? Wouldn’t we reach the optimum faster?
Because a picture is worth a thousand words (and a gif maybe more), the animation below shows what could happen with a learning rate set to 1 (equivalent to no learning rate at all).
Code to generate the animation
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
def gradient_descent(learning_rate, steps, x_initial):
= x_initial
x = [x]
history for _ in range(steps):
= 2 * x # Gradient of f(x) = x^2 is f'(x) = 2x
grad = x - learning_rate * grad
x
history.append(x)return history
= [10, 10] # Starting points for each learning rate
x_initials = 20 # Number of steps
steps = [0.1, 1]
learning_rates
# Prepare histories for both learning rates
= []
histories for lr, x_init in zip(learning_rates, x_initials):
= gradient_descent(lr, steps, x_init)
history
histories.append(history)
# Prepare the plot
= plt.subplots(1, 2, figsize=(12, 5))
fig, axes = np.linspace(-15, 15, 400)
x_vals = x_vals ** 2
y_vals
# Initialize plots
= []
lines = []
arrows for ax, lr in zip(axes, learning_rates):
='$f(x) = x^2$')
ax.plot(x_vals, y_vals, label= ax.plot([], [], 'ro', label='Current Position')
line,
lines.append(line)None) # Placeholder for arrows
arrows.append(f'Learning Rate = {lr}')
ax.set_title('x')
ax.set_xlabel('f(x)')
ax.set_ylabel(-15, 15)
ax.set_xlim(0, 225)
ax.set_ylim(
ax.legend()True)
ax.grid(
def animate(i):
f'Gradient Descent Step: {i}', fontsize=16)
fig.suptitle(for idx, (line, history, ax) in enumerate(zip(lines, histories, axes)):
if i < len(history):
= history[i]
x_current else:
= history[-1]
x_current = x_current ** 2
y_current
line.set_data([x_current], [y_current])
# Remove previous arrow if exists
if arrows[idx] is not None:
arrows[idx].remove()
# Draw arrow from previous point to current point
if i > 0:
= history[i - 1]
x_prev = x_prev ** 2
y_prev = ax.annotate(
arrow '',
=(x_current, y_current),
xy=(x_prev, y_prev),
xytext=dict(arrowstyle='->', color='blue', lw=2),
arrowprops
)= arrow
arrows[idx] else:
= None # No arrow in the first frame
arrows[idx]
return lines + [arrow for arrow in arrows if arrow is not None]
= FuncAnimation(fig, animate, frames=steps + 1, interval=500, blit=True)
ani
plt.tight_layout()
plt.show()'gradient_ascent_learning_rate.gif', writer='imagemagick') ani.save(
The choice of learning rate is critical in optimisation problems. Setting it too low will result in slow convergence while setting it too high could lead to no convergence at all.
There are some sophisticated ways to set the learning rate. For simple functions like the one studied in this article, simple experimentation should do the trick. However, to optimise millions of parameters, variations of the gradient descent algorithm vary the learning rate throughout the process. As an example, the optimisation could start with a high learning rate and decrease the learning rate as the algorithm converges (i.e., learning rate decay).
Does gradient descent find the optimum every time?
Working with convex functions, gradient descent guarantees finding the function’s global optimum. This is convenient as both the loss functions of a Linear and Logistic Regressions are convex. Unfortunately, a lot of real-world functions are not convex, such as the loss function of complex Neural Networks.
These functions can have local optima in which the gradient descent algorithm could get stuck. See the animation below as an example
Code to generate the animation
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
def f(x):
return x ** 2 + 10 * np.sin(x)
def grad_f(x):
return 2 * x + 10 * np.cos(x)
def gradient_descent(learning_rate, steps, x_initial):
= x_initial
x = [x]
history for _ in range(steps):
= grad_f(x)
grad = x - learning_rate * grad
x
history.append(x)return history
# Parameters
= [7.5, -7.5] # Starting points for each test case
x_initials = 30 # Number of steps
steps = [0.1, 0.1] # Same learning rate for both cases
learning_rates
# Prepare histories for both starting points
= []
histories for lr, x_init in zip(learning_rates, x_initials):
= gradient_descent(lr, steps, x_init)
history
histories.append(history)
# Prepare the plot
= plt.subplots(1, 2, figsize=(12, 5))
fig, axes = np.linspace(-10, 10, 1000)
x_vals = f(x_vals)
y_vals
# Initialize plots
= []
lines = []
arrows for idx, (ax, x_init) in enumerate(zip(axes, x_initials)):
='$f(x) = x^2 + 10 \sin(x)$')
ax.plot(x_vals, y_vals, label= ax.plot([], [], 'ro', label='Current Position')
line,
lines.append(line)None) # Placeholder for arrows
arrows.append(f'Starting Point = {x_init}')
ax.set_title('x')
ax.set_xlabel('f(x)')
ax.set_ylabel(-10, 10)
ax.set_xlim(-50, 150)
ax.set_ylim(
ax.legend()True)
ax.grid(
def animate(i):
f'Gradient Descent Step: {i}', fontsize=16)
fig.suptitle(= []
artists for idx, (line, history, ax) in enumerate(zip(lines, histories, axes)):
= history[i] if i < len(history) else history[-1]
x_current = f(x_current)
y_current
line.set_data([x_current], [y_current])
artists.append(line)
# Remove previous arrow if exists
if arrows[idx] is not None:
arrows[idx].remove()
# Draw arrow from previous point to current point
if i > 0:
= history[i - 1]
x_prev = f(x_prev)
y_prev = ax.annotate(
arrow '',
=(x_current, y_current),
xy=(x_prev, y_prev),
xytext=dict(arrowstyle='->', color='blue', lw=1.5),
arrowprops
)= arrow
arrows[idx]
artists.append(arrow)else:
= None # No arrow in the first frame
arrows[idx]
return artists
= FuncAnimation(fig, animate, frames=steps + 1, interval=500, blit=True)
ani
plt.tight_layout() plt.show()
In these cases, running the gradient descent algorithm from different starting points could be a good strategy.
What to do when we do not have the function or gradient?
Part 1 described a set of problems referred to as “black-box” optimisation. These are problems for which we do not have the definition of the function to optimise, or this function is too complex to be solved analytically or is not differentiable (no gradient…).
Looking at the problem above, it is as if we had to decide the numbers of tables and chairs to produce without having the profit function. We could simply try a quantity this month, observe profits, try another quantity next month, observe profits etc… The only information we have are the combinations we try and their outcome.
These problems are the Computer Science equivalent of finding a needle in a hay stack.
PS: There will be much less maths in the next article. I promise.