Speed ​​Up Your Gradient Descent: The Epic Quest for the Optimal Stride | by Naman Agrawal


Smooth Growth/Gradient Descent Techniques for Machine Learning

Towards Data Science
Image by stokpic from Pixabay
  1. Introduction
  2. Method 1: Maximum Growth Rate
  3. Method 2: Searching for the Right Line
  4. Method 3: Searching for a Reverse Line
  5. The end

For training any type of machine learning, Gradient Descent is one of the most widely used techniques for segmentation optimization. Gradient descent provides a good way to reduce the loss of incoming data, especially in some cases, where it may not be a solution to this problem. More generally, consider a machine learning problem defined by a discrete and discrete form f: Rᵈ → R (many lossy functions follow this). The goal is to find x* ∈ Rᵈ that minimizes the loss function:

Gradient Descent provides an iterative method to solve this problem. The additional command is given as follows:

Where x⁽ᵏ⁾ denotes the value of x in the kth iteration of the algorithm, and tₖ denotes the step size or learning rate of the model in the kth iteration. The workflow of the algorithm is given as follows:

  1. Determine the loss function f and calculate its coefficient ∇f.
  2. Start with a random choice of x ∈ Rᵈ, call it x⁽⁰⁾(repetition).
  3. Until you reach a stopping point (for example, the error drops to a certain level), do the following:
    A) Determine where x should be decreased or increased. Under gradient descent, this is given by the gradient-opposite side of the loss function evaluated in the same iteration. vₖ = ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)
    B) Determine the step size or change size: tₖ.
    C) Change the expression: x⁽ᵏ⁾= x⁽ᵏ ⁻ ¹⁾ − tₖ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)

That’s the whole process in a nutshell: Take the current, determine where it needs to be changed (vₖ), determine the magnitude of the change (tₖ), and change it.

Diagram of Gradient Descent [Image by Author]

So, what is the story? In this article, our focus will be on step 3B: finding the step size or tₖ size. When it comes to gradient descent, this is one of the most overlooked aspects of your color development. The step size can greatly affect how quickly your algorithm converges to a solution and the accuracy of the solution it converges to. in many cases, data scientists just set a fixed value of the size throughout the learning process or sometimes use validation techniques to train. But, there are many effective ways to solve this problem. In this article, we will discuss three ways to determine the magnitude of tₖ:

  1. Size of Fixed Segment
  2. Searching for the Right Line
  3. Searching for the Return Line (Armijo’s Rule)

In each of these methods, we will discuss the theory and use it to calculate the first few iterations for example. In particular, we’ll look at this lossy function to illustrate an example:

The 3D-Plot of this function is shown below:

Lost Project (3D Plot) [Image by Author generated using LibreTexts]

From the figure, it is clear that the global minima is x * = [0; 0]. In this article, we will manually count the first few iterations and calculate the number of interaction paths for each of these paths. We will also follow the iterate trajectory (aka the iterate trajectory) to understand how each of these processes interact [process of convergence. Usually, it’s easier to refer to the contour plot of the function (instead of its 3D plot) to better evaluate the different trajectories that follow. The contour plot of the function can be easily generated via the following code:

# Load Packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
sns.set(style="darkgrid")
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D
# Define Function
f = lambda x,y: 2*x**2 + 3*y**2 - 2*x*y - 1

# Plot contour
X = np.arange(-1, 1, 0.005)
Y = np.arange(-1, 1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
cmap = plt.cm.get_cmap('viridis')
plt.contour(X,Y,Z,250, cmap=cmap)

Contour Plot of f [Image by Author generated using Python]

Let’s get started!

This method is easy to use, and the most widely used method for training ML models. This includes setting up:

One should be very careful in choosing the right t under this method. Although a smaller value of t may result in more accurate responses, convergence may be slower. On the other hand, a large t makes the algorithm faster, but at the cost of accuracy. Using this method requires the user to carefully balance the trade-off between the amount of integration and the accuracy of the solution provided.

Instead, many scientists use validation methods such as cross-validation or k-fold cross-validation to achieve t. This method includes the creation of training units (called valid data), which are used to optimize the operation by running the process on the different levels that t can take. Let’s look at our example:

The first step is to calculate its gradient:

For all the following calculations, we will take the starting point to be x⁽⁰⁾= [1; 1]. Under this method, we make:

The two basic types are calculated as follows:

We calculate the remainder systematically via Python Code:

# Define the function f(x, y)
f = lambda x, y: 2*x**2 + 3*y**2 - 2*x*y - 1

# Define the derivative of f(x, y)
def df(x, y):
return np.array([4*x - 2*y, 6*y - 2*x])

# Perform gradient descent optimization
def grad_desc(f, df, x0, y0, t=0.1, tol=0.001):
x, y = [x0], [y0] # Initialize lists to store x and y coordinates
num_steps = 0 # Initialize the number of steps taken
# Continue until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:
v = -df(x0, y0) # Compute the direction of descent
x0 = x0 + t*v[0] # Update x coordinate
y0 = y0 + t*v[1] # Update y coordinate
x.append(x0) # Append updated x coordinate to the list
y.append(y0) # Append updated y coordinate to the list
num_steps += 1 # Increment the number of steps taken
return x, y, num_steps

# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)

# Print the number of steps taken for convergence
print(f"Number of Steps to Convergence: {n}")

In the code above, we have defined the following connection methods (which will always be used):

Running the code above, we see that it takes 26 steps to make the change. The following diagram shows the iteration process during the downgrade:

# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker="o",markersize=7, color="r")
plt.plot([a[i + 1]],[b[i + 1]],marker="o",markersize=7, color="r")
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i],
head_width=0, head_length=0, fc="r", ec="r", linewidth=2.0)
Fixed Plot: Fixed Size = 0.1 [Image by Author generated using Python]

To better understand how important it is to choose the right t in this process, let’s look at the effect of increasing or decreasing t. If we reduce the value of t from 0.1 to 0.01, the number of steps to connect increases significantly from 26 to 295. The iterative process of this case is shown below:

Standard Plot: Standard Part Size = 0.01 [Image by Author generated using Python]

However, by increasing t from 0.1 to 0.2, the number of steps to integrate drops from 26 to just 11, as shown in the following steps:

Fixed Plot: Fixed Size = 0.2 [Image by Author generated using Python]

However, it is important to note that this is not always the case. If the cost of the step is too large, it is possible that the iterators will simply jump away from the optimal path and never meet. In fact, the increase of t from 0.2 to 0.3 causes the repeated values ​​to increase, which makes it impossible to exchange. This is evident from the following lines (with t = 0.3) for the first 8 steps only:

Fixed Plot: Fixed Size = 0.3 [Image by Author generated using Python]

Therefore, it is clear that finding the right value of t is very important in this process and even a small increase or decrease can greatly affect the performance of the convergence algorithm. Now, let’s discuss the next method of determining t.

In this method, we do not assign a simple fixed value of t at any time. Instead, we view the problem of finding the optimal t as a 1D optimization problem. In other words, we want to find the optimal variable t, which minimizes the cost of the function:

Notice how good this is! We have a multivariate optimization problem (minimizing f) that we try to solve using gradient descent. We know the best way to transform our iterative expression (vₖ = − ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)), but we need to find the right size for tₖ. In other words, the value of the next iteration depends only on the value of tₖ that we chose to use. Therefore, we see this as another (but simple!) problem.

Therefore, we change x⁽ᵏ⁾ to be the reciprocal that minimizes the loss of f. This really helps to increase the amount of integration. However, it also adds extra time: Computation of the minimizer of g

The two basic types are calculated as follows:

We calculate the remainder systematically through the following Python Code

# Import package for 1D Optimization
from scipy.optimize import minimize_scalar

def grad_desc(f, df, x0, y0, tol=0.001):
x, y = [x0], [y0] # Initialize lists to store x and y coordinates
num_steps = 0 # Initialize the number of steps taken
# Continue until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:
v = -df(x0, y0) # Compute the direction of descent
# Define optimizer function for searching t
g = lambda t: f(x0 + t*v[0], y0 + t*v[1])
t = minimize_scalar(g).x # Minimize t
x0 = x0 + t*v[0] # Update x coordinate
y0 = y0 + t*v[1] # Update y coordinate
x.append(x0) # Append updated x coordinate to the list
y.append(y0) # Append updated y coordinate to the list
num_steps += 1 # Increment the number of steps taken
return x, y, num_steps

# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)

# Print the number of steps taken for convergence
print(f"Number of Steps to Convergence: {n}")

As before, in the code above, we have defined the following connection methods (which will always be used):

Running the code above, we see that it only takes 10 steps to change (a big change from the standard size). The following diagram shows the iteration process during the downgrade:

# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker="o",markersize=7, color="r")
plt.plot([a[i + 1]],[b[i + 1]],marker="o",markersize=7, color="r")
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0,
head_length=0, fc="r", ec="r", linewidth=2.0)
Contour Plot: Finding the True Line [Image by Author generated using Python]

Now, let’s discuss the next method of determining t.

Backtracking is an adaptive strategy for choosing the right size. In my experience, I have found this to be one of the most effective ways to control growth. Convergence is often much faster than standard scaling with no problems in scaling the 1D function g

Algorithm 1: Backtracking (Armijo-Goldstein condition) [Image by Author]

In other words, we start with a large size (which is often necessary in the first stages of the algorithm) and see if it helps us improve the current iteration by crossing the limit. If the step size is found to be too large, we reduce it by multiplying by a fixed scalar β ∈ (0, 1). We repeat this process until a minimum of f is obtained. Specifically, we choose the largest t as:

ie, the slope is at least σt || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||². But, why the price? It can be shown mathematically (via the original Taylor expansion) that t | ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||² is the minimum decrease in f that can be expected through the changes made during the iteration. There is an additional σ in this situation. This is a true calculation, that although we cannot achieve a definite decrease in the hypothesis t || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||², we expect to achieve a subset of the scale factor with σ. That is to say, we want the minimization achieved if f to be a fixed fraction σ of the minimization promised by the first Taylor formula of f at x⁽ᵏ⁾. If the condition is not met, we lower t to the minimum value via β. Let’s look at our example (setting t¯= 1, σ = β = 0.5):

The two basic types are calculated as follows:

Likewise,

We calculate the remainder systematically via Python Code:

# Perform gradient descent optimization
def grad_desc(f, df, x0, y0, tol=0.001):
x, y = [x0], [y0] # Initialize lists to store x and y coordinates
num_steps = 0 # Initialize the number of steps taken
# Continue until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:
v = -df(x0, y0) # Compute the direction of descent
# Compute the step size using Armijo line search
t = armijo(f, df, x0, y0, v[0], v[1])
x0 = x0 + t*v[0] # Update x coordinate
y0 = y0 + t*v[1] # Update y coordinate
x.append(x0) # Append updated x coordinate to the list
y.append(y0) # Append updated y coordinate to the list
num_steps += 1 # Increment the number of steps taken
return x, y, num_steps

def armijo(f, df, x1, x2, v1, v2, s = 0.5, b = 0.5):
t = 1
# Perform Armijo line search until the Armijo condition is satisfied
while (f(x1 + t*v1, x2 + t*v2) > f(x1, x2) +
t*s*np.matmul(df(x1, x2).T, np.array([v1, v2]))):
t = t*b # Reduce the step size by a factor of b
return t

# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)

# Print the number of steps taken for convergence
print(f"Number of Steps to Convergence: {n}")

As before, in the code above, we have defined the following connection methods (which will always be used):

By running the code above, we see that it only takes 10 steps to make the change. The following diagram shows the iteration process during the downgrade:

# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker="o",markersize=7, color="r")
plt.plot([a[i + 1]],[b[i + 1]],marker="o",markersize=7, color="r")
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0,
head_length=0, fc="r", ec="r", linewidth=2.0)
Contour Plot: Regression [Image by Author generated using Python]

In this article, we are familiar with some useful methods to achieve the step size in the gradient algorithm. In particular, we studied three main methods: Constant Phase Growth, which involves maintaining the same size or amount of training throughout the training period, True Line Search, which involves reducing the loss as a function of t, and Armijo Backtracking which involves gradual reduction. in step size until the limit is reached. While these are some of the most important techniques you can use to improve your optimization, there are many other techniques (such as setting t as a function of the number of iterations). These tools are often used in more complex environments, such as Stochastic Gradient Descent. The purpose of this article was not only to inform you about these methods but also to inform you about the problems that can affect your optimization. Although most of these methods are used in the context of Gradient Descent, they can also be applied to other optimization algorithms (e.g. Newton-Raphson Method). Each of these methods has its merits and may be preferred over the other for applications and algorithms.

We hope you enjoyed reading this article! If you have any doubts or suggestions, reply in the comment box. Please feel free to email me.

If you liked my article and want to read more, please follow me.

Note: All images were created by the author.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *