# Speed ​​Up Your Gradient Descent: The Epic Quest for the Optimal Stride | by Naman Agrawal

## Smooth Growth/Gradient Descent Techniques for Machine Learning 1. Introduction
2. Method 1: Maximum Growth Rate
3. Method 2: Searching for the Right Line
4. Method 3: Searching for a Reverse Line
5. The end

For training any type of machine learning, Gradient Descent is one of the most widely used techniques for segmentation optimization. Gradient descent provides a good way to reduce the loss of incoming data, especially in some cases, where it may not be a solution to this problem. More generally, consider a machine learning problem defined by a discrete and discrete form f: Rᵈ → R (many lossy functions follow this). The goal is to find x* ∈ Rᵈ that minimizes the loss function:

Gradient Descent provides an iterative method to solve this problem. The additional command is given as follows:

Where x⁽ᵏ⁾ denotes the value of x in the kth iteration of the algorithm, and tₖ denotes the step size or learning rate of the model in the kth iteration. The workflow of the algorithm is given as follows:

1. Determine the loss function f and calculate its coefficient ∇f.
2. Start with a random choice of x ∈ Rᵈ, call it x⁽⁰⁾(repetition).
3. Until you reach a stopping point (for example, the error drops to a certain level), do the following:
A) Determine where x should be decreased or increased. Under gradient descent, this is given by the gradient-opposite side of the loss function evaluated in the same iteration. vₖ = ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)
B) Determine the step size or change size: tₖ.
C) Change the expression: x⁽ᵏ⁾= x⁽ᵏ ⁻ ¹⁾ − tₖ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)

That’s the whole process in a nutshell: Take the current, determine where it needs to be changed (vₖ), determine the magnitude of the change (tₖ), and change it.

So, what is the story? In this article, our focus will be on step 3B: finding the step size or tₖ size. When it comes to gradient descent, this is one of the most overlooked aspects of your color development. The step size can greatly affect how quickly your algorithm converges to a solution and the accuracy of the solution it converges to. in many cases, data scientists just set a fixed value of the size throughout the learning process or sometimes use validation techniques to train. But, there are many effective ways to solve this problem. In this article, we will discuss three ways to determine the magnitude of tₖ:

1. Size of Fixed Segment
2. Searching for the Right Line
3. Searching for the Return Line (Armijo’s Rule)

In each of these methods, we will discuss the theory and use it to calculate the first few iterations for example. In particular, we’ll look at this lossy function to illustrate an example:

The 3D-Plot of this function is shown below:

From the figure, it is clear that the global minima is x * = [0; 0]. In this article, we will manually count the first few iterations and calculate the number of interaction paths for each of these paths. We will also follow the iterate trajectory (aka the iterate trajectory) to understand how each of these processes interact [process of convergence. Usually, it’s easier to refer to the contour plot of the function (instead of its 3D plot) to better evaluate the different trajectories that follow. The contour plot of the function can be easily generated via the following code:

`# Load Packagesimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineimport seaborn as snssns.set()sns.set(style="darkgrid")from matplotlib import cmfrom matplotlib.ticker import LinearLocator, FormatStrFormatterfrom mpl_toolkits.mplot3d import Axes3D`
`# Define Functionf = lambda x,y:  2*x**2 + 3*y**2 - 2*x*y - 1# Plot contourX = np.arange(-1, 1, 0.005)Y = np.arange(-1, 1, 0.005)X, Y = np.meshgrid(X, Y)Z = f(X,Y)plt.figure(figsize=(12, 7))cmap = plt.cm.get_cmap('viridis')plt.contour(X,Y,Z,250, cmap=cmap)`

Let’s get started!

This method is easy to use, and the most widely used method for training ML models. This includes setting up:

One should be very careful in choosing the right t under this method. Although a smaller value of t may result in more accurate responses, convergence may be slower. On the other hand, a large t makes the algorithm faster, but at the cost of accuracy. Using this method requires the user to carefully balance the trade-off between the amount of integration and the accuracy of the solution provided.

Instead, many scientists use validation methods such as cross-validation or k-fold cross-validation to achieve t. This method includes the creation of training units (called valid data), which are used to optimize the operation by running the process on the different levels that t can take. Let’s look at our example:

The first step is to calculate its gradient:

For all the following calculations, we will take the starting point to be x⁽⁰⁾= [1; 1]. Under this method, we make:

The two basic types are calculated as follows:

We calculate the remainder systematically via Python Code:

`# Define the function f(x, y)f = lambda x, y: 2*x**2 + 3*y**2 - 2*x*y - 1# Define the derivative of f(x, y)def df(x, y):return np.array([4*x - 2*y, 6*y - 2*x])# Perform gradient descent optimizationdef grad_desc(f, df, x0, y0, t=0.1, tol=0.001):x, y = [x0], [y0]  # Initialize lists to store x and y coordinatesnum_steps = 0  # Initialize the number of steps taken# Continue until the norm of the gradient is below the tolerancewhile np.linalg.norm(df(x0, y0)) > tol:  v = -df(x0, y0)  # Compute the direction of descentx0 = x0 + t*v  # Update x coordinatey0 = y0 + t*v  # Update y coordinatex.append(x0)  # Append updated x coordinate to the listy.append(y0)  # Append updated y coordinate to the listnum_steps += 1  # Increment the number of steps takenreturn x, y, num_steps# Run the gradient descent algorithm with initial point (1, 1)a, b, n = grad_desc(f, df, 1, 1)# Print the number of steps taken for convergenceprint(f"Number of Steps to Convergence: {n}")`

In the code above, we have defined the following connection methods (which will always be used):

Running the code above, we see that it takes 26 steps to make the change. The following diagram shows the iteration process during the downgrade:

`# Plot the contoursX = np.arange(-1.1, 1.1, 0.005)Y = np.arange(-1.1, 1.1, 0.005)X, Y = np.meshgrid(X, Y)Z = f(X,Y)plt.figure(figsize=(12, 7))plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)n = len(a)for i in range(n - 1):plt.plot([a[i]],[b[i]],marker="o",markersize=7, color="r")plt.plot([a[i + 1]],[b[i + 1]],marker="o",markersize=7, color="r")plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0, head_length=0, fc="r", ec="r", linewidth=2.0)` Fixed Plot: Fixed Size = 0.1 [Image by Author generated using Python]

To better understand how important it is to choose the right t in this process, let’s look at the effect of increasing or decreasing t. If we reduce the value of t from 0.1 to 0.01, the number of steps to connect increases significantly from 26 to 295. The iterative process of this case is shown below: Standard Plot: Standard Part Size = 0.01 [Image by Author generated using Python]

However, by increasing t from 0.1 to 0.2, the number of steps to integrate drops from 26 to just 11, as shown in the following steps: Fixed Plot: Fixed Size = 0.2 [Image by Author generated using Python]

However, it is important to note that this is not always the case. If the cost of the step is too large, it is possible that the iterators will simply jump away from the optimal path and never meet. In fact, the increase of t from 0.2 to 0.3 causes the repeated values ​​to increase, which makes it impossible to exchange. This is evident from the following lines (with t = 0.3) for the first 8 steps only: Fixed Plot: Fixed Size = 0.3 [Image by Author generated using Python]

Therefore, it is clear that finding the right value of t is very important in this process and even a small increase or decrease can greatly affect the performance of the convergence algorithm. Now, let’s discuss the next method of determining t.

In this method, we do not assign a simple fixed value of t at any time. Instead, we view the problem of finding the optimal t as a 1D optimization problem. In other words, we want to find the optimal variable t, which minimizes the cost of the function:

Notice how good this is! We have a multivariate optimization problem (minimizing f) that we try to solve using gradient descent. We know the best way to transform our iterative expression (vₖ = − ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)), but we need to find the right size for tₖ. In other words, the value of the next iteration depends only on the value of tₖ that we chose to use. Therefore, we see this as another (but simple!) problem.

Therefore, we change x⁽ᵏ⁾ to be the reciprocal that minimizes the loss of f. This really helps to increase the amount of integration. However, it also adds extra time: Computation of the minimizer of g

The two basic types are calculated as follows:

We calculate the remainder systematically through the following Python Code

`# Import package for 1D Optimizationfrom scipy.optimize import minimize_scalardef grad_desc(f, df, x0, y0, tol=0.001):x, y = [x0], [y0]  # Initialize lists to store x and y coordinatesnum_steps = 0  # Initialize the number of steps taken# Continue until the norm of the gradient is below the tolerancewhile np.linalg.norm(df(x0, y0)) > tol:  v = -df(x0, y0)  # Compute the direction of descent# Define optimizer function for searching tg = lambda t: f(x0 + t*v, y0 + t*v) t = minimize_scalar(g).x # Minimize tx0 = x0 + t*v  # Update x coordinatey0 = y0 + t*v  # Update y coordinatex.append(x0)  # Append updated x coordinate to the listy.append(y0)  # Append updated y coordinate to the listnum_steps += 1  # Increment the number of steps takenreturn x, y, num_steps# Run the gradient descent algorithm with initial point (1, 1)a, b, n = grad_desc(f, df, 1, 1)# Print the number of steps taken for convergenceprint(f"Number of Steps to Convergence: {n}")`

As before, in the code above, we have defined the following connection methods (which will always be used):

Running the code above, we see that it only takes 10 steps to change (a big change from the standard size). The following diagram shows the iteration process during the downgrade:

`# Plot the contoursX = np.arange(-1.1, 1.1, 0.005)Y = np.arange(-1.1, 1.1, 0.005)X, Y = np.meshgrid(X, Y)Z = f(X,Y)plt.figure(figsize=(12, 7))plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)n = len(a)for i in range(n - 1):plt.plot([a[i]],[b[i]],marker="o",markersize=7, color="r")plt.plot([a[i + 1]],[b[i + 1]],marker="o",markersize=7, color="r")plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0, head_length=0, fc="r", ec="r", linewidth=2.0)` Contour Plot: Finding the True Line [Image by Author generated using Python]

Now, let’s discuss the next method of determining t.

Backtracking is an adaptive strategy for choosing the right size. In my experience, I have found this to be one of the most effective ways to control growth. Convergence is often much faster than standard scaling with no problems in scaling the 1D function g

In other words, we start with a large size (which is often necessary in the first stages of the algorithm) and see if it helps us improve the current iteration by crossing the limit. If the step size is found to be too large, we reduce it by multiplying by a fixed scalar β ∈ (0, 1). We repeat this process until a minimum of f is obtained. Specifically, we choose the largest t as:

ie, the slope is at least σt || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||². But, why the price? It can be shown mathematically (via the original Taylor expansion) that t | ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||² is the minimum decrease in f that can be expected through the changes made during the iteration. There is an additional σ in this situation. This is a true calculation, that although we cannot achieve a definite decrease in the hypothesis t || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||², we expect to achieve a subset of the scale factor with σ. That is to say, we want the minimization achieved if f to be a fixed fraction σ of the minimization promised by the first Taylor formula of f at x⁽ᵏ⁾. If the condition is not met, we lower t to the minimum value via β. Let’s look at our example (setting t¯= 1, σ = β = 0.5):

The two basic types are calculated as follows:

Likewise,

We calculate the remainder systematically via Python Code:

`# Perform gradient descent optimizationdef grad_desc(f, df, x0, y0, tol=0.001):x, y = [x0], [y0]  # Initialize lists to store x and y coordinatesnum_steps = 0  # Initialize the number of steps taken# Continue until the norm of the gradient is below the tolerancewhile np.linalg.norm(df(x0, y0)) > tol:  v = -df(x0, y0)  # Compute the direction of descent# Compute the step size using Armijo line searcht = armijo(f, df, x0, y0, v, v) x0 = x0 + t*v  # Update x coordinatey0 = y0 + t*v  # Update y coordinatex.append(x0)  # Append updated x coordinate to the listy.append(y0)  # Append updated y coordinate to the listnum_steps += 1  # Increment the number of steps takenreturn x, y, num_stepsdef armijo(f, df, x1, x2, v1, v2, s = 0.5, b = 0.5):t = 1# Perform Armijo line search until the Armijo condition is satisfiedwhile (f(x1 + t*v1, x2 + t*v2) > f(x1, x2) + t*s*np.matmul(df(x1, x2).T, np.array([v1, v2]))):t = t*b # Reduce the step size by a factor of breturn t# Run the gradient descent algorithm with initial point (1, 1)a, b, n = grad_desc(f, df, 1, 1)# Print the number of steps taken for convergenceprint(f"Number of Steps to Convergence: {n}")`

As before, in the code above, we have defined the following connection methods (which will always be used):

By running the code above, we see that it only takes 10 steps to make the change. The following diagram shows the iteration process during the downgrade:

`# Plot the contoursX = np.arange(-1.1, 1.1, 0.005)Y = np.arange(-1.1, 1.1, 0.005)X, Y = np.meshgrid(X, Y)Z = f(X,Y)plt.figure(figsize=(12, 7))plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)n = len(a)for i in range(n - 1):plt.plot([a[i]],[b[i]],marker="o",markersize=7, color="r")plt.plot([a[i + 1]],[b[i + 1]],marker="o",markersize=7, color="r")plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0, head_length=0, fc="r", ec="r", linewidth=2.0)`