Understanding gradient descent is a critical part of learning machine learning. Although gradient descent is not the preferred method for optimization, it is similar to the more advanced methods so it’s still valuable to learn how it works.
Before we get into the programming part, we have to go through some math. No, don’t worry! It’s only 1 minute. And it’s only early high school level.
This is the second post in a three part series on fast.ai lesson 2. You can read the first part here and the third part here. This post describes what gradient descent is and how to implement in Pytorch.
1 minute of high school math
Virtually all algebra courses start with the following formula:
This gives us a straight line that could look something like this:
As you can see, using different values for changes the slope and using different values for changes the height of the line (with respect to the plane).
Now look at the following graph:
You can probably spot a relation in the data. Our goal is now to get a computer to do the same by tweaking the values of and to fit a line through these points.
We replace the with . This is to ease the matrix multiplication later on (don’t worry, Pytorch does it for us). For those of you who will continue to explore the math behind machine learning (you should), the is the bias factor.
It’s as easy as that!
Before we start implementing the code, there’s one more thing you need to know: a tensor is an array consisting of other arrays that all have the same size. There’s a lot more to this that you don’t need to know to implement gradient descent. In fact, there’s an entire branch of mathematics dedicated to tensors and manipulating them: linear algebra.
Start by setting
n to , the size of our training set:
n = 100
Then we create
n points. They all have coordinates :
x = torch.ones(n,2)
Next, we distribute the points on the x axis:
a (the tensor for slope and bias) and multiply it by
a = tensor(3.,2) y = x @ a
Note that we use the
@operator to do the multiplication step. The reason for this is that there are special rules for multiplying tensors. If you’re curious about this, I encourage you to search for “matrix multiplication” and learn about how it works.
If we now plot the line (the values of
y), we get this neat line.
Because it would be too easy for our model to fit a line through this dataset, we apply some random noise to make it more interesting.
y += torch.rand(n) # create a tensor with n elements which all have random values.
This looks like a pretty good dataset for our model to train on.
Before we start training, we need a way to measure how good our model performs on each iteration. The way we do that is by drawing lines from each of our points to the fitted line. Then we take the distance and by convention we square it. Then we take the average of all squared distances and this is the cost (mean squared error) for our model.
The cost for a machine learning model can be computed in multiple manners. The mean squared error technique is just an example that happens to work well for this particular problem.
Because this may sound a little complicated, let’s think through a couple of examples: if the points our all on the line, the distance would be 0 and therefore the square distance automatically is 0 as well. Taking the average yields, of course, 0. This is a correct error because our model fits all of the points. When the line is not at all like the training set, the squared distances are all high, and the average values is gonna be big. This is correct, because the model is not good at fitting the dataset.
In nearly all machine learning algorithms, we start by guessing a random value and then we evaluate it. We could set a guess ourselves like or have the computer make a guess. The latter is used more often because for larger models, it becomes impossible for humans to enter a value for every parameter.
a_guess = tensor(-1., 1)
After computing y values,
y_guess, corresponding to these value of
a_guess, we can inspect our guess by plotting it:
Not a very good guess. Let’s compute the error for it.
def mean_squared_error(y, y_guess): distances = (y_guess - y) ** 2 # a tensor return distances.mean()
You should see a value of about , this might differ slightly because of the random creation of the dataset. Anyway, the error is quite high.
Implementing stochastic gradient descent
Now that we have a way to measure the performance of our model, we can ask Pytorch to make us a model.
First, tell Pytorch
a is not the average tensor, it’s a set of parameters that we would like to change in order to decrease the error.
a = nn.Parameter(a)
A simplified description of how gradient descent works is it looks at the current error, it computes the gradient, (a step in an n-dimensional plane), (how to compute the gradient is a topic for a future blog post) and takes that step. Every step we take, we get a little closer to the best possible model (ignoring local minima - topic for another post as well). We multiply the size of the step by the learning rate. In order words, the size of the learning rate adjusts how big the steps we take are. Before entering 10000 for the learning rate, note that too big of a step might cause we miss the perfect model.
The code for one iteration of this algorithm (= epoch, computing a step and taking it) looks like this:
def update(): lr = 1e-1 y_guess = x @ a loss = mean_squared_error(y, y_guess) loss.backward() with torch.no_grad(): a.sub_(lr * a.grad) a.grad.zero_() # clean step memorization
Now for a certain number of epochs, we perform the update:
num_epochs = 100 for _ in range(num_epochs): update()
This gives us a very neat result:
An animated image of our model learning (source):
The bigger picture
OK, great. We’re able to fit a line through some random points. How would I now use this to make predictions?
Learning how to fit a line was just an example of how a computer is able to learn about a dataset that can be easily visualized which is the reason it is used so much in ML education
Making predictions is not far off. Except for the x and y coordinates, we got tens, hundreds of even thousands or more axises (dimensions) that we want to fit a line through.
The gradient descent algorithm we used is a so-called optimiser. It optimises the parameters according the an error rate. Gradient descent is a relatively slow technique that’s not used much in industry. More advanced optimizers are Conjugate Descent, BFGS, L-BFGS and Newton’s optimization method. Optimization is a huge part of machine learning so I encourage you to learn more about by looking at the wikipedia page for example.
In this blogpost you learned about the principles that make up much of machine learning.
You can find the code for this project on GitHub. Don’t forget to leave a star if you liked it! That would mean a lot to me.
A huge thanks to Sam Miserendino for proofreading this post!