Gradient descent, how neural networks learn | Deep learning, chapter 2

Last video I laid out the structure of a neural network I’ll give a quick recap here just so that it’s fresh in our minds And then I have two main goals for this video. The first is to introduce the idea of gradient descent, which underlies not only how neural networks learn, but how a lot of other machine learning works as well Then after that we’re going to dig in a little more to how this particular network performs And what those hidden layers of neurons end up actually looking for As a reminder our goal here is the classic example of handwritten digit recognition the hello world of neural networks these digits are rendered on a 28 by 28 pixel grid each pixel with some grayscale value between 0 & 1 those are what determine the activations of 784 neurons in the input layer of the network and Then the activation for each neuron in the following layers is based on a weighted sum of All the activations in the previous layer plus some special number called a bias then you compose that sum with some other function like the sigmoid squishification or a ReLu the way that I walked through last video In total given the somewhat arbitrary choice of two hidden layers here with 16 neurons each the network has about 13,000 weights and biases that we can adjust and it’s these values that determine what exactly the network you know actually does Then what we mean when we say that this network classifies a given digit Is that the brightest of those 10 neurons in the final layer corresponds to that digit And remember the motivation that we had in mind here for the layered structure was that maybe The second layer could pick up on the edges and the third layer might pick up on patterns like loops and lines And the last one could just piece together those patterns to recognize digits So here we learn how the network learns What we want is an algorithm where you can show this network a whole bunch of training data which comes in the form of a bunch of different images of handwritten digits along with labels for what they’re supposed to be and It’ll adjust those 13000 weights and biases so as to improve its performance on the training data Hopefully this layered structure will mean that what it learns generalizes to images beyond that training data And the way we test that is that after you train the network You show it more labeled theta that it’s never seen before and you see how accurately it classifies those new images Fortunately for us and what makes this such a common example to start with is that the good people behind the MNIST base have put together a collection of tens of thousands of handwritten digit images each one labeled with the numbers that they’re supposed to be and It’s provocative as it is to describe a machine as learning once you actually see how it works It feels a lot less like some crazy sci-fi premise and a lot more like well a calculus exercise I mean basically it comes down to finding the minimum of a certain function Remember conceptually we’re thinking of each neuron as being connected to all of the neurons in the previous layer and the weights in the weighted sum defining its activation are kind of like the strengths of those connections And the bias is some indication of whether that neuron tends to be active or inactive and to start things off We’re just gonna initialize all of those weights and biases totally randomly needless to say this network is going to perform pretty horribly on a given training example since it’s just doing something random for example you feed in this image of a 3 and the Output layer it just looks like a mess So what you do is you define a cost function a way of telling the computer: “No bad computer! That output should have activations which are zero for most neurons, but one for this neuron what you gave me is utter trash” To say that a little more mathematically what you do is add up the squares of the differences between each of those trash output activations and the value that you want them to have and This is what we’ll call the cost of a single training example Notice this sum is small when the network confidently classifies the image correctly But it’s large when the network seems like it doesn’t really know what it’s doing So then what you do is consider the average cost over all of the tens of thousands of training examples at your disposal This average cost is our measure for how lousy the network is and how bad the computer should feel, and that’s a complicated thing Remember how the network itself was basically a function one that takes in 784 numbers as inputs the pixel values and spits out ten numbers as its output and in a sense It’s parameterised by all these weights and biases While the cost function is a layer of complexity on top of that it takes as its input those thirteen thousand or so weights and biases and it spits out a single number describing how bad those weights and biases are and The way it’s defined depends on the network’s behavior over all the tens of thousands of pieces of training data That’s a lot to think about But just telling the computer what a crappy job, it’s doing isn’t very helpful You want to tell it how to change those weights and biases so that it gets better? To make it easier rather than struggling to imagine a function with 13,000 inputs Just imagine a simple function that has one number as an input and one number as an output How do you find an input that minimizes the value of this function? Calculus students will know that you can sometimes figure out that minimum explicitly But that’s not always feasible for really complicated functions Certainly not in the thirteen thousand input version of this situation for our crazy complicated neural network cost function A more flexible tactic is to start at any old input and figure out which direction you should step to make that output lower Specifically if you can figure out the slope of the function where you are Then shift to the left if that slope is positive and shift the input to the right if that slope is negative If you do this repeatedly at each point checking the new slope and taking the appropriate step you’re gonna approach some local minimum of the function and the image you might have in mind here is a ball rolling down a hill and Notice even for this really simplified single input function there are many possible valleys that you might land in Depending on which random input you start at and there’s no guarantee that the local minimum You land in is going to be the smallest possible value of the cost function That’s going to carry over to our neural network case as well, and I also want you to notice How if you make your step sizes proportional to the slope Then when the slope is flattening out towards the minimum your steps get smaller and smaller and that kind of helps you from overshooting Bumping up the complexity a bit imagine instead a function with two inputs and one output You might think of the input space as the XY plane and the cost function as being graphed as a surface above it Now instead of asking about the slope of the function you have to ask which direction should you step in this input space? So as to decrease the output of the function most quickly in other words. What’s the downhill direction? And again it’s helpful to think of a ball rolling down that hill Those of you familiar with multivariable calculus will know that the gradient of a function gives you the direction of steepest ascent Basically, which direction should you step to increase the function most quickly naturally enough taking the negative of that gradient gives you the direction to step that decreases the function most quickly and Even more than that the length of this gradient vector is actually an indication for just how steep that steepest slope is Now if you’re unfamiliar with multivariable calculus And you want to learn more check out some of the work that I did for Khan Academy on the topic Honestly, though all that matters for you and me right now Is that in principle there exists a way to compute this vector. This vector that tells you what the Downhill direction is and how steep it is you’ll be okay if that’s all you know and you’re not rock solid on the details because if you can get that the algorithm from minimizing the function is to compute this gradient direction then take a small step downhill and Just repeat that over and over It’s the same basic idea for a function that has 13,000 inputs instead of two inputs imagine organizing all 13,000 weights and biases of our network into a giant column vector The negative gradient of the cost function is just a vector It’s some Direction inside this insanely huge input space that tells you which nudges to all of those numbers is going to cause the most rapid decrease to the cost function and of course with our specially designed cost function Changing the weights and biases to decrease it means making the output of the network on each piece of training data Look less like a random array of ten values and more like an actual decision that we want it to make It’s important to remember this cost function involves an average over all of the training data So if you minimize it it means it’s a better performance on all of those samples The algorithm for computing this gradient efficiently which is effectively the heart of how a neural network learns is called back propagation And it’s what I’m going to be talking about next video There I really want to take the time to walk through What exactly happens to each weight and each bias for a given piece of training data? Trying to give an intuitive feel for what’s happening beyond the pile of relevant calculus and formulas Right here right now the main thing. I want you to know independent of implementation details is that what we mean when we talk about a network learning is that it’s just minimizing a cost function and Notice one consequence of that is that it’s important for this cost function to have a nice smooth output So that we can find a local minimum by taking little steps downhill This is why by the way Artificial neurons have continuously ranging activations rather than simply being active or inactive in a binary way if the way that biological neurons are This process of repeatedly nudging an input of a function by some multiple of the negative gradient is called gradient descent It’s a way to converge towards some local minimum of a cost function basically a valley in this graph I’m still showing the picture of a function with two inputs of course because nudges in a thirteen thousand dimensional input Space are a little hard to wrap your mind around, but there is actually a nice non-spatial way to think about this Each component of the negative gradient tells us two things the sign of course tells us whether the corresponding Component of the input vector should be nudged up or down, but importantly the relative magnitudes of all these components Kind of tells you which changes matter more You see in our network an adjustment to one of the weights might have a much greater impact on the cost function than the adjustment to some other weight Some of these connections just matter more for our training data So a way that you can think about this gradient vector of our mind-warpingly massive cost function is that it encodes the relative importance of each weight and bias That is which of these changes is going to carry the most bang for your buck This really is just another way of thinking about direction To take a simpler example if you have some function with two variables as an input and you Compute that its gradient at some particular point comes out as (3,1) Then on the one hand you can interpret that as saying that when you’re standing at that input moving along this direction increases the function most quickly That when you graph the function above the plane of input points that vector is what’s giving you the straight uphill direction But another way to read that is to say that changes to this first variable Have three times the importance as changes to the second variable that at least in the neighborhood of the relevant input Nudging the x value carries a lot more bang for your buck All right Let’s zoom out and sum up where we are so far the network itself is this function with 784 inputs and 10 outputs defined in terms of all of these weighted sums the cost function is a layer of complexity on top of that it takes the 13,000 weights and biases as inputs and spits out a single measure of lousyness based on the training examples and The gradient of the cost function is one more layer of complexity still it tells us What nudges to all of these weights and biases cause the fastest change to the value of the cost function Which you might interpret is saying which changes to which weights matter the most So when you initialize the network with random weights and biases and adjust them many times based on this gradient descent process How well does it actually perform on images that it’s never seen before? Well the one that I’ve described here with the two hidden layers of sixteen neurons each chosen mostly for aesthetic reasons well, it’s not bad it classifies about 96 percent of the new images that it sees correctly and Honestly, if you look at some of the examples that it messes up on you kind of feel compelled to cut it a little slack Now if you play around with the hidden layer structure and make a couple tweaks You can get this up to 98% and that’s pretty good. It’s not the best You can certainly get better performance by getting more sophisticated than this plain vanilla Network But given how daunting the initial task is I just think there’s something? Incredible about any network doing this well on images that it’s never seen before Given that we never specifically told it what patterns to look for Originally the way that I motivated this structure was by describing a hope that we might have That the second layer might pick up on little edges That the third layer would piece together those edges to recognize loops and longer lines and that those might be pieced together to recognize digits So is this what our network is actually doing? Well for this one at least Not at all remember how last video we looked at how the weights of the Connections from all of the neurons in the first layer to a given neuron in the second layer Can be visualized as a given pixel pattern that that second layer neuron is picking up on Well when we actually do that for the weights associated with these transitions from the first layer to the next Instead of picking up on isolated little edges here and there. They look well almost random Just put some very loose patterns in the middle there it would seem that in the unfathomably large 13,000 dimensional space of possible weights and biases our network found itself a happy little local minimum that despite successfully classifying most images doesn’t exactly pick up on the patterns that we might have hoped for and To really drive this point home watch what happens when you input a random image if the system was smart you might expect it to either feel uncertain maybe not really activating any of those 10 output neurons or Activating them all evenly But instead it Confidently gives you some nonsense answer as if it feels as sure that this random noise is a 5 as it does that an actual image of a 5 is a 5 phrase differently even if this network can recognize digits pretty well it has no idea how to draw them a Lot of this is because it’s such a tightly constrained training setup I mean put yourself in the network’s shoes here from its point of view the entire universe consists of nothing But clearly defined unmoving digits centered in a tiny grid and its cost function just never gave it any Incentive to be anything, but utterly confident in its decisions So if this is the image of what those second layer neurons are really doing You might wonder why I would introduce this network with the motivation of picking up on edges and patterns I mean, that’s just not at all what it ends up doing Well, this is not meant to be our end goal, but instead a starting point frankly This is old technology the kind researched in the 80s and 90s and You do need to understand it before you can understand more detailed modern variants and it clearly is capable of solving some interesting problems But the more you dig in to what those hidden layers are really doing the less intelligent it seems Shifting the focus for a moment from how networks learn to how you learn That’ll only happen if you engage actively with the material here somehow One pretty simple thing that I want you to do is just pause right now and think deeply for a moment about what Changes you might make to this system And how it perceives images if you wanted it to better pick up on things like edges and patterns? But better than that to actually engage with the material I Highly recommend the book by Michael Nielsen on deep learning and neural networks In it you can find the code and the data to download and play with for this exact example And the book will walk you through step by step what that code is doing What’s awesome is that this book is free and publicly available So if you do get something out of it consider joining me in making a donation towards Nielsen’s efforts I’ve also linked a couple other resources that I like a lot in the description including the phenomenal and beautiful blog post by Chris Ola and the articles in distill To close things off here for the last few minutes I want to jump back into a snippet of the interview that I had with Leisha Lee You might remember her from the last video. She did her PhD work in deep learning and in this little snippet She talks about two recent papers that really dig into how some of the more modern image recognition networks are actually learning Just to set up where we were in the conversation the first paper took one of these particularly deep neural networks That’s really good at image recognition and instead of training it on a properly labeled data Set it shuffled all of the labels around before training Obviously the testing accuracy here was going to be no better than random since everything’s just randomly labeled But it was still able to achieve the same training accuracy as you would on a properly labeled dataset Basically the millions of weights for this particular network were enough for it to just memorize the random data Which kind of raises the question for whether minimizing this cost function actually corresponds to any sort of structure in the image? Or is it just you know? memorize the entire Data set of what the correct classification is and so a couple of you know half a year later at ICML this year There was not exactly rebuttal paper paper that addressed some asked like hey Actually these networks are doing something a little bit smarter than that if you look at that accuracy curve if you were just training on a Random data set that curve sort of went down very you know very slowly in almost kind of a linear fashion So you’re really struggling to find that local minima of possible you know the right weights that would get you that accuracy whereas if you’re actually training on a structured data set one that has the Right labels. You know you fiddle around a little bit in the beginning, but then you kind of dropped very fast to get to that Accuracy level and so in some sense it was easier to find that Local maxima and so it was also interesting about that is it caught brings into light another paper from actually a couple of years ago Which has a lot more simplifications about the network layers But one of the results was saying how if you look at the optimization landscape the local minima that these networks tend to learn are Actually of equal quality so in some sense if your data set is structure, and you should be able to find that much more easily My thanks as always to those of you supporting on patreon I’ve said before just what a game-changer patreon is but these videos really would not be possible without you I Also want to give a special. Thanks to the VC firm amplifi partners in their support of these initial videos in the series They focus on very early stage machine learning and AI companies and I feel pretty confident in the Probabilities that some of you watching this and even more likely some of the people that you know are right now in the early stages of getting such a company off the ground and The amplifi folks would love to hear from any such founders and they even set up an email address just for this video that you can reach out to them through three blue one brown at amplify partners com

100 Replies to “Gradient descent, how neural networks learn | Deep learning, chapter 2

  1. Part 3 will be on backpropagation. I had originally planned to include it here, but the more I wanted to dig into a proper walk-through for what it's really doing, the more deserving it became of its own video. Stay tuned!

  2. We better be watchful of how we treat computers while teaching them. Scolding it as "Bad computer" today for it's wrong output may cost us heavily tomorrow in case it decides to avenge.

  3. When you feed random noise and get an answer from the network, it may seem it is nonsense. But in an experiment conducted in MIT few years back, people are shown thousands of noise images and asked whether an image felt like, say, a car. A simple yes/no answer. The subjects did not know why they made their choices, but they did anyway. Afterwards, when researchers take the average of all pixel values from all the images that participants say "yes, this reminds me of a car", they saw a vague, blur silhouette of a car. Same goes for other things. We don't identify cars by attributes of cars, we identify them by a representation of "carness" in our minds, which is statistically learned throughout years. Probably it's the same case with neural networks. It has a representation of fiveness and the random noise is statistically similar to it. You can google "Random Image Experiment Reveals The Building Blocks of Human Imagination" for further information on the experiment.

  4. if you don't give your network any 'copout' answer, how do you expect it to say 'I don't know about this' if you feed it a dadaist image? should it just give 0.1 weight to each digit?

  5. So how is the gradient of the cost function found? Is the cost function actually differentiable? Or is each partial derivative found numerically?

  6. Wow, more than 2 times less views than the 1st part. That's sad 🙁
    I'm now re-watching whole thing again and it is beautiful. Thanks 3Blue1Brown for this masterpiece.

  7. One thing I'm constantly confused by is what exactly a "weight" is.
    Let's say I have a pixel in the upper right corner. A second layer neuron detects loops in the bottom. Does weight refer to "how much that pixel lights up?" or "How important is this pixel in the detection of this loop?"
    The pixel could light up to 1, but it's importance within the loop detection could be 0.
    The pixel could be 0, but it's importance within the loop detection could be high.
    I understand that each neuron has a bias (which essentially changes how much harder or easier the weights have to work for that next layer neuron to fire), and the connection between layer 1 neuron and layer 2 neuron is itself the weight, but what exactly weight means is throwing me off.

  8. Once I graduate and start working, I’m gonna send you the money I owe you for watching all these videos. I’m doing BSEE for control systems so hopefully it works out.

  9. 1. What if you trained it with a target neuron that means, "This is not a digit," or that had a "confidence" target.
    Would that work? Would it be impossibly hard to train?
    2. What if you trained the network yourself, to look for exactly those things you were hoping it would find, and then — kind of manually teach it?

  10. Why are your values slighty off at 4:16 ? For example first number is (0.43-0)² = 0,1849 not 0,1863.
    Every number is slighty off in your example

  11. 15:19 seems interessting, just like you have to train your own (biological) NN to draw a human face, although you saw millions of them

  12. the average male brain has 86 billion neurons. the human body is made up of 37 trillion cells.
    neurons are made with multiple dendrites and multiple axon terminals.
    does deep learning use opthamalogy theory to tell a computer how bad it's doing?
    is it possible to embed an AI chatbot inside of an AI to give it an internal monologue in order to simulate thinking in words?
    the thoughts could then be displayed on a screen. making their thoughts visible would make AI less scary for People.

  13. I'm sorry MIT and Harvard but this is what TEACHING is supposed to be like. I watched some Stanford/Harvard/MIT classes and no one comes close to this

  14. Thank you 3Blue1Brown for these videos, they're extremely instructive!
    However there's one point that I didn't understand, why does the gradient descent algorithm struggle to find local minimas with swapped labels? If anyone could answer my question I'd be very grateful.
    Thank you.

  15. @18:10
    … Nah, looks pretty much what I see on the media nowadays. Everything and everyone improperly labeled as seen fit. Lol.

  16. If I want to start doing deep learning and AI, should I start learning calculus? and if so, where should I start?

  17. Your animations are so damn intuitive and amazing. You have developed one of the best teaching strategies! Thanks a lot.

  18. Fun Fact:
    If the whole equation for the cost function were to be written on paper in a straight line, with all of the variables plugged in, it would be approximately 115,696,496 characters long. If each character, on average, was 6 mm long, then the whole equation written out would be 694.496 km (431.343 miles) long! That's almost long enough to go from NYC to Chicago! And a computer can solve that equation in less than a second!

    i did the math

  19. I wish someone would have introduced this to me at a young age back in the 90s. I had no idea neural network have existed for so long

  20. I had always been daunted by neural networks due to the sheer complicated equations revolving around it this video series did the magic. I'm so extremely thankful to you and glad I found this that I wanna support this channel. I'm an undergrad now but I'll definitely be supporting the channel on Patreon when I start earning

  21. This is the best introductory video on neural network I have seen so far. Thank you so much for these fantastic videos!

  22. Because the sigmoid function has a range of (0,1) doesn't adding more nodes tend to saturate the output? Everything will always be at 1, unless the weights handle that in learning idk

  23. i wonder what would happen, if you put in a 1 for the outputneuron and calculated all the way back to the pixelgrid using the weights…

  24. Thanks I love How you explain, it's really clear, better than any other video. Thank you ! I'm gonna study Deep Learning, so I begin to learn by myself

  25. 18:04 Can anyone explain what they're talking about at this timestamp, re. mixing up the labels? Why would the network care about that? Doesn't it just mean that it'll identify everything with whatever false label you used in training?

  26. Really well done. I had never looked into neural nets but make sense as just an expansion of control system theory I learned about many years ago during my master's in EE. Will be fun to play with.

  27. 96% correct, what happens when the 4% incorrect means accident (autonomous cars, Tesla). If the model is 4% wrong and this causes a fatal accident you will be 100% dead 🙂

  28. 한국어 자막이 필요없는게 낫다 싶을정도로 엉망이네요.. 업데이트가 필요하다고 봅니다. i think korean subtitle just confusing what this video tell.. it needs update.

  29. Instead of squaring the difference between correct answers and given answers, why would it not be more accurate to take an absolute value to get the positive value?

  30. How I'd improve the program:
    1. At first I'd also give it some trash to input and train it to say that it is a trash (so all end neurones would be zeros)
    2. At second, if I wanted it to work as it is meant to be, namely through finding edges and shapes, I'd make separate programs, like one is searching for edges, and second is searching for shapes of these edges, and third is searching for digits of these shapes.

  31. Your videos with such wonderful LaTeX animations are just as high level as a BBC awarded documentaries. Very impressive to say the least.

  32. Why not take modulus of the difference between the desired output and the actual output instead of squaring it. Wouldn't squaring magnify the big differences and minimize the small differences instead of keeping it linear. Taking the modulus feels more intuitive but maybe this is a case where something works better just because.

  33. Hey there. Thanks for the videos, I am taking your course on multivariable calculus on Khan right now. Quick question, what software do you use for making those vids? Especially how do you visualize functions and how do you write freely on them. Thanks.

  34. Your videos have been key in teaching just enough of the topic to my non technical client to allow then to see past the marketing speak of ML vendors that either don't have real ML, or have garbage ML. This has actually lead to the savings in real dollars and several years of getting wrong products and having them fail us. Thank you!

  35. What if you, instead of using 10 neurons (one for each digit), uses 11, with the new one representing "not a digit" neuron. Wouldn't that "solve" the random input?

Leave a Reply

Your email address will not be published. Required fields are marked *