background top icon
background center wave icon
background filled rhombus icon
background two lines icon
background stroke rhombus icon

Download "Backpropagation explained | Part 5 - What puts the "back" in backprop?"

input logo icon
Video tags
|

Video tags

Keras
deep learning
machine learning
artificial neural network
neural network
neural net
AI
artificial intelligence
Theano
Tensorflow
tutorial
Python
supervised learning
unsupervised learning
Sequential model
transfer learning
image classification
convolutional neural network
CNN
categorical crossentropy
relu
activation function
stochastic gradient descent
educational
education
fine-tune
data augmentation
autoencoders
clustering
batch normalization
Subtitles
|

Subtitles

subtitles menu arrow
  • ruRussian
Download
00:00:00
hey what's going on everyone in this
00:00:03
video will see the math that explains
00:00:05
how back propagation works backwards
00:00:08
through a neural network so let's get to
00:00:11
it
00:00:14
[Music]
00:00:20
alright we've seen how to calculate the
00:00:23
gradient of the loss function using back
00:00:25
propagation in the previous video we
00:00:28
haven't yet seen though where the
00:00:30
backwards movement comes into play that
00:00:32
we talked about when we discuss the
00:00:34
intuition for back prop so now we're
00:00:37
going to build on the knowledge that
00:00:38
we've already developed to understand
00:00:40
what puts the back in back propagation
00:00:43
the explanation we'll give her this will
00:00:45
be math based so we're first going to
00:00:47
start out by exploring the motivation
00:00:49
needed for us to understand the
00:00:51
calculations that we'll be working
00:00:52
through we'll then jump right into the
00:00:55
calculations which we'll see are
00:00:57
actually quite similar to ones we've
00:00:59
worked through already in the previous
00:01:01
video after we've got the math down
00:01:03
we'll then bring everything together to
00:01:05
achieve the mind-blowing realization for
00:01:07
how these calculations are
00:01:09
mathematically done in a backwards
00:01:11
fashion
00:01:12
all right let's begin
00:01:16
we left off from our last video by
00:01:18
seeing how we can calculate the gradient
00:01:20
of the loss function with respect to any
00:01:22
weight in the network when we went
00:01:24
through the process for showing how that
00:01:26
was calculated recall that we worked
00:01:28
with this single weight in the output
00:01:31
layer of the network and then
00:01:32
generalized the result we obtained by
00:01:34
saying that this same process could be
00:01:36
applied for all other weights in the
00:01:38
network so for this particular weight we
00:01:41
saw that the derivative of the loss with
00:01:43
respect to this weight was equal to this
00:01:46
now what would happen if we chose to
00:01:49
work with a weight that's not in the
00:01:50
output layer like this weight here for
00:01:53
example well using the formula we
00:01:55
obtained for calculating the gradient
00:01:57
the loss we see that the gradient of the
00:01:59
loss with respect to this particular
00:02:01
weight is equal to this so check it out
00:02:04
it looks just like the equation we use
00:02:06
for the previous weight we were working
00:02:08
with the only difference is that the
00:02:10
superscripts are different because now
00:02:12
we're working with a weight in the third
00:02:14
layer which we are denoting as big L
00:02:16
minus 1 and then the subscripts are
00:02:19
different as well because we're working
00:02:20
with the weight that connects the second
00:02:22
node in the second layer to the second
00:02:24
node in the third layer so given this is
00:02:28
the same formula then we should just be
00:02:30
able to calculate it in the exact same
00:02:32
way we did for the previous weight we
00:02:34
worked with in the last video right well
00:02:37
not so fast
00:02:38
so yes this is the same formula and in
00:02:42
fact the second and third terms here on
00:02:44
the right hand side will be calculated
00:02:46
using this same exact approach as we
00:02:48
used before this first term though the
00:02:51
derivative of the loss with respect to
00:02:53
this one activation output that's
00:02:55
actually going to require a different
00:02:56
approach for us to calculate it let's
00:02:59
think about why when we calculated the
00:03:02
derivative of the loss with respect to a
00:03:04
weight in the output layer we saw that
00:03:06
this first term is the derivative of the
00:03:08
loss with respect to the activation
00:03:10
output for a node in the output layer
00:03:12
well as we've talked about before the
00:03:15
loss is a direct function of the
00:03:17
activation output of all of the nodes in
00:03:19
the output layer you know because the
00:03:21
loss is the sum of the squared errors
00:03:23
between the actual labels of the data
00:03:25
and the activation output of the nodes
00:03:27
in the output layer
00:03:29
okay so when we calculate the derivative
00:03:32
of the loss with respect to a weigh-in
00:03:33
layer big L minus 1 for example this
00:03:36
first term is the derivative of the loss
00:03:39
with respect to the activation output
00:03:41
for node 2 not in the output layer L but
00:03:44
in layer L minus 1 and unlike the
00:03:48
activation output for the nodes in the
00:03:49
output layer the loss is not a direct
00:03:52
function of this output C because look
00:03:55
at where this activation output is
00:03:56
within the network and then look at
00:03:58
where the loss is calculated at the end
00:04:00
of the network we can see that this
00:04:02
output is not being passed directly to
00:04:04
the loss so we need to understand how to
00:04:08
calculate this term then that's going to
00:04:10
be our focus for now so maybe if you
00:04:13
need then go ahead and pause the video
00:04:15
here and go back and watch the previous
00:04:17
video where we calculated the first term
00:04:19
in this equation to see the approach we
00:04:21
took then you can compare that to the
00:04:24
approach we're going to take to
00:04:25
calculate this first term in this
00:04:27
equation now but cause the second and
00:04:30
third terms on the right-hand side are
00:04:32
calculated in the exact same manner as
00:04:34
we've seen before we're not going to
00:04:36
cover those here we're just going to
00:04:38
focus on how to calculate this term and
00:04:40
then we'll combine the results from all
00:04:42
the terms to see the final result
00:04:45
all right at this point go ahead and
00:04:47
admit you're thinking to yourself I
00:04:49
clicked on this video to see how back
00:04:51
prop works backwards what the heck does
00:04:53
any of this so far have to do with the
00:04:55
backwards movement of backpropagation I
00:04:57
hear you we are getting there so stick
00:05:00
with me we have to go through this math
00:05:02
first and see what it's doing and then
00:05:04
once we see that we'll be able to
00:05:06
clearly see the whole point of the
00:05:08
backwards movement so let's go ahead and
00:05:10
jump into the calculations
00:05:13
all right time to get set up we're going
00:05:16
to show how we can calculate the
00:05:18
derivative of the loss function with
00:05:19
respect to the activation output for any
00:05:22
node that's not in the output layer
00:05:23
we're going to work with a single
00:05:26
activation output to illustrate this
00:05:27
particularly we'll be working with the
00:05:29
activation output for node two in layer
00:05:32
L minus 1 and that's denoted as this
00:05:36
term and the partial derivative of the
00:05:39
loss with respect to this activation
00:05:40
output is denoted as this now as we
00:05:45
discuss a few moments ago observe that
00:05:47
for each node J in the output layer L
00:05:49
the loss depends on the activation
00:05:51
output from each of these nodes
00:05:54
okay now the activation output for each
00:05:57
of these nodes depends on the input to
00:05:59
each of these nodes and in turn the
00:06:03
input to each of these nodes depends on
00:06:05
the weights connected to each of these
00:06:07
nodes from the previous layer L minus 1
00:06:10
as well as the activation outputs from
00:06:12
the previous layer
00:06:14
so given this we can see how the input
00:06:17
to each node in the output layer is
00:06:19
dependent on the activation output that
00:06:21
we've chosen to work with the activation
00:06:24
output for node 2 in layer L minus 1 so
00:06:28
again using similar logic to what we
00:06:30
used in our previous video we can see
00:06:32
from these dependencies that the loss
00:06:34
function is actually a composition of
00:06:36
functions and so to calculate the
00:06:39
derivative of the loss with respect to
00:06:40
the activation output we're working with
00:06:42
we'll need to use the chain rule which
00:06:45
tells us that this derivative is equal
00:06:47
to the product of the derivatives of the
00:06:49
composed function and we're expressing
00:06:52
that here so this says that the
00:06:55
derivative of the loss with respect to
00:06:57
the activation output for no.2 in layer
00:06:59
L minus 1 is equal to this this is the
00:07:04
sum for each node J in the output layer
00:07:06
L of the derivative of the loss with
00:07:08
respect to the activation output for
00:07:11
node J times the derivative of the
00:07:14
activation output for node J with
00:07:16
respect to the input for node J times
00:07:19
the input for node J with respect to the
00:07:21
activation output for node 2 in layer L
00:07:25
minus 1 now let's scroll a little bit
00:07:28
now actually this equation looks almost
00:07:31
identical to the equation we obtained in
00:07:34
the last video for the derivative of the
00:07:36
loss with respect to a given weight
00:07:38
recall that this previous derivative
00:07:40
with respect to a given weight we worked
00:07:41
with was expressed as this so just
00:07:45
eyeballing the general whiteness between
00:07:47
these two equations we see that the only
00:07:50
differences are 1 the presence of this
00:07:52
summation operation in our new equation
00:07:54
and 2 the last term on the right hand
00:07:57
side differs the reason for the
00:08:00
summation here is due to the fact that a
00:08:02
change in one activation output in the
00:08:04
previous layer is going to affect the
00:08:06
input for each node J in the following
00:08:09
layer L so we need to sum up these
00:08:12
effects now we can see that the first
00:08:15
and second terms on the right-hand side
00:08:17
of the equation are the same as the
00:08:19
first and second terms in the last
00:08:20
equation with regards to weight 1/2 in
00:08:23
the output layer when J equals 1 so
00:08:26
since we've already gone through the
00:08:28
work to find how to calculate these two
00:08:29
derivatives in the last video we won't
00:08:31
do it again here we're only going to
00:08:34
focus on breaking down this third term
00:08:36
and then we'll combine all terms to see
00:08:38
the final result
00:08:41
all right so let's jump into how to
00:08:43
calculate the third term from the
00:08:45
equation we just looked at this third
00:08:47
term is the derivative of the input to
00:08:49
any node J in the output layer L with
00:08:52
respect to the activation output for
00:08:55
node 2 and layer L minus 1 we know for
00:08:59
each node J in layer L that the input is
00:09:01
equal to the weighted sum of the
00:09:03
activation outputs from the previous
00:09:05
layer L minus 1 so then we can
00:09:08
substitute this sum in for Z sub J and
00:09:11
our derivative here now let's expand
00:09:14
this sum then due to the linearity of
00:09:18
the summation operation we can pull the
00:09:21
derivative operator through to each term
00:09:23
since the derivative of a sum is equal
00:09:25
to the sum of the derivatives
00:09:28
so we're taking the derivatives of each
00:09:31
of these terms with respect to a sub 2
00:09:33
but actually we can see that only one of
00:09:36
these terms contains a sub 2 so then
00:09:39
when we take the derivative of any of
00:09:40
these other terms that don't contain a
00:09:42
sub 2 they'll just all evaluate to 0 now
00:09:46
taking the derivative of this 1 single
00:09:48
term that does contain a sub 2 we apply
00:09:51
the power rule to get the result so this
00:09:54
result says that the input for any node
00:09:57
J in the output layer L will respond to
00:10:00
a change in the activation output for
00:10:02
node 2 in layer L minus 1 by an amount
00:10:05
equal to the weight connecting node 2 in
00:10:08
layer L minus 1 to note J in layer L all
00:10:12
right let's take this result and combine
00:10:14
it with our other terms to see what we
00:10:16
get as the total result for the
00:10:18
derivative of the loss with respect to
00:10:20
this activation output
00:10:23
all right so we have our original
00:10:26
equation here for the derivative of the
00:10:27
loss with respect to the activation
00:10:29
output that we've chosen to work with
00:10:31
and from our previous video we already
00:10:34
know what these first two terms evaluate
00:10:36
to you so I've gone ahead and put those
00:10:38
results in here and we just saw what the
00:10:41
result for this third term was so we
00:10:44
have that result here okay so we've got
00:10:47
this full result now what was it that we
00:10:50
wanted to do with it again
00:10:51
oh yeah now we can use this result to
00:10:54
calculate the gradient of the loss with
00:10:56
respect to any weight connected to node
00:10:58
two in layer L minus one like the one we
00:11:01
showed at the start of this video wait -
00:11:04
two for example with the following
00:11:06
equation
00:11:07
the result we just obtained for the
00:11:09
derivative of the loss with respect to
00:11:11
the activation output for no.2 and layer
00:11:13
L minus one can then be substituted for
00:11:16
the first term in this equation and then
00:11:19
as mentioned earlier the second and
00:11:21
third terms are calculated using the
00:11:23
exact same approach we took for those
00:11:24
terms in the previous video so notice
00:11:28
we've used the chain rule twice now with
00:11:30
one of those times being in nested
00:11:32
inside the other we first use the chain
00:11:35
rule to obtain the result for this
00:11:37
entire derivative for the loss with
00:11:39
respect to this one weight and then we
00:11:42
used it again to calculate the first
00:11:44
term within this derivative which itself
00:11:46
was the derivative of the loss with
00:11:48
respect to the activation output the
00:11:51
results from each of these derivatives
00:11:52
using the chain rule depended on
00:11:55
derivatives with respect to components
00:11:57
that reside later in the network so like
00:12:00
for the weight we worked with in the
00:12:01
last video for example to calculate the
00:12:04
gradient of the loss with respect to it
00:12:06
we needed derivatives that depended on
00:12:08
the activation output and the input for
00:12:11
this note then to calculate the gradient
00:12:14
of the loss with respect to the weight
00:12:16
we just worked with in this video we
00:12:18
needed derivatives that depended on this
00:12:21
input and this activation output and as
00:12:24
we saw the derivative that depended on
00:12:26
this activation output needed the
00:12:29
derivatives that depended on all of the
00:12:31
activation outputs and all of the inputs
00:12:33
for these nodes so essentially we're
00:12:37
needing to calculate derivatives that
00:12:39
depend on components later in the
00:12:41
network first and then use these
00:12:43
derivatives in our calculations for the
00:12:46
gradient of the loss with respect to
00:12:48
weights that come earlier in the network
00:12:50
so we achieve this by repeatedly
00:12:53
applying the chain rule in a backwards
00:12:55
fashion
00:13:05
whoo all right now we know what puts the
00:13:09
back in backprop after watching this
00:13:12
video along with the earlier videos on
00:13:14
back prop that precede this one you
00:13:15
should now have a full understanding for
00:13:17
what back prop is all about if you made
00:13:20
it through all of these and you think
00:13:22
you have a grip on this stuff then
00:13:23
Cheers made i'm glad you stuck around
00:13:25
till the end now we've gone over a lot
00:13:28
of math in this video as well as the
00:13:30
last several so if you have any
00:13:32
questions let's have a discussion in the
00:13:34
comments also I'd love to hear what you
00:13:37
think in general about this series of
00:13:39
videos on back prop was it helpful in
00:13:41
developing your understanding how was it
00:13:43
following the math I'd really like to
00:13:45
know what you're thinking thanks for
00:13:47
watching see you next time
00:13:54
[Music]
00:14:00
you

Description:

Let's see the math that explains how backpropagation works backwards through a neural network. We've seen how to calculate the gradient of the loss function using backpropagation in the previous video. We haven't yet seen though where the backwards movement comes into play that we talked about when we discussed the intuition for backprop. So now, we're going to build on the knowledge that we've already developed to understand what exactly puts the back in backpropagation. The explanation we'll give for this will be math-based, so we're first going to start out by exploring the motivation needed for us to understand the calculations we'll be working through. We'll then jump right into the calculations, which, we'll see, are actually quite similar to ones we've worked through in the previous video. After we've got the math down, we'll then bring everything together to achieve the mind-blowing realization for how these calculations are mathematically done in a backwards fashion. 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:43 Agenda 01:13 Calculations - Derivative of the loss with respect to activation outputs 13:06 Summary 13:40 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👉 Check out the website for more learning material: 🔗 https://deeplizard.com/ 💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES 🔗 https://deeplizard.com/resources 🧠 Support collective intelligence, join the deeplizard hivemind: 🔗 https://deeplizard.com/hivemind 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order 👉 Use your receipt from Neurohacker to get a discount on deeplizard courses 🔗 https://www.qualialife.com/shop?rfsn=6488344.d171c6 👀 CHECK OUT OUR VLOG: 🔗 https://www.youtube.com/deeplizardvlog ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Tammy Mano Prime Ling Li 🚀 Boost collective intelligence by sharing this video on social media! 👀 Follow deeplizard: Our vlog: https://www.youtube.com/deeplizardvlog Facebook: https://www.facebook.com/unsupportedbrowser Instagram: https://www.facebook.com/unsupportedbrowser Twitter: https://twitter.com/deeplizard Patreon: https://www.patreon.com/deeplizard YouTube: https://www.youtube.com/deeplizard 🎓 Deep Learning with deeplizard: Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://www.amazon.com/shop/deeplizard 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://www.youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

Preparing download options

popular icon
Popular
hd icon
HD video
audio icon
Only sound
total icon
All
* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."
** — Link intended for online playback in specialized players

Questions about downloading video

mobile menu iconHow can I download "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video?mobile menu icon

  • http://unidownloader.com/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.

  • The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.

  • UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.

  • UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

mobile menu iconWhich format of "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video should I choose?mobile menu icon

  • The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

mobile menu iconWhy does my computer freeze when loading a "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video?mobile menu icon

  • The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

mobile menu iconHow can I download "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video to my phone?mobile menu icon

  • You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

mobile menu iconHow can I download an audio track (music) to MP3 "Backpropagation explained | Part 5 - What puts the "back" in backprop?"?mobile menu icon

  • The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

mobile menu iconHow can I save a frame from a video "Backpropagation explained | Part 5 - What puts the "back" in backprop?"?mobile menu icon

  • This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

mobile menu iconWhat's the price of all this stuff?mobile menu icon

  • It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.