Download "Backpropagation explained | Part 5 - What puts the "back" in backprop?"

"videoThumbnail Backpropagation explained | Part 5 - What puts the "back" in backprop?

0:00

Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources

0:43

Agenda

1:13

Calculations - Derivative of the loss with respect to activation outputs

13:06

Summary

13:40

Collective Intelligence and the DEEPLIZARD HIVEMIND

How To Be a Ventriloquist! Lesson 1 | JEFF DUNHAM

How To Be a Ventriloquist! Lesson 1 | JEFF DUNHAM

Channel: Jeff Dunham

Build an AI/ML Tennis Analysis system with YOLO, PyTorch, and Key Point Extraction

Build an AI/ML Tennis Analysis system with YOLO, PyTorch, and Key Point Extraction

Channel: Code In a Jiffy

Инкапсуляция в Python | Базовый курс. Программирование на Python

Инкапсуляция в Python | Базовый курс. Программирование на Python

Channel: Sweet Coder

Как нейронная сеть распознает цифры | #9 нейросети на Python

Как нейронная сеть распознает цифры | #9 нейросети на Python

Channel: selfedu

📚 Master Python Series | Start Learning PYTHON

📚 Master Python Series | Start Learning PYTHON

Channel: CodeTekTeach

It's Clusterin' Time!

It's Clusterin' Time!

Channel: Jeff Geerling

Neurons or nerve cells - Structure function and types of neurons | Human Anatomy | 3D Biology

Neurons or nerve cells - Structure function and types of neurons | Human Anatomy | 3D Biology

Channel: Elearnin

Seven Steps | Counting Song | Super Simple Songs

Seven Steps | Counting Song | Super Simple Songs

Channel: Super Simple Songs - Kids Songs

Japanese Language Lesson 13 - Locations

Japanese Language Lesson 13 - Locations

Channel: JapanSocietyNYC

Keras

deep learning

machine learning

artificial neural network

neural network

neural net

AI

artificial intelligence

Theano

Tensorflow

tutorial

Python

supervised learning

unsupervised learning

Sequential model

transfer learning

image classification

convolutional neural network

CNN

categorical crossentropy

relu

activation function

stochastic gradient descent

educational

education

fine-tune

data augmentation

autoencoders

clustering

batch normalization

00:00:00

hey what's going on everyone in this

00:00:03

video will see the math that explains

00:00:05

how back propagation works backwards

00:00:08

through a neural network so let's get to

00:00:11

it

00:00:14

[Music]

00:00:20

alright we've seen how to calculate the

00:00:23

gradient of the loss function using back

00:00:25

propagation in the previous video we

00:00:28

haven't yet seen though where the

00:00:30

backwards movement comes into play that

00:00:32

we talked about when we discuss the

00:00:34

intuition for back prop so now we're

00:00:37

going to build on the knowledge that

00:00:38

we've already developed to understand

00:00:40

what puts the back in back propagation

00:00:43

the explanation we'll give her this will

00:00:45

be math based so we're first going to

00:00:47

start out by exploring the motivation

00:00:49

needed for us to understand the

00:00:51

calculations that we'll be working

00:00:52

through we'll then jump right into the

00:00:55

calculations which we'll see are

00:00:57

actually quite similar to ones we've

00:00:59

worked through already in the previous

00:01:01

video after we've got the math down

00:01:03

we'll then bring everything together to

00:01:05

achieve the mind-blowing realization for

00:01:07

how these calculations are

00:01:09

mathematically done in a backwards

00:01:11

fashion

00:01:12

all right let's begin

00:01:16

we left off from our last video by

00:01:18

seeing how we can calculate the gradient

00:01:20

of the loss function with respect to any

00:01:22

weight in the network when we went

00:01:24

through the process for showing how that

00:01:26

was calculated recall that we worked

00:01:28

with this single weight in the output

00:01:31

layer of the network and then

00:01:32

generalized the result we obtained by

00:01:34

saying that this same process could be

00:01:36

applied for all other weights in the

00:01:38

network so for this particular weight we

00:01:41

saw that the derivative of the loss with

00:01:43

respect to this weight was equal to this

00:01:46

now what would happen if we chose to

00:01:49

work with a weight that's not in the

00:01:50

output layer like this weight here for

00:01:53

example well using the formula we

00:01:55

obtained for calculating the gradient

00:01:57

the loss we see that the gradient of the

00:01:59

loss with respect to this particular

00:02:01

weight is equal to this so check it out

00:02:04

it looks just like the equation we use

00:02:06

for the previous weight we were working

00:02:08

with the only difference is that the

00:02:10

superscripts are different because now

00:02:12

we're working with a weight in the third

00:02:14

layer which we are denoting as big L

00:02:16

minus 1 and then the subscripts are

00:02:19

different as well because we're working

00:02:20

with the weight that connects the second

00:02:22

node in the second layer to the second

00:02:24

node in the third layer so given this is

00:02:28

the same formula then we should just be

00:02:30

able to calculate it in the exact same

00:02:32

way we did for the previous weight we

00:02:34

worked with in the last video right well

00:02:37

not so fast

00:02:38

so yes this is the same formula and in

00:02:42

fact the second and third terms here on

00:02:44

the right hand side will be calculated

00:02:46

using this same exact approach as we

00:02:48

used before this first term though the

00:02:51

derivative of the loss with respect to

00:02:53

this one activation output that's

00:02:55

actually going to require a different

00:02:56

approach for us to calculate it let's

00:02:59

think about why when we calculated the

00:03:02

derivative of the loss with respect to a

00:03:04

weight in the output layer we saw that

00:03:06

this first term is the derivative of the

00:03:08

loss with respect to the activation

00:03:10

output for a node in the output layer

00:03:12

well as we've talked about before the

00:03:15

loss is a direct function of the

00:03:17

activation output of all of the nodes in

00:03:19

the output layer you know because the

00:03:21

loss is the sum of the squared errors

00:03:23

between the actual labels of the data

00:03:25

and the activation output of the nodes

00:03:27

in the output layer

00:03:29

okay so when we calculate the derivative

00:03:32

of the loss with respect to a weigh-in

00:03:33

layer big L minus 1 for example this

00:03:36

first term is the derivative of the loss

00:03:39

with respect to the activation output

00:03:41

for node 2 not in the output layer L but

00:03:44

in layer L minus 1 and unlike the

00:03:48

activation output for the nodes in the

00:03:49

output layer the loss is not a direct

00:03:52

function of this output C because look

00:03:55

at where this activation output is

00:03:56

within the network and then look at

00:03:58

where the loss is calculated at the end

00:04:00

of the network we can see that this

00:04:02

output is not being passed directly to

00:04:04

the loss so we need to understand how to

00:04:08

calculate this term then that's going to

00:04:10

be our focus for now so maybe if you

00:04:13

need then go ahead and pause the video

00:04:15

here and go back and watch the previous

00:04:17

video where we calculated the first term

00:04:19

in this equation to see the approach we

00:04:21

took then you can compare that to the

00:04:24

approach we're going to take to

00:04:25

calculate this first term in this

00:04:27

equation now but cause the second and

00:04:30

third terms on the right-hand side are

00:04:32

calculated in the exact same manner as

00:04:34

we've seen before we're not going to

00:04:36

cover those here we're just going to

00:04:38

focus on how to calculate this term and

00:04:40

then we'll combine the results from all

00:04:42

the terms to see the final result

00:04:45

all right at this point go ahead and

00:04:47

admit you're thinking to yourself I

00:04:49

clicked on this video to see how back

00:04:51

prop works backwards what the heck does

00:04:53

any of this so far have to do with the

00:04:55

backwards movement of backpropagation I

00:04:57

hear you we are getting there so stick

00:05:00

with me we have to go through this math

00:05:02

first and see what it's doing and then

00:05:04

once we see that we'll be able to

00:05:06

clearly see the whole point of the

00:05:08

backwards movement so let's go ahead and

00:05:10

jump into the calculations

00:05:13

all right time to get set up we're going

00:05:16

to show how we can calculate the

00:05:18

derivative of the loss function with

00:05:19

respect to the activation output for any

00:05:22

node that's not in the output layer

00:05:23

we're going to work with a single

00:05:26

activation output to illustrate this

00:05:27

particularly we'll be working with the

00:05:29

activation output for node two in layer

00:05:32

L minus 1 and that's denoted as this

00:05:36

term and the partial derivative of the

00:05:39

loss with respect to this activation

00:05:40

output is denoted as this now as we

00:05:45

discuss a few moments ago observe that

00:05:47

for each node J in the output layer L

00:05:49

the loss depends on the activation

00:05:51

output from each of these nodes

00:05:54

okay now the activation output for each

00:05:57

of these nodes depends on the input to

00:05:59

each of these nodes and in turn the

00:06:03

input to each of these nodes depends on

00:06:05

the weights connected to each of these

00:06:07

nodes from the previous layer L minus 1

00:06:10

as well as the activation outputs from

00:06:12

the previous layer

00:06:14

so given this we can see how the input

00:06:17

to each node in the output layer is

00:06:19

dependent on the activation output that

00:06:21

we've chosen to work with the activation

00:06:24

output for node 2 in layer L minus 1 so

00:06:28

again using similar logic to what we

00:06:30

used in our previous video we can see

00:06:32

from these dependencies that the loss

00:06:34

function is actually a composition of

00:06:36

functions and so to calculate the

00:06:39

derivative of the loss with respect to

00:06:40

the activation output we're working with

00:06:42

we'll need to use the chain rule which

00:06:45

tells us that this derivative is equal

00:06:47

to the product of the derivatives of the

00:06:49

composed function and we're expressing

00:06:52

that here so this says that the

00:06:55

derivative of the loss with respect to

00:06:57

the activation output for no.2 in layer

00:06:59

L minus 1 is equal to this this is the

00:07:04

sum for each node J in the output layer

00:07:06

L of the derivative of the loss with

00:07:08

respect to the activation output for

00:07:11

node J times the derivative of the

00:07:14

activation output for node J with

00:07:16

respect to the input for node J times

00:07:19

the input for node J with respect to the

00:07:21

activation output for node 2 in layer L

00:07:25

minus 1 now let's scroll a little bit

00:07:28

now actually this equation looks almost

00:07:31

identical to the equation we obtained in

00:07:34

the last video for the derivative of the

00:07:36

loss with respect to a given weight

00:07:38

recall that this previous derivative

00:07:40

with respect to a given weight we worked

00:07:41

with was expressed as this so just

00:07:45

eyeballing the general whiteness between

00:07:47

these two equations we see that the only

00:07:50

differences are 1 the presence of this

00:07:52

summation operation in our new equation

00:07:54

and 2 the last term on the right hand

00:07:57

side differs the reason for the

00:08:00

summation here is due to the fact that a

00:08:02

change in one activation output in the

00:08:04

previous layer is going to affect the

00:08:06

input for each node J in the following

00:08:09

layer L so we need to sum up these

00:08:12

effects now we can see that the first

00:08:15

and second terms on the right-hand side

00:08:17

of the equation are the same as the

00:08:19

first and second terms in the last

00:08:20

equation with regards to weight 1/2 in

00:08:23

the output layer when J equals 1 so

00:08:26

since we've already gone through the

00:08:28

work to find how to calculate these two

00:08:29

derivatives in the last video we won't

00:08:31

do it again here we're only going to

00:08:34

focus on breaking down this third term

00:08:36

and then we'll combine all terms to see

00:08:38

the final result

00:08:41

all right so let's jump into how to

00:08:43

calculate the third term from the

00:08:45

equation we just looked at this third

00:08:47

term is the derivative of the input to

00:08:49

any node J in the output layer L with

00:08:52

respect to the activation output for

00:08:55

node 2 and layer L minus 1 we know for

00:08:59

each node J in layer L that the input is

00:09:01

equal to the weighted sum of the

00:09:03

activation outputs from the previous

00:09:05

layer L minus 1 so then we can

00:09:08

substitute this sum in for Z sub J and

00:09:11

our derivative here now let's expand

00:09:14

this sum then due to the linearity of

00:09:18

the summation operation we can pull the

00:09:21

derivative operator through to each term

00:09:23

since the derivative of a sum is equal

00:09:25

to the sum of the derivatives

00:09:28

so we're taking the derivatives of each

00:09:31

of these terms with respect to a sub 2

00:09:33

but actually we can see that only one of

00:09:36

these terms contains a sub 2 so then

00:09:39

when we take the derivative of any of

00:09:40

these other terms that don't contain a

00:09:42

sub 2 they'll just all evaluate to 0 now

00:09:46

taking the derivative of this 1 single

00:09:48

term that does contain a sub 2 we apply

00:09:51

the power rule to get the result so this

00:09:54

result says that the input for any node

00:09:57

J in the output layer L will respond to

00:10:00

a change in the activation output for

00:10:02

node 2 in layer L minus 1 by an amount

00:10:05

equal to the weight connecting node 2 in

00:10:08

layer L minus 1 to note J in layer L all

00:10:12

right let's take this result and combine

00:10:14

it with our other terms to see what we

00:10:16

get as the total result for the

00:10:18

derivative of the loss with respect to

00:10:20

this activation output

00:10:23

all right so we have our original

00:10:26

equation here for the derivative of the

00:10:27

loss with respect to the activation

00:10:29

output that we've chosen to work with

00:10:31

and from our previous video we already

00:10:34

know what these first two terms evaluate

00:10:36

to you so I've gone ahead and put those

00:10:38

results in here and we just saw what the

00:10:41

result for this third term was so we

00:10:44

have that result here okay so we've got

00:10:47

this full result now what was it that we

00:10:50

wanted to do with it again

00:10:51

oh yeah now we can use this result to

00:10:54

calculate the gradient of the loss with

00:10:56

respect to any weight connected to node

00:10:58

two in layer L minus one like the one we

00:11:01

showed at the start of this video wait -

00:11:04

two for example with the following

00:11:06

equation

00:11:07

the result we just obtained for the

00:11:09

derivative of the loss with respect to

00:11:11

the activation output for no.2 and layer

00:11:13

L minus one can then be substituted for

00:11:16

the first term in this equation and then

00:11:19

as mentioned earlier the second and

00:11:21

third terms are calculated using the

00:11:23

exact same approach we took for those

00:11:24

terms in the previous video so notice

00:11:28

we've used the chain rule twice now with

00:11:30

one of those times being in nested

00:11:32

inside the other we first use the chain

00:11:35

rule to obtain the result for this

00:11:37

entire derivative for the loss with

00:11:39

respect to this one weight and then we

00:11:42

used it again to calculate the first

00:11:44

term within this derivative which itself

00:11:46

was the derivative of the loss with

00:11:48

respect to the activation output the

00:11:51

results from each of these derivatives

00:11:52

using the chain rule depended on

00:11:55

derivatives with respect to components

00:11:57

that reside later in the network so like

00:12:00

for the weight we worked with in the

00:12:01

last video for example to calculate the

00:12:04

gradient of the loss with respect to it

00:12:06

we needed derivatives that depended on

00:12:08

the activation output and the input for

00:12:11

this note then to calculate the gradient

00:12:14

of the loss with respect to the weight

00:12:16

we just worked with in this video we

00:12:18

needed derivatives that depended on this

00:12:21

input and this activation output and as

00:12:24

we saw the derivative that depended on

00:12:26

this activation output needed the

00:12:29

derivatives that depended on all of the

00:12:31

activation outputs and all of the inputs

00:12:33

for these nodes so essentially we're

00:12:37

needing to calculate derivatives that

00:12:39

depend on components later in the

00:12:41

network first and then use these

00:12:43

derivatives in our calculations for the

00:12:46

gradient of the loss with respect to

00:12:48

weights that come earlier in the network

00:12:50

so we achieve this by repeatedly

00:12:53

applying the chain rule in a backwards

00:12:55

fashion

00:13:05

whoo all right now we know what puts the

00:13:09

back in backprop after watching this

00:13:12

video along with the earlier videos on

00:13:14

back prop that precede this one you

00:13:15

should now have a full understanding for

00:13:17

what back prop is all about if you made

00:13:20

it through all of these and you think

00:13:22

you have a grip on this stuff then

00:13:23

Cheers made i'm glad you stuck around

00:13:25

till the end now we've gone over a lot

00:13:28

of math in this video as well as the

00:13:30

last several so if you have any

00:13:32

questions let's have a discussion in the

00:13:34

comments also I'd love to hear what you

00:13:37

think in general about this series of

00:13:39

videos on back prop was it helpful in

00:13:41

developing your understanding how was it

00:13:43

following the math I'd really like to

00:13:45

know what you're thinking thanks for

00:13:47

watching see you next time

00:13:54

[Music]

00:14:00

you

Description:

Let's see the math that explains how backpropagation works backwards through a neural network. We've seen how to calculate the gradient of the loss function using backpropagation in the previous video. We haven't yet seen though where the backwards movement comes into play that we talked about when we discussed the intuition for backprop. So now, we're going to build on the knowledge that we've already developed to understand what exactly puts the back in backpropagation. The explanation we'll give for this will be math-based, so we're first going to start out by exploring the motivation needed for us to understand the calculations we'll be working through. We'll then jump right into the calculations, which, we'll see, are actually quite similar to ones we've worked through in the previous video. After we've got the math down, we'll then bring everything together to achieve the mind-blowing realization for how these calculations are mathematically done in a backwards fashion. 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:43 Agenda 01:13 Calculations - Derivative of the loss with respect to activation outputs 13:06 Summary 13:40 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👉 Check out the website for more learning material: 🔗 https://deeplizard.com/ 💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES 🔗 https://deeplizard.com/resources 🧠 Support collective intelligence, join the deeplizard hivemind: 🔗 https://deeplizard.com/hivemind 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order 👉 Use your receipt from Neurohacker to get a discount on deeplizard courses 🔗 https://www.qualialife.com/shop?rfsn=6488344.d171c6 👀 CHECK OUT OUR VLOG: 🔗 https://www.youtube.com/deeplizardvlog ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Tammy Mano Prime Ling Li 🚀 Boost collective intelligence by sharing this video on social media! 👀 Follow deeplizard: Our vlog: https://www.youtube.com/deeplizardvlog Facebook: https://www.facebook.com/unsupportedbrowser Instagram: https://www.facebook.com/unsupportedbrowser Twitter: https://twitter.com/deeplizard Patreon: https://www.patreon.com/deeplizard YouTube: https://www.youtube.com/deeplizard 🎓 Deep Learning with deeplizard: Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://www.amazon.com/shop/deeplizard 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://www.youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

Preparing download options

Popular

HD video

Only sound

All

* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."

** — Link intended for online playback in specialized players

Questions about downloading video

How can I download "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video?

http://unidownloader.com/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.
The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.
UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.
UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

Which format of "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video should I choose?

The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

Why does my computer freeze when loading a "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video?

The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

How can I download "Backpropagation explained | Part 5 - What puts the "back" in backprop?" video to my phone?

You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

How can I download an audio track (music) to MP3 "Backpropagation explained | Part 5 - What puts the "back" in backprop?"?

The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

How can I save a frame from a video "Backpropagation explained | Part 5 - What puts the "back" in backprop?"?

This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

What's the price of all this stuff?

It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.