Download "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations"

"videoThumbnail DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations

0:00

Intro

0:09

Biological Intuition

0:16

The Big Picture

0:17

Single Layer Networks

0:17

Sigmoid

0:23

Softmax

0:26

Linear Models

0:28

Solution

0:28

Puzzle View

0:28

Potential Solution

0:32

Playgrounds

0:34

Universe Approximator

0:37

Intuition

0:40

Going deeper

0:40

Models

0:41

Unit Rectifier

0:44

Intuition Behind Deep Learning

0:45

Mathematical Properties of Intuition

0:48

Computational Graphs

0:52

Linear Algebra 101

0:53

Gradient Descent 101

0:55

Gradient Descent API

0:56

Free Layers

Die Firma hinter ChatGPT - Die Geschichte von OpenAI

Die Firma hinter ChatGPT - Die Geschichte von OpenAI

Channel: Noel Lang

Text-To-Film: 15-Minute Video From One Prompt!

Text-To-Film: 15-Minute Video From One Prompt!

Channel: Matt Wolfe

Harry Pothead

Harry Pothead

Channel: AIGener8

比 Sora 更惊艳，一次处理 80 万汉字，Gemini 1.5 Pro 值得期待吗 | 回到Axton

比 Sora 更惊艳，一次处理 80 万汉字，Gemini 1.5 Pro 值得期待吗 | 回到Axton

Channel: 回到Axton

AI Learns to Walk (deep reinforcement learning)

AI Learns to Walk (deep reinforcement learning)

Channel: AI Warehouse

GPTs 商店、ChatGPT 团队版、最懂你的 GPT：OpenAI 新年三重惊喜，震撼发布 | 回到Axton

GPTs 商店、ChatGPT 团队版、最懂你的 GPT：OpenAI 新年三重惊喜，震撼发布 | 回到Axton

Channel: 回到Axton

AGI 蹒跚的第一步：一个视频告诉你「多GPTs + 多APP」协作的威力！OpenAI 最新 Mention 更新 | 回到Axton

AGI 蹒跚的第一步：一个视频告诉你「多GPTs + 多APP」协作的威力！OpenAI 最新 Mention 更新 | 回到Axton

Channel: 回到Axton

Intense AI Video Maker (Stable WarpFusion Tutorial)

Intense AI Video Maker (Stable WarpFusion Tutorial)

Channel: Matt Wolfe

Artificial Intelligence

AI

Deep Learning

Lecture

DeepMind

UCL

Machine Learning

Neural Networks

00:00:06

all right thank you for this lovely

00:00:08

introduction and

00:00:12

I mentioned today we'll go

00:00:14

the foundations of neural networks the

00:00:18

talk itself will last around 90 minutes

00:00:20

so I would ask you to wait with any

00:00:24

questions till the end of the lecture

00:00:26

we'll have a separate slot to address

00:00:29

this and I'll also hang around for a

00:00:32

while after the lecture if you would

00:00:33

prefer to ask some in person lecture

00:00:37

itself will be structured as follows

00:00:38

there'll be six sections I will start

00:00:41

with a basic overview trying to convince

00:00:43

you that there is a point in learning

00:00:45

about neural nets what is the actual

00:00:48

motivation to be studying this specific

00:00:51

branch of research in the second part

00:00:54

which is the main meat of the story

00:00:56

we'll go through both history of neural

00:00:59

nets their properties the way they are

00:01:00

defined and their inner workings mostly

00:01:03

in order to gather good and deep

00:01:06

intuition about them so that you are

00:01:09

both prepared for future more technical

00:01:11

more in-depth lectures in this series as

00:01:14

well as ready to simply work with these

00:01:16

models in your own research or work

00:01:19

equipped with these we'll be able to

00:01:22

dive into learning itself since even the

00:01:24

best model is useless without actually

00:01:27

knowing how to set all the knobs and

00:01:29

weights inside them after that we'll

00:01:32

fill in a few gaps in terms of these

00:01:34

pieces of the puzzles of the puzzle that

00:01:37

will be building as the talk progresses

00:01:40

and we'll finish with some practical

00:01:42

issues or practical guidelines to

00:01:46

actually deal with most common problems

00:01:48

in training neural nets if time permits

00:01:51

we'll also have a bonus slide or two on

00:01:54

what I like to call multiplicative

00:01:56

interactions so this is what will be in

00:02:02

the lecture there quite a few things

00:02:03

that could be included in a lecture

00:02:06

called foundations of our neural Nets

00:02:09

that's not going to be part of this talk

00:02:11

and I'd like to split these into three

00:02:14

branches first is what I refer to us

00:02:17

all-school neural Nets not to suggest

00:02:20

that the neural nets that are working

00:02:21

with these days are not old or they

00:02:24

don't go back like 70 years but rather

00:02:26

to

00:02:27

make sure that you see that this is a

00:02:29

really wide field with quite powerful

00:02:32

methods that were once really common and

00:02:34

important for the field that are not

00:02:37

that popular anymore but it's still

00:02:39

possible that they will come back and

00:02:41

it's valuable to learn about things like

00:02:43

restricted Boltzmann machines deep

00:02:45

belief networks hopfield net kohonen

00:02:46

maps so I'm just gonna leave these a

00:02:48

sort of keywords for further living if

00:02:51

you want to really dive deep the second

00:02:55

part is biologic biologically possible

00:02:57

neural nets where the path really is to

00:03:00

replicate the inner workings of human

00:03:02

brain so have physical simulators that

00:03:04

are spiking neural nets and these two

00:03:06

branches encoded in red here will not

00:03:10

share that many common ground with the

00:03:14

talk right there are still in your own

00:03:16

network land but they don't necessarily

00:03:17

follow the same the same design

00:03:19

principles on the other hand the third

00:03:21

branch called other because of lack of

00:03:25

better name do share a lot of

00:03:28

similarities and even though we won't

00:03:29

explicitly talk about capsule networks

00:03:31

graph networks or neuro differential

00:03:34

equations what you'll learn today the

00:03:36

high level ideas motivations and overall

00:03:39

scheme directly applies to all of these

00:03:41

they simply are somewhat beyond the

00:03:45

scope of the series and the ones in

00:03:47

green like convolutional neural networks

00:03:49

recurrent neural networks are simply not

00:03:51

part of this lecture but will come in

00:03:53

weeks to come for example in sundar talk

00:03:55

and others yep so why are we learning

00:04:00

about neural Nets quite a few examples

00:04:03

were already given a week ago I just

00:04:05

want to stress a few the first one being

00:04:08

computer vision in general most of the

00:04:10

modern solutions applications in

00:04:14

computer vision do use some form of

00:04:16

neural network based processing these

00:04:19

are not just hypothetical objects you

00:04:22

know things that are great for

00:04:23

mathematical analysis or for research

00:04:24

purposes

00:04:25

there are many actual commercial

00:04:27

applications products that use neural

00:04:30

networks on daily basis pretty much in

00:04:32

every smartphone you can find at least

00:04:34

one neural net these days the second one

00:04:37

is natural language processing testing

00:04:40

text synthesis

00:04:41

of grades recent results from open AI

00:04:43

and their GPT to model as well as

00:04:46

commercial results with building wavenet

00:04:49

based text generation into a Google

00:04:52

assistant if you if you own one finally

00:04:55

control that doesn't just allow us to

00:04:59

create AI for things like gold chess

00:05:03

Starcraft for games or simulations in

00:05:05

general but it's actually being used in

00:05:08

products like self-driving cars so what

00:05:11

made all this possible what started the

00:05:15

deep learning revolution what are the

00:05:16

fundamentals that neural networks that's

00:05:19

really benefited from this needs to have

00:05:21

the first one is compute and I wants to

00:05:25

make sure that you understand that there

00:05:27

are two sides to this story it's not

00:05:30

just that computers got faster they were

00:05:32

always getting faster what specifically

00:05:35

happened in recent years is that

00:05:38

specific kind of hardware compute namely

00:05:40

GPUs graphical processing units that

00:05:42

will designed for games got really

00:05:45

useful for machine learning right so

00:05:47

this is two first thing on one hand we

00:05:49

got hardware there's just much faster

00:05:51

but it's not generally faster it won't

00:05:53

make your sequential programs run faster

00:05:56

it is faster with respect to very

00:05:58

specific operations and neural networks

00:06:01

happen to use exactly these operations

00:06:03

we will reinforce this point in further

00:06:06

part of the lecture but think about

00:06:08

matrix multiplications as this core

00:06:10

element other machine learning

00:06:12

techniques that don't rely on matrix

00:06:14

multiplications would not benefit that

00:06:16

much from this exponential growth in

00:06:18

compute that came from GPUs and these

00:06:20

days from TVs the second is data and the

00:06:23

same argument applies you have various

00:06:26

methods of learning some of which scale

00:06:29

very well with data some scale badly

00:06:32

your competition complexity goes through

00:06:33

the roof you don't benefit from pushing

00:06:35

more and more data so we have again two

00:06:38

phases to this story a there's much more

00:06:40

data available just because of Internet

00:06:43

in terms of things and various other

00:06:44

things and on the other end you have

00:06:47

models that really our data hungry and

00:06:50

actually improve as amount of data

00:06:52

increases and finally

00:06:55

and finally the modularity of the system

00:06:59

itself the fact that deep learning is

00:07:01

not that well defined field of study

00:07:04

it's more of a mental picture of the

00:07:07

high-level idea of these modular blocks

00:07:10

that can be arranged in various ways and

00:07:13

I want to try to sell to you intuition

00:07:16

of viewing deep learning as this sort of

00:07:18

puzzle where all we are doing as

00:07:20

researchers is building these small

00:07:23

blocks that can be interconnected in

00:07:26

various ways so that they jointly

00:07:29

process data to use quotes from recent

00:07:35

turing award winner professor yan

00:07:37

laocoön deep learning is constructing

00:07:39

networks of harmonized functional

00:07:41

modules and training them from examples

00:07:43

using gradient based optimization

00:07:46

there's this core idea that we're

00:07:48

working with is an extremely modular

00:07:51

system we are not just defining one

00:07:53

model that we are trying to you know

00:07:55

apply to various domains were redefining

00:07:57

language to build them that relies on

00:08:00

very simple basic principles and these

00:08:04

basic principles in a single such a node

00:08:06

or piece of a puzzle is really these two

00:08:09

properties each of them needs to know

00:08:12

given an input given data what's to

00:08:15

output there are simple computational

00:08:17

units that compute one thing they take

00:08:19

an average they multiply things they

00:08:22

exponent them things like this and they

00:08:25

have also the other mode of operations

00:08:28

if they knew how the output of their

00:08:32

computation should change they should be

00:08:34

able to tell their input how to change

00:08:37

accordingly right so if I tell this node

00:08:39

your output should be higher it should

00:08:41

know how to change inputs if you are a

00:08:44

more mathematically oriented you'll

00:08:46

quickly find the analogy to

00:08:48

differentiation right then this is

00:08:50

pretty much the underlying mathematical

00:08:52

assumption for this so we will usually

00:08:54

work with the differentiable object and

00:08:57

this is also how professor Iyanla

00:08:59

quantifies this this is not necessarily

00:09:01

a strict requirement that you will see

00:09:03

through this lecture that in practice

00:09:05

people will put things that are kind of

00:09:07

differentiable and

00:09:09

a practitioner you know that people just

00:09:10

put everything into deep nets not

00:09:13

necessarily caring about the full

00:09:14

mathematical reasoning behind it

00:09:17

so given this really high-level view let

00:09:21

us go for some fundamentals and as usual

00:09:25

let's start with some biological

00:09:26

intuition so in every your network

00:09:30

lecture you need to see a neuron so I

00:09:32

drew one for you that's I know over

00:09:35

simplified so if you have biological

00:09:37

background forgive me for being very

00:09:39

naive but I wanted to capture just very

00:09:41

basic properties and so people have been

00:09:44

studying in neurobiology how real

00:09:46

neurons look like and one really

00:09:48

high-level view is that these are just

00:09:49

small cells that have multiple been

00:09:52

trends which are inputs from other

00:09:55

neurons through which they accumulate

00:09:58

their spikes their activity there is

00:10:01

some simple computation in the Sun in

00:10:05

the cell body and then there is a single

00:10:07

axon where the output is being produced

00:10:11

human brain is composed of billions of

00:10:14

these things connected to many many

00:10:15

others and you can see that this kind of

00:10:18

looks like a complex distributed

00:10:20

computation system right we have these

00:10:23

neurons connected to many others each of

00:10:25

them represents a very simple

00:10:27

computation on its own and what people

00:10:30

notice is that some of this connection

00:10:33

inhibits your activity so if other

00:10:35

neuron is active you aren't some excite

00:10:37

so if they are active you are excited as

00:10:40

well and of course that many other

00:10:41

properties like there's a stage right

00:10:43

these cells live through time the output

00:10:46

spikes for time each time you'll see a

00:10:48

slide like this with yellow box this is

00:10:51

a reference to further reading I won't

00:10:54

go into too much depth on various topics

00:10:58

but if you want to read more these are

00:11:01

nice preferences for example watch game

00:11:03

hodgkin-huxley model is a nice read if

00:11:07

you are somewhere between neurobiology

00:11:09

and mathematics so this is an intuition

00:11:12

and just intuition what people did with

00:11:15

that and by people I mean McCulloch and

00:11:18

Pitt's is to look at this

00:11:21

and ask themselves what seems to be the

00:11:24

main set of neurophysiological

00:11:26

observations that we need to replicate

00:11:28

it's important to stress that this is

00:11:30

not a model the model that they proposed

00:11:32

that was trying to replicate all of the

00:11:34

dynamics this is not you know an

00:11:37

artificial simulation of a neuron is

00:11:39

just something vaguely inspired by some

00:11:42

properties of real neurons and these

00:11:45

properties are selected in green there

00:11:48

are things that will be easy to compose

00:11:50

because you have Google for this in a

00:11:52

second you have blue inputs that are multiplied

00:11:56

by some weights W there are just real

00:12:02

numbers attached to each input and then

00:12:04

they are some so it's like an weighted

00:12:07

sum of your inputs and we also have a

00:12:09

parameter weight B also referred to a

00:12:13

bias to us bias which is the output of

00:12:17

our new you can see that this is

00:12:18

something you could easily compose there

00:12:20

are real numbers as inputs real numbers

00:12:22

as outputs it represents simple

00:12:24

computation

00:12:25

well it's literally weighted average you

00:12:27

can get more basic than that may be

00:12:29

slightly more basic it also has this

00:12:31

property of inhibiting or exciting if W

00:12:34

is negative you inhibit if it's positive

00:12:36

you excite but they left out quite a few

00:12:40

properties for example this is a

00:12:42

stateless model right you can compute it

00:12:44

many many times and the output is

00:12:45

exactly the same if you were to take a

00:12:47

real neuron and put the same action

00:12:50

potentials in it's spiking might change

00:12:52

through time because it is a living

00:12:54

thing that has a state physical state

00:12:56

and also outputs real values rather than

00:12:58

spikes through time just because the

00:13:00

time dimension got completely removed

00:13:02

and also to set some notation each time

00:13:05

something is blue in equations this

00:13:07

means this is an input if something is

00:13:08

red this is a parameter weight something

00:13:10

that you would usually train in your in

00:13:14

your model and the same applies to the

00:13:15

schemes themselves so what it means as

00:13:22

intuitively what's the weighted average

00:13:24

does at least my personal favourites

00:13:27

intuition is that it makes it defines a

00:13:32

linear or affine project

00:13:34

of your data so you can imagine that

00:13:36

this horizontal line is one such neuron

00:13:38

we've W that is just zero one and B

00:13:43

equals zero and then what will happen is

00:13:45

all the data that you have so your axis

00:13:46

would be perpendicularly projected onto

00:13:51

this line and you'd get this mess

00:13:53

everything would be on top of each other

00:13:55

if you had a different W say the sorry

00:13:58

vertical line before it was horizontal

00:14:00

the vertical line then they would be

00:14:02

nicely separated groups right because

00:14:04

you just collapsed them if it was a

00:14:06

diagonal line then things would be

00:14:08

slightly slightly separated as at the

00:14:12

bottom part of this so when we define

00:14:18

something like this we can start

00:14:20

composing them and the most natural or

00:14:23

the first mode of composition is to make

00:14:26

a layer out of these neurons so you can

00:14:29

see the idea is just to take each such

00:14:31

neuron put them next to each other and

00:14:33

what we gain from this mostly we gain a

00:14:37

lot of efficiency in terms of computing

00:14:39

because now the equation simplifies to a

00:14:42

simple affine transformation with W

00:14:44

being a matrix of the weights that are

00:14:47

in between our inputs and outputs X

00:14:51

being a vectorized input right so just

00:14:53

gather all the inputs for that as a

00:14:55

vector why is it important to fold

00:14:57

multiplication of two matrices in a

00:15:00

naive fashion is cubic but you probably

00:15:02

know from algorithmic so either 1 1 or 2

00:15:05

or whatever that you can go down like

00:15:07

2.7 by being slightly smart about how

00:15:10

you multiply things by basically using a

00:15:11

divide-and-conquer kind of methods and

00:15:13

furthermore this is something that fits

00:15:16

the GPU paradigm extremely well right so

00:15:19

this is one of these things that just

00:15:21

matches exactly what was already there

00:15:22

hardware wise and as such could benefit

00:15:25

from this huge boost in compute there's

00:15:28

also a lot of small caveats in the

00:15:30

neural network planned in terms of

00:15:32

naming conventions so each object will

00:15:34

have from one to five names and I'm

00:15:37

deeply sorry for us as a community for

00:15:39

doing this the main reason is many of

00:15:42

these things were independently

00:15:43

developed by various groups of

00:15:45

researchers and

00:15:46

unification never happened some of these

00:15:49

names are more common than others for

00:15:50

example this is usually called linear

00:15:53

layer even though mathematician would

00:15:55

probably cry and say no it's fine it's

00:15:57

not linear there's a bias this doesn't

00:15:59

satisfy seniority constraints neurons

00:16:01

will be often called units so if I say

00:16:03

unit or neuron I just use these

00:16:06

interchangeably and parameters and

00:16:08

weights are also the same object so you

00:16:14

might ask isn't this just linear

00:16:15

regression like equation looks exactly

00:16:17

like statistics around the world as in

00:16:20

your regression model and to some extent

00:16:23

yes you're right it is exactly the same

00:16:25

predictive model but what's important is

00:16:27

to have in mind our big picture yes we

00:16:29

start small but our end goal is to

00:16:32

produce these highly composable

00:16:34

functions and if you are helping with

00:16:37

composing many linear regression models

00:16:38

on top of each other especially

00:16:39

multinomial regression models then you

00:16:42

can view it like this the language that

00:16:44

neural network our community prefers

00:16:47

is to think about these as neurons or

00:16:49

collections of neurons that talk to each

00:16:51

other because this is our end goal yes

00:16:53

this third beginning of our puzzle could

00:16:56

be something that's known in literature

00:16:58

under different names but what's really

00:17:00

important is that we view them as this

00:17:02

single composable pieces that can be

00:17:05

arranged in any way and much of research

00:17:07

is about composing them in a smart way

00:17:10

so that you get a new quality out of it

00:17:14

but let's view these simple models are

00:17:17

senor networks first so we'll start with

00:17:19

a single layer neural network just so we

00:17:22

can gradually see what is being brought

00:17:25

to the table with each extension with

00:17:27

each added module so what we defined

00:17:30

right now is what we're gonna define

00:17:33

right now can be expressed more or less

00:17:35

like this we have data it will go

00:17:37

through the linear module then there

00:17:39

will be some extra notes that we are

00:17:40

going to be fine

00:17:41

then there are gonna then there is gonna

00:17:43

be a loss

00:17:44

there's also be gonna be connected to a

00:17:47

target we are missing these two so let's

00:17:50

define what can be used there and let's

00:17:53

start with the first one which is often

00:17:55

called an activation function or a

00:17:57

non-linearity this is

00:18:00

an object that is usually used to induce

00:18:04

more complex models if you had many

00:18:07

linear models many affine models and you

00:18:09

compose them it's very easy to prove

00:18:11

composition of linear Zitz linear

00:18:13

composition of affine thingses f-fine

00:18:15

you would not really bring anything to

00:18:17

the table you need to add something that

00:18:19

bends the space in a more funky way so

00:18:23

one way of doing this or historically

00:18:24

one of the first ones is to use sigmoid

00:18:27

activation function which you can view

00:18:29

as a squashing of a real line to the

00:18:32

zero one interval we often will refer to

00:18:34

things like this as producing

00:18:36

probability estimates or probability

00:18:38

distributions and while there exists a

00:18:41

probabilistic interpretation of this

00:18:42

sort of model what this usually means in

00:18:44

ml community is that it simply outputs

00:18:46

things between 0 & 1 or the day sum to 1

00:18:49

okay so now let's not be too strict when

00:18:51

we say probability estimate here it

00:18:54

might mean something as simple as being

00:18:55

in the correct range the nice thing is

00:18:58

it also has very simple derivatives just

00:19:00

to refer to this different ability that

00:19:02

we're talking about here but they're

00:19:04

also caveats that make it slightly less

00:19:07

useful as you will see in the grand

00:19:09

scheme of things one is that because it

00:19:11

saturates right as you go to plus or

00:19:12

minus infinity it approaches 1 or 0

00:19:15

respectively this means that the partial

00:19:18

derivatives vanish right the gradient

00:19:20

far far to the right will be pretty much

00:19:23

zero because your function is flat so

00:19:25

the gradient is pretty much zero the

00:19:26

same applies in minus infinity so once

00:19:28

you are in this specific point if you

00:19:31

view gradient magnitude as amount of

00:19:33

information that you are betting to

00:19:35

adjust your model then functions like

00:19:38

this won't work that well once you

00:19:40

saturate you won't be taught how to

00:19:42

adjust your weights anymore but this was

00:19:48

we are gonna use at least initially so

00:19:50

we plug in sigmoid on top of our linear

00:19:52

model and the only thing we are missing

00:19:54

is a loss and the most commonly used one

00:19:58

for the simplest possible task which is

00:20:00

going to be binary classification

00:20:01

meaning that our targets are either 0 or

00:20:05

1 something is either false or true

00:20:07

something is a face or not something is

00:20:10

a dog or not just this sort of products

00:20:13

then the most common loss function which

00:20:16

should be a two argument function that

00:20:19

returns a scalar so it accepts in this

00:20:22

notation P our prediction T our target

00:20:25

and it's supposed to output a single

00:20:27

scalar a real value such that smaller

00:20:30

loss means better model being closer to

00:20:34

the correct prediction and cross-entropy

00:20:37

which has at least three names being

00:20:39

negative log likelihood logistic loss

00:20:41

and probably many others

00:20:48

gives us the negation of the location of

00:20:51

probability of correct classification

00:20:53

which is exactly what you care about in

00:20:56

classification at least usually it's

00:20:58

also nicely composable with the sigmoid

00:21:01

function which will go back towards the

00:21:04

end of the lecture showing how this

00:21:06

specific composition removes two

00:21:08

numerical instabilities at once because

00:21:11

on its own unfortunately it is quite

00:21:13

numerically unstable so given these

00:21:18

three things we can compose them and

00:21:20

have the simplest possible neural

00:21:22

classifier we have data it goes through

00:21:24

linear model goes for sigmoid goes for

00:21:25

cross-entropy

00:21:26

attaches targets this is what you would

00:21:29

know from statistics as a logistic

00:21:31

regression and again the fact that we

00:21:33

were defining and well-known model from

00:21:35

a different branch of science is fine

00:21:37

because we won't stop here this is just

00:21:40

to gain intuition what we can already

00:21:41

achieve what we can already achieve in

00:21:43

practice is we can separate data that's

00:21:46

labeled with well two possible labelings

00:21:49

true or false

00:21:50

zero and one as long as you can put a

00:21:53

line or a hyperplane in a hyper in a

00:21:56

higher dimension that completely

00:21:57

separates these two datasets so in the

00:22:00

example you see red and blue you see

00:22:03

that the more vertical line can separate

00:22:07

these data says pretty perfectly and it

00:22:10

will have a very low loss very low cross

00:22:13

entropy loss the important property of

00:22:15

the specific loss and I would say 95% of

00:22:21

all the losses in machine learning is

00:22:23

that they are additive with respect to

00:22:26

sample so the loss that you can see at the

00:22:28

lower and decomposes additively over

00:22:33

summer so there is a small function l

00:22:35

that we just defined over its sample and

00:22:38

now T with I in the superscript is an

00:22:41

I've target can be expressed as a sum of

00:22:44

these this specific property relates to

00:22:48

the data aspect of deep learning

00:22:50

revolution losses that have this form

00:22:55

undergo very specific decomposition and

00:22:58

can be trained with what is going to be

00:23:00

introduced a stochastic gradient descent

00:23:01

and can simply scale very well with big

00:23:05

datasets and unfortunately as we just

00:23:10

discussed this is still slightly

00:23:13

numerically unstable so what happens

00:23:16

when we have more than one sorry more

00:23:17

than two classes then we usually define

00:23:20

what's called the softmax which is as a

00:23:22

name suggests a smooth version of the

00:23:25

maximum operation you take an exponent

00:23:26

of your input and just normalize divide

00:23:30

by the sum of exponents you can see this

00:23:33

was sum to one everything is

00:23:35

non-negative because well exponents by

00:23:36

definition I know not negative so we

00:23:39

produce probability estimates in the

00:23:41

sense that the output lies on the

00:23:43

simplex and it can be seen as a strict

00:23:46

multi-dimensional generalization of the

00:23:48

sigmoid so it's not a different thing is

00:23:50

just as three generalization if you take

00:23:52

a single X at 0 and compute the softmax

00:23:55

of it then the first argument of the

00:23:58

output will be Sigma within the second

00:24:00

one minus Sigma all right so it's simply

00:24:03

a way to go beyond two classes but have

00:24:06

very very similar mathematical

00:24:08

formulation and it's by far the most

00:24:11

commonly used final activation in

00:24:13

classification problems when number of

00:24:15

classes is bigger than than two it still

00:24:19

has the same issues for obvious reasons

00:24:21

generalizations so it cannot remove

00:24:22

issues but the nice thing is now we can

00:24:25

just substitute the piece of the puzzle

00:24:28

that we defined before right away the

00:24:29

sigmoid

00:24:30

now just put soft marks in its place and

00:24:32

exactly the same reasoning and mcann

00:24:36

that would work before apply now right

00:24:39

so we use exactly the same loss function

00:24:41

after the fact that it's summing over

00:24:44

all the pluses and now we can separate

00:24:46

still linearly of course more than two

00:24:50

colors sales class zero one and two

00:24:54

which is equivalent to multinomial

00:24:57

logistic regression if you went for some

00:24:59

statistical courses and the combination

00:25:03

of the softmax and the cross entropy as

00:25:05

I mentioned before becomes numerically

00:25:08

stable because of this specific

00:25:12

decomposition and there will be also a

00:25:14

more in-depth version towards the end of

00:25:16

this lecture the only thing that it

00:25:19

doesn't do very well is it doesn't scale

00:25:21

that we'll have number of classes all

00:25:23

right so one thing that you might want

00:25:24

is to be able to select one class

00:25:26

specifically just say one just say zero

00:25:29

and of course with equation like soft

00:25:31

max has you can't represent ones or

00:25:34

zeros you can get arbitrarily close but

00:25:36

never exactly one or zero and there are

00:25:38

nice other solutions to this like sparse

00:25:41

max module for example and also it

00:25:45

doesn't scale that well with K it will

00:25:47

work well if K number of classes is say

00:25:49

in hundreds if it's in hundreds of

00:25:52

thousands you might need to look for

00:25:54

some slightly different piece of the

00:25:56

puzzle the nice news is you can

00:25:58

literally just swap them and they will

00:26:00

start scaling up so why are we even

00:26:03

talking about these simple things so

00:26:04

apart from the fact that they become

00:26:06

pieces of the bigger puzzle it's also

00:26:08

because they just work and you might be

00:26:10

surprised that the linear models are

00:26:12

useful but they really are if you look

00:26:15

at this very well-known M this dataset

00:26:17

of just handwritten digits and try to

00:26:19

build a linear model that classifies

00:26:21

which digit it is based on pixels you

00:26:24

might get slightly surprising result of

00:26:26

somewhat around 92 percent of test

00:26:28

accuracy that's pretty good for

00:26:30

something that just takes you know

00:26:31

pixels and computes a weighted average

00:26:33

and that's all it does and in one of the

00:26:37

intuitions behind it is we usually keep

00:26:40

thinking about these models in like 1d

00:26:43

2d 3d and yes in 2d they are not that

00:26:45

many positions of objects that the line

00:26:48

cancer it in 3d know that many positions were

00:26:50

hyperplane and separate in 100,000

00:26:53

dimensions 99 hyperplanes of

00:26:58

corresponding size can actually shutter

00:27:01

a lot of possible labelings so as you

00:27:05

get higher dimensionality you can

00:27:07

actually deal with them pretty well even

00:27:10

within your models furthermore in

00:27:12

commercial applications a big chunk of

00:27:14

them actually use linear models in

00:27:16

natural language processing for many

00:27:17

years the most successful model was

00:27:19

nothing else but Max and maximum entropy

00:27:22

classifier which is a fourth name for

00:27:25

logistic regression so why don't we stop

00:27:28

here right we could stop the lecture

00:27:30

here but obviously we are interested in

00:27:31

something slightly more complex like AI

00:27:34

for chess or forego and for this we know

00:27:37

that the linear model I mean we know

00:27:39

empirically linear models are just not

00:27:42

powerful enough but before we go that

00:27:46

far ahead maybe let's focus on something

00:27:49

that's the simplest thing that linear

00:27:51

models cannot do and it's going to be a

00:27:53

very well-known expo problem where we

00:27:56

have two dimensional data sets and on

00:27:58

the diagonal one class on the other

00:28:00

diagonal the second class you can

00:28:02

quickly iterate in your head over all

00:28:04

possible lines right not a single line

00:28:06

has red dots on one side blue on the

00:28:08

other elbow we need something more

00:28:10

powerful so our solution is going to be

00:28:13

to introduce a hitter later so now we're

00:28:15

going to look into two layer neural

00:28:17

networks that in our puzzle view look

00:28:21

like this we have a theta goes to linear

00:28:23

go through sigmoid goes for another

00:28:24

linear goes to soft max cross-entropy

00:28:27

target as you can see we already have

00:28:29

all the pieces we just well we are just

00:28:31

connecting them differently that's all

00:28:33

we are doing and I want to now convince

00:28:35

you that we're adding qualitatively more

00:28:39

than just you know adding dimensions or

00:28:42

something like this so let's start with

00:28:44

the potential solution how can we solve

00:28:46

this if we had just two hidden neurons

00:28:49

and a sigmoid activation function so we

00:28:52

have our data set and for simplicity of

00:28:54

visualization I'm going to recolor them

00:28:56

so that we have four different colors

00:28:59

we have blue red green and pink just so

00:29:02

you see where the projections end up

00:29:04

just remember that we want to separate

00:29:06

one diagonal from the other and that two

00:29:09

hidden neurons are going to be these two

00:29:11

projection lines so the top one is

00:29:14

oriented downwards which means that

00:29:17

we're going to be projecting in such a

00:29:19

way that the blue class will end up on

00:29:21

the right hand side pink on the left

00:29:24

green and red in the middle so somehow I

00:29:29

miss order these two sides so this is

00:29:32

how it's going to look like if you look

00:29:34

at the right hand side you have a

00:29:36

projection of this top line right blue

00:29:38

on the right because everything is

00:29:39

flipped sorry I should have grabbed I

00:29:43

guess ping on the left green and red

00:29:46

compost on top of each other the second

00:29:49

line is pretty symmetrically oriented

00:29:52

and there you can see blue data set or

00:29:55

blue blob projected on the left hand

00:29:57

side pink projected on the right and

00:29:59

green and red again superimposed on each

00:30:03

other right this is all we did through

00:30:05

two lines and just projected everything

00:30:08

onto them these are the weights and

00:30:10

biases at the bottom that would

00:30:13

corresponds to this projection now we

00:30:16

add sigmoid all that Sigma it does is it

00:30:19

squashes right instead of being a

00:30:21

identifing it nonlinear discourses so we

00:30:25

squash these two plots on the sides and

00:30:29

recompose them as a two dimensional

00:30:30

object right we have now on x-axis the

00:30:34

sum x axis we have the first projection

00:30:38

just for sigmoid and this is why it

00:30:40

became extreme the blue things ended up

00:30:43

being very sickly in in one and

00:30:45

everything else went to zero maybe

00:30:48

slightly boomerang G here and the second

00:30:51

neuron this projection after squashing

00:30:54

for Sigma it became y-axis you can see

00:30:56

now pink one got separated everything

00:30:58

else got boomerang ly squashed the nice

00:31:02

thing about this maybe doesn't look that

00:31:04

nice but what it allows us to do is now

00:31:07

draw a line that's going to separate all

00:31:09

the blue and pink things from

00:31:11

everything else and this was our goal

00:31:13

right so if I now project on this line

00:31:15

or equivalently if I were to put the

00:31:19

decision boundary here it would separate

00:31:21

exactly what I wanted right so the Blues

00:31:24

and things were supposed to be one class

00:31:26

and the remaining two colors were

00:31:28

supposed to be the other so I can just

00:31:30

project them put the boundary and if you

00:31:34

now look into the input space we ended

00:31:36

up with this discontinuous

00:31:37

classification or the chasm of sorts in

00:31:41

the middle we came one class and a

00:31:44

reminder became the other right just

00:31:47

going through the internals layer by

00:31:48

layer how the neural network with a

00:31:50

single hidden layer would operate all it

00:31:53

really did was to use this hidden layer

00:31:57

to rotate and then slightly bent the

00:32:02

input space right of the signal you can

00:32:04

think about this as kind of bending or

00:32:06

squishing which as a topological

00:32:09

transformation allows the purely linear

00:32:12

model on top of it to solve the original

00:32:14

problem right so it prepared pre-process

00:32:17

the data such that it became linearly

00:32:20

separable and you just needed two hidden

00:32:22

neurons to do this even though the

00:32:25

problem was not that complex it is a

00:32:27

qualitative change in what we are in

00:32:30

what we can do so what if something is

00:32:36

slightly more complex let's imagine we

00:32:38

want to separate a circle from a

00:32:39

doughnut then tuners won't be enough you

00:32:42

can prove it's not enough then that are

00:32:44

just too complex but six neurons are

00:32:46

doing just fine and at this point I

00:32:48

would like to advertise to you this

00:32:51

great tool by Daniel's Milk of and

00:32:53

others called playgrounds under

00:32:57

playgrounds don't test your folder org

00:32:58

or you can just play with this sort of

00:33:00

simple classification problems you can

00:33:02

pick one of the datasets you can add

00:33:04

hidden layers add neurons at the top you

00:33:07

can select activation function to be

00:33:08

sigmoid to follow what we just talked

00:33:10

about and if you select classification

00:33:12

it will just attach the sigmoid path

00:33:15

cross-entropy as the loss if he'd run

00:33:19

and you get the solution which separates

00:33:23

our data quite nicely

00:33:25

see the loss going down as expected and

00:33:28

arguably this is the easiest and most

00:33:30

important way of learning about you know

00:33:33

that's playing with them actually

00:33:35

interact with them it's really hard to

00:33:37

gain intuitions by just studying their

00:33:40

mathematical properties unless you are a

00:33:43

person with really great imagination I

00:33:46

personally need to fly with stuff to

00:33:48

understand so I'm just trying to share

00:33:50

this sort of lesson that I learned so

00:33:55

what makes it possible for neural nets

00:33:58

to learn arbitrary shapes

00:34:01

I mean arguably donut is not that

00:34:03

complex of a shape but believe me if I

00:34:05

were to draw a dragon

00:34:06

it would also do just fine and the

00:34:09

brilliant result arguably the most

00:34:12

important theoretical results in

00:34:13

neonates is world of C Benko from late

00:34:16

80s where he proved that neural networks

00:34:21

are what he called universe approximator

00:34:23

'he's using slightly more technical

00:34:25

language what it actually means is if

00:34:27

you get if you take any continuous

00:34:29

function from a hypercube right so your

00:34:31

inputs are in between 0 and 1 and have D

00:34:34

dimensions and your function is

00:34:36

continuous so relatively smooth and

00:34:38

output a single scalar value a single

00:34:40

number then there exists neural network

00:34:45

with one hidden layer with sigmoids that

00:34:48

will get an epsilon error at most

00:34:50

epsilon error and this is true for every

00:34:53

positive Epsilon

00:34:54

so you pick an error say one in minus 20

00:34:56

there will exist and you on that satisfy

00:34:59

this constraint you can pick one in

00:35:01

minus 100 and they will exist one that

00:35:04

satisfies it so one could ask what if I

00:35:07

pick epsilon equals zero then answer is

00:35:09

no it can only approximate it cannot

00:35:12

represent so you won't ever be able to

00:35:15

represent most of the continuous

00:35:16

functions but you can get really close

00:35:19

to them at the cost of using potentially

00:35:21

huge exponentially growing models with

00:35:25

respect to input dimensions it shows

00:35:28

that neural networks are really

00:35:29

extremely expressive they can do a lot

00:35:31

of stuff what it doesn't tell us though

00:35:33

is how on earth would we learn them it's

00:35:36

an existential proof right if you

00:35:38

and through proper mathematical

00:35:40

mathematical training you know that

00:35:42

there are two types of proofs right

00:35:43

there either constructive or the

00:35:45

essential arguably reconstructive ones

00:35:46

are more interesting they provide you

00:35:48

with insights how to solve problems the

00:35:50

essential ones are these tricky funky

00:35:53

thing we'll just say it's impossible for

00:35:55

this to be false and this is this kind

00:35:57

of proof that subhankar provided you

00:35:59

just show you just showed that there is

00:36:01

no way for this not to hold there was no

00:36:04

constructive methods of finding weights

00:36:07

of the specific network in this prove

00:36:10

that he right since then we actually had

00:36:12

more constructive versions furthermore

00:36:16

the size can grow exponentially what

00:36:18

subhankar attributed this brilliant

00:36:20

property to was the sigmoid function

00:36:22

that this quashing this smooth beautiful

00:36:25

squashing is what gives you this

00:36:27

generality it wasn't long since

00:36:31

hardening show that actually what

00:36:32

matters is this more of a neural network

00:36:34

structure but you don't need sigmoid

00:36:36

activation function you can actually get

00:36:38

pretty much take pretty much anything as

00:36:40

long as it's not degenerate and what's

00:36:42

he meant by non degenerate is that it's

00:36:45

not constant bounded and continues all

00:36:47

right so you can take a sine wave you

00:36:48

can pretty much get any squiggle as long

00:36:51

as you squiggle at least a bit so things

00:36:53

are non constant and they are bounded so

00:36:56

they cannot go to infinities so it shows

00:36:58

that this extreme potential of

00:37:00

representing or approximating functions

00:37:02

relies on these F and transformations

00:37:04

being stacked on top of each other with

00:37:07

some notion of non-linearity in between

00:37:09

them still without telling you how to

00:37:11

train them is just as in principle the

00:37:13

annual networks that are doing all this

00:37:15

stuff we just don't know how to find

00:37:17

them so to give you some intuition and

00:37:19

to be precise this is going to be an

00:37:21

intuition behind the property not behind

00:37:23

the proof the true proof relies on

00:37:25

showing the displace define been around

00:37:27

that world is a dense set in these set

00:37:30

of continuous functions instead we are

00:37:32

gonna rely on intuition why

00:37:33

approximating functions with sigmoid

00:37:36

based networks should be possible by

00:37:39

proof by picture so let's imagine that

00:37:42

we have this sort of mountain ridge

00:37:45

that's our target function and to our

00:37:48

disposal is only our sigmoid activation

00:37:50

and

00:37:51

they're so of course I can represent

00:37:53

function like this right I'll just take

00:37:55

a positive W and then negative B so it

00:37:58

shifts a bit to the right it doesn't

00:38:00

matter that much because I'm gonna also

00:38:02

get the symmetrical one where W is

00:38:04

negative and B is positive right so I

00:38:06

have two sigmoids

00:38:07

then if we take an average it should

00:38:10

look like a bump and you probably see

00:38:13

where I'm going with this it's gonna

00:38:14

rely on a very similar argument to how

00:38:17

integration works I just want to have

00:38:19

enough bumps so that after adding them

00:38:22

they will correspond to the target

00:38:24

function of interest so let's take three

00:38:27

of them and they just differ in terms of

00:38:30

biases that I've chosen so I'm using six

00:38:33

hidden neurons right two for each bump

00:38:35

and now in the layer that follows the

00:38:39

final classification layer now we

00:38:41

regression layer I'm just gonna mix them

00:38:44

first wait half second one third one and

00:38:47

a half and after adding these three

00:38:49

bumps with weights I end up with the

00:38:51

approximation of the original shape of

00:38:54

course it's not perfect as we just

00:38:56

learned we are never going to be able to

00:38:58

represent functions exactly with

00:39:01

sigmoids but we can get really close

00:39:02

right then this really close the epsilon

00:39:05

is what's missing here I only used 60 10

00:39:08

euros got some error if you want to

00:39:10

squash the error further you just keep

00:39:12

adding bumps now I need a bump here to

00:39:15

resolve this issue

00:39:17

I need a tiny bump somewhere around here

00:39:19

I need a tiny bump here and you just

00:39:21

keep adding and adding and eventually

00:39:24

you'll get as close as you want you

00:39:25

won't ever get it exactly right what is

00:39:28

it gonna go in the right direction so

00:39:30

you can ask okay it's 1d usually things

00:39:33

in one they are just completely

00:39:35

different story then k dimensional case

00:39:37

is there an equivalent construction at

00:39:39

least for 2d and the answer is positive

00:39:42

and you've seen this seven ish slides

00:39:45

before it's this one when we saw a donut

00:39:48

it is nothing but bump in 2d right if

00:39:53

you think about the blue class as a

00:39:55

positive one the one that it's supposed

00:39:56

to get one as the output this is

00:39:59

essentially a 2d bump its saw the

00:40:00

perfect Gaussian right

00:40:02

could do a better job but even with this

00:40:04

sort of bumps we could compose enough of

00:40:07

them to represent any 2d function and

00:40:10

you can see how things starts to grow

00:40:12

exponentially alright we just needed two

00:40:14

neurons to represent bump in 1d now we

00:40:16

need six for 2d and you can imagine that

00:40:18

for KD is gonna be horrible but in

00:40:20

principle possible and this is what

00:40:23

drives this sort of universe

00:40:25

approximation theorem building blocks so

00:40:30

let's finally go deeper since we said

00:40:34

that things are in principle possible in

00:40:37

a shallow land there needs to be

00:40:38

something qualitatively different about

00:40:40

going deeper versus going wider so the

00:40:44

kind of models we are going to be

00:40:46

working with will look more or less like

00:40:47

this there's data those through linear

00:40:49

some node linear nodes in your no linear

00:40:51

node and eventually a loss attached to

00:40:55

our targets what we are missing here is

00:40:58

what is going to be this special node in

00:41:01

between that as advertised before it's

00:41:03

not going to be a sigmoid and the answer

00:41:06

to this is the value unit rectifier

00:41:10

rectified linear units again quite a few

00:41:12

names but essentially what it is is a

00:41:15

point wise maximum between axis between

00:41:18

inputs and a0 all it does is checks

00:41:21

whether the input signal is positive if

00:41:23

so it acts as an identity otherwise it

00:41:27

just flattens it sets it to zero and

00:41:29

that's all why is it interesting well

00:41:32

from say a practical perspective because

00:41:35

it is the most commonly used activation

00:41:38

these days that just works across the

00:41:40

board in a wide variety of practical

00:41:43

application starting from computer

00:41:44

vision and even reinforcement learning

00:41:46

it still introduced and only it still

00:41:49

introduces nonlinear behavior like no

00:41:52

one can claim that this is a linear

00:41:53

function right with the hinge but at the

00:41:55

same time it's kind of linear in the

00:41:57

sense that it's piecewise linear so all

00:42:00

it can do if you were to use it may be

00:42:02

on different layers is to cut your input

00:42:05

space into polyhedra so with the linear

00:42:09

transformations it could cut it into

00:42:11

multiple ones and in each sub subspace

00:42:15

such part it can define an affine

00:42:18

transformation right because they're

00:42:19

just two possibilities and either

00:42:21

identity I'm just cutting you off so in

00:42:24

each of these pieces you have a

00:42:27

hyperplane and in each piece might be a

00:42:30

different hyperplane but the overall

00:42:32

function is really piecewise linear in

00:42:34

1d it would be just a composition of

00:42:37

lines in 2d of planes that are you know

00:42:41

just changing their angles and in KD

00:42:43

well K minus 1 dimensional hyper planes

00:42:45

that are oriented in a funky way the

00:42:49

nice thing is derivatives no longer

00:42:51

vanish there either one when you're in

00:42:53

the positive line our zero otherwise

00:42:56

I mean arguably this was already

00:42:58

vanished before we started the bad thing

00:43:01

is the data neurons can no cure so

00:43:03

imagine that you're all your activities

00:43:05

are negative then going through such

00:43:07

neuron will just be a function

00:43:09

constantly equal to zero which is

00:43:11

completely useless so you need to pay

00:43:14

maybe more attention to the way you

00:43:16

initialize your model and maybe one

00:43:18

extra thing to keep track of to just see

00:43:20

how many dead units you have because it

00:43:22

might be a nice debugging signal if you

00:43:24

did something wrong and also technically

00:43:26

the structure is not differentiable at 0

00:43:29

and the reason why people usually don't

00:43:32

occur is that from probablistic

00:43:35

perspective this is a zero measure set

00:43:36

you will never actually hit zero you

00:43:39

could hand waves and say well the

00:43:41

underlying mathematical model is

00:43:43

actually smooth around zero I just never

00:43:45

hit it so I never care if he wants to

00:43:48

pursue more politically grounded

00:43:50

analysis you can just substitute it with

00:43:52

a smooth version which is logarithm 1

00:43:54

plus minus X this is the dotted line

00:43:56

here that has the same limiting

00:43:58

behaviors but is fully smooth around

00:43:59

zero and you can also just use slightly

00:44:01

different reasoning when you don't talk

00:44:03

about gradients but different objects

00:44:05

we've seen are properties that are just

00:44:07

fine

00:44:08

with single points of non

00:44:09

differentiability so we can now stack

00:44:12

these things together and we have our

00:44:14

typical deep learning model that you

00:44:16

would see in every book on deep learning

00:44:18

linear array linearly and the intuition

00:44:22

behind depths the people had from the

00:44:23

very beginning especially in terms of

00:44:25

computer vision was that each

00:44:27

year we'll be some sort of more and more

00:44:29

abstract feature extraction module so

00:44:32

let's imagine that these are pixels that

00:44:34

come as D as the input then you can

00:44:37

imagine that the first layer will detect

00:44:39

some sort of lines and corners and this

00:44:42

is this is what the what each of the

00:44:45

neurons will represent whether there is

00:44:46

a specific line like horizontal line or

00:44:48

vertical wires of magnets once you have

00:44:51

this sort of representation the next

00:44:52

layer could compose these and represent

00:44:54

shapes like squiggles or something

00:44:57

slightly more complex why do you have

00:44:58

these shapes the next layer could

00:45:00

compose them and represent things like

00:45:01

ears and noses and things like this and

00:45:04

then once you have this sort of

00:45:06

representation maybe you can tell

00:45:08

whether it's a dog or not maybe some

00:45:09

number of years of or existence of ears

00:45:11

in the first place but this is a very

00:45:13

high level intuition and awhile

00:45:14

confirmed in practice this is

00:45:17

necessarily that visible from the mouth

00:45:18

and a really nice result from sorry I

00:45:24

cannot pronounce French but Montu Farman

00:45:28

from Guido and Pascal would show and

00:45:33

Benjy is to show a mathematical

00:45:36

properties that somewhat encode this

00:45:40

high-level intuition and is a provable

00:45:41

statement so one thing is that when we

00:45:44

talked about these linear regions that

00:45:46

are created by Rayleigh networks what

00:45:48

you can show is as you keep adding

00:45:50

layers rather than neurons the number of

00:45:54

trunks in which you are dividing your

00:45:57

input space grows exponentially with

00:45:59

depth and only polynomially with going

00:46:02

wider which shows you that there isn't

00:46:04

simply an enormous reason to go deeper

00:46:08

rather than wider right exponential

00:46:10

growth simply will escape any polynomial

00:46:13

growth sooner or later and with the

00:46:15

scale at which were working these days

00:46:16

it escaped a long time ago the other

00:46:20

thing is if you believe in this high

00:46:23

level idea of learning from savable

00:46:25

times from statistical learning theory

00:46:27

that the principle of learning is to

00:46:30

encounter some underlying structure in

00:46:33

data right we get some training data set

00:46:35

which is some number of samples we build

00:46:37

a model and we expect it to work really

00:46:39

well on the data which comes from the same

00:46:41

distribution but is essentially a

00:46:43

different set how can this be done well

00:46:45

only if you learned if you discovered

00:46:47

some principles behind the data and the

00:46:50

output space and one such or a few such

00:46:53

things can be mathematically defined as

00:46:56

finding regularities symmetries in your

00:47:00

input space and what raelia networks can

00:47:03

be seen as is a method to keep folding

00:47:07

your input space on top of each other

00:47:09

which has two effects one of course if

00:47:14

you keep folding space you have more

00:47:16

when I say fold space I mean that the

00:47:17

points that end up on top of each other

00:47:19

are treated the same so whatever I build

00:47:21

on top of it will have exactly the same

00:47:23

output values for both points that got

00:47:26

folded so you can see why things will

00:47:27

grow exponentially right you fold the

00:47:29

paper once you have two things on top of

00:47:31

each other then four then eight it's

00:47:32

kind of how this proof is built it's

00:47:35

really beautiful I really recommend

00:47:36

reading this paper and beautiful

00:47:38

pictures as well and the second thing is

00:47:41

this is also the way to represent

00:47:42

symmetries if your data if your input

00:47:44

space is symmetric the easiest way to

00:47:47

learn that this symmetry is important is

00:47:49

by folding this space in half if the

00:47:52

symmetries more complex as represented

00:47:54

in this beautiful butterfly ish I don't

00:47:57

know shape you might need to fold in

00:48:00

this extra way so that all the red

00:48:02

points that are quite swirled end up

00:48:06

being mapped onto this one single

00:48:09

slightly curved shape and this gives you

00:48:13

this sort of generalization you discover

00:48:14

the structure if you could of course

00:48:16

learn it that only depth can give you if

00:48:21

you were to build much wider model you

00:48:23

need exponentially many neurons to

00:48:25

represent exactly the same invariance

00:48:27

exactly the same transformation which is

00:48:29

really nice mathematical insight into

00:48:31

this while depth really matters so

00:48:35

people believe this I mean of course

00:48:37

people were using depth before just

00:48:39

because they saw they seen better

00:48:42

results they didn't need necessarily

00:48:43

mathematical explanation for that so

00:48:46

let's focus on this simple model that we

00:48:49

just defined we have three neural

00:48:51

network

00:48:52

sorry three hidden layers in our neural

00:48:54

network linear a linear value and so and

00:48:57

so forth and now we'll go from our

00:49:00

puzzle view that was a nice high level

00:49:03

intuition into some of this extremely

00:49:04

similar and what's actually used in

00:49:07

pretty much every machine learning

00:49:09

library underneath which is called a

00:49:11

computational graph so it's a graph

00:49:14

which represents this sort of relations

00:49:17

of what talks to words I'm going to use

00:49:20

the same color coding so again blue

00:49:22

things inputs so this is my input X this

00:49:25

is gonna be a target orange is going to

00:49:28

be our loss the Reds are gonna be weight

00:49:32

parameters so the reason why some of you

00:49:38

might have noticed when I was talking

00:49:40

about linear layer I treated both

00:49:42

weights and XS as input to the function

00:49:45

whether I was writing f of X WB I was

00:49:49

not really discriminating between

00:49:50

weights and inputs apart from giving

00:49:52

them the color for easier readability is

00:49:55

because in practice it really doesn't

00:49:58

matter there's no difference between a

00:50:00

weight or an input into a note in a

00:50:02

computational graph and this alone gives

00:50:05

you a huge flexibility if you want to do

00:50:07

really funky stuff like maybe weights of

00:50:11

your network are gonna be generated by

00:50:12

another neuron network it's fully fits

00:50:16

this paradigm because all you're gonna

00:50:17

do is you're gonna substitute one of

00:50:20

these red boxes that would normally be a

00:50:22

weight with yet another network and it

00:50:24

just fits the same paradigm and we'll go

00:50:27

for some examples in a second to be more

00:50:29

precise we have this graph that

00:50:31

represents computational graph for a

00:50:33

free layer neural net with values on

00:50:36

height of abstraction omitting captions

00:50:38

because they are not necessary for this

00:50:40

story they don't have to be linear you

00:50:43

can have side tracks I'll skip

00:50:45

connections there is nothing stopping

00:50:47

you from saying okay output from this

00:50:49

layer it's actually going to be connected from toe sorry yet another

00:50:51

layer that is also parameterize by

00:50:53

something else

00:50:54

and then they go back and merge maybe

00:50:57

through mean operation concatenation

00:50:59

operation that many ways to merge two

00:51:01

signals

00:51:02

there is nothing stopping us from having

00:51:05

many losses and they don't even have to

00:51:07

be at the end of paragraph we might have

00:51:08

a loss attached directly to ways that

00:51:10

will act as the penalty for weights

00:51:12

becoming too large for example or maybe

00:51:14

leaving some specific constraints maybe

00:51:16

we want them to be lying on the sphere

00:51:18

and we're going to penalize the model

00:51:20

for not doing so our losses don't even

00:51:24

need to be the last things in the

00:51:25

computational graph you can have a

00:51:26

neural network that has a loss at the

00:51:29

end and this loss is fitted back its

00:51:31

value to next parts of a neural network

00:51:34

and this is the actual output that you

00:51:36

care about eventually you can also do a

00:51:38

lot of sharing so the same input can be

00:51:41

plucked into multiple parts of your net

00:51:43

in skip connection fashion you can share

00:51:48

weights of your model and sharing

00:51:50

weights in this computational graph

00:51:52

perspective is nothing about connecting

00:51:55

one nodes too many places this is

00:51:58

extremely flexible language that allows

00:52:00

this really modular development and

00:52:02

arguably it actually it helped

00:52:05

researchers find new techniques because

00:52:08

the engineering advancement of

00:52:09

computational graphs development allows

00:52:12

to free us from saying oh there are ways

00:52:14

that inputs in a qualitatively different

00:52:16

engineers came and said no from my

00:52:19

perspective they are exactly the same

00:52:20

and the research followed it would have

00:52:22

started plugging crazy things together

00:52:23

and ended up with really powerful things

00:52:26

like hyper networks so how do we learn

00:52:29

in all these models and the answer is

00:52:31

surprisingly simple you just need basic

00:52:33

linear algebra 101 to just recap radians

00:52:36

and jacobians I hope everyone knows what

00:52:39

they are if not in very short words if

00:52:41

we have a function that goes from D

00:52:43

dimensional space to a scalar space like

00:52:45

R then the gradient is nothing about the

00:52:47

vector of partial derivatives so I've

00:52:50

dimension we have partial derivative of

00:52:51

this function with respect to I input

00:52:54

what's partial derivative in height of

00:52:57

abstraction just the direction in which

00:52:58

the function grows the most and minus

00:53:00

gradient is the direction in which it

00:53:02

decreases the most

00:53:04

Jacobian nothing about the K dimensional

00:53:06

generalization if you have K outputs and

00:53:08

so is a matrix where you have a partial

00:53:11

derivative of F output with respect to

00:53:14

jave input nothing else very basic thing

00:53:17

the nice thing about these things is

00:53:19

they can be analytically computed for

00:53:21

many of the functions that we heard and

00:53:24

then the gradient descent technique that

00:53:26

is numerical methods 101 so surprisingly

00:53:30

deep learning uses a lot of very basic

00:53:33

components but from across the board of

00:53:36

mathematics and just composes it in a

00:53:38

very nice way an idea behind gradient

00:53:40

descent is extremely simple we can view

00:53:43

this a sort of physical simulation where

00:53:44

you have your function or loss landscape

00:53:47

you just pick an initial point and

00:53:49

imagine that is a ball that keeps

00:53:51

rolling down the hill until it hits a

00:53:53

stable point or it just cannot locally

00:53:56

minimize your loss anymore so you just

00:53:58

add each iteration tell your current

00:54:00

point subtract learning rate at time T

00:54:02

times the gradient in this specific

00:54:05

point and this is going to guarantee

00:54:06

convergence to the local minimum under

00:54:09

some minor assumptions of on the

00:54:11

smoothness of the function so it needs

00:54:13

to be smooth for it to actually converge

00:54:14

and it has this nice property that it

00:54:16

was referring before that because

00:54:18

gradient of the sum is sum of the

00:54:20

gradients you can show that analogues

00:54:23

properties hold for the stochastic

00:54:24

version or you don't sum over all

00:54:26

examples you just take a subset and keep

00:54:29

repeating this this will still converge

00:54:31

under some assumptions of the bound

00:54:33

between basically noise or the variance

00:54:37

of this estimator and the important

00:54:40

thing is this choice of the learning

00:54:42

rate unfortunately matters like quite a

00:54:44

few other parameters in machine learning

00:54:46

community and there have been quite a

00:54:48

few other optimizations that were

00:54:49

developed on top of gradient descent one

00:54:51

of which became a sort of golden

00:54:53

standard like step zero that you always

00:54:56

start with which is called atom and when

00:54:59

we go to practical issues I will say

00:55:01

this yet again if you're just starting

00:55:04

with some model just use atom if you're

00:55:06

even thinking about the optimization is

00:55:08

just a good starting starting rule and

00:55:11

in principle you can apply gradient

00:55:14

descent to non smooth functions and a

00:55:16

lot of stuff in deep learning is kind of

00:55:18

non smooth and people still apply it but

00:55:20

the consequence is you will lose your

00:55:22

converges guarantees so the fact that

00:55:24

your loss doesn't decrease anymore might

00:55:27

as well be

00:55:28

that you just did something you were not

00:55:29

supposed to be doing thank you

00:55:31

provided a note without a well-defined

00:55:33

gradient or you define the wrong

00:55:34

gradient you put the stop gradient in

00:55:36

are you created again then things might

00:55:38

stop converging so what do we need from

00:55:43

perspective of our notes so that we can

00:55:45

apply gradient descent directly to the

00:55:47

computational graph right because we

00:55:49

have this competition of graph British

00:55:50

for everything that we talked about and

00:55:52

the only API that we need to follow is

00:55:55

very similar to the one we talked before

00:55:58

we need forward pass given X given input

00:56:01

what is the output and also we need a

00:56:04

backward pass so what is it basically

00:56:06

Jacobian with respect to your inputs for

00:56:10

computational efficiency we want

00:56:12

necessarily compute a few Jacobian but

00:56:13

rather product between Jacobian and the

00:56:16

gradient of the loss that you eventually

00:56:17

care about and this is going to be an

00:56:19

information we're gonna send through the

00:56:21

network so let's be more precise with

00:56:24

our computational graph with free layers

00:56:28

we have this sort of gradient descent

00:56:30

algorithm we have our parameters citas

00:56:32

and we want to unify these views somehow

00:56:36

right so I need to know what theta is

00:56:37

and how to compute the gradient so let's

00:56:41

start with making feet appear so one

00:56:43

view that you might use is that there

00:56:46

actually is an extra node called theta

00:56:48

and all these parameters these w's B's

00:56:50

that I need for every layer is just

00:56:52

slicing and reshaping of this one huge

00:56:56

theta right so imagine there was this

00:56:57

huge vector theta and I'm just saying

00:57:00

the first W whatever its shape is is

00:57:02

first K dimensions I just take them

00:57:04

reshape this is a well-defined

00:57:06

differentiable operation right it's also

00:57:09

gradient of the reshaping is reshaping

00:57:11

of the gradient kind of thing so I can

00:57:13

have one theta and then the only

00:57:15

question is how to compute the gradient

00:57:17

and the whole math behind it is really

00:57:19

chain rule the targeted composition of

00:57:22

functions decomposes with respect to the

00:57:26

inner nodes so if you have F composed

00:57:28

with G and you try to compute the

00:57:30

partial derivative of the output with

00:57:31

respect to the input you can as well

00:57:33

compute the partial derivative of the

00:57:35

output with respect to this inner node G

00:57:38

let's multiply it by the partial

00:57:40

derivative of G with respect to X and if

00:57:43

G happens to be multi-dimensional if

00:57:45

there are many outputs then from matrix

00:57:48

calculus you know that the analogous

00:57:51

object requires you to simply sum over

00:57:54

all these paths so what it means from

00:57:57

the perspective of the computational

00:57:59

graph well let's take a look at one path

00:58:01

so we have the dependence of our lost

00:58:03

node on our way to note that now became

00:58:06

an input change to blue because as we

00:58:08

discussed before there is literally no

00:58:10

difference between these two and it's

00:58:12

going through this so now all we are

00:58:15

going to do is apply the first rule

00:58:17

we're going to take the final loss and

00:58:19

ask it okay we want you to be told

00:58:22

what's the gradient we are now in the

00:58:24

node that needs to know given how the

00:58:27

outputs needs to change which is already

00:58:29

told to us by this node how it needs to

00:58:32

adjust its inputs which is this Jacobian

00:58:35

times the partial derivative of rest of

00:58:37

the loss with respect to our output so

00:58:40

we can send back and we already have the

00:58:43

L D whatever is the name of this node

00:58:46

the previous node has the same property

00:58:49

right it's being told your outputs needs

00:58:52

to change in these directions and

00:58:53

internally its nose and by it nose I

00:58:56

mean we can compute this Jacobian how to

00:58:58

adjust its inputs so that its outputs

00:59:00

change in the same direction and you go

00:59:03

through all this graph backwards da da

00:59:05

da kill your feet theta and this is

00:59:07

using just this rule the only problem is

00:59:10

there is a bit more than 1/2 for this

00:59:13

network there's way more dependents but

00:59:15

this is where the other one comes into

00:59:17

place we will just need to sum over all

00:59:19

the paths that connect these two nodes

00:59:22

they might be exponentially many paths

00:59:25

but because they reuse computation the

00:59:28

whole algorithm is fully linear right

00:59:30

because we only go through each node

00:59:32

once computing up till here is

00:59:34

deterministic and then we can in

00:59:36

parallel also compute these two paths

00:59:38

until they meet again so have a linear

00:59:41

algorithm that bad props through the

00:59:43

whole thing you can ask couldn't I just

00:59:45

do it by hand for going through all the

00:59:47

equations of course you could but it

00:59:49

would be at the very least quadratic

00:59:51

if you do it naively this is just a

00:59:53

computational trick to make everything

00:59:54

linear and fit into this really generic

00:59:56

scheme that allows you to do all this

00:59:58

funky stuff including all the modern

01:00:01

different architectures representing

01:00:02

everything as computational graphs just

01:00:05

allows you to stop thinking about this

01:00:07

and you can see this shift in research

01:00:09

papers as well

01:00:10

do you like 2005 ish you seen each paper

01:00:14

from machine learning a section a

01:00:16

gradient of a log where people would

01:00:18

define some specific model and then

01:00:20

there will be a section where they say

01:00:22

oh I sat down and wrote down all the

01:00:24

partial derivatives this is what you

01:00:26

need to plug in to learn my model and

01:00:28

since then disappeared no one ever

01:00:31

writes this they just say and I use tensor flow P or

01:00:34

carrots or your favorites library it's a

01:00:38

good

01:00:39

it moved field forward instead of

01:00:42

positive dogs you know spending a month

01:00:44

deriving everything by hand they spent

01:00:46

five seconds

01:00:47

clicking greater so let's reimagine

01:00:49

these few modules that we introduced as

01:00:52

computational graphs we have our linear

01:00:54

module as we talked before is just a

01:00:57

function with three arguments it is

01:00:59

basically a dot product between X and W

01:01:01

we add B and what we need to define is

01:01:05

this backward computation with respect

01:01:07

to each of the inputs no matter if it's

01:01:09

an actual input blue thing or a weight

01:01:11

as we discussed before and for X and W

01:01:15

themselves the situation is symmetric we

01:01:18

essentially for X is just multiplied by

01:01:20

W the errors that are coming from the

01:01:22

future I mean from further from the

01:01:25

graph not from the future and for the W

01:01:27

is just the same situation but with X's

01:01:30

right because the dot product is pretty

01:01:32

symmetric operation itself and the

01:01:33

update for the biases is just the

01:01:36

identity since they are just added at

01:01:38

the end so you can just adjust them very

01:01:41

very easily and the nice thing to note

01:01:44

is that all these things in backwards

01:01:46

graph

01:01:47

they are also basic algebra and as such

01:01:50

they could be a computational graph

01:01:53

themselves and this is what happens in

01:01:55

many of these libraries when you call TF

01:01:57

gradients for example or something this

01:01:59

the backward computation will be added

01:02:02

to your graph there will be a new chunk

01:02:03

of your

01:02:04

that represents the backwards graph and

01:02:07

what's cool about this is now you can go

01:02:09

really crazy and say I won't tell old

01:02:12

order derivative I want to back up for

01:02:14

back prop and all you need to do is just

01:02:16

grab and note that corresponds to this

01:02:18

computation that was done for you just

01:02:20

call it again and again and again and

01:02:22

just get this really really powerful

01:02:24

differentiation technique until your GPU

01:02:27

around dies right but there's a customer

01:02:30

really itself

01:02:32

super simple in the forward pass you

01:02:34

have maximum will zero and X in the

01:02:36

backward pass you end up with a masking

01:02:38

method so if the specific neuron was

01:02:42

active when I say active I mean it was

01:02:44

positive and they rarely just passed it

01:02:45

through then you just pass the gradients

01:02:47

through as well and if it was inactive

01:02:50

meaning it was negative it hit zero then

01:02:53

of course gradients coming back need to

01:02:55

be zeroed as well because we don't know

01:02:57

how to adjust them right locally from a

01:02:59

vertical perspective if you are in the

01:03:01

zero land if you make an infinitely

01:03:03

small step you are still in zero land

01:03:04

let's forget about actual zero because

01:03:06

this one is tricky softmax is also

01:03:11

relatively simple

01:03:13

maybe it's gravely slightly fancier

01:03:14

because there's a exponentiation

01:03:15

summation division but is the same

01:03:18

principle right and you can also derive

01:03:21

the corresponding partial derivative

01:03:23

which is the backward pass and it's

01:03:26

essentially the difference between the

01:03:28

incoming gradients and the output and

01:03:31

you can see that these things might go

01:03:34

up right softmax itself if X J is very

01:03:37

big then exponent will just overflow

01:03:40

whatever is the numerical precision of

01:03:42

your computer and as such is rarely used

01:03:45

in such a form

01:03:46

it's either composed with something

01:03:48

that's cautious it back to reasonable

01:03:50

scale or does some tricks like you take

01:03:52

a minimum of XJ and say 50 so that you

01:03:56

lose parts of say mathematical beauty of

01:04:00

this but at least things will not blow

01:04:02

up to infinities and now if you look at

01:04:06

the cross entropy is also very simple to

01:04:10

vectorize and it's partial derivatives

01:04:13

now we can see why things get mess

01:04:16

see computationally you divide by P

01:04:17

dividing by small numbers as you know

01:04:19

from computer science basics can again

01:04:22

overflow so it's something that on its

01:04:24

own is not safe to do again you could

01:04:27

hack things around but the nicer

01:04:29

solutions and the nice thing about

01:04:31

viewing all these things jointly inputs

01:04:34

weights targets whatever as the same

01:04:38

objects with exactly the same paradigm

01:04:40

exactly the same model that we use to

01:04:43

say well these are these pictures of

01:04:45

dogs and cats right and these are the

01:04:46

targets what is the set a set of weights

01:04:49

for this model to maximize the

01:04:51

probability of this labeling can also

01:04:53

ask the question given this neural

01:04:55

network what is the most probable

01:04:57

labeling of these pictures so that this

01:05:00

neuron network is going to be happy

01:05:02

about it its loss is going to be low by

01:05:05

simply attaching our gradient decent

01:05:07

technique instead of to the theta that

01:05:11

we can attach directly to T right and as

01:05:14

long as these things are properly

01:05:16

defined in your library is going to work

01:05:19

and now you can see why would you

01:05:22

compose softmax and the cross-entropy

01:05:24

because now backward pass extremely

01:05:26

simplifies instead of all these

01:05:29

nastiness division small numbers etc you

01:05:32

just get the partial derivative of the

01:05:35

loss with respect to inputs as a

01:05:38

difference between targets and your

01:05:40

inputs as simple as that all the

01:05:42

numerical instabilities gone you can of

01:05:45

course still learn labels and partial

01:05:48

derivative is relatively okay this is

01:05:52

one of the main reasons why when using

01:05:55

machine learning libraries like Kara's

01:05:57

tensorflow and many others you'll

01:06:00

encounter this cross-entropy jungle you

01:06:02

see ten functions that are called

01:06:04

cross-entropy something like sparse

01:06:06

cross and dropping with logits

01:06:07

cross-entropy with soft marks I don't

01:06:09

know apply twice table the reason is

01:06:12

because each of these operations on its

01:06:14

own is numerically unstable and people

01:06:17

wanted to provide you with a solution

01:06:19

that is numerically stable they just

01:06:21

took literally every single combination

01:06:23

gave it a name and each of these

01:06:25

combinations is implemented in a way

01:06:27

this numerically

01:06:28

and all you need to do is to have this

01:06:30

lookup table which combination you want

01:06:32

to use and pick the right be the right

01:06:34

name right but underneath they're always

01:06:37

just composing cross-entropy with either

01:06:39

sigmoid or soft max or something like

01:06:41

this and it's exactly this problem that

01:06:43

they are avoiding if you want to do

01:06:45

things by hand feel free but don't be

01:06:48

surprised if even on em this from time

01:06:49

to time you'll see an infinity in your

01:06:51

loss it's just the beauty of finite

01:06:54

precision arithmetic in the continuous

01:06:56

land so let's go back to our example

01:07:00

right it was this small puzzle piece now

01:07:03

we can explicitly label each of the

01:07:06

notes so we have our XS they go for dot

01:07:09

product with ways biases are being added

01:07:11

then there is r lu we do it quite a few

01:07:13

times at some point at this point we

01:07:16

have probability estimates and this is

01:07:18

the output of our model even though our

01:07:20

loss is computed later this is also one

01:07:23

of the things I was mentioning before

01:07:24

right the output or the special nodes

01:07:27

don't have to be at the end they might

01:07:28

be branching from the middle we can

01:07:31

replace this with frita

01:07:32

does sorry and just slicing and apply

01:07:37

our gradient descent and maybe to

01:07:40

surprise some of you this is literally

01:07:43

how training of most of the deep neural

01:07:46

nets look like in supervised way

01:07:49

reinforcement learning slightly

01:07:51

different story but it's this underlying

01:07:54

principle that allows you to work with

01:07:56

any kind of neural network it doesn't

01:07:59

have to be this linear structure all of

01:08:01

these funky things that I was trying to

01:08:03

to portrait ten slides ago relies on

01:08:06

exactly the same principle and you use

01:08:08

exactly the same rules you just keep

01:08:09

composing around the same algorithm and

01:08:11

you get an optimization method that's

01:08:14

going to converge to some local minimum

01:08:16

not necessary perfect model but is going

01:08:18

to learn something so there are a few

01:08:21

things that we omitted from this that

01:08:24

are still interesting pieces of the

01:08:25

puzzle one such thing is taking a

01:08:28

maximum so imagine that one of your

01:08:30

nodes wants to take a maximum you have a

01:08:32

competition B between your inputs and

01:08:34

only the maximal one is going to be

01:08:36

selected then the backwards pass of this

01:08:40

operation

01:08:41

is nothing about gating again there's

01:08:43

gonna be passing through the gradients

01:08:46

even only if this specific dimension was

01:08:49

the maximal one and just zeroing out

01:08:51

everything else you can see that this

01:08:53

will not learn how to select things but

01:08:55

at least it will tell the maximal thing

01:08:57

how to adjust under the conditions that

01:08:59

it got selected all right so there's

01:09:01

this notion of non small things that

01:09:03

don't necessarily guarantee convergence

01:09:05

in a mathematical sense but they are

01:09:08

commonly used and you'll see in say

01:09:09

Sanders talk on convulsion your networks

01:09:12

that they are part of the max pooling

01:09:14

layers you can have funky things like

01:09:17

conditional execution like five

01:09:19

different computations and then a1 hood

01:09:22

layer that tells you which this of these

01:09:24

branches to select rather than if it was

01:09:26

one hot encoded then selection can be

01:09:28

viewed as just point wise multiplication

01:09:30

right by one we multiply and then the

01:09:33

backward pass is just gonna be again

01:09:35

gated in the same way but if it was not

01:09:38

the one hood encoding but rather output

01:09:40

of the softmax right of something

01:09:42

parametrized then looking at the

01:09:44

backward pass with respect to the gating

01:09:46

allows you to literally learn the

01:09:48

conditioning you can learn which branch

01:09:50

of execution to go through as long as

01:09:53

you smoothly mix between them using

01:09:55

softmax

01:09:56

and it's a high-level principle or

01:09:58

high-level idea behind modern attention

01:10:00

models let us essentially do this and to

01:10:04

give you some trivial example of other

01:10:06

laws so you had cross entropy but of

01:10:08

course many problems in real life are

01:10:09

not classification if it's a regression

01:10:13

so your outputs are just real numbers

01:10:14

then l to quadratic gloss or one of at

01:10:18

least ten other names for this quantity

01:10:19

which is just a square norm of a

01:10:22

difference between targets and your

01:10:24

prediction can also be seen as a

01:10:25

computational graph and the backward

01:10:27

pass is again nothing but the difference

01:10:29

between what you predicted and what you

01:10:32

wanted and there's this nice duality as

01:10:34

you can see from the backwards

01:10:35

perspective that looks exactly the same

01:10:37

as in case of the cross and trouble of

01:10:39

the softmax which also provides you with

01:10:41

some intuitions into how these things

01:10:43

are correlated so let's quickly go

01:10:46

through some practical issues given that

01:10:48

we know kind of what we're working with

01:10:50

and the first one is the well-known

01:10:53

problem of fitting and regularization so from

01:10:56

statistical learning theory so we are

01:10:58

still or again going back to say talking

01:11:01

her way before him we know that in this

01:11:04

situation where we just have some

01:11:06

training set which is a finite set of

01:11:08

data that we're building our model on

01:11:11

top minimizing error on it which we're

01:11:15

going to call training error or training

01:11:16

risk is not necessarily what we care

01:11:18

about what we care about is how our

01:11:20

model is going to behave in a while it

01:11:22

what's going to happen if I take a test

01:11:24

sample that kind of looks the same but

01:11:26

it's a different dog than the one that I

01:11:27

saw in train that's what we are going to

01:11:30

call test risk test car and it's a

01:11:33

provable statement that there is this

01:11:34

relation between complexity of your

01:11:37

model and the behavior between these two

01:11:40

errors as your model gets more and more

01:11:42

complex and by conflicts I mean more

01:11:44

capable of representing more and more

01:11:47

crazy functions or being able to just

01:11:49

store and store more information then

01:11:52

your training error has to go down not

01:11:54

in terms of any learning methods but

01:11:56

just in terms of existence of parameters

01:11:58

that realize it think universe

01:12:00

approximation theorem right it says

01:12:02

literally that but at the same time as

01:12:04

things get more complex and the bigger

01:12:06

your test test risk initially goes down

01:12:09

because you are just getting better at

01:12:11

representing the underlying structure

01:12:13

but eventually the worst case scenario

01:12:15

is actually going to go up because you

01:12:17

might as well represent things in a very

01:12:19

bad way by for example a numerating all

01:12:22

the training examples and outputting

01:12:24

exactly what's expected from you zero

01:12:26

training error horrible represents

01:12:28

horrible generalization power and this

01:12:31

sort of curve you'll see in pretty much

01:12:33

any machine learning book till 2016 ish

01:12:37

when people started discovering

01:12:38

something new that will go through in a

01:12:42

second but even if you just look at this

01:12:45

example you notice that there is some

01:12:47

reason to keep things simple and so

01:12:50

people developed many regularization

01:12:51

techniques such as LP regularization

01:12:53

where you attach one of these extra

01:12:56

losses directly to weights that we

01:12:57

talked about before which is just LP

01:13:00

norm like L to quadratic norm or l1 or

01:13:02

something like this to each of your

01:13:04

weights so that your weights are small

01:13:07

and you can prove again I guarantee that

01:13:08

if weights are small the function cannot

01:13:10

be too complex so you are restricting

01:13:12

yourself to the left-hand side of this

01:13:13

graph you can do drop out where some

01:13:15

neurons are randomly deactivated again

01:13:18

much harder to represent complex things

01:13:20

you can add noise to your data you can

01:13:22

stop early or you can use various

01:13:24

notions of normalization that will be

01:13:26

talked through in the next lecture but

01:13:29

that's all in this worst case scenario

01:13:31

what people recently discovered or

01:13:33

recently started working on is how this

01:13:36

relates to our deep neural networks that

01:13:39

don't have hundreds of parameters they

01:13:40

have billions of parameters and yet

01:13:42

somehow they don't really over fit as

01:13:44

easily as you would expect so the new

01:13:47

version of this picture emerge that's

01:13:49

currently referred to as double descent

01:13:51

where you have this phase change but yes

01:13:55

things get worse as you get more and

01:13:57

more complex model but eventually you

01:13:59

hit this magical boundary of over

01:14:01

parameterization where you have so many

01:14:03

parameters that even though in in theory

01:14:05

you could do things in a very nasty way

01:14:08

like by enumerate examples because of

01:14:11

the learning methods that we are using

01:14:12

you never will

01:14:14

you start to behave kind of like a

01:14:16

Gaussian process and as you keep

01:14:18

increasing number of parameters you

01:14:20

actually end up with as simplest

01:14:21

solutions being found first rather than

01:14:24

the more complex ones and so the curve

01:14:26

descends again and it has been proven by

01:14:30

balcony at all under some constraints

01:14:33

and shown in simple examples then it was

01:14:36

also reinforced with a cool work from

01:14:41

sorry

01:14:42

Pritam from from open AI where they

01:14:49

showed that this holds for deep big

01:14:51

models that we care about so one could

01:14:54

ask well does it mean we don't care

01:14:55

about relaxation anymore you just make

01:14:57

models bigger and the answer is well not

01:15:01

exactly

01:15:02

it's both true that as you increase the

01:15:04

model that you can see on the x-axis

01:15:05

that you're lost after test loss after

01:15:09

rapidly increasing keeps decreasing all

01:15:11

the time but adding regularization can

01:15:14

just keep the whole curve lower so here

01:15:18

as you go through curves from top to

01:15:19

bottom

01:15:20

it's just more and more regularization

01:15:21

being added so what it means how it

01:15:24

relates to this theory of complexity

01:15:27

what that mostly means is model

01:15:29

complexity is way more at the number of

01:15:33

parameters and this is a local minimum

01:15:35

like research local minimum people were

01:15:36

in for quite a while where they thought

01:15:38

well your neural network is huge truly

01:15:41

is not going to generalize well because

01:15:43

of opting Chevron and keys bounds are

01:15:44

infinite you're doomed and it seems not

01:15:47

to be the case the complexity of the

01:15:49

model strongly relies on the way we

01:15:53

train and as a result you are still kind

01:15:58

of in this regime where pain things can

01:16:01

get worse and you do need to regularize

01:16:03

but adding more parameters is also a way

01:16:06

to get better results slightly

01:16:09

counterintuitive and only applies if you

01:16:11

keep using gradient descent not some

01:16:13

nasty way okay

01:16:15

so just a few things there's a lot of

01:16:18

stuff that can go wrong when you train a

01:16:19

neural net and it can be hard harsh

01:16:24

experience initially so first of all if

01:16:27

you haven't tried don't get discouraged

01:16:29

initially nothing works and it's

01:16:31

something we all went through and there

01:16:33

is nothing to solve it apart from

01:16:36

practice just playing with this will

01:16:38

eventually get you there there's a

01:16:41

brilliant blog posts from Andrew karpati

01:16:45

and I'm referring to here and also a few

01:16:48

points that I like to keep in mind each

01:16:50

time I train neural networks first of

01:16:53

all that initialization really matters

01:16:55

all this fury that was built or the

01:16:57

practical results if you initialize your

01:16:59

network badly it won't learn and you can

01:17:01

prove it won't work won't learn well

01:17:04

what you should start with always is to

01:17:07

try to overfit with some if you're

01:17:09

introducing a new model especially you

01:17:11

need to try to overfit on some small

01:17:13

data sample if you can't over fit almost

01:17:16

surely you made a mistake unless for

01:17:18

some reason your model doesn't work for

01:17:19

small sample sizes then obviously just

01:17:21

ignore what I just said

01:17:24

you should always monitor training loss

01:17:26

I know sounds obvious but quite a few

01:17:29

people just assume that loss will go

01:17:30

down

01:17:31

because gradient decent grantees it

01:17:32

without monitoring it you will never

01:17:34

know if you are in the right spot

01:17:37

especially given that many of our models

01:17:39

are no differentiable and as such the

01:17:41

loss doesn't have to go down so if it's

01:17:42

not going down you might want to

01:17:44

reconsider using this non differentiable

01:17:45

units more important is something that

01:17:48

people apparently stopped doing in deep

01:17:50

learning on a daily basis it's

01:17:52

monitoring norms of your weights norms

01:17:54

going to infinity is something to be

01:17:57

worried about and if it's not making

01:17:59

your job crush right now it eventually

01:18:01

will once you leave it running for a few

01:18:04

days and then you'll regret that your

01:18:06

notes monitoring it earlier another

01:18:10

thing is adding shape assets all the

01:18:12

modern learning deep learning libraries

01:18:15

are great has helped brilliant features

01:18:16

one of which is automatic broadcasting

01:18:19

they take a column vector we take a row

01:18:20

vector you add them you get the matrix

01:18:23

very useful unless this is not what you

01:18:25

wanted to do you just wanted to vectors

01:18:27

and you ended up of a matrix if the next

01:18:30

operation is taking a maximum or taking

01:18:32

the average you won't notice right

01:18:34

afterwards there's just a scalar

01:18:35

everything looks fine but your learning

01:18:37

will be really crazy and you can try to

01:18:40

internal linear regression and just by

01:18:42

mistake transpose targets and you'll see

01:18:45

how badly linear regression can behave

01:18:47

by just one liner that throws no

01:18:49

exceptions and your loss will go down it

01:18:52

just won't be the model that you're

01:18:53

expecting the only way that I know about

01:18:55

to resolve this is to add shape asserts

01:18:57

everywhere each time you add an

01:18:59

operation we just write down an assert

01:19:02

like literally low-level engineering

01:19:03

thing to make sure that the shape is

01:19:05

exactly what you expect otherwise you

01:19:07

might run into issues things that we

01:19:09

mentioned before

01:19:10

use atom as your starting point just

01:19:12

because free e minus v is the magical

01:19:16

learning ring it works in 99% of deep

01:19:18

learning models for unknown reasons to

01:19:20

everyone

01:19:21

finally it's very tempting to change

01:19:24

five things at a time because you feel

01:19:25

like you have so many good ideas and

01:19:27

don't get me wrong you probably do but

01:19:29

if you change all of them at once

01:19:31

you were regretted afterwards when you

01:19:33

struggle with debugging and or credit

01:19:36

assignment of what actually improves in

01:19:38

your model and the reviewers won't be

01:19:40

happy either

01:19:41

when your ablation just keeps five steps

01:19:44

so given a few last minutes before the

01:19:48

questions I wanted to spend save three

01:19:51

ish minutes on the bonus Fink on

01:19:54

multiplicative interactions so I was

01:19:56

trying to convince you through this

01:19:58

lecture that neural networks are really

01:20:01

powerful and I hope I succeeded they are

01:20:05

very powerful but I want to ask this may

01:20:08

be a funny question what is one thing

01:20:10

that these multi-layer networks or we

01:20:12

just have a linear then an activation

01:20:14

function say Sigma the rail you stacked

01:20:16

on top of each other definitely cannot

01:20:18

do well there may be answers right they

01:20:21

can't do a lot of stuff but one trivial

01:20:24

thing they can't do is they can't

01:20:25

multiply there's just no way for them to

01:20:29

multiply two numbers given us inputs

01:20:32

again you might be slightly confused we

01:20:35

just talked about the inverse

01:20:36

approximation theorem but what I'm

01:20:37

referring to is representing

01:20:39

multiplication we can approximate

01:20:42

multiplication to any precision but they

01:20:44

can never actually represent the

01:20:46

function that multiplies so no matter

01:20:48

how big your data set is going to be no

01:20:51

matter how deep your network is going to

01:20:52

be I can't you train it to multiply two

01:20:54

numbers I can always find two new

01:20:57

numbers that you're going to miserably

01:20:59

fail it and I miserably I mean get

01:21:01

arbitrarily bigger maybe my numbers are

01:21:03

going to be huge doesn't matter there is

01:21:06

something special about multiplication

01:21:08

that I would like to see you know that

01:21:10

what's special about them for example

01:21:12

conditional execution relies on

01:21:14

multiplying something between 0 and 1

01:21:17

and something else many things in your

01:21:19

life can be represented as

01:21:21

multiplication for example computing distance between

01:21:23

two points relies on being able to

01:21:25

compute a dot product plus norms and

01:21:28

things like this so it's quite useful to

01:21:30

have this sort of operation yet stacking

01:21:33

even infinitely many yes infinitely many

01:21:36

layers would not help and one way to

01:21:40

resolve it in sort of a unit that just

01:21:43

implements multi-plate if interactions

01:21:45

one way to formalize it is as follows

01:21:47

you have a tensor W you take your inputs

01:21:50

through this you can see this as a

01:21:52

Mahalanobis dot product if you were

01:21:54

through this part of the algebra then

01:21:57

you have the bad fix projections of the

01:21:58

remaining things and just add the bytes

01:22:01

so if you just look at the approximation

01:22:04

things if you were to say compute a dot

01:22:06

product and you do it with a normal

01:22:09

neural net of Linear's and well use then

01:22:11

you have an exponentially many

01:22:13

parameters needed to approximate this to

01:22:16

that zero point one error I believe I

01:22:18

used here with respect to the

01:22:19

dimensionality of the input there is a

01:22:21

very steep exponential growth just

01:22:24

approximate and there is still gonna be

01:22:26

this problem that you don't generalize

01:22:27

but even approximation requires huge

01:22:30

amounts of parameters while using model

01:22:33

like this explicitly has a linear growth

01:22:35

and has a guarantee right once you hit

01:22:37

the dot product which can be represented

01:22:40

exactly with this module you will

01:22:42

generalize everywhere there's a nice

01:22:44

work from seed hunt at all at this

01:22:48

year's SML if you want to that deeper

01:22:51

but I want to just stress there is a

01:22:53

qualitative difference between

01:22:54

approximation and representation and in

01:22:58

some sense sends you home with this

01:22:59

take-home message which is if you wants

01:23:02

to do research in this sort of

01:23:04

fundamental building blocks of neural

01:23:06

networks please try not to focus on

01:23:09

improving things like marginally

01:23:12

improving things the neural networks

01:23:13

already do very well if we already have

01:23:16

this piece of a puzzle polishing it I

01:23:18

mean is an improvement but it's really

01:23:20

not what's cool about this field of

01:23:23

study and this is not where the biggest

01:23:24

gains both for you scientifically as

01:23:27

well as for the community lies was the

01:23:30

biggest game is identifying what neural

01:23:32

networks cannot do or cannot guaranty

01:23:34

think about maybe you might want a

01:23:37

module that's guaranteed to be convex or

01:23:39

quasi-convex or some other funky

01:23:42

mathematical property that you are

01:23:43

personally interested in and propose a

01:23:45

module that does that I guarantee you

01:23:47

that will be much better experience for

01:23:52

you and much better result for all of us

01:23:54

and with that I'm going to finish so

01:23:57

thank you

01:23:59

you

Description:

Neural networks are the models responsible for the deep learning revolution since 2006, but their foundations go as far as to 1960s. In this lecture DeepMind Research Scientist Wojciech Czarnecki will go through basics of how these models operate, learn and solve problems. He also introduces various terminology/naming conventions to prepare attendees for further, more advanced talks. Finally, he briefly touches upon more research oriented directions of neural network design and development. Download the slides here: https://storage.googleapis.com/deepmind-media/UCLxDeepMind_2020/L2%20-%20UCLxDeepMind%20DL2020.pdf Find out more about how DeepMind increases access to science here: https://deepmind.google/about/ Speak Bio: Wojciech Czarnecki is a Research Scientist at DeepMind. He obtained his PhD from the Jagiellonian University in Cracow, during which he worked on the intersection of machine learning, information theory and cheminformatics. Since joining DeepMind in 2016, Wojciech has been mainly working on deep reinforcement learning, with a focus on multi-agent systems, such as recent Capture the Flag project or AlphaStar, the first AI to reach the highest league of human players in a widespread professional esport without simplification of the game. About the lecture series: The Deep Learning Lecture Series is a collaboration between DeepMind and the UCL Centre for Artificial Intelligence. Over the past decade, Deep Learning has evolved as the leading artificial intelligence paradigm providing us with the ability to learn complex functions from raw data at unprecedented accuracy and scale. Deep Learning has been applied to problems in object recognition, speech recognition, speech synthesis, forecasting, scientific computing, control and many more. The resulting applications are touching all of our lives in areas such as healthcare and medical research, human-computer interaction, communication, transport, conservation, manufacturing and many other fields of human endeavour. In recognition of this huge impact, the 2019 Turing Award, the highest honour in computing, was awarded to pioneers of Deep Learning. In this lecture series, leading research scientists from leading AI research lab, DeepMind, deliver 12 lectures on an exciting selection of topics in Deep Learning, ranging from the fundamentals of training neural networks via advanced ideas around memory, attention, and generative modelling to the important topic of responsible innovation.

Preparing download options

Popular

HD video

Only sound

All

* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."

** — Link intended for online playback in specialized players

Questions about downloading video

How can I download "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video?

http://unidownloader.com/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.
The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.
UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.
UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

Which format of "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video should I choose?

The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

Why does my computer freeze when loading a "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video?

The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

How can I download "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video to my phone?

You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

How can I download an audio track (music) to MP3 "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations"?

The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

How can I save a frame from a video "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations"?

This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

What's the price of all this stuff?

It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.