background top icon
background center wave icon
background filled rhombus icon
background two lines icon
background stroke rhombus icon

Download "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations"

input logo icon
"videoThumbnail DeepMind x UCL | Deep Learning Lectures | 2/12 |  Neural Networks Foundations
Table of contents
|

Table of contents

0:00
Intro
0:09
Biological Intuition
0:16
The Big Picture
0:17
Single Layer Networks
0:17
Sigmoid
0:23
Softmax
0:26
Linear Models
0:28
Solution
0:28
Puzzle View
0:28
Potential Solution
0:32
Playgrounds
0:34
Universe Approximator
0:37
Intuition
0:40
Going deeper
0:40
Models
0:41
Unit Rectifier
0:44
Intuition Behind Deep Learning
0:45
Mathematical Properties of Intuition
0:48
Computational Graphs
0:52
Linear Algebra 101
0:53
Gradient Descent 101
0:55
Gradient Descent API
0:56
Free Layers
Video tags
|

Video tags

Artificial Intelligence
AI
Deep Learning
Lecture
DeepMind
UCL
Machine Learning
Neural Networks
Subtitles
|

Subtitles

subtitles menu arrow
  • ruRussian
Download
00:00:06
all right thank you for this lovely
00:00:08
introduction and
00:00:12
I mentioned today we'll go
00:00:14
the foundations of neural networks the
00:00:18
talk itself will last around 90 minutes
00:00:20
so I would ask you to wait with any
00:00:24
questions till the end of the lecture
00:00:26
we'll have a separate slot to address
00:00:29
this and I'll also hang around for a
00:00:32
while after the lecture if you would
00:00:33
prefer to ask some in person lecture
00:00:37
itself will be structured as follows
00:00:38
there'll be six sections I will start
00:00:41
with a basic overview trying to convince
00:00:43
you that there is a point in learning
00:00:45
about neural nets what is the actual
00:00:48
motivation to be studying this specific
00:00:51
branch of research in the second part
00:00:54
which is the main meat of the story
00:00:56
we'll go through both history of neural
00:00:59
nets their properties the way they are
00:01:00
defined and their inner workings mostly
00:01:03
in order to gather good and deep
00:01:06
intuition about them so that you are
00:01:09
both prepared for future more technical
00:01:11
more in-depth lectures in this series as
00:01:14
well as ready to simply work with these
00:01:16
models in your own research or work
00:01:19
equipped with these we'll be able to
00:01:22
dive into learning itself since even the
00:01:24
best model is useless without actually
00:01:27
knowing how to set all the knobs and
00:01:29
weights inside them after that we'll
00:01:32
fill in a few gaps in terms of these
00:01:34
pieces of the puzzles of the puzzle that
00:01:37
will be building as the talk progresses
00:01:40
and we'll finish with some practical
00:01:42
issues or practical guidelines to
00:01:46
actually deal with most common problems
00:01:48
in training neural nets if time permits
00:01:51
we'll also have a bonus slide or two on
00:01:54
what I like to call multiplicative
00:01:56
interactions so this is what will be in
00:02:02
the lecture there quite a few things
00:02:03
that could be included in a lecture
00:02:06
called foundations of our neural Nets
00:02:09
that's not going to be part of this talk
00:02:11
and I'd like to split these into three
00:02:14
branches first is what I refer to us
00:02:17
all-school neural Nets not to suggest
00:02:20
that the neural nets that are working
00:02:21
with these days are not old or they
00:02:24
don't go back like 70 years but rather
00:02:26
to
00:02:27
make sure that you see that this is a
00:02:29
really wide field with quite powerful
00:02:32
methods that were once really common and
00:02:34
important for the field that are not
00:02:37
that popular anymore but it's still
00:02:39
possible that they will come back and
00:02:41
it's valuable to learn about things like
00:02:43
restricted Boltzmann machines deep
00:02:45
belief networks hopfield net kohonen
00:02:46
maps so I'm just gonna leave these a
00:02:48
sort of keywords for further living if
00:02:51
you want to really dive deep the second
00:02:55
part is biologic biologically possible
00:02:57
neural nets where the path really is to
00:03:00
replicate the inner workings of human
00:03:02
brain so have physical simulators that
00:03:04
are spiking neural nets and these two
00:03:06
branches encoded in red here will not
00:03:10
share that many common ground with the
00:03:14
talk right there are still in your own
00:03:16
network land but they don't necessarily
00:03:17
follow the same the same design
00:03:19
principles on the other hand the third
00:03:21
branch called other because of lack of
00:03:25
better name do share a lot of
00:03:28
similarities and even though we won't
00:03:29
explicitly talk about capsule networks
00:03:31
graph networks or neuro differential
00:03:34
equations what you'll learn today the
00:03:36
high level ideas motivations and overall
00:03:39
scheme directly applies to all of these
00:03:41
they simply are somewhat beyond the
00:03:45
scope of the series and the ones in
00:03:47
green like convolutional neural networks
00:03:49
recurrent neural networks are simply not
00:03:51
part of this lecture but will come in
00:03:53
weeks to come for example in sundar talk
00:03:55
and others yep so why are we learning
00:04:00
about neural Nets quite a few examples
00:04:03
were already given a week ago I just
00:04:05
want to stress a few the first one being
00:04:08
computer vision in general most of the
00:04:10
modern solutions applications in
00:04:14
computer vision do use some form of
00:04:16
neural network based processing these
00:04:19
are not just hypothetical objects you
00:04:22
know things that are great for
00:04:23
mathematical analysis or for research
00:04:24
purposes
00:04:25
there are many actual commercial
00:04:27
applications products that use neural
00:04:30
networks on daily basis pretty much in
00:04:32
every smartphone you can find at least
00:04:34
one neural net these days the second one
00:04:37
is natural language processing testing
00:04:40
text synthesis
00:04:41
of grades recent results from open AI
00:04:43
and their GPT to model as well as
00:04:46
commercial results with building wavenet
00:04:49
based text generation into a Google
00:04:52
assistant if you if you own one finally
00:04:55
control that doesn't just allow us to
00:04:59
create AI for things like gold chess
00:05:03
Starcraft for games or simulations in
00:05:05
general but it's actually being used in
00:05:08
products like self-driving cars so what
00:05:11
made all this possible what started the
00:05:15
deep learning revolution what are the
00:05:16
fundamentals that neural networks that's
00:05:19
really benefited from this needs to have
00:05:21
the first one is compute and I wants to
00:05:25
make sure that you understand that there
00:05:27
are two sides to this story it's not
00:05:30
just that computers got faster they were
00:05:32
always getting faster what specifically
00:05:35
happened in recent years is that
00:05:38
specific kind of hardware compute namely
00:05:40
GPUs graphical processing units that
00:05:42
will designed for games got really
00:05:45
useful for machine learning right so
00:05:47
this is two first thing on one hand we
00:05:49
got hardware there's just much faster
00:05:51
but it's not generally faster it won't
00:05:53
make your sequential programs run faster
00:05:56
it is faster with respect to very
00:05:58
specific operations and neural networks
00:06:01
happen to use exactly these operations
00:06:03
we will reinforce this point in further
00:06:06
part of the lecture but think about
00:06:08
matrix multiplications as this core
00:06:10
element other machine learning
00:06:12
techniques that don't rely on matrix
00:06:14
multiplications would not benefit that
00:06:16
much from this exponential growth in
00:06:18
compute that came from GPUs and these
00:06:20
days from TVs the second is data and the
00:06:23
same argument applies you have various
00:06:26
methods of learning some of which scale
00:06:29
very well with data some scale badly
00:06:32
your competition complexity goes through
00:06:33
the roof you don't benefit from pushing
00:06:35
more and more data so we have again two
00:06:38
phases to this story a there's much more
00:06:40
data available just because of Internet
00:06:43
in terms of things and various other
00:06:44
things and on the other end you have
00:06:47
models that really our data hungry and
00:06:50
actually improve as amount of data
00:06:52
increases and finally
00:06:55
and finally the modularity of the system
00:06:59
itself the fact that deep learning is
00:07:01
not that well defined field of study
00:07:04
it's more of a mental picture of the
00:07:07
high-level idea of these modular blocks
00:07:10
that can be arranged in various ways and
00:07:13
I want to try to sell to you intuition
00:07:16
of viewing deep learning as this sort of
00:07:18
puzzle where all we are doing as
00:07:20
researchers is building these small
00:07:23
blocks that can be interconnected in
00:07:26
various ways so that they jointly
00:07:29
process data to use quotes from recent
00:07:35
turing award winner professor yan
00:07:37
laocoön deep learning is constructing
00:07:39
networks of harmonized functional
00:07:41
modules and training them from examples
00:07:43
using gradient based optimization
00:07:46
there's this core idea that we're
00:07:48
working with is an extremely modular
00:07:51
system we are not just defining one
00:07:53
model that we are trying to you know
00:07:55
apply to various domains were redefining
00:07:57
language to build them that relies on
00:08:00
very simple basic principles and these
00:08:04
basic principles in a single such a node
00:08:06
or piece of a puzzle is really these two
00:08:09
properties each of them needs to know
00:08:12
given an input given data what's to
00:08:15
output there are simple computational
00:08:17
units that compute one thing they take
00:08:19
an average they multiply things they
00:08:22
exponent them things like this and they
00:08:25
have also the other mode of operations
00:08:28
if they knew how the output of their
00:08:32
computation should change they should be
00:08:34
able to tell their input how to change
00:08:37
accordingly right so if I tell this node
00:08:39
your output should be higher it should
00:08:41
know how to change inputs if you are a
00:08:44
more mathematically oriented you'll
00:08:46
quickly find the analogy to
00:08:48
differentiation right then this is
00:08:50
pretty much the underlying mathematical
00:08:52
assumption for this so we will usually
00:08:54
work with the differentiable object and
00:08:57
this is also how professor Iyanla
00:08:59
quantifies this this is not necessarily
00:09:01
a strict requirement that you will see
00:09:03
through this lecture that in practice
00:09:05
people will put things that are kind of
00:09:07
differentiable and
00:09:09
a practitioner you know that people just
00:09:10
put everything into deep nets not
00:09:13
necessarily caring about the full
00:09:14
mathematical reasoning behind it
00:09:17
so given this really high-level view let
00:09:21
us go for some fundamentals and as usual
00:09:25
let's start with some biological
00:09:26
intuition so in every your network
00:09:30
lecture you need to see a neuron so I
00:09:32
drew one for you that's I know over
00:09:35
simplified so if you have biological
00:09:37
background forgive me for being very
00:09:39
naive but I wanted to capture just very
00:09:41
basic properties and so people have been
00:09:44
studying in neurobiology how real
00:09:46
neurons look like and one really
00:09:48
high-level view is that these are just
00:09:49
small cells that have multiple been
00:09:52
trends which are inputs from other
00:09:55
neurons through which they accumulate
00:09:58
their spikes their activity there is
00:10:01
some simple computation in the Sun in
00:10:05
the cell body and then there is a single
00:10:07
axon where the output is being produced
00:10:11
human brain is composed of billions of
00:10:14
these things connected to many many
00:10:15
others and you can see that this kind of
00:10:18
looks like a complex distributed
00:10:20
computation system right we have these
00:10:23
neurons connected to many others each of
00:10:25
them represents a very simple
00:10:27
computation on its own and what people
00:10:30
notice is that some of this connection
00:10:33
inhibits your activity so if other
00:10:35
neuron is active you aren't some excite
00:10:37
so if they are active you are excited as
00:10:40
well and of course that many other
00:10:41
properties like there's a stage right
00:10:43
these cells live through time the output
00:10:46
spikes for time each time you'll see a
00:10:48
slide like this with yellow box this is
00:10:51
a reference to further reading I won't
00:10:54
go into too much depth on various topics
00:10:58
but if you want to read more these are
00:11:01
nice preferences for example watch game
00:11:03
hodgkin-huxley model is a nice read if
00:11:07
you are somewhere between neurobiology
00:11:09
and mathematics so this is an intuition
00:11:12
and just intuition what people did with
00:11:15
that and by people I mean McCulloch and
00:11:18
Pitt's is to look at this
00:11:21
and ask themselves what seems to be the
00:11:24
main set of neurophysiological
00:11:26
observations that we need to replicate
00:11:28
it's important to stress that this is
00:11:30
not a model the model that they proposed
00:11:32
that was trying to replicate all of the
00:11:34
dynamics this is not you know an
00:11:37
artificial simulation of a neuron is
00:11:39
just something vaguely inspired by some
00:11:42
properties of real neurons and these
00:11:45
properties are selected in green there
00:11:48
are things that will be easy to compose
00:11:50
because you have Google for this in a
00:11:52
second you have blue inputs that are multiplied
00:11:56
by some weights W there are just real
00:12:02
numbers attached to each input and then
00:12:04
they are some so it's like an weighted
00:12:07
sum of your inputs and we also have a
00:12:09
parameter weight B also referred to a
00:12:13
bias to us bias which is the output of
00:12:17
our new you can see that this is
00:12:18
something you could easily compose there
00:12:20
are real numbers as inputs real numbers
00:12:22
as outputs it represents simple
00:12:24
computation
00:12:25
well it's literally weighted average you
00:12:27
can get more basic than that may be
00:12:29
slightly more basic it also has this
00:12:31
property of inhibiting or exciting if W
00:12:34
is negative you inhibit if it's positive
00:12:36
you excite but they left out quite a few
00:12:40
properties for example this is a
00:12:42
stateless model right you can compute it
00:12:44
many many times and the output is
00:12:45
exactly the same if you were to take a
00:12:47
real neuron and put the same action
00:12:50
potentials in it's spiking might change
00:12:52
through time because it is a living
00:12:54
thing that has a state physical state
00:12:56
and also outputs real values rather than
00:12:58
spikes through time just because the
00:13:00
time dimension got completely removed
00:13:02
and also to set some notation each time
00:13:05
something is blue in equations this
00:13:07
means this is an input if something is
00:13:08
red this is a parameter weight something
00:13:10
that you would usually train in your in
00:13:14
your model and the same applies to the
00:13:15
schemes themselves so what it means as
00:13:22
intuitively what's the weighted average
00:13:24
does at least my personal favourites
00:13:27
intuition is that it makes it defines a
00:13:32
linear or affine project
00:13:34
of your data so you can imagine that
00:13:36
this horizontal line is one such neuron
00:13:38
we've W that is just zero one and B
00:13:43
equals zero and then what will happen is
00:13:45
all the data that you have so your axis
00:13:46
would be perpendicularly projected onto
00:13:51
this line and you'd get this mess
00:13:53
everything would be on top of each other
00:13:55
if you had a different W say the sorry
00:13:58
vertical line before it was horizontal
00:14:00
the vertical line then they would be
00:14:02
nicely separated groups right because
00:14:04
you just collapsed them if it was a
00:14:06
diagonal line then things would be
00:14:08
slightly slightly separated as at the
00:14:12
bottom part of this so when we define
00:14:18
something like this we can start
00:14:20
composing them and the most natural or
00:14:23
the first mode of composition is to make
00:14:26
a layer out of these neurons so you can
00:14:29
see the idea is just to take each such
00:14:31
neuron put them next to each other and
00:14:33
what we gain from this mostly we gain a
00:14:37
lot of efficiency in terms of computing
00:14:39
because now the equation simplifies to a
00:14:42
simple affine transformation with W
00:14:44
being a matrix of the weights that are
00:14:47
in between our inputs and outputs X
00:14:51
being a vectorized input right so just
00:14:53
gather all the inputs for that as a
00:14:55
vector why is it important to fold
00:14:57
multiplication of two matrices in a
00:15:00
naive fashion is cubic but you probably
00:15:02
know from algorithmic so either 1 1 or 2
00:15:05
or whatever that you can go down like
00:15:07
2.7 by being slightly smart about how
00:15:10
you multiply things by basically using a
00:15:11
divide-and-conquer kind of methods and
00:15:13
furthermore this is something that fits
00:15:16
the GPU paradigm extremely well right so
00:15:19
this is one of these things that just
00:15:21
matches exactly what was already there
00:15:22
hardware wise and as such could benefit
00:15:25
from this huge boost in compute there's
00:15:28
also a lot of small caveats in the
00:15:30
neural network planned in terms of
00:15:32
naming conventions so each object will
00:15:34
have from one to five names and I'm
00:15:37
deeply sorry for us as a community for
00:15:39
doing this the main reason is many of
00:15:42
these things were independently
00:15:43
developed by various groups of
00:15:45
researchers and
00:15:46
unification never happened some of these
00:15:49
names are more common than others for
00:15:50
example this is usually called linear
00:15:53
layer even though mathematician would
00:15:55
probably cry and say no it's fine it's
00:15:57
not linear there's a bias this doesn't
00:15:59
satisfy seniority constraints neurons
00:16:01
will be often called units so if I say
00:16:03
unit or neuron I just use these
00:16:06
interchangeably and parameters and
00:16:08
weights are also the same object so you
00:16:14
might ask isn't this just linear
00:16:15
regression like equation looks exactly
00:16:17
like statistics around the world as in
00:16:20
your regression model and to some extent
00:16:23
yes you're right it is exactly the same
00:16:25
predictive model but what's important is
00:16:27
to have in mind our big picture yes we
00:16:29
start small but our end goal is to
00:16:32
produce these highly composable
00:16:34
functions and if you are helping with
00:16:37
composing many linear regression models
00:16:38
on top of each other especially
00:16:39
multinomial regression models then you
00:16:42
can view it like this the language that
00:16:44
neural network our community prefers
00:16:47
is to think about these as neurons or
00:16:49
collections of neurons that talk to each
00:16:51
other because this is our end goal yes
00:16:53
this third beginning of our puzzle could
00:16:56
be something that's known in literature
00:16:58
under different names but what's really
00:17:00
important is that we view them as this
00:17:02
single composable pieces that can be
00:17:05
arranged in any way and much of research
00:17:07
is about composing them in a smart way
00:17:10
so that you get a new quality out of it
00:17:14
but let's view these simple models are
00:17:17
senor networks first so we'll start with
00:17:19
a single layer neural network just so we
00:17:22
can gradually see what is being brought
00:17:25
to the table with each extension with
00:17:27
each added module so what we defined
00:17:30
right now is what we're gonna define
00:17:33
right now can be expressed more or less
00:17:35
like this we have data it will go
00:17:37
through the linear module then there
00:17:39
will be some extra notes that we are
00:17:40
going to be fine
00:17:41
then there are gonna then there is gonna
00:17:43
be a loss
00:17:44
there's also be gonna be connected to a
00:17:47
target we are missing these two so let's
00:17:50
define what can be used there and let's
00:17:53
start with the first one which is often
00:17:55
called an activation function or a
00:17:57
non-linearity this is
00:18:00
an object that is usually used to induce
00:18:04
more complex models if you had many
00:18:07
linear models many affine models and you
00:18:09
compose them it's very easy to prove
00:18:11
composition of linear Zitz linear
00:18:13
composition of affine thingses f-fine
00:18:15
you would not really bring anything to
00:18:17
the table you need to add something that
00:18:19
bends the space in a more funky way so
00:18:23
one way of doing this or historically
00:18:24
one of the first ones is to use sigmoid
00:18:27
activation function which you can view
00:18:29
as a squashing of a real line to the
00:18:32
zero one interval we often will refer to
00:18:34
things like this as producing
00:18:36
probability estimates or probability
00:18:38
distributions and while there exists a
00:18:41
probabilistic interpretation of this
00:18:42
sort of model what this usually means in
00:18:44
ml community is that it simply outputs
00:18:46
things between 0 & 1 or the day sum to 1
00:18:49
okay so now let's not be too strict when
00:18:51
we say probability estimate here it
00:18:54
might mean something as simple as being
00:18:55
in the correct range the nice thing is
00:18:58
it also has very simple derivatives just
00:19:00
to refer to this different ability that
00:19:02
we're talking about here but they're
00:19:04
also caveats that make it slightly less
00:19:07
useful as you will see in the grand
00:19:09
scheme of things one is that because it
00:19:11
saturates right as you go to plus or
00:19:12
minus infinity it approaches 1 or 0
00:19:15
respectively this means that the partial
00:19:18
derivatives vanish right the gradient
00:19:20
far far to the right will be pretty much
00:19:23
zero because your function is flat so
00:19:25
the gradient is pretty much zero the
00:19:26
same applies in minus infinity so once
00:19:28
you are in this specific point if you
00:19:31
view gradient magnitude as amount of
00:19:33
information that you are betting to
00:19:35
adjust your model then functions like
00:19:38
this won't work that well once you
00:19:40
saturate you won't be taught how to
00:19:42
adjust your weights anymore but this was
00:19:48
we are gonna use at least initially so
00:19:50
we plug in sigmoid on top of our linear
00:19:52
model and the only thing we are missing
00:19:54
is a loss and the most commonly used one
00:19:58
for the simplest possible task which is
00:20:00
going to be binary classification
00:20:01
meaning that our targets are either 0 or
00:20:05
1 something is either false or true
00:20:07
something is a face or not something is
00:20:10
a dog or not just this sort of products
00:20:13
then the most common loss function which
00:20:16
should be a two argument function that
00:20:19
returns a scalar so it accepts in this
00:20:22
notation P our prediction T our target
00:20:25
and it's supposed to output a single
00:20:27
scalar a real value such that smaller
00:20:30
loss means better model being closer to
00:20:34
the correct prediction and cross-entropy
00:20:37
which has at least three names being
00:20:39
negative log likelihood logistic loss
00:20:41
and probably many others
00:20:48
gives us the negation of the location of
00:20:51
probability of correct classification
00:20:53
which is exactly what you care about in
00:20:56
classification at least usually it's
00:20:58
also nicely composable with the sigmoid
00:21:01
function which will go back towards the
00:21:04
end of the lecture showing how this
00:21:06
specific composition removes two
00:21:08
numerical instabilities at once because
00:21:11
on its own unfortunately it is quite
00:21:13
numerically unstable so given these
00:21:18
three things we can compose them and
00:21:20
have the simplest possible neural
00:21:22
classifier we have data it goes through
00:21:24
linear model goes for sigmoid goes for
00:21:25
cross-entropy
00:21:26
attaches targets this is what you would
00:21:29
know from statistics as a logistic
00:21:31
regression and again the fact that we
00:21:33
were defining and well-known model from
00:21:35
a different branch of science is fine
00:21:37
because we won't stop here this is just
00:21:40
to gain intuition what we can already
00:21:41
achieve what we can already achieve in
00:21:43
practice is we can separate data that's
00:21:46
labeled with well two possible labelings
00:21:49
true or false
00:21:50
zero and one as long as you can put a
00:21:53
line or a hyperplane in a hyper in a
00:21:56
higher dimension that completely
00:21:57
separates these two datasets so in the
00:22:00
example you see red and blue you see
00:22:03
that the more vertical line can separate
00:22:07
these data says pretty perfectly and it
00:22:10
will have a very low loss very low cross
00:22:13
entropy loss the important property of
00:22:15
the specific loss and I would say 95% of
00:22:21
all the losses in machine learning is
00:22:23
that they are additive with respect to
00:22:26
sample so the loss that you can see at the
00:22:28
lower and decomposes additively over
00:22:33
summer so there is a small function l
00:22:35
that we just defined over its sample and
00:22:38
now T with I in the superscript is an
00:22:41
I've target can be expressed as a sum of
00:22:44
these this specific property relates to
00:22:48
the data aspect of deep learning
00:22:50
revolution losses that have this form
00:22:55
undergo very specific decomposition and
00:22:58
can be trained with what is going to be
00:23:00
introduced a stochastic gradient descent
00:23:01
and can simply scale very well with big
00:23:05
datasets and unfortunately as we just
00:23:10
discussed this is still slightly
00:23:13
numerically unstable so what happens
00:23:16
when we have more than one sorry more
00:23:17
than two classes then we usually define
00:23:20
what's called the softmax which is as a
00:23:22
name suggests a smooth version of the
00:23:25
maximum operation you take an exponent
00:23:26
of your input and just normalize divide
00:23:30
by the sum of exponents you can see this
00:23:33
was sum to one everything is
00:23:35
non-negative because well exponents by
00:23:36
definition I know not negative so we
00:23:39
produce probability estimates in the
00:23:41
sense that the output lies on the
00:23:43
simplex and it can be seen as a strict
00:23:46
multi-dimensional generalization of the
00:23:48
sigmoid so it's not a different thing is
00:23:50
just as three generalization if you take
00:23:52
a single X at 0 and compute the softmax
00:23:55
of it then the first argument of the
00:23:58
output will be Sigma within the second
00:24:00
one minus Sigma all right so it's simply
00:24:03
a way to go beyond two classes but have
00:24:06
very very similar mathematical
00:24:08
formulation and it's by far the most
00:24:11
commonly used final activation in
00:24:13
classification problems when number of
00:24:15
classes is bigger than than two it still
00:24:19
has the same issues for obvious reasons
00:24:21
generalizations so it cannot remove
00:24:22
issues but the nice thing is now we can
00:24:25
just substitute the piece of the puzzle
00:24:28
that we defined before right away the
00:24:29
sigmoid
00:24:30
now just put soft marks in its place and
00:24:32
exactly the same reasoning and mcann
00:24:36
that would work before apply now right
00:24:39
so we use exactly the same loss function
00:24:41
after the fact that it's summing over
00:24:44
all the pluses and now we can separate
00:24:46
still linearly of course more than two
00:24:50
colors sales class zero one and two
00:24:54
which is equivalent to multinomial
00:24:57
logistic regression if you went for some
00:24:59
statistical courses and the combination
00:25:03
of the softmax and the cross entropy as
00:25:05
I mentioned before becomes numerically
00:25:08
stable because of this specific
00:25:12
decomposition and there will be also a
00:25:14
more in-depth version towards the end of
00:25:16
this lecture the only thing that it
00:25:19
doesn't do very well is it doesn't scale
00:25:21
that we'll have number of classes all
00:25:23
right so one thing that you might want
00:25:24
is to be able to select one class
00:25:26
specifically just say one just say zero
00:25:29
and of course with equation like soft
00:25:31
max has you can't represent ones or
00:25:34
zeros you can get arbitrarily close but
00:25:36
never exactly one or zero and there are
00:25:38
nice other solutions to this like sparse
00:25:41
max module for example and also it
00:25:45
doesn't scale that well with K it will
00:25:47
work well if K number of classes is say
00:25:49
in hundreds if it's in hundreds of
00:25:52
thousands you might need to look for
00:25:54
some slightly different piece of the
00:25:56
puzzle the nice news is you can
00:25:58
literally just swap them and they will
00:26:00
start scaling up so why are we even
00:26:03
talking about these simple things so
00:26:04
apart from the fact that they become
00:26:06
pieces of the bigger puzzle it's also
00:26:08
because they just work and you might be
00:26:10
surprised that the linear models are
00:26:12
useful but they really are if you look
00:26:15
at this very well-known M this dataset
00:26:17
of just handwritten digits and try to
00:26:19
build a linear model that classifies
00:26:21
which digit it is based on pixels you
00:26:24
might get slightly surprising result of
00:26:26
somewhat around 92 percent of test
00:26:28
accuracy that's pretty good for
00:26:30
something that just takes you know
00:26:31
pixels and computes a weighted average
00:26:33
and that's all it does and in one of the
00:26:37
intuitions behind it is we usually keep
00:26:40
thinking about these models in like 1d
00:26:43
2d 3d and yes in 2d they are not that
00:26:45
many positions of objects that the line
00:26:48
cancer it in 3d know that many positions were
00:26:50
hyperplane and separate in 100,000
00:26:53
dimensions 99 hyperplanes of
00:26:58
corresponding size can actually shutter
00:27:01
a lot of possible labelings so as you
00:27:05
get higher dimensionality you can
00:27:07
actually deal with them pretty well even
00:27:10
within your models furthermore in
00:27:12
commercial applications a big chunk of
00:27:14
them actually use linear models in
00:27:16
natural language processing for many
00:27:17
years the most successful model was
00:27:19
nothing else but Max and maximum entropy
00:27:22
classifier which is a fourth name for
00:27:25
logistic regression so why don't we stop
00:27:28
here right we could stop the lecture
00:27:30
here but obviously we are interested in
00:27:31
something slightly more complex like AI
00:27:34
for chess or forego and for this we know
00:27:37
that the linear model I mean we know
00:27:39
empirically linear models are just not
00:27:42
powerful enough but before we go that
00:27:46
far ahead maybe let's focus on something
00:27:49
that's the simplest thing that linear
00:27:51
models cannot do and it's going to be a
00:27:53
very well-known expo problem where we
00:27:56
have two dimensional data sets and on
00:27:58
the diagonal one class on the other
00:28:00
diagonal the second class you can
00:28:02
quickly iterate in your head over all
00:28:04
possible lines right not a single line
00:28:06
has red dots on one side blue on the
00:28:08
other elbow we need something more
00:28:10
powerful so our solution is going to be
00:28:13
to introduce a hitter later so now we're
00:28:15
going to look into two layer neural
00:28:17
networks that in our puzzle view look
00:28:21
like this we have a theta goes to linear
00:28:23
go through sigmoid goes for another
00:28:24
linear goes to soft max cross-entropy
00:28:27
target as you can see we already have
00:28:29
all the pieces we just well we are just
00:28:31
connecting them differently that's all
00:28:33
we are doing and I want to now convince
00:28:35
you that we're adding qualitatively more
00:28:39
than just you know adding dimensions or
00:28:42
something like this so let's start with
00:28:44
the potential solution how can we solve
00:28:46
this if we had just two hidden neurons
00:28:49
and a sigmoid activation function so we
00:28:52
have our data set and for simplicity of
00:28:54
visualization I'm going to recolor them
00:28:56
so that we have four different colors
00:28:59
we have blue red green and pink just so
00:29:02
you see where the projections end up
00:29:04
just remember that we want to separate
00:29:06
one diagonal from the other and that two
00:29:09
hidden neurons are going to be these two
00:29:11
projection lines so the top one is
00:29:14
oriented downwards which means that
00:29:17
we're going to be projecting in such a
00:29:19
way that the blue class will end up on
00:29:21
the right hand side pink on the left
00:29:24
green and red in the middle so somehow I
00:29:29
miss order these two sides so this is
00:29:32
how it's going to look like if you look
00:29:34
at the right hand side you have a
00:29:36
projection of this top line right blue
00:29:38
on the right because everything is
00:29:39
flipped sorry I should have grabbed I
00:29:43
guess ping on the left green and red
00:29:46
compost on top of each other the second
00:29:49
line is pretty symmetrically oriented
00:29:52
and there you can see blue data set or
00:29:55
blue blob projected on the left hand
00:29:57
side pink projected on the right and
00:29:59
green and red again superimposed on each
00:30:03
other right this is all we did through
00:30:05
two lines and just projected everything
00:30:08
onto them these are the weights and
00:30:10
biases at the bottom that would
00:30:13
corresponds to this projection now we
00:30:16
add sigmoid all that Sigma it does is it
00:30:19
squashes right instead of being a
00:30:21
identifing it nonlinear discourses so we
00:30:25
squash these two plots on the sides and
00:30:29
recompose them as a two dimensional
00:30:30
object right we have now on x-axis the
00:30:34
sum x axis we have the first projection
00:30:38
just for sigmoid and this is why it
00:30:40
became extreme the blue things ended up
00:30:43
being very sickly in in one and
00:30:45
everything else went to zero maybe
00:30:48
slightly boomerang G here and the second
00:30:51
neuron this projection after squashing
00:30:54
for Sigma it became y-axis you can see
00:30:56
now pink one got separated everything
00:30:58
else got boomerang ly squashed the nice
00:31:02
thing about this maybe doesn't look that
00:31:04
nice but what it allows us to do is now
00:31:07
draw a line that's going to separate all
00:31:09
the blue and pink things from
00:31:11
everything else and this was our goal
00:31:13
right so if I now project on this line
00:31:15
or equivalently if I were to put the
00:31:19
decision boundary here it would separate
00:31:21
exactly what I wanted right so the Blues
00:31:24
and things were supposed to be one class
00:31:26
and the remaining two colors were
00:31:28
supposed to be the other so I can just
00:31:30
project them put the boundary and if you
00:31:34
now look into the input space we ended
00:31:36
up with this discontinuous
00:31:37
classification or the chasm of sorts in
00:31:41
the middle we came one class and a
00:31:44
reminder became the other right just
00:31:47
going through the internals layer by
00:31:48
layer how the neural network with a
00:31:50
single hidden layer would operate all it
00:31:53
really did was to use this hidden layer
00:31:57
to rotate and then slightly bent the
00:32:02
input space right of the signal you can
00:32:04
think about this as kind of bending or
00:32:06
squishing which as a topological
00:32:09
transformation allows the purely linear
00:32:12
model on top of it to solve the original
00:32:14
problem right so it prepared pre-process
00:32:17
the data such that it became linearly
00:32:20
separable and you just needed two hidden
00:32:22
neurons to do this even though the
00:32:25
problem was not that complex it is a
00:32:27
qualitative change in what we are in
00:32:30
what we can do so what if something is
00:32:36
slightly more complex let's imagine we
00:32:38
want to separate a circle from a
00:32:39
doughnut then tuners won't be enough you
00:32:42
can prove it's not enough then that are
00:32:44
just too complex but six neurons are
00:32:46
doing just fine and at this point I
00:32:48
would like to advertise to you this
00:32:51
great tool by Daniel's Milk of and
00:32:53
others called playgrounds under
00:32:57
playgrounds don't test your folder org
00:32:58
or you can just play with this sort of
00:33:00
simple classification problems you can
00:33:02
pick one of the datasets you can add
00:33:04
hidden layers add neurons at the top you
00:33:07
can select activation function to be
00:33:08
sigmoid to follow what we just talked
00:33:10
about and if you select classification
00:33:12
it will just attach the sigmoid path
00:33:15
cross-entropy as the loss if he'd run
00:33:19
and you get the solution which separates
00:33:23
our data quite nicely
00:33:25
see the loss going down as expected and
00:33:28
arguably this is the easiest and most
00:33:30
important way of learning about you know
00:33:33
that's playing with them actually
00:33:35
interact with them it's really hard to
00:33:37
gain intuitions by just studying their
00:33:40
mathematical properties unless you are a
00:33:43
person with really great imagination I
00:33:46
personally need to fly with stuff to
00:33:48
understand so I'm just trying to share
00:33:50
this sort of lesson that I learned so
00:33:55
what makes it possible for neural nets
00:33:58
to learn arbitrary shapes
00:34:01
I mean arguably donut is not that
00:34:03
complex of a shape but believe me if I
00:34:05
were to draw a dragon
00:34:06
it would also do just fine and the
00:34:09
brilliant result arguably the most
00:34:12
important theoretical results in
00:34:13
neonates is world of C Benko from late
00:34:16
80s where he proved that neural networks
00:34:21
are what he called universe approximator
00:34:23
'he's using slightly more technical
00:34:25
language what it actually means is if
00:34:27
you get if you take any continuous
00:34:29
function from a hypercube right so your
00:34:31
inputs are in between 0 and 1 and have D
00:34:34
dimensions and your function is
00:34:36
continuous so relatively smooth and
00:34:38
output a single scalar value a single
00:34:40
number then there exists neural network
00:34:45
with one hidden layer with sigmoids that
00:34:48
will get an epsilon error at most
00:34:50
epsilon error and this is true for every
00:34:53
positive Epsilon
00:34:54
so you pick an error say one in minus 20
00:34:56
there will exist and you on that satisfy
00:34:59
this constraint you can pick one in
00:35:01
minus 100 and they will exist one that
00:35:04
satisfies it so one could ask what if I
00:35:07
pick epsilon equals zero then answer is
00:35:09
no it can only approximate it cannot
00:35:12
represent so you won't ever be able to
00:35:15
represent most of the continuous
00:35:16
functions but you can get really close
00:35:19
to them at the cost of using potentially
00:35:21
huge exponentially growing models with
00:35:25
respect to input dimensions it shows
00:35:28
that neural networks are really
00:35:29
extremely expressive they can do a lot
00:35:31
of stuff what it doesn't tell us though
00:35:33
is how on earth would we learn them it's
00:35:36
an existential proof right if you
00:35:38
and through proper mathematical
00:35:40
mathematical training you know that
00:35:42
there are two types of proofs right
00:35:43
there either constructive or the
00:35:45
essential arguably reconstructive ones
00:35:46
are more interesting they provide you
00:35:48
with insights how to solve problems the
00:35:50
essential ones are these tricky funky
00:35:53
thing we'll just say it's impossible for
00:35:55
this to be false and this is this kind
00:35:57
of proof that subhankar provided you
00:35:59
just show you just showed that there is
00:36:01
no way for this not to hold there was no
00:36:04
constructive methods of finding weights
00:36:07
of the specific network in this prove
00:36:10
that he right since then we actually had
00:36:12
more constructive versions furthermore
00:36:16
the size can grow exponentially what
00:36:18
subhankar attributed this brilliant
00:36:20
property to was the sigmoid function
00:36:22
that this quashing this smooth beautiful
00:36:25
squashing is what gives you this
00:36:27
generality it wasn't long since
00:36:31
hardening show that actually what
00:36:32
matters is this more of a neural network
00:36:34
structure but you don't need sigmoid
00:36:36
activation function you can actually get
00:36:38
pretty much take pretty much anything as
00:36:40
long as it's not degenerate and what's
00:36:42
he meant by non degenerate is that it's
00:36:45
not constant bounded and continues all
00:36:47
right so you can take a sine wave you
00:36:48
can pretty much get any squiggle as long
00:36:51
as you squiggle at least a bit so things
00:36:53
are non constant and they are bounded so
00:36:56
they cannot go to infinities so it shows
00:36:58
that this extreme potential of
00:37:00
representing or approximating functions
00:37:02
relies on these F and transformations
00:37:04
being stacked on top of each other with
00:37:07
some notion of non-linearity in between
00:37:09
them still without telling you how to
00:37:11
train them is just as in principle the
00:37:13
annual networks that are doing all this
00:37:15
stuff we just don't know how to find
00:37:17
them so to give you some intuition and
00:37:19
to be precise this is going to be an
00:37:21
intuition behind the property not behind
00:37:23
the proof the true proof relies on
00:37:25
showing the displace define been around
00:37:27
that world is a dense set in these set
00:37:30
of continuous functions instead we are
00:37:32
gonna rely on intuition why
00:37:33
approximating functions with sigmoid
00:37:36
based networks should be possible by
00:37:39
proof by picture so let's imagine that
00:37:42
we have this sort of mountain ridge
00:37:45
that's our target function and to our
00:37:48
disposal is only our sigmoid activation
00:37:50
and
00:37:51
they're so of course I can represent
00:37:53
function like this right I'll just take
00:37:55
a positive W and then negative B so it
00:37:58
shifts a bit to the right it doesn't
00:38:00
matter that much because I'm gonna also
00:38:02
get the symmetrical one where W is
00:38:04
negative and B is positive right so I
00:38:06
have two sigmoids
00:38:07
then if we take an average it should
00:38:10
look like a bump and you probably see
00:38:13
where I'm going with this it's gonna
00:38:14
rely on a very similar argument to how
00:38:17
integration works I just want to have
00:38:19
enough bumps so that after adding them
00:38:22
they will correspond to the target
00:38:24
function of interest so let's take three
00:38:27
of them and they just differ in terms of
00:38:30
biases that I've chosen so I'm using six
00:38:33
hidden neurons right two for each bump
00:38:35
and now in the layer that follows the
00:38:39
final classification layer now we
00:38:41
regression layer I'm just gonna mix them
00:38:44
first wait half second one third one and
00:38:47
a half and after adding these three
00:38:49
bumps with weights I end up with the
00:38:51
approximation of the original shape of
00:38:54
course it's not perfect as we just
00:38:56
learned we are never going to be able to
00:38:58
represent functions exactly with
00:39:01
sigmoids but we can get really close
00:39:02
right then this really close the epsilon
00:39:05
is what's missing here I only used 60 10
00:39:08
euros got some error if you want to
00:39:10
squash the error further you just keep
00:39:12
adding bumps now I need a bump here to
00:39:15
resolve this issue
00:39:17
I need a tiny bump somewhere around here
00:39:19
I need a tiny bump here and you just
00:39:21
keep adding and adding and eventually
00:39:24
you'll get as close as you want you
00:39:25
won't ever get it exactly right what is
00:39:28
it gonna go in the right direction so
00:39:30
you can ask okay it's 1d usually things
00:39:33
in one they are just completely
00:39:35
different story then k dimensional case
00:39:37
is there an equivalent construction at
00:39:39
least for 2d and the answer is positive
00:39:42
and you've seen this seven ish slides
00:39:45
before it's this one when we saw a donut
00:39:48
it is nothing but bump in 2d right if
00:39:53
you think about the blue class as a
00:39:55
positive one the one that it's supposed
00:39:56
to get one as the output this is
00:39:59
essentially a 2d bump its saw the
00:40:00
perfect Gaussian right
00:40:02
could do a better job but even with this
00:40:04
sort of bumps we could compose enough of
00:40:07
them to represent any 2d function and
00:40:10
you can see how things starts to grow
00:40:12
exponentially alright we just needed two
00:40:14
neurons to represent bump in 1d now we
00:40:16
need six for 2d and you can imagine that
00:40:18
for KD is gonna be horrible but in
00:40:20
principle possible and this is what
00:40:23
drives this sort of universe
00:40:25
approximation theorem building blocks so
00:40:30
let's finally go deeper since we said
00:40:34
that things are in principle possible in
00:40:37
a shallow land there needs to be
00:40:38
something qualitatively different about
00:40:40
going deeper versus going wider so the
00:40:44
kind of models we are going to be
00:40:46
working with will look more or less like
00:40:47
this there's data those through linear
00:40:49
some node linear nodes in your no linear
00:40:51
node and eventually a loss attached to
00:40:55
our targets what we are missing here is
00:40:58
what is going to be this special node in
00:41:01
between that as advertised before it's
00:41:03
not going to be a sigmoid and the answer
00:41:06
to this is the value unit rectifier
00:41:10
rectified linear units again quite a few
00:41:12
names but essentially what it is is a
00:41:15
point wise maximum between axis between
00:41:18
inputs and a0 all it does is checks
00:41:21
whether the input signal is positive if
00:41:23
so it acts as an identity otherwise it
00:41:27
just flattens it sets it to zero and
00:41:29
that's all why is it interesting well
00:41:32
from say a practical perspective because
00:41:35
it is the most commonly used activation
00:41:38
these days that just works across the
00:41:40
board in a wide variety of practical
00:41:43
application starting from computer
00:41:44
vision and even reinforcement learning
00:41:46
it still introduced and only it still
00:41:49
introduces nonlinear behavior like no
00:41:52
one can claim that this is a linear
00:41:53
function right with the hinge but at the
00:41:55
same time it's kind of linear in the
00:41:57
sense that it's piecewise linear so all
00:42:00
it can do if you were to use it may be
00:42:02
on different layers is to cut your input
00:42:05
space into polyhedra so with the linear
00:42:09
transformations it could cut it into
00:42:11
multiple ones and in each sub subspace
00:42:15
such part it can define an affine
00:42:18
transformation right because they're
00:42:19
just two possibilities and either
00:42:21
identity I'm just cutting you off so in
00:42:24
each of these pieces you have a
00:42:27
hyperplane and in each piece might be a
00:42:30
different hyperplane but the overall
00:42:32
function is really piecewise linear in
00:42:34
1d it would be just a composition of
00:42:37
lines in 2d of planes that are you know
00:42:41
just changing their angles and in KD
00:42:43
well K minus 1 dimensional hyper planes
00:42:45
that are oriented in a funky way the
00:42:49
nice thing is derivatives no longer
00:42:51
vanish there either one when you're in
00:42:53
the positive line our zero otherwise
00:42:56
I mean arguably this was already
00:42:58
vanished before we started the bad thing
00:43:01
is the data neurons can no cure so
00:43:03
imagine that you're all your activities
00:43:05
are negative then going through such
00:43:07
neuron will just be a function
00:43:09
constantly equal to zero which is
00:43:11
completely useless so you need to pay
00:43:14
maybe more attention to the way you
00:43:16
initialize your model and maybe one
00:43:18
extra thing to keep track of to just see
00:43:20
how many dead units you have because it
00:43:22
might be a nice debugging signal if you
00:43:24
did something wrong and also technically
00:43:26
the structure is not differentiable at 0
00:43:29
and the reason why people usually don't
00:43:32
occur is that from probablistic
00:43:35
perspective this is a zero measure set
00:43:36
you will never actually hit zero you
00:43:39
could hand waves and say well the
00:43:41
underlying mathematical model is
00:43:43
actually smooth around zero I just never
00:43:45
hit it so I never care if he wants to
00:43:48
pursue more politically grounded
00:43:50
analysis you can just substitute it with
00:43:52
a smooth version which is logarithm 1
00:43:54
plus minus X this is the dotted line
00:43:56
here that has the same limiting
00:43:58
behaviors but is fully smooth around
00:43:59
zero and you can also just use slightly
00:44:01
different reasoning when you don't talk
00:44:03
about gradients but different objects
00:44:05
we've seen are properties that are just
00:44:07
fine
00:44:08
with single points of non
00:44:09
differentiability so we can now stack
00:44:12
these things together and we have our
00:44:14
typical deep learning model that you
00:44:16
would see in every book on deep learning
00:44:18
linear array linearly and the intuition
00:44:22
behind depths the people had from the
00:44:23
very beginning especially in terms of
00:44:25
computer vision was that each
00:44:27
year we'll be some sort of more and more
00:44:29
abstract feature extraction module so
00:44:32
let's imagine that these are pixels that
00:44:34
come as D as the input then you can
00:44:37
imagine that the first layer will detect
00:44:39
some sort of lines and corners and this
00:44:42
is this is what the what each of the
00:44:45
neurons will represent whether there is
00:44:46
a specific line like horizontal line or
00:44:48
vertical wires of magnets once you have
00:44:51
this sort of representation the next
00:44:52
layer could compose these and represent
00:44:54
shapes like squiggles or something
00:44:57
slightly more complex why do you have
00:44:58
these shapes the next layer could
00:45:00
compose them and represent things like
00:45:01
ears and noses and things like this and
00:45:04
then once you have this sort of
00:45:06
representation maybe you can tell
00:45:08
whether it's a dog or not maybe some
00:45:09
number of years of or existence of ears
00:45:11
in the first place but this is a very
00:45:13
high level intuition and awhile
00:45:14
confirmed in practice this is
00:45:17
necessarily that visible from the mouth
00:45:18
and a really nice result from sorry I
00:45:24
cannot pronounce French but Montu Farman
00:45:28
from Guido and Pascal would show and
00:45:33
Benjy is to show a mathematical
00:45:36
properties that somewhat encode this
00:45:40
high-level intuition and is a provable
00:45:41
statement so one thing is that when we
00:45:44
talked about these linear regions that
00:45:46
are created by Rayleigh networks what
00:45:48
you can show is as you keep adding
00:45:50
layers rather than neurons the number of
00:45:54
trunks in which you are dividing your
00:45:57
input space grows exponentially with
00:45:59
depth and only polynomially with going
00:46:02
wider which shows you that there isn't
00:46:04
simply an enormous reason to go deeper
00:46:08
rather than wider right exponential
00:46:10
growth simply will escape any polynomial
00:46:13
growth sooner or later and with the
00:46:15
scale at which were working these days
00:46:16
it escaped a long time ago the other
00:46:20
thing is if you believe in this high
00:46:23
level idea of learning from savable
00:46:25
times from statistical learning theory
00:46:27
that the principle of learning is to
00:46:30
encounter some underlying structure in
00:46:33
data right we get some training data set
00:46:35
which is some number of samples we build
00:46:37
a model and we expect it to work really
00:46:39
well on the data which comes from the same
00:46:41
distribution but is essentially a
00:46:43
different set how can this be done well
00:46:45
only if you learned if you discovered
00:46:47
some principles behind the data and the
00:46:50
output space and one such or a few such
00:46:53
things can be mathematically defined as
00:46:56
finding regularities symmetries in your
00:47:00
input space and what raelia networks can
00:47:03
be seen as is a method to keep folding
00:47:07
your input space on top of each other
00:47:09
which has two effects one of course if
00:47:14
you keep folding space you have more
00:47:16
when I say fold space I mean that the
00:47:17
points that end up on top of each other
00:47:19
are treated the same so whatever I build
00:47:21
on top of it will have exactly the same
00:47:23
output values for both points that got
00:47:26
folded so you can see why things will
00:47:27
grow exponentially right you fold the
00:47:29
paper once you have two things on top of
00:47:31
each other then four then eight it's
00:47:32
kind of how this proof is built it's
00:47:35
really beautiful I really recommend
00:47:36
reading this paper and beautiful
00:47:38
pictures as well and the second thing is
00:47:41
this is also the way to represent
00:47:42
symmetries if your data if your input
00:47:44
space is symmetric the easiest way to
00:47:47
learn that this symmetry is important is
00:47:49
by folding this space in half if the
00:47:52
symmetries more complex as represented
00:47:54
in this beautiful butterfly ish I don't
00:47:57
know shape you might need to fold in
00:48:00
this extra way so that all the red
00:48:02
points that are quite swirled end up
00:48:06
being mapped onto this one single
00:48:09
slightly curved shape and this gives you
00:48:13
this sort of generalization you discover
00:48:14
the structure if you could of course
00:48:16
learn it that only depth can give you if
00:48:21
you were to build much wider model you
00:48:23
need exponentially many neurons to
00:48:25
represent exactly the same invariance
00:48:27
exactly the same transformation which is
00:48:29
really nice mathematical insight into
00:48:31
this while depth really matters so
00:48:35
people believe this I mean of course
00:48:37
people were using depth before just
00:48:39
because they saw they seen better
00:48:42
results they didn't need necessarily
00:48:43
mathematical explanation for that so
00:48:46
let's focus on this simple model that we
00:48:49
just defined we have three neural
00:48:51
network
00:48:52
sorry three hidden layers in our neural
00:48:54
network linear a linear value and so and
00:48:57
so forth and now we'll go from our
00:49:00
puzzle view that was a nice high level
00:49:03
intuition into some of this extremely
00:49:04
similar and what's actually used in
00:49:07
pretty much every machine learning
00:49:09
library underneath which is called a
00:49:11
computational graph so it's a graph
00:49:14
which represents this sort of relations
00:49:17
of what talks to words I'm going to use
00:49:20
the same color coding so again blue
00:49:22
things inputs so this is my input X this
00:49:25
is gonna be a target orange is going to
00:49:28
be our loss the Reds are gonna be weight
00:49:32
parameters so the reason why some of you
00:49:38
might have noticed when I was talking
00:49:40
about linear layer I treated both
00:49:42
weights and XS as input to the function
00:49:45
whether I was writing f of X WB I was
00:49:49
not really discriminating between
00:49:50
weights and inputs apart from giving
00:49:52
them the color for easier readability is
00:49:55
because in practice it really doesn't
00:49:58
matter there's no difference between a
00:50:00
weight or an input into a note in a
00:50:02
computational graph and this alone gives
00:50:05
you a huge flexibility if you want to do
00:50:07
really funky stuff like maybe weights of
00:50:11
your network are gonna be generated by
00:50:12
another neuron network it's fully fits
00:50:16
this paradigm because all you're gonna
00:50:17
do is you're gonna substitute one of
00:50:20
these red boxes that would normally be a
00:50:22
weight with yet another network and it
00:50:24
just fits the same paradigm and we'll go
00:50:27
for some examples in a second to be more
00:50:29
precise we have this graph that
00:50:31
represents computational graph for a
00:50:33
free layer neural net with values on
00:50:36
height of abstraction omitting captions
00:50:38
because they are not necessary for this
00:50:40
story they don't have to be linear you
00:50:43
can have side tracks I'll skip
00:50:45
connections there is nothing stopping
00:50:47
you from saying okay output from this
00:50:49
layer it's actually going to be connected from toe sorry yet another
00:50:51
layer that is also parameterize by
00:50:53
something else
00:50:54
and then they go back and merge maybe
00:50:57
through mean operation concatenation
00:50:59
operation that many ways to merge two
00:51:01
signals
00:51:02
there is nothing stopping us from having
00:51:05
many losses and they don't even have to
00:51:07
be at the end of paragraph we might have
00:51:08
a loss attached directly to ways that
00:51:10
will act as the penalty for weights
00:51:12
becoming too large for example or maybe
00:51:14
leaving some specific constraints maybe
00:51:16
we want them to be lying on the sphere
00:51:18
and we're going to penalize the model
00:51:20
for not doing so our losses don't even
00:51:24
need to be the last things in the
00:51:25
computational graph you can have a
00:51:26
neural network that has a loss at the
00:51:29
end and this loss is fitted back its
00:51:31
value to next parts of a neural network
00:51:34
and this is the actual output that you
00:51:36
care about eventually you can also do a
00:51:38
lot of sharing so the same input can be
00:51:41
plucked into multiple parts of your net
00:51:43
in skip connection fashion you can share
00:51:48
weights of your model and sharing
00:51:50
weights in this computational graph
00:51:52
perspective is nothing about connecting
00:51:55
one nodes too many places this is
00:51:58
extremely flexible language that allows
00:52:00
this really modular development and
00:52:02
arguably it actually it helped
00:52:05
researchers find new techniques because
00:52:08
the engineering advancement of
00:52:09
computational graphs development allows
00:52:12
to free us from saying oh there are ways
00:52:14
that inputs in a qualitatively different
00:52:16
engineers came and said no from my
00:52:19
perspective they are exactly the same
00:52:20
and the research followed it would have
00:52:22
started plugging crazy things together
00:52:23
and ended up with really powerful things
00:52:26
like hyper networks so how do we learn
00:52:29
in all these models and the answer is
00:52:31
surprisingly simple you just need basic
00:52:33
linear algebra 101 to just recap radians
00:52:36
and jacobians I hope everyone knows what
00:52:39
they are if not in very short words if
00:52:41
we have a function that goes from D
00:52:43
dimensional space to a scalar space like
00:52:45
R then the gradient is nothing about the
00:52:47
vector of partial derivatives so I've
00:52:50
dimension we have partial derivative of
00:52:51
this function with respect to I input
00:52:54
what's partial derivative in height of
00:52:57
abstraction just the direction in which
00:52:58
the function grows the most and minus
00:53:00
gradient is the direction in which it
00:53:02
decreases the most
00:53:04
Jacobian nothing about the K dimensional
00:53:06
generalization if you have K outputs and
00:53:08
so is a matrix where you have a partial
00:53:11
derivative of F output with respect to
00:53:14
jave input nothing else very basic thing
00:53:17
the nice thing about these things is
00:53:19
they can be analytically computed for
00:53:21
many of the functions that we heard and
00:53:24
then the gradient descent technique that
00:53:26
is numerical methods 101 so surprisingly
00:53:30
deep learning uses a lot of very basic
00:53:33
components but from across the board of
00:53:36
mathematics and just composes it in a
00:53:38
very nice way an idea behind gradient
00:53:40
descent is extremely simple we can view
00:53:43
this a sort of physical simulation where
00:53:44
you have your function or loss landscape
00:53:47
you just pick an initial point and
00:53:49
imagine that is a ball that keeps
00:53:51
rolling down the hill until it hits a
00:53:53
stable point or it just cannot locally
00:53:56
minimize your loss anymore so you just
00:53:58
add each iteration tell your current
00:54:00
point subtract learning rate at time T
00:54:02
times the gradient in this specific
00:54:05
point and this is going to guarantee
00:54:06
convergence to the local minimum under
00:54:09
some minor assumptions of on the
00:54:11
smoothness of the function so it needs
00:54:13
to be smooth for it to actually converge
00:54:14
and it has this nice property that it
00:54:16
was referring before that because
00:54:18
gradient of the sum is sum of the
00:54:20
gradients you can show that analogues
00:54:23
properties hold for the stochastic
00:54:24
version or you don't sum over all
00:54:26
examples you just take a subset and keep
00:54:29
repeating this this will still converge
00:54:31
under some assumptions of the bound
00:54:33
between basically noise or the variance
00:54:37
of this estimator and the important
00:54:40
thing is this choice of the learning
00:54:42
rate unfortunately matters like quite a
00:54:44
few other parameters in machine learning
00:54:46
community and there have been quite a
00:54:48
few other optimizations that were
00:54:49
developed on top of gradient descent one
00:54:51
of which became a sort of golden
00:54:53
standard like step zero that you always
00:54:56
start with which is called atom and when
00:54:59
we go to practical issues I will say
00:55:01
this yet again if you're just starting
00:55:04
with some model just use atom if you're
00:55:06
even thinking about the optimization is
00:55:08
just a good starting starting rule and
00:55:11
in principle you can apply gradient
00:55:14
descent to non smooth functions and a
00:55:16
lot of stuff in deep learning is kind of
00:55:18
non smooth and people still apply it but
00:55:20
the consequence is you will lose your
00:55:22
converges guarantees so the fact that
00:55:24
your loss doesn't decrease anymore might
00:55:27
as well be
00:55:28
that you just did something you were not
00:55:29
supposed to be doing thank you
00:55:31
provided a note without a well-defined
00:55:33
gradient or you define the wrong
00:55:34
gradient you put the stop gradient in
00:55:36
are you created again then things might
00:55:38
stop converging so what do we need from
00:55:43
perspective of our notes so that we can
00:55:45
apply gradient descent directly to the
00:55:47
computational graph right because we
00:55:49
have this competition of graph British
00:55:50
for everything that we talked about and
00:55:52
the only API that we need to follow is
00:55:55
very similar to the one we talked before
00:55:58
we need forward pass given X given input
00:56:01
what is the output and also we need a
00:56:04
backward pass so what is it basically
00:56:06
Jacobian with respect to your inputs for
00:56:10
computational efficiency we want
00:56:12
necessarily compute a few Jacobian but
00:56:13
rather product between Jacobian and the
00:56:16
gradient of the loss that you eventually
00:56:17
care about and this is going to be an
00:56:19
information we're gonna send through the
00:56:21
network so let's be more precise with
00:56:24
our computational graph with free layers
00:56:28
we have this sort of gradient descent
00:56:30
algorithm we have our parameters citas
00:56:32
and we want to unify these views somehow
00:56:36
right so I need to know what theta is
00:56:37
and how to compute the gradient so let's
00:56:41
start with making feet appear so one
00:56:43
view that you might use is that there
00:56:46
actually is an extra node called theta
00:56:48
and all these parameters these w's B's
00:56:50
that I need for every layer is just
00:56:52
slicing and reshaping of this one huge
00:56:56
theta right so imagine there was this
00:56:57
huge vector theta and I'm just saying
00:57:00
the first W whatever its shape is is
00:57:02
first K dimensions I just take them
00:57:04
reshape this is a well-defined
00:57:06
differentiable operation right it's also
00:57:09
gradient of the reshaping is reshaping
00:57:11
of the gradient kind of thing so I can
00:57:13
have one theta and then the only
00:57:15
question is how to compute the gradient
00:57:17
and the whole math behind it is really
00:57:19
chain rule the targeted composition of
00:57:22
functions decomposes with respect to the
00:57:26
inner nodes so if you have F composed
00:57:28
with G and you try to compute the
00:57:30
partial derivative of the output with
00:57:31
respect to the input you can as well
00:57:33
compute the partial derivative of the
00:57:35
output with respect to this inner node G
00:57:38
let's multiply it by the partial
00:57:40
derivative of G with respect to X and if
00:57:43
G happens to be multi-dimensional if
00:57:45
there are many outputs then from matrix
00:57:48
calculus you know that the analogous
00:57:51
object requires you to simply sum over
00:57:54
all these paths so what it means from
00:57:57
the perspective of the computational
00:57:59
graph well let's take a look at one path
00:58:01
so we have the dependence of our lost
00:58:03
node on our way to note that now became
00:58:06
an input change to blue because as we
00:58:08
discussed before there is literally no
00:58:10
difference between these two and it's
00:58:12
going through this so now all we are
00:58:15
going to do is apply the first rule
00:58:17
we're going to take the final loss and
00:58:19
ask it okay we want you to be told
00:58:22
what's the gradient we are now in the
00:58:24
node that needs to know given how the
00:58:27
outputs needs to change which is already
00:58:29
told to us by this node how it needs to
00:58:32
adjust its inputs which is this Jacobian
00:58:35
times the partial derivative of rest of
00:58:37
the loss with respect to our output so
00:58:40
we can send back and we already have the
00:58:43
L D whatever is the name of this node
00:58:46
the previous node has the same property
00:58:49
right it's being told your outputs needs
00:58:52
to change in these directions and
00:58:53
internally its nose and by it nose I
00:58:56
mean we can compute this Jacobian how to
00:58:58
adjust its inputs so that its outputs
00:59:00
change in the same direction and you go
00:59:03
through all this graph backwards da da
00:59:05
da kill your feet theta and this is
00:59:07
using just this rule the only problem is
00:59:10
there is a bit more than 1/2 for this
00:59:13
network there's way more dependents but
00:59:15
this is where the other one comes into
00:59:17
place we will just need to sum over all
00:59:19
the paths that connect these two nodes
00:59:22
they might be exponentially many paths
00:59:25
but because they reuse computation the
00:59:28
whole algorithm is fully linear right
00:59:30
because we only go through each node
00:59:32
once computing up till here is
00:59:34
deterministic and then we can in
00:59:36
parallel also compute these two paths
00:59:38
until they meet again so have a linear
00:59:41
algorithm that bad props through the
00:59:43
whole thing you can ask couldn't I just
00:59:45
do it by hand for going through all the
00:59:47
equations of course you could but it
00:59:49
would be at the very least quadratic
00:59:51
if you do it naively this is just a
00:59:53
computational trick to make everything
00:59:54
linear and fit into this really generic
00:59:56
scheme that allows you to do all this
00:59:58
funky stuff including all the modern
01:00:01
different architectures representing
01:00:02
everything as computational graphs just
01:00:05
allows you to stop thinking about this
01:00:07
and you can see this shift in research
01:00:09
papers as well
01:00:10
do you like 2005 ish you seen each paper
01:00:14
from machine learning a section a
01:00:16
gradient of a log where people would
01:00:18
define some specific model and then
01:00:20
there will be a section where they say
01:00:22
oh I sat down and wrote down all the
01:00:24
partial derivatives this is what you
01:00:26
need to plug in to learn my model and
01:00:28
since then disappeared no one ever
01:00:31
writes this they just say and I use tensor flow P or
01:00:34
carrots or your favorites library it's a
01:00:38
good
01:00:39
it moved field forward instead of
01:00:42
positive dogs you know spending a month
01:00:44
deriving everything by hand they spent
01:00:46
five seconds
01:00:47
clicking greater so let's reimagine
01:00:49
these few modules that we introduced as
01:00:52
computational graphs we have our linear
01:00:54
module as we talked before is just a
01:00:57
function with three arguments it is
01:00:59
basically a dot product between X and W
01:01:01
we add B and what we need to define is
01:01:05
this backward computation with respect
01:01:07
to each of the inputs no matter if it's
01:01:09
an actual input blue thing or a weight
01:01:11
as we discussed before and for X and W
01:01:15
themselves the situation is symmetric we
01:01:18
essentially for X is just multiplied by
01:01:20
W the errors that are coming from the
01:01:22
future I mean from further from the
01:01:25
graph not from the future and for the W
01:01:27
is just the same situation but with X's
01:01:30
right because the dot product is pretty
01:01:32
symmetric operation itself and the
01:01:33
update for the biases is just the
01:01:36
identity since they are just added at
01:01:38
the end so you can just adjust them very
01:01:41
very easily and the nice thing to note
01:01:44
is that all these things in backwards
01:01:46
graph
01:01:47
they are also basic algebra and as such
01:01:50
they could be a computational graph
01:01:53
themselves and this is what happens in
01:01:55
many of these libraries when you call TF
01:01:57
gradients for example or something this
01:01:59
the backward computation will be added
01:02:02
to your graph there will be a new chunk
01:02:03
of your
01:02:04
that represents the backwards graph and
01:02:07
what's cool about this is now you can go
01:02:09
really crazy and say I won't tell old
01:02:12
order derivative I want to back up for
01:02:14
back prop and all you need to do is just
01:02:16
grab and note that corresponds to this
01:02:18
computation that was done for you just
01:02:20
call it again and again and again and
01:02:22
just get this really really powerful
01:02:24
differentiation technique until your GPU
01:02:27
around dies right but there's a customer
01:02:30
really itself
01:02:32
super simple in the forward pass you
01:02:34
have maximum will zero and X in the
01:02:36
backward pass you end up with a masking
01:02:38
method so if the specific neuron was
01:02:42
active when I say active I mean it was
01:02:44
positive and they rarely just passed it
01:02:45
through then you just pass the gradients
01:02:47
through as well and if it was inactive
01:02:50
meaning it was negative it hit zero then
01:02:53
of course gradients coming back need to
01:02:55
be zeroed as well because we don't know
01:02:57
how to adjust them right locally from a
01:02:59
vertical perspective if you are in the
01:03:01
zero land if you make an infinitely
01:03:03
small step you are still in zero land
01:03:04
let's forget about actual zero because
01:03:06
this one is tricky softmax is also
01:03:11
relatively simple
01:03:13
maybe it's gravely slightly fancier
01:03:14
because there's a exponentiation
01:03:15
summation division but is the same
01:03:18
principle right and you can also derive
01:03:21
the corresponding partial derivative
01:03:23
which is the backward pass and it's
01:03:26
essentially the difference between the
01:03:28
incoming gradients and the output and
01:03:31
you can see that these things might go
01:03:34
up right softmax itself if X J is very
01:03:37
big then exponent will just overflow
01:03:40
whatever is the numerical precision of
01:03:42
your computer and as such is rarely used
01:03:45
in such a form
01:03:46
it's either composed with something
01:03:48
that's cautious it back to reasonable
01:03:50
scale or does some tricks like you take
01:03:52
a minimum of XJ and say 50 so that you
01:03:56
lose parts of say mathematical beauty of
01:04:00
this but at least things will not blow
01:04:02
up to infinities and now if you look at
01:04:06
the cross entropy is also very simple to
01:04:10
vectorize and it's partial derivatives
01:04:13
now we can see why things get mess
01:04:16
see computationally you divide by P
01:04:17
dividing by small numbers as you know
01:04:19
from computer science basics can again
01:04:22
overflow so it's something that on its
01:04:24
own is not safe to do again you could
01:04:27
hack things around but the nicer
01:04:29
solutions and the nice thing about
01:04:31
viewing all these things jointly inputs
01:04:34
weights targets whatever as the same
01:04:38
objects with exactly the same paradigm
01:04:40
exactly the same model that we use to
01:04:43
say well these are these pictures of
01:04:45
dogs and cats right and these are the
01:04:46
targets what is the set a set of weights
01:04:49
for this model to maximize the
01:04:51
probability of this labeling can also
01:04:53
ask the question given this neural
01:04:55
network what is the most probable
01:04:57
labeling of these pictures so that this
01:05:00
neuron network is going to be happy
01:05:02
about it its loss is going to be low by
01:05:05
simply attaching our gradient decent
01:05:07
technique instead of to the theta that
01:05:11
we can attach directly to T right and as
01:05:14
long as these things are properly
01:05:16
defined in your library is going to work
01:05:19
and now you can see why would you
01:05:22
compose softmax and the cross-entropy
01:05:24
because now backward pass extremely
01:05:26
simplifies instead of all these
01:05:29
nastiness division small numbers etc you
01:05:32
just get the partial derivative of the
01:05:35
loss with respect to inputs as a
01:05:38
difference between targets and your
01:05:40
inputs as simple as that all the
01:05:42
numerical instabilities gone you can of
01:05:45
course still learn labels and partial
01:05:48
derivative is relatively okay this is
01:05:52
one of the main reasons why when using
01:05:55
machine learning libraries like Kara's
01:05:57
tensorflow and many others you'll
01:06:00
encounter this cross-entropy jungle you
01:06:02
see ten functions that are called
01:06:04
cross-entropy something like sparse
01:06:06
cross and dropping with logits
01:06:07
cross-entropy with soft marks I don't
01:06:09
know apply twice table the reason is
01:06:12
because each of these operations on its
01:06:14
own is numerically unstable and people
01:06:17
wanted to provide you with a solution
01:06:19
that is numerically stable they just
01:06:21
took literally every single combination
01:06:23
gave it a name and each of these
01:06:25
combinations is implemented in a way
01:06:27
this numerically
01:06:28
and all you need to do is to have this
01:06:30
lookup table which combination you want
01:06:32
to use and pick the right be the right
01:06:34
name right but underneath they're always
01:06:37
just composing cross-entropy with either
01:06:39
sigmoid or soft max or something like
01:06:41
this and it's exactly this problem that
01:06:43
they are avoiding if you want to do
01:06:45
things by hand feel free but don't be
01:06:48
surprised if even on em this from time
01:06:49
to time you'll see an infinity in your
01:06:51
loss it's just the beauty of finite
01:06:54
precision arithmetic in the continuous
01:06:56
land so let's go back to our example
01:07:00
right it was this small puzzle piece now
01:07:03
we can explicitly label each of the
01:07:06
notes so we have our XS they go for dot
01:07:09
product with ways biases are being added
01:07:11
then there is r lu we do it quite a few
01:07:13
times at some point at this point we
01:07:16
have probability estimates and this is
01:07:18
the output of our model even though our
01:07:20
loss is computed later this is also one
01:07:23
of the things I was mentioning before
01:07:24
right the output or the special nodes
01:07:27
don't have to be at the end they might
01:07:28
be branching from the middle we can
01:07:31
replace this with frita
01:07:32
does sorry and just slicing and apply
01:07:37
our gradient descent and maybe to
01:07:40
surprise some of you this is literally
01:07:43
how training of most of the deep neural
01:07:46
nets look like in supervised way
01:07:49
reinforcement learning slightly
01:07:51
different story but it's this underlying
01:07:54
principle that allows you to work with
01:07:56
any kind of neural network it doesn't
01:07:59
have to be this linear structure all of
01:08:01
these funky things that I was trying to
01:08:03
to portrait ten slides ago relies on
01:08:06
exactly the same principle and you use
01:08:08
exactly the same rules you just keep
01:08:09
composing around the same algorithm and
01:08:11
you get an optimization method that's
01:08:14
going to converge to some local minimum
01:08:16
not necessary perfect model but is going
01:08:18
to learn something so there are a few
01:08:21
things that we omitted from this that
01:08:24
are still interesting pieces of the
01:08:25
puzzle one such thing is taking a
01:08:28
maximum so imagine that one of your
01:08:30
nodes wants to take a maximum you have a
01:08:32
competition B between your inputs and
01:08:34
only the maximal one is going to be
01:08:36
selected then the backwards pass of this
01:08:40
operation
01:08:41
is nothing about gating again there's
01:08:43
gonna be passing through the gradients
01:08:46
even only if this specific dimension was
01:08:49
the maximal one and just zeroing out
01:08:51
everything else you can see that this
01:08:53
will not learn how to select things but
01:08:55
at least it will tell the maximal thing
01:08:57
how to adjust under the conditions that
01:08:59
it got selected all right so there's
01:09:01
this notion of non small things that
01:09:03
don't necessarily guarantee convergence
01:09:05
in a mathematical sense but they are
01:09:08
commonly used and you'll see in say
01:09:09
Sanders talk on convulsion your networks
01:09:12
that they are part of the max pooling
01:09:14
layers you can have funky things like
01:09:17
conditional execution like five
01:09:19
different computations and then a1 hood
01:09:22
layer that tells you which this of these
01:09:24
branches to select rather than if it was
01:09:26
one hot encoded then selection can be
01:09:28
viewed as just point wise multiplication
01:09:30
right by one we multiply and then the
01:09:33
backward pass is just gonna be again
01:09:35
gated in the same way but if it was not
01:09:38
the one hood encoding but rather output
01:09:40
of the softmax right of something
01:09:42
parametrized then looking at the
01:09:44
backward pass with respect to the gating
01:09:46
allows you to literally learn the
01:09:48
conditioning you can learn which branch
01:09:50
of execution to go through as long as
01:09:53
you smoothly mix between them using
01:09:55
softmax
01:09:56
and it's a high-level principle or
01:09:58
high-level idea behind modern attention
01:10:00
models let us essentially do this and to
01:10:04
give you some trivial example of other
01:10:06
laws so you had cross entropy but of
01:10:08
course many problems in real life are
01:10:09
not classification if it's a regression
01:10:13
so your outputs are just real numbers
01:10:14
then l to quadratic gloss or one of at
01:10:18
least ten other names for this quantity
01:10:19
which is just a square norm of a
01:10:22
difference between targets and your
01:10:24
prediction can also be seen as a
01:10:25
computational graph and the backward
01:10:27
pass is again nothing but the difference
01:10:29
between what you predicted and what you
01:10:32
wanted and there's this nice duality as
01:10:34
you can see from the backwards
01:10:35
perspective that looks exactly the same
01:10:37
as in case of the cross and trouble of
01:10:39
the softmax which also provides you with
01:10:41
some intuitions into how these things
01:10:43
are correlated so let's quickly go
01:10:46
through some practical issues given that
01:10:48
we know kind of what we're working with
01:10:50
and the first one is the well-known
01:10:53
problem of fitting and regularization so from
01:10:56
statistical learning theory so we are
01:10:58
still or again going back to say talking
01:11:01
her way before him we know that in this
01:11:04
situation where we just have some
01:11:06
training set which is a finite set of
01:11:08
data that we're building our model on
01:11:11
top minimizing error on it which we're
01:11:15
going to call training error or training
01:11:16
risk is not necessarily what we care
01:11:18
about what we care about is how our
01:11:20
model is going to behave in a while it
01:11:22
what's going to happen if I take a test
01:11:24
sample that kind of looks the same but
01:11:26
it's a different dog than the one that I
01:11:27
saw in train that's what we are going to
01:11:30
call test risk test car and it's a
01:11:33
provable statement that there is this
01:11:34
relation between complexity of your
01:11:37
model and the behavior between these two
01:11:40
errors as your model gets more and more
01:11:42
complex and by conflicts I mean more
01:11:44
capable of representing more and more
01:11:47
crazy functions or being able to just
01:11:49
store and store more information then
01:11:52
your training error has to go down not
01:11:54
in terms of any learning methods but
01:11:56
just in terms of existence of parameters
01:11:58
that realize it think universe
01:12:00
approximation theorem right it says
01:12:02
literally that but at the same time as
01:12:04
things get more complex and the bigger
01:12:06
your test test risk initially goes down
01:12:09
because you are just getting better at
01:12:11
representing the underlying structure
01:12:13
but eventually the worst case scenario
01:12:15
is actually going to go up because you
01:12:17
might as well represent things in a very
01:12:19
bad way by for example a numerating all
01:12:22
the training examples and outputting
01:12:24
exactly what's expected from you zero
01:12:26
training error horrible represents
01:12:28
horrible generalization power and this
01:12:31
sort of curve you'll see in pretty much
01:12:33
any machine learning book till 2016 ish
01:12:37
when people started discovering
01:12:38
something new that will go through in a
01:12:42
second but even if you just look at this
01:12:45
example you notice that there is some
01:12:47
reason to keep things simple and so
01:12:50
people developed many regularization
01:12:51
techniques such as LP regularization
01:12:53
where you attach one of these extra
01:12:56
losses directly to weights that we
01:12:57
talked about before which is just LP
01:13:00
norm like L to quadratic norm or l1 or
01:13:02
something like this to each of your
01:13:04
weights so that your weights are small
01:13:07
and you can prove again I guarantee that
01:13:08
if weights are small the function cannot
01:13:10
be too complex so you are restricting
01:13:12
yourself to the left-hand side of this
01:13:13
graph you can do drop out where some
01:13:15
neurons are randomly deactivated again
01:13:18
much harder to represent complex things
01:13:20
you can add noise to your data you can
01:13:22
stop early or you can use various
01:13:24
notions of normalization that will be
01:13:26
talked through in the next lecture but
01:13:29
that's all in this worst case scenario
01:13:31
what people recently discovered or
01:13:33
recently started working on is how this
01:13:36
relates to our deep neural networks that
01:13:39
don't have hundreds of parameters they
01:13:40
have billions of parameters and yet
01:13:42
somehow they don't really over fit as
01:13:44
easily as you would expect so the new
01:13:47
version of this picture emerge that's
01:13:49
currently referred to as double descent
01:13:51
where you have this phase change but yes
01:13:55
things get worse as you get more and
01:13:57
more complex model but eventually you
01:13:59
hit this magical boundary of over
01:14:01
parameterization where you have so many
01:14:03
parameters that even though in in theory
01:14:05
you could do things in a very nasty way
01:14:08
like by enumerate examples because of
01:14:11
the learning methods that we are using
01:14:12
you never will
01:14:14
you start to behave kind of like a
01:14:16
Gaussian process and as you keep
01:14:18
increasing number of parameters you
01:14:20
actually end up with as simplest
01:14:21
solutions being found first rather than
01:14:24
the more complex ones and so the curve
01:14:26
descends again and it has been proven by
01:14:30
balcony at all under some constraints
01:14:33
and shown in simple examples then it was
01:14:36
also reinforced with a cool work from
01:14:41
sorry
01:14:42
Pritam from from open AI where they
01:14:49
showed that this holds for deep big
01:14:51
models that we care about so one could
01:14:54
ask well does it mean we don't care
01:14:55
about relaxation anymore you just make
01:14:57
models bigger and the answer is well not
01:15:01
exactly
01:15:02
it's both true that as you increase the
01:15:04
model that you can see on the x-axis
01:15:05
that you're lost after test loss after
01:15:09
rapidly increasing keeps decreasing all
01:15:11
the time but adding regularization can
01:15:14
just keep the whole curve lower so here
01:15:18
as you go through curves from top to
01:15:19
bottom
01:15:20
it's just more and more regularization
01:15:21
being added so what it means how it
01:15:24
relates to this theory of complexity
01:15:27
what that mostly means is model
01:15:29
complexity is way more at the number of
01:15:33
parameters and this is a local minimum
01:15:35
like research local minimum people were
01:15:36
in for quite a while where they thought
01:15:38
well your neural network is huge truly
01:15:41
is not going to generalize well because
01:15:43
of opting Chevron and keys bounds are
01:15:44
infinite you're doomed and it seems not
01:15:47
to be the case the complexity of the
01:15:49
model strongly relies on the way we
01:15:53
train and as a result you are still kind
01:15:58
of in this regime where pain things can
01:16:01
get worse and you do need to regularize
01:16:03
but adding more parameters is also a way
01:16:06
to get better results slightly
01:16:09
counterintuitive and only applies if you
01:16:11
keep using gradient descent not some
01:16:13
nasty way okay
01:16:15
so just a few things there's a lot of
01:16:18
stuff that can go wrong when you train a
01:16:19
neural net and it can be hard harsh
01:16:24
experience initially so first of all if
01:16:27
you haven't tried don't get discouraged
01:16:29
initially nothing works and it's
01:16:31
something we all went through and there
01:16:33
is nothing to solve it apart from
01:16:36
practice just playing with this will
01:16:38
eventually get you there there's a
01:16:41
brilliant blog posts from Andrew karpati
01:16:45
and I'm referring to here and also a few
01:16:48
points that I like to keep in mind each
01:16:50
time I train neural networks first of
01:16:53
all that initialization really matters
01:16:55
all this fury that was built or the
01:16:57
practical results if you initialize your
01:16:59
network badly it won't learn and you can
01:17:01
prove it won't work won't learn well
01:17:04
what you should start with always is to
01:17:07
try to overfit with some if you're
01:17:09
introducing a new model especially you
01:17:11
need to try to overfit on some small
01:17:13
data sample if you can't over fit almost
01:17:16
surely you made a mistake unless for
01:17:18
some reason your model doesn't work for
01:17:19
small sample sizes then obviously just
01:17:21
ignore what I just said
01:17:24
you should always monitor training loss
01:17:26
I know sounds obvious but quite a few
01:17:29
people just assume that loss will go
01:17:30
down
01:17:31
because gradient decent grantees it
01:17:32
without monitoring it you will never
01:17:34
know if you are in the right spot
01:17:37
especially given that many of our models
01:17:39
are no differentiable and as such the
01:17:41
loss doesn't have to go down so if it's
01:17:42
not going down you might want to
01:17:44
reconsider using this non differentiable
01:17:45
units more important is something that
01:17:48
people apparently stopped doing in deep
01:17:50
learning on a daily basis it's
01:17:52
monitoring norms of your weights norms
01:17:54
going to infinity is something to be
01:17:57
worried about and if it's not making
01:17:59
your job crush right now it eventually
01:18:01
will once you leave it running for a few
01:18:04
days and then you'll regret that your
01:18:06
notes monitoring it earlier another
01:18:10
thing is adding shape assets all the
01:18:12
modern learning deep learning libraries
01:18:15
are great has helped brilliant features
01:18:16
one of which is automatic broadcasting
01:18:19
they take a column vector we take a row
01:18:20
vector you add them you get the matrix
01:18:23
very useful unless this is not what you
01:18:25
wanted to do you just wanted to vectors
01:18:27
and you ended up of a matrix if the next
01:18:30
operation is taking a maximum or taking
01:18:32
the average you won't notice right
01:18:34
afterwards there's just a scalar
01:18:35
everything looks fine but your learning
01:18:37
will be really crazy and you can try to
01:18:40
internal linear regression and just by
01:18:42
mistake transpose targets and you'll see
01:18:45
how badly linear regression can behave
01:18:47
by just one liner that throws no
01:18:49
exceptions and your loss will go down it
01:18:52
just won't be the model that you're
01:18:53
expecting the only way that I know about
01:18:55
to resolve this is to add shape asserts
01:18:57
everywhere each time you add an
01:18:59
operation we just write down an assert
01:19:02
like literally low-level engineering
01:19:03
thing to make sure that the shape is
01:19:05
exactly what you expect otherwise you
01:19:07
might run into issues things that we
01:19:09
mentioned before
01:19:10
use atom as your starting point just
01:19:12
because free e minus v is the magical
01:19:16
learning ring it works in 99% of deep
01:19:18
learning models for unknown reasons to
01:19:20
everyone
01:19:21
finally it's very tempting to change
01:19:24
five things at a time because you feel
01:19:25
like you have so many good ideas and
01:19:27
don't get me wrong you probably do but
01:19:29
if you change all of them at once
01:19:31
you were regretted afterwards when you
01:19:33
struggle with debugging and or credit
01:19:36
assignment of what actually improves in
01:19:38
your model and the reviewers won't be
01:19:40
happy either
01:19:41
when your ablation just keeps five steps
01:19:44
so given a few last minutes before the
01:19:48
questions I wanted to spend save three
01:19:51
ish minutes on the bonus Fink on
01:19:54
multiplicative interactions so I was
01:19:56
trying to convince you through this
01:19:58
lecture that neural networks are really
01:20:01
powerful and I hope I succeeded they are
01:20:05
very powerful but I want to ask this may
01:20:08
be a funny question what is one thing
01:20:10
that these multi-layer networks or we
01:20:12
just have a linear then an activation
01:20:14
function say Sigma the rail you stacked
01:20:16
on top of each other definitely cannot
01:20:18
do well there may be answers right they
01:20:21
can't do a lot of stuff but one trivial
01:20:24
thing they can't do is they can't
01:20:25
multiply there's just no way for them to
01:20:29
multiply two numbers given us inputs
01:20:32
again you might be slightly confused we
01:20:35
just talked about the inverse
01:20:36
approximation theorem but what I'm
01:20:37
referring to is representing
01:20:39
multiplication we can approximate
01:20:42
multiplication to any precision but they
01:20:44
can never actually represent the
01:20:46
function that multiplies so no matter
01:20:48
how big your data set is going to be no
01:20:51
matter how deep your network is going to
01:20:52
be I can't you train it to multiply two
01:20:54
numbers I can always find two new
01:20:57
numbers that you're going to miserably
01:20:59
fail it and I miserably I mean get
01:21:01
arbitrarily bigger maybe my numbers are
01:21:03
going to be huge doesn't matter there is
01:21:06
something special about multiplication
01:21:08
that I would like to see you know that
01:21:10
what's special about them for example
01:21:12
conditional execution relies on
01:21:14
multiplying something between 0 and 1
01:21:17
and something else many things in your
01:21:19
life can be represented as
01:21:21
multiplication for example computing distance between
01:21:23
two points relies on being able to
01:21:25
compute a dot product plus norms and
01:21:28
things like this so it's quite useful to
01:21:30
have this sort of operation yet stacking
01:21:33
even infinitely many yes infinitely many
01:21:36
layers would not help and one way to
01:21:40
resolve it in sort of a unit that just
01:21:43
implements multi-plate if interactions
01:21:45
one way to formalize it is as follows
01:21:47
you have a tensor W you take your inputs
01:21:50
through this you can see this as a
01:21:52
Mahalanobis dot product if you were
01:21:54
through this part of the algebra then
01:21:57
you have the bad fix projections of the
01:21:58
remaining things and just add the bytes
01:22:01
so if you just look at the approximation
01:22:04
things if you were to say compute a dot
01:22:06
product and you do it with a normal
01:22:09
neural net of Linear's and well use then
01:22:11
you have an exponentially many
01:22:13
parameters needed to approximate this to
01:22:16
that zero point one error I believe I
01:22:18
used here with respect to the
01:22:19
dimensionality of the input there is a
01:22:21
very steep exponential growth just
01:22:24
approximate and there is still gonna be
01:22:26
this problem that you don't generalize
01:22:27
but even approximation requires huge
01:22:30
amounts of parameters while using model
01:22:33
like this explicitly has a linear growth
01:22:35
and has a guarantee right once you hit
01:22:37
the dot product which can be represented
01:22:40
exactly with this module you will
01:22:42
generalize everywhere there's a nice
01:22:44
work from seed hunt at all at this
01:22:48
year's SML if you want to that deeper
01:22:51
but I want to just stress there is a
01:22:53
qualitative difference between
01:22:54
approximation and representation and in
01:22:58
some sense sends you home with this
01:22:59
take-home message which is if you wants
01:23:02
to do research in this sort of
01:23:04
fundamental building blocks of neural
01:23:06
networks please try not to focus on
01:23:09
improving things like marginally
01:23:12
improving things the neural networks
01:23:13
already do very well if we already have
01:23:16
this piece of a puzzle polishing it I
01:23:18
mean is an improvement but it's really
01:23:20
not what's cool about this field of
01:23:23
study and this is not where the biggest
01:23:24
gains both for you scientifically as
01:23:27
well as for the community lies was the
01:23:30
biggest game is identifying what neural
01:23:32
networks cannot do or cannot guaranty
01:23:34
think about maybe you might want a
01:23:37
module that's guaranteed to be convex or
01:23:39
quasi-convex or some other funky
01:23:42
mathematical property that you are
01:23:43
personally interested in and propose a
01:23:45
module that does that I guarantee you
01:23:47
that will be much better experience for
01:23:52
you and much better result for all of us
01:23:54
and with that I'm going to finish so
01:23:57
thank you
01:23:59
you

Description:

Neural networks are the models responsible for the deep learning revolution since 2006, but their foundations go as far as to 1960s. In this lecture DeepMind Research Scientist Wojciech Czarnecki will go through basics of how these models operate, learn and solve problems. He also introduces various terminology/naming conventions to prepare attendees for further, more advanced talks. Finally, he briefly touches upon more research oriented directions of neural network design and development. Download the slides here: https://storage.googleapis.com/deepmind-media/UCLxDeepMind_2020/L2%20-%20UCLxDeepMind%20DL2020.pdf Find out more about how DeepMind increases access to science here: https://deepmind.google/about/ Speak Bio: Wojciech Czarnecki is a Research Scientist at DeepMind. He obtained his PhD from the Jagiellonian University in Cracow, during which he worked on the intersection of machine learning, information theory and cheminformatics. Since joining DeepMind in 2016, Wojciech has been mainly working on deep reinforcement learning, with a focus on multi-agent systems, such as recent Capture the Flag project or AlphaStar, the first AI to reach the highest league of human players in a widespread professional esport without simplification of the game. About the lecture series: The Deep Learning Lecture Series is a collaboration between DeepMind and the UCL Centre for Artificial Intelligence. Over the past decade, Deep Learning has evolved as the leading artificial intelligence paradigm providing us with the ability to learn complex functions from raw data at unprecedented accuracy and scale. Deep Learning has been applied to problems in object recognition, speech recognition, speech synthesis, forecasting, scientific computing, control and many more. The resulting applications are touching all of our lives in areas such as healthcare and medical research, human-computer interaction, communication, transport, conservation, manufacturing and many other fields of human endeavour. In recognition of this huge impact, the 2019 Turing Award, the highest honour in computing, was awarded to pioneers of Deep Learning. In this lecture series, leading research scientists from leading AI research lab, DeepMind, deliver 12 lectures on an exciting selection of topics in Deep Learning, ranging from the fundamentals of training neural networks via advanced ideas around memory, attention, and generative modelling to the important topic of responsible innovation.

Preparing download options

popular icon
Popular
hd icon
HD video
audio icon
Only sound
total icon
All
* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."
** — Link intended for online playback in specialized players

Questions about downloading video

mobile menu iconHow can I download "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video?mobile menu icon

  • http://unidownloader.com/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.

  • The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.

  • UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.

  • UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

mobile menu iconWhich format of "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video should I choose?mobile menu icon

  • The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

mobile menu iconWhy does my computer freeze when loading a "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video?mobile menu icon

  • The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

mobile menu iconHow can I download "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations" video to my phone?mobile menu icon

  • You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

mobile menu iconHow can I download an audio track (music) to MP3 "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations"?mobile menu icon

  • The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

mobile menu iconHow can I save a frame from a video "DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations"?mobile menu icon

  • This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

mobile menu iconWhat's the price of all this stuff?mobile menu icon

  • It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.