background top icon
background center wave icon
background filled rhombus icon
background two lines icon
background stroke rhombus icon

Download "Data Science пример задачи кредитного скоринга / Урок построения модели ML на python"

input logo icon
Table of contents
|

Table of contents

0:00
Задача кредитного скоринга
0:25
Что необходимо сделать перед построением модели
0:44
Загрузка данных и предварительный анализ
4:07
Главная фишка EDA анализа!!!!! Как делать EDA?
5:29
Рассматриваем гипотезы
6:13
Анализируем целевую переменную (таргет) / Дисбаланс классов
7:11
Первая гипотеза. Распределение возраста в разрезе таргета (seaborn), нормализуем данные
8:11
Вторая гипотеза. Распределение возраста в разрезе образования / boxplot
10:10
Корреляция признаков
10:30
Третья гипотеза. Анализ зарплат в разрезе таргета / образования
12:33
Feature engineering (генерация фичей), как его делать, какие могут быть новые признаки, что делать с признаком типа дата-время, логарифмирование
15:07
Построение модели машинного обучения. 1 этап - бейзлайн (Logistic Regression)
16:46
Как интерпретировать и использовать метрики precision, recall, roc-auc
17:41
Строим roc-auc curve
18:03
Подбор параметров модели с использованием GridSearch
18:40
Сравниваем результаты на графике roc-auc / анализируем метрики
19:32
Анализ важных признаков после обучения модели
20:02
Используем для анализа важных признаков библиотеку shap / Интерпретирует результаты
22:30
Коэффициенты логистической регрессии
23:27
Сравнение важных признаков в разных классов (визуализация различий)
Similar videos from our catalog
|

Similar videos from our catalog

Что такое нейронные сети?  ДЛЯ НОВИЧКОВ / Про IT / Geekbrains
6:54

Что такое нейронные сети? ДЛЯ НОВИЧКОВ / Про IT / Geekbrains

Channel: GeekBrains
Как работают сверточные нейронные сети | #13 нейросети на Python
19:37

Как работают сверточные нейронные сети | #13 нейросети на Python

Channel: selfedu
Проанализировал комментарии НЕЙРОСЕТЬЮ. И вот, что я узнал
35:57

Проанализировал комментарии НЕЙРОСЕТЬЮ. И вот, что я узнал

Channel: Onigiri
Различие между Искусственным Интеллектом, Машинным обучением и Глубоким обучением
12:22

Различие между Искусственным Интеллектом, Машинным обучением и Глубоким обучением

Channel: Etudarium
Методы обнаружения выбросов | Вебинар Яна Пиле | karpov.courses
1:49:00

Методы обнаружения выбросов | Вебинар Яна Пиле | karpov.courses

Channel: karpov.courses
Как обучить нейросеть на PyTorch / Перцептрон / Функции активации
19:24

Как обучить нейросеть на PyTorch / Перцептрон / Функции активации

Channel: miracl6
Демо-лекция Игоря Шнуренко "Рождение искусственного интеллекта из духа математики и поэзии".
11:10

Демо-лекция Игоря Шнуренко "Рождение искусственного интеллекта из духа математики и поэзии".

Channel: АнтиТьюринг: канал Игоря Шнуренко
Переобучение - что это и как этого избежать, критерии останова обучения | #5 нейросети на Python
8:52

Переобучение - что это и как этого избежать, критерии останова обучения | #5 нейросети на Python

Channel: selfedu
Пикник - У шамана три руки (нейросеть) | Eng subs
3:27

Пикник - У шамана три руки (нейросеть) | Eng subs

Channel: nikborovik
Как подготовить данные для обучения нейронной сети? Интенсив по Python, нейросетям и биткоину
1:58:23

Как подготовить данные для обучения нейронной сети? Интенсив по Python, нейросетям и биткоину

Channel: Skillbox Программирование
Video tags
|

Video tags

дата сайентист
data science
data scientist
career in tech
работа в big data
data science interview
саморазвитие
sillicon valley
основы программирования
как проходить интервью
аналитик
Yandex
анализ данных
datascience
карьера в data science
Sysml
data analyst
ods
open data science
машинное обучение
miracl6
PyMagic
deep learning
perceptron
нейронные сети
глубокое обучение
нейронки
нейросеть
pytorch
функция активации
сигмоида
релу
relu
python
Subtitles
|

Subtitles

subtitles menu arrow
  • ruRussian
Download
00:00:01
build a machine learning model for the
00:00:03
credit scoring task, we will
00:00:06
learn how to correctly approach
00:00:07
data analysis, exploratory data analysis,
00:00:10
we will build the machine learning model itself,
00:00:12
we will select parameters for it and also
00:00:15
draw a conclusion based on the data,
00:00:17
hello everyone, my name is Nikulina Anastasia
00:00:19
for more than four years I have been working in yes, these are
00:00:21
sites, and I also teach in this area.
00:00:24
Before building your
00:00:25
machine learning model, you need to do
00:00:28
preliminary work, this is a discussion
00:00:30
with the business about the tasks, this is the collection of requirements,
00:00:32
this is data collection, but
00:00:36
today we will omit these stages since this a huge
00:00:39
but interesting topic for a separate video,
00:00:41
let's then get down to the
00:00:44
task itself, I open my pre-
00:00:47
prepared laptop here, in
00:00:49
fact, everything is as usual, I import all these
00:00:51
libraries that I need, I fix the
00:00:54
wounds don't stay, you can also add
00:00:56
always the necessary imports and already in
00:00:58
the process, let's see what
00:01:01
our task is in general, we have
00:01:03
data on the client's level of
00:01:05
education, gender, age, ownership of a car,
00:01:09
income, and so on, the first thing you should
00:01:11
always do is look at your dreams, and the
00:01:15
most important thing that needs to be
00:01:17
done here is we need to predict the
00:01:20
default flag for a loan, what does this mean,
00:01:22
you come to the bank, you take out a loan, for example,
00:01:25
you need to understand whether you should
00:01:27
issue a loan or not, if it’s not worth it, then
00:01:30
you’re a bad borrower for him, and if
00:01:32
it’s worth it, then you’re a good borrower for him, so
00:01:35
here it’s important for us to build such
00:01:38
the model that will predict all this for us is the
00:01:39
first thing we always look
00:01:41
at the size of the datasette and what’s
00:01:43
interesting here is that the site is small, the date of the site
00:01:45
is toy, but if you have more
00:01:49
fields, if you already work somewhere in a
00:01:51
bank, then there’s nothing wrong with this, all the
00:01:54
same the same tools, the
00:01:55
same approaches are used, and
00:01:57
for a larger number of data strings and
00:01:59
objects, something can be slightly
00:02:02
modified, but the approaches are the same;
00:02:04
then we always look at what
00:02:06
types of data we have, if there is a gap somewhere,
00:02:08
for example, in the signs of formation we
00:02:11
have a pass accordingly, you can
00:02:13
see for example in percentage
00:02:15
terms how much it is approximately somewhere between
00:02:17
0 and 4 percent no it’s not that much we don’t have
00:02:20
that much you
00:02:21
can fill in with fashion you can fill in
00:02:25
just with a word how do we get to if you have
00:02:27
algorithms features based on
00:02:29
trees you can see also for the
00:02:31
unique value here, I’ll
00:02:33
say, firstly, I tried to
00:02:35
decipher it all, but in fact it’s not
00:02:37
accurate, if you have any other
00:02:38
options, you can write by the way in
00:02:40
the comments under this video, you can, as I
00:02:44
already said, fill it all out with fashion
00:02:46
since there are no gaps so many next
00:02:48
steps that you should, in principle,
00:02:50
definitely do even before exploratory
00:02:52
data analysis, you should look at the
00:02:54
basic statistics at the basic
00:02:56
statistics for numerical data here you
00:02:59
can see where we have what
00:03:01
average for example the average salary in
00:03:04
this case we will have
00:03:06
how much 41000 further you can look at the
00:03:11
maximum value of our salary and the
00:03:13
minimum value of our salary,
00:03:16
also at age, as well as at other
00:03:18
characteristics of Kline Tandi, here you can
00:03:20
remove this, in principle, it would be better for IT people to
00:03:22
always not include it in the analysis, you can
00:03:25
look at the number of unique
00:03:26
values,
00:03:27
I’ll explain why we need this home Jess showed him in the
00:03:31
arcade with us how numerical
00:03:33
changes, that is, these are essentially numbers,
00:03:35
what’s interesting here is what’s interesting is that
00:03:38
let’s take a closer look now,
00:03:47
there are numbers 12 3, in fact, it’s not clear
00:03:51
whether this is a pronounced degree of
00:03:53
increase or not so I decided to
00:03:55
make it all an object type and of
00:03:58
course look at the basic statistics on
00:04:00
categorical various changes where
00:04:02
what is more common how many
00:04:04
unique values ​​do we have in each feature
00:04:06
so on and on here is the very most
00:04:09
important feature that you should
00:04:10
understand, even those who practice get it
00:04:13
out of before building your exploratory
00:04:15
data analysis, you must come to the business to
00:04:19
discuss all this, discuss your task and
00:04:22
make several hypotheses, there are
00:04:25
not enough hypotheses here in this particular case,
00:04:27
so at home you take this laptop and
00:04:29
try to add at least
00:04:31
five more hypotheses at least 10 hypotheses yourself you
00:04:34
have to work with the business somewhere,
00:04:36
add on your own and what to do next with
00:04:38
these hypotheses, but based on them you
00:04:41
can already build an exploratory data analysis, you
00:04:43
don’t need to fence in 100 graphs that are
00:04:47
absolutely incomprehensible in no way,
00:04:51
unstructured, it’s unclear where where
00:04:53
what comes from, where is the logic because
00:04:56
especially on how, what I see now
00:04:58
is such a complete discord and you see a
00:05:01
million of these graphs and you don’t understand what’s
00:05:04
what and where you have to study them,
00:05:06
so to avoid repeating this, so that
00:05:08
you have everything logically structured and
00:05:11
you can always later to answer a business question,
00:05:13
we write down hypotheses and then after
00:05:16
them we add our graphs,
00:05:19
find statistics, and so on, if you understand
00:05:22
that you have not yet researched something, then
00:05:24
after you have done all this, we
00:05:26
also add additional hypotheses and
00:05:28
build and play, well, for example, we
00:05:30
assume that the age of
00:05:32
good borrowers, so to speak, will be greater, that
00:05:35
is, the greater the age until if we
00:05:38
compare 20-30 years, then a person
00:05:40
who is 30 years old will most likely
00:05:43
pay back our loan before that in the country with
00:05:46
someone who is 20 years old, that is, he is more
00:05:49
stable on their feet, for example, also with
00:05:51
education, that is, the better a
00:05:53
person’s image they are, that is, if he has a
00:05:54
higher education, the more
00:05:56
likely it is that she will give us a loan, or, on
00:06:00
the contrary, the person has not completed anything at all
00:06:01
except school, God forbid, 9th grade,
00:06:04
it’s clear that for us this is there
00:06:06
will be potential risks, here is an example of
00:06:09
such hypotheses, you can also
00:06:11
additionally draw up the next
00:06:13
thing you should always look at your
00:06:15
trades, what am I doing here, I’m just
00:06:17
doing let’s say
00:06:18
such a normalization so that I can
00:06:21
see exactly in percentage
00:06:22
terms, I can of course look at
00:06:24
it quantitatively, but for me in
00:06:27
fact, it’s not very much, but it’s something a picture,
00:06:29
so it’s always better to normalize everything if necessary,
00:06:32
so let’s look
00:06:34
at our target variable, if you have
00:06:36
a classification, look at this
00:06:38
percentage ratio, if you have a
00:06:41
regression task, then you look at the
00:06:43
distribution of your target variable
00:06:45
if it is very let's say strongly there are
00:06:48
large outliers of the distribution on the
00:06:49
abnormal one, it is modal and so on,
00:06:51
then of course you need to make a decision on
00:06:54
how you can transform your
00:06:55
target variable, we see here
00:06:58
quite a very good imbalance,
00:07:01
so in the future we
00:07:04
must, of course, take this into account when choosing
00:07:06
metrics and evaluation metrics and when building the
00:07:08
model itself, the
00:07:10
following is how we had a hypothesis with
00:07:12
age, of course I take all my
00:07:16
data on age and again normalize
00:07:19
here to the class size because if I
00:07:22
don’t normalize the class sizes here I
00:07:24
will have all this quantitatively I will
00:07:26
never I understand this
00:07:28
relationship here, I use the
00:07:30
Sibur library everywhere and, in principle, I use
00:07:34
these lines, let’s just say they increase the
00:07:37
scale of my text along various
00:07:41
axes, as well as the signatures, but also plus, of course,
00:07:44
the title,
00:07:46
look, partly our hypothesis
00:07:47
was confirmed, but in fact they are
00:07:50
graphs the difference is not much, so we
00:07:53
can always display the average and median
00:07:56
and in general our average, in principle, we are
00:07:58
we see there that there really is a difference
00:08:01
of 2 to 1 year, if we are in direct mash
00:08:04
fashion, then we can notice here that the
00:08:06
difference is as much as 5 years next moment
00:08:09
next hypothesis related to
00:08:10
education,
00:08:12
let's first generally look at the
00:08:14
distribution for this value,
00:08:17
more precisely for and sit, yes, I called them
00:08:20
academicians, it’s quite noticeable that
00:08:23
even their age is shifted more to the right
00:08:27
side, that is, on the larger side, you can
00:08:29
look at our medians at our
00:08:32
quantiles to not only to navigate according to the graph,
00:08:35
here we can look at
00:08:37
the distribution here in general and the outliers and the
00:08:39
values ​​themselves, especially the average ones, in
00:08:43
principle, yes, the picture fits right
00:08:45
here, you can
00:08:47
read everything in detail on the laptop itself so that we do
00:08:49
n’t get stuck on this now, attention
00:08:51
now the most interesting thing, let’s break
00:08:55
it all down and by age and by education and
00:08:57
by our default flag, we see a
00:09:00
very interesting picture here, especially
00:09:03
for academics, let’s say that the age of
00:09:06
bad borrowers is slightly higher than that of
00:09:08
good ones, but what’s surprising is the spread of
00:09:11
age values ​​for good borrowers, it’s
00:09:14
quite large, that is, most likely
00:09:16
it’s just what - there may be
00:09:17
minor deviations, but they are not
00:09:19
super important for the model, let's
00:09:22
look further, here you can display
00:09:24
the numbers and the
00:09:26
next graph is here it is more
00:09:28
interesting, I am absolutely here,
00:09:30
normalizing times I repeat for the size of
00:09:33
classes for the size of objects with default
00:09:37
equal to zero and for size data set a
00:09:40
default equal to one what we
00:09:42
see here we see that this indicator
00:09:43
this value in education will
00:09:46
most likely influence our model more On the
00:09:55
other hand, the more we
00:09:57
meet people, as I understand it,
00:09:59
who have completed higher education, the
00:10:01
more likely they are to get a loan in the first place
00:10:03
and, of course, to return it, which is
00:10:06
good for the bank,
00:10:07
and of course you can always
00:10:09
additionally look at the correlation
00:10:11
between various numerical
00:10:14
characteristics about here in this case, we
00:10:17
actually don’t see anything really interesting,
00:10:19
maybe it’s the first time
00:10:21
and with her, but I think it won’t have much
00:10:24
impact on our model, that is,
00:10:27
here the inverse relationship turns out that what
00:10:29
follows is an analysis of salaries, firstly,
00:10:33
let’s see on the distribution of salaries, a
00:10:35
relatively bad borrower or a good one can
00:10:39
be seen very poorly here
00:10:42
because the spread is quite huge and
00:10:46
secondly, of course
00:10:48
this sign itself has an abnormal
00:10:51
distribution
00:10:53
if we try to look at the box
00:10:56
board we will see, well, let’s say, a similar
00:11:00
picture and something here It’s
00:11:03
quite difficult to evaluate, I’d rather look at
00:11:05
the values ​​themselves in this particular case
00:11:07
and see that the difference in principle is in the
00:11:13
average, it’s
00:11:14
somewhere around
00:11:17
10 thousand, well, quite
00:11:21
significant, which tells us that
00:11:22
the salary, of course it
00:11:24
can potentially also influence our model
00:11:27
further since our variable is
00:11:29
not normally distributed; our sign is
00:11:31
more precise, then let’s try to
00:11:34
logarithmize it; I’ll also try to break it all down
00:11:36
into
00:11:38
different categories from the point of view of
00:11:41
education into different, more precisely, objects,
00:11:44
plus, of course, it’s all
00:11:47
displayed here in such an interesting
00:11:50
graph is also a distribution, that is,
00:11:52
depending on education,
00:11:53
what kind of income distribution will we have
00:11:56
here? What we again observe is that it is very
00:11:59
interesting among academicians, again this
00:12:01
distribution shifts upward,
00:12:02
so potentially, of course,
00:12:05
we can say that there are academicians and a
00:12:08
pager now, and I will say who is this,
00:12:17
this is a graduate student, well, there is a suspicion that they
00:12:21
will say more such good borrowers
00:12:24
go to this category, but most likely
00:12:27
it will, as it seems to me, influence more, that’s
00:12:31
exactly the meaning of this sign, then the
00:12:33
next stage is the engineering key, I
00:12:35
highly recommend it in especially the banking
00:12:37
environment, even in telecom you can pull it out
00:12:39
various lags of years, that is, this is the
00:12:42
previous value, the
00:12:44
previous only values, for example, in the
00:12:46
last month, or some averages or some
00:12:48
averages over several
00:12:51
months by season, let's see
00:12:54
how we can do it, well, firstly, I
00:13:00
would build various distributions for
00:13:03
my numerical data and just about what
00:13:06
I was saying, I was a little confused, I was
00:13:08
a little confused by the salary, and here in the
00:13:10
video graph I see that not only it, that
00:13:13
is, the age and
00:13:15
number of the bull and hands in the snt, the number of
00:13:19
like the requests of the bull and it, they are
00:13:21
not distributed normally, and this one is also an
00:13:25
indicator also distributed
00:13:26
normally, so of course it’s better to
00:13:28
try to normalize it using
00:13:31
logarithm, but
00:13:33
after that you can see what all this gave us, we see
00:13:36
that somehow more or less all of this
00:13:38
shifted to one side, that is, in
00:13:39
this case it
00:13:41
will be somehow easier for the model with this distribution
00:13:42
coping preserved done further
00:13:46
generating new features with regards to the
00:13:48
date guys if you don’t have time series
00:13:52
then you don’t need to break it down in particular
00:13:55
take the year for example yes how many
00:13:57
add another year because if you
00:13:59
add a year then imagine you had
00:14:02
the years from 2000 to 2020 and then
00:14:05
2021 appears and your model simply cannot
00:14:08
distinguish, it does not understand that this is the new
00:14:11
year, they will give 1021, so it is better to
00:14:15
do what features it is a month, it is a season,
00:14:18
it is a working day or a day off, and much
00:14:21
more, but the main thing is not to get attached to
00:14:24
what if you apply for input new
00:14:26
data so that they, say, are not new
00:14:28
for your model, then you can make
00:14:31
different average incomes, taking into account the
00:14:33
rating of the region, taking into account age, taking into
00:14:35
account your soon to be a bull, and
00:14:39
try to do it yourself
00:14:41
in addition, well, at least three or five
00:14:43
additional candles,
00:14:46
maybe they will help you or maybe
00:14:48
not, this also all needs to be tested with
00:14:50
or without them, well, I save my
00:14:54
categorical columns so that later I can
00:14:57
dump them here, just in case, I
00:15:00
look again that I don’t have a gap anywhere,
00:15:02
what type of data do I have, and in order to compare,
00:15:06
then we proceed with them you to the
00:15:09
modeling section, the most interesting ones as you
00:15:11
can see is that exploratory data analysis before
00:15:13
this can take a very large
00:15:15
amount of time, so you should always
00:15:18
take this into account and budget your time
00:15:20
when you discuss
00:15:22
how much time you need when
00:15:24
you are asked, for example, business I
00:15:27
do binarizations my characteristics are
00:15:31
categorical because today I will
00:15:34
use
00:15:36
logistic regression because
00:15:38
since we have banking data since
00:15:40
this is credit scoring, let’s
00:15:42
say this model, taking it into account,
00:15:44
we interpret it and you and I
00:15:46
split our data into training
00:15:49
test data, which is absolutely necessary at the start fai put
00:15:52
especially if you have a balance here,
00:15:53
otherwise if you don’t put a
00:15:55
skew somewhere, for example, the first class will be
00:15:57
less somewhere more, that is, it will not
00:15:59
take into account it was syrup balancing the
00:16:01
ratio the
00:16:03
next stage is the bass line what is
00:16:06
beat evil and this your model without selecting
00:16:08
any parameters, that’s what it is like
00:16:10
here, the only thing I added is the
00:16:12
Wade class, that is, it automatically
00:16:14
recognizes where we have an imbalance and
00:16:16
adapts to this, you can try,
00:16:19
you can try the frenzy of sampling Andersen
00:16:21
blink, but honestly they don’t give
00:16:25
any kind of wow Of course, you
00:16:27
can test the results, but based on your experience with first
00:16:30
of all this ovir sampling, you’re just
00:16:31
stupid, will they be able to forget about
00:16:34
this, nothing on this, in principle, the
00:16:36
algorithms themselves cope well with these, we
00:16:39
train for rokovoko, be sure to
00:16:41
remember to submit it as soon as possible and
00:16:43
let’s look at our value meters here is
00:16:46
very important, there is a point for business,
00:16:48
look with him, this needs to be discussed if
00:16:50
our mistake in issuing a loan will
00:16:53
cost us very much, for example, we gave a
00:16:55
loan to a bad borrower and he did not return it to you
00:16:57
and this is a very expensive
00:16:59
operation, let’s say, then of course it will be better in this
00:17:03
case focus already on the
00:17:05
recall metric
00:17:07
if we do not suffer such direct
00:17:10
losses and for us the losses are greater when
00:17:13
we did not give a loan to a person before,
00:17:16
that is, the more we gave out, the more
00:17:18
we received, but in fact, we
00:17:20
somehow can’t focus too much on the bad ones your
00:17:22
attention, then it’s better to look at the
00:17:26
meter when I’m sitting on the plus, of course, always in
00:17:28
this case, we look at the hand of Kai
00:17:30
at our other metrics to compare with the
00:17:32
lines and other rokovoko models,
00:17:35
it simply shows, let’s say, the degree of
00:17:37
how well you predict 1st
00:17:40
grade I I’ll add schemes and metrics to a
00:17:42
separate dataframe so that later I can’t
00:17:44
compare it all, but in advance
00:17:47
I got this dataframe
00:17:48
to build my own lesson curve I’ll display the
00:17:51
values ​​of the hand uk in principle, at least it’s
00:17:56
already more than 0 5 well, what if 05 our
00:17:59
model is purely random, as if that’s good
00:18:02
this is already pleasing, the next process is the
00:18:05
selection of our parameters here, it may
00:18:07
even be desirable to have more, you need to take
00:18:10
different values ​​right away, I warn you
00:18:12
that you can work for a long time and two
00:18:15
hours and an hour and a half depends on the model, it
00:18:17
depends on your computer hardware,
00:18:19
so I have already found it in advance, taking
00:18:22
this into account the best parameters
00:18:24
I wrote them in a separate dictionary and I
00:18:28
simply fall on the input of my logistic
00:18:31
regression and
00:18:32
train my new model, I will
00:18:35
definitely output the metrics here I want to
00:18:39
output the sleeve that I had in the
00:18:41
previous step and now at this one
00:18:44
I will have in green,
00:18:46
by the way, in principle they practically
00:18:48
lie on the same straight line, this means
00:18:51
that we don’t give Nellie
00:18:53
our model so much, we can look at the
00:18:56
metrics in the table, I’ve also
00:18:58
colored additional ones here, and our
00:19:01
metrics have risen a little, which is very important for us, we’ve
00:19:03
risen quite well the joke during the
00:19:07
season is not so much, but since we, for
00:19:09
example, want to focus on
00:19:11
accuracy, we say that it’s already
00:19:14
good to add
00:19:16
some additional parameters here,
00:19:18
play around with it, and of course you can also use it
00:19:21
for your practice to purely
00:19:23
look at other models and
00:19:25
compare them, for example, trees, for example,
00:19:28
beads, I think that they will work even
00:19:30
better, the next stage after you
00:19:33
have chosen your model, in comparison with zloy
00:19:36
us, in comparison with other models, then
00:19:39
this stage will be called the analysis of
00:19:42
important features. Today we will
00:19:43
consider such a library as it
00:19:46
quite interesting and useful in principle,
00:19:48
but of course it is not necessary to always
00:19:50
use it, and in particular it
00:19:53
will be very difficult for business, so you
00:19:55
will have to make some additional effort
00:19:57
so that they understand or make
00:19:59
some footnotes now you will understand why
00:20:01
since we have a linear model I
00:20:04
I use such a
00:20:06
linear expander, then here I submit the
00:20:09
model itself as input, I submit my 3rd one, and of
00:20:13
course, I feed test data to the expander
00:20:16
and
00:20:17
display the graph itself.
00:20:20
Let’s understand firstly how it
00:20:24
all works, this library precisely
00:20:27
finds the importance by the Shapley value, it
00:20:31
is calculated separately, that is there are
00:20:33
some statistical techniques on
00:20:36
how to read this graph, look in
00:20:39
red, this is the value of the feature above, in
00:20:43
blue, this is the value of the feature below,
00:20:47
this axis seems to indicate our
00:20:50
target change, that is, in this direction,
00:20:52
closer to one, on this side, closer to
00:20:56
zero, that is, zero is our good
00:20:58
borrowers are one of our bad
00:21:01
borrowers and by the way the signs are arranged in order of
00:21:04
degree and descending importance, that is, the
00:21:07
first most important sign is the score from the
00:21:09
tank and so the guys give him your
00:21:13
score credit score and we take care of it,
00:21:16
read the graph, the more it turns out
00:21:19
this increases the sign is red,
00:21:22
the higher the probability that the borrower will be
00:21:25
on the chopping block,
00:21:27
the lower the value, the
00:21:29
sooner the bull will be and the more likely it is that
00:21:32
the borrower will be good and he will repay
00:21:34
the loan. What’s interesting is that in the top 2 we
00:21:38
got education about exactly
00:21:41
what I told you about the higher the
00:21:44
value of education, if it is worth
00:21:47
one, the higher the probability that the
00:21:50
borrower is bad, that’s exactly what we
00:21:54
made preliminary conclusions with you, this
00:21:55
was shown by age itself, this was shown by
00:21:57
self-education, and the lower it is
00:22:00
if it is equal to 0, then the borrower
00:22:03
has a good rating region, which is also
00:22:06
very interesting, the lower the rating of the region,
00:22:08
the higher the probability that the borrower is bad,
00:22:11
the higher the value of this indicator, the
00:22:15
higher the probability that the borrower is good, that
00:22:17
is, apparently from the region they also look at whether
00:22:19
the borrower is good or bad, and
00:22:22
we also got a semi-interesting month and
00:22:26
others you can also look at the indicators in these indicators
00:22:29
and by uploading the
00:22:31
coefficients of your genetic
00:22:33
aggression, you can first look
00:22:36
at this top, I recorded it in a separate
00:22:38
dataframe and what’s interesting is that before 1 even our
00:22:40
two features coincided, but if you look
00:22:43
carefully, we have 4 positions, there will be a
00:22:47
different type of education here they are somewhere
00:22:50
they may differ because the approach itself and the
00:22:53
calculation of the importance of features is slightly
00:22:55
different, this is normal, but here you
00:22:57
need to take into account that these weights are with the model,
00:22:59
so you can, in principle, combine these
00:23:02
important features and based on them you
00:23:04
can navigate further, by the way, you can
00:23:07
look at the sorrows to and I just became
00:23:09
interested in the value, we see that even with the
00:23:12
median and the value for good borrowers
00:23:15
it turns out to be lower than for bad ones at
00:23:18
speedo, that’s how it is, so you
00:23:22
can basically combine it all,
00:23:25
you can look at it all, and you can
00:23:28
also look at the numerical characteristics, I
00:23:30
always highly recommend when you have found
00:23:32
important features, just take them from a
00:23:35
data set and make, for example, for
00:23:37
numerical features these are different environments,
00:23:39
then different quantiles to look
00:23:41
exactly at the difference and these graphs
00:23:43
can be useful to you later if we are talking
00:23:46
about categorical variables,
00:23:48
you can also look there how often
00:23:50
there are certain values ​​that
00:23:52
will also be quite interesting and,
00:23:54
firstly, for you to understand the data itself
00:23:56
and, of course, to help business
00:23:59
because not always all
00:24:01
processes, let’s say, consist of a
00:24:04
machine learning model, somewhere
00:24:06
this particular information can help in the
00:24:08
formation of, for example, various washed
00:24:10
will also be quite interesting for them, everything is very
00:24:12
interesting, I really hope that you
00:24:14
liked this analysis of the problem on yes, this is
00:24:17
science, so if you want some
00:24:19
other new analyzes and other tasks,
00:24:21
be sure to write comments under this
00:24:23
video and then see you soon
00:24:25
everyone bye

Description:

Авторский курс по Data Science для начинающих https://pymagic.ru/ Код на Boosty https://boosty.to/miracl6/posts/96b1fe98-d4ea-455c-b1a1-670a2b90668b?share=post_link Разбираем задачу по Data Science кредитный скоринг с использованием модели логистической регрессии. Учимся грамотно подходить к разведочному анализу данных EDA, а также обучать модель ML и интерпретировать результаты для бизнеса. Новая группа ВКонтакте https://vk.com/pymagic Датасет https://www.kaggle.com/c/sf-dst-scoring Таймкоды: 00:00 Задача кредитного скоринга 00:25 Что необходимо сделать перед построением модели 00:44 Загрузка данных и предварительный анализ 04:07 Главная фишка EDA анализа!!!!! Как делать EDA? 05:29 Рассматриваем гипотезы 06:13 Анализируем целевую переменную (таргет) / Дисбаланс классов 07:11 Первая гипотеза. Распределение возраста в разрезе таргета (seaborn), нормализуем данные 08:11 Вторая гипотеза. Распределение возраста в разрезе образования / boxplot 10:10 Корреляция признаков 10:30 Третья гипотеза. Анализ зарплат в разрезе таргета / образования 12:33 Feature engineering (генерация фичей), как его делать, какие могут быть новые признаки, что делать с признаком типа дата-время, логарифмирование 15:07 Построение модели машинного обучения. 1 этап - бейзлайн (Logistic Regression) 16:46 Как интерпретировать и использовать метрики precision, recall, roc-auc 17:41 Строим roc-auc curve 18:03 Подбор параметров модели с использованием GridSearch 18:40 Сравниваем результаты на графике roc-auc / анализируем метрики 19:32 Анализ важных признаков после обучения модели 20:02 Используем для анализа важных признаков библиотеку shap / Интерпретирует результаты 22:30 Коэффициенты логистической регрессии 23:27 Сравнение важных признаков в разных классов (визуализация различий) Instagram* https://www.facebook.com/unsupportedbrowser Группы в ВКонтакте https://vk.com/pymagic Telegram https://t.me/pymagic *Компания Meta - организация, деятельность которой запрещена на территории Российской Федерации

Preparing download options

popular icon
Popular
hd icon
HD video
audio icon
Only sound
total icon
All
* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."
** — Link intended for online playback in specialized players

Questions about downloading video

mobile menu iconHow can I download "Data Science пример задачи кредитного скоринга / Урок построения модели ML на python" video?mobile menu icon

  • http://unidownloader.com/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.

  • The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.

  • UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.

  • UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

mobile menu iconWhich format of "Data Science пример задачи кредитного скоринга / Урок построения модели ML на python" video should I choose?mobile menu icon

  • The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

mobile menu iconWhy does my computer freeze when loading a "Data Science пример задачи кредитного скоринга / Урок построения модели ML на python" video?mobile menu icon

  • The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

mobile menu iconHow can I download "Data Science пример задачи кредитного скоринга / Урок построения модели ML на python" video to my phone?mobile menu icon

  • You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

mobile menu iconHow can I download an audio track (music) to MP3 "Data Science пример задачи кредитного скоринга / Урок построения модели ML на python"?mobile menu icon

  • The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

mobile menu iconHow can I save a frame from a video "Data Science пример задачи кредитного скоринга / Урок построения модели ML на python"?mobile menu icon

  • This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

mobile menu iconWhat's the price of all this stuff?mobile menu icon

  • It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.