background top icon
background center wave icon
background filled rhombus icon
background two lines icon
background stroke rhombus icon

Download "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3"

input logo icon
Table of contents
|

Table of contents

0:00
Начало (идея)
0:30
Как это будет работать
1:10
Кеша, Open Source
1:45
Первая модификация (убираем задержку)
2:30
Как устроен звук
3:50
Как нейросеть работает со звуком
5:00
Как улучшить код?
5:45
Wake Word Detection (активационная фраза)
6:40
Первый тест модификации Wake Word
7:05
Принцип работы Picovoice и Vosk
8:50
Первый тест команд
9:25
Вторая модификация (чейнинг команд)
10:05
Тест второй модификации
10:40
Реальная польза от Джарвиса
12:25
Боевой тест Джарвиса
13:25
Подрубаем ChatGPT
14:50
Тест искусственных мозгов Джарвиса
15:15
Синтез речи через VoiceMod (не смейтесь)
16:15
Тест синтеза через VoiceMod
17:15
Итоги
Video tags
|

Video tags

python
питон
python для начинающих
python уроки
голосовой ассистент
siri
kortana
jarvis
железный человек
распознавание голоса
питон уроки
python django
урок
для новичков
алиса
тони старк
python tutorial
learn python
python tutorial for beginners
нейросети
ии
искусственный интеллект
chat gpt
chat gpt 4
синтез речи
ии играет в игры
сделал игру
настоящий джарвис
джарвис
голосовй помощник
нейросеть
хауди
хо
хауди хо
Subtitles
|

Subtitles

subtitles menu arrow
  • ruRussian
Download
00:00:02
Jarvis voice assistant from the film
00:00:04
Iron Man and Yes, I know that the idea
00:00:08
is far from new after the release of the film,
00:00:10
unless the lazy one tried to create the
00:00:12
same assistant, however, all
00:00:14
programmers faced the same
00:00:16
problem that they faced at one time
00:00:17
and Tony Stark’s father, they were
00:00:20
limited by the technologies of their time and
00:00:24
limited by the technologies of their time,
00:00:29
we live in a wonderful future
00:00:33
where neural networks are highly developed, so I
00:00:36
sat and thought, why not collect the
00:00:38
most powerful neurons into a single whole and
00:00:41
thus create a real copy
00:00:44
Jarvis from the movie Moreover, I want to make it
00:00:46
so that he can do something else besides
00:00:49
moronic commands like Open the browser
00:00:52
or Change the volume, for this I want to
00:00:54
build brains into it, or rather a neural
00:00:57
network gpt chat with the ability to synthesize speech
00:01:00
as close as possible to human
00:01:02
and of course we will also make it possible
00:01:05
to add custom commands so that our
00:01:08
Jarvis is not just a toy, but actually
00:01:10
helps solve some problems and I even
00:01:13
know where and what to test it on. Well,
00:01:17
those who have been watching the channel for a long time know that I have
00:01:19
already made a voice assistant for
00:01:21
Kesha We will take it as a basis, or rather,
00:01:25
we will modify Kesha’s code and I will immediately
00:01:28
say the detalpen Source project, I will not
00:01:31
sell it or ask for money for it.
00:01:33
Anyone who wants can
00:01:35
download it for themselves and modify it however they want,
00:01:37
I think this will be correct and you will all
00:01:40
agree with this in short, let's
00:01:43
get started
00:01:44
and the first thing I did was, of course, I
00:01:47
took Kesha off the shelf and blew the dust off it, or
00:01:50
rather, I downloaded the Hit hubba project locally
00:01:53
and opened it in paycharm. To be honest,
00:01:55
during all this time I returned to the
00:01:58
cache only once, namely for
00:02:00
adding a neural network sat down the mouth of the TTS
00:02:03
which synthesizes speech and does it, I
00:02:05
must say, as closely as possible like a
00:02:07
human Well, the first modification
00:02:09
that I decided to make to the cache is to almost
00:02:12
completely remove any delay for
00:02:15
recognizing commands so that as soon as you
00:02:18
pronounce the command it will be executed at the same moment
00:02:20
and you wouldn’t have to wait
00:02:22
3-4 seconds before the
00:02:25
assistant’s braking engine understands what you generally need from it;
00:02:27
there are already two algorithms and,
00:02:31
accordingly, two types of neural networks:
00:02:33
wat and wakeword, and so you understand what
00:02:37
it is and why it’s needed,
00:02:39
let’s First, I’ll quickly explain something to you. The
00:02:41
fact is that neural networks do not
00:02:44
understand what sound is and do not know how to
00:02:46
hear; everything that neural networks understand is
00:02:49
numbers, but How to convert the sound that
00:02:53
we hear into numbers on the computer screen, the
00:02:55
answer is very simple, vibrations from a
00:02:59
physics lesson for the 9th grade We all know that the
00:03:01
amplitude of sound vibrations is directly
00:03:03
related to its volume, but now we
00:03:06
will not go into detail about what
00:03:08
membranes are and how sound is read,
00:03:10
we all have a microphone and this is
00:03:13
enough for us, we are more interested in
00:03:15
how to write all this in code and I think everyone at least
00:03:18
once in their life has seen such a thing
00:03:20
as a sound oscillogram, roughly speaking, this is a
00:03:23
graphical representation of an audio signal. And
00:03:25
as you can see here on the time scale along the
00:03:28
Scream axis, the loudness of the sound
00:03:31
or its amplitude is represented, so the amplitude of
00:03:34
any sound signal can be calculated
00:03:36
only by knowing it sampling frequency and
00:03:39
then it can easily be transferred to a
00:03:42
logarithm or, simply, to a set of
00:03:44
zeros and ones that a neural network can and
00:03:46
will work with. Ingeniously,
00:03:49
neural networks that can
00:03:52
convert sound into text do this
00:03:54
precisely with the help of these zeros and ones
00:03:57
you say something a
00:03:59
certain segment of sound is then stored in the microphone, say
00:04:02
1 second, this sound is converted into a
00:04:05
digital representation and then this
00:04:08
data is fed into a pre-trained
00:04:10
neural network that produces
00:04:13
words and even entire sentences at the output. However, the
00:04:16
big problem here is
00:04:18
that words now There are a lot of words
00:04:20
for There are now about 22,000 words in the Russian language,
00:04:23
and the
00:04:27
neuron network must recognize them all, and
00:04:30
Believe me, this is far from an easy task
00:04:32
even for ultra-modern computers, and
00:04:35
just so that you understand even the
00:04:38
fastest speech, a neuron for the
00:04:40
Russian language that I know does it
00:04:43
in 500 milliseconds or in half a second
00:04:45
add on top the pronunciation time plus
00:04:48
downtime plus processing and you
00:04:50
get a delay of at least 3-4
00:04:53
seconds before the speech turns into
00:04:55
some kind of command and the command is
00:04:57
executed. Well, this seems to be the
00:04:59
main problem of all
00:05:01
existing implementations of Jarvits, they are all
00:05:04
slow and have a large delay,
00:05:07
so let's think about what we can
00:05:09
do to somehow fix this, and
00:05:12
perhaps the first thing that comes to mind is to
00:05:15
stop listening to silence because
00:05:17
it is obvious that in silence there are no words and there is
00:05:20
nothing to recognize there anymore This is why
00:05:22
there are algorithms such as Voice
00:05:24
Activity detection,
00:05:26
they allow you to separate silence and noise from
00:05:30
direct speech that can be
00:05:31
recognized. But even so, this does not solve the
00:05:34
problem completely since the algorithms do not
00:05:37
always understand where the noise is. And where is the speech? Yes,
00:05:40
if I mumble into the microphone that it is
00:05:42
noise or speech nothing is clear, but it’s very
00:05:45
interesting, so a
00:05:47
smarter version was invented called
00:05:50
wakeword detection and which immediately
00:05:52
allows you to extract the activation phrase from all audio streams,
00:05:55
and this is exactly
00:05:57
what we need, see for yourself the
00:06:00
activation phrase - it’s always one
00:06:03
or at most 2- 3 words that a person
00:06:05
pronounces very quickly. For example, in Syria
00:06:08
on an iPhone, the activation phrase is Hello
00:06:11
Siri in Yandex Alice It seems to be just
00:06:13
the word Alice Well, and so on And as you
00:06:17
understand, it is hundreds of times easier to teach a
00:06:19
neural network to quickly recognize just
00:06:22
one word than to recognize 22,000 words
00:06:25
and then compare during the process whether the
00:06:28
spoken word is
00:06:29
activation; moreover, you will now
00:06:31
be surprised, but for the inference of the activation
00:06:34
phrase in the entire audio stream, we
00:06:36
only need segments of 30 MS. This is, if
00:06:39
anything, 33 times less than a second, and I even
00:06:43
managed to test this in code. Well,
00:06:46
in as a neuron, I took peak park
00:06:48
wakevor detection And now when I say the
00:06:52
keyword Jarvis, our program
00:06:54
reacts with lightning speed
00:06:58
Jarvis
00:07:00
Jorvis
00:07:04
Jarvis
00:07:06
Aida As you noticed, I whistled sections of
00:07:10
Jarvis’s voice from the film And it seems to me that
00:07:12
this will be much more effective Well, then
00:07:15
I began to rewrite Kesha’s code which
00:07:18
Now we will have time for Jarvis.
00:07:20
In general, the activation phrase can be
00:07:22
anything and the sounds that our
00:07:25
assistant pronounces can also be anything. So,
00:07:27
in principle, this is not a problem. Well, what
00:07:30
I want to draw your attention to is
00:07:33
how I combined the peak with his if
00:07:35
anything, wax is a neuron that
00:07:37
recognizes Russian speech, so both
00:07:40
neurons always use the same
00:07:42
microphone for recording. But the first one always
00:07:45
works as a peak and it catches the
00:07:47
activation phrase in 30-
00:07:49
second segments, as soon as the phrase was found
00:07:52
through the Play Sound library,
00:07:54
playback starts the desired phrase of Jarvis, or
00:07:57
rather one of several that there was
00:07:58
some kind of variety like Jarvis
00:08:00
says Yes sir, I’m listening, and so on during this
00:08:03
time, by the way, the recording of the microphone
00:08:05
stops so that if Jarvis
00:08:08
now says something on the external
00:08:09
speakers, it won’t work out so he
00:08:12
accidentally records himself and as if
00:08:14
trying to listen to himself further, immediately after
00:08:16
this the recording is turned on again and already
00:08:19
here the wax is cut, which writes
00:08:21
much larger segments and tries to
00:08:24
recognize the command in them when the command is
00:08:26
pronounced, the wax inferentializes it and
00:08:28
transmits it for processing in the
00:08:31
handlers functions, the Livienstein distance algorithm is already used,
00:08:33
which helps
00:08:36
we can recognize the command even if it
00:08:38
was said. Not entirely precisely, but in the
00:08:40
end. If one of the
00:08:43
existing commands was spoken, Jarvie
00:08:45
executes it and successfully reports it.
00:08:47
If not, then he says something like
00:08:49
what do you really need from me
00:09:11
[music]
00:09:19
Jarvis, you're doing well.
00:09:24
As for me, it's already good, but
00:09:28
there is definitely something to improve and the first upgrade
00:09:30
that needs to be done is to remove
00:09:32
the need to constantly
00:09:34
pronounce the word Jarvis before each command, it's
00:09:36
annoying And besides, it won't
00:09:38
work like that to make a command teapot for this
00:09:40
purpose I rewrote it a bit in the code the logic
00:09:43
of listening and now when we speak
00:09:45
Jarvis he responds and listens to the command
00:09:47
after pronouncing the command he
00:09:50
will
00:09:53
listen to subsequent commands for about 10-15 seconds but only
00:09:55
if we do not say another
00:09:57
command Then he will ask to
00:09:59
say the word again Jarvis It seems to me that this
00:10:02
will be much more convenient Well, that’s
00:10:04
actually how the improved version works
00:10:07
[music]
00:10:23
you’re great,
00:10:26
now it works much better but there is
00:10:32
still no practical benefit from Jarvis from the word
00:10:34
open browsers completely change the volume these are the
00:10:37
most stupid commands that anyone are
00:10:39
not needed Jarvis should be useful and
00:10:43
so I decided to teach him
00:10:44
to automate some actions on the
00:10:46
computer the simplest thing I came up with
00:10:48
is to make my life easier when I
00:10:52
want to play games on the TV
00:10:53
Look, the thing is that this is like my
00:10:57
computer but I don’t I like to play games
00:10:59
while sitting at the computer, it really
00:11:01
freezes me out. Write by the way in
00:11:03
the comments of someone who also
00:11:05
prefers to play on my big
00:11:07
comfortable TV instead. But in order to
00:11:10
do this, you first need to go through the whole quest,
00:11:13
turn on the TV on the computer,
00:11:15
switch the active monitors to the
00:11:17
TV, then stand at the computer,
00:11:19
since there is a mouse here, you need to
00:11:21
get the sound on the TV with the mouse so that it
00:11:24
goes to the TV and not to the computer and
00:11:26
finally turn on the Stimbik picchi, after
00:11:28
which you can finally take the
00:11:31
gamepads and start playing. Well, how do you
00:11:34
understand doing all this every time this is
00:11:36
far from the most pleasant process, so
00:11:39
let Jarvis do it for me, then there
00:11:41
will be at least some real benefit from it,
00:11:43
so for this I already wrote a
00:11:46
small script in Autohatkey that
00:11:49
does all this without my participation, and
00:11:52
Jarvis will run this script and
00:11:55
It would seem So what is the essence of
00:11:57
Jarvis then if he just runs the glasses
00:11:59
script? And the joke is that I do
00:12:03
n’t need to do anything with Jarvis at all. I’ll just
00:12:05
say like
00:12:07
game mode. Then he
00:12:09
’ll do everything himself and wish him good luck in the game. And when
00:12:11
I finish playing again- again, I will just
00:12:14
need to say Jarvis
00:12:16
Return to work mode or whatever
00:12:18
Jarvis Put everything back as it was and he will
00:12:21
do everything again To do this, I don’t
00:12:24
have to take the mouse in my hand at all,
00:12:26
just say out loud the desired command, in
00:12:28
short, it’s easier to show than to explain,
00:12:31
look How awesome it works in
00:12:33
fact
00:12:50
the engine is turned off
00:13:05
Jarvis
00:13:06
you're doing great
00:13:11
Jarvis
00:13:14
Go back to the computer
00:13:20
[music]
00:13:28
and now Imagine that Jarvis
00:13:31
will always listen to commands and whenever I
00:13:34
want to play games I'll
00:13:37
just say
00:13:39
and my voice assistant will do everything for
00:13:42
me This is crap cool
00:13:50
Wow But that's not all, I decided to go
00:13:54
further and connect a
00:13:56
real artificial brain to Jarvis, namely
00:13:59
to connect the gpt chat to it. And besides,
00:14:03
I had already worked with the pawn Chad gpt And
00:14:06
this does not cause any problems
00:14:08
except in Russian is not officially
00:14:10
supported by the developers, so
00:14:12
I have to translate the recorded speech
00:14:15
into English in the code. Then
00:14:17
feed a request in English into the gpt chat
00:14:19
and only then, when
00:14:21
the answer comes back, translate it back into
00:14:24
Russian and actually reproduce it. As you
00:14:26
probably understand, all this takes a lot of
00:14:29
time and on top of On top of that,
00:14:32
speech synthesis also wastes time. Therefore, answers to
00:14:35
arbitrary questions now
00:14:37
take Jarvis about 5-7 seconds. Yes, I understand
00:14:40
this takes a long time and I even have ideas on how to
00:14:43
speed it all up at
00:14:45
least twice as fast. But I’ll do that
00:14:47
later for now let's see what
00:14:50
we got Jarvis
00:14:54
tell me how much 2 + 2
00:15:02
two plus two four is
00:15:05
Tell me how many colors are in the Rainbow
00:15:11
in the Rainbow 7 Colors red orange
00:15:15
yellow green blue Indigo and purple
00:15:18
and by the way for speech synthesis I use the
00:15:21
whole carcass neural network or rotits it
00:15:23
generates a voice as
00:15:25
close as possible to a human one, this is
00:15:27
exactly what we need Well, that’s not
00:15:30
all, I decided to go even further and
00:15:33
try to make a synthesized speech the
00:15:35
same as Jarvis’s Well, or just to at
00:15:38
least make the answer sound like
00:15:40
it was spoken by a robot only let's
00:15:43
agree with you now that you won't
00:15:45
laugh because all I came up with
00:15:47
is to use the voicemod program, if
00:15:50
anyone doesn't know, this is a very popular
00:15:52
program for changing voices that is
00:15:54
usually used in discord or
00:15:56
various online games. Well, in the end it
00:15:58
turned out that I kind of connected Jarvis
00:16:01
with gpt whose answers are synthesized by the
00:16:04
TTS sealer whose voice is filtered through
00:16:07
voice mod and for the operation of this whole
00:16:10
seemingly dubious scheme,
00:16:12
four neural networks are involved at once sounds crazy But
00:16:15
it works
00:16:19
[music]
00:16:20
Jarvis
00:16:23
tell me who the samurai is The
00:16:31
samurai was a well-trained warrior of
00:16:34
feudal Japan who followed
00:16:36
strict
00:16:38
they were known for their courage and
00:16:41
skill B
00:16:45
Tell
00:16:46
some joke what
00:16:52
choice said when
00:16:58
say it’s not funny at all
00:17:04
I understand that this is not funny to you
00:17:11
You’re stupid
00:17:18
Well, that’s what we got In the end,
00:17:21
of course, Jarvis can still
00:17:23
play music save tracks
00:17:25
open browser and do all this
00:17:27
nonsense But the most important thing is that he has
00:17:30
intelligence. He can answer questions and
00:17:33
even do something useful. Well, the entire
00:17:35
Source Code of Jarvis. Of course, I’ll upload the
00:17:38
link to Github, I’ll leave a link in the description
00:17:40
under the video, you can poke it yourself and
00:17:43
maybe even really use it At
00:17:45
least I will use it,
00:17:46
the main thing is not to forget that this is still a
00:17:49
very, very crude prototype and it’s certainly not worth expecting
00:17:52
from it that it will open a new chemical
00:17:54
element, although who knows.
00:17:57
So maybe next I’ll make a
00:18:00
real convenient and expandable tool out of it
00:18:02
the program will already depend on you
00:18:05
and on your support, so put
00:18:07
your royal likes on this issue and if
00:18:10
you get at least 20,000 likes, I
00:18:13
promise to release a second part about
00:18:15
Jarvis where I will try to make
00:18:17
working software that you can simply
00:18:20
download and use and perhaps even
00:18:22
for phone too, but this is not certain and of
00:18:25
course I will still fix all the shortcomings and
00:18:28
bugs. And don’t forget to subscribe to the
00:18:30
channel and turn on the notification bell.
00:18:32
So you won’t miss new episodes on the
00:18:34
channel and of course write comments,
00:18:36
all this helps promote the video on
00:18:39
YouTube. So guys I hope you don't
00:18:41
fail in the rest, good luck and always
00:18:44
remember programming can and should
00:18:47
be interesting
00:18:52
[music]

Description:

Никогда такого не было и вот опять!) Сделал настоящего Джарвиса, соединив ChatGPT + SileroTTS + Voicemod. 🆇 Исходный код 🆇 Опубликую в нашем телеграм канале - https://t.me/howdyho_official 🔵 Подписывайтесь ;) 🆇 Главы 🆇 0:00 - Начало (идея) 0:30 - Как это будет работать 1:10 - Кеша, Open Source 1:45 - Первая модификация (убираем задержку) 2:30 - Как устроен звук 3:50 - Как нейросеть работает со звуком 5:00 - Как улучшить код? 5:45 - Wake Word Detection (активационная фраза) 6:40 - Первый тест модификации Wake Word 7:05 - Принцип работы Picovoice и Vosk 8:50 - Первый тест команд 9:25 - Вторая модификация (чейнинг команд) 10:05 - Тест второй модификации 10:40 - Реальная польза от Джарвиса 12:25 - Боевой тест Джарвиса 13:25 - Подрубаем ChatGPT 14:50 - Тест искусственных мозгов Джарвиса 15:15 - Синтез речи через VoiceMod (не смейтесь) 16:15 - Тест синтеза через VoiceMod 17:15 - Итоги 🔵 Наш TELEGRAM: https://t.me/howdyho_official Наш ВК: https://www.vk.com/howdyho_net Сотрудничество https://vk.com/topic-84392011_33285530 💗 Музыка предоставлена YouTube Audio Library.

Preparing download options

popular icon
Popular
hd icon
HD video
audio icon
Only sound
total icon
All
* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."
** — Link intended for online playback in specialized players

Questions about downloading video

mobile menu iconHow can I download "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video?mobile menu icon

  • http://unidownloader.com/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.

  • The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.

  • UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.

  • UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

mobile menu iconWhich format of "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video should I choose?mobile menu icon

  • The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

mobile menu iconWhy does my computer freeze when loading a "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video?mobile menu icon

  • The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

mobile menu iconHow can I download "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video to my phone?mobile menu icon

  • You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

mobile menu iconHow can I download an audio track (music) to MP3 "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3"?mobile menu icon

  • The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

mobile menu iconHow can I save a frame from a video "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3"?mobile menu icon

  • This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

mobile menu iconWhat's the price of all this stuff?mobile menu icon

  • It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.