Download "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3"

Please wait. We're preparing links for easy ad-free video watching and downloading.

Previous video: Как создать исполняемое приложение Python в MacOS и Windows. Десктопное приложение с помощью Tkinter Next video: И другие шокирующие новости в мире искусственного интеллекта

0:00

Начало (идея)

0:30

Как это будет работать

1:10

Кеша, Open Source

1:45

Первая модификация (убираем задержку)

2:30

Как устроен звук

3:50

Как нейросеть работает со звуком

5:00

Как улучшить код?

5:45

Wake Word Detection (активационная фраза)

6:40

Первый тест модификации Wake Word

7:05

Принцип работы Picovoice и Vosk

8:50

Первый тест команд

9:25

Вторая модификация (чейнинг команд)

10:05

Тест второй модификации

10:40

Реальная польза от Джарвиса

12:25

Боевой тест Джарвиса

13:25

Подрубаем ChatGPT

14:50

Тест искусственных мозгов Джарвиса

15:15

Синтез речи через VoiceMod (не смейтесь)

16:15

Тест синтеза через VoiceMod

17:15

Итоги

И другие шокирующие новости в мире искусственного интеллекта

И другие шокирующие новости в мире искусственного интеллекта

Channel: Web3nity

Как создать исполняемое приложение Python в MacOS и Windows. Десктопное приложение с помощью Tkinter

Как создать исполняемое приложение Python в MacOS и Windows. Десктопное приложение с помощью Tkinter

Channel: Sweet Coder

Стилизация элементов. Темы оформления. Создание десктопного приложения с помощью Tkinter #8

Стилизация элементов. Темы оформления. Создание десктопного приложения с помощью Tkinter #8

Channel: Sweet Coder

Пишем TELEGRAM бота с БАЗОЙ ДАННЫХ на Python

Пишем TELEGRAM бота с БАЗОЙ ДАННЫХ на Python

Channel: Хауди Хо™ - Просто о мире IT!

Всплывающие окна. Информационные и диалоговые. Создание десктопного приложения с помощью Tkinter #11

Всплывающие окна. Информационные и диалоговые. Создание десктопного приложения с помощью Tkinter #11

Channel: Sweet Coder

python

питон

python для начинающих

python уроки

голосовой ассистент

siri

kortana

jarvis

железный человек

распознавание голоса

питон уроки

python django

урок

для новичков

алиса

тони старк

python tutorial

learn python

python tutorial for beginners

нейросети

ии

искусственный интеллект

chat gpt

chat gpt 4

синтез речи

ии играет в игры

сделал игру

настоящий джарвис

джарвис

голосовй помощник

нейросеть

хауди

хо

хауди хо

00:00:02

Jarvis voice assistant from the film

00:00:04

Iron Man and Yes, I know that the idea

00:00:08

is far from new after the release of the film,

00:00:10

unless the lazy one tried to create the

00:00:12

same assistant, however, all

00:00:14

programmers faced the same

00:00:16

problem that they faced at one time

00:00:17

and Tony Stark’s father, they were

00:00:20

limited by the technologies of their time and

00:00:24

limited by the technologies of their time,

00:00:29

we live in a wonderful future

00:00:33

where neural networks are highly developed, so I

00:00:36

sat and thought, why not collect the

00:00:38

most powerful neurons into a single whole and

00:00:41

thus create a real copy

00:00:44

Jarvis from the movie Moreover, I want to make it

00:00:46

so that he can do something else besides

00:00:49

moronic commands like Open the browser

00:00:52

or Change the volume, for this I want to

00:00:54

build brains into it, or rather a neural

00:00:57

network gpt chat with the ability to synthesize speech

00:01:00

as close as possible to human

00:01:02

and of course we will also make it possible

00:01:05

to add custom commands so that our

00:01:08

Jarvis is not just a toy, but actually

00:01:10

helps solve some problems and I even

00:01:13

know where and what to test it on. Well,

00:01:17

those who have been watching the channel for a long time know that I have

00:01:19

already made a voice assistant for

00:01:21

Kesha We will take it as a basis, or rather,

00:01:25

we will modify Kesha’s code and I will immediately

00:01:28

say the detalpen Source project, I will not

00:01:31

sell it or ask for money for it.

00:01:33

Anyone who wants can

00:01:35

download it for themselves and modify it however they want,

00:01:37

I think this will be correct and you will all

00:01:40

agree with this in short, let's

00:01:43

get started

00:01:44

and the first thing I did was, of course, I

00:01:47

took Kesha off the shelf and blew the dust off it, or

00:01:50

rather, I downloaded the Hit hubba project locally

00:01:53

and opened it in paycharm. To be honest,

00:01:55

during all this time I returned to the

00:01:58

cache only once, namely for

00:02:00

adding a neural network sat down the mouth of the TTS

00:02:03

which synthesizes speech and does it, I

00:02:05

must say, as closely as possible like a

00:02:07

human Well, the first modification

00:02:09

that I decided to make to the cache is to almost

00:02:12

completely remove any delay for

00:02:15

recognizing commands so that as soon as you

00:02:18

pronounce the command it will be executed at the same moment

00:02:20

and you wouldn’t have to wait

00:02:22

3-4 seconds before the

00:02:25

assistant’s braking engine understands what you generally need from it;

00:02:27

there are already two algorithms and,

00:02:31

accordingly, two types of neural networks:

00:02:33

wat and wakeword, and so you understand what

00:02:37

it is and why it’s needed,

00:02:39

let’s First, I’ll quickly explain something to you. The

00:02:41

fact is that neural networks do not

00:02:44

understand what sound is and do not know how to

00:02:46

hear; everything that neural networks understand is

00:02:49

numbers, but How to convert the sound that

00:02:53

we hear into numbers on the computer screen, the

00:02:55

answer is very simple, vibrations from a

00:02:59

physics lesson for the 9th grade We all know that the

00:03:01

amplitude of sound vibrations is directly

00:03:03

related to its volume, but now we

00:03:06

will not go into detail about what

00:03:08

membranes are and how sound is read,

00:03:10

we all have a microphone and this is

00:03:13

enough for us, we are more interested in

00:03:15

how to write all this in code and I think everyone at least

00:03:18

once in their life has seen such a thing

00:03:20

as a sound oscillogram, roughly speaking, this is a

00:03:23

graphical representation of an audio signal. And

00:03:25

as you can see here on the time scale along the

00:03:28

Scream axis, the loudness of the sound

00:03:31

or its amplitude is represented, so the amplitude of

00:03:34

any sound signal can be calculated

00:03:36

only by knowing it sampling frequency and

00:03:39

then it can easily be transferred to a

00:03:42

logarithm or, simply, to a set of

00:03:44

zeros and ones that a neural network can and

00:03:46

will work with. Ingeniously,

00:03:49

neural networks that can

00:03:52

convert sound into text do this

00:03:54

precisely with the help of these zeros and ones

00:03:57

you say something a

00:03:59

certain segment of sound is then stored in the microphone, say

00:04:02

1 second, this sound is converted into a

00:04:05

digital representation and then this

00:04:08

data is fed into a pre-trained

00:04:10

neural network that produces

00:04:13

words and even entire sentences at the output. However, the

00:04:16

big problem here is

00:04:18

that words now There are a lot of words

00:04:20

for There are now about 22,000 words in the Russian language,

00:04:23

and the

00:04:27

neuron network must recognize them all, and

00:04:30

Believe me, this is far from an easy task

00:04:32

even for ultra-modern computers, and

00:04:35

just so that you understand even the

00:04:38

fastest speech, a neuron for the

00:04:40

Russian language that I know does it

00:04:43

in 500 milliseconds or in half a second

00:04:45

add on top the pronunciation time plus

00:04:48

downtime plus processing and you

00:04:50

get a delay of at least 3-4

00:04:53

seconds before the speech turns into

00:04:55

some kind of command and the command is

00:04:57

executed. Well, this seems to be the

00:04:59

main problem of all

00:05:01

existing implementations of Jarvits, they are all

00:05:04

slow and have a large delay,

00:05:07

so let's think about what we can

00:05:09

do to somehow fix this, and

00:05:12

perhaps the first thing that comes to mind is to

00:05:15

stop listening to silence because

00:05:17

it is obvious that in silence there are no words and there is

00:05:20

nothing to recognize there anymore This is why

00:05:22

there are algorithms such as Voice

00:05:24

Activity detection,

00:05:26

they allow you to separate silence and noise from

00:05:30

direct speech that can be

00:05:31

recognized. But even so, this does not solve the

00:05:34

problem completely since the algorithms do not

00:05:37

always understand where the noise is. And where is the speech? Yes,

00:05:40

if I mumble into the microphone that it is

00:05:42

noise or speech nothing is clear, but it’s very

00:05:45

interesting, so a

00:05:47

smarter version was invented called

00:05:50

wakeword detection and which immediately

00:05:52

allows you to extract the activation phrase from all audio streams,

00:05:55

and this is exactly

00:05:57

what we need, see for yourself the

00:06:00

activation phrase - it’s always one

00:06:03

or at most 2- 3 words that a person

00:06:05

pronounces very quickly. For example, in Syria

00:06:08

on an iPhone, the activation phrase is Hello

00:06:11

Siri in Yandex Alice It seems to be just

00:06:13

the word Alice Well, and so on And as you

00:06:17

understand, it is hundreds of times easier to teach a

00:06:19

neural network to quickly recognize just

00:06:22

one word than to recognize 22,000 words

00:06:25

and then compare during the process whether the

00:06:28

spoken word is

00:06:29

activation; moreover, you will now

00:06:31

be surprised, but for the inference of the activation

00:06:34

phrase in the entire audio stream, we

00:06:36

only need segments of 30 MS. This is, if

00:06:39

anything, 33 times less than a second, and I even

00:06:43

managed to test this in code. Well,

00:06:46

in as a neuron, I took peak park

00:06:48

wakevor detection And now when I say the

00:06:52

keyword Jarvis, our program

00:06:54

reacts with lightning speed

00:06:58

Jarvis

00:07:00

Jorvis

00:07:04

Jarvis

00:07:06

Aida As you noticed, I whistled sections of

00:07:10

Jarvis’s voice from the film And it seems to me that

00:07:12

this will be much more effective Well, then

00:07:15

I began to rewrite Kesha’s code which

00:07:18

Now we will have time for Jarvis.

00:07:20

In general, the activation phrase can be

00:07:22

anything and the sounds that our

00:07:25

assistant pronounces can also be anything. So,

00:07:27

in principle, this is not a problem. Well, what

00:07:30

I want to draw your attention to is

00:07:33

how I combined the peak with his if

00:07:35

anything, wax is a neuron that

00:07:37

recognizes Russian speech, so both

00:07:40

neurons always use the same

00:07:42

microphone for recording. But the first one always

00:07:45

works as a peak and it catches the

00:07:47

activation phrase in 30-

00:07:49

second segments, as soon as the phrase was found

00:07:52

through the Play Sound library,

00:07:54

playback starts the desired phrase of Jarvis, or

00:07:57

rather one of several that there was

00:07:58

some kind of variety like Jarvis

00:08:00

says Yes sir, I’m listening, and so on during this

00:08:03

time, by the way, the recording of the microphone

00:08:05

stops so that if Jarvis

00:08:08

now says something on the external

00:08:09

speakers, it won’t work out so he

00:08:12

accidentally records himself and as if

00:08:14

trying to listen to himself further, immediately after

00:08:16

this the recording is turned on again and already

00:08:19

here the wax is cut, which writes

00:08:21

much larger segments and tries to

00:08:24

recognize the command in them when the command is

00:08:26

pronounced, the wax inferentializes it and

00:08:28

transmits it for processing in the

00:08:31

handlers functions, the Livienstein distance algorithm is already used,

00:08:33

which helps

00:08:36

we can recognize the command even if it

00:08:38

was said. Not entirely precisely, but in the

00:08:40

end. If one of the

00:08:43

existing commands was spoken, Jarvie

00:08:45

executes it and successfully reports it.

00:08:47

If not, then he says something like

00:08:49

what do you really need from me

00:09:11

[music]

00:09:19

Jarvis, you're doing well.

00:09:24

As for me, it's already good, but

00:09:28

there is definitely something to improve and the first upgrade

00:09:30

that needs to be done is to remove

00:09:32

the need to constantly

00:09:34

pronounce the word Jarvis before each command, it's

00:09:36

annoying And besides, it won't

00:09:38

work like that to make a command teapot for this

00:09:40

purpose I rewrote it a bit in the code the logic

00:09:43

of listening and now when we speak

00:09:45

Jarvis he responds and listens to the command

00:09:47

after pronouncing the command he

00:09:50

will

00:09:53

listen to subsequent commands for about 10-15 seconds but only

00:09:55

if we do not say another

00:09:57

command Then he will ask to

00:09:59

say the word again Jarvis It seems to me that this

00:10:02

will be much more convenient Well, that’s

00:10:04

actually how the improved version works

00:10:07

[music]

00:10:23

you’re great,

00:10:26

now it works much better but there is

00:10:32

still no practical benefit from Jarvis from the word

00:10:34

open browsers completely change the volume these are the

00:10:37

most stupid commands that anyone are

00:10:39

not needed Jarvis should be useful and

00:10:43

so I decided to teach him

00:10:44

to automate some actions on the

00:10:46

computer the simplest thing I came up with

00:10:48

is to make my life easier when I

00:10:52

want to play games on the TV

00:10:53

Look, the thing is that this is like my

00:10:57

computer but I don’t I like to play games

00:10:59

while sitting at the computer, it really

00:11:01

freezes me out. Write by the way in

00:11:03

the comments of someone who also

00:11:05

prefers to play on my big

00:11:07

comfortable TV instead. But in order to

00:11:10

do this, you first need to go through the whole quest,

00:11:13

turn on the TV on the computer,

00:11:15

switch the active monitors to the

00:11:17

TV, then stand at the computer,

00:11:19

since there is a mouse here, you need to

00:11:21

get the sound on the TV with the mouse so that it

00:11:24

goes to the TV and not to the computer and

00:11:26

finally turn on the Stimbik picchi, after

00:11:28

which you can finally take the

00:11:31

gamepads and start playing. Well, how do you

00:11:34

understand doing all this every time this is

00:11:36

far from the most pleasant process, so

00:11:39

let Jarvis do it for me, then there

00:11:41

will be at least some real benefit from it,

00:11:43

so for this I already wrote a

00:11:46

small script in Autohatkey that

00:11:49

does all this without my participation, and

00:11:52

Jarvis will run this script and

00:11:55

It would seem So what is the essence of

00:11:57

Jarvis then if he just runs the glasses

00:11:59

script? And the joke is that I do

00:12:03

n’t need to do anything with Jarvis at all. I’ll just

00:12:05

say like

00:12:07

game mode. Then he

00:12:09

’ll do everything himself and wish him good luck in the game. And when

00:12:11

I finish playing again- again, I will just

00:12:14

need to say Jarvis

00:12:16

Return to work mode or whatever

00:12:18

Jarvis Put everything back as it was and he will

00:12:21

do everything again To do this, I don’t

00:12:24

have to take the mouse in my hand at all,

00:12:26

just say out loud the desired command, in

00:12:28

short, it’s easier to show than to explain,

00:12:31

look How awesome it works in

00:12:33

fact

00:12:50

the engine is turned off

00:13:05

Jarvis

00:13:06

you're doing great

00:13:11

Jarvis

00:13:14

Go back to the computer

00:13:20

[music]

00:13:28

and now Imagine that Jarvis

00:13:31

will always listen to commands and whenever I

00:13:34

want to play games I'll

00:13:37

just say

00:13:39

and my voice assistant will do everything for

00:13:42

me This is crap cool

00:13:50

Wow But that's not all, I decided to go

00:13:54

further and connect a

00:13:56

real artificial brain to Jarvis, namely

00:13:59

to connect the gpt chat to it. And besides,

00:14:03

I had already worked with the pawn Chad gpt And

00:14:06

this does not cause any problems

00:14:08

except in Russian is not officially

00:14:10

supported by the developers, so

00:14:12

I have to translate the recorded speech

00:14:15

into English in the code. Then

00:14:17

feed a request in English into the gpt chat

00:14:19

and only then, when

00:14:21

the answer comes back, translate it back into

00:14:24

Russian and actually reproduce it. As you

00:14:26

probably understand, all this takes a lot of

00:14:29

time and on top of On top of that,

00:14:32

speech synthesis also wastes time. Therefore, answers to

00:14:35

arbitrary questions now

00:14:37

take Jarvis about 5-7 seconds. Yes, I understand

00:14:40

this takes a long time and I even have ideas on how to

00:14:43

speed it all up at

00:14:45

least twice as fast. But I’ll do that

00:14:47

later for now let's see what

00:14:50

we got Jarvis

00:14:54

tell me how much 2 + 2

00:15:02

two plus two four is

00:15:05

Tell me how many colors are in the Rainbow

00:15:11

in the Rainbow 7 Colors red orange

00:15:15

yellow green blue Indigo and purple

00:15:18

and by the way for speech synthesis I use the

00:15:21

whole carcass neural network or rotits it

00:15:23

generates a voice as

00:15:25

close as possible to a human one, this is

00:15:27

exactly what we need Well, that’s not

00:15:30

all, I decided to go even further and

00:15:33

try to make a synthesized speech the

00:15:35

same as Jarvis’s Well, or just to at

00:15:38

least make the answer sound like

00:15:40

it was spoken by a robot only let's

00:15:43

agree with you now that you won't

00:15:45

laugh because all I came up with

00:15:47

is to use the voicemod program, if

00:15:50

anyone doesn't know, this is a very popular

00:15:52

program for changing voices that is

00:15:54

usually used in discord or

00:15:56

various online games. Well, in the end it

00:15:58

turned out that I kind of connected Jarvis

00:16:01

with gpt whose answers are synthesized by the

00:16:04

TTS sealer whose voice is filtered through

00:16:07

voice mod and for the operation of this whole

00:16:10

seemingly dubious scheme,

00:16:12

four neural networks are involved at once sounds crazy But

00:16:15

it works

00:16:19

[music]

00:16:20

Jarvis

00:16:23

tell me who the samurai is The

00:16:31

samurai was a well-trained warrior of

00:16:34

feudal Japan who followed

00:16:36

strict

00:16:38

they were known for their courage and

00:16:41

skill B

00:16:45

Tell

00:16:46

some joke what

00:16:52

choice said when

00:16:58

say it’s not funny at all

00:17:04

I understand that this is not funny to you

00:17:11

You’re stupid

00:17:18

Well, that’s what we got In the end,

00:17:21

of course, Jarvis can still

00:17:23

play music save tracks

00:17:25

open browser and do all this

00:17:27

nonsense But the most important thing is that he has

00:17:30

intelligence. He can answer questions and

00:17:33

even do something useful. Well, the entire

00:17:35

Source Code of Jarvis. Of course, I’ll upload the

00:17:38

link to Github, I’ll leave a link in the description

00:17:40

under the video, you can poke it yourself and

00:17:43

maybe even really use it At

00:17:45

least I will use it,

00:17:46

the main thing is not to forget that this is still a

00:17:49

very, very crude prototype and it’s certainly not worth expecting

00:17:52

from it that it will open a new chemical

00:17:54

element, although who knows.

00:17:57

So maybe next I’ll make a

00:18:00

real convenient and expandable tool out of it

00:18:02

the program will already depend on you

00:18:05

and on your support, so put

00:18:07

your royal likes on this issue and if

00:18:10

you get at least 20,000 likes, I

00:18:13

promise to release a second part about

00:18:15

Jarvis where I will try to make

00:18:17

working software that you can simply

00:18:20

download and use and perhaps even

00:18:22

for phone too, but this is not certain and of

00:18:25

course I will still fix all the shortcomings and

00:18:28

bugs. And don’t forget to subscribe to the

00:18:30

channel and turn on the notification bell.

00:18:32

So you won’t miss new episodes on the

00:18:34

channel and of course write comments,

00:18:36

all this helps promote the video on

00:18:39

YouTube. So guys I hope you don't

00:18:41

fail in the rest, good luck and always

00:18:44

remember programming can and should

00:18:47

be interesting

00:18:52

[music]

Description:

Никогда такого не было и вот опять!) Сделал настоящего Джарвиса, соединив ChatGPT + SileroTTS + Voicemod. 🆇 Исходный код 🆇 Опубликую в нашем телеграм канале - https://t.me/howdyho_official 🔵 Подписывайтесь ;) 🆇 Главы 🆇 0:00 - Начало (идея) 0:30 - Как это будет работать 1:10 - Кеша, Open Source 1:45 - Первая модификация (убираем задержку) 2:30 - Как устроен звук 3:50 - Как нейросеть работает со звуком 5:00 - Как улучшить код? 5:45 - Wake Word Detection (активационная фраза) 6:40 - Первый тест модификации Wake Word 7:05 - Принцип работы Picovoice и Vosk 8:50 - Первый тест команд 9:25 - Вторая модификация (чейнинг команд) 10:05 - Тест второй модификации 10:40 - Реальная польза от Джарвиса 12:25 - Боевой тест Джарвиса 13:25 - Подрубаем ChatGPT 14:50 - Тест искусственных мозгов Джарвиса 15:15 - Синтез речи через VoiceMod (не смейтесь) 16:15 - Тест синтеза через VoiceMod 17:15 - Итоги 🔵 Наш TELEGRAM: https://t.me/howdyho_official Наш ВК: https://www.vk.com/howdyho_net Сотрудничество https://vk.com/topic-84392011_33285530 💗 Музыка предоставлена YouTube Audio Library.

Preparing download options

Popular

HD video

Only sound

All

* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."

** — Link intended for online playback in specialized players

Questions about downloading video

How can I download "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video?

http://unidownloader.com/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.
The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.
UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.
UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

Which format of "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video should I choose?

The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

Why does my computer freeze when loading a "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video?

The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

How can I download "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3" video to my phone?

You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

How can I download an audio track (music) to MP3 "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3"?

The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

How can I save a frame from a video "Я сделал НАСТОЯЩЕГО ДЖАРВИСА! | Siri и Алиса больше не нужны :3"?

This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

What's the price of all this stuff?

It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.