Lots of math and no magic: how voice assistants actually work

Assistant Zuckerberg and the operating system "Samantha": two types of voice systems

I work in a lab

I teach two blocks of subjects at the university: one of them is related to applied machine learning and artificial intelligence.The second one is with search and its varieties.This is the area that I will try to talk about today.

There are two ways to look at what isvoice assistant. Imagine that you have a virtual butler. For example, about five years ago, Mark Zuckerberg made a smart assistant in his home, called him "Jarvis". He knew how to let people into the house, open and close doors, curtains, turn on the light. Other examples of such devices are "Alexa" and "Alice", they live in the device and are able to improve life. They can control the oven, washing machine, vacuum cleaner and so on.

Another way to look at assistants is asinterface. In the film "She" there was an operating system called "Samantha", in the Russian voice acting she had the same voice as "Alice" from Yandex. She acted as an interface to operating system management, and was not designed as an assistant. Apple has this approach - Siri, Microsoft - Сortana, Google - Google Assistant.

How do they work?

All assistants are built on a very similarprinciple. The first thing he needs to do is hear the voice. This happens on the user's device - a mobile phone or a smart speaker. The user says: "Alice", "Alexa", "OK Google". After these magic words, the device is ready to record the user's voice. This happens up to some point - until the client is silent or the device is tired of waiting until it is silent. After that, the data is sent to the server of the company, which provides the services.

This is where the magic begins.The first operation is speech-to-text conversion. Everyone says differently, how do I convert this to text? Then begins what we use voice assistants for - the provision of a service. This is any operation that is available online - buying tickets, booking a table in a restaurant. The only question is how to provide a user-friendly interface. If it is not there, the device turns into a talker.

After calling the service, the user needs toreturn the results, for this you need to properly pack them. Most likely, it will be a text, an issue from a page on the Internet, a song, data that the calculator calculated. The data is converted back to speech and transmitted to the client.

Speech to text

Our communication takes place through speech, the voice isair movement around. These vibrations fall on the eardrum, it pushes three bones - the stirrup, anvil and hammer. Those, in turn, rock an organ called a snail. We got the snail from fish, it is filled with water and hair cells live in it, they oscillate along with the water in the snail. The upper hair cells amplify the fluctuations in the fluid and transmit them to the lower part of the hair cells, which form an electrical impulse. This impulse is transmitted to the brain.

Moreover, in different places of the cochlea, hair cells are responsible for different frequencies. High frequencies are processed in the wide part, medium frequencies will be in the middle, and low frequencies closer to the center.

How can we make the machine perceive sound like thisin the same way - not in the form of a raw signal, but in the form of a set of frequencies? The answer to this question was given by the French mathematician Jean Baptiste Fourier, he lived on the border of the XVIII-XIX centuries. The scientist proposed such a mathematical transformation, with the help of which everything is the same as in the ear - a raw signal is taken and decomposed into frequency components.

What to do with frequency components?We can map a spectral representation to a phoneme, that is, we can convert speech into phonemes. They are more or less easily algorithmically converted into letters. That is, we can get a word from a phonetic representation.

But all of this is inaccurate.There are phonemes that differ slightly, transitions from one sound to another can sound different. They are called senones, there are about 10 thousand of them. But when there are so many of them, the task of defining words becomes much more difficult.

Fighting bugs

How do researchers deal with errors?The answer to this question was given by the Russian mathematician Andrei Markov, who lived at the turn of the 19th-20th centuries. He developed a theory describing processes where one follows from the other. And thanks to his theory, hidden Markov models were developed. This is one of the first ways to fix errors of this kind.

For example, when a person speaks indistinctly, heaccent or he mispronounces the word - there is a mathematical mechanism that allows you to restore and determine with high accuracy what the person meant. After all, people also make mistakes, but they understand each other, which means that we have a mechanism for correcting errors in our heads.

But text representation is not enough -computer works with numbers. How to get them? Noam Chomsky has a hypothesis that we have a structure in the brain, moreover, available at the birth level, which helps us quickly learn natural languages. Chomsky throughout his life builds, refines and works on a model that determines what common patterns there are in a language, no matter what - Russian, English or Chinese.

On the slide - Chomsky's grammar.This is about the same thing that they do in Russian language lessons when they analyze a sentence by composition. There are nouns, adjectives, subjects, predicates, verb groups - all this is formalized and can be shown to the machine. This structure is easily represented in the form of numbers.

The machine can understand what is the subject inproposal and what action to take. For example, if the client says: “Alice, turn on some music,” then “turn on” will be the action, “music” will be the object on which the action takes place. "Alice" will understand the client and begin to take action.

But the words themselves are a collection of letters, likeunderstand their meaning? There are similar words - "play" and "play", will the device understand that this is the same thing? The answer to this question was given by the American linguist Leonardo Bloomfield. At the beginning of the twentieth century, he proposed a theory where the meaning of a word is determined by the context in which this word is located. Look at the slide and think about what word can be substituted for three dots.

My answer would be an elephant, but when I askedstudents, they say there might be a rhinoceros or even a giraffe. But in general, we understand that this is a large animal that lives in Africa and which can be angry. If we combine all this, then we get some semantic description of this object without using the word itself.

But if we digitize it, we get tens of thousands of figures.And thanks to the American mathematician Gene Golub, he managed to figure out how to significantly reduce the number of digits.Instead of using numbers, they used a collection of digits called a vector.And this vector can be used to understand proximity or distance in meaning, semantic coherence.In this way, you can understand that "play" is about the same as "play."

Now there are tools where you can enter words, and it will become clear how they are distributed on the map of meanings.For example, the words "giraffe", "elephant" and "rhinoceros" are grouped together in the space of meanings.These methods have evolved and now look much more advanced.

We have presented words in the form of a structure, sentences in the form of a structure, we have presented words in the form of meanings, all this is in the form of numbers, what's next?

Services

Each service has hundreds of thousands, millions,billions of objects. If we are talking about searching on the Internet, these are hundreds of billions of pages, tens of billions of images. If music streaming, millions of songs.

One of the first approaches to data indexing −building binary search trees. The same is used in dictionaries: you open it in the middle, and if you skipped the right word, scroll back, if you didn’t get it, go ahead. But in 1962, Soviet mathematicians Georgy Adelson-Velsky and Evgeniy Landis came up with a data structure that maintains itself in a state of fast lookup.

This system only works on linear data −numbers or words. And what will happen to multidimensional data if we want to search for something on a map or in three-dimensional space? To do this, they came up with such structures as kd-trees, they perfectly cope with the tasks of searching in three-dimensional space. But they stopped working for modern tasks, where the text is described by hundreds of numbers.

But thanks to the theoretical work of the late twentieth centuryEric Berninson proposed the development of search trees, called Anna, which can be used to guarantee good search quality on huge collections. This works for the entire vast Spotify base - a wonderful result that was obtained only five years ago.

There are also other approaches:for example, the sociologist Stanley Milgram performed bizarre and sometimes inhuman experiments. He gave birth to the theory of six handshakes, that all people on Earth know each other through no more than six handshakes. To do this, he asked people to send a letter to a stranger. They then had to choose among their acquaintances those who could be familiar with this person. And it turned out that it took them six letters to do this. The experiment was criticized, but repeated in the 2000s - and confirmed the results.

This is an amazing property, which in mathematicsacquired the name Count "Small World". Russian scientists - the group of Yuri Malkov - proposed an interesting algorithm. They used it to find anything anywhere. The nodes in this graph are no longer people, but documents.

In this graph - the shortest distance between anya couple of objects. Users can find what we need very quickly. This data structure is now used in many companies in Russia and abroad - Facebook, Mail.ru, Yandex. An excellent mathematical model that has changed not only search and recommendation services, but also voice assistants.

Read more

See the world's first single-stage orbital ship of the future

For three years, scientists believed that there was water in the south of Mars. Turned out it wasn't

A hypersonic hydrogen-powered aircraft can reach speeds of up to Mach 12. It's almost 15,000 km/h