Library Technology – Reviews, Tips, Giveaways, Freeware

How ASR Works

Posted In Infographics - By Techtiplib on Tuesday, January 27th, 2015 With No Comments »

Here is a quick rundown of how automatic speech recognition software works, brought to you by the speech software specialists at West Interactive. And before we get down to some details, you should also check out this much more detailed post/infographic that explains the entire process more thoroughly.

Automatic speech recognition, also known as ASR for short, is the software that’s built into the various devices we use which have speech recognition and response capabilities. You’ve probably seen this before in PCs, speech recognition software for taking down text and, most famously, the Siri mechanism found in certain iPhone models.

ASR is a complex area of software development but its basic parameters follow the following process:

ASR

How your Device Talk to You:

The first part of the entire process of speech recognition by computer starts with you actually saying something like “what’s the weather forecast?” to a device.

From there, the device first creates a sound wave out of what you said and then cleans out background noise from that sound and normalizes its volume.

After this, your essential ASR system breaks the cleaned up sound wave down into what are called phonemes, which are the building block sounds of a given language. For example, English has 44 of them, French 33 and Italian 49.

Each phoneme is like a chain link and ASR software uses statistical analysis to guess the likelihood of subsequent phonemes. This is how it in turn connects words and later sentences together from what you say.

This process can be done via directed dialog conversations or through what are called natural language conversations.

The first, direct dialogue, is normally found in automatic phone banking systems in which a computer voice asks you questions and responds to specific dialogue choices from a menu. It is the simplest form of an ASR software system.

Natural language, on the other hand, is much more sophisticated and involves you actually interacting with your automatic voice software in a way that simulates real human conversation.

Siri is a very good example of a natural language system at work and the way in which it functions is by guessing contextual clues from the word combinations in your phrases in order to form an appropriate response. Thus, for example, if you say the word weather, you could just as easily be saying “whether” but by also saying “forecast” the ASR system decides that you’re probably asking for the climate prognosis instead of referring to a conditional possibility.

This guesswork capacity is necessary because an average natural language ASR software’s 60,000 word vocabulary breaks down into 216 trillion possible word combinations for every three words spoken to it in sequence. In other words, it uses contextual guessing among mutually compatible tagged keywords to cut through most of those trillions of potential combinations.

The Learning Process and Tuning Tests

ASR systems can also be improved by a process of “tuning” or, in the case of more sophisticated software, be made to slowly “learn” from the conversations they have with human users.

The tuning process is more mechanical and involves programmers and linguists periodically reviewing the conversation logs of a given piece of ASR software to check it for new words and phrases that are being more commonly used by people who interact with it. They then add these words and phrases to the software’s growing dictionary.

Active learning, on the other hand, is a much more sophisticated process by which the software itself learns from the repeated conversations it has with humans. By examining its own past interactions with humans regularly, the software autonomously learns to understand the context of words it previously didn’t have in its internal dictionary and to grasp new or unusual phrases better.

In essence, the above is how ASR software works and how our most commonly used voice recognition systems learn to speak to us more effectively.

Source: West Interactive

More contents in:

About -

Hey, this blog belongs to me! I am the founder of TechTipLib and managing editor right now. And I love to hear what do you think about this article, leave comment below! Thank you so much…