card image

Evolution of Speech Recognition: From Audrey to AI Integration

Bell Labs, in 1952, introduced a speech recognition system designed to capture spoken digits called the “Audrey” System.  It was the first system of its kind and could only recognize one voice speaking single-digit numbers. 

Ten years later, IBM introduced “Shoebox,” which took this a step further by being able to analyze not only digits but also sixteen words.  “Shoebox” understood the digits zero through nine, plus, and minus to build the first basic voice calculator. 

Throughout this guide, The Jmor Connection, Inc will share the difference between speech recognition and a new version incorporating AI and their evolution

Distinguishing Voice Recognition from Speech Recognition

Voice Recognition uses technology for the sole purpose of identifying an individual by their specific voice pattern.  Speech recognition is when we use technology to turn spoken words into typed text that may later be edited.  Many people get confused because they seem similar but are distinctly different.

DARPA's Funding and the Evolution of Speech Recognition

Around the 1970s DARPA (Defense Advanced Research Projects Agency) funded a SUR (Speech Understanding Research).  The benefits of SUR gave us. HARPY Speech Recognition System from Carnegie Mellon. 

Harpy could recognize sentences using a vocabulary of 1,011 words, equating to a three-year-old's ability.  Harpy was one of many to use the HMM (Hidden Markov Model) problem, which led to the development of ASR (Automatic Speech Recognition) in the 1980’s.  This allowed IBM to offer speech-to-text tools, allowing them to beta-test transcription.

Commercial Speech Recognition Journey: Tangora to Dragon Naturally Speaking

Tangora, when adequately trained, was able to recognize and type 20,000 words in English but was not yet ready for commercial use.  Consumer-level ASR development continued, and in 1990, Dragon Dictate was announced as the first commercial speech recognition software. 

Dragon Dictate sold for $9000 to $18,890 in 2021 until 1997, when Dragon Naturally Speaking was released.  Unfortunately, users had to pause between words, or Dragon Naturally Speaking would get confused and not interpret the speech correctly.

AT&T's VRCP and Mike Cohen's Google Impact

When 1992 surfaced, AT&T introduced Bell Lab’s (VRCP) Voice Recognition Call Processing System, and VRCP handles about 1.2 billion voice transactions per year. 

However, the main innovations in development work for voice processing development occurred because of Mike Cohen.  Mike Cohen was brought on to Google in 2003 to handle the creation of the company’s speech-to-text offering.

Voice Search Evolution and Podcast Transcription

Google announced its service called Google Voice Search in 2007; however, it stole speech data from millions of networked users, acquired for training data.  Around 2010, Apple’s Siri and Microsoft’s Cortana came up with their flavor of voice search. 

Today, many podcasters are using ASR to transcribe their shows, and by late 2020, transcription will be automated, and humans will step in to make minor corrections.

Quick Read: How does robotic process automation (RPA) work?

ASR Accuracy and Accessibility

Do you think Knight Industries reviewed Michael and KITT's conversations for accuracy?  With the amount of R&R that has gone into ASR, the technology is no longer just available to large corporations but also to small companies and even individuals. 

Today, people use transcription services to take notes during video conference calls, transcribe books, etc.  The two technologies, Voice Recognition and Speech Recognition, utilize technology to recognize a specific voice or turn speech to text, but neither can understand the human meaning of what is said.

Demystifying NLP (Natural Language Processing)

NLP (Natural Language Processing) refers to a specific functionality of AI that allows computers to understand text and spoken words as humans do. 

NLP takes computation linguistics and statistical ML (Machine Learning) with DL (Deep Learning) to give computers the ability to process and understand human language.   Note that NLP can also translate typed text or spoken words into another language. 

AI-Powered Voice Assistants: Siri, Alexa, and Cortana in Action

Today, Apple’s Siri, Google’s Alexa, and Microsoft’s Cortana use AI-powered speech recognition to answer questions. 

For example, you may want to compose an e-mail or a text on your iPhone by pressing the microphone button.  As you talk, the system will listen to what you are saying, type them out, and make corrections after it analyzes the context of the phrases you speak.

Also Read: Four types of ai

IBM's Watson: Beyond Speech-to-Text

IBM’s Watson is their speech-to-text technology system, which allows you to speak while it quickly transcribes your voice accurately from many languages.  Today, Watson’s integrated AI system can do more than help you convert text to speech or speech to text. 

Watson uses a Neural Network built with node layers, an input layer, hidden layers, and an output layer.  Each node links to another, and as a weight assigned to it, a threshold, and if the output of any single node registers higher than a set threshold value, the node activates. 

Upon activation, data will be sent to the next node layer; no data will be passed.  All neural networks need training data to learn and improve their operations and accuracy over time.

 Profanity Filtering and Speaker Diarization

Did you know that Watson has a word spotting and filtering module that can quickly locate inappropriate words and serve as a profanity filter?  It can even handle what they call speaker diarization, which is recognizing who is speaking and handling up to six different speakers.

AI vs. Traditional Systems in Speech Recognition

If a speech recognition system uses a classical system without a neural network, it may take hours to process instead of minutes with an artificial network.  Not to mention, a traditional network would not be able to learn and optimize itself as an artificial network can do.

Unfortunately, bad actors are out there, and did you know that with just a few seconds of your voice, they can have their AI system generate almost anything they want you to say?  Thus, if you get a call from someone you don’t recognize or have no number, don’t speak first. 

A simple example would be having them call your bank and play your voice to their ASR Phone System, and then the representative sees the call is verified and starts giving out information on your account.

Speech Recognition Trends and Challenges: The Path Ahead

Speech Recognition is expected to increase by 16.8% between 2021-2026, which means we will see an additional 27.16 billion.  We still have a way to go besides becoming more accurate and ensuring that gender, age, languages, dialects, accents, and non-native speakers are correctly recognized. 

A significant challenge currently impacting Speech Recognition Technology is background noise.  Did you know you can speak many of the things you may want your Apple phone to do? 

Hey Siri, set a timer for 2 Minutes, Call (Say Name or Number), Text (Say Name or Number), Send e-mail to, Read my Messages, Open App, Take a Picture. Set an Alarm for 3 PM; turn off all alarms, What time is it, What’s Today’s Date, Tell me about the Weather, How is the traffic today, How much is gas right now, Tell me a synonym for, What is the definition of, When is Sunrise, What time is it in (Say City & Location), How many calories are in, What’s in the news.

The Evolution of Speech Recognition: From Digits to Contextual Understanding

Thus, from 1952 till today, we have come a long way in understanding single digits to AI now being able to understand the context of phrases.  Speech Recognition may be a viable option to help enhance your business's services as well as its profits, but be remember to keep the human touch in there. 

Every time I call the primary wireless provider toward the end of the alphabet it seems like they work harder and harder to make their system less human-friendly and more like a gate that attempts to avoid connecting me to a live person.

Related: The use of AI in Farming 

Check all of my Fantastic Content at