ASR How Does it Work? The New Generation of ASR Transcription

Automatic speech recognition technology (ASR) is having a great impact on the world. This technology is already transforming the way students learn, employees work and society functions. ASR is also creating opportunities to assist specific communities of individuals, such as those navigating life or their studies with disabilities.

While ASR is a valuable tool that many people are using in their day-to-day lives, not everyone understands how it works or why it’s so useful. Misconceptions about the role of ASR and its capabilities persist. Delve deeper into the ways this technology works, and how ASR is supporting people with disabilities while simultaneously improving efficiency and saving time for millions of professionals.

Table of Contents:

What is ASR?
How does ASR transcription work?
What is ASR used for?
How does Verbit’s ASR work specifically?
How is the accuracy of ASR measured?

What is ASR?

An automatic speech recognition system involves voice recognition software that processes human speech and turns it into text. While many people are only now learning the capabilities of these types of tools, engineers and researchers have spent decades working to build such systems. In fact, the first attempts to create speech recognition tools date back to 1952. At that time, Three Bell Labs researchers built a system called “Audrey” for single-speaker digit recognition.

The capabilities of today’s ASR far exceed those of its predecessors. The reason for this is that innovations in the realm of artificial intelligence are allowing engineers to develop sophisticated software that responds to human voices. Modern systems can even differentiate speakers, accents and more.

Advanced versions of ASR transcription technologies now incorporate what is known as Natural Language Processing (NLP). These capture real conversations between people and use machine intelligence to process them. Still, the results will vary when it comes to ASR transcription. Many factors influence the accuracy provided by ASR, including speaker volume, background noise, the quality of the involved recording equipment and more.

How does ASR transcription work?

From the user’s perspective, setting up ASR and capturing a recording is easy. Essentially, the process works as follows:

An individual or a group speaks, and the ASR software detects this speech.
The device then creates a wave file of the words it hears.
The wave file is cleaned to delete background noise and normalize the volume.
The software then breaks down and analyzes the filtered wave file in sequences.
The automatic speech recognition software analyzes these sequences and employs statistical probability to determine the whole words. Next, it works them into complete sentences.
Some technology providers’ ASR service includes editing by professional human transcribers. Adding this layer to the process helps correct any errors to achieve greater accuracy.

man presenting in an an office conferences room with five people watching and listening with laptop computers in front of each person

What is ASR used for?

A variety of industries use ASR for many different purposes. For instance, ASR technology is becoming a standard tool for professionals in higher education, legal, finance, government, health care and media. In all these fields, conversations are continuous and it’s often necessary to capture word-for-word records. Here are some examples of ASR use cases in different industries.

Legal: In legal proceedings, it’s often crucial to capture every word that a witness or other involved party states. Also, there’s currently a shortage of court reporters, making it challenging to carry out this important step. Digital transcription and the ability to scale are key solutions that ASR technology offers those in this industry.
Higher education: ASR captions and transcriptions allow universities to support students navigating hearing loss or other disabilities in classrooms. It can also serve the needs of students who are non-native speakers, commuters, or who have varying learning needs. For instance, students with ADHD often focus better when they have access to captions.
Health care: Doctors are using ASR to transcribe notes from meetings with patients or document steps during surgeries.
Media: Media production companies use ASR to provide live captions and media transcription for all the produced and must according to the FCC (Federal Communications Committee) and other guidelines.
Corporate: Companies use ASR captioning and transcription to provide more accessible training materials and create inclusive environments for employees with differing needs.

What are the advantages of automatic speech recognition vs. traditional transcription?

Aside from the growing shortage of skilled traditional transcribers, ASR machines can help to improve efficiencies for captions and transcriptions. The technology can differentiate between voices in conversations, lectures, meetings and proceedings to provide an understanding of who said what. Speaker differentiation can be helpful since disruptions among participating parties are common in conversations with multiple stakeholders.

Users can upload hundreds of related documents, including books, articles and more into the ASR machine to train it to get smarter. The technology can absorb this plethora of information faster than a human can. It can then begin recognizing different accents, dialects and terminology more accurately.

However, the ideal format involves using human intelligence to fact-check results that the artificial intelligence produces. This editing step is particularly important when the ASR is supporting accessibility initiatives where guidelines and laws require near-perfect accuracy.

Additional benefits include:

Improved information sharing with more data
Better access to data for those who need captions or transcripts because of a disability
The ability to provide automatic transcription and captions for audio and video files to give immediate access to students, employees and consumers
Improved efficiencies that allow companies, such as legal agencies, to scale their operations and provide more services to more clients quickly
Easier documentation and hands-free note taking to help students and professionals
Efficient improvements to accuracy

three young people sitting in front of a computer pointing at the screen

How does Verbit’s ASR work specifically?

Verbit’s ASR machine works to provide captions and transcriptions for both live and recorded audio and video. It uses adaptive algorithms and three models that inform the ASR machine’s ability to perform precisely.

An acoustic model reduces background noise and echoes to cancel out factors that reduce the audio quality. This model also identifies speakers.
A linguistic model identifies specific terminology, recognizes different accents and dialects and differentiates between speakers.
A contextual events model incorporates current events, news, and relevant updates. By doing so, the technology incorporates new terms that enter the public dialogue.

Verbit’s automatic speech recognition system works live, or users can select to upload completed recordings of files. After the user uploads those files, the proprietary speech-to-text engine gets to work.

Achieving accuracy is highly important to Verbit and its clients. In fact, laws like the Americans with Disabilities Act often require higher levels of accuracy from our clients. To accommodate this need, Verbit takes the process one step further by using two skilled human transcribers per project to edit and review the ASR’s results. Once the process is complete, users can download the file immediately in the format of their choice.

How is the accuracy of ASR measured?

ASR alone isn’t always accurate. However, the accuracy varies greatly based on several factors, including how much training went into developing the system. As a result, some ASR performs much better than others. The system used to measure the accuracy of ASR is called the word error rate (WER).

The WER uses three categories of errors, including substitutions, deletions and insertions.

Substitutions: This happens when the ASR replaces the correct word with an incorrect one. For example, if a speaker says, “Don’t make a fuss,” and the ASR writes “Don’t make a bus.” Advanced AI takes the context into consideration to reduce these types of errors.

Deletions: A deletion is when the ASR leaves out a word. Omitting a word can change the meaning and make for a confusing transcription. Just consider the difference between “She did not complete the task” and “She did complete the task.”

Insertions: Sometimes, ASR will include words that the speaker did not say. Maybe the speaker said, “We’re ahead of schedule,” but the ASR transcribes, “We’re too ahead of schedule.” In this case, maybe another speaker, background noise or another issue led to the extra word.

Calculating the WER means dividing the number of errors by the total number of words in the sample audio and transcription. If there are 100 words in the sample and 20 errors, the WER is .2. ASR can produce transcripts with impressive WER rates. However, many variables impact accuracy.

When using ASR to transcribe poor-quality audio, speakers with heavy accents, recordings that include unusual niche language and other challenges, the transcript will likely have a worse WER. In real-world scenarios, background noise or speakers who stand too far from or too close to a microphone can impact the ability of ASR to produce quality results.

Training the AI to handle these issues can reduce errors, but the best way to provide high quality is to have humans edit the results. When it comes to accessibility, adding this layer is often necessary to provide an equitable experience.

Automatic speech recognition technology is now expected and evolving

Consumers and professionals now expect to reap the benefits that automatic speech recognition offers. The days of jotting down notes by hand, figuring out which button turns the lights on and rushing home after forgetting to lock the door are gone. You’ll be able to complete all of these tasks with your voice. Additionally, these features will be secure as the technology learns to differentiate between different voices.

ASR software and ASR transcription services will only continue to disrupt the way we function in our classrooms, workplaces and homes. With more efficiencies and use cases, this technology will continue to evolve to best serve those who rely on it.

Verbit’s mature ASR is supporting universities, businesses and other organizations worldwide. Reach out to us today to learn how our accessibility solutions are helping create more inclusive environments and new opportunities for people with disabilities.

ASR Transcription Software

Filters