What is Speech to Text Software?

Speech to text software bills itself as the catch-all solution to transcription services — delivering the cheap, easy, accurate and fast transcript that you’ve been searching for. But, is it as good as the hype? What is ‘speech to text’ software anyway?

In a nutshell, speech to text software, or automatic speech recognition (ASR) software, or voice to text software, is a computer program that uses linguistic algorithms to sort auditory signals and transform that information into words using Unicode characters. Yikes!

Put more simply, speech to text software ‘listens’ to audio and delivers an editable, verbatim transcript.

There are a large number of automatic transcription service providers online. Most provide enticing price points that look very attractive to anyone familiar with human transcription services — averaging around £0.10 per minute of recorded audio, and some are even free. 

Most promise accuracy rates around 90%-95%. This, however, is only for ‘clean’ recordings, something that is absolutely critical to understand when deciding if ASR software meets your transcription needs.

Before you get overly excited and ditch your allocated transcription budget in favour of speech to text software, it is worth getting better acquainted with this technology. Here is a quick rundown regarding the truth about speech to text software, and how it stacks up against traditional human transcription services.

How Does Speech to Text Software Work?

There are multiple steps involved in the process of converting speech into text. When you’re talking you create a series of vibrations. These are translated into digital language by the analogue-to-digital converter or the ADC. 

The ADC is able to complete this conversion by sampling sounds from an audio file and taking frequent, very detailed measurements of the waves. The system has a filter to distinguish the sounds that are relevant and differentiate frequencies. The speed of the speech is also modified and the volume set at a control level.

The next stage involves segmenting the signal into hundredths or thousandths of seconds and matching these parts to phonemes (a phoneme is a unit of sound that distinguishes one word from another in a particular language). There are over 40 phonemes within the English language. Each phoneme is then examined and evaluated in relation to other phonemes around them, and the system then runs the network of phonemes through a complicated mathematical model to compare them to well-known sentences, individual words and phrases. The system using machine learning then creates text based on what is most probable that the person said. This is either presented as a chunk of text (text file) or as a final computer-based command.

ASR/Speech to Text Software: the Good, the Bad and the Ugly

ASR may seem like a brilliant option on the surface. But, if you delve deeper, there can be issues, particularly with certain types of recording. When comparing ASR and human-based transcription services, it’s wise to explore the good, the bad, and the downright ugly.

Speech to Text Software: The Good

The most significant advantages offered by ASR are speed and cost. Automatic speech recognition (ASR) produces rapid results, and can even offer a real-time service in some cases. The associated price tag is also considerably lower than human services.

Some charge per minute. Others have a set subscription fee. Fee-based services generally cap the total amount of uploads you are allowed to make per month. No matter how you are charged, you can expect to pay around £0.07-£0.10 per minute of audio for an automatic transcription service.

A few services, however, are free. By paying for access to transcription software, you are likely to get slightly better results. But, now we’ll get into some of the problems with speech to text software.

Speech to Text Software: The Bad

One major limitation of automated speech recognition technology is its ability to produce verbatim text only. In the absence of a human, the system is only capable of transcribing what is there. This means that you could end up with a transcript which is awkward to read. 

When you speak, it’s very common to pause, to make noises such as ‘erm’, and to stumble on certain spoken words. A verbatim text will include everything on the recording. Human services can clean this up and deliver a much more readable transcript that still retains all of the detail and accuracy of the original recording. 

Speech to Text Software: The Ugly

The most concerning aspect of ASR is its accuracy. Even the best speech to text software rarely achieves accuracy rates over 80%, which often means that you have to spend time and effort making corrections and improvements. 

If there are ‘complicating’ factors, ASR can produce unintelligible results. To get a usable transcript from a speech to text service, you need ‘clean’ audio recordings. That means a high-quality recording of people speaking slowly, one at a time, without accents and with little to no background noise.

ASR may also struggle with specialised language or find it challenging to identify brand names and industry-specific jargon. Human transcription services will often let you provide a glossary of terms to avoid such complications, or can pair you with a transcriber with experience in the relevant field. ASR software can be trained overtime for specific industries or subjects, but this does take time and is unlikely to be what you receive out of the box.

How ASR Stacks up Against Human-Based Transcription Services

There are several key differences between speech to text software and human-based transcription services.


Cost is an important factor for many people, and human transcription services are significantly more expensive than ASR. Some ASR services are free, but most charge around £0.10 per minute. In contrast, human services usually have a fee of approximately £2 per minute. Lower rates may be available for long turnaround times. But, even if you can wait a week for your transcript, you won’t be able to get a human-based service as cheaply as speech to text software.


The timeframe in which human services operate is much longer than ASR. In most cases, human services offer a turnaround of 12-24 hours, with many providing a delivery time guarantee. ASR is much faster, it produces transcripts within seconds. If you need a human-based transcription urgently, you’ll likely be charged a premium.

Options and Versatility

With ASR, your only option is to get a verbatim transcript — if the speech recognition software is up to the task from an accuracy perspective. Human-based services offer a much broader spectrum of options, including verbatim and Detailed Notes. The verbatim option from most human-based transcription services will still correct errors, eliminate pauses and ‘ums’ and ‘errs’ to deliver a version that is much easier to read (unless you request to have all the detail left in). Detailed Notes go a step further to provide a more concise transcript. This can include summarising questions and removing off-topic chit-chat and pleasantries.

Confidence and Quality

When you invest in human-based transcription services, you enjoy greater confidence in the quality of the product. Human services have quality control guarantees and generally deliver 99%+ accuracy rates, only failing to do so if the audio is completely indecipherable.

Transcripts will be proofread, so you don’t need to devote your own time to checking the text or making changes. If you use ASR, you may find that you have to spend valuable time combing through the text looking for mistakes, fixing garbled text and removing words and unwanted sounds.

Summary: Speech to Text Delivers a Budget Solution, But is Not a Direct Replacement For Human Services

Speech to text software offers an attractive budget solution for those looking for transcription services in a hurry. But, it is not yet capable of producing the quality and accuracy of human-based, quality transcription services

Because ASR is cheap, and sometimes even free, it can be worth experimenting with to see what kinds of results you can achieve. By trying different options, you can determine what kind of sound quality is required to produce intelligible results.

The speed and price of ASR are undoubtedly appealing, but there are flaws. ASR produced texts are sometimes unintelligible. Speech to text software only produces verbatim transcripts and accuracy rates are always significantly lower than human-based services. 

To achieve a good-quality transcription with ASR, you need to invest in making a high-quality recording. But, if you want a range of options, an accurate transcription, and unrivalled attention to detail, you will need to invest in a human-based service.

You have been reading a guide to speech to text software and how it compares to human-based transcription services. If you’re not sure whether ASR or human-based transcription services are right for you, use our handy infographic to find out.


Take Note

Take Note is a UK-based transcription service with world-class customer support alongside the highest standards of security and ethics. We deliver a comprehensive range of transcription services including Audio and Video Transcription, Video Captions and On-Site Note Taking.