How to get the best results from Google Speech to Text APIs

brucebookman

A step by step guide to producing the highest value outcomes for speech transcription using Google Speech to Text

Google Cloud Speech-to-Text (STT) enables developers to convert audio to text by applying powerful neural network models in an easy-to-use API. The API recognizes 120 languages and variants to support your global user base. You can enable voice command-and-control, transcribe audio from call centers, and more. It can process real-time streaming or prerecorded audio, using Google’s machine learning technology. (source: Google.com)

TL;DR

Decide the measure(s) to employ for evaluating STT
Apply best practices to gather 20 hours of audio
Produce ground truth for the audio
Create a testing matrix
Run tests and evaluate results

1 - Use case determines measurements

Not every speech transcription project is focused on transcription accuracy. Word Error Rate (WER) may be an important measure for some applications while it may be of no value in others.

WER is important when every word counts. Use cases include transcribing audio books, earnings calls, immigration interviews, legal proceedings, and others.

Where WER may not be the best measure would be in cases where the exact words are not as critical. Providing raw text for an AI labeling algorithm is a good example. Extracting pricing data from a retail environment is another.

Before diving deep into the options that Google STT provides, it is critical to define the use case and know what measures are going to be used in the evaluation process.

2 - Follow best practices and gather audio

It is critical to consult Google Cloud Speech to Text Best Practices. Summarised below are the most critical aspects:

Capture audio with a sampling rate of 16,000 Hz or higher.
Use a lossless codec to record and transmit audio. FLAC or LINEAR16. Avoid using MP3, OGG_OPUS or other lossy codecs
Position the microphone as close to the source audio as possible
Reduce background noise as much as possible
Listen to some sample audio. It should sound clear, without distortion or unexpected noise
Volume of voices should be loud enough for a human to pick up what is being said
Record different speakers on separate audio channels
Do not use Text to Speech audio or otherwised synthetic audio for STT. The machine learning models for STT have been trained on human speech and results for synthetic speech can be poor
Read Google Cloud Speech to Text Best Practices

To evaluate STT, experimentation will be needed. Google recommends at least 20 hours of audio data. Simply put, the more audio data the better. For statistically valid tests, 20 hours is the minimum. Testing with less increases the error range of the results.

3 - Produce ground truth data

Data representing the desired outcome such as labels or human transcriptions are required for comparison to STT derived results.

For cases where WER is critical, the data would be a single text file containing human transcription for each audio.

If the intention is to generate labels from transcriptions, a set of labels per audio file will be needed to compare labels generated by humans and labels generated from the API transcriptions.

4 - Build test matrix

The testing matrix will include one or more models, one or more language codes, speech adaptation phrase hints and a set of boost values.

Example test matrix

Model	Language code	Alternative languages	Phrase hints used (y/n)	Boost value
phone call	en-US	en-GB, en-Au	n	0
enhanced	en-US	en-GB, en-Au	n	0
phone call	en-US	en-GB, en-Au	y	10
enhanced	en-US	en-GB, en-Au	y	10
phone call	en-US	en-GB, en-Au	y	20
enhanced	en-US	en-GB, en-Au	y	20
phone call	en-US	NONE	n	0
enhanced	en-US	NONE	n	0

Match models to use case

Google STT offers a variety of features that can help improve transcription. Create a list of features to evaluate during testing.

For US English, consider the enhanced phone call model

Review Selecting Models

The phone call model offered by Google STT is trained with IVR (interactive voide response) applications and phone call quality audio. An enhanced mode is available for an extra cost.

Google creates and improves enhanced models based upon data collected through data logging. If you opt-in to data logging,

There is currently only one enhanced model, which is used for processing phone call audio. When you request an enhanced model for your phone call transcription requests, you can receive higher quality results. Using the improved phone call model, Cloud Speech-to-Text can more accurately recognize speech captured from phone audio and therefore produce more accurate transcription of the audio data.

As part of the testing matrix, apply both phone call and phone call plus enhanced mode to help determine if the extra cost provides a benefit.

Languages other than US English

If you intend to use Google STT for languages other than US English, consult Supported features by language. The standard phone_call, enhanced phone_call, and video models are only available for en-US. Therefore only default and command_and_search models are available for other languages.

Your audio may contain a mixture of languages, in which case including alternative language codes can improve transcription accuracy. Consult Detecting language spoken automatically.

Speech adaptation phrases

Verticals such as health care, banking, retail, technology, construction and others may benefit from applying speech adaptation. Google STT models are trained on millions of hours of speech, and are therefore not specifically targeted to a specific use case.

When you send a transcription request to Cloud Speech-to-Text, you can include a list of phrases to act as “hints” to Cloud Speech-to-Text. Providing these hints, a technique called speech adaptation, helps Speech-to-Text API to recognize the specified phrases from your audio data. Thus, if your source audio includes a speaker saying “meet” frequently and you specify the phrase “meat” as a speech adaptation, Speech-to-Text API more likely transcribes the word as “meat” rather than “meet.

Review the ground truth data gathered and extract a list of words and phrases. For health care, this list might include medical terms or drug names. For retail, this might include product names. The generated list should be complete enough to cover the test audio, but does not need to be an all inclusive list that would be created for a deployed application. See classes below for more details on certain kinds of phrases.

Speech adaptation boost

To help determine the best boost value to apply to phrases, include a set of boost values in the test matrix. Boost values that are too high can over bias transcription and produce undesirable results, therefore testing a set of boost values is key.

To amplify the effect of speech adaptation, you can use boost-based adaptation. With boost adaptation, you provide a relative value to bias Cloud Speech-to-Text towards a speech adaptation for transcription.

Higher boost values can result in less false negatives—cases where the utterance occurred in the audio but wasn’t recognized by Cloud Speech-to-Text. However, boost can also increase the likelihood of false positives, that is, cases where the audio data doesn’t contain the utterance but appears in the transcription. For best results, you should experiment by using some initial value and adjust up or down as needed.

Speech adaptation classes

Mailing and shipping addresses, phone numbers, dates, times, and numbers are all examples of classes. If these are important to your transcription project, include classes the STT request body. Consult Improving accuracy with speech adaptation

Testing speech adaptation

Since boost values can apply to single phrases, it is possible to build a complex list of phrases and boost options. However, for the initial testing phases, use a single phrase list and boost value.

Use this:

"config": {
    "speechContexts": [{
        "phrases": ["Place an order", "pizza", "french fries", "burger", "cheese", "pickles", "mustard", "mayo"],
        "boost": 10
     }]
  }

Rather than this:

"config": {
    "speechContexts": [{
        "phrases":["Place an order"],
        "boost": 10
      },
    {   "phrases":["pizza"],
        "boost": 30
      }]
  }

Although the above is valid, this level of complexity could be reserved for fine tuning rather than initial testing.

Run the tests

With a testing matrix in place, run each test and determine the difference between the ground truth and the STT derived outcome. For cases were WER is the key metric, the human transcript reference can be weighed against the API transcript result hypothosis. You can use the simple_wer.py tool. The resulting WER for each test will help determine if further testing is required or aid in identifying a configuration to put into production.

For labels, the API transcripts would be run through your labeling algorithm. After deriving labels from STT, compare them to expected results.

Contact Me

Written on November 18, 2019