6
votes

I started using google speech api to transcribe audio.

The audio being transcribed contains many numbers spoken one after the other.

E.g. 273 298

But the transcription comes back 270-3298

My guess is that it is interpreting it as some sort of phone number.

What i want is unparsed output e.g. "two seventy three two ninety eight' which i can deal with and parse on my own.

Is there a setting or support for this kind of thing?

thanks

4
Are you requesting more than one alternative? If so, do any of the others get the transcription correct? - brandall
i get 10 alternatives, and all of them have the number formated as a phone number - Moshe Rayman
I'm having a similar problem. Application asks users to enter a 9 digit card number, Google thinks the user is trying to say a phone number so it pads the results with an extra digit at the end or even the middle of a number. - Sam
Try IBM's SR service, which provides a "smart_format" option to tweak whether return the original transcripts or "formatted" one - dy.octa

4 Answers

5
votes

So I had this exact same problem and I think we found a solution. If you're using English as input, switch to en-PH just when working with numbers. Google will then not format the result as a U.S. phone number or try to stick an extra digit in there.

2
votes

Try passing a speech context with some phrase hints. How to use it is documented here: https://cloud.google.com/speech/docs/basics#phrase-hints

Give it the spelled out numbers that you want recognized.

"speech_context": {
  "phrases":["zero", "one", "two", ... "nine", "ten", "eleven", ... "twenty", "thirty,..., "ninety"]
 }

This isn't guaranteed to work, but it may help.

1
votes

For the record, I tried blambert's solution above and it doesn't work, unfortunately. I posted another question recently seeing if anyone has found a way to defeat this behavior, as it is preventing me from implementing a transcription service that I had planned.

0
votes

Have you tried Google Speech customClass?

You have some class tokens that you could use, telling the API that you are not expecting a phone number but a different type of numbers.

For instance, if you choose to use OOV_CLASS_AM_RADIO_FREQUENCY, you'll indicate the API to interpret numbers like this:

  • "twelve twenty" --> 1220
  • "seven hundred and thirty" --> 730

Probably (haven't read this) the API is using this class FULLPHONENUM by defaut for numbers:

  • "one eight hundred five five five four oh oh one" --> +1-800-555-4001
  • "seven one eight five five five six one oh one" --> 718-555-6101