Why i got too many missing text in google speech API?

Question

i already spent 1 day to know best practice using google speech API.

this is my last try. here we will use online source to make sure we have same audio. The another requirement is you need ffmpeg to convert mp3 to google API desired format.

audio information:

singer: adele
song: chasing pavement
possible languange: en-GB (adele origin) or en-US
sample rate: 44100Hz
channel : stereo (2-channel)
format: mp3

what i did:

use both format flac or wav
use both sample rate original (44100) or 16000
always use mono (1-chanel)
use both language en-GB and en-US

output what i want: get text alignment. But this is secondary target, because now i'm focussing on why i get so many missing transcribed text.

Note: run it on bash/cmd

script: basic Synchronous transcrib.php

<?php
set_time_limit(300); //5min
//google speech php library
require __DIR__ . '/vendor/autoload.php';

# Imports the Google Cloud client library
use Google\Cloud\Speech\SpeechClient;
//use Google\Cloud\Storage\StorageClient;
use Google\Cloud\Core\ExponentialBackoff;


//json credential path
$google_json_credential = 'cloud-f7cd1957f36a.json';
putenv("GOOGLE_APPLICATION_CREDENTIALS=$google_json_credential"); 
# Your Google Cloud Platform project ID
$projectId = 'cloud-178108';
//$languageCode = 'en-US'; //not good (too many miss 
$languageCode = 'en-GB'; //adele country

$oldFile = "test.mp3";
//flac or wav??
$typeFile = 'wav';
$sampleRate = 16000;

if($typeFile = 'wav'){
    $newFile = "test.wav";
    $encoding='LINEAR16';
    $ffmpeg_command = "ffmpeg -i $oldFile -acodec pcm_s16le -ar $sampleRate -ac 1 $newFile -y";
}else{
    $newFile = "test.flac";
    $encoding='FLAC';
    $ffmpeg_command = "ffmpeg -i $oldFile -c:a flac -ar $sampleRate -ac 1 $newFile -y";
}

//download file
//original audio info: adele - chasing pavements, stereo (2 channel) 44100Hz mp3
$rawFile = file_get_contents("http://www.karaokebuilder.com/pix/toolkit/sam01.mp3");
//save file
file_put_contents($oldFile, $rawFile);

//convert to google cloud format using ffmpeg
shell_exec($ffmpeg_command);

# The audio file's encoding and sample rate
$options = [
    'encoding' => $encoding,
    'sampleRateHertz' => $sampleRate,
    'enableWordTimeOffsets' => true,
];

// Create the speech client
$speech = new SpeechClient([
    'projectId' => $projectId,
    'languageCode' => $languageCode,
]);

// Make the API call
$results = $speech->recognize(
    fopen($newFile, 'r'),
    $options
);

// Print the results
foreach ($results as $result) {
    $alternative = $result->alternatives()[0];
    printf('Transcript: %s' . PHP_EOL, $alternative['transcript']);
    print_r($result->alternatives());
}

Result:

en-US:

wav: even if it leads nowhere [confidence: 0.86799717]
flac: even if it leads nowhere [confidence: 0.92401636]

**en-GB: **

wav: happy birthday balloons delivered Leeds Norway [confidence: 0.4939031] 
flac: happy birthday balloons delivered Leeds Norway [confidence: 0.5762244]

expected:

Should I give up
Or should I just keep chasing pavements?
Even if it leads nowhere
Or would it be a waste?
Even If I knew my place should I leave it there?
Should I give up
Or should I just keep chasing pavements?
Even if it leads nowhere

if you see the result vs expected result you will know that's not only i missing so many text, but that's miss spell too.

to be honest. I dont know if machine (google cloud) can hear my converted audio clearly or not. but i try to send the best converted audio as i can.

did i miss something in my script? or i'm not converting audio correctly?

oakinlaja oakinlaja · Accepted Answer · 2018-03-06T19:59:02

Reviewing your script, it seems your code was accurately written --https://cloud.google.com/speech/docs/reference/libraries#using_the_client_library.

Also, the fact that a few words were picked up shows that the Google cloud Speech API gets your converted audio. Although the Speech API can successfully handle noisy audio and it recognizes over 110 languages and variants, I believe this issue with handling music files has to do with constraints on how the speech recognizer works. I think you should attempt simple audio(non-music) files to test.

Why i got too many missing text in google speech API?

1 Answers