1
votes

I have a sample .webm file recorded using MediaRecorder in Chrome Browser. When I use Google speech java client to get transcription for the video, it returns empty transcription. Here is what my code looks like

SpeechSettings settings = null;
Path path = Paths.get("D:\\scrap\\gcp_test.webm");
byte[] content = null;
try {
    content = Files.readAllBytes(path);
    settings = SpeechSettings.newBuilder().setCredentialsProvider(credentialsProvider).build();
} catch (IOException e1) {
    throw new IllegalStateException(e1);
}

try (SpeechClient speech = SpeechClient.create(settings)) {
    // Builds the request for remote FLAC file
    RecognitionConfig config = RecognitionConfig.newBuilder()
                    .setEncoding(AudioEncoding.LINEAR16)
                    .setLanguageCode("en-US")
                    .setUseEnhanced(true)
                    .setModel("video")
                    .setEnableAutomaticPunctuation(true)
                    .setSampleRateHertz(48000)
                    .build();

    RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    // RecognitionAudio audio = RecognitionAudio.newBuilder().setUri("gs://xxxx/gcp_test.webm") .build();

    // Use blocking call for getting audio transcript
    RecognizeResponse response = speech.recognize(config, audio);
    List<SpeechRecognitionResult> results = response.getResultsList();

    for (SpeechRecognitionResult result : results) {
        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
        System.out.printf("Transcription: %s%n", alternative.getTranscript());
    }
} catch (Exception e) {
    e.printStackTrace();
    System.err.println(e.getMessage());
}

If, I use the same file and visit https://cloud.google.com/speech-to-text/ and upload file in the demo section. It seems to work fine and shows transcription. I am clueless about whats going wrong here. I verified the request sent by demo and here it what looks like

enter image description here

I am sending the exact set of parameters, but that didn't work. Tried uploading file to Cloud storage but that too gave same result (no transcription).

2

2 Answers

0
votes

After going through error and trials (and looking at the javascript samples), I could solve the issue. The serialized version of audio should be in FLAC format. I was sending the video file(webm) as is to Google Cloud. The demo on the site extracts audio stream using Javascript Audio API and then sents the data in base64 format to make it work.

Here are the steps that I executed to get the output.

  1. Used FFMPEG to extract audio stream into FLAC format from webm.

    ffmpeg -i sample.webm -vn -acodec flac sample.flac

  2. The extracted file should be made available using either Storage cloud or send as ByteString.

  3. Set the appropriate model while calling the speech API (for english language video model works, while for french language command_and_search). I don't have any logical reason for this. I realised it after trial and error with demo on Google cloud site.

0
votes

I got results with flac encoded file.

Sample code results words with timestamp,

public class SpeechToTextSample {

public static void main(String... args) throws Exception {

 try (SpeechClient speechClient = SpeechClient.create()) {

   String gcsUriFlac = "gs://yourfile.flac";

   RecognitionConfig config =
       RecognitionConfig.newBuilder()
           .setEncoding(AudioEncoding.FLAC)  
           .setEnableWordTimeOffsets(true)
           .setLanguageCode("en-US")
           .build();

   RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUriFlac).build(); //for large files
   OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response = speechClient.longRunningRecognizeAsync(config, audio);
   while (!response.isDone()) {
          System.out.println("Waiting for response...");
          Thread.sleep(1000);
        }
   // Performs speech recognition on the audio file

   List<SpeechRecognitionResult> results = response.get().getResultsList();

   for (SpeechRecognitionResult result : results) {
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
     System.out.printf("Transcription: %s%n", alternative.getTranscript());
     for (WordInfo wordInfo : alternative.getWordsList()) {
         System.out.println(wordInfo.getWord());
         System.out.printf(
             "\t%s.%s sec - %s.%s sec\n",
             wordInfo.getStartTime().getSeconds(),
             wordInfo.getStartTime().getNanos() / 100000000,
             wordInfo.getEndTime().getSeconds(),
             wordInfo.getEndTime().getNanos() / 100000000);
       }
   }
 }
}
}

GCP supports different languages, I have used "en-US" for my example. Please refer following link document to know language list.