I'm looking at using Google Cloud Speech to convert long-form narrated audio files and I need to know the start time of each phrase in the audio file. Is there a way to do this with Google Cloud Speech?
I'm currently working with the transcribe_async.py
2 Answers
This is not possible with Google Cloud Speech. If that information is important to you, you may need to look at other ASR systems. I know that offline, non-hosted ASR systems like Kaldi and CMU Sphinx will give you this information. I don't know if or which hosted ASR systems can provide that information.
You can get (aproximated) start and end times (from the beginning of the audio track) for each word by setting to True the enableWordTimeOffsets option: https://cloud.google.com/speech/docs/async-time-offsets.
Beware that the start time of the first word of the transcript is always 0 and that, as far as I know, each word start time correspond to the previous word end time (also if there are pauses).