I'm working with some recorded audio files and I do have the transcript of what's being said. The problem is that I'm working with Arabic(Egyptian) language, so the accuracy is not so great. What I need to do is to give the api the transcript containing the correct text and then forcibly align the speech to the text. In other words, get the timestamps of each word in the text in the speech. So is there a way to do that?
1 Answers
0
votes
The speech-to-Text is based in Machine learning algorithms, the training depends of the amount of data which those algorithms are feed; therefore, some languages could have better accuracy than others, if you are using Arabic you should try tweaking the parameters of the API
In addition, if you want to acquire the Timestamps, the API has an option for getting word timestamps, you can enable the "enableWordTimeOffsets" parameter in the request configuration, this parameter will return the "startTime" and the "endTime" of every word along with the whole transcript, the API will return a response like the follow:
{
"name": "7612202767953098924",
"metadata": {
"@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
"progressPercent": 100,
"startTime": "2017-07-20T16:36:55.033650Z",
"lastUpdateTime": "2017-07-20T16:37:17.158630Z"
},
"done": true,
"response": {
"@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results": [
{
"alternatives": [
{
"transcript": "okay so what am I doing here...(etc)...",
"confidence": 0.96596134,
"words": [
{
"startTime": "1.400s",
"endTime": "1.800s",
"word": "okay"
},
{
"startTime": "1.800s",
"endTime": "2.300s",
"word": "so"
},
{
"startTime": "2.300s",
"endTime": "2.400s",
"word": "what"
},
{
"startTime": "2.400s",
"endTime": "2.600s",
"word": "am"
},
{
"startTime": "2.600s",
"endTime": "2.600s",
"word": "I"
},
{
"startTime": "2.600s",
"endTime": "2.700s",
"word": "doing"
},
{
"startTime": "2.700s",
"endTime": "3s",
"word": "here"
},
{
"startTime": "3s",
"endTime": "3.300s",
"word": "why"
},
{
"startTime": "3.300s",
"endTime": "3.400s",
"word": "am"
},
{
"startTime": "3.400s",
"endTime": "3.500s",
"word": "I"
},
{
"startTime": "3.500s",
"endTime": "3.500s",
"word": "here"
},
...
]
}
]
},
{
"alternatives": [
{
"transcript": "so so what am I doing here...(etc)...",
"confidence": 0.9642093,
}
]
}
]
}
}