Is Speech-to-Text voice training data sampled at 48kHz still good for improving recognition of 16kHz speech

Question

We are training our Azure Cognitive Services Custom Speech model using data recorded in .wav (RIFF) format at 16bit, 16kHz as per the documentation.

But, we have obtained a dataset of speech recorded at 48kHz and encoded as MP3. Speech Studio seems to be able to train the service using this data without problems but we would like to know if doing so, with the higher sample rate, will only be of use in recognising streamed data also at the higher rate or does that not matter?

GiftA-MSFT GiftA-MSFT · Accepted Answer · 2020-09-11T16:31:01

Having a higher sample rate like the one you described is desirable in terms of quality of the audio, but it generally won't influence speech recognition. As long as you meet the audio format minimum requirements, speech recognition should work just fine.

Is Speech-to-Text voice training data sampled at 48kHz still good for improving recognition of 16kHz speech

1 Answers