2
votes

Is there any way to algorithmically determine audio quality from a .wav or .mp3 file?

Basically I have users with diverse recording setups (i.e. they are from all over the world and I have no control over them) recording audio to mp3/wav files. At which point the software should determine whether their setup is okay or not (tragically, for some reason they are not capable of making this determination just by listening to their own recordings, and so occasionally we get recordings that are basically impossible to understand due to low volume or high noise).

I was doing a volume check to make sure the microphone level was okay; unfortunately this misses cases where the volume is high but the clarity is low. I'm wondering if there is some kind of standard scan I can do (ideally in Python) that detects when there is a lot of background noise.

I realize one possible solution is to ask them to record total silence and then compare to the spoken recording and consider the audio "bad" if the volume of the "silent" recording is too close to the volume of the spoken recording. But that depends on getting a good sample from the speaker both times, which may or may not be something I can depend on.

So I'm wondering if instead there's just a way to scan through an audio file (these would be ~10 seconds long) and recognize whether the sound file is "noisy" or clear.

3

3 Answers

2
votes

I am building an API that aims to detect various kinds of bad audio. You can use this API to compute an overall score and also give specific recommendations to people on how to improve their sound quality. Have a look:
https://www.tinydrop.tech/documentation/#loudness-detection

1
votes

It all depends on what your quality problems are, which is not 100% clear from your question, but here are some suggestions:

In the case where volume is high and clarity is low, I'm guessing the problem is that the user has the input gain too high. After the recording, you can simply check for distortion. Even better, you can use Automatic Gain Control (AGC) durring recording to prevent this from happening in the first place.

In the case of too much noise, I'm assuming the issue is that the speaker is too far from the mike. In this case Steve's suggestion might work, but to make it really work, you'd need to do a ton of work comparing sample recordings and developing statistics to see how you can discriminate. In practice, I think this is too much work. A simpler alternative that I think will be easier and more likely to work (although not necessarily guaranteed) would be to create an envelope of your signal, then create a histogram from that and see how the histogram compares to existing good and bad recordings. If we are talking about speech only, you could divide the signal into three frequency bands (with a time-domain filter, not an FFT) to give you an idea of how much is noise (the high and low bands) and how much is sound you care about (the center band).

Again, though, I would use an AGC durring recording and if the AGC finds it needs to set the input gain too high, it's probably a bad recording.

0
votes

Not quite my field but I suspect that if you get a spectrum, (do a Fourier transform maybe), and compare "good" and "noisy" recordings you will find that the noise contributes to a cross spectrum level that is higher in the bad recordings than the good. Take a look at the signal processing section in SciPy - this can probably help.