The situation
I am using VAD (Voice Activity Detection) from WebRTC by using WebRTC-VAD, a Python adapter. The example implementation from the GitHub repo uses Python's wave module to read PCM data from files. Note that according to the comments the module only works with mono audio and a sampling rate of either 8000, 16000 or 32000 Hz.
What I want to do
Read audio data from arbitrary audio files (MP3 and WAV files) with different sampling rates, convert them into the PCM-representation that WebRTC-VAD is using, apply WebRTC-VAD to detect voice activity and finally process the result by producing Numpy-Arrays again from PCM data because they are easiest to work with when using Librosa
My problem
The WebRTC-VAD module only works correctly when using the wave
module. This module returns PCM data as bytes
objects. It does not work when feeding it Numpy arrays that have been obtained e.g. by using librosa.load(...)
. I have not found a way to convert between the two representations.
What I have done so far
I have written the following functions to read audio data from audio files and automatically convert them:
Generic function to read/convert any audio data with Librosa (--> returns Numpy array):
def read_audio(file_path, sample_rate=None, mono=False):
return librosa.load(file_path, sr=sample_rate, mono=mono)
Functions to read arbitrary data as PCM data (--> returns bytes):
def read_audio_vad(file_path):
audio, rate = librosa.load(file_path, sr=16000, mono=True)
tmp_file = 'tmp.wav'
sf.write(tmp_file, audio, rate, subtype='PCM_16')
audio, rate = read_pcm16_wave(tmp_file)
remove(tmp_file)
return audio, rate
def read_pcm16_wave(file_path):
with wave.open(file_path, 'rb') as wf:
sample_rate = wf.getframerate()
pcm_data = wf.readframes(wf.getnframes())
return pcm_data, sample_rate
As you can see I am making a detour by reading/converting the audio data with librosa first. This is needed so I can read from MP3 files or WAV files with arbitrary encodings and automatically resample it to 16kHz mono with Librosa. I am then writing to a temporary file. Before deleting the file, I read the contents out again, but this time using the wave
module. This gives me the PCM data.
I now have the following code to extract the voice activity and produce Numpy arrays:
def webrtc_voice(audio, rate):
voiced_frames = webrtc_split(audio, rate)
tmp_file = 'tmp.wav'
for frames in voiced_frames:
voice_audio = b''.join([f.bytes for f in frames])
write_pcm16_wave(tmp_file, voice_audio, rate)
voice_audio, rate = read_audio(tmp_file)
remove(tmp_file)
start_time = frames[0].timestamp
end_time = (frames[-1].timestamp + frames[-1].duration)
start_frame = int(round(start_time * rate / 1e3))
end_frame = int(round(end_time * rate / 1e3))
yield voice_audio, rate, start_frame, end_frame
def write_pcm16_wave(path, audio, sample_rate):
with wave.open(path, 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(sample_rate)
wf.writeframes(audio)
As you can see I am taking the detour over a temporary file again to write PCM data first and then read the temporary file out again with Librosa to get a Numpy array. The webrtc_split
function is the implementation from the example implementation with only few minor changes. For completeness sake I am posting it here:
def webrtc_split(audio, rate, aggressiveness=3, frame_duration_ms=30, padding_duration_ms=300):
vad = Vad(aggressiveness)
num_padding_frames = int(padding_duration_ms / frame_duration_ms)
ring_buffer = collections.deque(maxlen=num_padding_frames)
triggered = False
voiced_frames = []
for frame in generate_frames(audio, rate):
is_speech = vad.is_speech(frame.bytes, rate)
if not triggered:
ring_buffer.append((frame, is_speech))
num_voiced = len([f for f, speech in ring_buffer if speech])
if num_voiced > 0.9 * ring_buffer.maxlen:
triggered = True
for f, s in ring_buffer:
voiced_frames.append(f)
ring_buffer.clear()
else:
voiced_frames.append(frame)
ring_buffer.append((frame, is_speech))
num_unvoiced = len([f for f, speech in ring_buffer if not speech])
if num_unvoiced > 0.9 * ring_buffer.maxlen:
triggered = False
yield voiced_frames
ring_buffer.clear()
voiced_frames = []
if voiced_frames:
yield voiced_frames
class Frame(object):
"""
object holding the audio signal of a fixed time interval (30ms) inside a long audio signal
"""
def __init__(self, bytes, timestamp, duration):
self.bytes = bytes
self.timestamp = timestamp
self.duration = duration
def generate_frames(audio, sample_rate, frame_duration_ms=30):
frame_length = int(sample_rate * frame_duration_ms / 1000) * 2
offset = 0
timestamp = 0.0
duration = (float(frame_length) / sample_rate)
while offset + frame_length < len(audio):
yield Frame(audio[offset:offset + frame_length], timestamp, duration)
timestamp += duration
offset += frame_length
My question
My implementation with writing/reading temporary files with the wave
module and reading/writing these files with Librosa to get Numpy Arrays seems overly complicated to me. However, despite spending a whole day on the matter I did not find a way to convert directly between the two encodings. I admit I don't fully understand all the details of PCM and WAVE files, the impact of using 16/24/32-Bit for PCM data or the endianness. I hope my explanations above are detailed enough and not too much. Is there an easier way to convert between the two representations in-memory?