3
votes

I am trying to combine speech recognition and speaker diarization techniques to identify how many speakers are present in an conversation and which speaker said what.

For this I am using CMU Sphinx and LIUM Speaker Diarization.

I am able to run these two tools separately i.e. I can run Sphinx 4 and get text output from audio and run LIUM toolkit and get audio segments.

Now I want to combine these two and get output something like below :

s0 : this is my first sentence.
s1 : this is my reply.
s2: i do not what you are talking about

Does anyone knows how to combine these two toolkit?

2

2 Answers

5
votes

Run diarization tools to get segment times for each speaker. They look like this:

file1 1 16105 217 M S U S9_file1
file1 1 16322 1908 M S U S9_file1
file2 1 18232 603 M S U S9_file2

The numbers like 16106 and 217 are segment start and segment length. Parse the text output and store times in the array.

Then split original audio on segments using the times.

Process each segment separately with Sphinx4 and display the transcription.

Optionally, run speaker adaptation for segments of each speaker and process each segment again with speaker-adapted model.

0
votes

If it's possible for you to act retroactively and change the recording settings, you could separate your speakers per recording channel and then analyse each channel individually. This is a common approach in phone call analysis.

You can achieve this with Google Speech to Text, enabling different recognition per channel (enable_separate_recognition_per_channel=True) and enabling speaker diarization (enable_speaker_diarization=True)

(Using the Python language)