Speaker Diarization when using Python Speech Recognition

Question

Is there an option to diarize the output when using the import speech_recognition in Python?

I would appreciate advice on this, or whether it is possible.

Furthermore, any advice on then outputting this information in a text file with lines between each new speaker would be greatly appreciated.

import speech_recognition as sr

from os import path

from pprint import pprint

audio_file = path.join(path.dirname(path.realpath(__file__)), "RobertP.wav")

r = sr.Recognizer()
with sr.AudioFile(audio_file) as source:
    audio = r.record(source)

try:
    txt = r.recognize_google(audio, show_all=True)
except:
    print("Didn't work.")

text = str(txt)

f = open("tester.txt", "w+")
f.write(text)
f.close()

Note: apologies for my novice-ness.

itroulli itroulli · Accepted Answer · 2019-12-10T09:47:55

Speaker diarization is currently in beta in Google Speech-to-Text API. You can find the documentation of this feature here. Handling on the output can be done in many ways. The following is an example (based on this Medium article):

import io

def transcribe_file_with_diarization(speech_file):
    “””Transcribe the given audio file synchronously with diarization.”””

    from google.cloud import speech_v1p1beta1 as speech
    client = speech.SpeechClient()

    with io.open(speech_file, ‘rb’) as audio_file:
        content = audio_file.read()
    audio = {"content": content}

    encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16
    sample_rate_hertz=48000
    language_code=’en-US’
    enable_speaker_diarization=True
    enable_automatic_punctuation=True
    diarization_speaker_count=4

    config = {
        "encoding": encoding,
        "sample_rate_hertz": sample_rate_hertz,
        "language_code": language_code,
        "enable_speaker_diarization": enable_speaker_diarization,
        "enable_automatic_punctuation": enable_automatic_punctuation,
        # Optional:
        "diarization_speaker_count": diarization_speaker_count
    }

    print(‘Waiting for operation to complete…’)
    response = client.recognize(config, audio)

    # The transcript within each result is separate and sequential per result.
    # However, the words list within an alternative includes all the words
    # from all the results thus far. Thus, to get all the words with speaker
    # tags, you only have to take the words list from the last result:

    result = response.results[-1]
    words_info = result.alternatives[0].words

    speaker1_transcript=””
    speaker2_transcript=””
    speaker3_transcript=””
    speaker4_transcript=””

    # Printing out the output:
    for word_info in words_info:
        if(word_info.speaker_tag==1): 
            speaker1_transcript=speaker1_transcript+word_info.word+’ ‘
        if(word_info.speaker_tag==2): 
            speaker2_transcript=speaker2_transcript+word_info.word+’ ‘
        if(word_info.speaker_tag==3): 
            speaker3_transcript=speaker3_transcript+word_info.word+’ ‘
        if(word_info.speaker_tag==4): 
            speaker4_transcript=speaker4_transcript+word_info.word+’ ‘

    print(“speaker1: ‘{}’”.format(speaker1_transcript))
    print(“speaker2: ‘{}’”.format(speaker2_transcript))
    print(“speaker3: ‘{}’”.format(speaker3_transcript))
    print(“speaker4: ‘{}’”.format(speaker4_transcript))

Speaker Diarization when using Python Speech Recognition

1 Answers