Convert mediarecorder blobs to a type that google speech to text can transcribe

Question

I am making an app where the user browser records the user speaking and sends it to the server which then passes it on to the Google speech to the text interface. I am using mediaRecorder to get 1-second blobs which are sent to a server. On the server-side, I send these blobs over to the Google speech to the text interface. However, I am getting an empty transcriptions.

I know what the issue is. Mediarecorder's default Mime Type id audio/WebM codec=opus, which is not accepted by google's speech to text API. After doing some research, I realize I need to use ffmpeg to convert blobs to LInear16. However, ffmpeg only accepts audio FILES and I want to be able to convert BLOBS. Then I can send the resulting converted blobs over to the API interface.

server.js

wsserver.on('connection', socket => {
    console.log("Listening on port 3002")
    audio = {
        content: null
    }
  socket.on('message',function(message){
        // const buffer = new Int16Array(message, 0, Math.floor(data.byteLength / 2));
        // console.log(`received from a client: ${new Uint8Array(message)}`);
        // console.log(message);
        audio.content = message.toString('base64')
        console.log(audio.content);
        livetranscriber.createRequest(audio).then(request => {
            livetranscriber.recognizeStream(request);
        });


  });
});

livetranscriber

module.exports = {
    createRequest: function(audio){
        const encoding = 'LINEAR16';
const sampleRateHertz = 16000;
const languageCode = 'en-US';
        return new Promise((resolve, reject, err) =>{
            if (err){
                reject(err)
            }
            else{
                const request = {
                    audio: audio,
                    config: {
                      encoding: encoding,
                      sampleRateHertz: sampleRateHertz,
                      languageCode: languageCode,
                    },
                    interimResults: false, // If you want interim results, set this to true
                  };
                  resolve(request);
            }
        });

    },
    recognizeStream: async function(request){
        const [response] = await client.recognize(request)
        const transcription = response.results
            .map(result => result.alternatives[0].transcript)
            .join('\n');
        console.log(`Transcription: ${transcription}`);
        // console.log(message);
        // message.pipe(recognizeStream);
    },

}

client

 recorder.ondataavailable = function(e) {
            console.log('Data', e.data);

            var ws = new WebSocket('ws://localhost:3002/websocket');
            ws.onopen = function() {
              console.log("opening connection");

              // const stream = websocketStream(ws)
              // const duplex = WebSocket.createWebSocketStream(ws, { encoding: 'utf8' });
              var blob = new Blob(e, { 'type' : 'audio/wav; base64' });
              ws.send(blob.data);
              // e.data).pipe(stream); 
              // console.log(e.data);
              console.log("Sent the message")
            };

            // chunks.push(e.data);
            // socket.emit('data', e.data);
        }

bitnahian bitnahian · Accepted Answer · 2021-01-05T00:07:42

I wrote a similar script several years ago. However, I used a JS frontend and a Python backend instead of NodeJS. I remember using a sox transformer to transform the audio input into to an output that the Google Speech API could use.

Perhaps this might be useful for you.

https://github.com/bitnahian/speech-transcriptor/blob/9f186e5416566aa8a6959fc1363d2e398b902822/app.py#L27

TLDR: Converted from a .wav format to .raw format using ffmpeg and sox.

Convert mediarecorder blobs to a type that google speech to text can transcribe

1 Answers