0
votes

I have developed an application for streaming speech recognition in c++ using another API and IBM Watson Speech to Text service API.

In both these programs, I am using the same file which contains this audio

several tornadoes touch down as a line of severe thunderstorms swept through Colorado on Sunday

This file is 641,680 bytes in size and I am sending 100,000 bytes (max) chunks at a time to the Speech to text servers.

Now, with the other API I am able to have everything recognized as a whole. With the IBM Watson API I couldn't. Here is what I have done:

  • Connect to IBM Watson web server (Speech to text API)
  • Send start frame {"action":"start","content-type":"audio/mulaw;rate=8000"}
  • Send binary 100,000 bytes
  • Send stop frame {"action":"stop"}
  • ...Repeat binary and stop until the last byte.

The IBM Watson Speech API could only recognize the chunks individually
e.g.

several tornadoes touch down
a line of severe thunder
swept through Colorado
Sunday

This seems to be the output of individual chunks and the words coming in between the chunk division (for eg here, "thunderstorm" is partially present in the end of a chunk and partially in the starting of the next chunk) are thus incorrectly recognized or dropped.

What am I doing wrong?

EDIT (I am using c++ with boost library for websocket interface)

//Do the websocket handshake 
void IbmWebsocketSession::on_ssl_handshake(beast::error_code ec) {

    auto mToken = mSttServiceObject->GetToken(); // Get the authentication token

    //Complete the websocket handshake and call back the "send_start" function
    mWebSocket.async_handshake_ex(mHost, mUrlEndpoint, [mToken](request_type& reqHead) {reqHead.insert(http::field::authorization,mToken);},
            bind(&IbmWebsocketSession::send_start, shared_from_this(), placeholders::_1));
}

//Sent the start frame
void IbmWebsocketSession::send_start(beast::error_code ec) {

    //Send the START_FRAME and call back the "read_resp" function to receive the "state: listening" message
    mWebSocket.async_write(net::buffer(START_FRAME),
            bind(&IbmWebsocketSession::read_resp, shared_from_this(), placeholders::_1, placeholders::_2));
}

//Sent the binary data
void IbmWebsocketSession::send_binary(beast::error_code ec) {

    streamsize bytes_read = mFilestream.rdbuf()->sgetn(&chunk[0], chunk.size()); //gets the binary data chunks from a file (which is being written at run time

    // Send binary data
    if (bytes_read > mcMinsize) {  //Minimum size defined by IBM  is 100 bytes.
                                   // If chunk size is greater than 100 bytes, then send the data and then callback "send_stop" function
        mWebSocket.binary(true);

        /**********************************************************************
         *  Wait a second before writing the next chunk.
         **********************************************************************/
        this_thread::sleep_for(chrono::seconds(1));

        mWebSocket.async_write(net::buffer(&chunk[0], bytes_read),
                bind(&IbmWebsocketSession::send_stop, shared_from_this(), placeholders::_1));
    } else {                     //If chunk size is less than 100 bytes, then DO NOT send the data only call "send_stop" function
        shared_from_this()->send_stop(ec);
    }

}

void IbmWebsocketSession::send_stop(beast::error_code ec) {

    mWebSocket.binary(false);
    /*****************************************************************
     * Send the Stop message
     *****************************************************************/
    mWebSocket.async_write(net::buffer(mTextStop),
            bind(&IbmWebsocketSession::read_resp, shared_from_this(), placeholders::_1, placeholders::_2));
}

void IbmWebsocketSession::read_resp(beast::error_code ec, size_t bytes_transferred) {
    boost::ignore_unused(bytes_transferred);
        if(mWebSocket.is_open())
        {
            // Read the websocket response and call back the "display_buffer" function
            mWebSocket.async_read(mBuffer, bind(&IbmWebsocketSession::display_buffer, shared_from_this(),placeholders::_1));
        }
        else
            cerr << "Error: " << e->what() << endl;

}

void IbmWebsocketSession::display_buffer(beast::error_code ec) {

    /*****************************************************************
     * Get the buffer into stringstream
     *****************************************************************/
    msWebsocketResponse << beast::buffers(mBuffer.data());

    mResponseTranscriptIBM = ParseTranscript(); //Parse the response transcript

    mBuffer.consume(mBuffer.size()); //Clear the websocket buffer

    if ("Listening" == mResponseTranscriptIBM && true != mSttServiceObject->IsGstFileWriteDone()) { // IsGstFileWriteDone -> checks if the user has stopped speaking
        shared_from_this()->send_binary(ec);
    } else {
        shared_from_this()->close_websocket(ec, 0);
    }
}
2
show the code you have used - it's better than just wordsdata_henrik
@data_henrik Well sure! I can share the code, but I don't think this is a coding issue. My guess is either this is the functionality of IBM's API or I am logically doing something wrong. Although I will have to make some changes to follow my organization's policies. So it may take some time to do the editing.RC0993

2 Answers

0
votes

IBM Watson Speech to Text has several APIs to transmit audio and receive transcribed text. Based on your description you seem to use the WebSocket Interface.

For the WebSocket Interface, you would open the connection (start), then send individual chunks of data, and - once everything has been transmitted - stop the recognition request.

You have not shared code, but it seems you are starting and stopping a request for each chunk. Only stop after the last chunk.

I would recommend to take a look at the API doc which contains samples in different languages. The Node.js sample shows how to register for events. There are also examples on GitHub like this WebSocket API with Python. And here is another one that shows the chunking.

0
votes

@data_henrik is correct, the flow is wrong, it should be: ...START FRAME >> binary data >> binary data >> binary data >> ... >> STOP FRAME

you only need to send the {"action":"stop"} message when there are no more audio chunks to send