1
votes

I'm working on a personal project involving microphones in my apartment that I can issue verbal commands to. To accomplish this, I've been using the Microsoft Speech API, and specifically RecognitionEngine from System.Speech.Recognition in C#. I construct a grammar as follows:

// validCommands is a Choices object containing all valid command strings
// recognizer is a RecognitionEngine
GrammarBuilder builder = new GrammarBuilder(recognitionSystemName);
builder.Append(validCommands);
recognizer.SetInputToDefaultAudioDevice();
recognizer.LoadGrammar(new Grammar(builder));
recognizer.RecognizeAsync(RecognizeMode.Multiple);

// etc ...

This seems to work pretty well for the case when I actually give it a command. It hasn't misidentified one of my commands yet. Unfortunately, it also tends to pick up random talking as commands! I've tried to ameliorate this by prefacing the command Choices object with a "name" (recognitionSystemName), which I address the system as. Oddly, this doesn't seem to help. I am restricting it to a set of predetermined command phrases, so I would have thought that it would be able to detect if speech wasn't any of the strings. My best guess is that it's assuming that all sound is a command and picking the best match from the command set. Any advice on improving this system so that it no longer triggers off of conversation not directed at it would be very helpful.

Edit: I've moved the name recognizer to a separate SpeechRecognitionEngine, but the accuracy is awful. Here's a bit of test code I wrote to examine the accuracy:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Speech.Recognition;

namespace RecognitionAccuracyTest
{
    class RecognitionAccuracyTest
    {
        static int recogcount;
        [STAThread]
        static void Main()
        {
            recogcount = 0;
            System.Console.WriteLine("Beginning speech recognition accuracy test.");

            SpeechRecognitionEngine recognizer;
            recognizer = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
            recognizer.SetInputToDefaultAudioDevice();
            recognizer.LoadGrammar(new Grammar(new GrammarBuilder("Octavian")));
            recognizer.SpeechHypothesized += new EventHandler<SpeechHypothesizedEventArgs>(recognizer_SpeechHypothesized);
            recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
            recognizer.RecognizeAsync(RecognizeMode.Multiple);

            while (true) ;
        }

        static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            System.Console.WriteLine("Recognized @ " + e.Result.Confidence);
            try
            {
                if (e.Result.Audio != null)
                {
                    System.IO.FileStream stream = new System.IO.FileStream("audio" + ++recogcount + ".wav", System.IO.FileMode.Create);
                    e.Result.Audio.WriteToWaveStream(stream);
                    stream.Close();
                }
            }
            catch (Exception) { }
        }

        static void recognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
        {
            System.Console.WriteLine("Hypothesized @ " + e.Result.Confidence);
        }
    }
}

If the name is "Octavian", it recognizes stuff like "Octopus", "Octagon", "Volkswagen", and "Wow, really?". I can clearly hear the difference in the associated audio clips. Any ideas on making this not awful would be great.

4
So how did you end up solving your problem. I am not sure the marked answer really solves it. Can you share anything on what made it better. I seem to be in a similar situation where the recognizer is just recognizing to much AND with a high confidence rate.darbid
The marked answer is a marginal improvement. What I did to make it much better was to switch to SRGS grammars, and have a <dictation> element as the first item. Then, when I get a result, I compare the first word with my system's name. If it doesn't match, I discard the result. Sometimes I have to repeat myself, but I've virtually eliminated false positives by doing this.Octavianus
Thank you for taking the time. By <Dictation> do you mean something like MS examples here? msdn.microsoft.com/en-us/library/ms723634%28v=vs.85%29.aspxdarbid
I am pretty sure that my question here is also another alternative approach to solving this issue. stackoverflow.com/questions/18821566/…darbid
I used the SrgsRuleRef class and SrgsDocument objects to generate my grammars. SrgsRuleRef.Dictation represents the dictation element - unfortunately, documentation on this is nonexistent, so I haven't figured out how to limit it to one word of dictation, but it seems to work fairly well regardless.Octavianus

4 Answers

2
votes

Let me make sure I understand, you want a phrase to set apart commands to the system, like "butler" or "Siri". So, you'll say "Butler, turn on TV". You can build this into your grammar.

Here is an example of a simple grammar that requires an opening phrase before it recognizes a command. It uses semantic results to help you understand what was said. In this case the user must say "Open" or "Please open" or "can you open"

    private Grammar CreateTestGrammar()
    {
        // item
        Choices item = new Choices();
        SemanticResultValue itemSRV;
        itemSRV = new SemanticResultValue("I E", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("explorer", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("firefox", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("mozilla", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("chrome", "chrome");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("google chrome", "chrome");
        item.Add(itemSRV);
        SemanticResultKey itemSemKey = new SemanticResultKey("item", item);

        //build the permutations of choices...
        GrammarBuilder gb = new GrammarBuilder();
        gb.Append(itemSemKey);


        //now build the complete pattern...
        GrammarBuilder itemRequest = new GrammarBuilder();
        //pre-amble "[I'd like] a"
        itemRequest.Append(new Choices("Can you open", "Open", "Please open"));

        itemRequest.Append(gb);

        Grammar TestGrammar = new Grammar(itemRequest);
        return TestGrammar;
    }

You can then process the speech with something like:

RecognitionResult result = myRecognizer.Recognize();

and check for semantic results like:

if(result.Semantics.ContainsKey("item"))
{
   string s = (string)result.Semantics["item"].Value;
}
1
votes

I'm with the same problem too. I'm using Microsoft Speech Platform, so it could be a little different in accuracy etc.

I'm using Claire as a wake-up command, but it's true that it recognizes different words as Claire too. The problem is that the engine hears you speak and search for the closest match.

I didn't found a really good solution to this. You could either try out to filter the recognized speech with the Confidence field. But it's not very reliable with my chosen recognizer engine. I just throw every word that I want to recognize in one big SRGS.xml and set the repeat value to 0-. I only accept the recognized sentence as Claire is the first word. But this solution is not what I want, as it doesn't work as good as I wish, but still it's a little improvement.

I'm currently busy with it, and I will post more info as I progress.

EDIT 1: As a comment to what Dims says: It's possible in a SRGS Grammar to add a "Garbage" rule. You might want to look into that. http://www.w3.org/TR/speech-grammar/

0
votes

In principle, you need to update either grammar or dictionary to have "empty" or "anything" entries there.

0
votes

Is it possible that you just need to run UnloadAllGrammars() prior to creating/loading the grammar that you want to use?