I'm working on a personal project involving microphones in my apartment that I can issue verbal commands to. To accomplish this, I've been using the Microsoft Speech API, and specifically RecognitionEngine from System.Speech.Recognition in C#. I construct a grammar as follows:
// validCommands is a Choices object containing all valid command strings
// recognizer is a RecognitionEngine
GrammarBuilder builder = new GrammarBuilder(recognitionSystemName);
builder.Append(validCommands);
recognizer.SetInputToDefaultAudioDevice();
recognizer.LoadGrammar(new Grammar(builder));
recognizer.RecognizeAsync(RecognizeMode.Multiple);
// etc ...
This seems to work pretty well for the case when I actually give it a command. It hasn't misidentified one of my commands yet. Unfortunately, it also tends to pick up random talking as commands! I've tried to ameliorate this by prefacing the command Choices object with a "name" (recognitionSystemName), which I address the system as. Oddly, this doesn't seem to help. I am restricting it to a set of predetermined command phrases, so I would have thought that it would be able to detect if speech wasn't any of the strings. My best guess is that it's assuming that all sound is a command and picking the best match from the command set. Any advice on improving this system so that it no longer triggers off of conversation not directed at it would be very helpful.
Edit: I've moved the name recognizer to a separate SpeechRecognitionEngine, but the accuracy is awful. Here's a bit of test code I wrote to examine the accuracy:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Speech.Recognition;
namespace RecognitionAccuracyTest
{
class RecognitionAccuracyTest
{
static int recogcount;
[STAThread]
static void Main()
{
recogcount = 0;
System.Console.WriteLine("Beginning speech recognition accuracy test.");
SpeechRecognitionEngine recognizer;
recognizer = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
recognizer.SetInputToDefaultAudioDevice();
recognizer.LoadGrammar(new Grammar(new GrammarBuilder("Octavian")));
recognizer.SpeechHypothesized += new EventHandler<SpeechHypothesizedEventArgs>(recognizer_SpeechHypothesized);
recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
recognizer.RecognizeAsync(RecognizeMode.Multiple);
while (true) ;
}
static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
System.Console.WriteLine("Recognized @ " + e.Result.Confidence);
try
{
if (e.Result.Audio != null)
{
System.IO.FileStream stream = new System.IO.FileStream("audio" + ++recogcount + ".wav", System.IO.FileMode.Create);
e.Result.Audio.WriteToWaveStream(stream);
stream.Close();
}
}
catch (Exception) { }
}
static void recognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
{
System.Console.WriteLine("Hypothesized @ " + e.Result.Confidence);
}
}
}
If the name is "Octavian", it recognizes stuff like "Octopus", "Octagon", "Volkswagen", and "Wow, really?". I can clearly hear the difference in the associated audio clips. Any ideas on making this not awful would be great.