Why is the Microsoft Speech Recognition SemanticValue.Confidence value always 1?

Question

I'm trying to use the SpeechRecognizer with a custom Grammar to handle the following pattern:

"Can you open {item}?" where {item} uses DictationGrammar.

I'm using the speech engine built into Vista and .NET 4.0.

I would like to be able to get the confidences for the SemanticValues returned. See example below.

If I simply use "recognizer.AddGrammar( new DictationGrammar() )", I can browse through e.Results.Alternates and view the confidence values of each alternate. That works if DictationGrammar is at the top level.

Made up example:

Can you open Firefox? .95
Can you open Fairfax? .93
Can you open file fax? .72
Can you pen Firefox? .85
Can you pin Fairfax? .63

But if I build a grammar that looks for "Can you open {semanticValue Key='item' GrammarBuilder=new DictationGrammar()}?", then I get this:

Can you open Firefox? .91 - Semantics = {GrammarBuilder.Name = "can you open"}
Can you open Fairfax? .91 - Semantics = {GrammarBuilder.Name = "can you open"}
Can you open file fax? .91 - Semantics = {GrammarBuilder.Name = "can you open"}
Can you pen Firefox? .85 - Semantics = null
Can you pin Fairfax? .63 - Semantics = null

The .91 shows me that how confident it is that it matched the pattern of "Can you open {item}?" but doesn't distinguish any further.

However, if I then look at e.Result.Alternates.Semantics.Where( s => s.Key == "item" ), and view their Confidence, I get this:

Firefox 1.0
Fairfax 1.0
file fax 1.0

Which doesn't help me much.

What I really want is something like this when I view the Confidence of the matching SemanticValues:

Firefox .95
Fairfax .93
file fax .85

It seems like it should work that way...

Am I doing something wrong? Is there even a way to do that within the Speech framework?

I'm hoping there's some inbuilt mechanism so that I can do it the "right" way.

As for another approach that will probably work...

Use the SemanticValue approach to match on the pattern
For anything that matches on that pattern, extract the raw Audio for {item} (use RecognitionResult.Words and RecognitionResult.GetAudioForWordRange)
Run the raw audio for {item} through a SpeechRecognizer with the DictationGrammar to get the Confidence

... but that's more processing than I really want to do.

How many different items do you want to support? Is it a known finite list? Using the Dictation grammar means that you don't know what someone might want to open, so you are preparing for them to say anything "can you open Foobar", "Can you open Ice Cream". The recognizer can't give you semantic analyis on the {item} because the Dictation grammar doesn't define any. You need to build a grammar for identifying the {item} and add semantic mapping info to it using SemanticResultValue(). — Michael Levy
Except that freeform response is exactly what I want. I'm building an interactive system that recognizes what it "knows" and what it doesn't, and then prompts for clarification for unknowns. E.g., "Can you open Chrome?" "I'm sorry, I don't know how to open chrome. Can you tell how to open it?" "Click on the start menu, programs, google, then launch google chrome." — Jonathan Mitchem
Then I think you can't use the SemanticValue from the recognizer. I believe it requires that explicit Choices be defined in the grammar. I think you have to do your own semantic analysis and classification to figure out what the user said. Treat the output of the recognizer as a text string as if the user had typed "Chrome" and run it through something else to figure out what that text means. You may want to look at mallet.cs.umass.edu. — Michael Levy
Or, you could use GrammarBuilder and dynamically build a new grammar after each new items is learned. You can keep a file of the names of things you know how to open and what they mean, and on startup always read that file and build a new grammar that includes SemanticResultValue() you've created dynamically. (just a thought) — Michael Levy

Michael Levy Michael Levy · Accepted Answer · 2011-03-24T12:43:34

I think a dictation grammar only does transcription. It does speech to text without extracting semantic meaning because by definition a dictation grammar supports all words and doesn't have any clues to your specific semantic mapping. You need to use a custom grammar to extract semantic meaning. If you supply an SRGS grammar or build one in code or with SpeechServer tools, you can specify Semantic mappings for certain words and phrases. Then the recognizer can extract semantic meaning and give you a semantic confidence.

You should be able to get Confidence value from the recognizer on the recognition, try System.Speech.Recognition.RecognitionResult.Confidence.

The help file that comes with the Microsoft Server Speech Platform 10.2 SDK has more details. (this is the Microsoft.Speech API for Server applications which is very similar to the System.Speech API for client applications) See (http://www.microsoft.com/downloads/en/details.aspx?FamilyID=1b1604d3-4f66-4241-9a21-90a294a5c9a4.) or the Microsoft.Speech documentation at http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.semanticvalue(v=office.13).aspx

For SemanticValue Class it says:

All Speech platform-based recognition engines output provide valid instances of SemanticValue for all recognized output, even phrases with no explicit semantic structure.

The SemanticValue instance for a phrase is obtained using the Semantics property on the RecognizedPhrase object (or objects which inherit from it, such as RecognitionResult).

SemanticValue objects obtained for recognized phrases without semantic structure are characterized by:

Having no children (Count is 0)

The Value property is null.

An artificial confidence level of 1.0 (returned by Confidence)

Typically, applications create instance of SemanticValue indirectly, adding them to Grammar objects by using SemanticResultValue, and SemanticResultKey instances in conjunction with, Choices and GrammarBuilder objects.

Direct construction of an SemanticValue is useful during the creation of strongly typed grammars

When you use the SemanticValue features in the grammar you are typically trying to map different phrases to a single meaning. In your case the phrase "I.E" or "Internet Explorer" should both map to the same semantic meaning. You set up choices in your grammar to understand each phrase that can map to a specific meaning. Here is a simple Winform example:

private void btnTest_Click(object sender, EventArgs e)
{
    SpeechRecognitionEngine myRecognizer = new SpeechRecognitionEngine();

    Grammar testGrammar = CreateTestGrammar();  
    myRecognizer.LoadGrammar(testGrammar);

    // use microphone
    try
    {
        myRecognizer.SetInputToDefaultAudioDevice();
        WriteTextOuput("");
        RecognitionResult result = myRecognizer.Recognize();              

        string item = null;
        float confidence = 0.0F;
        if (result.Semantics.ContainsKey("item"))
        {
            item = result.Semantics["item"].Value.ToString();
            confidence = result.Semantics["item"].Confidence;
            WriteTextOuput(String.Format("Item is '{0}' with confidence {1}.", item, confidence));
        }

    }
    catch (InvalidOperationException exception)
    {
        WriteTextOuput(String.Format("Could not recognize input from default aduio device. Is a microphone or sound card available?\r\n{0} - {1}.", exception.Source, exception.Message));
        myRecognizer.UnloadAllGrammars();
    }

}

private Grammar CreateTestGrammar()
{                        
    // item
    Choices item = new Choices();
    SemanticResultValue itemSRV;
    itemSRV = new SemanticResultValue("I E", "explorer");
    item.Add(itemSRV);
    itemSRV = new SemanticResultValue("explorer", "explorer");
    item.Add(itemSRV);
    itemSRV = new SemanticResultValue("firefox", "firefox");
    item.Add(itemSRV);
    itemSRV = new SemanticResultValue("mozilla", "firefox");
    item.Add(itemSRV);
    itemSRV = new SemanticResultValue("chrome", "chrome");
    item.Add(itemSRV);
    itemSRV = new SemanticResultValue("google chrome", "chrome");
    item.Add(itemSRV);
    SemanticResultKey itemSemKey = new SemanticResultKey("item", item);

    //build the permutations of choices...
    GrammarBuilder gb = new GrammarBuilder();
    gb.Append(itemSemKey);

    //now build the complete pattern...
    GrammarBuilder itemRequest = new GrammarBuilder();
    //pre-amble "[I'd like] a"
    itemRequest.Append(new Choices("Can you open", "Open", "Please open"));

    itemRequest.Append(gb, 0, 1);

    Grammar TestGrammar = new Grammar(itemRequest);
    return TestGrammar;
}

Why is the Microsoft Speech Recognition SemanticValue.Confidence value always 1?

1 Answers