Firefox and the Web Speech API

Speech Synthesis and recognition are powerful tools to have available on computers, and they have become quite widespread in this modern age — look at tools like Cortana, Dictation and Siri on popular modern OSes, and accessibility tools like screenreaders.

But what about the Web? To be able to issue voice commands directly to a web page and have the browser read text content directly would be very useful.

Fortunately, some intelligent folk have been at work on this. The Web Speech API has been around for quite some time, the spec having been written in around 2014, with no significant changes made since. As of late 2015, Firefox (44+ behind a pref, and Firefox OS 2.5+) has implemented Web Speech, with Chrome support available too!

In this article we’ll explore how this API works, and what kind of fun you can already have.

How does it work?

You might be thinking “functionality like Speech Synthesis is pretty complex to implement.” Well, you’d be right. Browsers tend to use the speech services available on the operating system by default, so for example you’ll be using the Mac Speech service when accessing speech synthesis on Firefox or Chrome for OS X.

The recognition and synthesis parts of the Web Speech API sit in the same spec, but operate independently to one another. The is nothing to stop you from implementing an app that recognizes an inputted voice command and then speaks it back to the user, but apart from that their functionality is separate.

Each one has a series of interfaces defining their functionality, at the center of which sits a controller interface — called (predictably) SpeechRecognition and SpeechSynthesis. In the coming sections we’ll explore how to use these interfaces to build up speech-enabled apps.

Browser support in more detail

As mentioned above, the two browsers that have implemented Web Speech so far are Firefox and Chrome. Chrome/Chrome mobile have supported synthesis and recognition since version 33, the latter with webkit prefixes.

Firefox on the other hand has support for both parts of the API without prefixes, although there are some things to bear in mind:

  • Even through recognition is implemented in Gecko, it is not currently usable in desktop/Android because the UX/UI to allow users to grant an app permission to use it is not yet implemented.
  • To use the recognition and synthesis parts of the spec in Firefox (desktop/Android), you’ll need to enable the media.webspeech.recognition.enable and media.webspeech.synth.enabled flags in about:config.
  • In Firefox OS, for an app to use speech recognition it needs to be privileged, and include the audio-capture and speech-recognition permission (see here for a suitable manifest example)
  • Firefox does not currently support the continuous property
  • The onnomatch event handler is currently of limited use — it doesn’t fire because the speech recognition engine Gecko has integrated, Pocketsphinx, does not support a confidence measure for each recognition. So it doesn’t report back “sorry that’s none of the above” — instead it says “of the choices you gave me, this looks the best”.

Note: Chrome does not appear to deal with specific grammars; instead it just returns all results, and you can deal with them as you want. This is because Chrome’s server-side speech recognition has more processing power available than the client-side solution Firefox uses. There are advantages to each approach.

Demos

We have written two simple demos to allow you to try out speech recognition and synthesis: Speech color changer and Speak easy synthesis. You can find both of these on Github.

To run them live:

Speech Recognition

Let’s look quickly at the JavaScript powering the Speech color changer demo.

Chrome support

As mentioned earlier, Chrome currently supports speech recognition with prefixed properties, so we start our code with this, to make sure each browser gets fed the right object (nom nom.)

var SpeechRecognition = SpeechRecognition || webkitSpeechRecognition
var SpeechGrammarList = SpeechGrammarList || webkitSpeechGrammarList
var SpeechRecognitionEvent = SpeechRecognitionEvent || webkitSpeechRecognitionEvent

The grammar

The next line defines the grammar we want our app to recognize:

var grammar = '#JSGF V1.0; grammar colors; public  = aqua | azure | beige | bisque | black | [LOTS MORE COLOURS] ;'

The grammar format used is JSpeech Grammar Format (JSGF).

Plugging the grammar into our speech recognition

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

var recognition = new SpeechRecognition();
var speechRecognitionList = new SpeechGrammarList();

We add our grammar to the list using the SpeechGrammarList.addFromString() method. Its parameters are the grammar we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.

speechRecognitionList.addFromString(grammar, 1);

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition grammars property.

Starting the speech recognition

Now we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start().

var diagnostic = document.querySelector('.output');
var bg = document.querySelector('html');

document.body.onclick = function() {
  recognition.start();
  console.log('Ready to receive a color command.');
}

Receiving and handling results

Once the speech recognition is started, there are many event handlers than can be used to retrieve results and other pieces of surrounding information (see the SpeechRecognition event handlers list.) The most common one you’ll probably use is SpeechRecognition.onresult, which is fired once a successful result is received:

recognition.onresult = function(event) {
  var color = event.results[0][0].transcript;
  diagnostic.textContent = 'Result received: ' + color + '.';
  bg.style.backgroundColor = color;
  console.log('Confidence: ' + event.results[0][0].confidence);
}

The second line here is a bit complex-looking, so let’s explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing one or more SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0.

Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

You can find more detail about this demo on MDN.

Speech Synthesis

Now let’s quickly review how the Speak easy synthesis demo works

Setting variables

First of all, we capture a reference to Window.speechSynthesis. This is API’s entry point — it returns an instance of SpeechSynthesis, the controller interface for web speech synthesis. We also create an empty array to store the available system voices (see the next step.)

var synth = window.speechSynthesis;

  ...

var voices = [];

Populating the select element

To populate the <select> element with the different voice options the device has available, we’ve written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices(), which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name), the language of the voice (grabbed from SpeechSynthesisVoice.lang), and — DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true.)

function populateVoiceList() {
  voices = synth.getVoices();

  for(i = 0; i < voices.length ; i++) {
    var option = document.createElement('option');
    option.textContent = voices[i].name + ' (' + voices[i].lang + ')';

    if(voices[i].default) {
      option.textContent += ' -- DEFAULT';
    }

    option.setAttribute('data-lang', voices[i].lang);
    option.setAttribute('data-name', voices[i].name);
    voiceSelect.appendChild(option);
  }
}

When we come to run the function, we do the following. This is because Firefox doesn't support SpeechSynthesis.onvoiceschanged, and will just return a list of voices when SpeechSynthesis.getVoices() is fired. With Chrome however, you have to wait for the event to fire before populating the list, hence the if statement seen below.

populateVoiceList();
if (speechSynthesis.onvoiceschanged !== undefined) {
  speechSynthesis.onvoiceschanged = populateVoiceList;
}

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter/Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak(), passing it the SpeechSynthesisUtterance instance as a parameter.

inputForm.onsubmit = function(event) {

  event.preventDefault();

  var utterThis = new SpeechSynthesisUtterance(inputTxt.value);
  var selectedOption = voiceSelect.selectedOptions[0].getAttribute('data-name');
  for(i = 0; i < voices.length ; i++) {
    if(voices[i].name === selectedOption) {
      utterThis.voice = voices[i];
    }
  }
  utterThis.pitch = pitch.value;
  utterThis.rate = rate.value;
  synth.speak(utterThis);

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

inputTxt.blur();
}

You can find more detail about this demo on MDN.

View full post on Mozilla Hacks - the Web developer blog

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Tagged on: ,

3 thoughts on “Firefox and the Web Speech API

  1. Chris Mills

    Apologies — I shoulda been clearer on the prefs. I’ve added a line about the prefs you need to enabled in the notes near the top. Recognition won’t work at the moment in desktop or Android.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  2. vince

    The demos dont work on Firefox Android

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  3. Brett Zamir

    It is great to see momentum building on this. Is there a rough ETA on desktop support?

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

Leave a Reply