Web Technologies
CLIENT SIDE ESSENTIALS
Java Script Objects and Functions – JQuery – Accessing DOM Elements using Java Script
and JQuery Objects – Java Script Event Handling – XML DOM – AJAX Enabled Rich
Internet Applications with XML and JSON – Dynamic Access and Manipulation of Web
Pages using Java Script and JQuery – Web Speech API – Speech Synthesis Markup
Language.
Web Speech API
• Web Speech API provides two distinct areas of functionality — speech recognition, and speech synthesis (also known as
text to speech, or tts)
• Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech
recognition service
• When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further
actions can be initiated as a result.
• The Web Speech API has a main controller interface for this — SpeechRecognition
• SpeechRecognition interface of the Web Speech API is the controller interface for the recognition service; this also
handles the SpeechRecognitionEvent sent from the recognition service.
Properties
SpeechRecognition also inherits properties from its parent interface, EventTarget.
SpeechRecognition.grammarsReturns and sets a collection of SpeechGrammar objects that represent the grammars that
will be understood by the current SpeechRecognition.
SpeechRecognition.langReturns and sets the language of the current SpeechRecognition. If not specified, this defaults to
the HTML lang attribute value, or the user agent's language setting if that isn't set either.
SpeechRecognition.continuousControls whether continuous results are returned for each recognition, or only a single
result. Defaults to single (false.)
SpeechRecognition.interimResultsControls whether interim results should be returned (true) or not (false.) Interim
results are results that are not yet final
Web Speech API
Methods
SpeechRecognition also inherits methods from its parent interface, EventTarget.
SpeechRecognition.abort()Stops the speech recognition service from listening to incoming audio, and doesn't attempt to return
a SpeechRecognitionResult.
SpeechRecognition.start()Starts the speech recognition service listening to incoming audio with intent to recognize grammars
associated with the current SpeechRecognition.
SpeechRecognition.stop()Stops the speech recognition service from listening to incoming audio, and attempts to return a
SpeechRecognitionResult using the audio captured so far.
Steps:
Add the script tag after the body tag will make sure that the script file is loaded after all the elements have been loaded to the
DOM which aids performance.
<script src="./speechRecognition.js"></script>
Check webkitSpeechRecognition class is available in the window object
if ("webkitSpeechRecognition" in window) {
// Speech Recognition Stuff goes here
} else {
console.log("Speech Recognition Not Available")
}
speech recognition is currently limited to Chrome for Desktop and Android — Chrome has supported it since around version 33
but with prefixed interfaces, so you need to include prefixed versions of them, e.g. webkitSpeechRecognition
Web Speech API
Steps:
Initialization: create a webkitSpeechRecognition object.let speechRecognition = new webkitSpeechRecognition();
Properties:
Continuous listeningThe speech recognition object can either stop listening after the user stops speaking or it can
keep listening until the user stops it.
If you only want to recognize a phrase or a word, you can set this to false. Here let’s set it to true.
speechRecognition.continuous = true;
Interim results
Interim results are results that are not yet final.
enable this property, the speechRecognition object will also return the interim results along with the final results.
Let’s set it to true. Eg.,speechRecognition.interimResults = true;
Language that the user will speak in.
You need to use locale codes to set this property. Note: not all languages are available in this feature yet.
Language that the user has chosen from the select menu. You need to select the Dialect select menu and use its value
for the language property.
Eg., speechRecognition.lang = document.querySelector("#select_dialect").value;
querySelector() method returns the first element that matches a CSS selector
Web Speech API
Steps:
Events & callbacks
onStart
This event is triggered when speech recognition is started by the user.
pass a callback function that will display that the speech recognition instance is listening on the webpage.
there is a <p> element with an ID called status that says Listening.... It’s been hidden by setting the display property of
the element to none using CSS.
set it to display: block when the speech recognition starts.
speechRecognition.onstart = () => {
document.querySelector("#status").style.display = "block";
};
Display: block Displays an element as a block element (like <p>). It starts on a new line, and takes up the whole width
Eg. var add = function(x, y) {
return x + y
};
// arrow function example(Arrow functions allow us to write shorter function syntax)
let add = (x, y) => x + y
Web Speech API
Steps:
Events & callbacks
onEnd
This event is triggered when the speech recognition is ended by the user.
Pass a callback function that will hide the status <p> element in the webpage.
set it to display: none when the speech recognition starts.
speechRecognition.onend = () => {
document.querySelector("#status").style.display = "none";
};
onError
This event is triggered when there is some sort of error in speech recognition. Let’s pass a callback function that
will hide the status <p> element in the webpage.
set it to display: none when the speech recognition starts.
speechRecognition.onError = () => {
document.querySelector("#status").style.display = "none";
};
Web Speech API
Steps:
Events & callbacks
onResult
This event is triggered when the speechRecognition object has some results from the recognition.
It will contain the final results and interim results.
pass a callback function that will set the results to the respective <span> inside the transcript box.
This is the HTML code for the transcript box on the web page. The interim results span is colored in a different
color to differentiate between the interim results and the final results.
<div class="p-3" style="border: 1px solid gray; height: 300px; border-radius: 8px;">
<span id="final" class="text-light"></span>
<span id="interim" class="text-secondary"></span>
</div>
Web Speech API
Steps:
Events & callbacks
onResult
result event will pass an event object to the callback function.
This object will contain the results in the form of an array.
Each element in the array will have a property called isFinal denoting whether that item is an interim result or a final result.
declare a variable for the final transcript outside the callback function and a variable for the interim transcript inside the callback function.
let final_transcript = "";
speechRecognition.onresult = (event) => {
// Create the interim transcript string locally because we don't want it to persist like final transcript
let interim_transcript = "";
};
Build a string from the results array. We should run it through a loop and add the result item to the final transcript if the result item is final. If
not, we should add it to the interim results string.
// Loop through the results from the speech recognition object.
for (let i = event.resultIndex; i < event.results.length; ++i) {
// If the result item is Final, add it to Final Transcript, Else add it to Interim transcript
if (event.results[i].isFinal) {
final_transcript += event.results[i][0].transcript;
} else {
interim_transcript += event.results[i][0].transcript;
Web Speech API
Steps:
Events & callbacks
onResult
update the DOM with the transcript values.
document.querySelector("#final").innerHTML = final_transcript;
document.querySelector("#interim").innerHTML = interim_transcript;
Start/Stop recognition
Finally, start and stop the recognition.
Need to set the onClick property of the start and stop buttons to start and stop the speech recognition.
document.querySelector("#start").onclick = () => {
speechRecognition.start();
};
document.querySelector("#stop").onclick = () => {
speechRecognition.stop();
Web Speech API
Web Speech API
Web Speech API
Web Speech API
Speech synthesis (aka text-to-speech, or tts) involves receiving synthesising text contained within an app to
speech, and playing it out of a device's speaker or audio output connection.
The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-
related interfaces for representing text to be synthesised (known as utterances), voices to be used for the
utterance, etc
It includes a set of form controls for entering text to be synthesised, and setting the pitch, rate, and voice to use
when the text is uttered.
HTML and CSS
The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some
simple controls. The <select> element is initially empty, but is populated with <option>s via JavaScript
Setting variables
First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture
a reference to Window.speechSynthesis. This is API's entry point — it returns an instance of SpeechSynthesis,
the controller interface for web speech synthesis.
Web Speech API
Populating the select element
To populate the <select> element with the different voice options the device has available.
populateVoiceList() function first invoke SpeechSynthesis.getVoices(), which returns a list of all the available
voices, represented by SpeechSynthesisVoice objects.
getVoices() method of the SpeechSynthesis interface returns a list of SpeechSynthesisVoice objects representing
all the available voices on the current device.
We then loop through this list — for each voice we create an <option> element, set its text content to display
the name of the voice (grabbed from SpeechSynthesisVoice.name)
The name read-only property of the SpeechSynthesisVoice interface returns a human-readable name that
represents the voice.
the language of the voice (grabbed from SpeechSynthesisVoice.lang)The lang read-only property of the
SpeechSynthesisVoice interface returns a BCP 47 language tag indicating the language of the voice.
and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if
SpeechSynthesisVoice.default returns true.)
default read-only property of the SpeechSynthesisVoice interface returns a boolean value indicating whether
the voice is the default voice for the current app (true), or not (false.)
Also create data- attributes for each option, containing the name and language of the associated voice, so we
can grab them easily later on, and then append the options as children of the select.
Web Speech API
Firefox doesn't support SpeechSynthesis.onvoiceschanged, and will just return a list of voices when
SpeechSynthesis.getVoices() is fired.
With Chrome however, you have to wait for the event to fire before populating the list, hence the if statement
seen below.
populateVoiceList();
if (speechSynthesis.onvoiceschanged !== undefined) {
speechSynthesis.onvoiceschanged = populateVoiceList;
}
Speaking the entered text
We create an event handler to start speaking the text entered into the text field.
Using an onsubmit handler on the form so that the action happens when play is pressed. We first create a new
SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.
The SpeechSynthesisUtterance() constructor of the SpeechSynthesisUtterance interface returns a new
SpeechSynthesisUtterance object instance.
Syntax
var utterThis = new SpeechSynthesisUtterance(text);
Return value: A DOMString containing the text that will be synthesized when the utterance is spoken..
Web Speech API
we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return
the currently selected <option> element.
We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches
this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice
property.
The SpeechSynthesisVoice interface of the Web Speech API represents a voice that the system supports. Every
SpeechSynthesisVoice has its own relative speech service including information about language, name and URI.
voice property of the SpeechSynthesisUtterance interface gets and sets the voice that will be used to speak the
utterance.
Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the
relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken
by invoking SpeechSynthesis.speak(), passing it the SpeechSynthesisUtterance instance as a parameter.
pitch property of the SpeechSynthesisUtterance interface gets and sets the pitch at which the utterance will be
spoken at.
If unset, a default value of 1 will be used.
rate property of the SpeechSynthesisUtterance interface gets and sets the speed at which the utterance will be
spoken at.
If unset, a default value of 1 will be used.
Web Speech API
Speech Synthesis Markup Language (SSML)
Speech Synthesis Markup Language (SSML) is an XML-based markup language that lets developers specify how input text is converted into
synthesized speech using the Text-to-Speech service.
Compared to plain text, SSML allows developers to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the Text-to-Speech output.
Normal punctuation, such as pausing after a period, or using the correct intonation when a sentence ends with a question mark are automatically
handled.
SSML elements
Each SSML document is created with SSML elements (or tags). These elements are used to adjust pitch, prosody, volume, and more.
speak is the root element, and is required for all SSML documents.
The speak element contains important information, such as version, language, and the markup vocabulary definition.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string"></speak>
Choose a voice for Text-to-Speech
The voice element is required. It is used to specify the voice that is used for Text-to-Speech.
<voice name="string">
This text will get converted into synthesized speech.
</voice>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-ChristopherNeural">
This is the text that is spoken.
</voice>
</speak>
https://cloud.google.com/text-to-speech/docs/ssml
Speech Synthesis Markup Language (SSML)
Use multiple voices
Within the speak element, you can specify multiple voices for Text-to-Speech output. These voices can be in different languages. For each voice, the
text must be wrapped in a voice element.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
Good morning!
</voice>
<voice name="en-US-ChristopherNeural">
Good morning to you too Jenny!
</voice>
</speak>
can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm, or optimize the voice for different scenarios like
customer service, newscasting and voice assistant, using the mstts:express-as element. This is an optional element unique to the Speech service.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-AriaNeural">
<mstts:express-as style="cheerful">
That'd be just amazing!
</mstts:express-as>
</voice>
</speak>
Speech Synthesis Markup Language (SSML)
Adjust speaking languages
Adjust speaking languages for neural voices. Enable one voice to speak different languages fluently (like English, Spanish, and Chinese) using the <lang
xml:lang> element.
This is an optional element unique to the Speech service. Without this element, the voice will speak its primary language.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyMultilingualNeural">
I am looking forward to the exciting things.
<lang xml:lang="es-MX">
Estoy deseando que lleguen las cosas emocionantes.
</lang>
<lang xml:lang="de-DE">
Ich freue mich auf die spannenden Dinge.
</lang>
</voice>
</speak>
Add or remove a break/pause
Use the break element to insert pauses (or breaks) between words, or prevent pauses automatically added by the Text-to-Speech service.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-ChristopherNeural">
Welcome to Microsoft Cognitive Services <break time="100ms" /> Text-to-Speech API.
</voice>
</speak>
Speech Synthesis Markup Language (SSML)
Add silence
Use the mstts:silence element to insert pauses before or after text, or between the 2 adjacent sentences.
TypeSpecifies the location of silence be added
Leading – at the beginning of text
Tailing – in the end of text
Sentenceboundary – between adjacent sentences
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-ChristopherNeural">
<mstts:silence type="Sentenceboundary" value="200ms"/>
If we’re home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their
schooling at the same time.
</voice>
</speak>
Specify paragraphs and sentences
p and s elements are used to denote paragraphs and sentences, respectively. In the absence of these elements, the Text-to-
Speech service automatically determines the structure of the SSML document.
<p></p>
<s></s>
Speech Synthesis Markup Language (SSML)
Use phonemes to improve pronunciation
The ph element is used to for phonetic pronunciation in SSML documents. The ph element can only contain
text, no other elements.
Adjust prosody
The prosody element is used to specify changes to pitch, contour, range, rate, duration, and volume for the Text-
to-Speech output. The prosody element can contain text and the following elements: audio, break, p, phoneme,
prosody, say-as, sub, and s.