This post is part of the series on Text-to-Speech (TTS) for eLearning written by Dr. Joel Harband and edited by me. The other posts are: Text-to-Speech Overview and NLP Quality, Digital Signal Processor and Text-to-Speech, Using Text-to-Speech in an eLearning Course, Text-to-Speech eLearning Tools - Integrated Products, and seeming the most popular of the series so far: Text-to-Speech vs Human Narration for eLearning.
One of the concerns raised by various comments during the series has been around the quality of the results of Text-to-Speech (TTS) Voices and if that was suitable for eLearning. This issue was partly addressed in the previous post. In this post we’ll take a different cut at it by looking at how authors can use punctuation and mark-up language with TTS voices to bring out the meaning of the text more accurately and to make them more interesting. Using these techniques a voice can be made similar enough to human narration to hold a learner’s interest during an entire eLearning course - with a retention rate equivalent to that of a human voice.
Value and Concern Around Voice-Over
Before we jump into this specific topic, let’s look back at some of the specifics from last month’s Big Question - Voice Over in eLearning. Here’s a very quick summary of some of the responses regarding the added learning value of a voice-over as opposed to plain screen text:
- Audio provides an additional channel of information which the brain can process in parallel with the visual information [Kapp].
- A voice should not just read screen text [Kapp] but can optionally be supported by running subtitles at the bottom of the slide as in Captivate and Speech-Over [Joel].
- A great deal more information per slide can be transferred with voice than with plain text. One minute of speech is equivalent to 125 words – which would crowd the slide considerably [Joel].
- A lively and interesting voice can motivate learning and increase retention. [Mike Harrison]
- A voice can often express the intended meaning more accurately than plain text by changing speed, volume and pitch, emphasizing words, and pausing for emphasis [Mike Harrison] (This is the prosody that we discussed in the first post). For example: He reads well. He reads well. He reads well.
It’s these last two points that relate closely to this topic. Ultimately, we would like the voice (human or TTS) to be lively and interesting, help increase motivation and learning, and convey the meaning more accurately.
Some of the concern around the use of Text-to-Speech Voices in eLearning is whether you can achieve that level of voice use.
Making the Author into a Voice Talent
Today’s post aims to show that with state-of-the-art tools that simplify the use of markup language, like Speech-Over Professional, TTS voices can easily be made interesting as well as prosody-accurate (points 4 and 5 above).
The concept presented here is a bit of a change in thinking:
An author together with a TTS voice is equivalent to a voice talent!
While handling the grammar quite well, the TTS voice by itself cannot know the nuances and emphases (prosody) needed to bring out the intended meaning of the sentence and will produce a compromise prosody. Authors need to fill the gap. Some people in the world of TTS call them “Text Authors.” Throughout this post, we will refer to them simply as “authors” as they likely are also the course author. Authors know what the voice should sound like, they use punctuation and mark-up language to makes the TTS voice achieve the intended meaning and clarity as well as enlivening it.
In some ways this is not that new for people who have worked with voice talent before. If you’ve ever worked a recording session, you will sit there and listen to what’s being said and often correct the phrasing, pronunciation, pacing, and other aspects of how the voice talent is handling the script that you have written. What we are saying is that there’s an equivalent operation when dealing with TTS Voices. You need to listen to the results and make corrections. Of course as we’ve pointed out in Using Text-to-Speech in an eLearning Course, the effort to make changes is likely substantially less.
The Basics
Let’s see an example of what we are talking about. Here is a clip of the TTS voice Heather reading Elizabeth Barrett Browning’s poem “How I love thee?” produced by Speech-Over Professional.
How I love thee?
How do I love thee? Let me count the ways.
I love thee to the depth and breadth and height
My soul can reach, when feeling out of sight
For the ends of Being and ideal Grace.
I love thee to the level of every day's
Most quiet need, by sun and candlelight.
I love thee freely, as men strive for Right;
I love thee purely, as they turn from Praise.
I love with a passion put to use
In my old griefs, and with my childhood's faith.
I love thee with a love I seemed to lose
With my lost saints, I love thee with the breath,
Smiles, tears, of all my life! and, if God choose,
I shall but love thee better after death.
When you listen there are a few simple uses of punctuation and markup language with Speech-Over Professional’s SAPI editor that provide some improvements to how the default would have read this.
The Speech-Over SAPI editor shown above lets authors apply markup language quickly and accurately with simple text symbols, which are as easy to use as ordinary punctuation. The symbols used in this example are the em-dash (—) which inserts a 0.5 sec silent delay and the right and left arrows (⊳,⊲) which decrease and increase the voice speed by one unit.
Listen to the effect of ordinary punctuation on the voice in the example:
- The question mark is obvious - Heather expresses it very nicely.
- The colon after "Let me count the ways:" gives a feeling of expectation for what’s to come. Putting a comma or period there would not give the same flow. Colons are generally used to introduce sequences to good effect.
- Commas are used to give phrasing and resolve ambiguous sentences. They are a powerful tool and are used more often than proper punctuation would require.
Listen also to the effect of the markup language:
- A delay (—) was placed between “How do I love thee” and “Let me count the ways” to express a slight hesitation for thought and then again after “Let me count the ways” to further hesitate for thought before stating the reasons.
- Delays are also inserted throughout introduce the hesitations that make the voice more realistic.
- The decrease and increase in speed for groups of words give them a slight accent and emphasis. For example, the words “I love thee”, “most quiet need”, etc have a speed decrease before them and a return to normal speed afterwards to give them a slight accent, depth, and emotional content. The amount of accent is controlled by the amount of speed reduction two units (⊳⊳) or one (⊳). A similar effect can be achieved by the emphasis tag (!!).
Also Heather’s natural slight Southern accent is because she is made from a real Southerner’s voice!
Now let’s see these concepts more in detail.
Using Punctuation
The judicious use of punctuation goes a long way towards making the voices more expressive and precise, especially the comma and the colon.
Let’s see how the prosody of the following sentence becomes clearer as we add punctuation:
In our experience, the really good voices like Paul and Heather do quite well on their own most of the time with well-placed commas, colons, and silent delays only.
Mark-Up Language
As we mentioned in the first post, many “small” innovations are needed to make text to speech useful and practical. The most important of these is the programming standard Microsoft Speech Application Programming Interface (SAPI) for Windows. SAPI standardizes the way authors control TTS voices: starting and stopping the voice, controlling its speed, volume and pitch, and its flow with silent delays. Manufacturers of SAPI-standard voices implement the SAPI controls in the voice software and developers of speech applications program SAPI controls into their applications to let the user control any SAPI-standard voice.
To control the properties and flow of the voice, SAPI provides a XML markup language, also called speech tags, which is added to the input text to communicate to the voice processor actions to take when converting the text to speech.
Some examples:
1. Volume - The Volume tag controls the volume of a voice on a scale of 0:100. The voice will change volume at the point it encounters the tag.
This text should be spoken at volume level 100.
<volume level="50">
This text should be spoken at volume level fifty.
</volume>
2. Rate - The Rate tag controls the rate (speed) of a voice on a scale of -10:10. The voice will change speed at the point it encounters the tag.
This text should be spoken at rate 0.
<rate absspeed="3"> This text should be spoken at rate 3.
<rate absspeed="-3"> This text should be spoken at rate -3.
</rate> </rate> Heather
The Pitch tag works the same as the Rate tag.
3. Emphasis - The Emph tag instructs the voice to emphasize a word or section of text.
<emph> boo </emph>!
Use the Emph tag to determine the prosody of an ambiguous sentence, for example the one referred to in the first post.
4. Silence - The Silence tag inserts a specified number of milliseconds of silence into the output audio stream.
Five hundred milliseconds of silence <silence msec="500"/> just occurred.
This is a very important tag for the naturalness of the voice.
5. Pronounce - The Pron tag inserts a specified pronunciation using the SYM phonetic language. Here is “Hello world” in SYM.
<pron sym="h eh 1 l ow & w er 1 l d "/>
This tag lets you instruct the voice how to say highly technical words and company slogans. See the first post for an example.
6. The PartOfSp tag lets you resolve the part of speech of a word.
Notes:
· Not all voices have all the tags implemented, for example, Heather does not have an emph tag.
· The NeoSpeech voices in Captivate do not use the SAPI tags but rather a proprietary markup language, VTML. Speech-Over works with SAPI-standard voices only.
· For more info about SAPI and its markup language, download sapi.chm from here.
Automating the markup language – SAPI editor
Clearly, having to type in or even paste these XML tags into the input text is time-consuming and error-prone. This is another case where a small innovation is called for: as discussed above, Speech-Over Professional has a SAPI editor that represents XML tags with simple text symbols - which makes it very easy and error-proof to insert and manipulate speech tags in the input text. Speech-Over Professional also automates the Pron tag with its Pronunciation lexicon you can use to add highly technical terms and company slogans.
Bottom Line
You may be thinking that some of the cost savings that you get from using TTS as compared to human voice talent is lost in this effort and that’s true. However, the rework aspect is still substantially less. Again, the best comparison is that of going through a recording session with a script. That process is very similar to what you end up with doing punctuation and markup with text to get the TTS voice to be much improved for eLearning.
For me personally, this is still not the same quality as a good voice talent, but it is definitely a lower cost and has MUCH lower cost in the face of change. It’s a good balance in many situations.