VoiceXML: Letting People Talk to Your HTTP Server through the Telephone (ArsDigita Systems Journal)

VoiceXML: letting people talk to your HTTP server through the telephone

Submitted on: 2001-03-05
Last updated: 2001-03-05

ArsDigita : ArsDigita Systems Journal : One article

Eve Andersson is a co-founder of ArsDigita Corporation and one of the authors of the ArsDigita Community System. She has degrees from Caltech and U.C. Berkeley and has built dozens of popular web sites on the public internet. Personal web site: www.eveandersson.com.
In every computing era, programmers have been responsible for writing the fundamental application logic. During the desktop application era (1980s), the attention given to this logic was generally dwarfed by that given to the user interface, event handling, and graphics code that a programming team needed to write to get a computer program into the hands of users. Result: very little innovation at the individual level; most widely used computer programs were written by large companies.
During the Web era (1990s), the user interface and graphics were rendered by the Web browser, e.g., Netscape Navigator or Microsoft Internet Explorer. Programmers were able to deliver a complete system to end-users after writing only the application logic and some simple HTML specifying the user interface behavior. Result: a revolution in innovation, with most Web services written in a few months by a handful of people.
Suppose that you'd observed that telephones are much more common and portable than personal computers and Web browsers. Furthermore, you'd noticed that telephones are able to be used by almost everyone whereas many consumers have little patience for the complexities of the PC. Thus, you'd want to make your information system accessible to a user with only a telephone. How would you have done it? In the 1980s, you'd rent a telephone line, buy a big specialized box to recognize utterances, buy another specialized box to talk to the user, and park those boxes right next to the main server for your application. In the 1990s you'd have had to rent a telephone line, buy specialized software, and park a standard computer running that software next to the server running your application. Result in both decades: very little innovation, with only the largest organizations offering voice/telephone interfaces to their information systems.
With the advent of today's voice browsers, the coming years promise to be a period of tremendous innovation in the development of telephone-accessible Internet applications. With a Web service, you operate the HTTP server and run the application; someone else runs the browser. The idea of the voice browser is the same. You operate a server and the application. Someone else, perhaps the phone company, runs the telephone lines and voice browser.
Bottom line: voice browsers allow you to build telephone voice applications with nothing more than an HTTP server. From this, great innovation shall spring.
Illustration

One weekend in February 2001, Tracy Adams, one of my ArsDigita co-founders, called me from her cell phone. She had just flown into Los Angeles and wanted to know the telephone number and address of our Los Angeles office, as well as the direct number for one of the employees. I pointed a Web browser to our intranet, looked up the info, and read it aloud to Tracy.
Feeling inspired, I spent a few hours creating a VoiceXML application: the ArsDigita Telephone Directory, accessible from any telephone in the world. You call up and say which office or employee you're looking for. After searching through some pre-existing Oracle database tables, it tells you the phone numbers and addresses you want.
Next time Tracy arrives confused in a foreign city, she won't have to rely on me being at my desk.
What is VoiceXML?
VoiceXML, or VXML, is a markup language like HTML. The difference: HTML is rendered by your web browser to format content and user-input forms; VXML is rendered by a voice browser. Your application can speak to the user via synthesized speech or by pre-recorded audio files. Your software can receive input from the user via speech or by the tones from their telephone keypad. If you've ever built a web application, you're ready to get started with your phone application.

How to make your content telephone-accessible
As in the old days, you can still rent a telephone line and run commercial voice recognition software and text-to-speech (TTS) conversion software. However, the most interesting aspect of the VXML revolution is that you need not actually do so. There are free VXML gateways, such as Tellme (http://www.tellme.com) and VoiceGenie (http://www.voicegenie.com). These take VXML pages from your web server and read them to your user. If your application needs input from the user, the gateway will interpret the incoming response and pass that response to your server in a way that your software can understand.

You use a web form to configure the gateway with the URL of your application, and it will associate a telephone number with it. In the case of Tellme, your users call 1-800-555-TELL, dial your 5-digit extension, and now they're talking to your application.

VoiceXML basics
The format of a VXML document is simple. Here's how to say "Hello, World" to your visitors:
<vxml>
  <form>
    <block>
       <audio>Hello, World</audio>
       
       <goto next="_home"/>
    </block>
  </form>
</vxml>
Every opening tag (e.g., <vxml>) has to be closed, either with a closing tag like </vxml>, or with a slash (/) as at the end of the singleton goto tag.
The <vxml> tag specifies that this is a VXML document. Within that is a <form>, which can either be an interactive element -- requesting input from the user -- or informational. You can have as many forms as you want within a VXML document. A <block> is a container for your executables, meaning that all your tags that make your application do something, such as <audio> and <goto>, can be clumped together inside of a block. <audio>text</audio> will read the text with a TTS converter, whereas <audio src="wav_file_URL"/> will play a pre-recorded .wav audio file. <goto> can point to another URL, another form within the same VXML doc, or _home, meaning the application is finished.
Here's an example that accepts user input and behaves differently depending on what the user says:
<vxml>
  <form id="animal_questionnaire">
       <field name="favorite_animal">
          <prompt>
          <audio>Which do you like better, dogs or cats?</audio>
          </prompt>
          <grammar>
          <![CDATA[
              [
               [dogs hounds puppies (hound dogs)] {<option "dogs">}
               [cats kitties kittens] {<option "cats">}
              ]
            ]]>
          </grammar>
       </field>
       
       
       <filled>
          <result name="dogs">
             <goto next="#popular_dog_facts"/>
          </result>
          <result name="cats">
             
             <goto next="psychological_evaluation.cgi?affliction={favorite_animal}"/>
          </result>
       </filled>
       
       <nomatch>
          <audio>I'm sorry, I didn't understand what you said.</audio>
          <reprompt/>
       </nomatch>
       
       <noinput>
          <audio>I'm sorry, I didn't hear you.</audio>
          <reprompt/>
       </noinput>
  </form>
  
</vxml>
In this example, we've created a variable called favorite_animal using the <field> tag. After we've prompted the user for a response, we have to specify what the user is allowed to answer by defining a grammar. In our grammar, if the user says "dogs," "hounds," "puppies," or "hound dogs," the value of favorite_animal becomes "dogs." If they respond "cats," "kitties," or "kittens," favorite_animal will be set to "cats."
That's all there is to getting user input. Now we can use the value of their response in our program. In this example, if their answer is "dogs," they will be sent to a form named "popular_dog_facts" within the same VXML document. If they answer "cats," they will be sent to a different URL, psychological_evaluation.cgi. Putting curly braces around a variable name references the value of the variable, so the query string sent to psychological_evaluation.cgi will be affliction=cats.
That's the gist of VXML. Excellent reference material can be found on the Tellme developers' site, http://studio.tellme.com/, including a VXML reference, a grammar reference, a code library, and a library of reusable grammars.

Case Study 1: Building a Pi Reciter Using Tellme

You can sign up for a free developer account at studio.tellme.com. With a developer account, you get your own Tellme telephone extension that you can point to the URL of your VXML application. Having an account also gives you access to Tellme's handy utilities such as the Scratchpad -- a web form where you can type in some VXML (e.g., the "Hello, World" example), it checks your syntax, and then you can call a phone number to hear how it turned out. Tellme provides a very gentle slope into the VXML world.
The first step to building the Pi Reciter is to create a page asking the user how many digits of pi they want to hear. This is a very straightforward VXML document, similar to the example above (source code: http://www.arsdigita.com/asj/vxml/pi-index.vxml.txt). It is convenient to write your VXML in the Tellme Scratchpad first, test it, and then move it over to your web server.

Try out the Pi Reciter!
Call 1-800-555-TELL.
At the main menu, speak the word "Extensions."
Enter extension 58874.

The next VXML page is a little more complicated because it has to be dynamically generated based on user input (how many digits of pi they want to hear).
Your program needs to:
understand the user input (n_digits)
generate n digits of pi
write out a string containing:
<vxml>
 ...
   <audio>3.14159...[the nth digit]</audio>
 ...
</vxml>
Since form variables are passed in exactly the same manner whether you're making a VXML application or a web application, step 1, understanding the user input, is no problem. In my case, I already had step 2, generating the digits of pi, covered: I have more digits of pi stored on my hard drive than I know what to do with. Step 3, writing out the VXML, seems like it would be very straightforward, but it turns out there are a few subtleties here.

256-character word length limit in audio tags

Although I didn't see anything about this in the VXML documentation, it turns out that Tellme ignores <audio> tags if there are overly long words (i.e., long strings of digits) between them. The experimentally-derived character limit is 256.
The solution: break up the audio into multiple tags. If someone wants 500 digits of pi, give it to them in two 250-digit chunks.

Funny number pronunciation

The TTS translator used by Tellme can be rather clever with its pronunciation. Unfortunately, sometimes this backfires, as when it pronounces a string of digits in pi, say 3238462, as "three million, two hundred thirty-eight thousand, four hundred sixty-two." The solution: put spaces between the digits.

Tip
Tellme and VoiceGenie don't care what Content-Type you use when writing out your content, so you might as well choose text/plain instead of application/x-vxml. It makes debugging easier because you can view VXML source using a web browser.

Source code: http://www.arsdigita.com/asj/the-digits.tcl.txt (coded to the AOLserver Tcl API because that's what I had running on my development server but the code can be easily translated to run in Perl, VB, or Java).

Case Study 2: ArsDigita Telephone Directory (including some reusable code)

The user experience:

Joe Employee calls up 1-800-555-Tell and dials the ArsDigita Directory extension.
For security, he is asked to dial or say the passcode before he can go any farther.
Joe spells out a few letters of an office name or a person's last name using his keypad.
He hears a list of matches (pulled from ArsDigita's intranet database), and chooses the one he wants.
An automated voice reads the person or office's contact info to him.
Joe can go back to Step 3 if he wants to hear someone else's contact info.

The source code:

http://www.arsdigita.com/asj/vxml/index.vxml.txt
http://www.arsdigita.com/asj/vxml/passcode-check.tcl.txt
http://www.arsdigita.com/asj/vxml/name-search-input.tcl.txt
http://www.arsdigita.com/asj/vxml/name-search-results.tcl.txt
http://www.arsdigita.com/asj/vxml/one.tcl.txt
shared procedures: http://www.arsdigita.com/asj/vxml/vxml-defs.tcl.txt
PL/SQL procedures to load into Oracle: http://www.arsdigita.com/asj/vxml/vxml.sql.txt

This code can be used as-is if you are running the ArsDigita Community System Intranet Module (http://www.arsdigita.com/products/modules). The ArsDigita Community System (ACS) is a free, open-source platform enabling ecommerce, enterprise coordination, and education. The Intranet Module allows you to manage employees by keeping track of salaries, benefits, assignments and reviews, and manage customers and projects by keeping track of resource allocation, schedules, tasks, deadlines and status.
Regardless of whether you are running the ACS, it should be easy for you to adapt the concepts and logic to your programming environment. The shared Tcl procedures and PL/SQL procedures will be immediately useful for any application using Oracle or AOLserver.

HTML-encoded characters

If you are using the same database to serve web content and voice content, you have to be aware that many of the character strings common in HTML are illegal in VXML, for example, á (á), ç (ç), and ö (ö). Your employee Carl Bjørnsen may enter his name with an ø (ø) so that it will render correctly in a web browser, which is fine; just make sure you translate it to the corresponding iso8859-1 code (ø) when you generate your VXML, or your voice application will crash and burn.
vxml-defs.tcl.txt contains a procedure, vxml_convert_illegal_characters, which will take care of the HTML/iso8859-1 conversion for you, while also doing the necessary conversion of & to &, < to <, and > to >. Any user-entered data that is going to be embedded in VXML should be filtered through this procedure first.

Practical limitations on grammars

When people want to look up a name in the directory application, they have to enter the first few letters of the name using their keypad. E.g., for "Adams," you would push 23267, or some subset thereof. I thought it would be much more convenient if people could use their voice to spell out the name they were looking for: "A D A M S."
To accept keypad digits from the user, you can just use Tellme's pre-defined grammar TM_DTMF_DigitString. This listens for an arbitrary number of keypad tones. But there is no pre-defined grammar that listens for an arbitrary number of spoken letters. So I decided to create my own grammar that would accept all combinations of letters up to 4 letters in length. I wrote a little script that generated all such combinations and saved them in a file:
eves-first-grammar.gsl:
[
  [a] {<option "a">}
  [b] {<option "b">}
  [c] {<option "c">}
  ...
  [z] {<option "z">}
  [(a a)] {<option "aa">}
  [(a b)] {<option "ab">}
  [(a c)] {<option "ac">}
  ...
  [(z z)] {<option "zz">}
  [(a a a)] {<option "aaa">}
  [(a a b)] {<option "aab">}
  [(a a c)] {<option "aac">}
  ...
  [(z z z)] {<option "zzz">}
  [(a a a a)] {<option "aaaa">}
  [(a a a b)] {<option "aaab">}
  [(a a a c)] {<option "aaac">}
  ...
  [(z z z z)] {<option "zzzz">}
]
I referenced this grammar from within my VXML (<grammar src="eves-first-grammar.gsl"/>), and ran my application. The Tellme voice browser ground to a halt and then crashed. Apparently it's not good to reference a 450,000-element grammar from within a VXML file. A little testing showed that you can't have a grammar with more than about 10,000 elements in it, otherwise the Tellme voice browser will crash.
More experimentation showed that it's not reasonable to have more than a few (10? 15?) elements in your grammar, otherwise the gateway's voice recognition facilities become extremely inaccurate. And even with only 10 elements, it still makes many more mistakes than a human would.
My advice: use spoken grammars when there are only a couple items to choose from (Yes/No, Dogs/Cats, Coke/Pepsi). Otherwise, have your user key in their choice.

More funny pronunciation

We have a medical doctor on our staff at ArsDigita (on our sales staff). His name, as stored in the intranet database is: Harry Greenspun, MD.
The TTS converter, as clever as always with its pronunciation, reads his name as: "Harry Greenspun, Maryland." I could have fixed it by adding spaces between the M and the D, just like the trick with the digits of pi, but since he actually lives in Maryland, I didn't bother.
SSL

Make sure that the VXML gateway you use supports SSL. Both Tellme and VoiceGenie do; just point them to an application URL beginning with https.

With Tellme, all form variable values, including user-entered passwords, appear in the query string

Since query strings show up as an extension of the URL, you typically don't want to have sensitive data present in the query string. One reason is that URLs are captured in log files, which are not often heavily protected. You don't want to have a line like this in your access log:
64.14.68.215 - - [25/Feb/2001:19:10:43 -0500] \
"GET /login.tcl?username=eveander&password=alexisgood HTTP/1.0" \
403 413 "" "Mozilla/1.01 [en] (Win95; I)"
In the HTML world, you can avoid the query string entirely by submitting forms via method="post" instead of method="get". But what happens when you try to do this using Tellme?
VXML code snippet:
<submit next="passcode-check.tcl" method="post"/>
Corresponding line in the access log:
64.41.140.74 - - [25/Feb/2001:19:13:01 -0500] \
"POST /tellme/passcode-check.tcl?passcode=1111 HTTP/1.0" \
200 0 "" "Tellme/1.0 (I; en-US)"
Bottom line: if you're going to collect sensitive customer data using Tellme, protect your server log. (Note: VoiceGenie does not have this problem.)

VoiceXML gateway comparison

Which VoiceXML gateway do you choose? Two publicly accessible gateways I found on the Web were Tellme and VoiceGenie:

Tellme VoiceGenie

Security Good. Serves pages via SSL. Unfortunately, sensitive data can appear in the access log, so you have to protect it. Great. Serves pages via SSL. No sensitive data in the access log as long as forms are submitted via post.

Convenience for users OK. Toll-free telephone number. The downside is that your users have to listen to two advertisements per call, and there is no international telephone number. Users have to go through too many menus to reach your extension. Good. Non-toll-free telephone number, accessible by everyone.

Text-to-speech capabilities Great. Does an impressive job of pronunciation and phrasing. Good. Pronunciation is good, but the voice inflection is odd.

Documentation and libraries Great. Extensive well-written documentation. Extensive grammar and code libraries. Good. Good documentation and code library. Small grammar library.

Development environment Good. Convenient utilities like the Scratchpad for testing code. Can run real-time debugger. Unfortunately, this debugger is necessary because the error messages sent upon application failure are no use ("Failure for unknown reasons" is the most common message). Good. Helpful error messages upon application failure. Up to 20 telephone extensions per login account.

Support Great. Received a helpful response to my email request within an hour. Great. Received a helpful response to my email request within minutes, complete with VXML code.

Switching between gateways

Switching between gateways is non-trivial, so make your choice carefully. Different gateways require slightly different VXML syntax, and they provide different pre-defined grammars, on which you will find yourself relying.
Beyond VXML

Will all voice applications be VXML applications? The current syntax of VXML is geared at producing a user experience of navigating through hierarchical menus. State-of-the-art research is moving beyond this towards conversational systems in which any utterance makes sense at any time and where context is carried from exchange to exchange. For example, you can call the MIT Laboratory for Computer Science's server at 1-888-573-8255:

You: Will it rain tomorrow in Boston?
JUPITER: To my knowledge, the forecast calls for no rain tomorrow in Boston.
You: What about Detroit?
JUPITER: To my knowledge, the forecast calls for no rain tomorrow in Detroit.
You: I'm interested in floods.
JUPITER: Flood warnings have been issued for Louisiana and Mississippi.
You: Will it be sunny in Phoenix?
...
Notice how the system, more fully described at http://www.sls.lcs.mit.edu/sls/whatwedo/applications/jupiter.html, assumed that you were still interested in rain when asking about Detroit, context carried over from the Boston question.
In the long run, as these more natural conversational technologies are perfected, the syntax of VXML will have to grow to accommodate the full power of speech interpreters or be eclipsed by another standard.
More

VoiceXML gateways:

Tellme (http://studio.tellme.com/)
VoiceGenie (http://developer.voicegenie.com/)
Voxeo (http://www.voxeo.com/)
BeVocal Cafe (http://cafe.bevocal.com/)

Related links:

VoiceXML Forum (http://www.voicexml.org/)
ArsDigita Community System (http://www.arsdigita.com/products/)

Source code:

http://www.arsdigita.com/asj/vxml/pi-index.vxml.txt
http://www.arsdigita.com/asj/the-digits.tcl.txt
http://www.arsdigita.com/asj/vxml/index.vxml.txt
http://www.arsdigita.com/asj/vxml/passcode-check.tcl.txt
http://www.arsdigita.com/asj/vxml/name-search-input.tcl.txt
http://www.arsdigita.com/asj/vxml/name-search-results.tcl.txt
http://www.arsdigita.com/asj/vxml/one.tcl.txt
shared procedures: http://www.arsdigita.com/asj/vxml/vxml-defs.tcl.txt
PL/SQL procedures to load into Oracle: http://www.arsdigita.com/asj/vxml/vxml-defs.tcl.txt

Credits:

Graphics: Mina Reimer (http://www.mina.net/)
Inspiration: Philip Greenspun's 1-day Internet Applications Course: (http://philip.greenspun.com/teaching/one-day-internet)

asj-editors@arsdigita.com

Reader's Comments

Ah, excellent. While my cries for a simple "latest articles" list on the ASJ homepage have gone unheard, as long as every new ASJ article is announced on Slashdot, I'll be sure to find them!
Happy pi day!

-- Luke Francl, March 14, 2001

The problem with recognition accuracy for letter sequences isn't due to the number of possibilities, but due to their confusibility. To a speech recognizer, the letters "B", "C" and "D" sound a lot a like (thought not as much alike as "D" and "T", or "F" and "S"). In Nuance's grammar format, you can even represent all the options efficiently, as in:

(Digit ?(Digit ?(Digit ?Digit)))
Digit [ zero one two three four five six seven eight nine ]

You'll find that it'll compile, and do a pretty good job at recognizing, especially compared to a alphabet grammar of the same size. VoiceGenie's site provides a similar example of a four digit number grammar with semantics.
In real life, it's no problem matching sixteen digit credit card numbers because the initial sequences are constrained (they code the type of card), the length is constrained, and there's a checksum, all of which can be used to reject illegal hypotheses. All of this information is used in the SpeechWorks' Dialog Modules. I don't know if you can do this with Tellme, which only gives you limited control at the grammar level.

-- Bob Carpenter, March 14, 2001

There seams to be a severe backdraw in VXML if there is a must to use 8859-1 encoding instead of Unicode/UTF-8, as described in the XML definitions. As an additional hurdle the missing of code independant entities to render non basic latin characters (see the 'HTML-encoded characters' section) may result in problems - Tere are still varoius Systems with varoius character code tables used to write source code, and these entities are a necersary way to overcome several limitations. XML should allow the definition of such entities. When comparing the VXML 1.0 specification it seams that the said limitations are not part of the spec, but rather the rendering (audio) software.

-- Hans Franke, March 14, 2001

Tellme need to work on their grammar expression regarding the repetition of a sequence of letters ( A-A-A-A ) if this really isn't something currently supported. If you can recognize one 'A', how hard can it be to recognize two ?
I guess the platform is Version 1.0 as much as VXML.
Regarding accuracy in recognition, if users are prepared to learn it, Alpha, Bravo, Charlie, Delta... would be a solution. Depends on how important accuracy is and how motivated the users are.

-- Ed McGuigan, March 15, 2001

Wow..where has this been hiding .... I have been developing voiceXML apps for the ACS for about 6 months, and have been preaching voiceXML to anyone at aD that cared to listen. I am amazed that this much has been done... by the way ... forget TELLME and go direct to VOXEO.com .. you can get your own number (in fact they gave me a free toll free number to develop with) and dont have to live with the drivel from Tellme.
We are developing an app for a construction company which provides workforce training and testing by phone...

-- Allan Regenbaum, March 16, 2001

Further .. I spent considerable time with Jerry Asher working on the ability to speak to Tellme, and pass the resulting .wav file to the ACS for storage as a file or BLOB in Oracle. (The blob is not the preferred emthod as it would use a db connection for the entire duration that the file is being listened to). The Tellme platform does not pass attachments correctly (as discovered by Jerry) so we had to work around it... if anyone is interested in the workaround, let me know .
This offers significant capability for ACS modules to participate in the process of sending voice files !!! voice-e-mail etc...
The way we used it is to allow an instructor to call the ACS to leave verbal instructions for students such that students arriving at /pvt/home would be able to listen to the instructor and then use the survey system to answer questions in response etc etc..
We also did some online testing stuff where we presented multiple choice questions to students and recorded results from the telephone keypad and wrote these to Oracle..
The possibilities are endless..

-- Allan Regenbaum, March 16, 2001

Please also check out the BeVocal Cafe for VoiceXML development -> http://cafe.bevocal.com We have several unique tools and resources... For example, tuned grammars and professional audio for US street names and addresses. Also publicly traded companies for stocks, airports, airlines, and more... We also have tags that support speaker verification for secure transactions, etc... Also, attached is an interesting report that recently rated the top VoiceXML environments...
Attachment: CTLabs.pdf

-- bryan michael, March 17, 2001

I'm quite impressed with VoiceXML. It took a weekend to create a voice interface to http://www.tcgreens.org/ which can be accessed by (1) dialing 1-800-555-8355 (555-TELL) and (2) at the "main menu" dialing 1-48247 (1-4TCGP). Ir reminded me of what a thrill I had when I first tried Mosaic.

-- Paul Houle, June 4, 2001

VoiceXML is great but it does have limitations... however, there are companies out there building the framework for the next generation of Voice Web Services... that enable dialtone 2.0 to really happen. Bevocal, Tellme, VoiceGenie, Hey Anita, Telara, Motorola, and many more are great starters for people interested in this infant industry. Wait till next year, you'll be amazed at what new things people will be able to do with just the sound of thier voice --- startrek, here we come :)

-- Brian graham, July 5, 2001

VoiceXML: letting people talk to your HTTP server through the telephone

Illustration

What is VoiceXML?

How to make your content telephone-accessible

VoiceXML basics

Case Study 1: Building a Pi Reciter Using Tellme

256-character word length limit in audio tags

Funny number pronunciation

Case Study 2: ArsDigita Telephone Directory (including some reusable code)

HTML-encoded characters

Practical limitations on grammars

More funny pronunciation

SSL

With Tellme, all form variable values, including user-entered passwords, appear in the query string

VoiceXML gateway comparison

Switching between gateways

Beyond VXML

More

Reader's Comments

Related Links