Voice Portals And

Voice Portals and
Multimodal Technology
Innovative speech applications
benefit companies and their customers
DAVID SHADBOLT

The evolution in speech technology and established open voice standards has led to the development of voice portals, voice-enabled user interfaces on desktops and mobile devices and the use of speech technology in applications as diverse as mobile workforce support and air traffic control simulation training programs.

In customer support, speech technology is the latest driving force for improvement because while multichannel marketing may have increased point-of-sale opportunities for companies, a company still has to keep the customer smiling. In a recent global survey by Genesys Telecommunications Laboratories, 56% of respondents rated customer service as more important than the product itself, which was rated second in importance at 28%. After a bad customer experience, 63% of consumers will stop using a company's product or service, and 100% of those between 18 and 25, where brand loyalty seems non-existent, would do so. Conversely, 76% of customers say they would buy from a company based on a positive call-center experience.

Speech technology gives customers the opportunity to interrupt operator voice menus by speaking naturally to go directly to the service required. This raises satisfaction levels among consumers who often find interactive voice response (IVR) systems confusing, frustrating and time consuming. According to speech technology vendor Nuance, a survey conducted in the year 2000 showed that 80% of customers preferred speech to touch-tone keypads, and 84% of them rated speech interaction equal to or better than Web service.

A voice recognition process is as follows: The speaker speaks into the phone which captures the analog signal and converts it to a digital signal. The speech recognition engine converts the digital signal to phonemes (the smallest segment of speech), followed by the specific application processing words in grammar matching phonemes.

The act of developing a voice portal gives corporate management the unusual experience of not only improving customer satisfaction but also reducing call center costs. IBM, which has the WebSphere Voice Server, estimates that voice recognition reduces customer service costs at call centers by 10 to 30 cents per call on normal costs of $5 to $10, while Genesys claims its Voice Portal to be as much as 40% less costly to buy than IVR systems. It claims that a standard implementation of 200 ports in a 200-agent corporate contact center can expect a return on investment in as little as eight months.

Verizon set out to provide a voice portal for its 230,000 employees, many of them mobile or without access at work to the intranet (eWeb.verizon.com) containing the corporate directory and other information. The eWeb Voice Portal has proven very successful, says project lead Lu Shen. "We wanted to give employees anytime, anywhere access without incurring the cost of additional operators. It's now receiving 114,000 calls per month, with a 5.4 minute average call length. It has also won Verizon's most prestigious award, the Verizon Excellence Award."

Following its success with its intranet, Verizon is developing a nationwide customer-facing application to serve its 38 million customers, millions of whom do not have Internet access. "By giving them convenient access to their Verizon accounts," says Shen, "we will reduce the number of person-to-person calls, which has a huge cost benefit. Fifteen features are either being supported or will soon be supported. These include bill summary, payment, bill history, popular services and products, and pricing information."

One leading European technical and IT outsourcer has implemented Genesys' Voice Portal. twenty4help employs 1,700 people handling more than 850,000 phone and Web-based interactions per month in 15 different languages at five European customer contact centers. The company provides technical support for major software manufacturers, 85% of the support conducted by phone and the balance via fax or by Web-based contact such as e-mail, text chat and Web collaboration. Providing comprehensive technical support 24 hours a day, seven days a week demands a successful integration of hardware and software solutions. Ralf Rottman, the European IT manager for twenty4help, says, "Because we are acting as outsourcing partners, we have to fight to decrease support cost but increase service availability and service quality. Speech-driven self-service applications offer customers the flexibility of moving from a self-service to an agent-assisted transaction as needed. Voice Portal is not only fast and easy to install, but also easy to configure. In addition, it can be combined with existing Web databases and applications through VoiceXML." Recently expanded language support for French, Italian, German and Spanish further enables global capabilities for voice self service.

Programming and management of voice response applications are aided by voice-extensible markup language (VoiceXML), which was released publicly by the VoiceXML Forum in 2000. This markup language enables voice response applications, including telephony features such as call transfer, mixed initiative conversations and recognition of spoken and dual tone multifrequency key input, audio dialogs that feature synthesized speech and digitized audio. As an application of XML, VoiceXML supports Unicode through the "xml:lang" attribute, a mechanism for precise control of the input and output languages and the ability to interpret input in a language different from the output language(s).

Multimodality

Consumers who want feature-rich handheld devices that incorporate computer screens and advanced data and messaging applications, as well as "anytime, anywhere" access have found the small graphical user interface frustrating. It has hindered market growth. VoiceXML offers manufacturers the possibility of partially overcoming size limitations by adding speech input, in addition to the keyboard, keypad, mouse and or stylus, as a way of accessing applications and Web services, and adding output received as synthesized speech, as well as audio, plain text, motion video and/or graphics. Access through this more versatile user interface via PCs, telephones, tablet PCs and wireless personal digital assistants is termed "multimodal." It required a new set of open application standards to make it viable.

One answer is Speech Applications Language Tags (SALT). The SALT Forum — a standards-setting body that has Cisco, Intel and Microsoft among its founding members — describes the standards on its Web site as "a lightweight set of extensions to existing markup languages, in particular HTML and XHTML." IBM, Motorola, Opera Software ASA and others developed another standard. IBM Pervasive Computing's multimodal-product manager Igor Jablakov says, "We had known from looking at our customers' investments in various technologies that data and voice would merge at some point, but it needed a computing language that would allow programmers to write code for these multimodal applications. Our working group arrived at XHTML+Voice (X+V), a standards-compliant multimodal markup language that uses combinations of XHTML and VoiceXML."

Concerns over one standard eventually dominating the market may lead some companies to hedge their bets by building applications to support both SALT and X+V; but both standards enable developers to use existing skills, thereby reducing the time required to build Web applications for voice, browser and new multimodal devices. Speech technology vendors offer developer toolkits. IBM provides its IBM Multimodal Toolkit, an integrated development environment built on the Eclipse framework. The toolkit includes a multimodal editor in which developers can write both XHTML and VoiceXML in the same application, reusable blocks of X+V code and a simulator to test the applications.

Royal Philips Electronics in the Netherlands has released Speech SDK 3.1, a customizable software development kit for creating professional applications that benefit from the features of the Philips speech recognition technology. Armin Scheuer, Philips media relations manager, says, "The new version includes the world's first VoiceXML engine to support STT functionality in addition to the existing dialog capabilities. With SDK 3.1, developers can implement the full capabilities of the recognition engine directly into their application software using any standard software development through the PSP C/C++ API interface."

Embedded Solutions

The small footprint of current speech technology has made it more practical for manufacturers to embed voice into their own products, particularly for those consumer products, services and supporting systems that deliver information, communications and entertainment to on-vehicle and mobile devices. In-vehicle services called telematics allow users to obtain customized services such as driving directions, emergency roadside assistance, personalized news, sports and weather information as well as access to e-mail and other productivity tools. An organization such as Starbucks, McDonalds or the 24-hour copy service Kinko's can link the in-car experience with a channel for selling its products and services. The Kelsey Group predicts US and European spending in this area will exceed $6.4 billion by 2006. Voice interaction adds a new customer service dimension to this sector.

Automobile manufacturers have already begun incorporating speech technology into new models. Honda is using IBM's Embedded ViaVoice in its navigation technology in select 2003 Honda Accord models. The application has a vocabulary of 150 English-language commands that recognizes a range of accents. Drivers can ask for directions and hear responses over their existing audio car systems. To get directions, the driver uses the "talk" button located on the steering wheel. Jablakov points to another example. "Daimler is looking at integrating speech within the 2004 minivan. It will allow customers to use mobile handsets that use Bluetooth technology (an open technology specification for short-range radio links between mobile PCs, "smart" devices and other portable machines). A user can simply bring a handset into the vehicle, and it will integrate with the on-vehicle navigation. Voice access gives the driver a hands-free, eye-free way of obtaining e-mail, having it read, saved or deleted; book travel and even obtain information from his or her 401(k)."

Global Languages & Cultures, a multiservice translation company, has translated voice messages in on-vehicle global positioning devices as well as voice for general telephone services. Project manager Francesco Carbonari explains, "We translate material from the client's script files, record into the target languages at the studio and then send the files back to the client. The client creates a voice database based on the splicing of the material. To avoid problems later, we need to know in advance what voice talent is required. Is it a male or female? And if male, what for them is a good male voice? This can change according to the culture. In Japanese culture, a good female voice sounds light and somewhat passive, with a low, soft tone, which wouldn't fit for an American audience or other countries."

Carbonari continues: "If there is a problem with the final recording, it's a serious issue. We need to fix the translation, find out where the problem began, and if it's truly a problem, locate the voice talent, book the studio (it could be in Europe) and record again. The same costly process results if we have saved files in the wrong format, as a WAV, for example, rather than in MP3. We try to lower client costs as much as possible, for example, with enumeration which in the case of driving directions can be thousands of numbers. To avoid recording all the numbers from zero to a thousand and everything in between, we buy prerecorded language number sets from 1 to 1,000 so that the client avoids paying studio time."

Server-based Offering

Companies can voice-enable their Web sites, intranets, databases and other applications. Baltimore-based investment firm T. Rowe Price sought to handle more calls from retirement plan participants and improve customer satisfaction. The company first had positive results from the use of IBM WebSphere Voice Response for touch-tone IVR systems, followed by IBM WebSphere Voice Server (available in 14 languages) with speech recognition allowing callers to speak selections by name. The company took the next step when IBM introduced Natural Language Understanding technology. Participants can use phrases such as "I'd like my balance, please" or "What funds are in my plan?" instead of being limited to specific numeric keypad functions. In addition, plan participants can interrupt the system, change their minds, ask for different information or execute tasks in any order they choose. The company's management finds that the system's more natural dialogues allow it to handle more transactions per hour and even support customers who are still using rotary dial "pulse" telephones and are unable to access IVR touch-tone systems.

Mobile Workforce

The convenience of voice can save time for mobile workforces. Newport Wireless, based in Irvine, California, has a product line called NewportWorks that includes integrated, wireless solutions for residential real estate agents designed to increase the productivity and the revenue of mobile workforces. Within this line is anytimeMLS, a solution based on IBM WebSphere Voice Server technology that uses a voice interface that gives residential real estate agents the tool to request and/or retrieve information via mobile phone. During the initial market trial, the test group estimated that they could sell an additional three to five properties per year if they had wireless access to multiple listing service (MLS). In addition to accessing MLS databases, features such as short text messages or voice call alert to mobile phones enable users to forward MLS information to customers by e-mail directly from their phones. In the near future, personal contact management tools (address book and calendar) will become available.

Other Applications

The evolution in speech technology seems like good news in training environments and in defense departments. Adacel Inc. provides professional training products and services to the air transportation industries. It has licensed third-party voice recognition and rVoice text-to-speech technology from Rhetorical for integration into its MaxSim software used in air traffic control training programs. Simulated scenarios will cover a wide range of situations from equipment failures to emergency landings.

The United States Defense Advanced Research Projects Agency, together with the Defense Department's Language and Speech Exploitation Resources initiative, "launched two speech programs in 2002: Babylon, which aims to develop a portable speech-to-speech translation device, and the Effective, Affordable, Reusable Speech-to-Text project, for turning speech recordings into searchable digital text. The programs aim to develop and deliver improved speech transcription and translation capabilities to intelligence analysts and the military." (Ed McKenna, "Listen Up," Federal Computer Week, September 30, 2002.)

One of the goals with EARS is to reduce the word error rate to 10% and improve foreign language speech transcriptions. "Another program requirement," according to Elizabeth Shriberg, one of the researchers in the program, "is the ability to extend this technology to other languages, starting with Arabic and Mandarin Chinese."

Babylon's goals, according to program director Kristin Precoda, include building a two-way translator, "a sort of bilingual phraselator. Questions can be asked in English, while the foreign speakers will have a limited range of answers they can say in their language for translation into English."

Two-way translation is a user benefit that IBM has moved toward. Brian Garr, program manager of Voice and Translation Servers for IBM Pervasive Computing, points out that by combining IBM's machine translation technology and ViaVoice's desktop product, "ViaVoice Translator allows users to enter text in one language and have it returned in another language (in English, French, Italian, German and Spanish) either as text or read out as speech using a Compaq/HP iPAQ PocketPC handheld device."

Innovative uses for speech applications will continue as more companies see the advantages both for internal use and for their customers. Jablakov says, "We see Web browsers that support X+V, as well as device manufacturers adding special features to hardware as their developers create great applications that are both visual and voice oriented. Basically it's following the natural evolution when different opportunities arise as a function of approved open standards."

David Shadbolt is a research editor for MultiLingual Computing & Technology. He can be reached at