Can the Use of Multiple Synthetic Voices

Increase Persuasion?

An Experimental Test of the Multiple Source Effect

Kwan Min Lee & Clifford Nass

Department of Communication

Stanford University

Stanford, CA 94305-2050, USA

+1 650-723-5499

kmlee,

Jennifer Lai

IBM Corporation/ T.J. Watson Research Center

30 Saw Mill River Road

Hawthorne, NY. 10532, USA

+1 914-784-6515

Abstract

This study examines whether the use of multiple synthetic voices increases persuasion in testimonial advertisement in the same way that use of multiple human sources does. Participants (N= 40) heard five positive reviews about a book either by five different synthetic voices or by a single synthetic voice. Results showed that participants hearing multiple synthetic voices evaluated the reviewed books more positively, predicted more favorable public reactions to the books, and felt greater social presence of the virtual sources. These results support the idea that paralinguistic cues in synthetic voices influence people’s imagination of virtual speakers. The observed social responses were not eliminated when participants were explicitly exposed to the creation of various synthetic voices prior to the main experiment.

Keywords

Synthetic speech, persuasion, multiple source effect, speech systems.

Introduction

These days buying products on the Internet is not an uncommon experience for many Internet users. More than half of all Internet users (51.7%) have purchased online [18]. One reason cited for shopping on the Internet, in addition to convenience, is the availability of information about goods and services. A form of product information that is becoming more common is reviews of the particular product or service by other consumers. While today

websites rely on textual presentation of the reviews, we can foresee that these will soon be available for the telephone-based web shoppers. The question arises as to what will be the ideal presentation for this information given an auditory presentation rather than a visual one.

Many forms of mediated communication such as newspapers, radio, films, TV, and computers utilize disembodied language. Disembodied language is “language that is not being produced by an actual speaker at the moment it is being interpreted” [2]. Because disembodied language is so abundant in everyday life and most people understand it with no difficulties, we tend to think that its interpretation is a trivial process that does not require further investigation.

However, Clark [2] argues that the ability to understand disembodied language is a remarkable process that only humans can handle, because imagination plays a key role in the interpretation. Through imagination, people visualize virtual speakers who have written or spoken sentences in disembodied language. In this sense, disembodied language is no more than a representation of embodied language produced by virtual speakers. It is understood in the same way we understand natural embodied language (e.g., face-to-face conversation).

There are two forms of disembodied language: written language and pre-recorded human speech[1] [2]. The two forms are clearly distinct not only because their modalities are different (visual vs. auditory), but also because they yield two psychologically different imagination mechanisms. For written language, the imagination of virtual speakers is based on an internal visualization process occurring when people read a text. If a writer is unknown to a reader, linguistic cues (e.g., word selection, verb-adjective ratio) in the writing become the basis for the imagination [7]. For pre-recorded speech, both paralinguistic cues (e.g., loudness, pitch, and speech rate) and linguistic cues affect the imagining of a virtual speaker. Paralinguistic cues are judged immediately because their judgment is evolutionarily hard-wired [10]. Immediate identification of enemies or friends had such a profound an impact on survival that humans evolved to judge paralinguistic cues immediately after the start of an utterance [3]. Paralinguistic cues become more salient and influential when imagining an unknown speaker [11, 15]. For example, a listener is more inclined to imagine a writer as an extrovert person if the speech contains paralinguistic cues such as loudness, higher pitch and faster speech rate which are typical of extroverted people [11, 17].

When a computer synthesizes a voice and produces disembodied speech, the speech is ontologically problematic not only because it is disembodied language but also because it is a voice from a source that is clearly not human. Even the best text-to-speech (TTS) systems have not yet achieved the quality and prosody of natural human speech. Thus one could argue that linguistic cues in synthesized speech should be the sole basis for listeners’ imagination of the virtual speaker, because the voice is synthesized from a predetermined algorithm and is disconnected from any source of the speech. In other words, the obviously synthetic nature of synthesized-speech should make paralinguistic aspects of speech irrelevant to listeners’ imagination of virtual speakers (since the virtual speaker is a machine not a human).

From another viewpoint however, paralinguistic cues in synthesized speech should be as influential and powerful as those in human speech, because human brains are not evolved to respond differently to synthesized speech from real speech (for a similar claim, see [10]). Consequently, we can expect that paralinguistic cues will also be influential in people’s imagination of virtual speakers, even though they no longer have any value for survival.

MULTIPLE SOURCE EFFECT

One of the most powerful strategies for increasing the persuasive potential of a message is to use multiple sources. People utilize multiple sources both in embodied and disembodied communication situations. For example, it is very common to use a number of speakers supporting or opposing a candidate in political rallies. Attorneys try to present as many witnesses as possible before the jury in order to persuade them [4]. Multiple sources are also heavily utilized in mediated communication such as advertising; it is very common to present multiple endorsers of a product.

The effect of multiple sources on persuasion has been well documented in social psychology [4, 5] and in marketing [8, 9]. Based on the elaboration likelihood model of persuasion [13, 14], Harkins and Petty theorized and tested the mechanism of the multiple source effect in the context of attitude change. Moore and Reardon [8] retested the theory in a print advertising context. Their results are summarized as followed:

  1. Multiples sources have more persuasive power than a single source if each source provides a convincing argument.
  2. Multiples sources have more persuasive power than a single source if each source provides a different argument.
  3. Multiples sources have more persuasive power than a single source if each source is perceived as an independent source.
  4. Actual exposure to multiple sources and multiple arguments have more persuasive power than mere knowledge of the existence of multiple sources and multiple arguments.

Paradoxically, the heaviest users of the multiple source technique are new media designers. In electronic commerce, programmers and designers rely heavily on multiple sources in order to promote their products on their websites. While traditional media (e.g., magazines and newspapers) often employ celebrities for product endorsements, e-commerce sites rely more heavily on testimonials from "people-like-you." There are three main reasons for this. First, the interactive nature of the web makes it very easy for consumers to provide their comments about a product, and for designers to save and present the expressed opinions. Secondly, e-retailers often don't have deep enough pockets to cover the high cost of hiring celebrities. Even when they can afford to hire celebrities, it is impractical to promote the wide variety of products sold by a single site. Lastly, the credibility of celebrity endorsers could be questioned, because modern consumers are keenly aware that celebrities endorse products primarily for money [19].

As a result, it is common for e-commerce sites to provide consumer reviews about a product. It is believed that consumers utilize the expressed comments from other customers who have already experienced the product when evaluating an unknown product. The more a product receives positive reviews from multiple sources, the more likely it is that consumers positively evaluate the product [8].

In traditional GUI based e-commerce sites, only linguistic cues are utilized to manifest multiple sources. That is, designers present multiple reviews from multiple sources by sorting each review under a real name ("Tom from New York") or an anonymous name with a little geographical information ("a customer from Palo Alto, CA"). Users also recognize multiple sources by noticing different linguistic styles (e.g., content, grammar, word choice, etc.) in each review.

As technologies such as TTS and speech recognition become more mature and business environments become increasingly mobile thanks to the widespread use of cell-phones, new media designers can manipulate both linguistic and paralinguistic cues to maximize the multiple source effect. It is a simple matter for voice user interface (VUI) designers to manipulate the parameters of a TTS system to effect the paralinguistic cues of the voice. While there is no reason to doubt that the use of multiple prerecorded human voices will increase the impact of the multiple source effect, we are not sure if this is true for multiple synthetic voices. The lack of understanding of people's social responses to multiple synthetic voices presents a problem to designers of new technology.

The two experiments described in this paper attempt to address the research question of whether the use of multiple synthetic voices engenders the multiple source effect. Additionally, since the multiple source effect is affected by whether users believe each source is from an independent source, findings would be informative for determining whether paralinguistic cues in synthetic speech can influence a listener’s image of the virtual speaker.

SINGLE TTS VOICE vs. Multiple TTS VOICES

In order to provide empirical answers to these issues, we performed two different experiments. In the first experiment, we tested whether the use of different synthetic voices contributes to the imagination of multiple sources. To understand the process behind the multiple source effect, we used the concept of social presence. Social presence is defined as the simulation of another intelligence [1]. In other words, it is the sense that another intelligent being co-exists and is interacting with one in a virtual or imagined environment [6]. If people have a greater feeling of social presence when they hear multiple synthetic voices (vs. hearing just a single synthetic voice), it would support the proposition that paralinguistic cues in synthetic speech influence the imagination of virtual speakers.

We ran both experiments in a realistic testimonial advertising context. More specifically, we tested whether the narration of different testimonials using different synthetic voices (one voice for each testimonial) has a greater persuasion impact than the narration of those same testimonials with a single synthetic voice.

The experiment was executed in the context of a book-buying web site that presents customers with reviews of a book. The experimental website listed three different books, all on the same web page. The visual interface was similar to the one used by Amazon.com. The page included the text for the titles, the author(s) names and pictures of the books. Instead of having customer reviews for the book in text, there was a link to an audio (.wav) file. Clicking on the link would play the reviews. Each book had five different testimonial reviews, all of them positive.

Hypotheses

Our hypothesis was that people would feel an increased social presence of different sources when they heard book reviews with multiple synthetic voices than with a single synthetic voice. This was based on the social presence literature [1] and the media equation [16].

If the use of multiple synthetic voices positively affects the imagination of multiple independent sources, then one should observe a stronger persuasion effect when using multiple synthetic voices. Thus, we further hypothesized that people would evaluate books more positively when the reviews were presented with five different synthesized voices.

Because people can recognize the persuasive power of entities that are present, we hypothesized that people would expect others’ evaluation of books to be more positive when they heard reviews via multiple synthetic voices than when via a single synthetic voice [see 12 for a distinction between the judgment of one’s own opinion and the assessment of other’s opinion].

Finally, we expected that the multiple source effect would influence not only the evaluation of the reviewed books but also the judgment of the experimental website, as the source of the reviews. That is, with the same visual interface, a website providing multiple synthetic voices would be evaluated more positively and be regarded as more credible than one providing a single synthetic voice.

Method

Participants

A total of 40 adults participated in the experiment. Participants were randomly assigned to either the single synthetic voice condition or the multiple synthetic voices condition. Gender was fairly evenly balanced across conditions (9 men & 11 women). All participants signed informed consent forms and were debriefed at the end of the experiment session.

Procedure

The experiment was a two-group between-subjects design, with sets of five customer reviews per book as a repeated factor. Participants logged on to the experimental website for their condition and provided their responses through mouse clicks, in exactly the same way as they would normally do in everyday web surfing. All participants used the Internet Explorer 4.0 (or higher) browser in order to ensure the same graphic environment across conditions. As noted earlier, each book review page consisted of a picture of the book, a title, author names, and a .wav file of synthesized speech. The customer reviews were edited versions of actual customer reviews on the Amazon.com site. The books and their authors were selected based on low sales so that the participants would not be familiar with the books. Lack of familiarity was verified by a question asked at the end of the experiment. All books were fiction to avoid bias based on users’ general knowledge about various topics.

Sample reviews for a book used in the experiment:

  1. I loved this book. I stayed up all night reading it because I cared so much about the people, I could not put it down. I haven't done that since I was a child. After a lifetime of reading, I'm aware how rare this kind of experience is. And the reason? Kent Haruf's honesty, skill, and compassion as a writer.
  2. The pace of the story mimics that of the small town it takes place in. The characters are richly drawn, but not caricatures. Not a lot happens, but I believe that's the point. There is no urgency to the story, and I liked it that way. The author, like James Lee Burke, has an affection for beauty of the land. Reading this book is like taking a stroll down the main street of Holt County, in which the story takes place. Highly recommended.

Below the icon for the audio file, there was a questionnaire regarding the book being reviewed and the review itself. Subsequent book reviews and questionnaires were placed sequentially on the web page. In both conditions there was only one audio file. With the exception of the number of voices used in the audio file, the visual layout, textual information, and book review content were identical across conditions. After hearing the reviews for all three books, the participants were presented with a final set of questions with regard to their evaluation of the website and their experience. Finally, all participants were debriefed and thanked.

Manipulation

The CSLU Toolkit and Bell Lab’s TTS engine were used to produce five different synthesized voices. Three voice types were created with the CSLU Toolkit and two with the TTS engine from Bell Lab. The different voices were created by changing voice parameters (e.g., fundamental frequency, speech rate, pitch range) in each TTS engine. Pre-tests were conducted with six independent coders to ensure that each voice was distinctive enough as to be separately identified. In the one-voice condition, the particular voice (from the five) was randomly assigned to each participant. In the multiple voice condition, the order of presentation of the five voices was balanced to control for possible order effects.

Measures

All dependent measures were derived from questionnaire data. Participants used radio buttons to indicate their responses. Each question was answered with an independent, ten-point Likert scale.

Four questions concerning the participant’s personal opinion of the book were asked for each of the three reviewed books:

1. How likely would you be to recommend this book to your friends?

2. How much would you enjoy reading this book?

3. How would you judge the quality of this book?

4. How likely would you be to buy this book if you were going to buy a novel?

These answers to these questions were combined to create a “personal opinion” index.