by Fred Hapgood


Oct 15, 20016 mins
Enterprise Applications

In the early ’90s we stuck a fork, editorially speaking, in speech recognition and declared it done. In three articles on the technology (February 1991, April 15, 1992 and Nov. 1, 1993), each more enthusiastic than the last, we said that speech recognition was about to hit the market?and the market was going to notice when it did.

The reason for our excitement was that the technical problems that had bottled the technology up in labs for decades were finally being addressed. The most basic issue had been associating specific vocalizations with specific phonemes. (Phonemes are the basic units of speech. For example, the wuh sound in one.) Making the associations required compiling huge databases of how the more than 40 English-language phonemes are spoken by those of different ages, sexes, linguistic cultures and under different phone-line conditions. Developers then had to write programs that could find the degree of fit between a given user’s vocalization and one of those samples.

Once software recognizes a phoneme, the sound has to be assigned to a meaning (unless the product in question is a simple dictation engine for directly recording and transcribing speech). That meant building more databases, and this time of all the ways humans might express meanings of interest, such as yes, sure, right, correct, yep, yup, yeah, uh-huh, fine, OK, affirmative, good and so on. Since the culture was throwing off new expressions constantly?whatever, no doubt, word?the programs also needed to be easy to update in the field. Then grammar algorithms needed to be written and hardware developed that was fast enough to do all that computation in real-time, yet cheap enough for ordinary businesses to buy.

By the early ’90s all those pieces were falling into place or were close to doing so, and we felt the implications were significant. A wide range of enterprise functions, from order entry to customer support to incident and inspection reporting, were about to get cheaper and easier to perform. Computation and telephony were going to merge. We were all going to get out of voice mail jail. “A decade from now perhaps speech recognition will be as ubiquitous as voice-messaging is today,” we concluded.

While considerable progress has been made since then, we are not there yet. Were we wrong in our judgment of what had been achieved technically? Probably not. As a volunteer subject in a university speech recognition project, I can testify firsthand that the technology was indeed about where we claimed.

It turned out, however, that speech recognition is only partly a technical problem. Recognition implies a conversation, and conversations make sense only in the context of relationships. When humans enter relationships they immediately impose a structure of assumptions and expectations. Is the person smart? Knowledgeable? Nice? Lazy? Snobbish? That structure controls the interaction. If a comprehension problem comes up during a conversation with a smart person we assume we are at fault and take on the responsibility of working it out. We do the same if we think our respondent is not too bright but basically nice. On the other hand, if we think the other party is lazy, doesn’t care or worse, is trying to manipulate us, we behave very differently.

Those relationship issues are just as important when talking with machines as with people; even more so, since most users were and are uncertain about how to talk to software. “Suppose you said you wanted to go to Boston and you heard the reply, ’I don’t understand,’” says William Meisel, president of TMA Associates, a speech recognition consultancy in Tarzana, Calif. “This was a common response at the time. But what didn’t [the computer] understand? Was it your pronunciation? Usage? The logical thread? You didn’t know.”

What you did know was that the program refused to give help when you needed it. This refusal became a cue in and of itself?a sign that the machine planned to shift all the work of the conversation onto the user. Humans reacted to that the same way they would have in a conversation, with resentment and irritation. They raised their voices and sounded out words as if they were speaking to a child. Their voice became stressed. They changed their pitch. They started to swear. This would confuse the program even more, until eventually users hung up with a bang.

Today the industry understands the importance of giving the user as much help as possible, Meisel concludes, which might mean building another database?this time of the most common errors?and writing prompts that suggest specific solutions to problems. For example, the computer could ask, “Do you want Austin or Boston?” This does more than locate the problem as a pronunciation issue; it reassures the user that the program has the smarts to understand the situation and is willing to help the speaker solve it, which in turn makes users more disposed to working with the program.

Mike Phillips, CTO of SpeechWorks International, a speech recognition product and services vendor in Boston, offers many other examples. Answering a question with “unauthorized request” might work on a webpage, for instance, but in the context of speech it communicates a haughty indifference. A better answer would be, “I’m sorry, my supervisor doesn’t allow me to make that transaction” in a sympathetic, you-and-me-against-the-world tone of voice. The old way of saying that a database is down was simply, “That database is down.” That might work as an error message onscreen, but in the context of speech a better way might be to sigh and say, “I’m sorry, the system is giving me a lot of trouble right now.”

Speech recognition systems used to ask people to “Wait for the prompt to finish before speaking.” That made the technology easier to implement, but it communicated a snippy insistence on privilege and hierarchy that annoyed users. Today most speech recognition software is “barge-in enabled,” which means that speech recognition programs defer to users whenever they interrupt. “The point is to keep assuring the user that the system is on her side,” Meisel says.

During the past few years the underlying technology has continued to improve. (Meisel estimates that the error rate in phoneme recognition accuracy falls by about 30 percent a year.) The technology is now in the peculiar position of outrunning expectations, says Phillips. Good speech recognition is perfectly capable of handling a complete sentence, such as “I want to take the red-eye from Boston to Austin a week from Wednesday,” but most users still want a highly structured interaction that prompts for each element of the transaction. Phillips’s hope is that as their experience with speech recognition applications grows, users will relax and conversations will get more ambitious and wide-ranging. But whatever happens, the programs will always be very, very nice.