Najjar, L. J., Ockerman, J. J., & Thompson, J. C. (1998). User interface design guidelines for speech recognition applications. Paper presented at IEEE VRAIS 98 workshop - Interfaces for wearable computers, Atlanta, GA.




User Interface Design Guidelines for Speech Recognition Applications

Lawrence J. Najjar, Jennifer J. Ockerman, and J. Christopher Thompson
Georgia Tech Research Institute
Multimedia Information in Mobile Environments Laboratory
GTRI/EOEML
575 14th Street
IPST Building
Atlanta, GA 30332-0823 USA
Telephone: (404) 894-3412
Facsimile: (404) 894-8051
gt4708d@prism.gatech.edu, jojo@chmsr.gatech.edu, chris.thompson@gtri.gatech.edu

Abstract

Computers make marked improvements each year. They get faster, more powerful, and cheaper. However, one area where computer improvement is surprisingly slow is the user interface. The windows style of user interface is over a decade old now (e.g., Morgan, Williams, & Lemmons, 1983; Yasaki, 1983; Fuerst, 1985; Johnson et al., 1989) and the next generation of user interface is overdue.

What will the user interface of the future be like? Since it offers a more natural, familiar way to communicate, some people (e.g., Freed, 1997; Newsome, 1997; Slater, 1997) believe that the next generation of computer user interfaces will involve speech recognition. Instead of rolling a mouse around and clicking buttons, we will simply tell our computers what to do. This prediction may be accurate. Several reasonably-priced speech recognition products are now available for personal computers (e.g., Brisbin, 1997; Feibus, 1997; Girard & Dillon, 1997; Triverio, 1997).

However, this new technology presents major, new user interface design challenges. For example, how do we improve the likelihood that a spoken input will be understood by the computer? How do we design a speech-driven user interface that is easy to learn and use?

This paper describes speech recognition user interface design guidelines that are based on a review of the available literature and on the authors’ experience building these interfaces (e.g., Najjar et al. 1996; Ockerman, Najjar, & Thompson, 1996; Najjar, Thompson, & Ockerman, 1997). The purpose of the guidelines is to increase the effectiveness of speech-driven user interfaces. Since every speech recognition application is different, these guidelines are somewhat general. The needs of a particular application’s users should always override the guidelines.

Introduction

A speech-driven user interface is affected by both software and hardware. So the guidelines that follow are organized into major categories called general, software, and hardware.

General

The following three guidelines describe, in general terms, when and how to develop a speech recognition application.

Use speech when the user’s eyes and hands are busy or when the user is moving.

Some user situations are better for using speech recognition input than others. Situations that might benefit from speech recognition input include those in which the user’s eyes and hands are busy (e.g., emergency room doctors, package sorters) or when the user is mobile (e.g., food processing plant quality assurance inspectors; Ockerman, Najjar, & Thompson, 1996; Najjar, Thompson, & Ockerman, 1997) (Simpson et al., 1985; Jones, Frankish, & Hapeshi, 1992, Helander, 1993; Peckham, 1994). Situations that might not be good matches for speech input include those in which the user is not alone and is stationary (e.g., user sits at a desk with other workers in hearing range) or the work environment is extremely loud (e.g., aircraft repair workers; Siegal & Bauer, 1997).

Train the speech recognition system in the user’s work environment.

To significantly improve recognition accuracy, train the user’s voice in the same sound environment in which the user will work (Simpson et al., 1982; National Research Council, 1984; Waterworth, 1984; Mane et al., 1985, 1996). This technique increases the likelihood that the trainee will speak in a style similar to that when working (Schwab, Ball, and Lively 1994), allows the recognition system to adapt to ambient noise (Schwab, Ball, & Lively, 1994), and increases the degree of match between the speech input expected by the recognition system and that used when the user is working. The representativeness of the training voice input and the system recognition accuracy may also be improved by training the repetitions of each spoken command in a random order, rather than a sequential order (Poock, 1981).

Iteratively evaluate and re-design the speech recognition application.

To improve the speech recognition user interface, have representative users perform typical tasks, preferably in the target work environment. Measure the users’ performance using metrics such as speed of performance, rate of errors, subjective satisfaction, time to learn, and retention over time (Damper, 1993 citing Schneiderman, 1987). Identify and improve trouble spots, then repeat the process until the performance measurements are satisfactory (Mane et al., 1996).

Software

The following 10 guidelines are related to the software portion of the speech recognition user interface.

Keep small the number of words in the speech recognition vocabulary.

There are several reasons for limiting the number of words that need to be recognized by the system (Simpson et al., 1985; Jones, Frankish, & Hapeshi, 1992; Helander, 1993; Peckham, 1994). Since there are fewer input candidates for the system to examine, a small vocabulary may improve recognition speed and accuracy (e.g., Simpson et al., 1985). Also, a smaller vocabulary can simplify the user interface for the user and usually shortens the time it takes to train a speaker-dependent system on the user’s voice.

Keep short each speech input.

Relatively short speech inputs (e.g., one three-syllable word or two two-syllable words) shorten the time it takes to train the user’s voice and can allow the user to enter speech inputs quickly and easily (e.g., Baber & Stammers, 1989). However, to maintain recognition accuracy, these short inputs need to sound distinctly different from each other. This need for discriminability may force you to use longer speech inputs in your recognition vocabulary.

Use speech inputs that sound distinctly different from each other.

Speech inputs that sound different from each other improve the system’s recognition accuracy, making the system more pleasant and productive for the user. For example, use ‘Try again’ with ‘Correct’, but don’t use ‘Incorrect’ with ‘Correct’. Use ‘Make new entry’ with ‘Delete’, but don’t use ‘Create’ with ‘Delete’. The need for input distinctiveness may require you to use longer speech inputs (e.g., words with more syllables or several words as a phrase) (Schwab, Ball, & Lively, 1994).

Provide immediate feedback for every speech input.

To provide a sense of control and to assure the user that the system is working, the system should give some kind of immediate, obvious, and consistent feedback for every speech input (Poock, Martin, & Roland, 1983; Schurick, Williges, & Maynard, 1985; Simpson et al., 1985). If the user’s eyes are not occupied and the system’s user interface includes a computer display, consider making an obvious change to the display after every user input. For example, when the user says ‘Next page’, immediately make the system go to the next page in the application. Do not simply display the entry that was recognized by the speech system. This kind of feedback is often not necessary and can slow down the user (Mountford et al., 1982; Murray et al., 1993).

You can use auditory feedback to help get the user’s attention. If the environment is not noisy and the system interface includes an earphone or speaker, consider presenting a short beep for unrecognized inputs or user interface errors. Also, consider providing a distinct beep when the user needs to confirm the deletion of information (Murray, Jones, & Frankish, 1996). This technique reduces the chance that the user will unintentionally delete information. As with display feedback, avoid repeating back via sound each word or phrase recognized by the speech system. This auditory feedback technique can slow down the user (Jones, Frankish, & Hapeshi, 1992; Karis & Dobroth, 1994) and actually increase errors (Schurick et al., 1985 citing Martin & Welch, 1980, Hapeshi, Hudson, & Jones, 1988). Don’t use auditory feedback in a way that annoys users (Jones, Hapeshi, and Frankish 1989). For example, don’t present a beep each time the system is ready for the next speech input. Over time, users find this kind of prompt very annoying (e.g., Schwab, Ball, & Lively, 1994). Finally, don’t use auditory feedback when it may disturb other people.

Keep the user interface simple.

A well-designed, simple user interface is easy to learn and use. There are many specific ways to simplify a speech recognition user interface. Try to design the user interface so that its operation is obvious. For example, use speech commands that match user expectations (Waterworth, 1984). As much as possible, use the same speech commands throughout the application (e.g., ‘Next Page’, ‘Previous Page’, ‘Main Menu’, ‘Quit’). Only display commands (e.g., ‘buttons’ at the bottom of the screen) that are currently available. Minimize the need for the user to navigate through the application. Do this by putting related information together and designing the application so it automatically performs required navigation commands. For example, the authors developed a speech recognition application that automatically moves the cursor to the next field when the current field gets filled and automatically displays the next page when the last field on the current page gets filled. Also, utilize items such as screen and section titles to always let the user know where the user is in the application.

Make error correction intuitive.

Design the speech recognition user interface so that error correction is simple and obvious. Make it easy for users to go back to correct recognition errors (Waterworth, 1984; Simpson et al., 1985). Though not necessarily effective, users prefer to correct recognition errors by repeating the speech entry rather than using the system’s error correction techniques (Baber, Stammers, & Usher, 1990; Leggett & Baber, 1992). Try to accommodate this preference when you develop the error correction function. Since the ‘undo’ command can be confusing (e.g., ‘Can you undo an undo’?), avoid providing a function to ‘undo’ the last command. Users may not be sure what is being undone and how many previously entered commands the users can undo. Instead, use simple, common, explicit commands such as ‘Previous field’ or ‘Previous page’ to allow users to go back and correct a recognition error. If the recognition system is confused between two possible matches, consider providing both to the user and let the user select the correct match (Peckham, 1994). Finally, make the user confirm risky entries such as the deletion of important information.

Avoid modes.

A mode is a system state in which the same user action has different effects in different states (Mayhew, 1992). A very simple example of a mode is the keyboard ‘caps lock’ mode. The same user action (such as typing a letter) has a different effect (the letter appears capitalized) than the typical keyboard mode (the letter appears in lower case). For speech recognition applications, don’t require the user to enter a ‘data input mode’ when entering information and a ‘command mode’ when entering commands. Users often get confused about which mode they are in (e.g., a user may try to enter a command when the system is in the ‘data input mode’ or a user may try to enter data when the system is in the ‘command mode’) (e.g., Norman, 1981, 1983; Thimbleby, 1982).

Don’t use speech to position objects.

Speech recognition is very inefficient for moving objects, such as a cursor, on a computer display (Jones, Hapeshi, & Frankish, 1989; Murray et al., 1993; Peckham, 1994). This type of user interface can force users to enter speech inputs such as, ‘Right’, ‘Right’, ‘Right’, ‘Up’, ‘Up’, ‘Up’. Using speech recognition to position objects is extremely slow and annoys users (Murray et al., 1993). This means that you should avoid using the familiar point-and-click type of user interface that is common on mouse-driven, desktop applications. Instead, use a command-based user interface.

Use a command-based user interface.

Since it requires the user to enter repetitive and annoying sequences of verbal positioning commands, don’t use the common, point-and-click, graphical, usually mouse-driven, style of user interface. Instead, go back to the old command-based, usually function key, style of user interface. But don’t use function keys. Consider displaying function key buttons on the bottom of the computer screen. Label each button with a speech input that is currently available (e.g., ‘Next Page’). The user can see the commands that are currently available and you can change a button label to inverse video when the user enters that speech command. Command-based systems like this one generally allow experienced users to perform their tasks more quickly than prompt-based, menu, or question-answering systems (Peckham, 1994).

Allow users to quickly and easily turn off and on the speech recognizer.

For example, design the user interface so the user says ‘Stop listening’ to turn off the speech recognizer and ‘Activate listening’ to turn it back on. This feature reduces unwanted entries when the user talks to colleagues.

Hardware

The following four guidelines are related to the hardware portion of the speech recognition user interface.

Use a highly directional, noise-canceling microphone.

A highly directional microphone only accepts inputs from a very narrow direction, such as the user’s mouth. Noise canceling identifies and eliminates constant noises in the sound environment, such as the hum of a large machine. A highly directional, noise-canceling microphone minimizes irrelevant inputs from the user’s environment (e.g., machines, conversations) and increases speech recognition accuracy (e.g., Peckham, 1994). Also, it is helpful to use an adjustable microphone that allows users to position the microphone close to their mouths. The authors used a highly directional, noise-canceling microphone to successfully test a speech-driven system in a very loud, 90 dB noise environment.

Consider using headphones or an earphone (versus a speaker) for auditory feedback.

The sound from headphones or an earphone is less likely to disturb nearby people than the sound from a speaker. Headphones can also reduce the amount of interfering noise from the environment (e.g., Peckham, 1994). One disadvantage of this hardware option is that it requires you to purchase and maintain an additional piece of equipment.

Use full duplex audio.

Full duplex audio is a combination of hardware and software that allows the user to make a speech input while the system plays audio, such as a sound or narration. This feature allows the user to interrupt and control the speech recognition application (e.g., Mane et al., 1996). Full duplex audio also speeds up system operation and reduces user annoyance from waiting for auditory information to finish playing.

Consider providing a back-up input technique to speech.

The accuracy rates of current speech recognition technologies are far from perfect. Usually, as the number of words that have to be recognized increases, the inputs become less distinct, and the recognition accuracy decreases. Speaker-dependent speech recognition systems have higher accuracy, but these systems require each user to train the systems on his or her voice (e.g., Simpson et al., 1985). When speech recognition fails, users may not be able to perform their tasks. To keep users productive, consider providing another way to control the computer. One back-up command entry technique is to provide a cursor positioning and object selection device such as a mouse or track ball. Mobile users can use one of several small, pocket-sized, hand-held cursor positioning and selection devices. To use this back-up input technique, design your user interface so that users can make all inputs by selecting objects (such as function key buttons on the bottom of the screen). For numeric or unrestricted text entry, you may need to allow the user to display a keyboard with selectable keys.

Conclusion

Speech recognition is a promising way for users to control computer applications, especially when the users’ eyes and hands are busy or the users are mobile. However, this style of user interface presents significant, new design challenges in the areas of recognition accuracy and ease of use. To improve the chances that your speech recognition application will be effective and easy to use, the guidelines in this paper offer concrete, practical suggestions that are based on the results of empirical studies and the authors’ extensive experience.

References

Baber, C., & Stammers, R. B. (1989). Is it natural to talk to computers? An experiment using the wizard of Oz technique. In E. D. Megaw (ed.) Contemporary Ergonomics 1989, (Taylor & Francis, London), 234-239.

Baber, C., Stammers, R. B., & Usher, D. M. (1990). Error correction requirements in automatic speech recognition. In E. J. Lovesey (ed.) Contemporary Ergonomics 1990, (Taylor & Francis, London), 454-459.

Brisbin, S. (1997, March). Talking back to your Mac. MacUser, 13, 2, 127-129.

Damper, R. I. (1993). Speech as an interface medium: How can it best be used? In C. Baber and J. M. Noyes (eds.) Interactive Speech Technology, (Taylor & Francis, London), 59-71.

Feibus, A. (1997, July 21). Its master's voice. InformationWeek, 640, 55-68.

Freed, L. (1997, March 25). Future user interfaces. PC Magazine, 16, 6, 206-207.

Fuerst, I. (1985, March). Broken windows. DATAMATION, 31, 5, 46-52.

Girard, K., & Dillon, N. (1997, August 11). Market grows for voice applications. Computerworld, 31, 32, 55-56.

Hapeshi, K., Hudson, S., & Jones, D. M. (1988). Voice data-entry feedback and short-term memory. In E. D. Megaw (ed.) Contemporary Ergonomics 1988, (Taylor & Francis, London), 105-110.

Helander, M. G. (1993). Foreword. In C. Baber and J. M. Noyes (eds.) Interactive Speech Technology, (Taylor & Francis, London), ix-xii.

Johnson, J., Roberts, T. L., Verplank, W., Smith, D. C., Irby, C. H., Beard, M., & Mackey, K. (1989). The Xerox Star: A retrospective. COMPUTER, 22, 9, 11-29.

Jones, D., Hapeshi, K., & Frankish, C. (1989). Design guidelines for speech recognition interfaces. Applied Ergonomics, 20, 47-52.

Jones, D. M., Frankish, C. R., & Hapeshi, K. (1992). Automatic speech recognition in practice. Behaviour & Information Technology, 11, 109-122.

Karis, D., & Dobroth, K. M. (1994). Psychological and human factors issues in the design of speech recognition systems. In A. Sydral, R. Bennett, and S. Greenspan (eds.) Applied Speech Technology, (CRC, Boca Raton, FL), 359-388.

Leggett, A., & Baber, C. (1992). Optimising the recognition of digits in automatic speech recognition through the use of ‘minimal dialogue.’ In E .J. Lovesey (ed.) Contemporary Ergonomics 1992, (Taylor & Francis, London), 545-550.

Mane, A., Boyce, S., Karis, D., & Yankelovich, N. (1996). Designing the user interface for speech recognition applications: A CHI 96 workshop. SIGCHI Bulletin, 28, 4, 29-34.

Martin, T. B., & Welch, J. R. (1980). Practical speech recognizers and some performance effectiveness. In W. A. Lea (ed.) Trends in Speech Recognition, (Prentice-Hall, Upper Saddle River, NJ), 39-98.

Mayhew, D. J. (1992). Principles and guidelines in software user interface design (Prentice-Hall, Upper Saddle River, NJ).

Morgan, C., Williams, G., & Lemmons, P. (1983). An interview with Wayne Rosing, Bruce Daniels, and Larry Tesler. Byte, February, 8, 2, 90-114.

Mountford, S. J., North, R. A., Metz, S. V., & Warner, N. (1982). Methodology for exploring voice-interactive avionics tasks: Optimizing interactive dialogues. In Proceedings of the Human Factors Society 27th Annual Meeting, (Human Factors Society, Santa Monica, CA) 207-211.

Murray, A. C., Jones, D. M., & Frankish, C. R. (1996). Dialogue design in speech-mediated data entry: The role of syntactic constraints and feedback. International Journal of Human-Computer Studies, 45, 263-286.

Murray, I. R., Newell, A. F., Arnott, J. L., & Cairns, A. Y. (1993). Listening typewriters in use: Some practical studies. In C. Baber and J. M. Noyes (eds.) Interactive Speech Technology: Human Factors Issues in the Application of Speech Input/Output to Computers, (Taylor & Francis, London), 99-107.

Najjar, L. J., Ockerman, J. J., Thompson, J. C., & Treanor, C. J. (1996). Building a demonstration multimedia electronic performance support system. In P. Carson and F. Makedon (eds), Educational Multimedia and Hypermedia 1996 (June 17 - June 22), (Association for the Advancement of Computing in Education, Charlottesville, VA), 794.

Najjar, L. J., Thompson, J. C., & Ockerman, J. J. (1997). A wearable computer for quality assurance inspectors in a food processing plant. In L. Bass and A. Pentland (eds.), Digest of Papers: First International Symposium on Wearable Computers (October 13 - 14), (IEEE Computer Society, Los Alamitos, CA), 163-164.

National Research Council, Committee on Computerized Speech Recognition Technologies (1984). Automatic speech recognition in severe environments, (National Research Council, Commission on Engineering and Technical Systems, Washington, DC).

Newsome, C. (1997). Speech therapy for slow typists. PC User, 16 October, 26.

Norman, D. A. (1981). Categorizations of action slips. Psychological Review , 88, 1, 1-15.

Norman, D. A. (1983). Design rules based on analyses of human error. Communications of the ACM, 26, 4, 254-258.

Ockerman, J. J., Najjar, L. J., & Thompson, J. C. (1996), Factory automation support technology (FAST). In D. Adelson and E. Domeshek (eds.), International Conference on the Learning Sciences, 1996, (July 25 - 27) (Association for the Advancement of Computing in Education, Charlottesville, VA), 567.

Peckham, J. (1994). Behavioral aspects of speech technology: Industrial systems. In A. Syrdal, R. Bennett and S. Greenspan (eds) Applied Speech Technology, (CRC, Boca Raton, FL) 469-486.

Poock, G. K. (1981). To train randomly or all at once. That is the question. In Proceedings of the Voice Data Entry Systems Applications Conference (Lockheed Missiles & Space, Sunnyvale, CA).

Poock, G. K., Martin, B. J., & Roland, E. F. (1983). The Effect of Feedback to Users of Voice Recognition Equipment (NPS55-83-003), (Naval Postgraduate School, Monterrey, CA).

Schneiderman, B. (1987). Designing the User Interface: Strategies for Effective Human-Computer Interaction, (Addison-Wesley, Reading, MA).

Schurick, J. M., Williges, B. H., & Maynard, J. F. (1985). User feedback requirements with automatic speech recognition. Ergonomics, 28, 1543-1555.

Schwab, E. C., Ball, C. A., & Lively, B. L. (1994). Human factors contributions to the development of a speech recognition cellular telephone. In A. Syrdal, R. Bennett and S. Greenspan (eds.) Applied Speech Technology, (CRC, Boca Raton, FL ), 445-454.

Siegel, J., & Bauer, M. (1997). A field usability evaluation of a wearable system. In L. Bass and A. Pentland (eds), Digest of Papers: First International Symposium on Wearable Computers (October 13 - 14), (IEEE Computer Society, Los Alamitos, CA), 18-22.

Simpson, C. A., Coler, C. R., & Huff, E. M. (1982). Human factors of voice I/O for aircraft cockpit controls and displays. In D. Pallett (ed.), Proceedings of the Workshop on Standardization for Speech I/O Technology (March 18 - 19), (National Bureau of Standards, Gaithersburg, MD), 159-166.

Simpson, C. A., McCauley, M. E., Roland, E. F., Ruth, J. C., & Williges, B. H. (1985). System design for speech recognition and generation. Human Factors, 27, 115-141.

Slater, M. (1997). User interfaces: beyond keyboards. Microprocessor Report , 23 June, 11, 15.

Thimbleby, H. (1982). Character level ambiguity: Consequences for user interface design. International Journal of Man-Machine Studies, 16, 211-225.

Triverio, J. (1997). Speak out: Today's software takes dictation well. Home PC, June, 4, 131-133.

Waterworth, J. A. (1984). Speech communication: How to use it. In A. Monk (ed.) Fundamentals of Human-Computer Interaction, (Academic Press, London), 221-236.

Yasaki, E. K. (1983). Lisa is Apple’s star. DATAMATION, February, 29, 2, 40-42.

Bionotes

Lawrence J. Najjar is a Research Scientist in the Multimedia Information in Mobile Environments Laboratory, Georgia Tech Research Institute, GTRI/EOEML, 575 14th Street, IPST Building, Atlanta, GA 30332-0823 USA. He holds a PhD in engineering psychology from the Georgia Institute of Technology and is a highly experienced user interface designer. Many of his publications are available at http://mime1.gtri.gatech.edu/imb/people/larry.html.

Jennifer J. Ockerman is a Graduate Research Assistant in the Multimedia Information in Mobile Environments Laboratory, Georgia Tech Research Institute, GTRI/EOEML, 575 14th Street, IPST Building, Atlanta, GA 30332-0823 USA. She is a PhD student in industrial engineering at the Georgia Institute of Technology. She is interested in computer systems that aid and guide operators’ actions and decision making with electronic media.

J. Christopher Thompson is a Senior Research Engineer in the Multimedia Information in Mobile Environments Laboratory, Georgia Tech Research Institute, GTRI/EOEML, 575 14th Street, IPST Building, Atlanta, GA 30332-0823 USA. He is a PhD candidate in instructional technology at Georgia State University. He is interested in the use of wearable computers for data collection and performance support in food processing facilities.