Robert S. Tannen
tannerrs@email.uc.edu
InfoLogicon Technical Services, Dayton , Ohio, USA
Department of Psychology, University of Cincinnati, Cincinnati, Ohio, USA
Abstract
As richer visual web content is delivered, there are reduced desktop and attentional resources available for the display of background browser processes (e.g. data transfer rate). A proposed solution is to enable users to aurally monitor background processes while attending to visually displayed web content. This approach is aimed at implementing global usability by stepping away from language specific alphanumeric data and culturally limited icons and sounds, towards more abstract, but still informative, display elements. These can be derived through the use of simple perceptual cues. Symmetry, which has been an effective cue in graphical configural displays, and may be utilized in the design of auditory configural displays of dynamic information. This approach may serve as the basis for cross-cultural auditory interface design.
The Utility of Sound There has been relatively little research on the development of auditory interfaces for the Internet. For example the recently published "Human Factors and Web Development" (Forsythe, Grose, & Ratner, 1998) contains only two short paragraphs on the implementation of audio. Typically the use sound has been reserved as a solution for visually impaired users, rather than as an enhancement for the sighted population, but there are several emerging issues which suggest the potential utility for sound in interface design, particularly non-speech delivery. Constraints on Space Advances in web browser interfaces can have significant effects on the relative salience of different types of information. As better graphics and video products emerge, the "geography" of displays is being dominated by web content, potentially at the expensive of more mundane browser interface elements that provide useful information about background processes (download rate, connection status, etc.). Much like television, which browsers are arguably attempting to emulate, there is a paucity of dynamic, continuously available information about background processes, beyond the content itself. This is adequate for television’s highly reliable, serial channel delivery, but may be lacking for multi-window browsing, FTPing and similar multitasking. The relatively high instability of web browsing coupled with the relatively low bandwidth dedicated to monitoring those background processes, makes it difficult for users to diagnosis, and even avoid, problems when they arise: Is the source of a communication lapse local or wide-area? Is this page still downloading or has my browser crashed? In fact a hallmark of advanced technology is that people are often removed, both spatially and behaviorally, from the components of the systems that they are using. Successful interface designs generally enhance the performance of systems that are working properly, but this tends to distance users from what’s going on "behind the scenes". This separation can cause people to fall "out of the loop" when unexpected circumstances or problems arise (Norman, 1988). Users may become reliant on the features of the interface as true indications of the system state (e.g. content continuity), rather than the underlying causes. As richer primary content is delivered visually, there is less space available for secondary feedback. A potential solution is to transfer non-content information delivery off of the screen, to the auditory domain. But the utility of auditory display is not based solely on the limitations of screen sizes.
Constraints on Visual Attention As information is delivered in more engaging formats such as video and interactive animation, in addition to text, users may fixate on a particular part of the display. When a user’s visual attention is focused on a particular area, he or she is unable to attend to visual information beyond the field of view. Much like immersion in any highly visual activity such as reading or watching television, the web can incite a "tunnel vision". A significant difference between web browsing and these more typical activities is the relative unreliability of the components: networks, operating systems, software, variability in user skill, etc. Given the dependency of information delivery on these subsystems, one might consider browsing as a simplified form of process monitoring. Traditionally, process monitoring has involved complex systems (air traffic control, power plant management) where operators must remain vigilant of system behavior. Similarly, web browsing can be viewed as being comprised of two subtasks: content acquisition (primary), and background process monitoring (secondary). Obviously, the danger in a nuclear power plant failing is not equivalent to a personal computer crash, but until reliability is achieved on par with telephony, monitoring remains as a secondary activity. Some of the lessons learned in the design of process monitoring systems can apply to web interface design. For example, peoples’ omnidirectional listening ability make auditory signals in particular, excellent for re-orienting attention towards a pressing matter. There are other characteristics that merit the use of sound as an important interface tool. Properties of Sound
Mountford and Gaver (1990) eloquently suggested that "sound exists in time and over space, vision exists in space and over time." In other words, information delivered via sound tends to indicate changes over time, but can be picked-up over a wide range of spatial locations. Visual displays are usually less transitory, but can only be perceived at specific locations in space. This is an oversimplification of the characteristics of acoustically and optically conveyed information, but it points out that auditory displays differs from visual displays in several important ways. The relevance of this to web interface design can be readily seen from the previously described situations. Sound is omnidirectional and alerting, suggesting that it can be used to provide information over and above both visual display and visual attention limitations. This is not to suggest that sound be used in place of visual display, rather it should enrich the visual experience. Acoustical media can carry information that is distinct from optical media in both form and content, and an interface designer should chose the form and modality of a display to fit the context of the information and the user environment. One particular case where this may be useful is in the presentation of background process information such as downloading, as sound is good at displaying changes over time. For example we are all familiar with the "modem-handshaking song" whose changes in pitch and timbre aurally confirm the status of a log-on. I propose an extension of this to other aspects of web browsing, providing continuous, but unobtrusive feedback , while allowing users to attend to visual displayed primary content. Furthermore, I see this as the basis for auditory development guidelines. Given that the use of sound is relatively limited, there is an excellent opportunity to consider the Human Factors issues in sound portrayal on the web, and particularly to focus on the design of sounds that are globally usable.
Human Factors Issues in the Design of Auditory Displays
The design of displays can be driven by a number of usability issues, but I have chosen to focus on perceptual saliency and global usability. These are interrelated in that many of same human factors principles that make an effective display can be extrapolated across cultures. The following section will discuss some of the research on auditory display. This will guide the selection of sounds and sound ensembles towards creating effective interfaces, interfaces which make information salient, and perhaps more significantly, make dynamic relationships easier to comprehend.
Two Ways of Listening Experts in auditory display (Kramer, 1994) recognize a continuum of sounds ranging from audification to sonification. Audification, is similar to the concept of icons in visual displays, in that the acoustical structure of the sound parallels that of the represented event, such as a "You have mail!" indicating the receipt of a new e-mail message. That is, these are typically familiar, short-duration copies of real-world acoustical events. At the other end , sonification, refers to the use of sound as a representation aid to acoustically map information for events that do not typically have a relevant acoustic component. Examples of sonification use include sonar pings to convey distance information about objects in the environment. Here the sounds are the means of information conveyance, but are not meaningful in and of themselves. Rather it is the relations (in the case of sonar, temporal) among the sounds that provide information to a listener. Note that "realistic" sounds could be used for the purpose of sonification, but there again is the problem of loss of recognition with changes in data. In a nutshell: in audification the structure of the sound contains the information, and in sonification the relation among sounds conveys information. This means that, in the case of sonification, changing the specific sounds does not change the information, and even simple tones may be used. An analysis of how people actively listen for non-speech information can play a large role in determining the acoustical characteristics that can be used to transform information into sound. Gaver’s work on everyday listening (1993a,b) provides a unique theoretical perspective on this issue. Gaver makes a categorical, but malleable, distinction between musical listening and everyday listening. The former describes attending to the acoustical components of an event, and the latter refers to perceiving events conveyed by the acoustical components. These are not definitive delineations; they refer to the kind of information that a listener is trying to obtain, and not specifically to the source of sounds he or she is hearing. People may attend to certain acoustical features for musical listening, and ignore those features while attending to others for everyday listening. This duality can easily be illustrated. Imagine sitting in a concert hall listening to a singer. If you attend to the relations between the sounds: patterns in the melody, changes in tempo, and such, this is considered musical listening. On the other hand if the sounds are used to obtain information about the event; i.e., perceiving that a harpsichord is being played at a certain distance, in a hall of a given size, etc., then that is everyday listening. In everyday listening the sounds themselves are important - one kind of tone can convey distinct information from another about the listener’s world. In musical listening it is not the particular sound themselves which are essential, but the information that is conveyed by the spatio-temporal pattern and arrangement of sounds, and thereby not language or culturally specific. Spatio-temporal patterns are meaningful changes in the position of objects in space (real or virtual) over time. In other words the message conveyed by the choice and arrangement of the notes will remain invariant even though the piece could be played on different instruments, environments, etc. In a nutshell then, there is sound as information (everyday listening) and information as sound (musical listening).
The Trouble with Icons and Earcons A key decision in designing an auditory display is determining the sounds which will be used to portray information. Developing for global usability means selecting sounds that are not culturally constrained. Hansen (1994) warned against the use of semantically-load icons in graphical displays. He is referring to components of the interface that resemble real-world objects and have some feature, or features, that are analogous to those objects, but do not afford all of the actions that one might expect. For example, the "Trash Can" icon on a pc desktop is generally recognizable as a place where unwanted items may be thrown away. In that respect it functions like one would expect a trash can to. A real trash can would have a myriad of other uses (affordances) such as holding things down, generating noise, impeding entrance, etc. On the desktop it does not have these uses, and even its operational function is context specific - you can thrown out a file, but dragging a disk to the trash ejects it (fortunately), rather than erasing it. Of course this all assumes that you have some reasonable understanding of what a trash can is, or trash for that matter. It is reasonable to suppose that the same kind of affordance expectancies exist with sounds. For example, what good would a foghorn warning signal be to someone who lives in the desert? Empirical work on auditory display design has shown some of the problems with using sounds that were based on real acoustic events. The Arkola Simulation (Gaver, Smith, and O’Shea, 1991) was an evaluation of an auditory icon-based display for process/manufacturing monitoring and control. The authors discussed some of the difficulties that arose out of using semantically-loaded symbology: "First, the urgency of the alarm sounds did not always appropriately reflect the urgency of the situations causing them. For example, the breaking bottle sound was so compelling semantically and acoustically that partners sometimes rushed to stop the sound without understanding its underlying cause or at the expense of ignoring other more serious problems." Another related difficulty was that the use of realistic sounds meant that when a process had failed to operate it would become silent, rather than alerting, and operators would not be alerted to the system malfunction. Difficulties in using metaphoric or iconic sounds become more apparent when considering the limitations of their use in representing change in dynamic processes, such as data transfer. The display of system processes requires the generation of sounds that correspond to those relations. Many of those processes involve complex /or abstract interactions, making it difficult, if not impossible, to generate realistic iconic sounds. Additionally, certain characteristics of these sounds will vary as a function of dynamic system states. For example changes in the pitch of an engine sound could be used to indicate changes in CPU use, but relatively small changes in pitch can easily make the engine noise unrecognizable as an engine (although potentially still useful as an indicator). Varying sonic parameters limits the identifying features of sounds. Changes in frequency and temporal rate can make sounds that were readily identifiable under normal operating conditions, cacophonous at system extrema, when they are most useful. Finally, the interactions of the sounds might also impede identification, and, instead of emergent sonic characteristics, would lead to confusion. The solution to many of these problems is to use abstract or simple tones to convey information. Such sounds do not have particular cultural connotations, and this will avoid the problem of mistaken affordances, as sounds will not represent anything other than there designed function. Additionally, abstract sounds can be specifically tailored to achieve both acoustical and perceptual harmony for the particular system or function to be represented. It is not a great leap to see the similarities between audification with everyday listening, and sonification with musical listening. Audification and sonification are descriptive of the display processes, while everyday and musical listening are perceptual level categories; describing what kind of information the listener is getting out of the sounds. One is capable of extracting either kind of information from either kind of display, but that information may not be optimal for the user’s task-specific goals. In the case of presenting information about background processes we want the listener to hear changes over time, and emergent patterns. This is a dynamic process best conveyed over time, rather than by static icons. Using sonification also avoids the problems of cultural literacy that can limit the usability of icons. Therefore, monitoring dynamic background processes, such as downloading, is akin to musical listening. And clearly we are able to perceive complex patterns among the relations of sounds even when the sounds themselves are different.
Developing a Dynamic Auditory Display
Given the research, an auditory display aimed at presenting information flow should emphasize musical listening. Sonification should be used to combine abstract sounds into informative displays. This will allow for the display of dynamic changes in information exchange, that are not constrained by culture-specific sounds. But what should be the form of such a display, and how can multiple sonic elements be arranged into a meaningful and perceptually salient presentation? A promising answer to these questions lies in work done on visually displays.
Graphical Configural Displays The web browser is a promising area to implement global usability, that will not be constrained by cultural specificity. This means stepping away from language based alphanumeric data and culturally limited images and icons, towards more general, but still informative, display elements. These can be derived through the study of perceptual invariants, environmental cues that are universal and directly perceivable. Spatio-temporal changes such as symmetry are readily perceived, and have been used effectively in the design of graphical configural displays. These displays take advantage of our natural pattern-recognition abilities by optically representing higher-order variables as emergent properties of the graphical layout of lower-order variables (Bennett, Toms, & Woods, 1993). An elementary example of a graphical configural display comes out of the relationship between the two hands of a clock. The relation between the hands (angle) is a readily perceived indicator of the time that does not rely on numeric symbology. Similarly, information about when to change gears in an automobile is an emergent feature that arises out of the angle described by the tachometer and speedometer needles (although sound and other sensory information can play a significant role). In this case changes in the relationship between the individual elements (needles) over time, can be perceived holistically to indicate changes in the behavior of the system. While these examples may be coincidental, interface designers have been purposefully developing configural displays that graphically present part-whole system information, without relying directly on a particular language for information acquisition.
Auditory Display of Background Processes Earlier I brought up Mountford and Gaver’s (1990) articulation concerning sound existing in space over time, versus vision existing in time over space. As sound is predisposed to communicate changes over time, the auditory display would be suited to indicating the dynamic flow of information. Observation leads one to believe that there are spatial metaphors inherent in interface browsers. Information is DOWNloaded from offsite to one’s desktop, and then UPloaded back to the outside world. Web pages load, and hence read, from top to bottom, with content roughly going from general (titles, headers) to specific (body text, links) along that descent. This spatial metaphor for the organization of information makes sense as it resembles the spatial arrangement of similar artifacts such as books, and of course, other computer interfaces. Consistency will be achieved by developing a spatial metaphor for the auditory display that corresponds with users expectations and learned behaviors. I propose a hypothetical auditory display based that follows this vertical metaphor. The form of this auditory display is a cascade: the direction and rate of information between the web and the desktop is aurally presented as rising and falling "streams" of information. Clearly, trying to simulate a waterfall-like sound would go against the principles discussed earlier, so short mid-frequency (~500Hz) pulses can be generated. These pulses are spatialized using 3D sound processing technology, so that when information is being download to the user’s desktop, the pulses appear to originate from a point above the desktop display (and vice-versa for uploading). The position of consecutive pulses fall sequentially closer to the computer, to give the impression that data is falling from above and landing in the user’s information space. This is consistent with the visual desktop metaphor of pages loading from top to bottom. Direction (uploading or downloading) is displayed by the perception of sound sources moving away from or towards the desktop along a vertical axis. Rate of information flow is a time-based function: the rate of change in the position of sound pulses. Communication lags can be differentiated by spatial location. For example, if the pulse rate decreases as the signals approach the desktop level, this is an indication that there is a local problem (busy CPU). If the pulses are relatively slow all the way down one can assume that the lag is caused by high network traffic.
Implementation and Limitations The omnidirectional nature of sound makes it ideal for presenting information to use in a fixed or limited field of view situation. Auditory display is well suited for the presentation of changes over time and this was used to represent the time-dependent, dynamic process of information conveyance. Although aural perception is not as acute as its visual counterpart when it comes to localizing objects in space, auditory display is sufficient (within 5 degrees) for the separation of components over the vertical space. Audition is well suited to displaying temporal dynamics and relations, and this was applied here to the representation of the overall rate of data flow. The relatively simple display that I have proposed is a good starting point to assess some of the issues concerning the development of auditory displays for web browsers. One area that should be discussed is the hardware requirements for implementing such an interface. It should be noted that, at the current level of auditory technology, the features of this display would require a computer with digital signal processing (DSP)/convolving capabilities, and that the user wear earphones to control the delivery of the acoustic signal to each ear, necessary for sound spatialization. One of the major issues to address in the development of auditory displays is there impact and interaction with other sounds, particularly speech and relevant background signals. The use of headphones may be a mixed blessing, because while it prevents the sounds of the display from disturbing others nearby, it can also cut-off communication. Eventually, sound could be broadcast from speakers at a comfortable loudness level, turning the display into a background noise, like the air-conditioning - always present, but never noticed until a meaningful change occurs. One potential application for such a display system would be for a network manager, allowing him or her to track multiple data lines simultaneously without having to visually attend to a display. It would make sense to evaluate this auditory configural display in the context of an applied setting, much like was done in the Arkola simulation (Gaver, Smith, and O’Shea, 1991). A comparison with an auditory icon based approach would be useful in measuring the utility of configural versus icon displays for audition, particularly across cultures. The major issues that would need to be examined by such a simulation include the salience of the auditory signals, users’ ability to learn the mappings, and ultimately, the efficacy of the interface. The bottom line is whether such a display can significantly improve situational awareness and in doing so, enhance usability, both within and across cultures. References Bennett, K.B., Toms, M.L., and Woods, D.D. (1993). Emergent features and graphical elements: Designing more effective configural displays. Human Factors, 35(1), 71-97. Fosythe, C., Grose, E., & Ratner, J. (1998). Human factors and web development (p.93). Mahwah, New Jersey, Lawrence Erlbaum Associates. Gaver, W.W. (1993a). What in the world do we hear?: An ecological approach to auditory event perception. Ecological Psychology, 5, 1-29. Gaver, W.W. (1993b). How do we hear in the world?: Explorations in ecological acoustics. Ecological Psychology, 5, 285-314. Gaver, W.W., Smith, R.B., and O’Shea, T. (1991). Effective sounds in complex systems: The ARKola simulation. Proceedings of CHI ‘91, pp.85-90. Hansen, J.P. (1994). Representation of system invariants by optical invariants in configural displays for process control. In P.A. Hancock, J.M. Flach, J.K. Caird, and K.J. Vicente (Eds.), Local applications in the ecology of human-machine systems. Hillsdale, NJ: Lawrence Erlbaum Associates. Kramer, G. (1994). An introduction to auditory display. In. G. Kramer (Ed.), Auditory display: Sonification, audification, and auditory interfaces. Reading, Mass: Addison Wesley. Mountford, S.J., & Gaver, W.W. (1990). Talking and listening to computers. In B. Laurel (Ed.), The art of human-computer interface design (p. 322). Reading, Massachusetts: Addison-Wesley. Norman, D.A. (1988). The psychology of everyday things. New York: Basic Books.