next up previous contents
Next: Bibliography Up: No Title Previous: Measures and Metrics

Evaluating Interaction: Where to Go?

Today, many research communities face the problem of how to evaluate interaction. To name a few:

In all of these cases, the difficulties are similar: a multi-dimensional interface space, the need to factor in (or out) user variability, and lack of definition of the task space (or task accomplishment metrics).

A Possible Solution Path

For all of these tasks, we can abstract away from the specifics of real users, actual GUIs and specific tasks to a level of interaction primitives: a basic set of activities that occur during most kinds of interaction. These might include:

The next step is to build some models of prototypical collaborative tasks using such primitives. For example, the characterization of the use of a collaborative environment to schedule a meeting:

Let's take another example: a briefing in a collaborative environment. This would require: Again, these activities can be broken down into participation, broadcast, response, acknowledgement, etc, but now in a generally synchronous environment. A good environment would allow these things to happen with minimal disruption to the briefing, maximal bandwidth for the briefing, ability to support and control interaction (what happens if two people ask questions at the same time?), etc.

Suppose we can collect data to characterize these tasks at this level of abstraction. Can we then write a basic "collaborative briefing script'' that simulates a collaborative briefing at this level of abstract event, such that it can be "enacted'' on a variety of systems? This would have the advantage of 1) factoring out human variability - each system would be benchmarked against the same script; 2) providing "real-time'' evaluation - the interaction would be on a session basis, so would not take longer that the actual session (an hour for the briefing - for meeting scheduling, real time could be speeded up, as long as asynchronous communication is demonstrated). And the same measurements would be made for each scripted event, and the same outcome metrics reported, e.g., system A could not support ``brief and point'', system B could not support interruptions, system C allowed interruptions, but the questioner ``drowned out'' the speaker, so part of the briefing was lost, system D had so much delay (5 seconds) between sending a question and receipt of question that the context was lost, etc.

The scripts may be initially enacted by a set of live testers, each following their part in the script, testing multiple systems. Longer term, it might be possible (though not necessarily advisable) to build ``participatons'' or automated agents that could ``push the buttons'' required for system testing. Each system would require its own participaton, but the development of such testing scripts and agents would enable system developers to iteratively test their own systems during development.

Even if we cannot build fully automatic agents to simulate human participation, asking these questions pushes us to examine the sources of variability in these tasks. The scenario-based method that we have outlined in this document provides a framework for experimentation in this direction, even while we explore the more conventional means of scenario-based evaluation.


next up previous contents
Next: Bibliography Up: No Title Previous: Measures and Metrics

Charles Sheppard
Wed Aug 27 17:05:29 EDT 1997