Next: Bibliography
Up: No Title
Previous: Measures and Metrics
Today, many research communities face the problem of how to evaluate
interaction. To name a few:
-
Spoken language systems can only evaluate ``pre-recorded''
speech input vs. database tuples as a response; this
methodology is effective because all systems receive the
identical input, factoring out subject variability - at
the expense of interaction: interaction
is disallowed because there is no user in the loop - you
cannot talk back to a pre-recorded message.
-
Interactive information retrieval search has also, to date,
failed to define an evaluation paradigm because we do not
really understand how to compare user A interacting with system
X with user B on system Y.
-
Interface evaluation is limited to high-level qualitative results,
from, e.g., a cognitive walk-through, or to expensive laboratory
and field evaluations involving significant numbers of subjects.
-
Collaborative environments offer an even bigger challenge: evaluating
interaction among multiple participants, who may differ in rank,
preferred mode of usage, familiarity with the system, etc.
In all of these cases, the difficulties are similar: a multi-dimensional
interface space, the need to factor in (or out) user variability, and
lack of definition of the task space (or task accomplishment metrics).
For all of these tasks, we can abstract away from the specifics of
real users, actual GUIs and specific tasks to a level of interaction
primitives: a basic set of activities that occur during most kinds of
interaction. These might include:
-
Signaling readiness to participate (joining a session),
-
Signaling end of participation (leaving a session),
-
Communicating to participants,
-
Responding to communication from participant(s),
-
Signaling successful communication (acknowledgement),
-
Signaling failure to communicate (error condition), and
-
Attempt to interrupt.
The next step is to build some models of prototypical collaborative tasks
using such primitives. For example, the characterization of the use of a
collaborative environment to schedule a meeting:
-
Participant sends either schedule or suggested time
-
Participant acknowledges agreement or conflict or suggests
alternative(s)
-
This process iterates until someone identifies a consensus
or a plurality This activity is made up of iterated
communications, acknowledgements, responses, etc. This particular
activity is normally asynchronous, so that interruptions don't really
enter in. However, a successful collaborative system must support
these asynchronous communications. Obviously, in this case, e-mail
would work fine (and may be the baseline system to beat). However,
the way in which a user's on-line schedule can be shipped around and
be made viewable, may decrease the amount of time any one user has to
take, how many iterations are needed, and/or improve the quality of
the solution.
Let's take another example: a briefing in a collaborative environment.
This would require:
-
Participants entering a session at a fixed time,
-
Briefer able to broadcast briefing materials, well synchronized
in case of multiple media (e.g., talk and point to viewgraph),
-
Participants able to join late, leave early,
-
Participants able to interrupt and ask questions,
-
Briefer able to answer questions,
-
Participant able to enter after the briefing and replay it.
Again, these activities can be broken down into participation, broadcast,
response, acknowledgement, etc, but now in a generally synchronous
environment. A good environment would allow these things to happen
with minimal disruption to the briefing, maximal bandwidth for the
briefing, ability to support and control interaction (what happens if
two people ask questions at the same time?), etc.
Suppose we can collect data to characterize these tasks at this level
of abstraction. Can we then write a basic "collaborative briefing
script'' that simulates a collaborative briefing at this level of
abstract event, such that it can be "enacted'' on a variety of systems?
This would have the advantage of 1) factoring out human variability -
each system would be benchmarked against the same script; 2) providing
"real-time'' evaluation - the interaction would be on a session basis,
so would not take longer that the actual session (an hour for the
briefing - for meeting scheduling, real time could be speeded up, as
long as asynchronous communication is demonstrated). And the same
measurements would be made for each scripted event, and the same
outcome metrics reported, e.g., system A could not support ``brief and
point'', system B could not support interruptions, system C allowed
interruptions, but the questioner ``drowned out'' the speaker, so part
of the briefing was lost, system D had so much delay (5 seconds)
between sending a question and receipt of question that the context
was lost, etc.
The scripts may be initially enacted by a set of live
testers, each following their part in the script, testing multiple
systems. Longer term, it might be possible (though not necessarily
advisable) to build ``participatons'' or automated agents that could ``push
the buttons'' required for system testing. Each system would require
its own participaton, but the development of such testing scripts and
agents would enable system developers to iteratively test their own
systems during development.
Even if we cannot build fully automatic agents to simulate human
participation, asking these questions pushes us to examine the
sources of variability in these tasks. The scenario-based method that
we have outlined in this document provides a framework for experimentation
in this direction, even while we explore the more conventional means
of scenario-based evaluation.
Next: Bibliography
Up: No Title
Previous: Measures and Metrics
Charles Sheppard
Wed Aug 27 17:05:29 EDT 1997