In the preceding sections, we presented a framework for classifying collaborative systems and an approach to evaluation that employs scenarios. Here we enumerate the metrics and measures introduced in Section 3 and discuss techniques for gathering the metrics.
Metrics are indicators of system, user, and group performance that can be observed, singly or collectively, while executing scenarios. Metrics – such as time, length of turn, and other countable events – are directly measurable and can often be collected automatically.
Measures can be taken at each of the four levels of the collaborative framework. Measures are derived from interpretations of one or more metrics. For example, task completion time (a requirement-level measure) is based on start and end time metrics. For asynchronous tasks, it may be useful to distinguish time on task from elapsed time. A measure can also be a combination of interpreted metrics and other measures. A complicated measure, like efficiency, is partially derived from the interpretation of metrics such as time, as well as user ratings and tool usage. In addition, measures of system breakdown (taken at the service or technology levels) contribute to efficiency.
A simple way to distinguish between metrics and measures is by the following statement: a metric is an observable value, while a measure associates meaning to that value by applying human judgment.
As another example of the metric and measure relationship, consider the turn overlap metric. Overlap of speaker A by speaker B can be counted if the start time of speaker B is greater than the start time of speaker A but less than the end time of speaker A. Further interpretation is required to determine if a particular occurrence of turn overlap is an attempt to gain the floor (interruption in communication) or a back-channel response indicating attentiveness (support, grounding). This might be an interesting measure at the capability level, to understand the ease and effectiveness of multi-person conversational interaction.
The metrics and measures described here are relevant to laboratory experiments. The goal of the Evaluation Working Group is to define inexpensive evaluation technology that can be reproduced in a laboratory. Hence, the measures do not address organizational impact and other measures that require fielding production-quality systems.
Experiments involving human subjects, or sets of subjects, are expensive and time consuming, both in terms of collection and analysis. In many cases, the measures must be developed and validated before they can be applied with confidence. Despite these difficulties, we describe options for evaluating collaborative systems at all four levels of the framework discussed in Section 3.
We begin with an overview of the methods, metrics and measures, and then present the measures found at each level of the framework. Metrics are introduced first because they are components of measures, including methods used to collect metrics.< Next we discuss measures, before turning to automated logging techniques that may aid data collection and analysis.
Figure 2 illustrates the relationship of data collection methods, metrics, measures, and human judgment. The diagram also shows how the measures of the four levels of the collaborative framework are nested, emphasizing the mappings between these levels. As illustrated at the bottom of the diagram, we use data collection methods, such as a logging tool or video taping, to gather metrics on a system (the next level up). Metrics provide the raw input necessary for refining measures for requirements, capabilities, services, and technologies. Human judgements, from both experts and users are associated with all four levels.
Figure 2. Overview of the Levels of Measures
Metrics based on human judgment also support measures for making judgments about the system. Questions to both experts and users of the system can provide valuable data points in the evaluation of a system.
The following table of metrics comprises the observable data elements that were identified as useful indicators of system and group performance. A single metric is just a number (e.g., number of turns) or a set of numbers (e.g., start time and end time). Metrics can also be used in combinations (e.g., the repair activities metric is partially built up from the number of undos and the number of cancels and is the deviation from some established ‘right path’). For each metric, we present in Table 8 definitions and examples of ways to capture the metric. Where applicable, the metrics are enumerated or are broken down into finer granularity.
We attempt to provide some guidelines for gathering each metric. It is important for the evaluator to record how each metric was observed so that comparisons can be made with data collected from multiple experiments. Methods used to gather each metric are listed in Section 5.2.1.
Table 8. Metrics
Metric
Definition
Examples
Countables
Any items that can be
counted, once a clear definition has been developed. The difficulty with
countables is in ensuring that the definitions are consistent and that items
are counted in the same manner.
• Turns
(spoken utterances, e-mail messages, typed lines, etc.)
• Moves (a
subtype of turn, e.g., turns taken in a game)
• Steps
(number of high-level steps to produce result; number of mouse clicks or
button presses towards accomplishing a task)
• Trials
(number of aborted attempts and successful attempts to reach goal)
• Ideas
generated
• Responses
(replies to e-mail message; responses to a question)
• Cancel
button presses (Note that there could be different reasons for this: 1) the
user made a mistake, or 2) the user changed his/her mind)
Length of turn
The length or size of any
type of turn. A turn is a single unit of communication or activity.
• Length
of spoken utterance = end time (of utterance) - start time
• Length
of e-mail message = number of lines
- or - = number of sentences
- or - = number of words
- or - = size in bytes
• Length
of turn in chat session = number of sentences
- or - = number of words
• Length
of move (subtype of turn) = time at end of move –
time at start of move
Task completion
Whether or not a task is completed.
Yes or no (completion or
non-completion). In some cases, the
degree of completion can be measured.
Time
Time, as a metric,
supports a number of measures. Base metrics for start and end time can be
interpreted to support various measures. Time can be measured with respect to
an individual, as a sum of all individual times, and/or as the longest
individual time in the group.
• Overall
task execution time: the interval from task beginning to task end. This is
the total time it takes the collective group to complete execution of the
task.
• Task time:
the time spent on the actual task. This does not include transition time or
time spent doing non-task-related activities.
• Transition
time: the amount of time it takes the group to transition from one task to
another. This includes set up time.
• Other
time: determined by the following
formula: other time = overall time -
(task time + transition time)
• Repair
activities time: the time spent going down a wrong path in addition to the
time spent repairing those actions. (See the definition of repair activities
below. In asynchronous tasks, repair
activities time = overall time – task time – transition time – idle time.
Preparation cost metrics
These metrics include the
monetary amount of a system and the learning time for individual users.
• Dollar
amount.
• Learning
time.
• Learning
cost = labor cost * time.
Expert judgments
Questions posed to experts
Yes/no/to what
degree/quality questions relating to:
• Scalability
• Security
• Interoperability
• Collaboration
management measures
• Communication
• Collaborative
object support measures
• Transition
measure
usability
product
quality
task
outcome
the set of tools used to accomplish the task (the efficiency
of the tools used)
User ratings
Questions posed to users
• Task
outcome
– Product quality
• User
satisfaction
– Satisfaction with the group process
– Satisfaction with task outcome or final
solution
– Satisfaction with an individual's
participation
– Satisfaction with the group participation
• Participation
of:
– An individual
– The group
• Efficiency
– Efficiency of the system
• Consensus
on:
– The solution
– The task outcome
• Awareness
of:
– Other participants
– Objects
– Actions
• Communication
– Whether communication was possible
– The goodness of the communication
– Ability to get floor control
– Ability to ask a question / make a response
• Grounding
– Establishing common understanding with other
participants
– Understanding what other participants were
talking about
• Transition
– Smoothness of the transition
• Usability
– Standard user interface evaluation and
usability questions
Tool usage
The frequency and
distribution of tools used to accomplish a particular task. Also the way in which tools are used to
accomplish a particular task.
• Can be
measured as the deviation from a pre-determined “correct” tool usage
established by experts. Experts can
rate the use of sets of tools for each subtask, and tool usage can be
measured as the deviation from those ratings.
Turn overlap
The overlap in
communication by two or more people.
• Occurs when
the start time of a turn happens before the end time of a previous turn. Turn overlaps can be counted and
categorized into interruptions, backchannel communications, etc.
Repair activities
Include all errors,
following down a wrong path, and all actions performed to correct those
errors. Repair activities also include not knowing what to do and seeking
help. In order to have this metric, there must be an established ‘right
path’, or a list of possible ‘right paths’. Experts determine the ‘right
paths’.
• The
repair activities metric is made up of the number of undos and the number of
cancels, as well as repetitive or useless keystrokes.
Conversa-tional constructs
A general category of
metrics that includes semantic and grammatical content of communication. Topic segmentation and labeling is the
ability to segment dialog into multiple, distinct topics with each segment
labeled as to its topic. A reference
is the use of a grammatical construction (e.g., a definite noun phrase or a
pronoun) to refer to a preceding word or group of words.
Given topic segmentation,
we can measure:
• Topic
mention – the number of times a topic is mentioned
• Distance of
topic from a given turn
• Number of
supportive responses (backchannelling, explicit agreement, etc.)
References can be counted
by their occurrence, type, and distribution in a transcript. The use of
pronouns can indicate group inclusion (you, we) or exclusion (I).
The following is a list of methods, or data-collection instruments, used for
gathering the metrics described above:
Many of the metrics can be gathered by a variety of tools or methods. For example,
countables can be obtained by studying logs, observations, audio transcripts, and
video recordings. (Automated methods of collecting data are discussed in Section 5.4.)
Each method offers potential opportunities not available by other means but also has
inherent limitations. For this reason, we suggest using multiple methods to obtain
each metric. The possibilities are enumerated in Table 9 below.
Table 9. Table of Metrics vs. Data Collection
Methods
Metrics Logs Question-naires
Audio Video x
x
x
x
Expert judgments
x
x
x
Length of turns
x
x
x
x
Turn overlap
x
x
Resource costs
x
x
x
x
x
Task completion
x
x
Time
x
x
x
x
Tool usage
x
x
x
User ratings
x
Conversational constructs
x
x
x
x
Repair activities
x
x
x
x
x
The framework described in Section 3 of this document divides the collaborative
system design space into four levels: requirement, capability, service, and
technology. Measures are associated with each level. For example, participation is a
requirement level measure correlated with various metrics: the total number of turns
per participant, the total number of turns per group, and user ratings. By contrast,
usability is a technology level measure.
For each measure, we present:
The “building blocks” of a measure are metrics and other measures. A building block
can be a combination of metrics or arithmetic formulas involving metrics. Note that
many of the components of each measure are examples; exact components may vary
depending upon the specific situation. Where available, we have included references to
supporting research. For some measures, we have also included metrics and measures
that might show some indirect relevance.
In general, all measures can be applied to all tasks. However, some measures may
have little or no relevance for a particular instantiation of a task type. For
example, measuring participation is usually not important to a dissemination of
information task (Type 9), but is often very important in a brainstorming and
creativity task (Type 2). With each measure, we include a list of what we believe to
be the applicable task types that the measure helps to evaluate. The task types are
enumerated and explained in Section 3 of this document.
The requirement level measures are summarized in Table 10 and include definitions
and components. Capability level measures evaluate general capabilities of
collaborative systems (see Section 3), and are summarized in Table 11.< Service level
measures evaluate the services provided by a collaborative system; they are summarized
in Table 12. Finally, technology level measures aid in evaluating the implementation
of a collaborative system. Example technology level measures can be found in Table
13.
Table 10. Example Requirement Level Measures
Requirement Definition Metric/Measure Components Task outcome
Measure of the state of a
particular task. A set of artifacts
is produced during task execution (e.g., documents, ideas, solutions, defined
processes).
- Countables (number of generated artifacts);
- task completion (yes/no)
- expert judgments of product quality
- user ratings of product quality
Cost
Measure of time invested
in the system and the resources consumed in executing an activity
- preparation cost measures (monetary
amount and learning time)
- countables (number of turns)
- length of turns
- execution time
User satisfaction
Subjective measure of
satisfaction with respect to the four aspects of group work.
- user ratings; e.g., satisfaction with: group process, task outcome, individual’s
participation, group participation
Scalability
Measure of a system’s
accommodation for larger or smaller group size.
- time to complete
particular tasks versus number of users
- resources needed to
complete particular task versus number of users
- expert judgments -
yes/no, to what degree
Security
Measure of the protection
of information and resources
- expert judgments (yes/no
to a list of features, to what degree)
Interoperability
Measure of how well system
components work together, sharing functionality and resources
- expert judgments
- tool usage
Participation
Measure of an individual’s
involvement in a group activity
- countables (e.g., number
of sentences, number of floor turns)
- user ratings
Efficiency
Measure of group and
system effectiveness and productivity
- percent efficiency =
(task time - repair activities time) / execution time
- user ratings
- tool usage
- breakdown
Consensus
Measure of general
agreement or group unity in outcome
- user ratings
- grounding
Table 11. Example Capability Level Measures
Capability Definition Metric/Measure Components Awareness
Degree of “having ... realization, perception, or knowledge” (Webster)
of surroundings and events.
- user ratings
- conversational constructs
Collaboration management
Measures assess support for coordinating collaboration; e.g., floor
control, agenda support, document access control, etc.
- expert judgments
Communi-cation (human to
human)
Measure of the exchange of
information in any of the following forms:
verbal (spoken or written), visual, physical
- countables (number of turns per participant)
- turn overlap (simple overlap and interruptions)
- expert judgments
- user ratings (goodness of communication, getting
floor control, getting other participants’ attention, ability to interrupt)
Grounding
Measure of how well common
understanding is established.
- user ratings (e.g., reaching common understanding
with other participants)
- countables (number of turns, length of turns)
- turn overlap
- conversational constructs
Collaborative object
support
Measure that assesses
support for collaborative objects; applied to shared workspace, object
manipulation and management, etc.
- expert judgments
- tool usage (optimal set of tools used?)
Task focus
Measures the ability to
concentrate on the task at hand.
- task focus = (overall time - transition time - other
time) / overall time
Transition measures
Assesses support for
transitions; used to evaluate collaboration startup, summarization, playback,
archiving, object exporting and importing, distribution of objects,
translation between modalities, meeting notification, etc.
- expert judgments
- user ratings (e.g., flow of transitions between
tasks)
- conversational constructs
Table 12. Example Service Level Measures
Service Definition Metric/Measure
Components Breakdown
Measures how often the
user has to rationalize a problem experienced. Breakdowns can occur in communication, in coordination, in the
system components, etc.
- conversational
constructs
- repair activities
Tool usage
The degree to which the
optimal tools are used for a particular task.
- tool usage (which tools were used and how
often?)
- expert judgments (were the right tools used?)
Table 13. Example Technology Level Measures
Technology Definition Metric/Measure
Components Usability
Evaluates the ease,
accessibility, and intuitiveness of the specific graphical user interfaces of
the system tools and components.
Since usability evaluation is done on specific user interfaces, the
usability measures are realized at the technology level although the
component measures are also based, in part, on measures taken at different
levels of the framework.
- expert judgments (standard user interface
evaluation questions)
- user ratings (standard user interface evaluation questions)
- tool usage (which tools were used and how often? Was the set of optimal tools used?)
- repair activities
- breakdown (defined at the service level)
- awareness (defined at the capability level)
Specific technology
- tool usage (which tools were used and how often?)
- expert judgments (were the right tools used?)
We often need to collect
data about running systems. Since
videotaping every session can be difficult, automatic logging can be important
in developing easy, repeatable, evaluation scenarios. A multi-modal logging capability can support these efforts.
The MITRE Multi-Modal Logger (MML) is an example of this class of logging tool.
MITRE’s MML can be used to log multi-modal data while collaborative systems are
running. This data can be fine-grained (individual X or window events, for instance)
or coarse-grained (a record of which windows the user interacts with); automatically
gathered (via instrumentation) or manually created (via post-hoc annotation). This
data can be gathered at the level of the physical or virtual device (for example,
window events or the output of an audio device); at the level of the interface (for
example, a record of menu selections made and the content of text entry fields); or at
the level of the application (a record of actions taken, such as a retrieval of
information or a command issued). This information can also be in a variety of
modalities, such as text, images, and audio.
Since this information will typically be gathered for multiple users and multiple
interactions with the system in question, the notion of a “trial” or “session” is
supported. In addition, each trial might require information to be gathered from
multiple components simultaneously (for example, when a speech recognizer is used in
conjunction with an independent multi-modal system). Therefore, the MML also supports
sharing each trial among multiple components, potentially running on different hosts.
MITRE's MML provides an application programming interface (API) for instrumenting
existing applications. It also provides a set of tools for reviewing and annotating
data collected via instrumentation.
MITRE’s MML offers a solution to the question of what granularity and levels of
data may be collected. The instrumenter inserts whatever logger calls are desired into
the source code and is thus in complete control of where, in the code, the log entries
are generated, how many are generated, and what data types are assigned to them. In
general, one can instrument any application for which the source code is available.
MITRE has also developed a log review and annotation tool to distribute with the
logger. It allows users to view the data that has been logged for a given session. One
may view all the data or select a subset by application, data type and/or timestamp.
The data is displayed along a scrollable timeline, with the data sorted into streams
by application and data type. This tool may also be used to add post-hoc annotations
to the data. There is also a replay facility that allows the reviewer to replay the
logged interaction in approximately real time.
The MITRE MML tool suite and documentation are available for downloading at
http://www.mitre.org/technology/logger .
5.2.1 Data Collection Methods
Observations
5.3 Measures
5.4 Data
Collection Methods: Logging