In the preceding sections, we presented a framework for classifying collaborative systems and an approach to evaluation that employs scenarios. Here we enumerate the measures and metrics introduced in Section 3 and discuss methods for gathering the metrics.
The measures and metrics described here are relevant to laboratory experiments. The goal of the Evaluation Working Group is to define inexpensive evaluation technology that can be reproduced in a laboratory. Hence, the measures will not address organizational impact and other measures that require fielding production-quality systems.
Experiments involving human subjects, or sets of subjects, are expensive and time consuming, both in terms of collection and analysis. In many cases, the measures must be developed and validated before they can be applied with confidence. Despite these difficulties, we lay out here some options for evaluating collaborative systems at all four levels of the framework discussed in section 3.
We begin with a graphical overview of the methods, metrics and measures, and then we present the measures found at each level of the framework. Metrics will be introduced in the context of measures and then described further in their own section. Finally, we will describe the methods used to collect the metrics we have identified.
The diagram below illustrates the relationship of data collection methods, metrics, measures, and human judgment. The diagram also shows how the measures map onto the four levels of the collaborative framework.
Figure: Overview of the levels of measures
At the bottom, we use data collection methods, such as a logging tool or video taping, to gather metrics on a system. Metrics are indicators of system, user, and group performance that can be observed, singly or collectively, while executing scenarios. Metrics - like time, length of turn, and other countable events - are directly measurable and can be collected automatically.
Measures can be taken at each of the four levels of the collaborative framework. Measures are derived from interpretations of one or more metrics. For example, task completion time (a requirement-level measure) is based on start and end time metrics. A measure can also be a combination of interpreted metrics and other measures. A complicated measure, like efficiency, is partially derived from the interpretation of metrics like time, user ratings, and tool usage. In addition, measures of system breakdown (taken at the service level) contribute to efficiency.
A simple way to distinguish between metrics and measures is by the following statement: a metric is an observable value, while a measure associates meaning to that value by applying human judgment.
Human judgment is important as humans themselves can be a source of data. Questions to both experts and users of the system can provide valuable data points in the evaluation of a system.
Now we are going to shift perspective and discuss, in greater detail, the measures, metrics, and methods from the top down.
The framework described in section 3 of this document divides the collaborative system design space into four levels: requirement, capability, service, and technology. Evaluations may be conducted at all of these levels. By this we mean that we have identified measures that are associated with each level. For example, participation is a requirement level measure correlated with various metrics: the total number of turns per participant, the total number of turns per group, and user ratings. By contrast, usability is a technology level measure.
For each measure, we present:
The ``building blocks'' of a measure are metrics and other measures. A building block can actually be a combination of metrics or arithmetic formulas involving metrics. It is important to note that the authors are hypothesizing on many of the metric and measure components of each measure. Where available, we have included references to supporting research. For some measures, we have also included additional metrics and measures that might show some indirect relevance. Metrics are more fully detailed and discussed in subsection 5.3.
In general, all measures can be applied to all tasks. However, some measures may have little or no relevance for a particular instantiation of a task type. For example, measuring participation may not be important to a dissemination of information task (Type 9), whereas it appears to be very important in a brainstorming and creativity task (Type 2). With each measure, we include a list of what we believe to be the applicable task types that the measure helps to evaluate. The task types are enumerated and explained in section 3 of this document.
When discussing the requirement and capability level measures, we will also address the four aspects of group work: work task, transition, social protocol, and group characteristics (see section 3).
The requirement level measures include:
Task outcome is a measure of the state of a particular task. There is a set of artifacts, produced during execution of the task, such as documents, ideas, solutions, and defined processes.
Aspects of group work evaluated
Metric and measure components
See the different task types in section 3 for specific metrics and measures in addition to the general ones list below.
Additional metrics and measures that may be relevant or correlated
Associated task types
Cost is the measure of time invested in the system and the resources consumed in executing an activity.
Aspects of group work evaluated
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
User satisfaction is a subjective measure of satisfaction with respect to the four aspects of group work.
Aspects of group work evaluated
Metric and measure components
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
Scalability is the measure of a system's accommodation for larger or smaller group size.
Aspects of group work evaluated
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
Security is a measure of the protection of information and resources.
Aspects of group work evaluated
Metric and measure components
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
Interoperability is a measure of how well system components work together, sharing functionality and resources.
Aspects of group work evaluated
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
Participation is the measure of an individual's involvement in a group activity.
Aspects of group work evaluated
Metric and measure components
Tsai (1977) has found that participation measures involving three different definitions of a unit act are highly correlated:
Group participation can be calculated by the number of contributing participants divided by the number of total participants.
Equality of participation is 100% if each group member
has the same value of participation:
100% equality of participation.
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
Less-associated task types
Efficiency is a measure of group and system effectiveness and productivity.
Aspects of group work evaluated
Metric and measure components
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
Consensus is the measure of general agreement or group unity in outcome.
Aspects of group work evaluated
Metric and measure components
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
Capability level measures evaluate general capabilities of collaborative systems (see Section 3). The capability measures include the following:
Awareness is the degree of ``having ... realization, perception, or knowledge'' (Webster) of surroundings and events. For example, awareness of:
Aspects of group work evaluated
Metric and measure components
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
The set of collaboration management measures assesses support for coordinating collaboration.
These measures are used to evaluate the following (non-exhaustive) list of capabilities:
Aspects of group work evaluated
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
Communication is a measure of the exchange of information in any of the following forms:
Aspects of group work evaluated
Metric and measure components
General questions about:
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
Grounding is a measure of how well common understanding is established.
Aspects of group work evaluated
Metric and measure components
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task type
The set of measures assesses support for collaborative objects.
These measures are used to evaluate the following (non-exhaustive) list of capabilities:
Aspects of group work evaluated
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
Task focus measures the ability to concentrate on the task at hand.
Aspects of group work evaluated
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
This set of measures assesses support for transitions.
These measures are used to evaluate the following (non-exhaustive) list of capabilities:
Aspects of group work evaluated
Metric and measure components
General questions about:
Additional metrics and measures that may be relevant or correlated
Associated task types
Service level measures evaluate the services provided by a collaborative system. This type of measure includes:
Breakdown measures how often the user has to rationalize a problem experienced. Breakdowns can occur in communication, in coordination, in the system components, etc.
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
Tool usage is the degree to which the optimal tools are used for a particular task.
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
Measures on the lowest level of the framework, the technology level, evaluate the implementation of a collaborative system. The technology consists of software or hardware components, interfaces, and component connections. The measures include:
Usability measures evaluate the ease, accessibility, and intuitiveness of the specific graphical user interfaces of the system tools and components. Since usability evaluation is done on specific graphical user interfaces, the usability measures are realized at the technology level although the component measures are also based, in part, on measures taken at different levels of the framework.
Metric and measure components
Additional metrics and measures that may be relevant or correlated
Associated task types
(EWG contributors are preparing a list of pointers to industry standards)
The following list of metrics comprises the observable data elements that we have identified as useful indicators of system and group performance. A single metric is just a number (e.g., number of turns) or a set of numbers (e.g., start time and end time).
Metrics can also be used in combinations (e.g., the repair activities metric is partially built up from the number of undos and the number of cancels and is the deviation from some established `right path').
Metrics are associated with an interpretation in order to build measures. For example, the measure of task completion time is the interpreted end time of a task minus the interpreted start time of the task.
As another example of the metric and measure relationship, let's consider the turn overlap metric. Overlap of speaker A by speaker B can be counted if the start time of speaker B is greater than the start time of speaker A but less than the end time of speaker A. Further interpretation is required to determine if a particular occurrence of turn overlap is an attempt to gain the floor (interruption in communication) or a back-channel response indicating attentiveness (support, grounding).
For each metric, we present:
Where applicable, the metrics are enumerated or they are broken down into finer granularity.
We attempt to provide some guidelines for gathering each metric. It is important for the evaluator to record how each metric was observed so that comparisons can be made with data collected from multiple experiments. Methods used to gather each metric are listed at the end of this section.
Some metrics are:
Countables are any items that can be counted, once a clear definition has been developed. The difficulty with countables is in ensuring that the definitions are consistent and that items are counted in the same manner.
Examples of countables:
Length of turns refers to the length or size of any type of turn. A turn is a single unit of communication or activity.
For example:
- or - = number of sentences
- or - = number of words
- or - = size in bytes
- or - = number of words
Task completion refers to whether or not a task is completed. In some cases, the degree of completion can be measured.
Time, as a metric, supports a number of measures. There are base metrics for start and end time which can be interpreted to support the list of measures that follows. Time can be measured with respect to an individual, as a sum of all individual times, and/or as the longest individual time in the group.
Overall task execution time is the interval from task beginning to task end. This is the total time it takes the collective group to complete execution of the task.
Task time is the time spent on the actual task. This does not include transition time or time spent doing non-task-related activities.
Transition time is the amount of time it takes the group to transition from one task to another. This includes set up time.
Other time is determined by the following formula:
other time = overall time - (task time + transition time)
Repair activities time is the time spent going down a wrong path in addition to the time spent repairing those actions. (See the definition of repair activities below.)
In asynchronous tasks, repair activities time = overall time - task time - transition time - idle time
Time may also be broken down into finer granularity, if desired. For example, time spent on each type of object manipulation can be used as a metric.
Preparation cost metrics include the monetary amount of a system and the learning time for individual users.
Expert judgments are questions posed to experts.
General questions about:
User ratings are questions posed to users.
These include general questions about:
Tool usage is the frequency and distribution of tools used to accomplish a particular task.
Tool usage is also the way in which tools are used to accomplish a particular task. This can be measured as the deviation from a pre-determined correct tool usage established by experts. Experts can rate the use of sets of tools for each subtask, and tool usage can be measured as the deviation from those ratings.
Turn overlap is the overlap in communication by two or more people.
Turn overlap occurs when the start time of a turn happens before the end time of a previous turn. Turn overlaps can be counted.
Repair activities include all errors, following down a wrong path, and all actions performed to correct those errors. Repair activities also include not knowing what to do and seeking help. In order to have this metric, there must be an established `right path', or a list of possible `right paths'. Experts determine the `right paths'. The repair activities metric is made up of the number of undos and the number of cancels, as well as repetitive or useless keystrokes.
Conversational constructs is a general category of metrics that includes semantic and grammatical content of communication.
Given topic segmentation, we can measure:
The following is a list of methods, or data-collection instruments, used for gathering the metrics described above:
Many of the metrics can be gathered by a variety of tools or methods. Each method offers potential opportunities not available by other means but also has inherent limitations. For this reason, we suggest using multiple methods to obtain each metric. The possibilities are enumerated in the matrix below.
For example, countables can be measured by studying logs, observations, audio transcripts, and video recordings.
In our evaluation of many of the above items, the need frequently arises for collection of data about running systems. Because videotaping every session can be cumbersome and costly, the ability to log automatically many aspects of the interaction will be important in developing easy, repeatable, evaluation suites. To this end we require a multi-modal logging capability to support our evaluation efforts.
A second motivation for logging is that logged data can be used for feedback and training of adaptive systems; however, this aspect will not be explored further here.
A limitation of the logging approach to data collection is that we can only log what happens on the workstation, or other actions the workstation is ``aware'' of.
To support our evaluation efforts, MITRE is providing a tool which can be used to log multi-modal data while collaborative systems are running. This data can be fine-grained (individual X events, for instance) or coarse-grained (a record of which windows the user interacts with); automatically gathered (via instrumentation) or manually created (via post-hoc annotation); and at the level of the physical or virtual device (say, X events or the output of /dev/audio), the interface (say, a record of menu selections made and the content of text entry fields), or the application (a record of actions taken, such as a retrieval of information or a command issued). This information can also be in a variety of modalities, such as text, images, and audio.
Since this information will typically be gathered for multiple users and multiple interactions with the system in question, the notion of a ``trial'' or ``session'' is supported. In addition, each trial might require information to be gathered from multiple components simultaneously (say, when a speech recognizer is used in conjunction with an independent multi-modal system). Therefore, sharing each trial among multiple components, potentially running on different hosts, is also supported.
MITRE's multi-modal logger provides an easy-to-use API for instrumenting existing applications; it embodies a database structure which groups data points by application, and applications by session; it supports the typing of data points via MIME types; and it provides a set of tools for reviewing and annotating data collected via instrumentations.
MITRE's logger offers a simple solution to the question of what granularities and levels of data may be collected: the instrumenter inserts whatever logger calls are desired into the source code and is thus in complete control of where, in the code, the log entries are generated, how many are generated, and what data types are assigned to them. Anything that one has access to in the source code can be logged.
It is undeniable, of course, that in most cases, access to the source code is required. However, this is not always true. For instance, if we want to log windowing events, tools such as X-Runner, which ``instruments'' X applications by substituting its own version of the Xt library, will soon generate windowing events for consumption by other applications (such as our logger). In this case, we only need to link against the new Xt library, rather than needing access to the source code for a full compilation.
The logger is available in a number of programming languages. While the API is written natively in C (and thus also available in C++), MITRE has already created bindings for Python and for the LambdaMOO programming language used in our Collaborative Virtual Workspace; bindings for Tcl are also planned.
The logger supports both mSQL and ``flat'' databases. mSQL is a lightweight RDB implementation which is free for many types of non-commercial use. If this is not available, the ``flat'' option can be used, which causes the logger system to create and maintain its own database in a group of flat binary files.
Log records are grouped by a session name, and within a session, by application name. A number of different applications may log to the same session. Time stamps are automatically included in the log records.
Each log record sent to the database must contain a data type, which may be associated with a MIME type. The data type describes the data, and the MIME type allows other applications to use it in intelligent ways, such as playing audio or video files correctly.
The logger system keeps a table of predefined data types, to which users may add (or delete) types. A data type includes a name, an optional associated MIME type, and a flag indicating whether the actual data or a pointer to a file containing it will be stored in the database.
MITRE is also developing a log review and annotation tool to distribute with the logger. It allows users to view the data that has been logged for a given session. One may view all the data or select a subset by application, datatype and/or timestamp. The data is displayed along a scrollable timeline, with the data sorted into streams by application and datatype.
This tool may also be used to add post-hoc annotations to the data, linking sets of points to user-produced text. The annotations will also appear as datapoints in the log session, and may themselves have further annotations applied to them.
There is also a replay facility which will allow the reviewer to replay the logged interaction in approximately real time. Currently an initial implementation exists, which will be improved and revised as users begin to provide feedback.