Section 5

Metrics and Measures

5.1 Introduction

In the preceding sections, we presented a framework for classifying collaborative systems and an approach to evaluation that employs scenarios. Here we enumerate the metrics and measures introduced in Section 3 and discuss techniques for gathering the metrics.

Metrics are indicators of system, user, and group performance that can be observed, singly or collectively, while executing scenarios. Metrics – such as time, length of turn, and other countable events – are directly measurable and can often be collected automatically.

Measures can be taken at each of the four levels of the collaborative framework. Measures are derived from interpretations of one or more metrics. For example, task completion time (a requirement-level measure) is based on start and end time metrics. For asynchronous tasks, it may be useful to distinguish time on task from elapsed time. A measure can also be a combination of interpreted metrics and other measures. A complicated measure, like efficiency, is partially derived from the interpretation of metrics such as time, as well as user ratings and tool usage. In addition, measures of system breakdown (taken at the service or technology levels) contribute to efficiency.

A simple way to distinguish between metrics and measures is by the following statement: a metric is an observable value, while a measure associates meaning to that value by applying human judgment.

As another example of the metric and measure relationship, consider the turn overlap metric. Overlap of speaker A by speaker B can be counted if the start time of speaker B is greater than the start time of speaker A but less than the end time of speaker A. Further interpretation is required to determine if a particular occurrence of turn overlap is an attempt to gain the floor (interruption in communication) or a back-channel response indicating attentiveness (support, grounding). This might be an interesting measure at the capability level, to understand the ease and effectiveness of multi-person conversational interaction.

The metrics and measures described here are relevant to laboratory experiments. The goal of the Evaluation Working Group is to define inexpensive evaluation technology that can be reproduced in a laboratory. Hence, the measures do not address organizational impact and other measures that require fielding production-quality systems.

Experiments involving human subjects, or sets of subjects, are expensive and time consuming, both in terms of collection and analysis. In many cases, the measures must be developed and validated before they can be applied with confidence. Despite these difficulties, we describe options for evaluating collaborative systems at all four levels of the framework discussed in Section 3.

We begin with an overview of the methods, metrics and measures, and then present the measures found at each level of the framework. Metrics are introduced first because they are components of measures, including methods used to collect metrics.< Next we discuss measures, before turning to automated logging techniques that may aid data collection and analysis.

Figure 2 illustrates the relationship of data collection methods, metrics, measures, and human judgment. The diagram also shows how the measures of the four levels of the collaborative framework are nested, emphasizing the mappings between these levels. As illustrated at the bottom of the diagram, we use data collection methods, such as a logging tool or video taping, to gather metrics on a system (the next level up). Metrics provide the raw input necessary for refining measures for requirements, capabilities, services, and technologies. Human judgements, from both experts and users are associated with all four levels.

 
Figure 2. Overview of the Levels of Measures

 

Metrics based on human judgment also support measures for making judgments about the system. Questions to both experts and users of the system can provide valuable data points in the evaluation of a system.

5.2 Metrics

The following table of metrics comprises the observable data elements that were identified as useful indicators of system and group performance. A single metric is just a number (e.g., number of turns) or a set of numbers (e.g., start time and end time). Metrics can also be used in combinations (e.g., the repair activities metric is partially built up from the number of undos and the number of cancels and is the deviation from some established ‘right path’). For each metric, we present in Table 8 definitions and examples of ways to capture the metric. Where applicable, the metrics are enumerated or are broken down into finer granularity.

We attempt to provide some guidelines for gathering each metric. It is important for the evaluator to record how each metric was observed so that comparisons can be made with data collected from multiple experiments. Methods used to gather each metric are listed in Section 5.2.1.

 

Table 8.  Metrics

Metric

Definition

Examples

Countables

Any items that can be counted, once a clear definition has been developed. The difficulty with countables is in ensuring that the definitions are consistent and that items are counted in the same manner.

 

  Turns (spoken utterances, e-mail messages, typed lines, etc.)

  Moves (a subtype of turn, e.g., turns taken in a game)

  Steps (number of high-level steps to produce result; number of mouse clicks or button presses towards accomplishing a task)

  Trials (number of aborted attempts and successful attempts to reach goal)

  Ideas generated

  Responses (replies to e-mail message; responses to a question)

  Cancel button presses (Note that there could be different reasons for this: 1) the user made a mistake, or 2) the user changed his/her mind)

Length of turn

The length or size of any type of turn. A turn is a single unit of communication or activity.

 

Length of spoken utterance = end time (of utterance) - start time

Length of e-mail message = number of lines

- or - = number of sentences

- or - = number of words

- or - = size in bytes

Length of turn in chat session = number of sentences

- or - = number of words

Length of move (subtype of turn) = time at end of move –

time at start of move

Task completion

Whether or not a task is completed.

Yes or no (completion or non-completion). In some cases, the degree of completion can be measured.

Time

Time, as a metric, supports a number of measures. Base metrics for start and end time can be interpreted to support various measures. Time can be measured with respect to an individual, as a sum of all individual times, and/or as the longest individual time in the group.

 

  Overall task execution time: the interval from task beginning to task end. This is the total time it takes the collective group to complete execution of the task.

  Task time: the time spent on the actual task. This does not include transition time or time spent doing non-task-related activities.

  Transition time: the amount of time it takes the group to transition from one task to another. This includes set up time.

  Other time: determined by the following formula: other time = overall time - (task time + transition time)

  Repair activities time: the time spent going down a wrong path in addition to the time spent repairing those actions. (See the definition of repair activities below. In asynchronous tasks, repair activities time = overall time – task time – transition time – idle time.

Preparation cost metrics

These metrics include the monetary amount of a system and the learning time for individual users.

  Dollar amount.

  Learning time.

  Learning cost = labor cost * time.

Expert judgments

Questions posed to experts

Yes/no/to what degree/quality questions relating to:

Scalability

Security

Interoperability

Collaboration management measures

Communication

Collaborative object support measures

Transition measure

       usability

       product quality

       task outcome

       the set of tools used to accomplish the task (the efficiency of the tools used)

User ratings

Questions posed to users

  Task outcome

      Product quality

  User satisfaction

      Satisfaction with the group process

      Satisfaction with task outcome or final solution

      Satisfaction with an individual's participation

      Satisfaction with the group participation

  Participation of:

      An individual

      The group

  Efficiency

      Efficiency of the system

  Consensus on:

      The solution

      The task outcome

  Awareness of:

      Other participants

      Objects

      Actions

  Communication

      Whether communication was possible

      The goodness of the communication

      Ability to get floor control

      Ability to ask a question / make a response

  Grounding

      Establishing common understanding with other participants

      Understanding what other participants were talking about

  Transition

      Smoothness of the transition

  Usability

      Standard user interface evaluation and usability questions

Tool usage

The frequency and distribution of tools used to accomplish a particular task.  Also the way in which tools are used to accomplish a particular task.

Can be measured as the deviation from a pre-determined “correct” tool usage established by experts. Experts can rate the use of sets of tools for each subtask, and tool usage can be measured as the deviation from those ratings.

Turn overlap

The overlap in communication by two or more people.

Occurs when the start time of a turn happens before the end time of a previous turn.  Turn overlaps can be counted and categorized into interruptions, backchannel communications, etc.

Repair activities

Include all errors, following down a wrong path, and all actions performed to correct those errors. Repair activities also include not knowing what to do and seeking help. In order to have this metric, there must be an established ‘right path’, or a list of possible ‘right paths’. Experts determine the ‘right paths’.

• The repair activities metric is made up of the number of undos and the number of cancels, as well as repetitive or useless keystrokes.

Conversa-tional constructs

A general category of metrics that includes semantic and grammatical content of communication.  Topic segmentation and labeling is the ability to segment dialog into multiple, distinct topics with each segment labeled as to its topic. A reference is the use of a grammatical construction (e.g., a definite noun phrase or a pronoun) to refer to a preceding word or group of words.

Given topic segmentation, we can measure:

Topic mention – the number of times a topic is mentioned

Distance of topic from a given turn

Number of supportive responses (backchannelling, explicit agreement, etc.)

 

References can be counted by their occurrence, type, and distribution in a transcript. The use of pronouns can indicate group inclusion (you, we) or exclusion (I).

 

5.2.1 Data Collection Methods

The following is a list of methods, or data-collection instruments, used for gathering the metrics described above:

Many of the metrics can be gathered by a variety of tools or methods. For example, countables can be obtained by studying logs, observations, audio transcripts, and video recordings. (Automated methods of collecting data are discussed in Section 5.4.) Each method offers potential opportunities not available by other means but also has inherent limitations. For this reason, we suggest using multiple methods to obtain each metric. The possibilities are enumerated in Table 9 below.

 

Table 9.  Table of Metrics vs. Data Collection Methods

 

Metrics

 

Logs


Observations

Question-naires

 

Audio

 

Video

 

Countables

x

x

 

x

x

 

Expert judgments

 

x

x

 

x

 

Length of turns

x

x

 

x

x

 

Turn overlap

x

 

 

x

 

 

Resource costs

x

x

x

x

x

 

Task completion

x

x

 

 

 

 

Time

x

x

 

x

x

 

Tool usage

x

x

x

 

x

 

User ratings

 

 

x

 

 

 

Conversational constructs

x

x

 

x

x

 

Repair activities

x

x

x

x

x

 

 

5.3      Measures

The framework described in Section 3 of this document divides the collaborative system design space into four levels: requirement, capability, service, and technology. Measures are associated with each level. For example, participation is a requirement level measure correlated with various metrics: the total number of turns per participant, the total number of turns per group, and user ratings. By contrast, usability is a technology level measure.

For each measure, we present:

  1. A definition
  2. metrics and other measures
  3. Associated task types

The “building blocks” of a measure are metrics and other measures. A building block can be a combination of metrics or arithmetic formulas involving metrics. Note that many of the components of each measure are examples; exact components may vary depending upon the specific situation. Where available, we have included references to supporting research. For some measures, we have also included metrics and measures that might show some indirect relevance.

In general, all measures can be applied to all tasks. However, some measures may have little or no relevance for a particular instantiation of a task type. For example, measuring participation is usually not important to a dissemination of information task (Type 9), but is often very important in a brainstorming and creativity task (Type 2). With each measure, we include a list of what we believe to be the applicable task types that the measure helps to evaluate. The task types are enumerated and explained in Section 3 of this document.

The requirement level measures are summarized in Table 10 and include definitions and components. Capability level measures evaluate general capabilities of collaborative systems (see Section 3), and are summarized in Table 11.< Service level measures evaluate the services provided by a collaborative system; they are summarized in Table 12. Finally, technology level measures aid in evaluating the implementation of a collaborative system. Example technology level measures can be found in Table 13.

 

Table 10.  Example Requirement Level Measures

Requirement

Definition

Metric/Measure Components

Task outcome

Measure of the state of a particular task. A set of artifacts is produced during task execution (e.g., documents, ideas, solutions, defined processes).

- Countables (number of generated artifacts);

- task completion (yes/no)

- expert judgments of product quality

- user ratings of product quality

Cost

Measure of time invested in the system and the resources consumed in executing an activity

- preparation cost measures (monetary amount and learning time)

- countables (number of turns)

- length of turns

- execution time

User satisfaction

Subjective measure of satisfaction with respect to the four aspects of group work.

- user ratings; e.g., satisfaction with: group process, task outcome, individual’s participation, group participation

Scalability

Measure of a system’s accommodation for larger or smaller group size.

- time to complete particular tasks versus number of users

- resources needed to complete particular task versus number of users

- expert judgments - yes/no, to what degree

Security

Measure of the protection of information and resources

- expert judgments (yes/no to a list of features, to what degree)

Interoperability

Measure of how well system components work together, sharing functionality and resources

- expert judgments

- tool usage

Participation

Measure of an individual’s involvement in a group activity

- countables (e.g., number of sentences, number of floor turns)

- user ratings

Efficiency

Measure of group and system effectiveness and productivity

- percent efficiency = (task time - repair activities time) / execution time

- user ratings

- tool usage

- breakdown

Consensus

Measure of general agreement or group unity in outcome

- user ratings

- grounding

 

Table 11.  Example Capability Level Measures

Capability

Definition

Metric/Measure Components

Awareness

Degree of “having ... realization, perception, or knowledge” (Webster) of surroundings and events.

- user ratings

- conversational constructs

Collaboration management

Measures assess support for coordinating collaboration; e.g., floor control, agenda support, document access control, etc.

- expert judgments

Communi-cation (human to human)

Measure of the exchange of information in any of the following forms:  verbal (spoken or written), visual, physical

- countables (number of turns per participant)

- turn overlap (simple overlap and interruptions)

- expert judgments

- user ratings (goodness of communication, getting floor control, getting other participants’ attention, ability to interrupt)

Grounding

Measure of how well common understanding is established.

- user ratings (e.g., reaching common understanding with other participants)

- countables (number of turns, length of turns)

- turn overlap

- conversational constructs

Collaborative object support

Measure that assesses support for collaborative objects; applied to shared workspace, object manipulation and management, etc.

- expert judgments

- tool usage (optimal set of tools used?)

Task focus

Measures the ability to concentrate on the task at hand.

- task focus = (overall time - transition time - other time) / overall time

Transition measures

Assesses support for transitions; used to evaluate collaboration startup, summarization, playback, archiving, object exporting and importing, distribution of objects, translation between modalities, meeting notification, etc.

- expert judgments

- user ratings (e.g., flow of transitions between tasks)

- conversational constructs

 

 

Table 12.  Example Service Level Measures

Service

Definition

Metric/Measure Components

Breakdown

Measures how often the user has to rationalize a problem experienced.  Breakdowns can occur in communication, in coordination, in the system components, etc.

- conversational constructs

- repair activities

 

 

Tool usage

The degree to which the optimal tools are used for a particular task.

-  tool usage (which tools were used and how often?)

- expert judgments (were the right tools used?)

 

 

Table 13.  Example Technology Level Measures

Technology

Definition

Metric/Measure Components

Usability

Evaluates the ease, accessibility, and intuitiveness of the specific graphical user interfaces of the system tools and components.  Since usability evaluation is done on specific user interfaces, the usability measures are realized at the technology level although the component measures are also based, in part, on measures taken at different levels of the framework.

- expert judgments (standard user interface evaluation questions)

- user ratings (standard user interface evaluation questions)

- tool usage (which tools were used and how often?  Was the set of optimal tools used?)

- repair activities

- breakdown (defined at the service level)

- awareness (defined at the capability level)

 

Specific technology

 

- tool usage (which tools were used and how often?)

- expert judgments (were the right tools used?)

 

5.4      Data Collection Methods: Logging

We often need to collect data about running systems. Since videotaping every session can be difficult, automatic logging can be important in developing easy, repeatable, evaluation scenarios.  A multi-modal logging capability can support these efforts.

The MITRE Multi-Modal Logger (MML) is an example of this class of logging tool. MITRE’s MML can be used to log multi-modal data while collaborative systems are running. This data can be fine-grained (individual X or window events, for instance) or coarse-grained (a record of which windows the user interacts with); automatically gathered (via instrumentation) or manually created (via post-hoc annotation). This data can be gathered at the level of the physical or virtual device (for example, window events or the output of an audio device); at the level of the interface (for example, a record of menu selections made and the content of text entry fields); or at the level of the application (a record of actions taken, such as a retrieval of information or a command issued). This information can also be in a variety of modalities, such as text, images, and audio.

Since this information will typically be gathered for multiple users and multiple interactions with the system in question, the notion of a “trial” or “session” is supported. In addition, each trial might require information to be gathered from multiple components simultaneously (for example, when a speech recognizer is used in conjunction with an independent multi-modal system). Therefore, the MML also supports sharing each trial among multiple components, potentially running on different hosts.

MITRE's MML provides an application programming interface (API) for instrumenting existing applications. It also provides a set of tools for reviewing and annotating data collected via instrumentation.

MITRE’s MML offers a solution to the question of what granularity and levels of data may be collected. The instrumenter inserts whatever logger calls are desired into the source code and is thus in complete control of where, in the code, the log entries are generated, how many are generated, and what data types are assigned to them. In general, one can instrument any application for which the source code is available.

MITRE has also developed a log review and annotation tool to distribute with the logger. It allows users to view the data that has been logged for a given session. One may view all the data or select a subset by application, data type and/or timestamp. The data is displayed along a scrollable timeline, with the data sorted into streams by application and data type. This tool may also be used to add post-hoc annotations to the data. There is also a replay facility that allows the reviewer to replay the logged interaction in approximately real time.

The MITRE MML tool suite and documentation are available for downloading at http://www.mitre.org/technology/logger .