Common Industry Format for Usability Test Reports

Version 1.1, October 28, 1999

Produced by the Industry USability Reporting project: www.nist.gov/iusr
If you have any comments or questions about this document, please contact: iusr@nist.gov

Word97 version suitable for printing
Complete set of documents (includes all appendices)

Preface. 4

Purpose and Objectives. 4

Audience. 4

Scope. 4

Relationship to existing standards. 5

Report Format Description. 6

Title Page. 6

Executive Summary. 6

Introduction. 7

Full Product Description. 7

Test Objectives. 7

Method. 7

Participants. 7

Context of Product Use in the Test9

Tasks. 9

Test Facility. 9

Participant’s Computing Environment9

Test Administrator Tools. 10

Experimental Design. 10

Procedure. 10

Participant General Instructions. 11

Participant Task Instructions. 11

Usability Metrics. 11

Effectiveness. 11

Efficiency. 12

Satisfaction. 13

Results. 13

Data Analysis. 13

Data Scoring w. 13

Data Reduction w. 14

Statistical Analysis w. 14

Presentation of the Results. 14

Performance Results v.. 14

Satisfaction Results v.. 15

Appendices. 16

References. 16

Appendix A – Guidance for CIF preparation. 17

Appendix B -- Checklist. 17

Appendix C -- Sample Report. 17

Appendix D -- Glossary. 17

Appendix E –Template. 17


Preface

Purpose and Objectives

The overall purpose of the Common Industry Format (CIF) for Usability Test Reports is to promote incorporation of usability as part of the procurement decision-making process for interactive products. Examples of such decisions include purchasing, upgrading and automating. It provides a common format for human factors engineers and usability professionals in supplier companies to report the methods and results of usability tests to customer organizations.

Audience

The CIF is meant to be used by usability professionals within supplier organizations to generate reports that can be used by customer organizations. The CIF is also meant to be used by customer organizations to verify that a particular report is CIF-compliant. The Usability Test Report itself is intended for two types of readers:

Human factors or other usability professionals in customer organizations who are evaluating both the technical merit of usability tests and the usability of the products.

Other technical professionals and managers who are using the test results to make business decisions.

Methods and Results sections are aimed at the first audience. These sections describe the test methodology and results in technical detail suitable for replication, and also support application of test data to questions about the product’s expected costs and benefits. Understanding and interpreting these sections will require technical background in human factors or usability engineering for optimal use. The second audience is directed to the Introduction, which provides summary information for non-usability professionals and managers. The Introduction may also be of general interest to other computing professionals.

Scope

Trial use of the CIF report format will occur during a Pilot Study. For further information of the Pilot Study, see the following document (http://www.nist.gov/iusr/documents/WhitePaper.html). The report format assumes sound practice (e.g., refs. 8 & 9) has been followed in the design and execution of the test. Summative type usability testing is recommended. The format is intended to support clear and thorough reporting of both the methods and the results of any empirical test. Test procedures which produce measures that summarize usability should be used. Some usability evaluation methods, such as formative tests, are intended to identify problems rather than produce measures; the format is not currently structured to support the results of such testing methods. The common format covers the minimum information that should be reported. Suppliers may choose to include more. Although the format could be extended for wider use with products such as hardware with user interfaces, they are not included at this time. These issues will likely be addressed as we gain more experience in the Pilot study.

Relationship to existing standards

This document is not formally related to standards-making efforts but has been informed by existing standards such as Annex C of ISO 13407, ISO 9241-11, and ISO 14598-5. It is consistent with major portions of these documents but more limited in scope.


Report Format Description

The format should be used as a generalized template. All the sections are reported according to agreement between the customer organization, the product supplier, and any third-party test organization where applicable.

Elements of the CIF are either ‘Mandatory’ or ‘Recommended’ and are marked ‘v’ and w, respectively, in the text.

Appendix A presents guidance for preparing a CIF report. Appendix B provides a checklist that can be used to ensure inclusion of required and recommended information. Appendix C of this template contains an example that illustrates how the report format can be used. A glossary is provided in Appendix D to define terminology used in the report format description.Appendix E contains a Word template for report production.

Title Page

This section contains lines for

videntifying the report as a Common Industry Format (CIF) document; state CIF version and contact information (i.e., ‘Comments and question: iusr@nist.gov’).

vnaming the product and version that was tested

vwho led the test

vwhen the test was conducted

vthe date the report was prepared

vwho prepared the report

vcontact information (telephone, email and street address) for an individual or individuals who can clarify all questions about the test to support validation and replication.

Executive Summary

This section provides a high level overview of the test. This section should begin on a new page and should end with a page break to facilitate its use as a stand-alone summary. The intent of this section is to provide information for procurement decision-makers in customer organizations. These people may not read the technical body of this document but are interested in:

vthe identity and a description of the product

va summary of the method(s) of the test including the number of and type of participants and their tasks.

vresults expressed as mean scores or other suitable measure of central tendency

w the reason for and nature of the test

w tabular summary of performance results.

If differences between values or products are claimed, the probability that the difference did not occur by chance should be stated.

Introduction

Full Product Description

vThis section identifies the formal product name and release or version. It describes what parts of the product were evaluated. This section should also specify:

v the user population for which the product is intended

w any groups with special needs

w a brief description of the environment in which it should be used

w the type of user work that is supported by the product

Test Objectives

vThis section describes all of the objectives for the test and any areas of specific interest. Possible objectives include testing user performance of work tasks and subjective satisfaction in using the product. This section should include: 

v The functions and components of the product with which the user directly and indirectly interacted in this test.

w If the product component or functionality that was tested is a subset of the total product, explain the reason for focusing on the subset.

Method

This is the first key technical section. It must provide sufficient information to allow an independent tester to replicate the procedure used in testing.

Participants

This section describes the users who participated in the test in terms of demographics, professional experience, computing experience and special needs. This description must be sufficiently informative to replicate the study with a similar sample of participants. If there are any known differences between the participant sample and the user population, they should be noted here, e.g., actual users would attend a training course whereas test subjects were untrained.  Participants should not be from the same organization as the testing or supplier organization. Great care should be exercised when reporting differences between demographic groups on usability metrics.

A general description should include important facts such as:

v The total number of participants tested. A minimum of 8 per cell (segment) is recommended [10].

v Segmentation of user groups tested (if more than one user group was tested). Example: novice and expert programmers.

v The key characteristics and capabilities expected of the user groups being evaluated.

v How participants were selected and whether they had the essential characteristics and capabilities.

w Whether the participant sample included representatives of groups with special needs such as: the young, the elderly or those with physical or mental disabilities.

A table specifying the characteristics and capabilities of the participants tested should include a row in the table for each participant, and a column for each characteristic. Characteristics should be chosen to be relevant to the product’s usability; they should allow a customer to determine how similar the participants were to the customers’ user population; and they must be complete enough so that an essentially similar group of participants can be recruited. The table below is an example; the characteristics that are shown are typical but may not necessarily cover every type of testing situation.

 

Gender

Age

Education

Occupation / role

Professional Experience

Computer Experience

Product Experience

P1

 

 

 

 

 

 

 

P2

 

 

 

 

 

 

 

Pn

 

 

 

 

 

 

 

For ‘Gender’, indicate male or female.

For ‘Age’, state the chronological age of the participant, or indicate membership in an age range (e.g. 25-45) or age category (e.g. under 18, over 65) if the exact age is not known.

For ‘Education’, state the number of years of completed formal education (e.g., in the US a high school graduate would have 12 years of education and a college graduate 16 years).

For ‘Occupation/role’, describe what the user’s job role when using the product. Use the Role title if known.

For ‘Professional experience’, give the amount of time the user has been performing in the role.

For ‘Computer experience’, describe relevant background such as how much experience the user has with the platform or operating system, and/or the product domain. This may be more extensive than one column.

For ‘Product experience’ indicate the type and duration of any prior experience with the product or with similar products.

Context of Product Use in the Test

This section describes the tasks, scenarios and conditions under which the tests were performed, the tasks that were part of the evaluation, the platform on which the application was run, and the specific configuration operated by test participants. Any known differences between the evaluated context and the expected context of use should be noted in the corresponding subsection.

Tasks

A thorough description of the tasks that were performed by the participants is critical to the face validity of the test.

vDescribe the task scenarios for testing.

vExplain why these tasks were selected (e.g. the most frequent tasks, the most troublesome tasks).

vDescribe the source of these tasks (e.g. observation of customers using similar products, product marketing specifications).

vAlso, include any task data given to the participants, and

vand any completion or performance criteria established for each task.

Test Facility

This section refers to the physical description of the test facility.

w Describe the setting, and type of space in which the evaluation was conducted (e.g., usability lab, cubicle office, meeting room, home office, home family room, manufacturing floor).

w Detail any relevant features or circumstances which could affect the quality of the results, such as video and audio recording equipment, one-way mirrors, or automatic data collection equipment.

Participant’s Computing Environment v

The section should include all the detail required to replicate and validate the test. It should include appropriate configuration detail on the participant’s computer, including hardware model, operating system versions, and any required libraries or settings. If the product uses a web browser, then the browser should be identified along with its version and the name and version of any relevant plug-ins.

Display Devicesv If the producthas a screen-based visual interface, the screen size, monitor resolution, and color setting (number of colors) must be detailed. If the product has a print-based visual interface, the media size and print resolution must be detailed. If visual interface elements can vary in size, specify the size(s) used in the test. This factor is particularly relevant for fonts.

Audio DeviceswIfthe product has an audio interface, specify relevant settings or values for the audio bits, volume, etc.

Manual Input Devicesw Ifthe product requires a manual input device (e.g., keyboard, mouse, joystick) specify the make and model of devices used in the test.

Test Administrator Tools

v If a standard questionnaire was used, describe or specify it here. Include customized questionnaires in an appendix.

w Describe any hardware or software used to control the test or to record data.

Experimental Design

vDescribe the logical design of the test. Define independent variables and control variables. Briefly describe the measures for which data were recorded for each set of conditions.

Procedure

This section details the test protocol.

v Give operational definitions of measures and any presented independent variables or control variables. Describe any time limits on tasks, and any policies and procedures for training, coaching, assistance, interventions or responding to questions.

w Include the sequence of events from greeting the participants to dismissing them.

w Include details concerning non-disclosure agreements, form completion, warm-ups, pre-task training, and debriefing.

w Verify that the participants knew and understood their rights as human subjects [1].

w Specify the steps that the evaluation team followed to execute the test sessions and record data.

w Specify how many people interacted with the participants during the test sessions and briefly describe their roles.

w State whether other individuals were present in the test environment and their roles.

wState whether participants were paid or otherwise compensated.

Participant General Instructions

v Include here or in an appendix all instructions given to the participants (except the actual task instructions, which are given in the Participant Task Instructions section).

v Include instructions on how participants were to interact with any other persons present, including how they were to ask for assistance and interact with other participants, if applicable.

Participant Task Instructions

vThis section should summarize the task instructions. Put the exact task instructions in an appendix.

Usability Metrics

vExplain what measures have been used for each category of usability metrics: effectiveness, efficiency and satisfaction. Conceptual descriptions and examples of the metrics are given below.

Effectiveness

Effectiveness relates the goals of using the product to the accuracy and completeness with which these goals can be achieved. Common measures of effectiveness include percent task completion, frequency of errors, frequency of assists to the participant from the testers, and frequency of accesses to help or documentation by the participants during the tasks. It does not take account of how the goals were achieved, only the extent to which they were achieved. Efficiency relates the level of effectiveness achieved to the quantity of resources expended.

Completion Rate

The results must include the percentage of participants who completely and correctly achieve each task goal. If goals can be partially achieved (e.g., by incomplete or sub-optimum results) then it may also be useful to report the average goal achievement, scored on a scale of 0 to 100% based on specified criteria related to the value of a partial result. For example, a spell-checking task might involve identifying and correcting 10 spelling errors and the completion rate might be calculated based on the percent of errors corrected. Another method for calculating completion rate is weighting; e.g., spelling errors in the title page of the document are judged to be twice as important as errors in the main body of text. The rationale for choosing a particular method of partial goal analysis should be stated, if such results are included in the report.

Note: The unassisted completion rate (i.e. the rate achieved without intervention from the testers) should be reported as well as the assisted rate (i.e. the rate achieved with tester intervention) where these two metrics differ.

Errors

Errors are instances where test participants did not complete the task successfully, or had to attempt portions of the task more than once. It is recommended that scoring of data include classifying errors according to some taxonomy, such as in [2].

Assists

When participants cannot proceed on a task, the test administrator sometimes gives direct procedural help in order to allow the test to proceed. This type of tester intervention is called an assist for the purposes of this report. If it is necessary to provide participants with assists, efficiency and effectiveness metrics must be determined for both unassisted and assisted conditions. For example, if a participant received an assist on Task A, that participant should not be included among those successfully completing the task when calculating the unassisted completion rate for that task. However, if the participant went on to successfully complete the task following the assist, he could be included in the assisted Task A completion rate. When assists are allowed or provided, the number and type of assists must be included as part of the test results.

In some usability tests, participants are instructed to use support tools such as online help or documentation, which are part of the product, when they cannot complete tasks on their own. Accesses to product features which provide information and help are not considered assists for the purposes of this report. It may, however, be desirable to report the frequency of accesses to different product support features, especially if they factor into participants’ ability to use products independently.

Efficiency

Efficiency relates the level of effectiveness achieved to the quantity of resources expended. Efficiency is generally assessed by the mean time taken to achieve the task. Efficiency may also relate to other resources (e.g. total cost of usage). A common measure of efficiency is time on task.

Task time

The results must include the mean time taken to complete each task, together with the range and standard deviation of times across participants. Sometimes a more detailed breakdown is appropriate; for instance, the time that users spent looking for or obtaining help (e.g., including documentation, help system or calls to the help desk). This time should also be included in the total time on task.

Completion Rate/Mean Time-On-Task.w

The measure Completion Rate / Mean Time-On-Task is the core measure of efficiency. It specifies the percentage of users who were successful (or percentage goal achievement) for every unit of time. This formula shows that as the time on task increases, one would expect users to be more successful. A very efficient product has a high percentage of successful users in a small amount of time. This allows customers to compare fast error-prone interfaces (e.g., command lines with wildcards to delete files) to slow easy interfaces (e.g., using a mouse and keyboard to drag each file to the trash).

Note: Effectiveness and efficiency results must be reported, even when they are difficult to interpret within the specified context of use. In this case, the report must specify why the supplier does not consider the metrics meaningful. For example, suppose that the context of use for the product includes real time, open-ended interaction between close associates. In this case, Time-On-Task may not be meaningfully interpreted as a measure of efficiency, because for many users, time spent on this task is “time well spent”.

Satisfaction

Satisfaction describes a user’s subjective response when using the product. User satisfaction may be an important correlate of motivation to use a product and may affect performance in some cases. Questionnaires to measure satisfaction and associated attitudes are commonly built using Likert and semantic differential scales.

A variety of instruments are available for measuring user satisfaction of software interactive products, and many companies create their own. Whether an external, standardized instrument is used or a customized instrument is created, it is suggested that subjective rating dimensions such as Satisfaction, Usefulness, and Ease of Use be considered for inclusion, as these will be of general interest to customer organizations.

A number of questionnaires are available that are widely used. They include: ASQ [5], CUSI [6], PSSUQ [6], QUIS [3], SUMI [4], and SUS [7]).  While each offers unique perspectives on subjective measures of product usability, most include measurements of Satisfaction, Usefulness, and Ease of Use.

Suppliers may choose to use validated published satisfaction measures or may submit satisfaction metrics they have developed themselves.

Results

This is the second major technical section of the report. It includes a description of how the data were scored, reduced, and analyzed. It provides the major findings in quantitative formats.

Data Analysis

Data Scoringv

The method by which the data collected were scored should be described in sufficient detail to allow replication of the data scoring methods by another organization if the test is repeated. Particular items that should be addressed include the exclusion of outliers, categorization of error data, and criteria for scoring assisted or unassisted completion.

Data Reductionv

The method by which the data were reduced should be described in sufficient detail to allow replication of the data reduction methods by another organization if the test is repeated. Particular items that should be addressed include how data were collapsed across tasks or task categories.

Statistical Analysis v

The method by which the data were analyzed should be described in sufficient detail to allow replication of the data analysis methods by another organization if the test is repeated. Particular items that should be addressed include statistical procedures (e.g. transformation of the data) and tests (e.g. t-tests, F tests and statistical significance of differences between groups). Scores that are reported as means must include the standard deviation and optionally the standard error of the mean.

Presentation of the Results

v Effectiveness, Efficiency and Satisfaction results must always be reported.

Both tabular and graphical presentations of results should be included. Various graphical formats are effective in describing usability data at a glance. Examples are included in the Sample Test Report in Appendix C. Bar graphs are useful for describing subjective data such as that gleaned from Likert scales. A variety of plots can be used effectively to show comparisons of expert benchmark times for a product vs. the mean participant performance time. The data may be accompanied by a brief explanation of the results but detailed interpretation is discouraged.

Performance Resultsv

It is recommended that efficiency and effectiveness results be tabulated across participants on a per unit task basis. A table of results may be presented for groups of related tasks (e.g. all program creation tasks in one group, all debugging tasks in another group) where this is more efficient and makes sense.  If a unit task has sub-tasks, then the sub-tasks may be reported in summary form for the unit task. For example, if a unit task is to identify all the misspelled words on a page, then the results may be summarized as a percent of misspellings found. Finally, a summary table showing total mean task times and completion rates across all tasks should be presented. Testers should report additional tables of metrics if they are relevant to the product’s design and a particular application area.


Task A

User #

Unassisted Task Effectiveness [(%)Complete]

Assisted Task Effectiveness [(%)Complete]

Task Time (min)

Effectiveness / Mean Time-On-Task

Errors

Assists

1

 

 

 

 

 

 

2

 

 

 

 

 

 

N

 

 

 

 

 

 

Mean

 

 

 

 

 

 

Standard Deviation

 

 

 

 

 

 

Min

 

 

 

 

 

 

Max

 

 

 

 

 

 

  

Summary

User #

Total Unassisted Task Effectiveness [(%)Complete]

Total Assisted Task Effectiveness [(%)Complete]

Total Task Time (min)

Effectiveness / Mean Time-On-Task

Total Errors

Total Assists

1

 

 

 

 

 

 

2

 

 

 

 

 

 

N

 

 

 

 

 

 

Mean

 

 

 

 

 

 

Standard Deviation

 

 

 

 

 

 

Min

 

 

 

 

 

 

Max

 

 

 

 

 

 

Satisfaction Results v

Data from satisfaction questionnaires can be summarized in a manner similar to that described above for performance data. Each column should represent a single measurement scale.

Satisfaction

User #

Scale 1

Scale 2

Scale 3

Scale N

1

 

 

 

 

 

2

 

 

 

 

 

N

 

 

 

 

 

Mean

 

 

 

 

 

Standard Deviation

 

 

 

 

 

Min

 

 

 

 

 

Max

 

 

 

 

 

 

Appendices

Custom questionnaires, Participant General Instructions and Participant Task Instructions are appropriately submitted as appendices. Release Notes, which would include any information the supplier would like to include since the test was run that might explain or update the test results (e.g. if the UI design has been fixed since the test), should be placed in a separate appendix.

 

References

1.      American Psychological Association. Ethical Principles in the Conduct of Research with Human Participants. 1982.

2.      Norman, D.A. (1983) Design Rules Based on Analyses of Human Error. Communications of the ACM, 26(4), 254-258.

3.      Chin, J. P., Diehl, V. A., and Norman, K. (1988). Development of an instrument measuring user satisfaction of the human-computer interface. In the Proceedings of ACM CHI ‘88 (Washington D.C.), 213-218.

4.      Kirakowski, J. (1996). The software usability measurement inventory: Background and usage. In Jordan, P., Thomas, B., and Weerdmeester, B. (Eds.), Usability Evaluation in Industry. UK: Taylor and Francis.  

5.      Lewis, J. R. (1991). Psychometric Evaluation of an After-Scenario Questionnaire for Computer Usability Studies: the ASQ. SIGCHI Bulletin, 23(1), 78-81.

6.      Lewis, J. R. (1995). IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use. International Journal of Human-Computer Interaction, 7, 57-78.

7.      Brooke, J. (1996). SUS: A “quick and dirty” usability scale. Usability Evaluation in Industry. UK: Taylor and Francis.

8.      Rubin, J. (1994) Handbook of Usability Testing, How to Plan, Design, and Conduct Effective Tests. New York: John Wiley & Sons, Inc.

9.      Dumas, J. & Redish, G. (1993), A Practical Guide to Usability Testing. New Jersey: Ablex Publishing Corp.

10.   Nielsen, J. & Landauer, T. K. (1993) A mathematical model of the finding of usability problems. In: CHI '93. Conference proceedings on Human factors in computing systems, 206-213


Appendix A – Guidance for CIF preparation

This document provides some prescriptive guidance on creating a usability report for the CIF.

Appendix B -- Checklist

This document is a checklist for the CIF.

Appendix C -- Sample Report

This large-scale report includes all the elements of the CIF and also illustrates how to include items that are not specified in the CIF.

Appendix D -- Glossary

This document contains a list of terms used in the CIF and definitions.

Appendix E – Template

This document is a text-processing template to facilitate production of the CIF by suppliers