Simuser to explore HRI
Using a simulated user to explore human robot interfaces
D. VAN ROOY, FRANK E. RITTER Member, IEEE, and R. ST.-AMANT, Member, IEEE
Abstract
Human-robot interfaces (HRI) can be difficult to use. We examine urban search rescue robots (USR) as an example. We present here a theory of their use based on a simulated user written in the ACT-R cognitive modeling language. The model, using a simulated eye and hand, interacts directly with an unmodified and simple tele-operating task of maneuvering in an environment to avoid other moving objects. The model user also performs a secondary task. In addition to describing the knowledge the human operator must have, as well as what aspects of the task will be difficult for the operator, the model makes quantitative predictions about how the speed of the robot influences the quality of the navigation and performance on the secondary task. These results are examples of the types of outputs available from a model user. As the model now interacts with the USR simulator using only the bitmap, the model should be widely applicable to testing other simulators and to actual robots. The model already suggests why human-robot interfaces are difficult to use and where they can be improved.
Index Terms--
cognitive model, ACT-R, human-robot interfaces
I. INTRODUCTION
In the future, it might be that robots will become completely autonomous and will act largely independent. However, such a level of independence has not yet been achieved and is in some cases simply undesirable. Many of the tasks that robots face today like exploration, reconnaissance, and surveillance, will continue to require supervision [1]. Furthermore, people often do not have enough confidence in a completely autonomous robot to let it operate independently. So it seems that the level to which the use of robots will be integrated in our society, will be largely dependent on the robots ability to communicate with humans in understandable and friendly modalities [2].
Despite its importance, a general theory of human-robot interface use seems to be lacking. Many human-robot interfaces do not even respect the most fundamental HCI principles. In this paper, we will present the beginnings of a theory that indicates the issues that make human-robot interfaces difficult to use. Concurrently, we will present a quantative tool in the form of a simulated user that can be used to identify problems associated with human-robot interface use. Specifically, we introduce a methodology in which a cognitive model autonomously exercises human-robot interfaces, indicating ways to improve the interface and laying bare problems that can serve as starting points for a general theory of human-robot interface use.
One of the reasons that there does not seem to be a general theory of human-robot interface use is the complexity of the task domain, which is reflected in the diversity in types of human-robot interactions. An application that illustrates this well is that of robot assisted Urban Search and Rescue (USR). USR involves the detection and rescue of victims from urban structures like collapsed buildings. Because of the extreme physical and perceptual demands of USR, these applications are usually mixed-initiative human-robot interactions, in which a human operator and a robot interact in some manner to produce adequate performance [3]. This means that it might be optimal for the robot to exhibit a fair amount of autonomy in some situations, for instance, in navigating in a confined space using its own sensors. However, other situations might require human intervention: An operator may have to assist in freeing a robot because its sensors do not provide enough information for autonomous recovery [3]. And yet further interventions, some only imagined, such as providing medication to trapped survivors, will legally require human intervention. This illustrates how in the case of enhanced robot autonomy, the role of the operator could often shift between control to monitoring and diagnosis [1].
There are several reasons why principles from HCI are missing from many human robot systems. First of all, the task domain of human-robot systems is more complex and diverse, making it very hard to meet the needs of diverse users or come up with a general metaphor. Furthermore, these systems are typically more expensive than regular commercial software packages. At the same time, they are not built as often as regular software and when they are built, it is usually not by people trained in HCI. Currently, USR robots are directly driven by operators. As they become more autonomous, these problems will become more complex. What is needed is a way to test and improve these interfaces.
II. Using a simulated user to explore human robot interfaces
In this section, we will introduce a cross-platform architecture in which a cognitive model simulates user performance. Specifically, we will introduce a simulated user, consisting of a cognitive model and a pair of simulated eyes and hands that can be applied to sample human-robot interfaces (or with additional knowledge any other interface for that matter). Ultimately, the intention is to provide a quantative tool to guide the design process of human-robot interfaces. This tool will enable designers to apply psychological theories in real time, providing a simulated user that acts like and interacts with the same interface as a real user.
A cognitive model forms the cognition of our simulated user. A cognitive model is a theory of human cognition realized as a running computer program. It produces human-like performance in that it takes time, commits errors, deploys strategies, and learns. It presents a means of applying cognitive psychology data and theory to HCI problems in real-time and in an interactive environment [4-6]. We have developed a system consisting of the cognitive architecture ACT-R [7] and a simulated eyes and hands suite called Segman [8] that can be applied to virtually any type of interface running on any operating system. We will begin by describing the parts that make up the system and then provide a demonstration. Subsequently, we’ll discuss how this system can be applied as a simulated user to explore human-robot interaction, and how it supports explanations of user’s behavior and evaluation of interfaces.
A. The ACT-R architecture
The ACT-R architecture integrates theories of cognition [7], visual attention [9], and motor movement [10]. It has been applied successfully to higher-level cognition phenomena, such as modeling scientific reasoning [11], differences in working memory [12], and skill acquisition [13] to name but a few. Recently it has been applied successfully to a number of HCI issues [14] [15] [6].ACT-R makes a distinction between two types of long-term knowledge, declarative and procedural. Declarative knowledge is factual and holds information like “2 + 3 = 5” or “George Bush is the president of the USA”. The basic units of declarative knowledge are chunks, which are schema-like structures, effectively forming a propositional network. Procedural knowledge consists of production rules that encode skills and take the form of condition-action pairs. Production rules correspond to specific goals or sub-goals, and mainly retrieve and change declarative knowledge.
Besides the symbolic procedural and declarative components, ACT-R also has a sub-symbolic component that determines the use of the symbolic knowledge. Each symbolic construct, be it a production or chunk, has sub-symbolic parameters associated with it that reflect its past use. In this way, the system keeps track of the usefulness of the symbolic information. Which information is currently available in the declarative memory module is partially determined by the odds that a particular piece of information will be used in that context.
An important aspect of the ACT-R architecture is that models created in it predict human behavior qualitatively and quantative: Each covert step of cognition (production firing, retrieval from declarative memory, procedural knowledge application) and overt action (mouse-click, moving visual attention) has latencies associated with them that are based on psychological theories and data. For instance, taking a cognitive action, firing a production rule, takes 50 ms (modulated by other factors such as practice), and the time needed to move a mouse is calculated using Fitts law (e.g., [16]). In this way, the system provides a way to apply psychological knowledge in real-time.
B. The perceptual-motor buffers
A schematic of the current implementation of the theory, ACT-R 5.0 (act.psy.cmu.edu/ACT-R_5.0), is shown in Figure 1. At the heart of the architecture is a production system, which represents central cognition and interacts with a number of buffers. These buffers represent the information that the system is currently acting on: The Goal buffer contains the present goal of the system, the Declarative buffer contains the declarative knowledge that is currently available, and the perceptual and motor buffers indicate the state of the perceptual and motor module (busy or free, and their contents). The communication between central cognition and the buffers is regulated by production rules. As mentioned, production rules are condition-action pairs: The first part of a production rule, the condition-side, typically tests if certain declarative knowledge (in the form of a chunk) is present in a certain buffer. The second part, the action side, then sends a request to a buffer to either change the current goal, retrieve knowledge from a buffer such as declarative memory, or perform some action.
The perceptual and motor buffers allow the model to “look” at an interface and manipulate objects in that interface. The perceptual buffer builds a representation of the display in which each object is represented by a feature. Productions can send commands to the perceptual buffer to direct attention to an object on the screen and create a chunk in declarative memory that represents that object and its location on the screen. The production system can then send commands, initiated by a production rule, to the motor buffer to manipulate these objects.
Central cognition and the various buffers run in parallel with one another, but each of the perceptual and motor buffers is serial (with a few rare exceptions) and can only contain one chunk of information. This means that the production system might retrieve a chunk from declarative memory, while the perceptual buffer shifts attention and the motor buffer moves the mouse. We will mainly concentrate on the motor and perceptual buffer, which are most relevant for our purpose.
C. Segman and ACT-R 5
ACT-R 5 in its current release (act.psy.cmu.edu) interacts with interfaces using a Perceptual-Motor buffer (ACT-R/PM). ACT-R/PM [15] includes tools for creating interfaces and annotating existing interfaces in Macintosh Common Lisp so that models can see and interact with objects in the interface. This allows most models to interact in some way with most interfaces that are written in that language, and to let all models interact with all interfaces written with the special tools.
For our simulations, we developed a more general version of ACT-R/PM, which provides ACT-R 5 direct access to an interface, thus removing the need for a specific interface creation tool. This is done by extending ACT-R/PM with the Segman suite (www.csc.ncsu.edu/faculty/stamant/cognitive-modeling.html).
As Figure 1 shows, Segman [8] takes pixel-level input from the screen (i.e., the screen bitmap), runs the bitmap through image processing algorithms, and builds a structured representation of the screen. This representation is then passed to ACT-R through the ACT-R/PM theory of visual perception (i.e. perceptual buffer). ACT-R/PM moderates what is visible and how long it takes to see and recognize objects. Segman can also generate mouse and keyboard inputs to manipulate objects on the screen. This functionality is called through the ACT-R/PM theory of motor output, but we have extended the output results to work with any Windows interface. This is done by creating very primitive events (click icon, select button, etc), which are implemented as functions at the operating system level. As such, they are indistinguishable from human-generated events. Currently, we have a fully functional system that runs under Windows 98 and 2000.
III. The MODEL OF ROBOT DRIVING
We will now describe an implementation of our system called DUMAS (pronounced [doo ‘maa], see also smartAHS [17]), which stands for Driver User Model in ACT-R & Segman. DUMAS drives a car in a Java-implemented game, which was downloaded from www.theebest.com/games/3ddriver/3ddriver.shtml. For the simulations reported below, no changes were made to the game.
We choose the 3D driver game for several reasons. First, it has a direct interface, in that the operator directs the car using the keyboard. This perspective is often referred to as “inside-out” driving, because the operator feels as if she is inside the vehicle and looking out, and is a common method for vehicle or robot tele-operation [1]. Second, driving behavior is a prototypical example of real-time, interactive decision making in an interactive environment [18] [14] and is as such is comparable to many tele-operated robot tasks. The source code is extensible, which means aspects of the environment (e.g., slow or fast driving) and interface can be manipulated (e.g., bigger or smaller buttons), in a controlled fashion. Because the code is Java, this can be done on multiple platforms. And finally, because we did not write it, it helps to show the generality of this approach.
Models of driving have been targets of research for decades (the analysis of Gibson and Crooks in 1938 provides one of the earliest examples [19]; see Bellet and Tattegrain-Veste [20] for a concise historical overview from a cognitive ergonomics perspective.) The hierarchical risk model of van der Molen and Botticher is a representative example of recent models [21]. Driving can be seen as structured into strategic, tactical and operational levels. Moving up the hierarchy, each level describes an increasingly abstract set of behaviors that govern choices at the level below it. At the strategic level, planning activity takes place, such as the choice of route and travel speed. At the tactical level decisions encompass more concrete, situation-dependent actions, such as lane changing, passing, and so forth. The operational level describes skilled but routine activities, such as steering and acceleration.
The different levels of abstraction represent different demands on the cognitive, perceptual, and motor abilities of the driver. For example, feedback from assistive technology such as ABS or power steering is provided at the operational level through haptic channels, often imperceptibly. Feedback for travel speed, in contrast, requires some cognitive activity at the strategic level, to interpret speedometer readings. If the feedback channels from these different activities were reversed (e.g., if the driver had to interpret a numerical value to determine power steering assist), their usability would be seriously impaired. Many task domains in HRI, in particular urban search and rescue, share this layered structure.