Paul Werbicki
Synthetic Agent Vision in Complex 3D Environments
CPSC 699 – Thesis Proposal
Introduction
In the development of agents with human-level behaviors a great importance is placed on the environment and the ability of the agent to interact with it. The challenge is to create an environment that provides an agent with the ability to perceive things as a human would. As stated in (Blumberg, 1996) “believable behavior begins with believable perception.” Arguably, of all the senses humans use to perceive their environment, vision is the most important and provides us with more sensory information quickly and efficiently than any other. The same holds true for agents operating in a complex 3D environment, our closest approximation of the “real world”. Agents rely heavily on their architecture to provide this sensory information. The agent architecture represents the environment that an agent operates in and acts like an operating system, executing the agent program and managing input/output with the environment. Many architecture implementations of complex 3D environments (virtual realities) exist that provide synthetic vision to an agent (also referred to as a synthetic character or a virtual human). Each implementation relies on the interaction of the agent program with the architecture and trade-offs are made depending on certain design goals (such as autonomy and performance).
This paper proposes criterions for developing agent architectures that provide human-based perceptions, such as vision, for complex 3D environments. It then proposes a method for implementing a synthetic vision agent architecture that meets these criteria allowing the future development of agents with human-level behavior. Focus is placed on developing an ideal agent architecture that imposes desirable constraints on agent programs. The ideal solution is a subjective term relative to the proposed criteria in conjunction with the end design goals of the complete system. In the proposed system the overall design goal is to allow an agent, based on human simulated senses, to perform human actions with reasonable reaction times. The only restrictions are those naturally imposed by an accurate simulation of human vision.
Background
In order to discuss human-level agent architectures it is important to first provide a common understanding of agent terminology. Agents consist of two distinct parts: the agent program and the agent architecture. The interaction of these two parts is important to developing criteria for evaluating the agent architecture portion.
Agent Programs
An agent, short for software agent, is a piece of software that is able to operate in place of another entity, in our case most likely a human. Formally we define an agent as anything that perceives its environment through sensors and acts upon that environment through effectors (Russel and Norvig, 1995). This is a very broad definition that includes humans, animals and robots as well as software agent programs. In the case of a human agent information is perceived using the eyes, ears, nose and other sensors, and actions are performed using the hands, feet, mouth and other body parts that effect the surrounding environment. For a software agent inputs comprise the information perceived and outputs provide the effects that should be performed as a consequence.
Software agents are comprised of an agent program and an agent architecture. The agent program is the software that functions on input perceptions, mapping them to output actions. This mapping is performed based on domain knowledge, acquired knowledge and defined or learned behavior. Domain knowledge is specific information about the environment the agent operates in. It usually takes the form of cause and effect rules that allow a certain level of predictability to be assumed based on actions taken. As an agent program operates in its environment it acquires further knowledge as that environment changes. This knowledge is based on the interaction with the environment through perceptions and actions. Using domain knowledge and acquired knowledge an agent program is able to decide on the action to take from its own understanding and experiences.
Agent programs are responsible for exhibiting intelligent behavior because they are the place where perceptions are mapped to actions. They are used to perform research on topics such as cognitive modeling, perception and reasoning, motion planning, memory and learning. Inside the agent program Artificial Intelligence algorithms and techniques are used to implement domain knowledge and collect acquired knowledge of the environment in which the agent operates. Depending on the methods used an agent will behave in a certain way given each situation in the environment which leads to the possibility of personality, emotions and unpredictable behavior, all active research areas. These higher-level concepts, however, are heavily dependant upon the perceptions provided to the agent program; this is a necessary task of the agent architecture.
Agent Architectures
The agent architecture (often simply called architecture) is the operating system that executes an agent program. The architecture is responsible not only for executing the agent program, but also for sensing information from the environment to provide to it as percepts and performing its effectors against the environment. In general anything being sent to the agent program can be called the percepts (perception information) and any feedback to the agent architecture can be called the effectors (environmental actions). There is a tight coupling between an agent program and the agent architecture as an agent is programmed specific to the environment it operates in. If the interface mechanism is common among agent architectures it is possible to reuse agents in another environment, although their suitability is questionable.
Examples of existing agent architectures include desktop operating systems (system monitoring agents), the World Wide Web (web spiders), and modern computer games (artificial opponents). Since the definition of an agent can apply to so many things it is necessary to restrict the scope when discussing agents, agent programs and agent architectures. In the context of human-level agents our scope is restricted to agent programs and architectures that involve 3D environments. The best examples of these are computer games. Computer games closely simulate the real world providing virtual 3D environments in which human players can compete against or cooperate with artificial opponents (agents) to enhance their game play experience. Computer game engines use algorithms and data structures developed in Computer Graphics research to render 3D environments to the computer screen in real-time. The image supplied to the user allows them to perceive a realistic simulation of an environment that exists only on the hard-drive and in the memory of their personal computer.
The 3D environments of modern computer games provide a richness and complexity that blurs the line between the artificial and the real world. Agent programs that use computer games as their agent architecture have available to them the visualization of an agent’s operation, with perceptions that roughly simulate human senses and the ability to interact and change their environment. These agent architectures, however, are limited to the commercial goals of the computer game developer, to provide artificial opponents that give the illusion of human-like behavior. Still, computer games provide an excellent starting point for developing an ideal architecture for human-level agents.
Criteria for Evaluating Human-Level Agent Architectures
There are many different approaches to evaluating an architecture that supports agent programs that exhibit human level behavior. Implementation details such as the complexity of the environment, the interface between agents programs and the architecture, the language used and the resources required make direct evaluations using simple benchmarks difficult. Evaluations may also be made based on the specific architecture domain (in this case is a complex 3D environment), the features supported by the architecture or the tasks an agent is able to complete.
The most useful approach may be to establish design goals based on the type of agent research being performed. Existing work on design goals for autonomous synthetic characters in computer games (Laird, 2000) outlines many behavior and performance capabilities that are required for research into agents of this nature. Often though, the criteria are based on a combination of agent programs and architectures and do not facilitate evaluating the architecture separately. Given that an agent is comprised of an agent program and an agent architecture, and that an agent program is completely dependent upon the architecture, it should be possible to evaluate the architecture alone for the suitability of developing agents with human-level behavior.
The design goals of a human-level agent architecture are in fact the software requirements of an agent program that operates in this complex 3D environment. It is convenient to present and discuss these in the form of a requirements list that can be used to evaluate existing or future architectures. The requirements that follow are proposed, based on what an ideal agent architecture should provide in order to develop agents with human-level behavior.
Promote the autonomy of agents. An agent that has complete knowledge of its environment, where to move, the location of objects, etc. lacks autonomy since it does not require any perception of its environment. An architecture that is able to represent different environments with different layouts that are easily modified makes it very difficult to create usable agents with complete built-in knowledge of their environment. For agents to function well in different environments it is important that they have the ability to acquire knowledge about each environment and determine their behavior based on their experience. By providing tools to create and modify environments agents can be tested in a wide variety of situations, helping to give them autonomous behavior.
Provide a strict interface contract between the agent program and the agent architecture. Implementing a 3D environment requires data structures, often called a scene database, that represent the geometry of the world, the objects within, their locations and motion trajectories. Agents are often provided with complete access to the scene database and are allowed to interrogate the environment directly for information. This leads to the possibility of knowing more information then a human would be able to know. Omnipotence in the long run does not lead to agents that exhibit believable behavior.
An agent program must be able to act based only on the information it is provided by the architecture through its senses. It is the job of the architecture to provide a proper information flow so that it is not necessary to interrogate the scene database for more information. Enforcing a strict contract between the architecture and the agent program also makes it easier to achieve other requirements such as sensory honesty and scalability.
Support human based sensors such as vision, audition and touch. A human-level agent architecture that does not simulate human senses would not be very useful. However the techniques used to implement those senses vary greatly in their ability to provide perception information to the agent. The goal of any human-level agent architecture is to allow the development of believable agents. Implementing artificial vision in which the mechanisms of the eye are simulated may provide an agent with vision but the agent may be unable to use the information in creating human behavior. The delicate balance between providing information and being able to use that information makes this an active research area. Ultimately the better the sensory perceptions the better the behavior of the agent will be.
Enforce sensory honestly of agent programs. Agents that are only able to perceive the same information as a human would exhibit a property called sensory honesty. In computer games it is often the case that agents are able to see through walls and possess a 360° viewing angle. This gives them more information then they would be privy to if they were human players. This unfairness in perceptions causes agents to stick out from human players, decreasing their believability. By enforcing sensory honesty an agent program has no choice but to use only the information available to it. An agent that usually accesses the scene database may initially be at a disadvantage against human players however this shortcoming will most likely cause it to exhibit more human characteristics in the eyes of the human players.
Support human-based effectors such as movement, object manipulation and voice/speech. The behavior of an agent is portrayed through how it effects its environment. In the same way that realistic perceptions are necessary to believable behavior so are the actions carried out based on that behavior. In this case actions should be executed realistically by the architecture. For example an agent should not be able to perform an instant 180° turn, or instantaneously respond to a change in the environment. Actions take time to be carried out and this is enforceable at the architecture level.
Allow movement restricted only by the laws of the environment (physics, gravity, etc.). Environments should be complex enough to allow such things as bridges where agents may pass over and under each other at the same time. Although this may seem trivial some virtual environments are restricted based on their underlying implementation to flat rooms with only variable ceiling and floor heights as their greatest level of complexity. Complex 3D environments have in them arbitrary geometry that allow complex movement bound only by the laws of the environment such as physics (cannot move through walls) and gravity (cannot fly).
Provide for macro and micro interaction in the environment. Macro interaction is considered the coarse movement of an agent around an environment. The body of the agent is treated as one single object. At a micro level it should be possible to close in on the agent and observe the movement of various body parts, recognize various facial expressions (very important for expressing emotion) and observe interactions between an object and a the hand of an agent. This ability to observe different levels of interaction with the environment provides opportunities for the agent to itself exhibit different levels of behavior.
Easily scale to large complex environments. An agent with human senses should be able to perceive information in a particular area of influence. In the case of vision this is a viewing frustum, usually 90° that extends out from the agent in the viewing direction. In the case of audition this is a sphere that surrounds the agents that degrades the further out it goes to a point where no more sound can be heard. By providing the agent with only that information, the agent simply processes what it is given. The architecture removes all objects that are occluded or all sounds outside of the hearing radius of the agent greatly reducing its processing time. The agent is in a sense unaware of the size of the environment, which allows it to scale to larger-size environments.
Provide high-level information versus a raw unprocessed view of the sensor data. When a human uses their nose they use it to recognize a particular scent. Unless the scent is unknown the work involved about how it is recognized does not even enter the mind. An architecture can possess information about its environment beyond simply the geometry of the 3D world. Shapes such as chairs, tables and stairs; locations like the kitchen or the bedroom; the sound of a clock ticking and the smell of fresh bread, if present in the environment, need to all be stored in the scene database. This higher-level information is more useful to an agent then the raw polygons of the 3D environment, or the frequencies of the sound. Although useful in its own way, raw information requires an agent to perform more work because it must first think about, then recognize, what information has been provided. Given that it should not be able to access the scene database for any information, the agent will have to perform computations before it can use the information properly. By providing higher-level information an agent does not need to spend time recognizing what information it perceives.
Balance operations of the environment with operations of the agent. In the case of a computer game as an architecture the agent is given precious little time to perform its operations. The game engine uses most of the computational time for other tasks such as object culling, real-time rendering and collision detection. Ideally agents should be given enough time to perform whatever operations they need while the environment is given enough time to manage inputs and outputs of agents and update the visualization. By supporting human-senses and providing higher-level information to the agent it is possible to off-load some work, such as collision detection and object avoidance, from the architecture to the agent. Tradeoffs like this help balance resources between the architecture and the agents.
Synthetic Vision
In proposing architecture criteria for agents with human-level behaviors the requirements are generalized to allow any human sense to be implemented. Of all of the human senses vision is the most used and from an Artificial Intelligence viewpoint, most researched. Synthetic vision is the term used to describe human vision applied to agents. It differs from artificial vision, which attempts to simulate the mechanics of the eye to precisely reproduce vision. Synthetic vision is a reasonable estimate of what is seen, providing a list of visible objects in a scene to an agent. This avoids the high costs of artificial vision, and because the information about the environment is already known, it bypasses issues such as distance, recognition and noise faced by robotics and other types of physical agents.