Outline for GDC speech

Intro:

The face is the pinnacle of human expression. It’s no accident that all movies use close-ups in emotionally intense moments. When the beautiful woman kisses the leading man, we’re not looking at her 1200 dollar Gucci stilettos. It’s always the face, because the face conveys that most elusive essence of humanity that we all understand, but have a hard time quantifying. And this is what makes it so difficult, and crucial, for the face to be re-created well in the realm of digital art. If you get it wrong, everybody can tell. If you get it right, nobody really notices. We see other people by seeing their face, and their facial movements, expressions, and nuances, and when we are breathing life into a character for a game, we must create a face that conveys the essence of that character. Otherwise, it becomes mechanical and unappealing.

As game developers, we have only just begun to use some of the tools that the film industry has had at their disposal for a decade. But as our technology increases exponentially, so will the potential to create great art, and to generate memorable characters, that, like their comic-book predecessors, will continue to live on for years to come. But we can only accomplish that (along with some really healthy sales figures) by understanding what makes a great character, and how we can use every bit of technology available to us to towards that goal. The next generation of hardware is right around the corner, and we are going to get some tools soon that I think will allow us to aspire to these goals, but ultimately speaking, it all comes down to how we use these tools and for what purpose. It’s far too easy, and I speak from personal experience, to pursue technology as a means in itself, without stopping to consider what we really want to do with it. And what we want to do is use it to create great characters for fun games.

This balance between art and technology is essential in face animation, because it is such an elusive thing to capture, and there are so many new ways to approach it. My goal here today is to share my experience and ideas with you, so that everyone can get a good idea of what is involved, and how to find the best solution to make the most expressive, dramatic, and gripping characters you possibly can.

The Facts:

I break face animation into four separate stages:

  1. Modeling
  2. Deforming
  3. Rigging
  4. Animating

Each one of these stages must be separate from the other, so that your entire face animation solution can be applied to many model, your rig can drive any method of deformation, and the animation can be applied to any rig in your character set. This is a modular concept, so ultimately speaking in our pipeline we can as many character as we want, being deformed any way we want, and we can take the same animation data, apply it to each character, and still get an individual performance tailored to the character. How do we do this? How can we have one single animation file be applied to any face, but yet still have each one completely different? The key is to make one aspect of your setup controlled by another, but not dependent on it. The animation drives the rig, which is a collection of attributes that drive the deformer, which in turn control the deformation of the model. Therefore, the deformers’ behavior can be set differently for the same attribute. So an “Angry” attribute could look totally different for two different faces, but the animation on that attribute could be the same. I have a graph which illustrates this relationship (visual).


Animation data

Driven Keys

Deformation Nodes


Modeling is vital as a foundation. The face model, regardless of what your poly limit is, should be created to give the best results, which means taking a look at the anatomy of the human face (this applies even if the character is not human, see Finding Nemo as an example). I don’t have a lot of time, so I am not going to go into a detailed description of how to model a face, but I will discuss what makes a good face model for animation. The face is an amazingly complex structure, with many overlapping muscles and mechanisms, but the most important thing in evaluating a model on suitability is to look at the construction of the edge loops (or contours if using NURBS). The loops should be constructed based on how the muscles of the face operate. The eyes and mouth are radial muscles, so the edge loops should radiate out from center of the lips and eye sockets, but other areas such as the forehead should be more straight. There should also be allowances for the folds in the skin, like eye creases and smile lines. I know that as game developers we often don’t get the poly counts we would like for our models, especially in a game like football when you have 30-40 models on the field at once. But even with low poly counts the models can be set up to deform well.

Deforming is the method with which you are going to make your model move. This can be Blend Shapes, or Bones, or Clusters, or any other proprietary setup you currently have in your pipeline. My personal preference, in a perfect world, is to use Bones in conjunction with Blend Shapes, using the Bones to drive the bulk of the deformations, with Blend Shapes being used for accents like fleshy folds and wrinkles. I know that blend shapes are the standard method of working with face animation, but for a lot of reasons they really aren’t feasible for a game developer. However, a lot of people working with face animation consider bones to be an inferior compromise, made necessary by limited resources, so they don’t spend a lot of time developing a bone structure that reflects the face muscles accurately. I believe that bones offer a tremendous advantage for face animation, and the title of this presentation belies my preference and intent in presenting my system, however I came to this conclusion through my own experience and experimentation.

Blend Shapes:

  • Mesh Dependent - as we all know, models change in the course of development, and generating 50-100 blend shapes for each model is very time consuming, especially when you have to do it for each character.
  • Interpolation – Blend shapes have a disadvantage when used for face animation, in that the vertices are only moving in a linear fashion, therefore making the control of the motion impossible without a tween shape for every shape involved. This exponentially multiplies the shapes necessary, without fixing the problem much. My first face animation job was to create a solution so that we could use it across multiple characters in a game (this eventually became the basis for my whole approach). The first thing I did was to individually sculpt each blend shape target based on some photo reference I had. After this, I tried some simple test animations. The results were terrible. Everything looked kind of weird, as if it were not moving right, I had strange lip movements and the jaw seemed to be swimming instead of opening. I spent some time trying to improve it with tween shapes, but it only helped marginally. After a weekend off, I came back to work and looked at it again. The it dawned on me that while the blend shapes looked accurate, they didn’t account for how the face got from point A to point B. The vertices were arbitrarily finding their way into position (based on linear interpolation) instead of moving in an anatomically correct manner. I thought about it for a while, and decided to scrap the blend shapes and try using bones instead. In creating the bone structure, I tried to mimic the face muscle actions as best I could (I’ll get to this later). The results were dramatically better. It was at this point that I realized what we are really seeing is the MOTION of the face, not the end shapes. Therefore, even if the bones didn’t get it quite as perfect as the blend shapes, it looked much better because it moved like it should.
  • Memory Issues – While this may depend on your game engine and platform, I believe that I can make the general statement that blend shapes are more memory intensive than bones. If you have 10 characters in a scene, and you need a minimum of 50 blend shapes for each character, then you have 500 face meshes that need to be evaluated in each scene. That’s a lot to calculate, even in a next-gen system.

Bones

  • Deformation Control - Bones have a definite disadvantage to blend shapes when it come to deforming the mesh exactly. To get the same level of detail and desired target shape, a lot of bones need to be used, and even then the shapes will never look as exact as you may want them to. To get the proper level of deformation you need to set up the bones very carefully, especially if you are limited to using only rotations.
  • Motion Control – What you lose in control of deformation using Bones, you more than make up for in motion control. As stated before, bones allow an infinite level of control of the motion of the face, something that blend shapes do not. A bone can be set to approximate the mechanism of a face muscle, and with the implementation of spline IK it can even mimic the arc of flesh and muscle sliding over bone, such as in the brow area. The motion of the face trumps the final shape in my book, because that is what the eye reads when we see it.
  • Game Engine – Even though game engines are all different, almost every one I have had experience with can read bones. Blend shapes, on the other hand, are not as universal.

My own conclusions, from 5 years of experience, is that blend shapes as the primary mechanism of deformation are more trouble than they are worth, and they don’t provide superior results to a well-constructed bone system. Now, with a very dense model, separated into sections based on the muscular structure of the face, with each section having individual blend shapes denoting the muscle movements, will look really good (think Gollum). The problem is, that in a game pipeline, this is simply unfeasible. Gollum had over one thousand blend shapes, and a whole team of people working on him. Chances are your resources are not quite as exorbitant. One way to leverage the usefulness of blend shapes is to create a system that uses them to create details that bones can’t achieve, and this is a great way to get things such as wrinkles in the forehead and folds in the skin (animated bump maps are another way when dealing with a lower polygon model).

Bone Setups

In creating a bone structure for your face, the first thing to take into account are your limitations. I have four face animation bone setups, based on different levels of detail and available resources. These are:

  1. Simple 4-bone setup: for in-game animation
  1. Rotation only, 20-bone setup: for game engines limited to bone rotations
  1. X-form, Rotate and Scale 20-bone setup: engines that allow x-forms and scales
  1. Spline IK 50-bone setup – for close ups and cinematic level detail

The beauty of the system I’m demonstrating is that the same four setups can be controlled with one animation file, even though they are totally different.

I came up with the idea for using spline IK after looking at some different custom face animation solutions from different companies and seeing how they were set up, how the bone structure was developed and why. After reading article about how simulating the skin sliding over bone was the best way to achieve realism, it got me thinking about how to re-create this with a bone system. I tried a few experiments, and then stumbled across the spline IK solution. What really got me interested in this is seeing how when you transform a joint in a spline IK chain, it will only travel along the curve. This creates a customizable arc which approximates the bones of the face (see brow Frontalis muscles). This also approximates the action of muscles like the Frontalis and the Zygomatic Arch. These muscles, unlike the orbicularis occuli (eye) or the orbicularis oris (mouth), actually pull up, or push down. You can see the motion demonstrated here. Combined with some accent blend shapes, I think you can really get a good fleshy and anatomically accurate model of the face structure. You can see how the pivots of the joints are set way back to provide a shallower arc of motion, especially in the eyes. Since I had to use only one bone for both eyes, it had to be rotated with a shallow arc to work.

The less detailed bone systems I have were devised to use in-game for PS2, where we had some limits with the number of bones, and what channels we could use. I have one 21-bone setup using only rotations, and one 21 bone setup using rotations, transforms, and scales. The low-res face setup is using only four bones with rotations, as you can see in the image. Even with four bones, you can get full eyebrow, eye, eyelid, and jaw movement that looks reasonable from a distance.

Rigging

The rig is the centerpiece of the face animation system. The rig is a simple setup, really, but powerful nonetheless. It is based on a collection of attributes that are created on a few locators next to the head. These attributes are inputting values to the bones and blend shapes via Set Driven Keys in Maya. Attributes are able to be added as needed, and each joint can have as many inputs coming in as you need. The attributes are set up to mimic Blend Shape channels, for the purpose of using any animation that would normally go into a blend shape. One added plus to using this system is that you can tailor each rig to the individual character model, by changing both the skin weights and the driven key curves. This means that even though the rig is the same for each character, and the animation is the same, every character will give a unique performance with the same data. This is especially useful for a game that uses the same dialogue for different characters.

My method for setting up the rig is based on a layering principle, mostly set up for hand animation, but it can really be configured any way you want. Layering in Face animation is a similar concept to NLA for characters, I have one set of Attrs for the each element of the face, and all of them can be mixed or matched just like Blend Shape sliders.

  1. Visemes
  2. Emotions Upper
  3. Emotions Lower
  4. Characteristics (individual controls)
  5. Eye Controls
  6. Head Control (optional)

The Visemes contain all of the lip sync attributes, and these are the attributes that control the mouth shapes. The Upper Emotions are broad stroke attributes, and they control everything outside of the mouth, these I generally use for strong emotions, or for quick mood swings. The Lower Emotions are similar, but control the mouth only, this is so there can be some control over any driven key conflicts with the Viseme curves. It also allows me to give more subtle changes in emotional states while engaged in speech. The Characteristics are individual controls for more subtle motions, and also has several asymmetry abilities, like moving the mouth from side to side. The Eye controls are set up to track the eyes, and they eyelids are wired into this control set, so that they automatically rotate and give the illusion that the eyes are rolling in the socket. The Eyes are actually tracking two points, one that is tied to the head motion, and one that is independent. They are weighted mostly to the independent eye tracker, but slightly on the head controlled tracker, so that when you move the head around it creates an illusion of focus. This happens because the eyes must change slightly when the head is moved with the Head Control, even though they are still looking at the same spot. This mimics the real life mechanism of focus, in which the eyes do not stay in the same place when the head moves, even though they are looking at the same spot. I did this because I noticed that it looks more realistic than when there is only one eye tracker, which would mean that the eyes stay in the same spot if the head moves.

A big advantage to this rig is that animation data can come from any source, and still be channeled into the rig, without being baked directly on the bones. This gives you the opportunity to edit it if you so desire. I’ll talk about animation in a moment, but you can see how this rig could be set up easily in a pipeline where your data could be coming from any source, or multiple sources, and it would always end up as just an anim file on your rig, which could be sent across any other rigged character.

Animation

As I said previously, the animation data can basically come from any source, and still sit into this face animation system pipeline. So far I have worked with hand animation, motion capture, audio recognition, puppeteering, and video tracking. All of them have good and bad points, but all of them can be fed directly into this face animation solution. My personal preference is a combination of mocap and hand animation, and my second preference would be hand animation. But regardless of my own predispositions, all of these methods will work, if you can feed the data into the attributes set up. Here is a list of different methods and how they work: