There Is an Extensive Literature Dedicated to Modeling Temporal Sequences in the Vision

There is an extensive literature dedicated to modeling temporal sequences in the vision community. A comprehensive survey of modeling temporal sequences in the vision community can be found in [3,12,13,14]. We will review the methods most relevant to our work.We will focus our attention mainly on motion models e.g. trajectory and graphical models.

Trajectory models learn the trajectories of different motion classes, and an input motion trajectory is matched against these class-specific trajectories. Alon[1] presented a dynamic space-time warping (DSTW) algorithm that aligns a pair of query and model gestures in both space and time. The performance of this approach was evaluated on hand-signed digits. Li [2] proposed a two phase approach for recognizing multi-scale gestures by first using Dynamic Time Warping (DTW) to eliminate significantly different gesture models, and then apply Mutual Information(MI) to match the remaining models. Both techniques are invariant to different time scales of motion sequences, but did not learn the sharing of common structure between gesture classes.

Apart from trajectory models, graphical models have been extensively used in modeling temporal sequences as well. Directed graphical models, like Hidden Markov Models (HMM) [4], and many extensions have been used successfully to recognize arm gestures [5] and a number of sign languages [6,7]. Another directed graphical model, like Maximum Entropy Markov models (MEMMs) [8] have been used for tasks such as word recognition, part-of-speech tagging, text segmentation and information extraction. The advantages of MEMMs are that they can model arbitrary features of observation sequences and can therefore accommodate overlapping features. Undirected graphical models were also used. Sminchisescu et al. [9] applied Conditional Random Fields(CRFs)[15] to classify human motion activities (i.e. walking, jumping, etc); their model can also discriminate subtle motion styles like normal walk and wander walk. Wang et al.[10] demonstrated the useof Hidden Conditional Random Fields (HCRFs)[16,17] to classify head and arm gestures. In this framework, a common structure is shared between gesture classes. All these graphical models for recognizing temporal sequences however suffer from the pre-requisite of acquiring numerous training examples.

One other technique that does not fit neatly in the framework of trajectory or graphical models is introduced by Efros et al.[11]. Efros et al[11] proposed a motion descriptor based on smoothed and aggregated optical flow measurements over a spatio-temporal volume centered on a movement figure. These descriptors, which are robust to misalignment of the video frame, are used in a nearest neighbor querying framework for classification. This set of descriptors was used for classification at the frame level only and could potentially be used for learning temporal models that share a common set of descriptors.

[1] J. Alon, V. Athitsos, Q. Yuan and S. Sclaroff. Simultaneous Localization and Recognition of Dynamic Hand Gestures. IEEE Workshop on Motion and Video Computing 2005.

[2] H. Li and M. Greenspan. Multi-scale Gesture Recognition from Time-varying Contours. ICCV 2005

[3] V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual interpretation of hand gestures for human-computer interaction. InPAMI, volume 19, pages 677–695, 1997.

[4] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proc. of the

IEEE, volume 77, pages 257–286, 2002. [5] M. Brand, N. Oliver, and A. Pentland. Coupled hidden

markov models for complex action recognition. In CVPR, 1996.

[6] M. Assan and K. Groebel. Video-based sign language recognitionusing hidden markov models. In Int’l Gest Wksp:

Gest. and Sign Lang., 1997.

[7] T. Starner and A. Pentland. Real-time asl recognition from video using hidden markov models. In ISCV, 1995.

[8] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation.In ICML, 2000.

[9] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Conditional models for contextual human motion recognition. InInt’l Conf. on Computer Vision, 2005.

[10] S. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian and T. Darrell. Hidden Conditional Random Fields for Gesture Recognition. CVPR 2006.

[11] A. Efros, A. Berg, G. Mori and J. Malik. Recognizing Action at a Distance. ICCV 2003

[12] Cedras, C. & M. Shah (1995), Motion-Based Recognition: A Survey, IVC, 13(2):129-155.

[13] D.M. Gavrila, "The visual analysis of human movement: a survey", Computer Vision and Image Understanding, vol. 73, no. 1, 1999, 82-98.

[14] J.K. Aggarwal and Q. Cai, "Human motion analysis: a review", Computer Vision and Image Understanding, vol. 73, no. 3, 1999, 428-440.

[15] J. Lafferty, A. McCallum, and F. Pereira. Conditional random

fields: probabilistic models for segmenting and labelling

sequence data. In ICML, 2001.

[16] A. Quattoni, M. Collins, and T. Darrell. Conditional random fields for object recognition. In NIPS, 2004.

[17] A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt. Hidden conditional random fields for phone classification.In INTERSPEECH, 2005.