CLEAR 2007 Evaluation Plan

Head Pose Estimation

Editor:

Michael Voit ()

Version 0.2

Feb., 14th, 2007

Version: Error! No text of specified style in document.Error! No text of specified style in document. Page 1/7

© CHIL

0Label file description......

0.13D labelfiles

0.2Head Pose Groundtruth Labels

0.3Notes on general labelling rate:

1Task and Metric Description......

1.1Head Pose Estimation

1.1.1Description of the Task......

1.1.2Database Requirements......

1.1.3Output file format

1.1.4Metrics......

1.1.4.1Definition of metrics

2Submission protocol......

2.1Result submission......

2.1.1File naming convention......

2.1.2Structure of the archive......

2.1.3System description......

2.1.4Submission procedure......

Version: Error! No text of specified style in document.Error! No text of specified style in document. Page 1/7

© CHIL

Error! No text of specified style in document.

0Label file description

0.13D labelfiles

For the head pose estimation task, it was agreed to provide manual annotations of the respective head bounding box in order to avoid an implicit inclusion of a head detection subtask. The head bounding box was defined to be the smallest rectangle around the skull. For each camera view, one labelfile per person is provided, which, in general, follows the following structure:

<TimeStamp> <xywidth> <height

Note, that in order to provide an annotation in one particular video frame, the person must be visible. So, in frames where the person leaves the camera’s view, the values are set to -1.

0.2Head Pose Groundtruth Labels

Every participant was to wear a magnetic motion sensor (Flock of Birds, Ascension Technology Corporation) in order to allow capturing his or her true head orientation. The sensor captured with ~30Hz and its capturefiles follow the same structure as the manual head bounding box annotation files described above:

<TimeStamp> <pan> <tilt> <roll>

For training, these labelfiles are provided to every participant. For evaluation, these labelfiles are only used for scoring.

The order of the rotations are: first pan (around the head’s z-axis), then tilt (around the rotated head’s y-axis), then roll (around the rotated head’s x-axis).

0.3Notes on generallabelling rate

In general, it has been agreed to produce manual annotationsof the head bounding box only every 5th frame. This is to reduce the effort needed to labellong sequences of video. Also, evaluation only happens on these discrete timestamps where annotations are available. Every system may however also produce hypotheses between these distinct frames.

1Task and Metric Description

1.1Head Pose Estimation

Leading partner / UKA-ISL
Contact / Michael Voit ()

1.1.1Description of the Task

The specification of this task is, to estimate peoples’ head orientation within a multiview environment. It is up to the participants to either estimate head pose from only within one camera view or use all camera views in order to print a more stabilized, joint hypothesis. The estimate has to be given with respect to the original sensor’s coordinate frame.

15 people have been recorded in the smart-room at Universität Karlsruhe, all wearing a magnetic motion sensor to annotate their true head orientation. Four cameras in the upper corners of the room were used to capture the whole scene, providing ~3minutes video material at 15 frames per second per person. The participants were advised to stand still but move their head into all possible directions, due to the long amount of time, repetitions were included.

For training, 10 videos, inluding annotations of the head bounding boxes and the original groundtruth information about the true headpose, are to be spread. For evaluation, 5 videos along with the head bounding box annotations are provided. The groundtruth information will be used for scoring. People appearing in the training set, do not appear in the evaluation set.

Since manual annotations of the head bounding box only happened every 5th frame of the videos, only hypotheses corresponding to these timestamps are going to be scored. Estimates in between these frames will be ignored.

1.1.2Database Requirements

The database contents required by the visual head pose estimation task are:

-The video streams from the four fixed corner cameras of the room.

-The intrinsic and extrinsic calibration information for each camera.

-A detailed description of the room layout showing e.g. the room dimensions and the position and orientation of the room coordinate frame.

-The labelfiles containing the manually annotated head bounding boxes.

-The groundtruth information about the respective person’s true head orientation. (During the evaluation run, this information will not be spread but used for scoring).

1.1.3Output file format

To ease the evaluation process, the resulting hypo-files should follow the labelfiles described above. The suggested structure is as follows:

<TimeStamp> <hypo pan> <hypo tilt> <hypo roll>

In frames where no estimation could be done, the respective hypotheses need to be set to -1.

Please note, that only hypotheses with respect to the senso’s coordinate frame are to be sent in.Further, since multiple camera views are available also for the test set, the reference will always be camera 1. Only for defined bounding boxes in camera 1’s annotations, hypotheses are to be sent in.

1.1.4Metrics

It has been agreed to evaluate with the following metrics:

1.1.4.1Definition of metrics
Metrics
Number of metrics / 2
List metrics /
  • Mean Pan / Tilt / Roll
  • Angular Distance

Metrics Name / Mean Pan / Tilt / Roll [°]
Description / For all frames i (1..n), where head bounding boxes are defined in the labelfiles, both, the average error for pan, tilt and roll are to be evaluated as follows:

For frames i, where no hypothesis is given, the largest error is to be used (360° difference).
Metrics Name / Angular Distance
Description / For all frames i (1..n), where head bounding boxes are defined in the labelfiles, the mean angular difference between the estimated line of sight v1and theannotated line of sight v2:

The line of sight is built from the three estimated components pan, tilt and roll and needs to be given with respect to the general room coordinate system.
For frames, where no hypothesis is given for the person’s head orientation, the largest possible error is to be used (that is, a single angular distance of 180° in these frames).

2Submission protocol

2.1Result submission

2.1.1File naming convention

An experiment is identified by the following name:

EXP-ID: = <SITE>_<TASK>_<DATA>_<SYSTEM>

Where:

<SITE> = <Abbreviation for the institution making the submission>

<TASK> = HEADPOSE

<DATA> = EVAL07

<SYSTEM> = PRIMARY | CONTRAST-XXX| CONTRAST-YYY

2.1.2Structure of the archive

One result file has to be submittedfor each video in the evaluation dataset (in total 5).

For each experiment, the submitted archive should contain the following structure

<EXP-ID>/<EXP-ID>.TXT

<EXP-ID>/<RESULT_FILE_1>.<TASK>

<EXP-ID>/<RESULT_FILE_2>.<TASK>

.

.

.

<EXP-ID>/<RESULT_FILE_N>.<TASK>

Each result file <RESUL_FILE_N> is a UNIX text file with the name of the task as extension.

The archive should be a tgz, tar or a zip file.

2.1.3System description

For each experiment, a one page system description <EXP-ID>.TXT must be provided describing the data used, the approaches (algorithms), the configuration, the processing time, etc.

2.1.4Submission procedure

The systems results must be uploaded to the following FTP server:

ftp://141.3.25.39/clear

Username and password are to be used as anonymously.

The deadline for submissions is:

□March 28st23h59 GMT-5 for Head Pose Estimation

Version: Error! No text of specified style in document.Error! No text of specified style in document. Page 1/7

© CHIL