Sensing, Analyzing, Interpreting, and Responding Musically
to Expressive Gesture
by
Teresa Marrin
Thesis Proposal for the degree of
Doctor of Philosophy
at the
Massachusetts Institute of Technology
January 1999
Thesis Advisor______Rosalind Picard
Associate Professor of Media Technology
NEC Career Development Professor
of Computers and Communications
Thesis Advisor______Tod Machover
Associate Professor of Music and Media
Media Arts and Sciences
Thesis Reader______David Wessel
Professor
Center for New Music and Audio Technologies
University of California at Berkeley
Thesis Reader______

Abstract
This thesis will address the problem of sensing, analyzing, interpreting,
and accompanying expressive gestures in real-time with a flexible and
responsive musical environment. It presents a unique wearable device for
measuring such gestures, analyses of data collected from several orchestral
conductors, and interpretation of the primary features found in the data.
The thesis will culminate with a comprehensive series of "Etudes" (studies)
written for the instrument, which map salient features of the gestures to
variations on musical structures in real-time. This collection of "Etudes"
will form a toolkit of expressive utilities which can be combined in
various ways to create dramatic and effective live performances.
Introduction
During the past twenty years there have been tremendous innovations in the
development of interfaces and methods for performing live music with
computers. However, many of these (with a few notable exceptions) have not
been adopted by performing musicians. In my opinion, this is due to three
basic problems: first of all, these interfaces do not sample their input
data fast enough or with enough degrees of freedom to match the speed and
complexity of human movement; secondly, the nature of the sensing
environment constrains the range and style of movement that the performer
can make; and thirdly, the software which maps the inputs to musical
outputs is not powerful enough to respond appropriately to the structure,
quality, and character in the gestures. For these reasons, very few
musicians have used computer technology to replace or enhance the
capabilities of their violins, guitars, and conducting batons.
However, there is a very strong case to be made for the need for new
instruments; music literacy rates are declining, as is attendance and
support for our "classical" music forms. As the mass production of the
piano did wonders for music in western culture during the nineteenth
century, it is possible that a new means for creating music might attract a
large segment of the technologically-savvy populace of the twenty-first
century. It could also change and update our notion of art music; while
novel instruments could captivate the imaginations of the vast masses of
amateurs, they might inspire performing artists to develop new forms for
the stage which preserve a place for "music for its own sake" in our
culture. This collection of new instruments could leverage off of available
technologies to do more than just play notes; they could generate and vary
complex patterns, perform higher-level functions like conducting, and even
generate graphics, lighting, and special effects. But in order for the
performing arts community to consider these possibilities, it must have
instruments which are at the very least as expressive as the traditional,
mechanical ones have been.
Another reason to improve the capabilities of our new instruments is to
keep open the possibility for the continuance of our tradition of live
performing arts; in this age of easily-available recorded music it is rare
to experience music which is unique to the moment in which you hear it.
Live performances reflect the state of the artists, the instruments, the
audience, and the venue, all of which interact in complex ways and
sometimes produce surprising and exciting results. From the performer’s
perspective, the thing that makes live performances most powerfully
expressive (aside from technical accuracy) is the range of interpretive
variation in the music. Techniques for creating this variation involve
subtle control over aspects such as timing, volume, timbre, accents, and
articulation -- sometimes implemented on many levels simultaneously.
Musicians intentionally apply these techniques in the form of time-varying
modulations on the structures in the music in order to express feelings and
dramatic ideas -- some of which are pre-rehearsed and some of which change
based on their own moods and whims.
In order for skilled performers to intentionally modulate these musical
lines in real-time, they require musical instruments that are not only
sensitive to subtle variations in input but can be used to control multiple
modulation streams in real-time; these instruments must also be repeatable
and deterministic in their output. What often happens with the replacement
of traditional mechanical instruments with sensor-based interfaces is that
the many dimensions in the input stream are reduced in the transduction
process, and are effectively projected down to minimal axes of control.
While the reduction of dimensionality in an input stream is often good for
automatic recognition and other engineering tasks, it is not always ideal
for music, where subtlety in the ‘microstructural variation’ provides much
of the interesting content for the performer and the listener. The other
big problem with most sensor-based systems is that there is a big
disconnect between the way the gesture looks and the way the music sounds;
this makes it difficult for an audience to understand or be able to respond
to the performance. This problem must be solved before sensor-based
instruments can be widely adopted.

Proposed Approach
This thesis will present novel methods for sensing, analyzing,
interpreting, and accompanying expressive gestures in real-time with
flexible and responsive music. To this end, I designed and built a wearable
system of sensors which accurately measure such gestures, and used this
system to gather a great deal of data from orchestral conductors in real
situations. I have also embarked upon the final phases of this work, in
which I have analyzed and interpreted the data and started to build a
real-time music performance system. This system will recognize individual
features in a set of gestures and employ numerous software mappings,
organized into "Etudes" (studies), which will form a toolkit of expressive
utilities that can be combined in various ways to create dramatic and
effective performances.
In the process, I will attempt to overcome the problems of brittle,
unnatural, overly-constrained and unsatisfying mappings between gesture and
sound which are frequently encountered by performers of technology-mediated
music. I think that the enormous engineering challenges faced in designing
robust real-time systems have dissuaded many from going the extra distance
to build truly responsive and adaptive systems. I’m also sure that such
systems are possible; there are some very promising techniques from pattern
recognition which, if pushed to be real-time, could prove to be very
powerful. With my training in both engineering and musical performance, I
think I have the unique skill-set to be able to bridge the gap between
these two fields and thereby improve upon the state of the art.
This thesis will consist of four interwoven project phases, to be followed
by the dissertation and defense. The first phase, which is nearly complete,
consisted of building several versions of the interface and sensor hardware
and using them to collect data from numerous subjects. The second phase,
which has begun, is to analyze the data for significant features and find
useful filtering, segmentation, and recognition algorithms for exposing the
underlying structural detail in those features. The third phase will focus
on interpreting the results of the analysis phase and making decisions
about which features are most salient and meaningful. The fourth phase will
be the development of numerous "Etudes" which elucidate the set of features
by responding to each one individually with music that reflects its
structure and character.
My approach to this problem is unique for several reasons, most notably
because I have taken an enormous amount of effort to construct a careful,
quantitative study of how conductors express musical ideas through gesture.
Even though my final goal is not to create a system for recognizing
conducting gestures per se, I chose to do this because I suspected that
gesture-based instruments might be more appropriate for higher-level
musical control than they would be for triggering or initiating individual
notes. Another way in which my approach differs from others is in the
wearable nature of the interface and its integration into clothing. This is
because I thought that a new instrument should have a relationship to (and
a dependence on) the form of the human body; it should in some sense feel
natural to play. I wanted to create an instrument which is gestural in the
sense that it encourages the player to move kinesthetically and to leverage
off of the musical affordances of the body – I wanted an interface which
would allow the physical frame to move.
This thesis is well-placed at the Media Lab because it draws upon knowledge
from several disciplines within music and engineering that are strongly
represented in the faculty here. This work will require me to exercise
skills and knowledge in software design, analog and digital circuit design,
digital signal processing, pattern recognition, feature detection, gesture
recognition, musical composition, instrumental performance, sound design,
and emotion theory. It is also driven by a strong practical application and
has implications for work in other disciplines. If I succeed with the
feature-based recognition of gestures and can apply them to music in such a
way that the music reflects the perceived character of the gesture, this
project will contribute significantly not only to the state of the art in
the computer music community, but also to many real-time gesture-based
systems. Although it may resemble in some ways the work that I proposed to
do in my masters' thesis, this is a completely new and larger project; it
is, in many ways, my answer to the obstacles I encountered when attempting
to complete that work.
Expected Results
The results of this thesis project will be presented in the form of a
wireless jacket interface, a large-scale analysis paper (comprising at
least one chapter in the dissertation), a set of interpretive decisions
about the most meaningful features, and a large collection of Etudes which
can be performed individually or in combination with each other. The work
will be completed in four phases, which will overlap with each other --
that is, I'll do a mini-analysis project followed by a synthesis project
with the same feature, make an etude to demonstrate it, and move on. The
details of each phase are explained below.
The focus throughout this entire project will be to discover the features
in different gestures that are significant or meaningful and finding ways
to make the music reflect that meaning and then finding ways to synthesize
these into music.
Phase I – Sensing and Data Collection
The first phase, which is now nearly complete, consists of hardware
development, sensing, and data collection. It began with investigations
into sensors and data acquisition methods, after which I developed a robust
architecture for data collection and several versions of a wearable device
for recording physiological and motion data. Then I ran a series of studies
in which I used the device to gather data from a range of active musicians,
including six professional and student conductors.
What has been completed?
Working jacket hardware in multiple copies; three versions of the
jacket, two sets of sensors, two data acquisition boards on two
machines running the Labview development environment
Large study of six conductor subjects (including three students and
three professionals) in real rehearsal and performance situations,
including a live performance of the Boston Pops Orchestra in Symphony
Hall
collection of numerous useful data files from that study
paper co-authored with Professor Picard in the proceedings of the
Colloquium for Musical Informatics, entitled "The "Conductor's
Jacket": A Device for Recording Expressive Musical Gestures."
The design of a wireless version of the jacket, incorporating a radio
link with a desktop PC.
the design of a unique bus architecture to distribute power, ground,
and signal lines throughout the body.
What is ongoing:
the wireless version of the jacket is nearly complete; this involved
extensive research into radio frequencies and power consumption, the
building of an eight-line analog radio transmitter, and the
integration of this with the wearable network of sensors.
final costume design and fitting of a jacket for Teresa to wear; in
the meantime I am using a version of the jacket which was used
earlier.
Things which might be good to do if time permits:

collect data from my performances with the jacket to compare with
traditional conductor data
integrate speakers and computation directly into the jacket, so that
the whole unit is self-contained
Phase II – Signal Analysis
This second phase, which has begun, consists of processing and analyzing
the data that was collected from the conductors. There are essentially four
ways to do this: visualization, filtering, segmentation, and automatic
recognition. The visualization, which is nearly completed for two of the
subjects, has involved watching the video and the data simultaneously and
picking out the significant features by eye. The filtering will consist of
running a series of algorithms on the data to make it more usable and
expose its underlying structure; a list of the proposed filters is given in
Appendix A. Segmentation will consist of using algorithms to pick the
places where the data is richest, such as conducting vs. non-conducting,,
informative vs. non-informative gestures, and beginnings and endings of
pieces. The automatic recognition task will involve the implementation of a
real-time feature detection system using either cluster-weighted models,
hidden markov models, or a hierarchical mixture of experts.
Progress in this phase will be driven by the need for detection algorithms
for the etudes; I plan to pick one feature at a time, do an analysis of it,
and then go on to write an etude for it. The final deliverables from this
phase will be a chapter in the dissertation in which I focus on a few data
segments and compare the success of different methods along various axes.
Appendix A contains a list of the comparisons that I think will be most
interesting.
What has been done:
Formatting and printing of all conductor data
Visualization analysis of all conductor
paper co-authored with Professor Picard in the Proceedings of the
International Computer Music Conference 1998: "Analysis of Affective
Musical Expression with the Conductor's Jacket."
Particular segments chosen for focus
Cross-comparisons made between subjects, within subjects, within data
segments, across styles of music, etc.
Compilation of forty points on musical features found in the data
Compilation of twelve points on formatting, timing, and filtering of
the data
Built a filter in Matlab and successfully used it to remove a major
source of noise from one subject’s data (without compromising the
integrity of the data)
What remains to be done:
Cross-correlation between certain signals, particularly respiration
and bicep EMG, within subjects
Construction of an automatic recognition project on at least one
segment
Employ methods from DSP to detect the forty features I have
uncovered.
It might be good to:
call a session with experts from music and signal processing to look
at my data and hypothesize methods for real-time analysis and

processing
Phase III – Interpretation
This phase, while not necessarily requiring much time or rigor, may be the
crucial piece of all this work. During this phase I will make my choices
about what is most significant in the high-level detail that I have teased
out from the data in the analysis phase. This will be my opportunity to
decide what the features "mean," what importance they have, and which ones
are most useful and practical to implement. More explicitly, I hope to use
this phase to identify the most significant time-varying features in the
set of gestures that I’ve analyzed. The results may just be a series of
personal hunches and value judgements, but they will be based on a huge
amount of experience working with the data. This phase will be ongoing
during the completion of phases I and II.
As part of this work, it might be good to:
interview conductors
talk to those who have experience using or composing for electronic
instruments; find out their opinions regarding the aspects of
expression they wish they could control, expressive features they
wish would work better.
revisit the original analysis and build a new feature-detection
system to recognize some of the more interesting features in the
signals
Phase IV – Synthesis
This phase is the culmination of the whole project; it will be the
synthesis of all the work leading up to it. It will consist of the
development of a toolkit for translating gesture to sound, written as a
series of Etudes for the jacket. I intend for the primitives to be written
on a low-enough level that they are useful for creating a broad range of
musical styles; it will not be a toolkit for orchestral conducting, but
rather, for almost any gestures. The test by whether or not I am successful
will be in the effectiveness of the "etudes" in reflecting the character
and quality of each gesture.
The Etudes have been planned so that they fall into three different
categories, in order of complexity: Technical, Gestural, and
Emotional/Dramatic. The Technical etudes will be straightforward mappings
between sensor values and musical structures, while the Gestural Etudes
will use more sophisticated algorithms to recognize complete gestures. The
Emotional etudes have not been defined yet but several possibilities have
been considered. All three are discussed below.