Use of Prototypical Expressions Inevitably Obscures Individual Differences

Facial Expressions and Young Children’s Ability to Decode Their Meaning

Katharine Bailey,

Christine Merrell and Peter Tymms

CEM Centre,

DurhamUniversity.

Paper presented at the British Educational Research Association Annual Conference, University of Warwick, 6-9 September 2006

Abstract

This paper describes the development of an assessment of children’s ability to decode facial expressions. The assessment is computer-delivered and initially intended for use with young children aged 4 to 6 years. The analysis of early trials is encouraging although further refinement of the scale is required.

Contact:

Introduction

As humans we are highly skilled in decoding facial expressions. The skill we demonstrate suggests that the process has an evolutionary function which would be explained by our need to use facial expressions to identify emotional states in others for successful functioning in a complex social environment. The expressions, that in our evolutionary past must surely have been fairly basic, now serve a higher level of social functioning: they allow us to convey and interpret honest emotional states in others but can also be controlled for social manipulation and through the display and detection of facial expressions, a basic level of communication between two individuals can be achieved even if they do not speak a common language. Infants are able to differentiate between positive and negative facial expressions from the age of around 10 or 12 months by which time they are also displaying social-referencing behaviour. The early onset of this ability suggests a genetic basis (Ekman and Oster, 1979) for the ability.

Beginning with Darwin’s ‘The Expression of the Emotions in Man and Animals’ (1872), there has been increased interest generally in the role of emotion within the area of cognition and more specifically in the ability to convey and detect emotion in others. The most prevalent approach of research revolves around the identification of a set of basic emotions. The number and range of emotions investigated varies (see Power and Dalgleish, 1997 for a comparison) but tend to converge on a set of 6 basic emotions; fear, happiness, sadness, disgust, anger and surprise (Ekman and Friesen, 1975; Izard, 1971, 1994). Evidence for basic emotions includes research from cross-cultural studies (Ekman, 1999 for a review of this work) and also work with blind children (Medicus et al, 1994).

The ability to detect emotions is fundamental to a wider area termed ‘emotional intelligence’. The exact definition of this term is still a matter for debate, it being a relatively new concept. A widely accepted current definition is:

The capacity to accurately perceive emotions.
The capacity to use emotions to facilitate thinking.
The capacity to understand emotional meanings.
The capacity to manage emotions. (Salovey and Mayer, 1990)

From an educational perspective, communication, both verbal and non-verbal, is an important area of development for young children and QCA curriculum guidance for the Foundation Stage (QCA, 2006) emphasises the importance of successful emotional development to provide pupils with the best opportunity for success in life, including academic attainment. Although this is a reasonable position it is not clear why this should be the case. Is it because we use the same cognitive mechanisms to process emotional material as we do to learn? Or is it because emotional information used in social-referencing behaviour works as an independent process allowing us to interact more successfully with both peers and adults facilitating the learning process?

If emotional intelligence is related to wider success includingperformance in school and a test can be shown to identify individual differences in children’s ability to detect emotional expressions this will have implications for their education. With this in mind, the CEM Centre, DurhamUniversity, has recently started to develop a computer-delivered assessment known as PIPS-Faces. PIPS-Faces aims to assess the ability of young children, aged 4, 5 and 6 years, to decode emotions conveyed through facial expressions. This paper reports the first stages in an attempt to create an objective, computer-delivered assessment of facial expression and the extent to which it is a reliable and useful way of identifying individual differences in ability to decode facial expressions. The analysis in this paper is exploratory and eventually the aim is to facilitate the early identification of children at risk of later negative outcomes.

Measures

PIPS-Faces is a computer-delivered assessment which was administered in April and May 2006. Children aged 4 – 6 years (Reception and Year 1) were presented with an on-screen cartoon portraying a character in particular situations such as hammering his/her hand instead of a nail. The face of the character is missing and children are asked to choose a face of appropriate expression from a choice of four options. The style of the faces is taken from the work of Herman Chernoff (Pickover, 2001). Some example items are shown below:

Twenty cartoons were drawn, each depicting a situation that would elicit one of the six basic emotions; happiness, sadness, anger, fear, surprise and disgust. Faces were omitted from the cartoons. Alongside each cartoon, four faces were shown, each head containing one of the six expressions. One of the faces had an expression most suitable for the emotion depicted in the cartoon. The other three expressions were taken from the remaining five emotional expressions. In order to discriminate ability further, two versions of each facial expression were used, one more subtle than the ‘standard’ expression. Each item appeared with a standard target among standard distracters, with the standard target among subtle distracters, with a subtle target among standard distracters and with subtle target among subtle distracters. The items were presented in a random order, but in the same order between participants.

A small-scale pilot of 68 items was conducted with 23 children aged 4 and 5 years. Analysis of this data looked promising and in addition a further 10 new items were added to make 78.

The table below shows the number of items assessing each emotion.

Emotion / Number of Items
Happy / 12
Sad / 19
Angry / 16
Fear / 13
Disgust / 9
Surprise / 9
Total / 78

Although the pilot test had good item reliability (Cronbach’s alpha = 0.88), some of the items had poor discrimination. This might have been a consequence of the small sample so more items were added to that emotion type for the next larger trial. This presented the problem of the assessment being too long for young children to complete in a single session. Therefore in order to reduce the assessment period to approximately 10 – 15 minutes, and trial the additional new items, all children were presented with an anchor test which comprised 28 items weighted in the following way: 2 items portraying happy scenarios, 9 sad, 6 angry, 3 fear, 4 disgust and 4 surprise. Then each child was randomly presented with one of four sub-sets, each comprising 10 items which included at least one item for each of the six emotions. In that way it was possible to collect data on a large number of items yet keep each child’s assessment time reasonably short.

Sample

The sample for this paper comprised 61 children from 4 different schools (28 children in Year 1 and 33 children in the Reception year). Details of the sample are outlined in the table below:

School / Reception / Year 1
1 / 27
12 boys
15 girls / 14
6 boys
8 girls
2 / 4
1 boy
3 girls / 4
3 boys
1 girl
3 / 2
1 boy
1 girl / 9
5 boys
4 girls
4 / 0 / 1
1 girl

The children were selected by their class teachers to represent a range of academic ability and their parents had consented for them to participate in the trial of the new PIPS-Faces assessment.

In order to see whether the sample of children were typical for their age in terms of ability and achievement, three different assessments were compared against nationally representative norms. The three assessments were the PIPS On-entry Baseline and Follow-up, conducted at the start and end of the Reception year respectively, and the PIPS End of Year 1 Assessment (for more information about the PIPS assessments, see The scores of the children in this sample are reported in the table below. PIPS scores are standardised on nationally representative samples of children completing the same assessment at the same time of year. The mean score of the national sample is 50 and the standard deviation is 10.

Assessment / Study Sample Mean / Standard Deviation
PIPS On-entry Baseline Total score / 56.0 / 9.9
PIPS On-entry Baseline Follow-up Total score / 66.5 / 7.1
PIPS End of Year 1 Mathematics / 53.2 / 8.8
PIPS End of Year 1 Reading / 57.1 / 10.4
PIPS End of Year 1 English vocabulary / 55.5 / 9.8
PIPS End of Year 1 Non-verbal Ability / 52.7 / 7.9

The mean total score of the children in the Reception group for this study was higher than the national average at the start of the year. During the Reception year those particular children moved further ahead compared with the national sample and their mean score was one and a half standard deviations higher than the national average.

The children in Year 1 were also higher than average although just slightly for mathematics and non-verbal ability.

Although the sample of children in this study was small, it is still important to consider their position in relation to the national population and they were higher than average although they were not an extremely unusual group.

Results and Analysis

Rasch measurement was used to estimate the relative difficulties of the items for the entire PIPS-Faces assessment (anchor test and sub-tests). (For an explanation of Rasch measurement, see Bond and Fox, 2001.) The Logit interval scale enables direct comparisons between items and children to be made. WINSTEPS was used for the Rasch analysis (Linacre, 2005). Children from both year groups were analysed together as this would enable a comparison between age and test performance to be made.

Initial analysis showed that 20 items out of the 78 were either misfitting or that they were poorly correlated with the rest of the scale. Misfitting items measure something other than the main factor assessed in the test and lower the reliability of the scale so they were removed from the subsequent analyses. The remaining items in the refined scale were 9 items portraying a scene that would be associated with happiness, 14 sad, 13 angry, 9 fear, 6 disgust and 7 surprise. Three of the excluded items were from the anchor test that was presented to all children and the other excluded items were from the four sub-tests. The internal reliabilities of this refined scale of 58 items were 0.78 (item) and 0.56 (person). Both figures are lower than are needed for a high quality assessment and the scale was refined further by taking out more items which had a poor correlation with the rest of the scale leaving a total of 43 items, 20 of which were in the anchor test presented to all children. The internal reliabilities changed to 0.72 (item) and 0.68 (person). The item reliability was slightly lower than previously but the person reliability had improved. The table below shows details of the remaining items.

Emotion / Anchor test
Number of Items / Sub-tests
Number of Items
Happy / 2 / 6
Sad / 5 / 4
Angry / 5 / 6
Fear / 3 / 3
Disgust / 3 / 2
Surprise / 2 / 2
Total / 20 / 23

Having refined the scale, the percentage of correct answers for each item is shown in the appendix.

The figure below is produced in WINSTEPS and shows the relative difficulty of the items and the relative abilities of the pupils. The anchor items have been highlighted. These were completed by all pupils and therefore less susceptible to the effect of small sample size. The items are displayed on the right side of the scale and the distribution of pupils’ abilities on the left. The higher the Logit value, the more difficult an item or the higher ability the pupil. The letter ‘M’ denotes the mean, ‘S’ is one standard deviation from the mean and ‘T’ is two standard deviations. On the figure below, the mean pupil score is the same as the mean item difficulty, indicating that the assessment was of appropriate difficulty for the sample of children.

There is an overlap between the emotions, and it didn’t appear that some emotions were easier to identify than others.

PERSONS MAP OF ITEMS

3 +

| Q77Angry

2 X +T

| Q5Happy

X |

| Q71Angry

X |

| Q31Angry

XXXX |

X | Q2Angry Q8Angry

1 +S Q17Sad

XXXX S|

X | Q20Disgust Q3Disgust

XX | Q1Sad

XX | Q54Disgust

XX | Q60Disgust Q78SurpriseQ9Disgust

XXXX | Q12Surprise Q14Angry Q19Angry Q25Angry

XX | Q34Sad

XXXX |

XX | Q49Fear Q68Sad Q73Sad Q74Angry

0 XX M+M Q53Angry Q72Sad

XXX | Q15Fear Q30Sad

XXX | Q18Sad

XXXXX |

XX |

X | Q65Angry

X |

X | Q27Fear Q33Happy Q66Fear Q67Happy

X | Q10Fear Q50Happy

XXX S|

-1 XXX +S

X | Q6Surprise Q64Sad

| Q22Happy

| Q40Surprise

X | Q4Fear

XX |

X T|

| Q28Happy

| Q62Happy

-2 +T

X |

| Q45Happy

-3 +

Comparisons Between Groups

WINSTEPS generates a Rasch measurement score for each child, which gave a measure of ability on the PIPS-Faces test. The children in Year 1 had significantly higher scores than the children in Reception by 0.55 logits (p≤0.01), meaning that they were more able to identify emotions within the context of the PIPS-Faces assessment. The girls had higher scores than the boys by 0.08 logits but that difference was not significant. There was no interaction between year-group and sex.

Discussion

The intention was to develop an objective, computer-delivered assessment to identify individual differences in pupils’ ability to decode facial expressions. Following trials in four schools, the assessment was refined to give a scale of 43 items with an item reliability of 0.72 and a person reliability of 0.68 (derived from Rasch measurement).

The Rasch measurement item map indicated that no single emotion appeared to be easier to decode than the others, which would be expected in the light of work pointing to a set of six basic emotions. Factor analysis of the whole scale also suggested the presence of a single factorfor the whole scale rather than a factor for each emotion. It is interesting that, after removing poorly correlating items, only five and four items remained for disgust and surprise respectively. Other items for disgust and surprise performed badly for some reason.

The results of this initial study are promising for the development of a discriminating assessment. The mean ability of the pupils matched the mean difficulty of the test items and the distribution of item difficulties suggests that the test is covering the range of ability. Added to the moderately strong item and person reliability figures, the modified scale has provided a foundation on which to build a reliable assessment.

These initial trials were limited to 62 pupils, which casts some uncertainly on the findings and in order to gather more reliable data, further trials are needed.

In the development of any assessment there is a tension between needing many items to obtain meaningful data and the length of the assessment. The assessment is intended, at least initially, for children aged 4 to 6 which limits the number of items that can be presented. The initial trials consisting of 68 items proved very difficult to administer as the children lost interest. Even with the number of items reduced to 38, children began to wander towards the end of the test. A solution to this would be to produce a computer-adaptive test which would enable us to deliver the minimum number of items for a reliable measure on each child. To do this, more items will need to be developed.

Future development, then, will require an examination of items shown to be misfitting, or not correlating, with the rest of the assessment. These items will be changed and further items will be added. Ideally, an adaptive assessment requires hundreds of items covering the full range of ability and exactly how many items need to be presented to each pupil to get a reliable measure (0.9 or higher) is a matter for consideration. Adding new items within the framework used to date would involve new scenarios being drawn but is limited to 6 basic emotions. The use of prototypical expressions inevitably obscures individual differences so, although we used two versions of each expression, standard and subtle, new items could be added with different extremes of each expression to cover the full range of ability. This could also be achieved by distorting the image, reducing size of the image or increasing/limiting speed of presentation.

Further validation of the test should be carried out and one way to do this is to administer, to the same pupils, another test measuring emotional intelligence. Another way would be to observe the children’s behaviour, in particular how they relate to, and communicate with, their peers and adults in a range of situations.

In conclusion, initial findings suggest that an assessment such as PIPS-Faces is able to detect individual differences in ability to decode facial expressions. But more items are needed to produce a reliable test that can discriminate well with a minimum of items. A new version would need a substantially larger trial and a measure of concurrent validity.

References

Bond, T. G. & Fox, C. M. (2001). Applying the Rasch model: Fundamental

measurement in the social sciences. New Jersey: Lawrence Erlbaum

Associates.

Darwin, C. (1998, first published 1872) The Expression of the Emotions in Man and Animals (3rd edn), London, Harper Collins.

Ekman, P. (1973) ‘Universal facial expressions in emotion’, Studia Pscyhologica vol.15, pp.443-55.

Ekman, P and Friesen, W. V. (1975) Unmasking the face. Englewood Cliffs, NJ:

Prentice Hall.

Ekman, P. and Oster, H. (1979) ‘Facial expressions of emotion’, Annual Review of Psychology, vol.30, pp.527-54.