Independent review to determine the quality of the Diamond dataset

Authors: Neil Smith
Date: 10th August 2017
Prepared for: Creative Diversity Network

At NatCen Social Research we believe

that social research has the power to

make life better. By really understanding

the complexity of people’s lives and what

they think about the issues that affect

them, we give the public a powerful and

influential role in shaping decisions and

services that can make a difference to

everyone. And as an independent, not for

profit organisation we’re able to put all

our time and energy into delivering social

research that works for society.

NatCen Social Research
35 Northampton Square
London EC1V 0AX
T 020 7250 1866
www.natcen.ac.uk
A Company Limited by Guarantee
Registered in England No.4392418.
A Charity registered in England and Wales (1091768) and Scotland (SC038454)
This project was carried out in compliance with ISO20252

1  Introduction

Diamond has been developed by the Creative Diversity Network (CDN) as a mechanism for collecting and reporting diversity data. Diamond enables production companies to facilitate the collection of consistent diversity data from programme contributors (actual contributions), and to monitor diversity as portrayed on-screen (perceived contributions).

The over-arching aim of the Diamond project is to provide industry-wide data to answer two key questions:

• Who is on UK TV?

• Who makes UK TV?

This will allow us to see whether the industry represents the UK both on- and off-screen.

Although there have been a number of monitoring reports and employment surveys of the creative industries in previous years, Diamond is a ground-breaking project due to its innovative cross industry approach where competing broadcasters collect and publish diversity data together. Given the new methodologies and technologies deployed in the collection of this data, the CDN has commissioned NatCen Social Research to review the strength and weaknesses of the data collected since August 2016, in advance of publication of the first annual report due in August 2017.

This advice is intended to give the CDN an overview of the reliability of the Diamond data, evaluate the analysis and interpretation of the emerging trends in the data, and provide and assessment of the data security and privacy risks associated with the collection an publication of information of personal and protected characteristics of the individuals represented within the Diamond dataset.

2  Aims

As agreed with the Creative Diversity Network, the aims of this review were to:

·  Ascertain the strengths and weaknesses of the dataset, and advise CDN as to the likely quality and accuracy of the data being collected

·  Compare CDN with other known (publically available) data sets of a similar nature in terms of sample size, expected return rates, uniqueness, and likely accuracy.

·  Provide high level data privacy/data protection advice, identifying privacy risks associated with CDN fixed reports extraction and encryption sharing processes and the intention to publish data from them, and recommend action mitigating against them.

·  Review proposed publication with a view to ensuring that individuals are not identifiable from the data we want to publish and that there are no other data privacy issues.

·  Review the analysis of the data ensuring that the analysis follows general good practice.

·  Comments on the draft report which highlights any risks and poor analysis, and recommend improvements with regards to the way the data is analysed and what is included in the report.

·  Recommend to CDN how this review should inform their analysis and presentation of the data, indicating how this might affect confidence in conclusions drawn from the data at this current time.

3  Review

The first draft report of 17th July contained a comprehensive set of analyses, investigating the diversity of contributions, analysed separately by on or off screen presence. The latest report, 10th August, which is due for publication contains information on the characteristics of the contributions only. Given that later reports will likely include the full set of analyses, this quality review will appraise the more detailed cut of the data, 17th July, though the findings apply equally to the report of 10th August.

3.1  Strengths and weaknesses of the Diamond dataset

3.1.1  Contributions may be confounded by contributors

The Diamond data comprises 5904 individual contributors making a total of around 80,804 contributions to the industry in the previous year. The relationship between the contributor and contributions is a considerable strength, and potential weakness, of the data depending on how the data is presented.

For instance, not all contributors are equal in the number of their respective contributions – one individual with a given set of personal characteristics may contribute many times more than another person with a different personal profile. The data’s strength is that it is well powered to detect whether the contribution rate varies across a range of diversity profiles, and therefore potentially identify inequitable patterns of employment within the industry. In this instance it can determine whether a particular set of personal characteristics are under- or over-represented in terms of their contribution to the industry.

However, the potential drawback to the inconsistent relationship between contributor/contribution is that reporting contributions only, as per the intended report, might hide such inequities. For example, is the high proportion of on-screen BAMEs a consequence of a small number of individuals appearing multiple times, or is the high proportion due to a large number of individuals each contributing a relatively smaller number of contributions?

This raises the broader issue of whether we are trying to define and identify representativeness, or, impact. If we think of the outcome in terms of contributions/impact, the BAME profile looks very positive; however, there is the possibility that in terms of actual numbers of contributors/individuals, BAME people are under-represented.

Overall, the contributor/contribution relationship does not undermine the data specifically; however care must be exercised during the reporting process, so that the implications of this relationship are articulated and the results are interpreted with this caveat in mind.

3.1.2  Self-selection and bias

The Diamond response rate was 24.3%, which was a considerable achievement given the 2012 Creative Skillset census, which collected data on many of the same indicators of diversity, yielded a 4% response from over 20,000 companies[1]. Furthermore, the 2014 workforce survey resulted in a 2.3% response rate. Nonetheless, the Diamond sample response rate is low in absolute terms and is undoubtedly subject to response bias whereby there is a systematic underlying reason why some respondents completed their return whereas others did not. In this scenario it was possible that those who were comfortable with the diversity of their workforce were more likely to respond, whereas those with a homogenous workforce were less likely – the result is an over-estimation of the level of diversity within the industry. This is a particular risk to the Diamond data as the achieved sample was not drawn from a pre-specified sample with known characteristics. If a sample frame was pre-specified then we would know the characteristics of the organisations that did and didn’t respond, so we would be able to weight our final estimates during analysis to gain representative estimates.

Such data limitations are commonplace, but the presence of some data is far more useful to researchers and end-users than no data at all. This limitation (self-selection bias) ought to be conveyed within the report.

3.2  Comparability to previous data

Comparisons to previous data are essential in order to validate the findings of the Diamond data. There is a considerable risk (outlined above) that this data is inherently biased and unrepresentative of the creative industry workforce due to the self-selection of participants into the Diamond database. Cross-referencing the diversity profiles of the Diamond data with the profiles of the 2012 creative industries census, or the data available in the 2014 workforce survey[2] provides an insight into how much confidence we can place in the representativeness of the Diamond data. Only once it has been demonstrated that the Diamond data is representative of the industry can further comparisons to the national profile of diversity be made. Showing that Diamond is broadly in line with previous data demonstrates that it is likely to be a reliable estimate of the workforce characteristics.

Fine level data from the 2012 creative industry census does not appear to be publically available. However, the 2014 data is available and has been compared with Diamond data (taken from the draft report) as follows.

Table 3.1: Comparison of the prevalence of diversity indicators between the Diamond dataset and the 2014 workforce survey

Indicator / Diamond[3]
(All channels/all times, off/on combined) % / 2014 Workforce Survey
%
Gender (% Female) / 52.0 / 49.0
Transgender identity / 1.0 / 1.0
Age >50 years / 18.5 / 16.0
Ethnic origin (% BAME) / 16.1 / 5.4[4]
Sexual orientation (% LGB) / 11.4 / 7.0
Disability / 5.6 / 5.0

There are three main differences. First, Diamond is capturing a considerably higher number of BAME contributions to the creative industries. Given that the 2012 census figure of 5.4% is particularly low, and considerably different to the profile of the working age population, , it is likely that Diamond is providing a more accurate estimate of the BAME population.

Second, Diamond is reporting a higher prevalence (11.4%) of LGB than the 2014 survey. This prevalence is nearly double that of the national estimate of 6.4% (for England)[5]. Diamond response rates for this item show that sexual orientation was not widely withheld by contributors, meaning there is unlikely to be a social desirability bias or other non-reporting effect at play. An alternative explanation is that LGB contributors are systematically more willing to contribute to the Diamond data collection; it is difficult to assess at this point how likely this is. The final explanation is that LGB are over-represented within the data as a consequence of a high proportion of LGB individuals in the workplace. Furthermore, the higher proportion in Diamond compared to the workforce survey may be due to the improved data collection process leading to the higher response rate for this item.

Lastly, the age profile of Diamond is older than the workforce survey, though the Diamond estimate lies closer to patterns observed in the general working population[6].

Overall, the Diamond data is broadly comparable to the 2014 survey and the 2012 census. Although each of these datasets are similarly compromised by non-response, the general comparability between the three does provide some assurance that they are capturing the demographic profiles of the workplace to the same extent, increasing our confidence that the Diamond estimates are reliable.

A further strength is that Diamond intends to be an on-going data collection exercise. This means that even if baseline proportions are not true representations of the population of the broadcasters, estimates of relative change year-on-year will be reliable, especially if no major methodological changes take place in data collection.

3.3  Data security and privacy

The data privacy risk assessment undertaken by Lewis Silkin (LS) is comprehensive; NatCen is grateful to CDN for supplying this documentation and consequently avoiding a considerable duplication of effort.

We have a number of points to add to the existing advice.

The implications of the GDPR in May 2018 will result in large scale changes to the ways that data are collected and stored. There is a strong likelihood that some of the data collection procedures, especially concerning consent, will need revision. However, be aware that the GDPR is still being consulted on and, as previously advised, there is no certainty at this point as to how far this will affect future data collection procedures.

The guidance notes refer to anonymisation, but there is little detail on what this means and how it is done. Whilst potential participants only need to know the basic details, it would be helpful to know further information on how individuals are anonymised, where the original records are stored, and who has access to both records.

As per the example described by LS, it should not be possible to specify a set of characteristics and return a cell count of fewer than ten individuals. This is the current threshold applied by the ONS to the analysis of secure data. The suppression of low cell counts (<10) could be automated, or, those individuals running analyses ought to receive training in statistical disclosure control to understand the various techniques for suppressing disclosive data[7]. CDN has confirmed that thresholds are applied across all tables. However, analysts ought to follow additional procedures to prevent “differencing” between two reports. Disclosure by “differencing” occurs when comparison of two or more tables reveals information that is not available from any single table.

There is no discussion of the possibility of sharing the data with approved third parties for further research, or whether the data can be linked to other administrative datasets. Data sharing is a matter for the CDN to decide and to put in place the relevant protocols to prevent the identification of individuals. These protocols ought to be independently reviewed before data sharing take place. For both the sharing and linkage of data, the contributors will need to consent to the sharing and linkage of their data when they submit their personal details.

3.4  Analysis and interpretation

We have provided detailed comments on the draft report as of 17th July 2017 and 10th August 2017. Most of the comments are practical and signpost to where additional details are required in the text, or where the presentation of results could be made simpler. A minority of comments raise broader questions about the nature of the research and what its aims are. These will be discussed here in more detail.

Though the 2017 report will not present results by peak time and any time, or by on and off screen, further reports ought to exercise caution when comparing differences between these groups of people. Where possible, statistical tests should be carried out to ensure that differences between two groups are unlikely to be due to chance and that they do in fact represent real and significant differences between the two groups.

A further research question compares the demographic characteristics of the workforce of the creative industries with the national population. Care should be taken when comparing the age profile of Diamond contributors to national estimates. On screen contributors are likely to cover the entire age range represented by national estimates, whereas the off screen age profile will more closely resemble the national working age profile. The CDN have acknowledged these differences and intend to account for job roles during comparative analysis.