The Office for National Statistics – Big Data Project

Keywords:big data, pilot projects, official statistics

1.Introduction

The amount of data that is generally available is growing exponentially and the speed at which it is made available is faster than ever. The variety of data that is available for analysis has increased and is available in many formats including audio, video, from computer logs, purchase transactions, sensors, social networking sites as well as traditional modes. These changes have led to the big data phenomena – large, often unstructured datasets that are available potentially in real time.

Like many other National Statistics Institutes (NSIs) the Office for National Statistics (ONS) in the UK recognises the importance of understanding the impact that big data may have on our statistical processes and outputs. A 15 month Big Data Project (which is to complete end of March 2015) has been established to investigate the benefits alongside the challenges of using big data and associated technologies within official statistics.

This paper provides an overview of the approach taken to establish and undertake the project as well as emerging findings and proposed next steps.

2.Methods

The high level aims of the ONS Big Data Project are to:

-investigate the potential advantages that big data provides for official statistics, to understand the challenges with using these sources and to establish an ONS policy on big data and longer term strategy incorporating ONS’s position within Government and internationally in this field; and

-make recommendations on the best way to support the ONS strategy on big data beyond the life of this project.

The approach adopted to work towards these aims was to establish a small cross disciplinary team (around 9 individuals) involving statisticians, methodologists and IT experts. The project was arranged into 4 workstreams as described below:

  • Management and Strategy - an overarching workstream that brings together the outputs/outcomes/knowledge from the other workstreams to develop policy/strategy as well as providing necessary project support.
  • Stakeholder Engagement - the activities within this workstream aim to eitheracquire data/tools/technologies, learn from other’s experience, develop knowledge and skills, coordinate efforts, develop partnerships or understand concerns around the use of big data within official statistics.
  • Analysis and Infrastructure - to understand and demonstrate the potential for using big data within official statistics through four specific case studies. A key tool for this work is the ONS Innovation Lab, a 'sandpit' environment where alternative tools, technologies and methods can be explored.
  • Communication - to share what we have learnt more widely.

3.Results

A summary of the progress made within the ONS Big Data Project under each of the four workstreams is provided below

3.1.Management and Strategy

A key deliverable from this workstreamhas been an ONS Big Data Policy. The main challenge around the use of big data within government is to maximise benefits to the public while protecting the privacy of individuals. This policy pulls together a set of principles designed to deal with the new aspects of the use of big data (such as legal and ethical issues associated with commercial data and data sourced from the web) together with long-standing principles to create a comprehensive policy with supporting guidance.

In addition the emerging findings from the project so far will be developed into a formal ONS Big Data Strategy.

3.2.Stakeholder Engagement

Five groups of key external stakeholders have been identified for the project. These are listed below with a summary of key activities to date.
- International - The ONS Big Data team have contributed to the UNECE international collaboration project focused on big data and a European Statistical System (ESS) taskforce on big data and official statistics. Both these projects provide opportunities to move forward thinking on cross cutting issues in a collaborative way.

- Academia - Throughout the project we have engaged with both overarching bodies within academia, (such as the Royal Statistical Society and Economic and Social Research Council) as well as individual institutions to either work collaboratively with or commission those with big data expertise to undertake research related to official statistics and/or to enhance our understanding of the skills required to work within data science and to potentially recruit/attract graduates to the ONS.
- Private Sector - We have engaged with a number of private sector companies to purchase or acquire their data for use within our pilot projects. In addition we have met with ‘big data’ companies to share our experiences and tap into their expertise.
- Government - The ONS Big Data team are contributing to cross Government initiatives in the UK aimed at bringing together big data expertise across departments and professions. In addition we have liaised directly with colleagues in departments with specific expertise or access to data in support of our pilot work.

- Privacy groups - Throughout the project we have liaised with a number of privacy groups who were positive about this engagement and provided advice on handling, communications, policy in this area and future directions with the pilots.

3.3.Analysis and Infrastructure

3.3.1.Prices pilot

Web scrapers are software tools for extracting data from web pages. The Consumer Price Index (CPI) and the Retail Price Index (RPI) are key economic indicators produced by ONS. Web scraping could provide an opportunity for ONS to collect prices for some goods and services automatically rather than physically visiting stores. Prototype web-scrapers have been developed for three on-line supermarket chains and are automatically collecting prices for selected basket items each day. A high level review of the methodological implications of using these data for price statistics has been undertaken. The pilot has also had useful discussions with MySupermarket.com(a price comparison website) around purchasing daily price quote data for analysis within the pilot.

3.3.2.Twitter pilot

Twitter provides open source tools for accessing tweets as well as an option for users to identify their current location. This means that ‘tweets’ from a subset of users can be tied to specific locations over time. This data can then be used to track mobility patterns. The primary aim of this research is to determine whether this geo-located data from Twitter can provide fresh insights into internal migration within England and Wales.

A fully tested harvesting application that can collect, store and format the required data has been deployed and a total of 38.9 million geo-located tweets were collected between 10 April and 30 June. One of the main challenges of this pilot is how to make sense of the large amount of data that is being collected. Good progress has been made in the development of a method for creating clusters for these data and classifying these as either valid clusters or noise points and where valid whether these indicate locations of some significance, such as a person’s home, their place or work or study.

3.3.3.Smartmeter Pilot

A smart meter is an electronic device that records and stores consumption information of either electric, gas or water at frequent intervals and UK policy aims to put electricity and gas smart meters in every home in England by 2020[1]. Smart meter electricity energy usage data is attractive to statistical organisations as it allows investigation at low levels of geography and high levels of timeliness and almost complete coverage of homes.

The focus for this research is to assess if these data could identify unoccupied households. It is considered that this might have potential to create efficiencies within a census or survey operation. Samples of data from pilot exerciseshave been taken and analysis undertaken to understand the data and to help identify methods of analysing it.

3.3.4.Mobile Phone Pilot

The ONS is interested in location data generated through mobile phone usage to inform on population flows for example the number of people who travel from area A to area B.Due to the ethical and privacy concerns around accessing this data we presented our proposals to a number of privacy groups. Although wary of the acquisition of individual level data, the groups were supportive of the use of aggregated data. We have therefore held meetings with the large mobile network operators to brief them on the research and enquire about their interest in providing data for research. A procurement exercise will follow to acquire aggregated data for comparison with 2011 Census travel to work flows

3.4.Communication

Throughout the project the ONS Big Data team have undertaken a number of activities to communicate our plans, progress and results. Presentations have been given at conferences. A discussion group has been established for informal communications and an external webpage has been developed and progress reports posted on a quarterly basis. In external communications the aim has been for transparency around plans and progress whilst emphasising that any research will be in line with data protection/security procedures and that our research will consider ethical issues.

4.Conclusions

The key conclusions drawn from this initial investigation is that there are real tangible benefits in the use of into big data and associated technologies within official statistics, for example tocreate efficiencies and improve quality, produce new or complimentary outputs, improve operational processes and respond to challenges.

We have also started to demonstrate that perceived challenges can be overcome. There are technical challenges around the use of big data but the Innovation Lab is providing an environment where we can test open source software and new tools and technologies. A key statistical challenge around the use of most big data sources is bias therefore traditional statistical methods must not be forgotten in the ‘big data hype’. NSIs therefore have a critical role to play to support the use of big data and associated technologies across Government.There are legal and ethical challenges associated with the use of big data within official statistics but we have engaged with privacy groups, are planning research into public attitudes to these issues and are proposing the establishment of an Ethical Committee to manage these risks.Many big data sources are produced from commercial organisations which raise new challenges associated with partnerships, procurement and branding. We have started to consider these issues through our pilot work.Capability is a key challenge for all organisations embarking on the use of big data and associated technologies. We have developed skills in this area by creating a cross disciplinary team, investing in training as well as learning ‘on the job’, creating time, space and environments that encourage innovation and working collaboratively with big data experts across Government, from NSIs and other sectors.

What has become clear through this initial phase of the ONS Big Data project is that long term investment is needed to continue this work to realise the benefits andovercome the challenges. This case is currently being made to secure funding for the ONS Big Data project to continue and expand.

1

[1] Wales and Northern Ireland have similar policies.