Next Generation Data Management Architecture in Statistical Organisation

Next Generation Data Management Architecture in Statistical Organisation

Paul Grooten ()[1], Matjaž Jug () 1, Robbert Renssen ( ) 1

Keywords: enterprise architecture, data management, data virtualization, metadata management, semantic technology.

1. Introduction

1.1. Drivers for new data management architecture

21st century societies are faced with increasingly complex challenges and the pace of change is faster than ever before. Policy and decision makers, researchers, entrepreneurs and the general public need high-quality information. This has put strong demand on statistical organisations to provide more complete picture of phenomena and to do this faster. On the other side we see the rise of new (big) data sources and rapid evolution of new methods and technologies needed to manage existing as well as new types of data in much more effective and efficient way. Work described in this paper is part of data management modernisation programme at Statistics Netherlands with the aim of increased discoverability and accessibility of internally stored datasets, better possibility for phenomenon-based data analysis and increased flexibility to add, process and analyse new data sources including big data sources.

1.2. Current situation

Statistics Netherlands performs direct data collection and has been extensively using secondary data sources including administrative data (government registers) as well as emerging big data (for example road sensors and web-scraped data) for regular statistical production and ad-hoc analysis. This results in wide variety of input data sources and formats. Data architecture is underpinned by “steady states” approach with four conceptual data stores for input, micro, macro and publishable data however the datasets needed for process/analyses phase are physically stored and managed in central data stores[2] functionally similar to Data Warehouse / Data Marts and many other data stores and files within siloed survey processing platforms and systems.

As a result many statistical datasets are described with their local data taxonomy and metadata vocabulary and are stored in various formats using different storage techniques, e.g. file base storage using CSV or fixed width, relational database tables etc. In order to easily access and use the datasets, Statistics Netherlands wishes to implement a new concept, named a ‘Datameer’.

1.3. Vision

With the introduction of the ‘Datameer’ concept a first step is taken towards a new Business Architecture in which data has a central role and where the processes are positioned around the data; a data driven architecture [1]. This approach is schematically presented on Figure 1.

English translation of ‘Datameer’ is ‘Data Lake’ however definition of Statistics Netherlands Data Lake is rather different and much broader concept than “Repository for storing massive amounts of unstructured data, usually within a Hadoop infrastructure” [2] as Data Lakes are often described. It corresponds more closely to the term ‘Logical Data Warehouse’, introduced in 2011 by Mark Beyer and further defined by Gartner [3]:

“This architecture will include and even expand the enterprise data warehouse, but will add semantic data abstraction and distributed processing. It will be fed by content and data mining to document data assets in metadata. And, it will monitor its own performance and provide that information first to manual administration, but then grow toward dynamic provisioning and evaluation of performance against service level expectations. This is important. This is big. This is NOT buzz. This is real.

Figure 1. Statistics Netherlands Data Lake Vision scheme

2. Methods

2.1. Approach

SN Data Lake vision is not only about the new data architecture. It encompasses many other areas for example business processes, data governance, skills… it is like a new paradigm. So how to define, investigate and develop new data management paradigm? Where to start and how to define priorities?

The way we approached this problem was a combination of analysis of main use cases identified at the several workshops with statistical users (demand side) and description of new or improved business capabilities required to fulfil these use cases (supply side). Data Lake investigation was the first project at Statistics Netherlands where capabilities model was used to help clarify project scope and define priority areas for further investigation and research.

Specific goals from an end-user perspective:

· Enable more phenomenon based output (a phenomenon is a striking event that you want to explain with data)

· Enable more current and coherent statistics

· Stimulate the reuse of data

· Accelerate the statistical processes

· Grow and stimulate the access to a large number of existing and new data sources

· Provide faster response and output to requests from external clients

· Accelerate the design process around collecting and storing data

2.2. Priority areas for Proof of the Concept

User stories from the first phase of the programme informed selection of areas important for the implementation of new architectural vision and delivery of goals.

Resulted priority areas were therefore capabilities that can make possible that data, stored in a distributed way, can be transformed and integrated in a logical layer, presented as building blocks (virtual datasets) and provisioned to users and applications via standard interface. Proof of Concept for Data Lake therefore focused in multi-layered data architecture, data virtualisation technology and semantic metadata model.

3. Results

3.1. Architecture

Key capabilities that are required to achieve the goals described in the Data Lake vision (see Apendix 1) have been used to design layered scheme that defines new Data Lake architecture and consists of the following logical layers:

1. Data Sources layer provides storage for the Data Lake data. This can be implemented using the different types of databases management systems for example SQL, NoSQL, Hadoop but also various file formats or access to web service.

2. Transformation Layer enables data transformations for example data format transformation capability and the capability to create new data sources from already attached data sources. Each dataset in the Data Source Layer will in the proposed data virtualization solution have a representative in the Transformation Layer (aka base datasets). There is no data to be found in this layer, only the definition of virtual datasets with associated mappings / transformations.

3. Provisioning Layer makes logical data structure available to users and systems using open standard interfaces and protocols. In comparison with the datasets in the Transformation layer, redundancy in the provided data is allowed (certain objects will be included in multiple datasets – one can find data in de-normalized form). The datasets are designed on the bases of a phenomenon (or topic). Data can be provided as a dataset or as a web service, which makes it possible to include the Data Lake in a Service Oriented Architecture (SOA).

4. Consumer Layer consists of tools and systems that are used for processing or analysis of the data. They communicate with the provisioning layer to get access to the chosen logical datasets. Data Preparation, Data Analytics, Data Visualization are main sub-domains of this layer but we can also include here interface for automated system-to-system data access and presentation of data in web interfaces.

This layers are accompanied by the cross-cutting layers:

1. Corporate metadata management: the content structure and the semantic meaning of statistical data must be recorded. Furthermore, it is possible to trace where data comes from and what path (for example transformations, linkage) has data made before current value (data lineage).

2. Data governance: existing statistical datasets are already stored somewhere, and these datasets therefore don’t need to be obtained again however we need capabilities that will enable us to manage these assets enterprise-wide.

3. Security and Authorisation.

Schematic view of these layers is represented in Figure 2. For information how capabilities fit proposed architecture please see Appendix 1.

Figure 2. Data Lake Architecture

This layered architecture has been first used in design of Unit base that is derived from the Statistical Business Register (SBR) with the aim to provide data to interactive users and systems (more details in [1]). Project confirmed feasibility of new architecture and additional PoC projects are now testing key technologies that need to be combined to prove that architecture can be efficiently implemented at a large scale: Metadata Management and Data Virtualisation.

3.2. Data virtualisation

Data virtualisation PoC tested the use of this technology on existing data sources including central data store DSC. The big advantage of this technology is that there is no need to copy data into a new structure and therefore no effort is needed to load and unload data during maintenance. Changes in the data model can be applied quickly with a minimal impact and with a very short time-to-market. The owner of the data will stay fully in control improving the willingness of data providers to share their data. Furthermore the virtual datasets can be fully customized to the customer needs (no limitations concerning storage requirements and actuality). As the virtual datasets only consists of meta data, versioning of the datasets is also very easy. During a transition period multiple versions of the same dataset can exist, providing more time for the receiving systems to change their model accordingly. By positioning the Data Lake as a single data platform, future capabilities can focus on this platform. It is not needed to implement these new capabilities on all existing (legacy) data sources.

3.3. Metadata management

The purpose of development of the metadata model and its application in PoC was to show how could metadata help users to find the right data, understand its semantic characteristics and use this information in data integration, data analysis and other statistical activities.

The model is based on representation of characteristics that describe statistical dataset and relationships between the datasets in a graph using formal semantics theory. Description of model exceeds the purpose of this paper and is described here [4].

Work on metadata model has proved that it is possible to design such graph for any statistical dataset and therefore create a map of semantically linked data. PoC also proved that this map can be stored in a graph database and further work is underway to understand how can this metadata be captured and used in conjunction with the data virtualisation.

4. Conclusions

4.1. Benefits of new architecture

Work to date identified the following benefits of next generation data management architecture:

· Increased accessibility to available datasets physically stored in different locations and multiple formats without the need for costly data migration and standardisation required to centralise data in single physical data store;

· Logical separation between data sources layer and data consumer layer allows flexibility for example adding (registering) new data sources while protecting analytical and processing systems from change in data sources layer;

· Decoupling of different layers provides mature versioning capability for every dataset, meaning a dataset can have multiple versions enabling a controlled change management process without time pressure on dataset related applications (a timeframe can be provided for the transition from version x to version x+1);

· Data transformation layer can be used to define transformations and other functionality (for example confidentialisation) and make it available across all users/systems reducing development time and fostering reuse;

· Different tools and applications in Data provisioning layer have access to predefined (virtual) datasets, transformations and other common functions (data versioning, data quality, security confidentiality) without the need to copy datasets and replicate these functions in each individual tool or application;

· Faster phenomenon-based analysis and reporting based on semantic relationships that help find and understand relevant data faster and reduce time to market;

· Improved query performance (especially for file based data sources) because of query optimisation and cache-ing methods of data virtualisation technology in transformation & provisioning layer;

· Possibility to add new types of data sources (unstructured, high volume etc.) that require different storage, have various levels of data quality etc.

4.2. Challenges

While research and implementation of PoCs and selected projects already confirmed feasibility and benefits of new generation Data Lake architecture we also identified areas that will require additional work.

Data Governance: new architectural approach and underpinning technologies and models alone aren't enough: with the possibility to provision virtual datasets in a simple and expedient way we have to make sure that we apply appropriate governance mechanisms.

Security: security and confidentiality controls are often based on physical datasets and their location. With the new architecture there is possibility to apply security at two levels: data sources layer and provisioning layer. In combination with technological possibilities like audit trails and more centralised monitoring we have potential to increase security however we need to carefully investigate best approach and change policies accordingly.

Confidentiality control: similarly, it will be possible to design virtual aggregated datasets directly based on micro data however the use of this approach in some areas for example external researchers would require implementation of confidentiality on the fly to replace manual control of outputs.

Metadata enrichment and linking to the new semantic metadata model: this is perhaps the most difficult area. Capture and management of Metadata has always been time consuming and not very popular task (particularly as benefits of metadata are often accrued by data users and not by data owners / producers who have to do most of the work). Technology offers some new possibilities to automate extraction of technical metadata from data sources however we still don’t have all answers how to efficiently and effectively define and maintain semantic and other added-value metadata.

4.3. Next steps

Work is currently progressing in several directions.

Work on PoC that will connect all elements of the new architecture and provide complete picture is scheduled to be finished by the end of March. This will give answers to remaining and perhaps generate some additional questions.

Proposed data architecture is also being tested with the new use cases, with particular emphasis on the use with big data sources and externally stored data.

Development of Business Case for implementation with the aim to better understand return of investment and implementation scenarios.

References

[1] I.Salemink, A Unit-base derived from the Statistical Business Register as springboard to a Data-lake, 25th Meeting of the Wiesbaden Group on Business Registers – International Roundtable on Business Survey Frames (2016)