Basic Requirements on Data Analysis System of DRG Restart Project

Table of contents

Table of contents 1

Document summary 2

Project goal 2

Project submitter 2

Document purpose 2

Product description 2

System assumptions 2

Fundamental data sources 3

Logical components of the system 3

Basic requirements on designed solution 4

Possible solution examples 5

Use cases 6

Requirement specification 6

Appendices 7

Appendix A: Definitions, Acronyms and Abbreviations 7

Document summary

Project goal

The project goal is to define a complex platform architecture – solution, which will allow to process, integrate, validate, analyze and report data in the DRG Restart project. Result of this project will be set of system proposals, provided by the interested vendors, describing selected architecture, benefits and limitations as well as pricing and schedule offers. This system specification project should end by the end of 2015.

This project’s goal is not a complete design of whole data warehouse, particular data integration processes, data models, reports etc. but overall architecture, components, modules, tools… The DWH process design itself will be implemented by the submitter on its own.

Project submitter

This project is submitted by Institute of Health Information and Statistics of the Czech Republic (IHIS). Contact person: Milan Blaha, PhD (CIO, project management), , +420601392841.

Document purpose

This document’s purpose is to define basic concepts and requirements on data analysis and data mining system for the DRG Restart project. Detailed specification of the system will arise during the system specification project.

The audience of this document is wide IT vendors community, developing public healthcare DWH solutions world-wide. It should be used as a base document to discuss submitters’ expectations and vendors’ possibilities and as a specification for the vendors’ solution offers.

Product description

System assumptions

Central data warehouse (DWH) will predominantly process structured administrative medical data, originating from public healthcare care providers, or medical insurance companies, respectively. Its increment will be hundreds of millions of records in a year (approx. 300 – 400 mil.), which correspond to tens of gigabytes of input data of the lowest granularity. Together with historical and aggregated data we expect billions (10^12) of records and terabytes of data to be processed and analyzed.

System will be focused on small group of advanced users, in order of singles or little tens of them. Deployment for hundreds or more of unexperienced users is not expected, at least in few years. Another specific assumption is the expectation of batch imports: New data will arrive on a quarterly basis at the beginning. It can be speeded up to monthly period of updates later on. Online updates are now not expected because of the data character. Also, high availability of the system is not required as the system will not be used for online operation management, rather for long-term quality assurance and cost-effectiveness evaluation and optimization analyses.

Fundamental data sources

Following data sources will create a bone of processed data in the designed system:

1) National registry of healthcare services covered by public health insurance

· Production data of all healthcare providers on services covered by public health insurance

· Data on organizational, technical and personal assurance of these services

· Data on payments for these services addressed to individual providers

2) Registry of the data from network of reference hospital healthcare providers

· Production data on provided healthcare services and submitted to the public health insurance companies

· Data on organizational, technical and personal assurance of these services

· Data on expenses spent on the provided care

3) Developed classification systems and rules (classification system of hospital procedures, hierarchical classification of healthcare production items, rules compiling hospitalization confinements, rules defining DRG classification etc.)

4) National and international lists and classifications (ICD, TNM, DASTA, VZP, SUKL, …)

5) Other registries maintained by IHIS and other sources

Logical components of the system

The designed system has to be composed of the following logical components, ensuring required capabilities and functions:

Component / Capabilities and functions
Data integration subsystem / Visual transform design, processing data from files, databases, web services, other components of this system
DWH store / Data warehouse database for effective storing and acquiring (meta)data
BI OLAP / Dimensional modelling, OLAP cubes, pivot tables, multidimensional querying, self-service analyses
Advanced DM / Data mining tools for advanced analyses (statistical, business rules engine, regression models and classifications, clustering, associative models and pattern matching, …)
Reporting / Online predefined or self-service analyses, export to text, pdf, word, excel, powerpoint, images, …
Specialized components / System management tools, Metadata management tools, QA tools, …

Real technical solution doesn’t need to correspond exactly to these components, but has to fulfill their capabilities.

Picture 1: Schema of the logical components of DWH

Basic requirements on designed solution

The following table summarizes the general requirements on the system without exact specification.

1) Complexity: All the basic elements of BIDM process are covered – acquiring, processing and integration of data, its storage, basic as well as advanced analyses, reporting generation

2) Scalability: System could be simply enhanced in order to performance boost by new (preferably independent) hardware and licenses with reasonable price

3) Modularity, openness and interoperability: Particular process components could be solved with various specialized tools, which could be implemented by different vendors. These components will be integrated via specified interfaces according to exact requirements. They have to cooperate on couple of levels – e.g. some advanced datamining task results could be used as an input for data integration process; metadata with data integration process description will be stored in DWH and further analyzed etc.

4) Exchangeability: Thanks to previous requirements, some of the solution components could be exchanged with some alternatives during system’s lifecycle, which could solve changing expectations better. The expenses of the change itself (migration of data and metadata, integration with other components etc.) must not be too high. One special requirement is to allow using a solution based on open-source and/or free tools for non-profit and educational purposes, which will be compatible with standard DWH system as much as possible.

5) Extensibility: The system could be easily extended with additional tools and components, providing users with new functions and capabilities, not included at the beginning.

6) Quality Assurance: A tool for designing, validating and improving data and metadata quality process has to be supported. It is necessary to support validation and ensuring completeness, consistency and recentness of processed data, as well as enable validating the whole process (metadata) of data transformations since their acquiring up to final reporting.

7) Security: The system will be used for processing very sensitive personal health data, so it has to be secured against all external or internal threats. It has support secure authentication and authorization, storing and communication. It has to allow logging and auditing of execution and read operations. As this system will be available to a limited number of advanced users, the user access rights will be set to database, table or column level. Row or cell level access rights don’t have to be supported. The system has to be operated on local servers of the submitter. Cloud or outsourced solutions are forbidden due to legal limitations.

8) Simplicity: Particular components and the whole system as well, have to be very simple to use and manage. Stability of the solution is crucial.

9) Metadata and data versioning, archiving and backup: There have to be support of tools, which allow version control and development cycle (GIT or others) and parallel team collaboration on all processes, jobs (data flows), DB schemas, etc. The system has to allow commit and revert changes in tasks, comparison of changes in data and metadata etc.

10) Performance requirements: System should be designed for few concurrent users, batch processing of source data and complex data mining analyses. The complete data integration process of quarterly data increments should take no more than tens or few hundreds of minutes. OLAP analyses reports have to respond in maximum few seconds. Enhanced parallelization of the process is expected – hardware architecture will be now based on single-machine, multi-threaded (up to many tens of threads) server. Further enhancement to distributed platform is also expected.

Possible solution examples

The following table describes several possibilities of the component set, which the designed system could be composed from. These solutions are compiled without detailed knowledge of the commercial products. The table is far from being complete and is introduced here to illustrate and demonstrate basic system concepts. Particular components could be combined or exchanged to others according to functional, performance and financial expectations and limits.

Solution \ Component / DI / DB / BI OLAP / Advanced DM / Reporting
Pentaho community / Data Integration (Kettle) / MonetDB / BA Platform OLAP (Mondrian) / DM (Weka) / Reporting
Mixed community / Talend Open Studio / InfoBright ICE / Palo / R / JasperReports Server
IBM / InfoSphere DataStage / DB2 / Cognos TM1 / SPSS Modeler / Cognos BI
Oracle / Data Integrator / Database Standard / BI Server / ? / BI Publisher
Hewlett-Packard / ? / Vertica / ? / ? / ?
Microsoft / SSIS / SQL Server BI ed. / SSAS BI ed. / ? / SSRS
SAS / Data Management / ? / Enterprise BI server / Enterprise Miner / Visual Analytics

Use cases

The use case scenarios will be defined during the project course.

Requirement specification

The detailed requirement specification will be defined during the project course.

Appendices

Appendix A: Definitions, Acronyms and Abbreviations

The following table explains terms and abbreviations, mentioned in the text (emphasized by italic).

Term / Description
BI / Business Intelligence
DASTA / DAta exchange STAndard in Czech healthcare system
DB / Database system (RDBMS, columnar store, in-memory, distributed, ...)
DI / Data integration process
DM / Data Mining
DRG / Diagnosis related group
DRG Restart / Long-term complex project of restructuring payment methodology for inpatient care in the Czech Republic healthcare system
DWH / Data Warehouse
ETL / Extract, Transform, Load Process
GIT / Open source version control system
IHIS / Institute of Health Information and Statistics of the Czech Republic
ICD / International classification of diseases, 10th revision
OLAP / OnLine Analysis Processing System
QA / Quality Assurance
SUKL / State Institute for Drug Control of Czech Republic
TNM / Tumor staging classification
VZP / General public healthcare insurance company of Czech Republic