/ CDISC MDR
Business Requirements Specification
CDISC Meta data Repository
Enrichment & integration of CDISC Data Standards toward semantic interoperability
Business Requirements
Prepared by the CDISC MDR Team
Notice to Readers
This is a description of the CMDR business requirements, based on the stakeholder analysis provided in another document. A related document with appendices contains additional information that was considered during the establishment of the requirements.
(NOTE: this is NOT yet completed as a few stakeholder analysis still need to be done)
Revision History
Date / Version / Summary of Changes
Nov08 / 0.4 / First version distributed to BRS sub-stream as basis of discussion
Dec 08 / 1.0 / First consolidated version on requirement based on – partial - stakeholder analysis (provided in a separate document)


Table of Contents

1. Introduction: what is the problem we want to solve with CMDR ? 3

2. Definitions 5

2.1 Glossary 5

2.2 Difference between concept and variable 6

2.2.1 Storyboard 6

2.2.2 Description of differences between concept and variable 7

3. Requirements 9

3.1 Scope & criticality of requirements 9

3.2 Information content 10

3.2.1 Variable definition 10

3.2.2 Concept description 17

3.2.3 Code lists/terminologies 24

3.3 Release Management & version control 26

3.4 Information upload 27

3.5 Information entry 28

3.6 Information retrieval (search and access) 31

3.7 Information extraction & downloads 33

3.8 System availability & performance - miscellaneous 35

4. Acknowledgment 36

4.1 Participants to the requirement stream 36

4.2 Participants to other activities related to CMDR 36

1.  Introduction: what is the problem we want to solve with CMDR ?

This document is the Business Requirements Specification (BRS) for the CDISC Metadata Repository (C-MDR). It documents an initial, mutual understanding among several pharmaceutical companies about the current business situation and future business requirements.

Today, each pharma company maintains its own internal data dictionary, describing variables (both collected and derived) with their related code lists aka value sets. For larger companies these dictionaries can exceed 25,000-30,000 variables. A number of problems can be noted that are common across pharma companies:

·  Although with dedicated staff most data dictionaries are well-maintained for clinical trial and pharmacovigilance information, other areas such as pre-clinical, Health Economics & Operations Research (HEOR) and epidemiological studies are far less organized. Ongoing maintenance of existing data dictionaries is very costly.

·  Company data dictionaries very rarely support semantic interoperability, because of inconsistent and redundant definition of the variables.

·  There is a major problem with data re-use, particularly outside of their primary purpose. The exchange of data within and across companies for in/out licensing requires major mapping work. In their overall cost of services, CROs need to include cost of mapping for each company specific standard.

·  Increasing pressure to decrease costs while augmenting the number of New Chemical Entities (NCE)

·  Regulatory requirements to provide comparative profile of safety and cost-efficacy

Given these problems, company-specific data dictionaries are no longer sustainable. A cross-industry standard “data dictionary” or Metadata Repository is needed – one that is enriched to support full semantic interoperability for clinical research (and preferably EHR integration).

To date, most CDISC data standards have been developed with little coordination across standards. Most companies implement and use CDISC standards in different ways due to insufficient and/or inconsistent definition of data elements and their meaning. The CDISC Terminology Team and other teams (e.g. CDASH, SDTM) have spent significant time the past couple of years harmonizing standards (and associated code lists) to common data element definitions.
This is an extremely valuable first step but remains limited to single variables, not taking into account cross variable constraints. To define unambiguously activities and observations, controlled terminologies is not enough; there are constraints across controlled terminologies for specific activities that needs to be captured. For instance , only a few measurement units are applicable for blood pressure. Furthermore context information may be needed to ensure correct interpretation; for instance a reference range in specific population.

There is an industry need for increased efficiency across the data chain from collection through submission, and therefore the consistency of current and future CDISC standards is critically important. This consistency becomes even more important as CDISC seeks alignment with other standards such as those developed by HL7 and ISO. BRIDG represents a step in the right direction, however BRIDG does not cover all variables that can be used to collect, derive and submit data. Also, BRIDG is agnostic of controlled terminology code lists.

The C-MDR is intended to be complementary to BRIDG and CDISC terminology activities. It is a “cross industry enriched data dictionary” that is comprehensive, electronic and publicly accessible with unambiguous concepts and variables for safety and efficacy and other areas related to protocol-driven research. Its purpose it to:

1.  Provide semantically correct, public & re-usable set of standard definitions valid across the industry. The content of the C-MDR must be accessible/usable for each company to support Drug Development from data collection to submission, while enabling seamless data exchange with external partners (CROs, in/out licensing).

2.  Support in each organization data integration & data pooling across large databases coming from different sources; this applies to JANUS as well as databases within pharma companies.

3.  Support semantically consistent data exchange across organizations by enabling implementation of fully interoperable messages (compliant with HL7 methodology or messages such as CDISC ODM) where there is a consistent definition of data elements across different messages within the clinical research community.

4.  Enable integration of clinical research and clinical care by supporting easier access to EHR information.

As mentioned by FDA in The Sentinel Initiative. National Strategy for Monitoring Medical Product Safety May 2008: “Standardizing data elements and terminologies is a critical component to any attempt to achieve a modern electronic approach to monitoring medical product performance. “

2.  Definitions

2.1  Glossary

Terms / Description / Source
Terminology / A standardized, finite set of terms (e.g., picklists, ICD9 codes) that denote patient findings, circumstances, events, and interventions.
Code list /
Valid value set / Finite list of codes and their meanings that represent the only allowed values for a data item. A codelist is one type of controlled vocabulary.
controlled vocabulary. A finite set of values that represent the only allowed values for a data item. These values may be codes, text or numeric. / Clinical Research Glossary. V4.0. CDISC. Actmagazine.com APPLIED CLINICAL TRIALS,
Dec 2005
Variable / 1.  Any quantity that varies; any attribute, phenomenon or event that can have different qualitative or quantitative values. There is usually a form of metadata that goes with the variable, there is a variable definition that describes what is varying, and there is a value for the variable.
2.  In SDTM “variables” are used to describe observations. Such describing variables have roles that determine the type of information conveyed by the variable about each observation and how it can be used.
Variable is an enveloping term that includes specific subtypes used in clinical research.
·  “Study variable” is a term used in trial design to denote a variable to be captured on the CRF.
·  An “assessment” is a study variable pertaining to the status of a subject (i.e involve evaluation of judgment)
·  An “endpoint” is a variable that pertains to the trial objectives (safety and efficacy).
An item is a representation of a clinical variable, fact, concept or instruction in a manner suitable for communication, interpretation or processing by humans or by automated means. NOTE: Items are collected together to form item groups. / Clinical Research Glossary. V4.0. CDISC. Actmagazine.com APPLIED CLINICAL TRIALS,
Dec 2005
Data element / 1. For XML, an item of data provided in a mark up mode to allow machine processing. . [FDA - GL/IEEE]
2. Smallest unit of information in a transaction.
NOTE: The mark up or tagging facilitates document indexing, search and retrieval, and provides standard conventions for insertion of codes. [Center for Advancement of Clinical Research]
Note: as data element is an overloaded term we will limit its usage in this document / Clinical Research Glossary. V4.0. CDISC. Actmagazine.com APPLIED CLINICAL TRIALS,
Dec 2005
Ontology / a rigorous and exhaustive organization of some knowledge domain that is usually hierarchical and contains all the relevant entities and their relations)
Princeton University http://wordnet.princeton.edu/perl/webwn?s=ontology
A common vocabulary for describing the concepts that exist in an area of knowledge and the relationships that exist between them. An ontology allows for a more detailed specification of the relationships in a domain than is the case with a thesaurus or taxonomy. ...
Univ. of Toronto http://plc.fis.utoronto.ca/tgdemo/Glossary.asp
A description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. In biomedicine, such ontologies typically specify the meanings and hierarchical relationships among terms and concepts in a domain.
cordis.europa.eu/ist/ka1/administrations/publications/glossary.htm

2.2  Difference between concept and variable

2.2.1  Storyboard[1]

A project team is developing a Clinical Development Plan (CDP) for an antihypertensive compound and the first study team has been identified to develop the first study in that program.
As part of the development of the CDP, the team discusses what primary clinical measures they want in the program: efficacy (primary endpoints), safety (primary endpoints), economic evaluation (defining what they would like to be able to demonstrate), genetic information (maybe just defining that they want samples to be taken for subsequent analysis). They make decisions and document these.
Following on from the approval of the CDP, the clinical and statistical members of the study team start to flesh out the requirements. They augment the CDP endpoints with secondary endpoints, modify the primary endpoints if necessary, flesh out the economic evaluation and agree to analyze the genetic samples.
In all the above, there is no detail ... just the use of terms like blood pressure, ECG, Labs, ambulatory blood pressure monitoring, hospitalization costs, health survey. In these discussions, no-one will be talking about variables.
When all this is tied down (duration of treatment, visit schedule etc), these high level terms need to be made real, from talk of measures to talk of individual CRF fields[2]:

·  Which blood pressure measurements (Is that supine or standing? Is that after exercise or not? Does it matter which arm is measured? Does it matter who does the measurement, nurse or doctor),

·  which ECG parameters,

·  which lab parameters etc.

Throughout this process, people need to use unambiguous terms (both high level and detailed).

·  Until the discussion moves from talk of measures to talk of individual CRF fields, we should talk about concepts.

·  At the point where we must have a precise description of how specific instances of a concept will be measured or observed, and values of that observation recorded, we need to define a variable. A variable must have a unique name/identifier and sufficient attributes defined to ensure no ambiguity when references are made to data values. Attributes include, but are not limited to data type (numeric, character, etc.), length, permissible values (e.g. permissible range or a code list), measurement methodology, etc.

·  When we need to talk about individual CRF fields in the context of data collection, we should talk at the variable information but still in the context of the concepts.

·  When we start to analyze the data (particularly after the initial reporting of the study), we are again interested in the variables in the context of concepts.

2.2.2  Description of differences between concept and variable

Concept / Variable
Abstract idea, used by human being (scientists who are often computer agonistics) - / Used by computer system as the representation of a concept
System independent (i.e. could be managed through paper) / Constrained by system development/implementation (as a computer system needs more formal language than human being)[3]
Can be complex or can be simple
·  A complex concept is composed by other concepts (complex or simple) through a structured organizational framework or ontology)
·  A simple concept may be linked to many different complex concepts
A simple concept must be linked to all variables capturing the meaning related to the concept in a computer tractable way
(e.g. “Systolic blood pressure[4] is linked to SYS_BP_val, SYS_BP_unit, SYS_BP_range) / Must be linked to a concept to ensure meaning is totally unambiguous
Unambiguous, with context information, as it relates to other concepts (information model) / May not exist without being linked to one specific simple concept which provides unambiguous semantic[5]
Unique for a specific meaning, with unique name/identifier and defining attributes / Ideally unique variable for a specific meaning, with unique name/identifier and uniquely defining attributes, uniquely linked to a simple concept
(i.e. we would have SYS_BP_unit and DIA_BP_unit as 2 different variables as they represent 2 different meaning)
In practice – due to legacy databases – there will be many different variables with the same meaning or many of the same variables with different meanings; in that case it is recommended to have a “gold standard” unique variable name within the CDISC MDR.
As with any conversion to a new system/standard/version, each company will have to decide whether and how to implement the variables in their company. However, whenever there is an exchange of information between companies, the “gold standard” unique variable name (and definitions) will be used.
Based on ISO 21090 abstract data type (e.g. Physical Quantity (PQ), Internal of time specification (IVLTS), ..) / Based on basic data type (e.g. integer, real, Boolean, char, …)

3.  Requirements

3.1  Scope & criticality of requirements

Different stakeholder may share the same needs. This section integrates these needs into a consistent set across stakeholders.

Each requirements is provided with

·  The list of stakeholder for which this requirements is relevant; the following acronyms are used

o  PH_SC = Protocol author/scientist

o  PH_DM = Data manager/eCRF developer

o  PH_ST = Biostatician