Data Guide for an SDTM Database

This analysis was submitted by John Brega and Jane Diefenbach of PharmaStat. It is based on methods they have developed through their eCTD submissions consulting practice.

We call the Data Guide document either tabulations-data-guide.pdf or analysis-data-guide.pdf, depending on which type of database it describes. In the submission it lives with the data, just like the define.pdf or define.xml. We did not call it a Reviewer’s Guide because that causes a name collision with the Reviewer’s Guide document typically included in Module 1, which is a guide to the entire submission and mainly the work of the Clinical and Regulatory groups.

We have provided links to examples of both the Tabulations Data Guide and the Analysis Data Guide. We have also described and provided an example of a Guide to Analysis Programs. The examples make it easier to visualize what we’re describing below in the analysis of the issues, but the issues can be addressed by many different styles of document. It is important to understand the issues and the rationales for the document contents to fully understand the examples.

Data Guide for an SDTM Database

Click Here for a generic example of an SDTM tabulations-data-guide.pdf document.

Click Here for a more detailed example of a Data Guide document for a public domain TB submission.

For an SDTM database the document has six sections:

1. Intro to the study and database

We provide one to three paragraphs at the top of the document, starting with the name and title of the study, followed by a few general observations about it such as whether the database includes screen failures, whether the study is ongoing, whether it is the pivotal study, which studies are included if it’s an ISS or ISE database, etc.

2. Table of Contents and bookmark panel

3. Where to Find Key Data

One of the difficulties reviewers have, even with SDTM data, is to figure out where to find the basic data they need to review, and which datasets to ignore because they’re just administrative data, like lab sample dates that are already reconciled with the LB dataset. We typically put most or all of the following subsections under this header:

· Demographics and Compliance

· Exposure to Study Treatment

· Subject Disposition

· Safety

· Efficacy

· Pharmacokinetics/Pharmacodynamics

· Trial Design Model Datasets

Each subsection gives the names and titles of the datasets relevant to that category of data. For example, if only Drug Accountability data was collected for the study, it would mention that DA (Drug Accountability) is the primary source for exposure data, and that the EX (Exposure) domain is derived from the DA data. The Efficacy section it might describe which custom domains address which study endpoints. For a phase I study it might simply say that no efficacy data were collected. There could be other subcategories as needed. For example, a Thorough QT study might have a section to identify the ECG and PK data, describe their relationships, indicate where to find the means of the replicates, etc.

4. Overview of Custom Domains

We typically don’t give descriptions of the standard domains, but we find that the custom domains can be very obscure, limited as we are to two-character dataset names and 40 character dataset labels. We provide a description of each custom domain. The description is usually just a short paragraph describing the dataset structure and purpose, what kind data it contains, how it relates to other datasets, and how to identify the subjects or records of particular interest. If it’s just administrative data we would mention that.

5. Datasets Not Submitted

It frequently happens that a study end up with no data collected for some domains. Since the SDTM rule is to not submit empty datasets, the database can have holes that make it appear that the sponsor was asleep at the switch. This most frequently happens in phase I healthy volunteer studies where, for example, there may not be any subjects with IE criteria not met, or any AEs. You’d kind of wonder if there weren’t an AE dataset, wouldn’t you? For later phase studies there might be a collection instrument for pregnancies, but none were reported. And so forth.

6. Derived Datasets

Sometimes SDTM datasets are completely derived, and in these cases it’s important to describe the derivation methods and the assumptions that are made. For example, if EX is derived from DA, you often end up with EX records that represent an average daily dose for each “continuous dosing interval” since you don’t have a record of each dose taken. It’s important to know this if you want to associate AEs and other data with specific cumulative dose levels, for example.

If a DV (Protocol Deviations) dataset is submitted it is essential for the reviewer to know what categories of deviations or violations were included in the dataset, and what rules or data sources were used to identify them. If you get a dataset with a handful of trivial deviations and no explanation, you have no idea whether those were all the deviations there were, or whether they were the only ones collected on a CRF, for example.

We put the descriptions of derivations in their own section because they can sometimes be lengthy and are a different kind of narrative from the descriptions of the datasets, which are focused on how to use and interpret them.

Data Guide for an Analysis Database

Click Here for a generic example of an ADaM analysis-data-guide.pdf document.

We use a similar template for other types of databases and adjust it to address the reviewer’s needs. We treat ADaM documentation the same as for any other legacy analysis database. The first three sections above are included and perform the same function. Section 4 is an overview of all datasets, since only ADSL is the only standard analysis domain. One thing we include in the description of ADSL is the list of the “Core” variables that are merged onto all the other analysis datasets.

There is no need for section 5 (Datasets not Submitted). Section 6 we typically roll up into the dataset overviews if the material is not too voluminous. This would include dataset-level and row-level descriptions of the structure, derivation and usage of the data. Occasionally a study will involve a large amount of complex derivation. In these cases we add another section on the end called Computational Methods. We have seen these sections get as large as 70 – 80 pages. A handbook for scoring a complex questionnaire is an example of a multipage derivation description that might be included in its entirety for the benefit of the reviewer.

Data Guide for an “Item 11” Legacy Database

The template we use for legacy “Item 11” database documentation is essentially the same as for an analysis database. In particular we include a description of every dataset since there are no standard datasets with commonly understood purposes and structures. Since this type of data is typically less derived than SDTM, a Computational Methods section is rarely needed.

Analysis Program Guide

Besides the define.xml/define.pdf, data guides and blankcrf.pdf, if a submission includes analysis programs we also include an analysis-program-guide.pdf document. This is because the names of sponsor SAS programs do not usefully describe their purpose, and without a guide to the program set a reviewer has little hope of making good use of them. The purpose of the guide is to identify every component of the submission that was produced by a program (e.g., analysis datasets, tables and figures), the program that produced it, and the inputs the program used. We put this document with the analysis data and related documentation because that’s where reviewers look for documentation. There is an obvious case to be made for putting it with the programs, especially if it were an expected document.

Click Here for a generic example of an analysis-program-guide.pdf document.

We build the program guide as a spreadsheet which we then convert to a pdf. It has four tabs:

1. Index Page

This tab serves as a table of contents for the other three tabs

2. Analysis Dataset Programs

This tab has an entry for each submitted analysis dataset. It specifies the dataset name and description (label), the program that creates it, and the complete list of inputs. Everything except the dataset label is a hyperlink to the named file.

3. Table and Figure Programs

This tab has an entry for each table and figure included in the Clinical Study Report, though not necessarily the in-text tables. We do not include listings, as their programs do not typically perform interesting derivations. The tab is ordered by table/figure number. It specifies the table or figure number, the title of the table or figure, the program that created it, and the complete list of inputs to the program. This table can be repetitive if many tables are produced by a single program, but it makes it very easy for a reviewer to identify a program of interest by looking up the table or figure number.

4. Macros

Frequently the analysis programs will employ a suite of SAS macros to perform common functions. There may be many such macros used, but usually only a few are of interest to a reviewer. This tab identifies the macros of interest. We define these to be macros that perform some substantive derivation that contributes to results in an analysis dataset, table or figure. We exclude utility macros that perform formatting, printing or common data manipulation functions. Out of a possible 70 – 100 macros used by the analysis programs we rarely submit more than about eight.

The tab has an entry for each submitted macro. It lists the macro name, a short description of its function, and the complete list of programs that call the macro. Everything except the description is a hyperlink to the referenced program file.