2016-03-081 2:40 PM

ProposalIIIfor V2 HL7 genomics reporting lite aligned with a possible FHIR version(updated 2016 03 01)

Background

In our January 2016 proposal presented in Orlando, we focused exclusively on a V2 solution. Using the FHIR clinical report and observation resourceswe can develop one approach that will work for both V2 and FHIR,and are in the process of building a FHIR equivalent.

The National Center for Biotechnology Information (NCBI)divides genetic variations into two worlds. Simplevariations –those that span segments of less than 50 nucleotides ( andlarger variations(often called structural, or copy number, variations) that affect DNA segments of 50 Kb or larger (

The smaller ones are stored in dbSNP and the larger onesin dbVar. The smaller (dbSNP) sized variations represent the vast majority (98 or 99%) of the variations reported to public genomic databases to date. These have been studied for a much longer period of time thancopy number variants, more is known about their clinical consequences, and their specification is much more precise;NCBI and the European Bioinformatics Institute(EBI)assign reference identifiers to thosesmaller variations. The precise borders ofmost structural variations are known only to an approximation (at best today of 100k bases), and cannot be assigned a reference identifier. NCBI’s dbVar database includes outer, inner,and exact (when known) boundaries(bigger than a bread box, smaller than a house) as well as the length (which may be known independently from the boundaries) for specifying the window in which the edge locations of such variations exist.Reports typically just give a boundary without definining the kind of boundary.

This is all preamble to argue to moving with due haste to craft an approach to the simple variants without waiting to find a perfect solution for the copy number (structure) variants (though we do propose some structure for them). However, we hasten to add that the really important difference between these two is the degree to which the precise position and kind of the variant is known, not the size.

As you review this proposal, remember that the structured observations that we are adding to the narrative report are intended for computer consumption and that the narrative report (which is aimed solely for human consumption) will be retained in the message as another observation. This specification is intended to extend what is now being delivered in narrative to electronic medical record or other clinical database systems to enable retrievals and decision support. So do not focus too much on the human readability of the extra content intended for computer consumption. Also, for simplicity’s sake, this proposal presents one approach that could work for single and multiple gene studies.

Note: this version is different in two ways: First we have accommodated the variant ID and the possibility that a variant can include more than one allele. So this adds one more level of nesting to the simple variant specification. Second we have revised the specification for structural variants (copy number) a bit to make it fit better with current reporting practices and want to discuss it.

There will be three or perhaps four kinds of observations for reporting a genomic variation. For now, will consider only four types:

1)A set of observations that set up and say things about the overall study.

2)An observation or set of observations (component observations in FHIR) reporting asimple variation (less than 50 nucleotides). If a study detects multiple variations this construct will repeat for each variation detected.And if a variant includes more than one allele, the allele specification will repeat beneath the variant ( A new level of nesting

3)A set of observations that carry information describing the region(s) of interest – which will by necessity be different for targeted mutation analyses versus full sequencing studies.

4)A set of observations for specifying structural (copy number) variations.

The list of variables are not now differentiated as to which ones are required and which are optional or conditional. We are working on a future section of the report that would specify details about what variables would be required depending upon the kind of variant being reported and the clinical circumstances.

  1. Observations that apply to the overall study

We assume that the overall report would be named as they typically are named by reporting services,(e.g. ASPA gene mutation analysis in Blood or Tissue by Molecular genetics method, Ashkenazi 7 gene targeted mutation analysis), and that most of them would have LOINC Codes and names.

Table 1. Observations that may apply to the overall study. (See additional details in Appendix A.)

LOINC code / Name / Data type / Repeat / Answer list/ coding system
53577-3 / Reason for study / CWE / 0..1 / text
51967-8 / Genetic disease (s)assessed / CWE / 0..* / If we can allow many would include SNOMED, ICD-9, ICD-10, a modified version of NCBI’s MedGen ( – which includes only diseases but includes most genetic diseases as well as ordinary diseases.
48002-0 / Genomic Source Class † / CNE / 0..1 / Somatic, germline, fetal, etc.
48018-6 / Gene(s) studied / CWE / 1..* / HGNC ( The ID would be the HGNC Gene ID (a number) (believe there is already an HL7 coding system and OID). You can see what will be in the coding system by trying out

51969-4 / Full Mutation analysis report / CWE / 1..1
36908-2 / Gene mutations targeted † / Record as HGVS
51968-6 / Genetic disease analysis overall interpretation † / CWE / 1..1 / Positive, Negative, Inconclusive, Failure. (NOTE: At present, the LOINC term has an answer list defined for the current HL7 v2 implementation guide. Should be reviewed to see if it still fits well.) You can see what will be in the coding system in:

*LOINC code and name to be defined. Dummy codes and draft names appear in the table.

† Variables that would be pushed for inclusion in all reports

Some may argue for inclusion of other “overall report” variables and any that the committee agrees to would be easy to add.

  1. Specification of smallgenetic variations (The kind that are reported in dbSNP)

2.1 Overview: In HL7 V2 at least oneOBX would be required for each such variation. In FHIR one observation would also be required for each variation. We propose to depend upon the ClinVarvariantID (this is a change from the V2 III proposal) as the primary coding system for individualsmallvariations.We have used a temporary code X1230 and the name ClinVarvariant as the name of thisobservation.When a variant ID exists, the record in the ClinVar table tells you “everything.” But, would envision a set of related observations (and LOINC codes) using sub-IDs In V2, and analogous FHIR component variation to break out key content of the variantinto its four constituent parts:RefSeq, Gene symbol, DNA change in HGVS syntax (g. or c. depending on the variant), and Amino acid change in HGVS p. syntax and more. When a variant ID exists (because the variant had not been registered) one could populate other fields of interest (as desired) such as the cytogenetic position, the Ref variant, and the Alt variant,directly from the ClinVar variant record.Whenvariant ID does not exist, the separate fields of interest would have to be populated “by hand” but could be supported by look upsto appropriate coding systems as we will demonstratewith an existing demo form. (See Figure 1 and 2)

[GR([1]

Figure 1:JavaScript- on the fly web form that demonstrates most of the attributed defined in this specification. If you enter avariant ID that exists in the ClinVar files, it auto populates many of the subsidiary field. This figure shows how a variant with two alleles might be reported (see the nesting of alleles under the variant. The example is concocted from two independent variants because there are so few example of multiple allele variants in Clinvar to date.

To anticipate, Table 2 and 3 list the potential variables (LOINC codes) that could be used to report genomic results for simple variations. In contrast to the previous proposal, we propose that the variant ID record carry only the variant ID and will present separate fields for carrying its constituent and repeat the those fields under the allele when the variant has many alleles

With our new proposal the record would look the same whether the parts of the name were pulled directly from the variant record or had to be hand entered as would be needed when a variant ID did not exist.

Figure 2: JavaScript- on the fly web form that shows almost the same content as Figure 1, but with a table structure for the entry of variant attributes. You can play with the look-ups on both of these forms (and some others) at The look up for the variants table returns the autocomplete entries as a table. When you pick one it preloads many of the other fields (in both tables). It does not now pre-load the related diseases or the cytogenetic location. That is in process.

We show an example of the new approach in Figure 3 and the list of fields that could be included in Table 2. The example shows some additional items such as the clinical significanceand cytogenetic location.In FHIR these would be one primary observation per distinct variant in Figure 3. The observation connected by dot notation (OBX-4) in V2, would be component observations of the primary observation in FHIR.

48018-6

2.2 Machinery for connecting additional information about the variation to its primary observation.

variantTable 2 summarizes all of the observations that could be tied to a single variant. These are divided into four groups 1) the primary observation (when a variant ID exists – black font), 2) the components of the ClinVar variant names (red font), 3) other potential attributes of the variant (blue font) [GR([2]and 4) alternative codes for the variant in question (green font).

They would have the same LOINC codes and cardinality and links to answer lists and/or coding systems in both systems, V2 and FHIR.

Table 2. Variables that are used to describe a single variation. The whole set repeats for that might be associated with the primary reporting structure.

LOINC code / Name / Data type / Repeat / Answer list/ coding system
Variables that repeat per variant
Variant ID if available carries the information and points to the information in the next four attributes
X1230 / dbVar Variant / CWE / 0..1 / The code is the variant ID. You can see what will be in the coding system at: [c3]
Type in a gene symbol, RefSeq or known HGVS mutation
Break out of the internal components of the Variant name. Required for recording these details when a the variant has not been registered and a variant ID is not available. These seven observations could repeat under one variant ID if the variant includes multiple alleles
48013-7 / RefSeq / CWE / 1..1 / NCBI RefSeq

48018-6 / Gene / CNE / 0..1 / HGNC
*†[1]
41103-3 / DNA change(s) / CNE / 0..1 / HGVS (c.or g. syntax)
Need to decide whether to allow or disallow the [ ] syntax that would obviate the need for repeated values Could generate for genotypes, haplotypes from Variant record- but would not mix well with the breakdown which follows.
48005-3 / Amino acid change(s) / CNE / 0..1 / HGVS (p. syntax) (ditto yellow highlight above
Aternative way to HGVS – If HGVS restricted to change within RefSeq may not need the following two
69547-8 / Reference Allele / ST / 0..1 / Comment current LOINC name is reference nucleotide-
X0029 / Allele loc / NM
69551-0 / Alternate allele / ST / 0..1 / Comment current LOINC name is Variable nucleotide[c4]
Other attributes of the variation
53034-5 / Allelic State / CNE / 0..1 / LOINC list (e.g. Heteroplasmic, Homoplasmic, Homozygous, Heterozygous, Hemizygous)(Need to deal with source som e how (Mother, father) when know [c5]
53037-8 / Clinical significance / CNE / 0..1 / Benign, Likely benign, Uncertain significance, Likely pathogenic, Pathogenic[GR([6]
X0007 / Cytogenetic location / CNE / 0..1 / If we can get a comprehensive list of such location could create a coding system. If not comprehensive might still make coding system in a CWE
X0020 / Associated disease / CWE / 0..1* / From Medgen – using UMLS codes (For now)
Different codes for the same variation:
X134* / DbSNP RS ID / CNE / 0..1 / URL to be provided
X136* / CIGAR code / CNE / ? / CIGAR syntax. Not sure about its role in clinical reporting.
X137* / COSMIC mutation ID / CWE / 0..1 / COSMIC –- Maybe should be CNE.

*LOINC code and name to be defined. Dummy codes and draft names appear in the table.

† Variable that would be pushed for inclusion in all reports

  1. Defining the region of interest (variations that could have been found)

Genetic reports need to have information about the region, or mutation, studied, to fully define what they could have found(i.e. to bound the meaning of a study). A negative study that looked at 135 Cystic fibrosis mutation has a different meaning from one that looked at 35 mutations.Having this information included in the report also allows laboratories to expand what they test for under a given test name a bit, which they already tend to do. Two different cases exist. For targeted mutations, the report must list the mutations looked for. For sequencing studies, the report must include the genomic ranges sequenced. We propose LOINC codes for each case.

3.1 Targeted mutation analyses: Required observation for defining mutations targeted

These will mostly be probe based studies, but at least some sequencing studies only examine a fixed set of mutations,[2]and chip based studies often focus on asubset of the mutations that could be discerned from the chip analysis. The listing of the targeted mutations could use a similar approach to what we propose for reporting the mutations found.

Simplest way to do this would be to record all of the RefSeq, gene, and HGVS mutation as the value of anobservation (i.e. by “name” tied to the variant ID in ClinVar). This would be the simplest and would be the same as our previous proposal for reporting mutations (that has now been modified). The example for the mutations found and the mutations targeted all come from a real report. Note that someof them are specified in the amino acid change notation.

3.1.1 Two alternate ways to report targeted mutation (skip gray shaded because not crucial to the discussion).

An alternative(see Figure 4b) to the above approach is to report the RefSeq and gene in one observation and tie it to a set of observations that report mutations as variant ID and HGVS, the RefSeq and gene implied by “parent observation.”This will be a bit more compact when hundreds of mutations are sought in this case, but adds complexity.

A third option (a modification of the second) would be to have only two observations per RefSeq and gene and store all of the mutations targeted in the value of the second observation (see Figure 4c).

Committee please give feedback which should be preferr

3.2 Specifying the region of interest in a full mutation analysis as specified ranges of the Reference Sequence

For full mutation analyses, the region of interest would require one observation to identify the reference sequence and gene (chromosome)andanother observation repeated to identify the exact ranges of that RefSeq that were sequenced.

Table 3 shows proposed LOINC codes for representing the region(s) of interest for full (sequenced) mutation analyses. In the example shown in Figure 5a, we dedicate one observation per range, but think this information could also be sent as oneobservation separated by Repeat delimiters (as shown in Figure 5b). (The difference between Figure 5a and Figure 5b is a bit like the difference between 4b and 4c.)Realize that the ranges in the example are concocted.We realize that most copy number studies would reference a genomic (NG-) or chromosome reference sequence (NC-) and will build more appropriate examples in a future draft.

Table 3. Variables that might be associated with the reporting of the region of interest for full mutation (sequencing) analyses

LOINC code / Name / Data type / Repeat / Answer list/ coding system
X002 / Genomic [GR([7]reference sequence † / CNE / 1..1
X12312* / Range Sequenced† / SN / 1..1 / Numeric range
X12313* / Portions of genome sequenced narrative / TX / 0..1[GR([8]

*LOINC code and name to be defined. Dummy codes and draft names appear in the table.

† Variable that would be pushed for inclusion in all reports

[GR([9]

However, the fact is that few if any mutation reports ever report the degree of detail (called for by the second row in Table 3) in their clinical reports. Most give only narrative description such as “all coding regions and appropriate flanking regions for the genes studied. And because of this, we propose a LOINC code and observation (the last row in Table 3) for doing that.

  1. Observations for dealing with structural (copy number) variations

These large variations are more difficult to define than the smallones, because most studies report only approximate positions for the beginning and end of structural (copy number) variations. NCBI uses 7 numeric fields to describe submissions about structural variations, and assigns an accession number to each submission but does not assign reference numbers because of the uncertainty of the position of the edge.So, structural variantsdefined with identical values for many of the 7 fields cannot be equated. This is just to emphasize that these structural variations occupy a messier world than the simple variations.

The following example illustrates some of the options we have for describing structural variants (the kind that live in dbVar) in HL7. We will show examples in V2, andare in the process of building FHIR models that will be varyparallel.