NLP SHARP Project Plan
SHARP
PROJECT PLAN
Version 2.7
02-09-2011
Project 2: Natural language processing for the clinical narrative
Created by:
ChildrensHospitalBoston and HarvardMedicalSchool: Guergana Savova
Mayo Clinic: Jay Doughty
MIT: Peter Szolovits
Seattle Group Health: David Carrell
StateUniversity of New York at Albany: Ozlem Uzuner
University of California, San Diego: Wendy Chapman
University of Colorado: Martha Palmer, Jim Martin, Wayne Ward
University of Utah/Intermountain Health Care: Peter Haug
Revision History
Date / Version / Description / Author7/19/2010 / 1.0 / Charter started / Jay Doughty
Guergana Savova
7/26/2010 / 1.1 / Content added for tasks 1.4.1, 1.4.2, 1.4.3, 1.4.4 and 1.4.6.
SUNY pre-existing materials added.
Jeff Ferraro will send content for task 1.4.5 by EOD 7/27/2010 / Jay Doughty
Guergana Savova
7/30/2010 / 1.2 / Content added for tasks 1.4.5, 1.4.10, 1.4.8
Henk Harkema added to task 1.4.1 / Guergana Savova
8/1/2010 / 1.2 / Content added for task 1.4.12, 1.4.7
Task 1.3.4 added / Guergana Savova
8/2/2010 / 1.2 / Removed task 1.4.9 (Structured data)
New task added 1.4.9 (Development of deidentification tool) / Guergana Savova
8/3/2010 / 1.3 / Feedback from Martha Palmer added / Guergana Savova
8/9/2010 / 1.3 / Content added for task 1.4.11, 1.4.14, 1.4.15, 1.4.16 / Guergana Savova
8/12/2010 / 1.4 / Content added for tasks 1.5.1, 1.5.2, 1.6.2, 1.5.5, 1.6.5
Methods tasks for NER and relation discovery broken down into three separate tasks
Updated content from David Carrell for 1.4.1 / Guergana Savova
8/16/2010 / 1.4 / Content added for tasks under specific aim 1 and 2 / Guergana Savova
8/17/2010 / 1.5 / Content added for tasks 1.4.17, 1.4.18 / Guergana Savova
8/23/2010 / 1.5 / Content added for tasks 1.5.3, 1.5.4, 1.5.6, 1.5.7
8/24/2010 / 1.6 / Content added for tasks 1.5.8, 1.6.1,
8/30/10 / 1.7 / Content added for tasks 1.6.5, 1.6.7
Prototype release added to 1.5.5 and 1.5.8 (populating OrderMedAmb CEM)
New task added: 1.4.19 (Release schedule) / Guergana Savova
8/31/2010 / 1.8 / Added Sunghwan Sohn to participate in task 1.5.1
9/13/2010 / 1.9 / Added Peter Haug’s subtasks to task 1.4.3 / Jay Doughty
9/16/2010 / 2.0 / Added Calvin Beebe to tasks 1.3.4, 1.4.11, 1.4.13, 14.15 and 1.4.16 / Jay Doughty
9/17/2010 / 2.1 / Noted that tasks 1.4.1 and 1.4.3 are now combined. Changed dates for tasks 1.4.4 and 1.4.8 / Jay Doughty
9/17/2010 / 2.1 / Changed task 1.3.4 to reflect that it is being owned by the infrastructure team / Jay Doughty
9/24/2010 / 2.2 / Changed description of task 1.4.1 / Jay Doughty
9/29/2010 / 2.3 / Changed task owner of the combined 1.4.1/1.4.3 from Peter Haug to David Carrell / Jay Doughty
10/5/2010 / 2.4 / Various / Guergana Savova
10/13/2010 / 2.5 / Added endorsements and changed name of 1.4.12 task / Jay Doughty
10/14/2010 / 2.5 / Changed name of task 1.4.9 / Jay Doughty
2/9/2011 / 2.7 / Added Stephen Wu to tasks 1.5.8 and 1.6.8 / Jay Doughty
1Project Objectives
The overarching goal of this project is the development of enabling technologies for high-throughput phenotype extraction from clinical free text. We will also explore the ability to hybridize clinical data extracted from medical reports with the already-structured data in our data repositories to support outcomes research and the development of knowledge useful in clinical decision support. Our focus is NLP and Information Extraction (IE), defined as the transformation of unstructured free text into structured representations. We propose to research and implement modular solutions for the discovery of key components to be used in a wide variety of use cases: comparative effectiveness (=outcomes), clinical research, translational research, and the science of healthcare delivery. Specifically, our efforts are on methodologies for clinical event discovery and semantic relations between these events. Subsequently, the discovered entities and relations will populate templated data structures informed by conventions and standards in the biomedical informatics and general standards communities.
- Specific Aim 1: Clinical event discovery from the clinical narrative consisting of (1) defining a set of clinical events and a set of attributes to be discovered, (2) identifying standards to serve as templates for attribute/value pairs, (3) creating a "gold standard" through the development of annotation schema, guidelines, and annotation flow, and evaluating the quality of the gold standard, (4) identifying relevant controlled vocabularies and ontologies for broad clinical event coverage, (5) developing and evaluating a methodology for clinical event discovery and template population, (6) extending Mayo Clinic's clinical Text Analysis and Knowledge Extraction System (cTAKES) information model, and implementing best-practice solutions for clinical event discovery.
- Specific Aim 2: Relation discovery among the clinical events discovered in Aim 1 consisting of (1) defining a set of relevant relations, (2) identifying standards-based information models for templated normalization, (3) creating a gold standard through the development of an annotation schema, guidelines, and annotation flow, and evaluating the quality of the gold standard, (4) developing and evaluating methods for relation discovery and template population, (5) implementing high-throughput scalable phenotype extraction solutions as annotators in cTAKES and UIMA-AS, either within an institution’s local network or as a cloud-based deployment integrated with the institution’s virtual private network.
SUMMARY: Advancing semantic language processing of the clinical narrative.
1.1Project Scope
1.1.1In Scope (see specific tasks for schedule)
1.1.2Out of Scope
- Temporal relations
- Inferencing such as inferenced relations, inferenced attributes, inferencing from NLP output
1.2Pre-existing materials
This section lists all pre-existing materials that the sites will make available open source for the duration of the project.
1.2.1ChildrensHospitalBoston and HarvardMedicalSchool
1.2.2Mayo
- cTAKES, smoking status classifier, rule-based relation extractor
- annotated corpora
- NEs: regression set (drugs, diseases/disorders, signs/symptoms, procedures), colon cancer notes (histologies, anatomical sites), Neuroradiology notes (tumor progression relevant terms)
- Regression set (160 clinical notes, different from the notes from the linguistic corpus): annotated for NEs (drugs, diseases, procedures, signs/symptoms), coreferenceanaphoric relations (identity, set/subset, part/whole) for semantic types People, Anatomical site, Disease/Syndrome, Sign/Symptom, Procedure, Lab or Test Result, Indicator, Reagent, or Diagnostic Aid, Organ or Tissue Function.
- Coreference: in addition to the regression set notes, we have annotated for coreference 100 Pitt notes (coreferenceanaphoric relations (identity, set/subset, part/whole) for semantic types People, Anatomical site, Disease/Syndrome, Sign/Symptom, Procedure, Lab or Test Result, Indicator, Reagent, or Diagnostic Aid, Organ or Tissue Function.)
- Colon cancer pathology notes: 302 notes, NEs (anat. Sites, histology), higher level templates, coreference, See Coden et al, 2009 for a full description
- Neuroradiology notes: about 800 notes, NEs, timelines, tumor status
- Linguistic corpus: 160 notes, appr. 100K tokens, pos tags, chunks, sentence boundaries. Of note, this corpus is different from the regression set.
- Colon cancer clinical notes/oncology notes: 150 notes, NEs, templates, timelines. Annotations followed the schema described in Coden et al., 2009
1.2.3MIT
1.2.4Seattle Group Health (SGH)
1.2.5StateUniversity of New York at Albany (SUNY Albany). Pre-existing material lists joint material between SUNY/MIT/i2b2
- Code:
- Goldstein and Uzuner., ICD-9 coder based on CincinnatiComputationalMedicineCenter challenge. Covers all diseases and ICD-9-CM codes in the CMC data.
- Goldstein and Uzuner, Kappa calculator for n annotators.
- Sibanda, Uzuner, Szolovits., Automatic de-identifier (not easy to share).
- He, Uzuner, Automatic re-identifier. Offsets dates within a record, keeps co-reference of names of patients, doctors, hospitals, protects formatting of names, e.g., first name initial with a last name is kept as a first name initial and last name.
- Sibanda, Uzuner, Szolovits, CaRE system for concepts, assertions, and relations (not easy to share). Concepts include diseases, symptoms, treatments, tests, results, practitioners, addictive substances. Assertions include negation and uncertainty. Relations relate concepts to each other. 22 relations total.
- He, Uzuner, Szolovits, Co-reference resolution (not easy to share) for noun phrases. Excludes pronoun resolution. Includes diseases, symptoms, tests, treatments, practitioners.
- Corpora:
- De-identification corpus 1: De-identified and re-identified with surrogates, part of MIMIC corpus, 50 discharge summaries. Surrogate PHI include realistic PHI, ambiguous PHI, and out of vocabulary PHI. Details in: Uzuner, Ö., Sibanda, T., Luo, Y., Szolovits, P. (2008) “A De-identifier for Medical Discharge Summaries”. International Journal Artificial Intelligence in Medicine. January 2008; 42(1): 13-35.
- De-identification challenge corpus: Partners healthcare i2b2 challenge corpus. 889 discharge summaries manually scrubbed and populated with realistic, ambiguous, and out of vocabulary PHI. Details in: Uzuner Ö, Luo Y, Szolovits P. (2007) “Evaluating the State-of-the-Art in Automatic De-identification”. Journal of the American Medical Informatics Association. September 2007; 14(5):550-563.
- Smoking challenge corpus: Partners Healthcare, subset of De-identification challenge corpus, annotated by pulmonologists for smoking status. Includes temporal distinction between current and past smoker. Details in:
Uzuner, Ö., Goldstein, I., Luo, Y., Kohane, I. (2008) “Identifying Patient Smoking Status from Medical Discharge Records”. Journal of the American Medical Informatics Association. January 2008; 15(1): 14-24. - Obesity challenge corpus: Partners Healthcare 1243 discharge summaries annotated for sixteen diseases, at the record level, marking each disease as present, absent, uncertain, or unmentioned. One record per patient. Details in: Uzuner, Ö. (2009). “Recognizing Obesity and Co-morbidities in Sparse Data”. Journal of the American Medical Informatics Association. July 2009; 16(4): 561-570.
- Medication challenge corpus: Further annotations on a subset of the obesity challenge corpus for medications administered to the patient and their administration details, e.g., route of administration, frequency, duration, reason. Details in:
Uzuner, Ö., Solti, I., Xia, F., Cadag, E. Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication Challenge. Journal of the American Medical Informatics Association. Forthcoming.
Uzuner, Ö., Solti, I., Cadag, E. Extracting Medication Information from Clinical Text. Journal of the American Medical Informatics Association. Forthcoming. - Relation challenge corpus: a mix of Partners Healthcare, UPitt, and MIMIC II records annotated for concepts, assertions, and relations. Approximately 800 records.
1.2.6University of California, San Diego
ONYX – natural language processor for semantic analysis
Topaz modules for cTAKES
- ConText – assigns values to properties of Existence, Historicity, Experiencer, General/Conditional
1.2.7University of Colorado
CLEAR-TK
VerbNet
Jubilee Annotation Tool
Cornerstone Annotation Tool
Both open source projects on Google Code:
1.2.8University of Utah/Intermountain Health Care (IHC)
1.2.9Other
- Informatics for integrating biology and the bedside (i2b2): -- NLP corpus data annotated for patient smoking status at the document level (no text spans), obesity and its comorbidities, medications and their attributes (instance level annotations linked to text spans), i2b2 abstract data model. I2b2 organizers agree to make the corpus and model available to us.
- University of Pittsburgh – 100 clinical notes annotated for coreferring clinical named entities of 10 types and anaphoric relations (identity, set/subset, part/whole. Wendy Chapman agrees to make the notes available to us.
- Penn Treebank corpus as distributed by the Linguistic Data Consortium (LDC)
- Brown corpus
- GENIA corpus
- PropBank
- VerbNet
- UMLS
- Bill Long’s program for LOINC code assignment
1.3Organizational tasks
1.3.1Detailed project plan (this document)
Task description: A detailed project plan will be jointly created and agreed upon by the teams. Each task along with its description, assumptions, dependencies, completion criteria, responsible parties and schedule will be spelled out. It will include a detailed communications plan (see section 7)
Assumptions: none
Dependencies: none
Completion criteria: an agreed-upon document outlining the detailed project plan (this document)
Participants: all NLP co-investigators
Responsible party (ies):Jay Doughty and project lead
Collaborating SHARP 4 projects: all
Schedule: August 31, 2010
1.3.2Administrative prerequisites
Task description: active IRBs at xxxx, Data Use agreement (if necessary).
Data use agreement with i2b2 and Pittsburgh for using their data. This task is at the program level.
Assumptions: any collaborating site potentially contributing with data should apply for IRB as early as possible.
Dependencies: none
Completion criteria: active IRB that allows de-identified data to be shared with investigators. Data use agreement with i2b2 and Pittsburgh
Participants: site PIs and project manager
Responsible party (ies): Jay Doughty, site PIs, Lacey Hart.
Collaborating SHARP 4 projects: all. This task is at the program level.
Schedule: July, 2010 – December, 2010
1.3.3Quarterly reports and annual progress report
Task description: Submission of the quarterly reports due xxxx. Submission of the annual progress report due xxxx
Assumptions: none
Dependencies: none
Completion criteria: submitted reports
Participants: sub-award PI’s and project lead
Responsible parties: Jay Doughty
Collaborating SHARP 4 projects: none
Schedule: as scheduled for the reports
1.3.4Coordinate a common data and code repository with the Infrastructure team
Task description: The Infrastructure team will define a SHARP 4 wide code and data repository. That will be communicated to the NLP team.
Assumptions: Definition of the SHARP 4 wide code and data repository by the Infrastructure team.
Dependencies: same as Assumptions
Completion criteria:
Participants: Guergana Savova, Vinod Kaggal
Responsible parties:Jay Doughty, Vinod KaggalInfrastructure Team
Collaborating SHARP 4 projects: Infrastructure team. This is a task at the program level.
Schedule: Sept, 2010 – Dec, 2011
1.3.5Deposit open-source pre-existing material into the central repository
Task description: The Infrastructure team will define a SHARP 4 wide code and data repository. Sites will submit their pre-existing material in the project’s code repository.
Assumptions: Definition of the SHARP 4 wide code and data repository by the Infrastructure team.
Dependencies: 1.3.4
Completion criteria: pre-existing code
Participants: all sites
Responsible parties: site PI’s, Vinod Kaggal
Collaborating SHARP 4 projects:
Schedule: Nov, 2010 – Jan, 2011
1.4General technical tasks
This section outlines the tasks that are general and non-specific to the two specific aims.
1.4.1Phenotype extraction use case definitions (combined with task 1.4.3)
Task description: Identification of candidate domain-specific use cases (e.g., peripheral artery disease diagnoses, colonoscopy procedure quality metrics) is a multi-project responsibility being led by Project 3: High Throughput Phenotyping (HTP).
1.4.2Define the users
Task description:
- Define types of potential users of the software, e.g. programmers, assemblers of UIMA components, end users, etc. and the minimum level of requirements for each user.
- This will inform the system architecture and involvement of UIMA engineers, e.g. for GUI development.
Note:
- Consideration of the potential types of users of SHARP NLP technologies begs the question of how and where the technologies are deployed. Alternative deployment modalities may impose very different technical requirements on the local users of the systems.
Assumptions:
- Text corpora remain local to the generating/owning institution and will be processed within the local firewall, though efforts will be made to enlarge the “local firewall” to include “virtual private networks” leveraging cloud-based computing technologies.
- Regardless of deployment modality local users must be able to conduct model re-/training exercises using local clinical text and perform validation studies of algorithm performance.
- Earliest and near-term users of clinical NLP technologies may differ from subsequent and longer-term users (as the algorithms mature and the deployment modalities evolve).
Dependencies:
- Who the user is depends in part on the use case because, for example, the simplest NLP algorithms will involve different types of technologies with different skill requirements compared to more complex algorithms.
Completion criteria (“deliverables”):
- A document defining the types of users and their respective skill levels, including a summary like Table X (below) cross-tabulating users/skill levels with NLP algorithms of varying complexity.
- A list of institutions where likely users are based, with corresponding lists of algorithms they might implement. At a minimum this list will be illustrative; ideally it will be comprehensive.
Participants:David Carrell and 1.4.1 group
Responsible party (ies): David Carrell
Collaborating SHARP 4 projects:
- Project 1: Normalization services (NS), because NS introduces approaches that require particular technical skills [SOMEONE CONFIRM THIS][Guergana1].
- Project 3: High throughput phenotyping (HTP), because HTP requires technologies and approaches to system implementation that require particular technical skills [SOMEONE CONFIRM THIS][Guergana2].
Schedule:
- Defining users (and resolving related upstream questions) should be done as soon as possible because it has important implications for many aspects of SHARP Area 4 work.
- Nov,2010 – Jan, 2011
Types of Potential Users
SHARP NLP algorithms users and their minimum skill levels are listed below. Note that by “algorithm user” we include those who implement and use the algorithms, associated documentation, validation studies from the original development work, descriptions of re-training and validation procedures for local implementers, etc.
- Research-oriented NLP programmers with broad NLP experience in mature NLP environments, who meet the following minimum requirements:
- Familiar with NLP concepts
- Familiar with NLP tools and resources (e.g., open-source POS taggers, ontologies)
- Java expertise
- Ability to “wrap” software components from other languages to make them available in a Java environment
- Text manipulation languages (e.g., Python, Pearl, Ruby)
- Familiar with UIMA/cTAKES
- Novice/applied NLP programmers in research settings with little NLP experience and newly-deployed “out of the box” NLP systems, who meet the following minimum requirements:
- Java expertise
- Text manipulation languages (e.g., Python, Pearl, Ruby)
- Familiarity with local clinical text repositories
- Ability to extract, transform, and load (ETL) local text
- Fluency with SQL and relational database usage
- Ability to execute algorithm training and algorithm-specific validation studies
- UIMA experts
- Ability to “performance tune” UIMA and/or algorithms
- Other?
- Researchers from academic, government, or industry settings …
- Ability to evaluate high-level descriptions of algorithms and algorithm validation studies for the purpose of determining the utility/propriety of algorithms for specific research purposes
- Others?
Table X. Skill levels of user types (rows) by algorithm complexity level (columns)
User types / NLP Algorithm Complexity Level
Complexity level 1 / Complexity level 2 / Complexity level 3
User type A / A-1 skills / A-2 skills / A-3 skills
User type B / B-1 skills / B-2 skills / B-3 skills
User type C / C-1 skills / C-2 skills / C-3 skills[Guergana3]
1.4.3Clinical Element model (CEM) and application to NLP tasks (combined with task 1.4.1)