Knowledge-Based Reading for Biomedical Texts

13/28/2014

Scott E. Fahlman, PI
Research Professor, LTI

Ravi Starzl
Assistant Research Professor, LTI

13/28/2014

Proposed Work: We propose to continue the work already started in the Scone Research Group of CMU/LTI on Natural-Language Understanding (NLU). By "understanding" we mean going all the way from text to an internal conceptual representation that can support reasoning, question answering, making predictions, spotting inconsistencies, and so on. The proposed work directly addresses all of the issues mentioned in the "Reading" theme in the Foundation's call for proposals.

At the core of our proposed system is the Scone knowledge-base system, which has been developed by Fahlman's research group in LTI over the past ten years. Scone offers a number of advantages for reading and understanding, including a clean separation of concepts from names, default reasoning with exceptions, and the ability to model multiple overlapping world-models in the same knowledge base. This multiple world-model capability allows us to model the state of the world at different times, the differing claims made in various articles, "what if" scenarios, and many other things. Scone is fast and scalable, able to give real-time responses even as the knowledge base grows to several million entities and statements.

Tightly integrated with Scone in our reading system is a natural-language understanding system based on a form of construction grammar. A construction is a pattern representing some fragment of meaning, such as (X kicked Y). X is constrained to be an animal and Y must be a physical object. When faced with text such as "John kicked Mary" these constraints would be tested using Scone's background knowledge, and in this case a match would occur. Attached to this construction is a formula for creating the appropriate knowledge structure in Scone's memory. Once created, we can sanity-test that knowledge structure to see if it makes sense in context. The advantage of this approach is that it disambiguates and tests the knowledge as it goes, referring to the KB's background knowledge early and often. In effect, the syntactic and semantic parts of the system engage in a conversation, as opposed to the rigid mostly-one-way processing pipelines common in current NLP systems. This tight integration and cooperation of meaning, background knowledge, and syntax analysis is innovative and potentially revolutionary for work in NLU. Constructions are well-suited to dealing with odd or idiomatic grammar forms, with incomplete grammars, and with language that is not strictly grammatical.

13/28/2014

Target Domain: The proposed NLU work could be done in any number of domains. Our goal is to produce a portable, open-source tool that can easily be adapted to new domains by adding new vocabulary and background knowledge. For development, we need a specific domain that is challenging, yet bounded in scope. We propose to work on reading, understanding, and organizing biomedical texts and research papers, specifically in the area of immunology. (Starzl already has considerable knowledge of this field, and of the available resources, from his dissertation work in this area.)

This is part of a long-term goal to use knowledge-based NLU to help organize, digest, and navigate the overwhelming amount of knowledge that is created every year in biomedical fields. Just to highlight one challenging problem, it is quite common for several papers to appear describing a new protein or gene, with each research team giving it a different name and describing somewhat different aspects of the new entity. It is a problem for the reader (human or software) to decide whether these papers refer to the same entity, a different entity, or whether that cannot be determined from the literature at hand and the available background knowledge. In addition to being a good workbench for developing our reading system, we believe that any contribution to solving problems like this can be of great benefit to biomedical research and applications.

Need: The call for proposals says that they want to fund both "exceptional young scientists" and "more established researchers with ambitious, high-risk ideas" who have difficulty funding those ideas via traditional channels. We have one of each. Starzl is just starting his faculty career; Fahlman is a well-established AI researcher, full professor at a top AI school, and Fellow of the AAAI.

It has been very hard to fund this knowledge-based research in the current climate. Statistical learning and big data are dominant at present, sucking up most of the available funding. Within symbolic KRR, proponents of first-order logic and the semantic web are dominant. In the Scone group we have laid a good foundation for the work proposed here, but have done so on a shoestring, with minimal outside support. For most of the last five years, Fahlman has had difficulty funding his own salary as a Research Professor, and in this period the Scone group has never been able to support more than one research-funded grad student at any given time. So this grant would make a very big difference in the rate at which we can pursue these ideas.

13/28/2014