Question #2 Entity Representation (Notation): Chemicals, Genes, Proteins, Etc. (Stephanie)

I have put together a possible set of questions and readers, based on input from Alex, Stephanie and myself. I put questions in DEFINTE if at least two of us said “Should” include. We combined #4 and #8. I discarded my question on open databases, which nobody (including me voted for J). You are welcome to advocate for including/excluding questions. In particular I think the two main option plays would be to consider

· Swapping in Q1 for Q2

· Swapping in Q7 for Q9

I’d like to hear from Javed and Diane. I’ll update this when I do. Again, we’re meet Friday morning at 8:30am here at Manning Hall in room 214. Before then, it would be good for Alex and Diane to mesh their questions #4 and #8 together into one question, and for Alex to flesh out Q3.

Thanks, Brad

DEFINITE

Question #2 Entity representation (notation): chemicals, genes, proteins, etc. (Stephanie)

Readers: Stephanie, Brad

Chemicals, genes, proteins and similar entities can be represented in a number of forms, ranging from in-line text through graphical, hybrid, and even 3-D animations. In part, they have arisen because of the variety of uses for chemical representations. Some are well-suited for human interpretation, or printing in journal articles, others are useful for computer manipulation such as searching and matching. Some are used for communication with non-specialists, (e.g., aspirin), others are designed to hide information, for example, in a patent application.

In natural language, ambiguity adds to our richness and creativity of expression, but can also lead to confusion (intentional or not). We can consult a dictionary or thesaurus to choose a word that is appropriate to the context, from felis domesticus (for scientific discourse) to kitty (for conversation with a child). These tools thus support some level of translation among terms, but as with much translation between natural languages, information may be lost in translation.

A. What are the issues involved in translation (or conversion) among different forms of representations for chemicals? In your discussion, you could consider affordances for use, information loss or gain, and policy or legal issues, but you don't need to limit yourself to these ideas.

B. Returning to natural language translation, one model of machine translation is the interlingua model. In this theoretical model, there is a language-independent representation of meaning. All languages can be translated into interlingua, and all can be translated from it, with no loss of meaning. <diagram, if it would be useful> Would the interlingua model be useful in chemical representation? Why or why not? What existing representation, in your opinion, comes closest to an interlingua? Describe the interlingual features it has, and those it lacks.

I'd be willing to be first or second reader on this, if it's included.

Question #5 Named Entity Extraction (Brad)

Readers: Brad, Alex

Your proposed work utilizes MeSH defined terms (chemical names) assigned to articles by expert human Medline indexers. Compare and contrast discovery of chemical names via Mesh terms with algorithmic discovery from full text (or abstracts) of chemical names via NLM’s Metamap program, and Zimmerman’s ProMiner system. When you compare and contrast them, discuss assumptions (requirements of the methods), cost, performance, scalability (to hundreds of millions of articles), strengths and weakness of each of the three methods. Motivate why you believe using Mesh provides competitive advantages, or different types of results than using Metamap or Prominer for named entity extraction of chemicals. (I like this because it goes to the core of her argument; whether using human annotated information or automatic discovery from text is best solution in long term).

Question #9 (Diane)

Readers: Diane, Stephanie

In many cases, scientific literature is more structured than other forms of writing. Are there ways to take advantage of this reduced diversity to improve natural language processing techniques? Are there techniques that have shown themselves to work better in such environment? (I like this as a discussion topic too; although I think it would have to be fleshed out a fair bit more though)

I think we could work with this. I'd be willing to be second reader on it.

COMBINATION OF #4 and #8

Readers: Javed, Diane

Question #4 (Javed)

What is the significance of token extraction as it relates to detection of entities critical to biomedicine? What are some of the state-of-the art approaches that have been developed for entity detection? Discuss their computational advantages and disadvantages. Why is it necessary to supplement mining of textual content with other sources of evidence such as entity associations generated based on techniques such as BLAST or micro-array analysis? Provide some concrete examples. What are the potential ways an integrated mining approach could be developed to improve upon techniques that rely on a single source of evidence?

my concern with this is knowing when she's answered it well. How would she (or the readers) know what a complete answer would look like? If combined with #8, perhaps it could be refined a bit to better define its scope.

Question #8 (Diane)

Extracting information from databases: (a) As you put data into databases, there are a large number of data mining techniques that can be put to use. Are there any of these techniques that will be helpful in your work? (b) Also, are there fields within the already stored information, such as chemical compound structure, that in themselves would provide useful information that could be used. (I think this overlaps and could be potentially combined with Question #4)
Yes, combining with 4 might work well.

Question #3 (Alex)

Readers: Alex, Javed

Discuss the potential of Swanson's literature-based discovery model (ABC) in drug research as applied to data sources beyond textual ones. Discuss the integration (in the context of both hypothesis generation and validation) of textual and non-textual data sources in drug discovery including both primary and undesired (e.g., toxic) side effects. Do you want to prompt with some specific types of non-text data sources (microarrays, etc)?

I think this is a good question, perhaps with some refinement.

POSSIBLE

Question #1 Evaluating LBD (Stephanie)

In your literature review, you discuss the difficulty of evaluating and validating the results of literature-based discovery (LBD). The problem is similar to that of evaluating information retrieval systems based on relevance of retrieved items to the initial question. In IR, relevance can be viewed from the user perspective, and thus evaluation must involve real users with real information needs. On the other hand, TREC-style evaluation provides the large collections and defined tasks (and results) that allow for more uniform evaluation and comparison of system performance. In general, this is viewed as a reasonable compromise that has advanced IR technology.

The ultimate validation of an instance of LBD is its confirmation by actual experiment, and even more stringently, that it is interesting and useful. Short of that, methods such as partitioning the literature by date, and seeking confirmation of discoveries in the older literature in the newer literature have been used.

1. Identify and discuss two limitations and two advantages of current methods of validation of LBD. In your answer, you might consider aspects such as availability of data, generalizability of results, and "power of persuasion", that is, ability to convince skeptics that LBD is a legitimate means of discovery. (my question was for her to detail limitations and advantages of her proposed evaluation method, and what she thought were the best three other methods described in papers in her lit review, and then contrast the four methods).

Your version of part 1 would be fine Brad, but rather than working with 4 methods (including hers), what about 3? I'd rather she work on depth rather than breadth, especially if the question includes part 2.

2. Many research communities have adopted the TREC model of evaluation: creating a large collection of data and setting specific tasks for research systems.

a) Discuss the viability of the TREC model for LBD. Include in your discussion consideration of the limitations and advantages from part 1.

b) What does establishing such an effort require on the part of LBD researchers?

I'b be willing to be first or second reader on this.

Question #7 (Diane)

You point out in your review that there is some questionable quality in the databases that you are using and the quality of any result is heavily dependent on the quality of its inputs. It is not practical to assume that all data is properly curated and even trying to select the best databases will not necessarily work over time as the quality may change. The question therefore is: As you use data from these databases, what techniques are available that will be more tolerant of incorrect data? In statistics, this is referred to as the problem of mis-labeled data; in experimental science, it is the question of dealing without outliers. You should be able to assume that the information in the text sources is correct as those are well reviewed. It is the massively collected databases that are the question. (Good question but maybe not as closely tied to Nancy’s work since she is mainly capitalizing on human indexed Mesh terms assigned to articles as opposed to the massively collected databases).
But I like this question because the problem of data quality is central to any large collection, especially when working with a variety of mining techniques.

DISCARD

Question #6 Chemical Databases and Open Science (Brad)

Discuss the status of PubChem. How widely used is it currently used; by what communities/groups? For what purposes? Compare the functionality of PubChem with CAS. Be sure to include data, functions, or services that CAS provides that PubChem does not currently (and vice versus). If scientists were to stop using CAS tomorrow, and replace all usage of chemical names in their lab work and paper writing with PubChem names what effects would this have? Be thorough, and include not just technical and workflow issues, but social, community, sharing, and other effects. (I like this as a discussion topic, but maybe not central to her work).

I agree –I don't think it's as salient as others.