ECHO: Encyclopedia of Hepatocellular Carcinoma Genes Online

Abstract

1. Introduction

Hepatocellular carcinoma (HCC), which involves the malignant tumor of the liver, is one of the most frequent malignant neoplasms. Its incidence varies greatly with geographical location, sex, ethnic background, etc., and has been especially prevalent among Asian populations. Chronic infection of the hepatitis B (HBV) or hepatitis C virus (HCV), ingestion of food contaminated with chemical carcinogens and consumption of alcoholic beverages are major risk factors. Marked as the top cause of cancer death worldwide, research on its cause, diagnosis, and treatment continues into this post-genomic era.

The emergence of genomic-related technologies has resulted in an exponential increase in potential targets for HCC diagnosis and treatment. Microarray enables HCC research to succeed in where traditional methods have faltered. With the ability to investigate massive mRNA expression profiles simultaneously, microarray is used to discover potential markers of HCC development, predict disease recurrence, and identify the specific genes related to HCC.

Yet the research environment is far from perfect. Information regarding HCC-related gene sets is scattered. They reside in separate labs, websites, databases, published literatures; they may appear with different ID, gene names, or aliases; they may be presented only with a portion of the complete gene annotation. To obtain the whole picture of a gene, one must browse through various places to collect its information: genomic location, sequence, homologs, pathways, protein-protein interaction, related diseases, microarray studies, etc. There have been several attempts to integrate cross-site information, but the outcomes lack rich annotation and are therefore insufficient for supporting HCC research: GeneWebEx ( a software package for mining web-based biomolecular databanks that queries on its collected data, lacks the friendliness of a web-based tool; GeneAround( a GO-based gene annotation databank, presents only text format; GENA ( which extracts information from articles by natural language processing (NLP), supports automatic gene extraction, gene full name, symbol, and synonym lookup, has little annotation related to HCC.

All this calls for the establishment of an infrastructure that can collect the scattered annotations, present them in a user-friendly way, and allow viewers to participate actively in its making.

In this study, we join the paradigm shift of providing web-based services and publishing organized information via semantic webs. Softbots, or software agents, are implemented to collect scattered gene annotations either by mining data sources directly or by querying into publicly accessible databases. The focus is to design of an information-harvesting infrastructure with flexible storage/presentation system, capable of developing into an excellent content management environment supporting both human-human and human-computer interactions. What resulted is EHCO, an integrated biological information portal for efficient information sharing and extensive aggregation of research-related topics. EHCO demonstrates how HCC-related research can be gathered and shared among collaborators.

In the following sections, we will describe the methods used to integrate different types of data into EHCO.

2. Materials and Methods

2.1 Collecting HCC-related gene sets

The fundamental part of a HCC-related information databases is, of course, the gene sets that have been reported to be related to HCC. In order to carry out an all-around web service, EHCO aims to provide structured information between these genes and HCC reported in literature when usersquery a candidate gene. Since the amount of biomedical literature available on the web is rapidly increasing, manual information extraction cannot always be the case. Different collection methods are used on different data sources, described from 2.1.1 to 2.1.2.

2.1.1 Import HCC-related gene sets from published large-scale studies

HCC-related gene sets have been incorporated from Stanford Microarray Database (SMD HCC-1648) [2] and other related literatures [3, 4] manually. The resulting gene sets from these studies are invaluable to HCC researchers since they are experimentally validated.

2.1.2 Text-mining for HCC-related genes in literatures

The text-mining method consists of the following steps:

(1)Acquire HCC-related literature from PubMed using “hepatocellular carcinoma” as keyword. This study used the latest approved human genome nomenclature from HUGO Gene Nomenclature Committee ( We wrote a PERL program to look for existence of HUGO-approved gene names, symbols, and aliases, in the title and abstract part of HCC-related literatures. In this way, we were able to identify a list of genes that are possibly related to HCC.

(2)The potential gene list was verified by experts in biotechnology.

2.1.3 Gene collection by reading published literatures

HCC-related genes were identified by manually reading published literatures.

2.2 Handling the annotation

Once related gene sets have been collected into EHCO, the annotation handler steps in. First, softbots, or intelligent software agents, are implemented to harvest gene annotation from various web resources. Then, weblinks are established, and we identify gene-disease relationship through natural language processing. Protein-protein interaction networks are then predicted. Finally, a presentation engine integrates all these information into a single user-friendly page view.

2.2.1 Harvesting annotation through softbots

A softbot is an intelligent software robot that acts, on behalf of the user, to achieve certain goals. Given the resources (they can be online websites, databases, or documents), a softbot extracts the information it is demanded of. In our study, individual softbots are used to mine different targets.

?????????????????????????????????????????????????????????????

2.2.2 Establishing weblinks

Hyperlinks to UniGene, SwissProt, OMIM, GeneCard, GO, PubMed, as well as other important bioinformatics websites were collected to the EHCO database.

2.2.3 Information retrieval by natural language processing

To elucidate the relationship between the collected genes and HCC, this study applied a natural language processing (NLP) technique to extract information from literatures. To begin with, here is a sample text from PudMed 10632334 containing interesting relations to illustrate the idea of automatic information extraction:

“…Using semiquantitative reverse transcription - PCR for alpha - fetoprotein (afp) andalbumin (alb) mRNAs, we measured the mass of malignant and nontumor hepatocytesin 53 peripheral blood samples collected preoperatively, intraoperatively, andpostoperatively from 13 HCC patients …In 100% (23 of 23) of HCC and adenoma patients, alb mRNA levels increased 10 - 10(6)- fold intraoperatively and then markedly declined within 8 weeks after operation …”

From the above text one understands that, in order to extract information related to a certain topic, the extraction tool needs to be capable of accomplishing two tasks - (1) Named Entity Recognition (NER): to recognize biomedical named entities (NEs), e.g., afp, alb, mRNA and HCC; and (2) Named Entity Relation Recognition (NERR): to recognize interesting relations between NEs, e.g., HCC and mRNA level.

Most biomedical named entities have no nomenclature; they may appear as long compound words (ex: hepatocellular carcinoma), or short abbreviations (ex: HCC). Symbols and spellings may also be different. To handle this NER problem, a NE list was defined. The list contained the following NEs: gene, protein, mRNA, serum, hepatitis B virus (HBV), hepatitis C virus (HCV), methylation, liver regeneration, HCC, cirrhosis, fibrosis, necrosis.

Once the NER problem was tackled, we proceeded to investigate the relations between NEs. Relations are usually expressed in various verbal forms, including active voice, passive voice, nominalization, and gerund forms. Some relations are in adjective or adverb forms. The complex sentence structure of published literatures made the situation even more sophisticated. This study used natural language parser and template-based methods to solve this problem.

Our gene-HCC knowledge base system consisted of the following steps (see Fig. 1):

(1)Document retrieval/filtering (DR/DF): Documents addressing gene-HCC relation from PubMed were automatically retrieved by searching for keyword combinations (gene symbol or aliases) and (hepatocellular carcinoma). The documents were then downloaded.

(2)Biological information extraction (BioIE): The NER and NERR system described above was processed in this step. The NER system detects whether any of the NEs in the NE list appeared in the paper abstracts. If yes, the NE is marked “” in the result table. The NERR system then detects whether increased or decreased expressions of genes, proteins, mRNAs and serum were reported.

Table 1 demonstrates the result table using the sample text from PubMed.

2.2.4 Identifying protein-protein interactions

??????????????????????????????????????????????

2.3 Storing and Presenting the annotation

2.3.1 Annotation Engine

When the collected annotations are to be stored, the enrollment is done via HTTP protocol. Enrolled annotation strings are parsed by the annotation engine, and are not published onto the EHCO website until a content manager commits them. When committed, they go into the storage service, and are processed by the presentation engine.

2.3.2 Presentation Engine

The presentation engine decides what, where, and how the annotations should be organized. A template named GeneInfo is created to customize webpage appearances. Each annotation entity is assigned a category, class, and rank property so that the manager can easily adjust the annotation content. The presentation engine also adapts the Wiki mechanism to allow a more advanced commenting system. With Wiki, website users can freely create and edit annotation pages through any web browser.

3. Results and Discussions

3. 1 The architecture of EHCO

The uniqueness of EHCO lies in its ability for registered users to share information. Aside from traditional browsing activities (reading webpages, downloading softwares, keyword searching), users are encouraged to contribute their own work onto EHCO. They can send comments, submit papers, or edit webpages. In a word, EHCO is an online community for HCC researchers around the world.

The architecture of EHCO, depicted in Fig. 3, is a content management system with workflow support. Registered users have their own personalized workspace. By changing the access right policy of an object, users can decide whether or not to publish their works openly. Revision and commenting is encouraged to enhance content quality.

3.2 HCC-related genes collection

?????????????????????????????????????????(這裡放交集圖, 說明收集到多少gene)

3.3 Protein-protein interaction results

?????????????????????????????????????????(找一個gene 當例子)

3.4 Pathway results

????????????????????????????????????????????????

3.5 Gene-HCC relation results

Table. 2 shows the number of HCC-related genes extracted from PubMed literatures. Table 3. shows the number of occurrences of defined NEs in these literatures.

Fig 1. Gene-HCC relation knowledge base system

Table 1: Information Extraction Results of Gene alb and PubMed ID 10632334

PubMed ID / Gene / Protein / mRNA / Serum / HBV / HCV
10632334 /  / -
Methylation / Liver Regeneration / HCC / Cirrhosis / Fibrosis / Necrosis
 / 

: appear +: increased expression -: decreased expression

Table 2: Corpus

Number of genes / Number of papers / Number of sentences
1017 / 10072 / 102968

Table 3: Information extraction results (measured in the number of occurrences)

gene / protein / mRNA / serum / HBV / HCV
14229 / 446 / 684 / 3836 / 1365 / 1031
methylation / liver regeneration / HCC / cirrhosis / fibrosis / necrosis
377 / 146 / 14287 / 3439 / 421 / 1112