IVC Database Manual 2.1

7/05/05

InfoVis Lab

School of Library and Information Science

Contributors:

Katy Börner+, Weimao Ke+, Ketan Mane+, Elijah Wright+ & Gavin LaRowe

Indiana University

Bloomington, IN 47405, USA

Email:

www: http://iv.slis.indiana.edu

Table of Contents

Introduction

Brief History

Target Users & Information Needs

Hardware

- UITS Research System

- Production System

- Mirror System

- Development System

Software

- UITS Research System

- Production System

- Mirror System

- Development System

Data Sets

- Source

- Description

- Composition

- Statistics

- Acquisition

- Updates

- Errata

- System Information

- UITS Research System

- Production System

- Development System

- Notes

Database

- Users

- Access

- Rights

- System Information

- Structure

- Database Schema

- Interfaces

- On-line Interface

- Users

- Login

- New User Registration

- Search

- Download Data

- Download Matrix

- On-line Interface Administration

- Modifying the Web Interface

- Backend Interface

- PostgreSQL information

- Manual

- Books

Future Work

Acknowledgements

Appendices

- System Notes

- Sample Queries

- PostgreSQL Information

- Names & Contact Information

References

Index

Introduction

Today, humanity's scholarly knowledge is stored in an increasing number of papers, books, emails, and other formats. No man and no machine can process this enormous amount of data and hence most of the knowledge gets reinvented, is duplicated across sciences, or is simply lost forever after a short period of time. To survive as a species, however, we will need to preserve our planet or find means to sustain life as we know it by other means. Besides achieving survival, we should aim to enable all human beings to live a healthy, productive and fulfilling life. Meeting these challenges requires the design of effective cyberinfrastructures that provide access to humanity's collective knowledge and support the 'global brain' that is emerging on this planet [1, 3].

The IVC database of scholarly work aims to serve researchers and practitioners interested in the large-scale analysis, modeling, and visualization of scholarly data sets. A particular focus of this database is to support the generation of knowledge domain visualizations (KDVis) [2, 6] based on large amounts of publication, patent, and grant data. KDVs use sophisticated data analysis and visualization techniques to objectively identify major research areas, experts, institutions, grants, papers, journals, etc., in a domain of interest. They can be used to gain an overview of a knowledge domain; its homogeneity, import-export factors, and relative speed; to track the emergence and evolution of topics; or to help identify the most productive as well as novel research areas.

A sample map of the ‘Melanoma’ research area is given in Figure 1. It shows 53,804 unique Medline papers, 299 unique genes retrieved from the Entrez Gene database, and 367 unique proteins from the Universal Protein Resource (UniProt). All are related to melanoma. Cosine similarity based on co-occurrence of MeSH terms was used to spatially layout the papers and their associated genes and proteins (see [4]). Figure 1 shows the main research areas covered by melanoma research over the last 40 years. Labeling of major research areas was done by hand after exploration of the map using VxInsight [5]. The different research areas can be grouped into applied medical sciences (left side) and basic molecular sciences (right side). Interestingly, papers in the applied science portions of the map are less numerous than their molecular science counterparts.

Fig. 1. Snapshot of an interactive map showing melanoma related papers, genes

and proteins.

Benefits of KDVs include reducing visual search time, revealing hidden relations, displaying data sets from several perspectives simultaneously, facilitating hypothesis formulation, serving as effective means of communication, and prompting users to think in new ways about document data. This knowledge is not only interesting for funding agencies but also for companies, researchers, and society.

The creation of high quality KDVis requires access to high quality scholarly data sets. High quality here refers to a comprehensive coverage of the scientific discipline(s) of interest and a high accuracy of the data elements, e.g., no misspellings, omissions, missing or extraneous citation links. Many KDVis studies require access to multiple data sets – either across domains, e.g. to determine the citation interaction among education and psychology, or across data types, e.g. to identify the influence of grant funding on paper and citation output. Unfortunately, today’s scholarly data sets are kept in data silos that are hardly interlinked [8]. For example, NSF provides access to NSF data, US patent data is distributed by United States Patent and Trademark Office (USPTO) and Medline data is available via the National Library of Medicine. Many patents/grants cite papers where one person could be an author, grant awardee, and patent holder.

In short, the IVC database is a warehouse of scholarly works that interlinks and integrates diverse databases. One major focus of this database is the identification of the diverse data integration and provenance issues that need to be addressed when supporting access to diverse homogeneous and heterogeneous publication, patent, and grant data sets.

Brief History

Based on a (name & type of grant here) awarded to Katy Borner in (date of grant here), the IVC database was initiated in the summer of 2004. With the above mission in mind, various data sets from the NSF, NIH, Medline, PNAS, et Al. were collected both manually and automatically over the Internet. Once the data sets were collected, a small team of developers and database administrators set about the task of verifying, organizing, and developing an Oracle version of this database. Over the course of development during the summer and fall of 2004, many issues and problems relating to the usage of the Oracle system became readily apparent, in addition to issues of provenance, data sharing, and licensing, as well as a need for a new database administrator. A new database administrator was brought on-board in the spring 2005, whereby the decision was made to move from the proprietary Oracle-based platform to an Open Source alternative, namely PostgreSQL.

Development on the new database commenced in February, 2005. A development version of the database is now available for a majority of the various data sets. Mirroring and backup services will come on-line in July. Work may also commence during the fall of 2005 on conversion of the on-line interface for the new PostgreSQL database. A projected release date of the beginning of August has been set for the production system.

Target Users & Information Needs

This database will serve researchers and practitioners interested in the large scale analysis, modeling and visualization of scholarly databases. A particular focus is the support of the generation of knowledge domain visualizations. Correspondingly, the database provides access to complete local copies of major data sets (dependent upon sharing agreements, etc.). Databases which are freely redistributed and/or Open Source are preferred. For more information regarding the database, Please see: http://iv.slis.indiana.edu/db.

Hardware

The database is currently composed of four different systems serving different purposes: (1) UITS Research system; (2) the Production system; (3) the Mirror system; and (4) the Development system.

UITS Research System

It is hopeful that the UITS Research system will serve as the public interface to the database in the future. Currently, we are in discussion with UITS regarding both hardware setup and personnel acquisition. There is currently no infrastructure to support a PostgreSQL implementation of the database at UITS. Once the details of such have been worked out, details regarding the UITS hardware setup for the database will be updated here. If support for PostgreSQL is not forthcoming, the UITS Research system will most likely be used as a storage repository and backup for some of the various data sets.

Production System

The Production system is currently serving as the public interface for the database. It is made up of a Sun v480 with four 1.2Ghz Ultrasparc IIIcu CPUs, with 32GB of memory, two 73GB internal disks, and 10T external fibre channel disk in 4 sun 3511FC disk trays. This works out to be about 40 250GB SATA disks. Currently 2.5T are allocated for physical storage of the database, with

2.5T allocated for the ancillary storage of raw and scratch data.

Mirror System

The Mirror system will serve as a network of both hot and cold 'mirrors' to provide exact replicas of the database at a given point in time. The purpose of the mirror is to provide redundancy and failover in both a local and distributed context. Cold mirrors provide a snapshot of the database based on a daily or weekly dump. Hot mirrors will provide a live snapshot of the production database as it is updated and/or modified.

As of 7/15/05, a hot mirror will reside at the School of Library and Information Science. This mirror will be composed of Sun v490 with four 1.033Ghz Ultrasparc IV CPUs, 16GB of memory, two 73GB internal disks, and a 4.8T X-RAID external disk. Various other geographically distributed cold mirrors will soon be established, the details of which will be updated here in the coming months.

Development System

The Development system is currently housed on the production machine (details above). Once the details regarding UITS support for PostgreSQL are worked out, all production will eventually move to the UITS Research system whereby the v480 will become the sole development system. If support for PostgreSQL is not forthcoming, it may become the case the the Development system will remain on the machine where it currently resides.

Software

Again, the software definition for the database is currently composed of four different systems serving different purposes: (1) UITS Research system; (2) the Production system; (3) the Mirror system; and (4) the Development system.

Once details regarding support for PostgreSQL are worked out concerning the UITS Research system, we will provide details concerning the operating system, etc., here. The Production system runs on Solaris 10. The local mirror is also currently running on Solaris 10. Details regarding the various distributed mirrors will be updated here. The Development system is currently running on Solaris 10.

Regardless of type, the database software used for this system is PostgreSQL (http://www.postgresql.org), an Open Source object-relational database server (database management system), released under the flexible BSD-style license.It offers an alternative to other open-source database systems (such as MySQL and Firebird), as well as to proprietary systems such as Oracle, Sybase, IBM's DB2 and Microsoft SQL Server. The current version we are using is: 8.0.

The On-line Interface (http://NWB.uits.indiana.edu:8201/) is a servlet/JSP based solution running on Tomcat 4.1 which is accessed via any Web client. Various backend clients also exist for interaction with the database, though some may still prefer to use the backend client provided with PostgreSQL, namely psql. Obviously with the proper credentials, how one accesses the database is not highly relevant here.

Since a majority of this work has been done in an academic setting where turnover is high, talk has been initiated regarding conversion of the On-line Interface to a simpler technology such as PHP, Perl, or Python. Such a conversion, if initiated, may occur sometime during the fall of 2005.

For on-line and backend access and usage information, see the Database section below.

Data Sets

At the core of the IVC database are the numerous data sets. Below, we list all of the currently available data sets including various details regarding each, such as: (1) source; (2) purpose; (3) composition; (4) statistics; (5) acquisition details; (6) updates; (7) errata; (8) system information (i.e. where the data lives); and (9) notes regarding interaction with the data sets, development and load scripts, et Al.

General Guidelines

Some time went into thinking about the processes involved with interacting with the data sets as both a system administrator and a developer. The directory structures setup for the Production and Development systems should be used at all times. Each directory/file has a specific purpose. Using the below protocol for the directory structure and files should cut down on confusion, etc., and provide a lasting foundation for further development, this being in specific regard to the constant phenomenon of turnover that occurs within this particular academic environment.

The raw data directory contains all of the raw data files as downloaded, imported, or ftp'd to the server. This directory contains all of the raw data that exists in the database for a particular data set. It should never be modified or touched. Files in the raw data directory should be copied over to a corresponding scratch directory and processed there. Never modify anything in a raw data directory.

The scratch directory contains copies of all the files from the raw data directory corresponding to a given data set. This directory should be used for all processing/cleaning/development of scripts for a given data set.

The load files directory contains the final versions of the processed raw data that was or will be loaded into the database. Usually, these are tab-delimited (\t) files with a newline (\n) record separator. If you need to reload a particular data set or are trying to track down a problem with a particular data set that was loaded, check this directory. Again, the files in this directory should never be modified unless there is good reason to do so (i.e. reloading the data set, etc.).

If XSLT was used in the transformation of raw data files that are XML, any all files such as stylesheets, parsers, etc. should go into the xslt directory. If you don't see any files in the scripts directory for a given data set, then it is probably the case the XSLT was used to process the raw data files into a tab-delimited format which were then placed into the load files directory.

The scripts directory contains all of the scripts used in processing/preparing the raw data for import into the database. Put all of your scripts in this directory and be sure to document what they do in the NOTES file.

The NOTES file contains any/all information regarding any processes, scripts, cron jobs, stylesheets, whatever, associated with the preparation of the raw data files for import into the database. It is to serve as a log and comment file regarding changes/development/loading of any raw data into the database. All documentation regarding development for a particular data set should go into the NOTES file.