Programmer’s Manual for the InfoVis CyberInfraStructure Database Search System

By

Mohd Saiful Nizam Bahari

Stephanie Gato

Todd Holloway

Qian Wang

1. Introduction 4

2. Oracle Database 4

2.1 Schema Design Rationale 4

2.2 Complete Schema 6

2.3 Query Optimization 19

2.4 Loading Data in the DB 22

2.4.1 SQL*Loader Scripts 22

3 Web Application 23

3.1 ER Diagram 23

3.2 Table View Diagram 25

3.3 Module Architecture 27

3.4 Java Docs 28

3.4.1 Class SearchSessionManager 28

3.4.2 Class ProcessLogin 43

3.4.3 Class downloadHistory 49

3.4.4 Class downloadStats 52

3.4.5 Class downloadUsers 55

3.4.6 Class Update 59

3.4.7 Class AdminConfirmPending 65

3.4.8 Class AdminGetHistory 69

3.4.9 Class AdminGetProfile 71

3.4.10 Class connectDB 73

3.4.11 Class getParameters 75

3.4.12 Class ConnectManager 78

3.4.13 Class DownloadManager 79

3.4.14 Class HistoryManager 83

3.4.15 Class registerPage 86

3.4.16 Class UserManager 94

3.4.17 Class dlError 99

3.4.18 Class Error 102

3.4.19 Class regError 104

3.4.20 Class searchError 107

3.4.21 Class Format 110

3.4.22 Class History 113

3.4.23 Class Search 114

3.4.24 Class Dictionary 136

4 State of the IUIV Database System 139

5 Adding New Collections 141

6 Adding New Matrix Algorithms 142

Appendix A: SQL*Loader Scripts 143

Script for COS 143

Script for NIH 148

Script for NSF 151

Script for Medline 178

Glossary 280

Index 280

1. Introduction

Contained in this manual are those materials we feel will help a programmer quickly get up to speed on the state of the system. The design decisions made for how to set up the database are extremely important to understand.

2. Oracle Database

2.1 Schema Design Rationale

The tables exist for two reasons: (1) to provide a container for several collections of documents and (2) to support a web-based tool for searching. As such, each table generally either models a user level entity (history, query, user, etc) or a document level entity keyword, author, document, nsf, etc). Because the system does not only exist to support the web-interface, but also models aspects of the documents searchable only via the command-line, we choose to slightly break the relational model. Each document is divided across several tables, two of which are logically very similar. What we mean by this is that many portions of a document (title, text, abstract) that are searchable via the web tool are placed into a collective document_table. The portions of the document that are not searchable via the web and are usually specific to the collection from which the document came are placed into a collection-specific document table nsf_table, medline_table, etc). This separation should make it less painful to add new document collections, with unique fields, to the database and without invalidating the web tool.

The tablespace design involves a fair amount of guesswork, usually overestimation. The estimates are based on the current number of documents involved for the five current collections:

Medline: 21081852
NIH: 1043843
NSF: 163291
Citeseer ???
COS data: 39858


We have 11 locally managed tablespaces:
NIH_Tablespace 2000M
NSF_Tablespace 1000M
Citeseer_Tablespace 1000M (?)
COS_Tablespace 500M
Medline_Tablespace 1000M
Document_Table_Tablespace 20000M
User_Base_Tablespace 100M
Document_Base_Tablespace 1000M
User_Index_Tablespace 10M
Document_Index_Tablespace 100M
Abstract_Index_Tablespace 100M


The first five support the collection specific information, whose size is quite understandable. The document_table has its own tablespace, as it will be much larger than all other tables as it contains the text and abstract fields for all documents having these items. The remaining tables will be in one of two tablespaces--one whose size is governed by the size of the user base, the other whose size is governed by the total number of documents. All indices will likewise be grouped into one of two categories, those whose size is governed by the size of the user base, and those whose size is governed by the total number of documents. The only exception is our Oracle Text index, abstract, which has its own tablespace. The only other design decisions of note are that the collection specific tables have mostly fields that are large varchar2's in order that SQL*Loader may easily be used with highly variable data. We have indices for anything we can foresee being involved in where clauses. Lastly, our only large objects are text and abstract, both clobs.

2.2 Complete Schema

Tables, single column indices, and triggers are listed below.

/////IUIV//////////////////////////////////////////////////////////////

/////////////////////// Tables and Indexes ///////////////////////////

/////////////////////////////////////////////////////////////////////

create table user_table (

username varchar2(255) constraint user_table_username_nn not null

constraint user_table_username_pk primary key,

type varchar2(255) default 'student',

priv varchar2(255) default 'unregistered',

first_name varchar2(255) constraint i_u_user_table_firstName_nn not null,

last_name varchar2(255) constraint i_u_user_table_lastName_nn not null,

not_used_since timestamp constraint user_table_not_used_since_nn not null,

date_joined timestamp constraint user_table_date_joined_nn not null,

email varchar2(255) constraint user_table_email_nn not null,

address varchar2(255) null,

zipcode varchar2(255) null,

city varchar2(255) null,

state varchar2(4) null,

country varchar2(255) null,

homepage varchar2(255) null,

expertise varchar2(255) null,

planned_use varchar2(255) null,

agreement_1 varchar2(2) default 'f',

agreement_2 varchar2(2) default 'f',

organization varchar2(255) null,

position varchar2(255) null,

password varchar2(255) null,

status varchar2(255) default 'unconfirmed',

non_iu varchar2(2) default 't')

tablespace user_base_tablespace;

create unique index user_password_index

on user_table(password);

tablespace user_index_tablespace;

create table keyword_table (

keyword_id number constraint keyword_table_keyword_pk primary key,

name varchar2(255) constraint keyword_table_name_un unique);

tablespace document_base_tablespace;

create table belongs_to(

institution_id integer,

author_id integer,

note varchar2(100),

constraint belongs_to_author_fk foreign key (author_id) references author_table (author_id),

constraint belongs_to_institution_fk foreign key (institution_id) references institution_table (institution_id));

tablespace document_base_tablespace;

create table contains_keyword (

keyword_id number,

doc_id number,

comment varchar2(100),

constraint contains_keyword_keyword_id_fk foreign key (keyword_id) references keyword_table(keyword_id),

constraint contains_keyword_doc_id_fk foreign key (doc_id) references document_table(doc_id));

tablespace document_base_tablespace;

create table query_table (

query_id number constraint query_table_query_id_nn not null

constraint query_table_query_id_pk primary key,

author varchar2(255) null,

title varchar2(255) null,

source varchar2(255) null,

abstract varchar2(255) null,

keyword varchar2(255) null,

min_year varchar2(255) null,

max_year varchar(255) null,

label varchar2(1000) null,

collection varchar2(255) null,

date_entered varchar2(16) not null,

time_entered varchar2(16) not null)

tablespace user_base_tablespace;

create index query_timestamp_index

on search_result_table(time_stamp);

tablespace user_index_tablespace;

create index query_label_index

on search_result_table(label);

tablespace user_index_tablespace;

create index query_date_entered_index

on search_result_table(date_entered);

tablespace user_index_tablespace;

create index query_time_entered_index

on search_result_table(time_entered);

tablespace user_index_tablespace;

create table history_of (

username varchar2(255) constraint history_of_username_nn not null,

doc_id number constraint history_of_result_id_nn not null,

query_id number constraint history_of_query_id_nn not null,

constraint history_of_u_fk foreign key (username) references user_table (username),

constraint history_of_q_fk foreign key (query_id) references query_table (query_id),

constraint history_of_r_fk foreign key (doc_id) references document_table (doc_id),

constraint history_of_pk primary key(username, doc_id, query_id));

tablespace user_base_tablespace;

create table document_table(

doc_id number constraint document_table_doc_id_nn not null

constraint document_table_doc_id_pk primary key,

date_published date null,

title varchar2(2550) null,

journal_name varchar2(2550) null,

text clob null,

type varchar2(100) default 'unknown',

date_entered timestamp constraint document_table_date_entered_nn not null,

collection varchar2(100) default 'misc',

main_url varchar2(1000) null,

abstract clob null);

tablespace document_table_tablespace;

create index document_date_published_index

on document_table(date_published);

tablespace document_index_tablespace;

create index document_title_index

on document_table(title);

tablespace document_index_tablespace;

create index document_collection_index

on document_table(collection);

tablespace document_index_tablespace;

create index document_dtc_index

on document_table(doc_id, title, collection);

tablespace document_index_tablespace;

create index document_dc_index

on document_table(doc_id, collection);

tablespace document_index_tablespace;

create index document_dca_index

on document_table(doc_id, collection);

tablespace document_index_tablespace;

create index document_dtc_index

on document_table(doc_id, title, collection);

tablespace document_index_tablespace;

create index document_main_url_index

on document_table(main_url);

tablespace document_index_tablespace;

create index document_main_abstract_index

on document_table(abstract);

tablespace document_index_tablespace;

create table citeseer_table(

citeseer_id number constraint citeseer_table_id_pk primary key,

source_id number,

start_date date,

end_date date,

last_date date,

amount number)

tablespace citeseer_tablespace;

create index citeseer_source_index

on citeseer_table(source_id)

tablespace document_index_tablespace;

create index citeseer_start_date_index

on citeseer_table(start_date)

tablespace document_index_tablespace;

create index citeseer_end_date_index

on citeseer_table(end_date)

tablespace document_index_tablespace;

create index citeseer_last_date_index

on citeseer_table(last_date)

tablespace document_index_tablespace;

create index citeseer_amount_index

on citeseer_table(amount)

tablespace document_index_tablespace;

create table nsf_grant_table(

nsf_id number constraint nsf_grant_table_nsf_id primary key,

type varchar2(4000),

nsf_org varchar2(4000),

latest_amendment_date varchar2(4000),

filename varchar2(4000),

phone_number varchar2(4000),

award_number varchar2(4000),

instr varchar2(4000),

program_manager varchar2(4000),

start_date varchar2(4000),

expires varchar2(4000),

expected_total_amount varchar2(4000),

investigator varchar2(4000),

sponsor varchar2(4000),

nsf_program varchar2(4000),

fld_application varchar2(4000),

program_ref varchar2(4000),

doc_id number,

constraint nsf_doc_id_fk foreign key (doc_id) references document_table(doc_id));

tablespace nsf_tablespace;

create index nsf_file_index

on nsf_grant_table(filename);

tablespace document_index_tablespace;

create index nsf_instr_index

on nsf_grant_table(instr);

tablespace document_index_tablespace;

create index nsf_program_manager_index

on nsf_grant_table(program_manager);

tablespace document_index_tablespace;

create index nsf_investigator_index

on nsf_grant_table(investigator);

tablespace document_index_tablespace;

create index nsf_sponser_index

on nsf_grant_table(sponsor);

tablespace document_index_tablespace;

create index nsf_fld_application_index

on nsf_grant_table(fld_application);

tablespace document_index_tablespace;

create index nsf_nsf_program_index

on nsf_grant_table(nsf_program);

tablespace document_index_tablespace;

create table cos_table(

cos_id number constraint cos_table_cos_id primary key,

accn varchar2(255),

sponsor_list varchar2(1000),

sponsor_type varchar2(1000),

deadline varchar2(1000),

deadline_note varchar2(4000),

upper_amount varchar2(1000),

eligibility varchar2(4000),

citizenship varchar2(1000),

activity_location varchar2(1000),

requirements varchar2(1000),

contact_name varchar2(1000),

contact_address varchar2(1000),

contact_city varchar2(1000),

contact_state varchar2(1000),

contact_zipcode varchar2(255),

contact_country varchar2(1000),

contact_phone varchar2(255),

contact_fax varchar2(255),

contact_email varchar2(1000),

last_updated varchar2(1000),

bookmark_url varchar2(1000),

funding_type varchar2(1000),

date_published varchar2(1000),

doc_id number,

constraint cos_doc_id_fk foreign key (doc_id) references document_table(doc_id));

tablespace cos_tablespace;

create index cos_accn_index

on cos_table(accn);

tablespace document_index_tablespace;

create index cos_sponsor_list_index

on cos_table(sponsor_list);

tablespace document_index_tablespace;

create index cos_deadline_note_index

on cos_table(deadline_note);

tablespace document_index_tablespace;

create index cos_eligibility_index

on cos_table(eligibility);

tablespace document_index_tablespace;

create index cos_citizenship_index

on cos_table(citizenship);

tablespace document_index_tablespace;

create index cos_activity_location_index

on cos_table(activity_location);

tablespace document_index_tablespace;

create index cos_requirements_index

on cos_table(requirements);

tablespace document_index_tablespace;

create index cos_contact_name_index

on cos_table(contact_name);

tablespace document_index_tablespace;

create index cos_contact_address_index

on cos_table(contact_address);

tablespace document_index_tablespace;

create index cos_contact_city_index

on cos_table(contact_city);

tablespace document_index_tablespace;

create index cos_contact_state_index

on cos_table(contact_state);

tablespace document_index_tablespace;

create index cos_contact_country_index

on cos_table(contact_country);

tablespace document_index_tablespace;

create index cos_contact_email_index

on cos_table(contact_email);

tablespace document_index_tablespace;

create index cos_las_updated_index

on cos_table(last_updated);

tablespace document_index_tablespace;

create index cos_bookmark_url_index

on cos_table(bookmark_url);

tablespace document_index_tablespace;

create index cos_funding_type_index

on cos_table(funding_type);

tablespace document_index_tablespace;

create table medline_table(

medline_id number constraint medline_table_medline_id primary key,

original_medline_id varchar2(2500),

pmid varchar2(2550),

date_created varchar2(2550),

date_completed varchar2(2500),

issn varchar2(2550),

volume varchar2(2550),

issue varchar2(2550),

pub_date varchar2(2000),

affiliation varchar2(2000),

full_date_published varchar2(2000),

pgn varchar2(2000),

publication_type varchar2(2000),

language varchar2(2500),

country varchar2(2550),

ta varchar2(1000),

nlm_unique_id varchar2(2500),

chemical_list varchar2(4000),

citation_subset varchar2(4000),

mesh_heading varchar2(2000),

doc_id number,

constraint medline_doc_id_fk foreign key (doc_id) references document_table(doc_id));

tablespace medline_tablespace;

create index medline_language_index

on medline_table(language);

tablespace document_index_tablespace;

create index medline_country_index

on medline_table(country);

tablespace document_index_tablespace;

create index medline_ta_index

on medline_table(ta);

tablespace document_index_tablespace;

create index medline_nim_unique_id_index

on medline_table(nlm_unique_id);

tablespace document_index_tablespace;

create index medline_chemical_list_index

on medline_table(chemical_list);

tablespace document_index_tablespace;

create index medline_citation_subst_index

on medline_table(citation_subset);

tablespace document_index_tablespace;

create index medline_mesh_heading_index

on medline_table(mesh_heading);

tablespace document_index_tablespace;

create table nih_table(

nih_id integer constraint nih_table_nih_id primary key,

grant_number varchar2(4000),

pi_title varchar2(4000),

thesaurus_terms varchar2(4000),

fiscal_year varchar2(4000),

department varchar2(4000),

project_start varchar2(4000),

project_end varchar(4000),

icd varchar2(4000),

irg varchar2(4000),

doc_id integer,

constraint nih_doc_id_fk foreign key (doc_id) references document_table(doc_id));

tablespace nih_tablespace;

create index grant_number_index

on nih_table(grant_number);

tablespace document_index_tablespace;

create index nih_pi_title_index

on nih_table(pi_title);

tablespace document_index_tablespace;

create index nih_thesaurus_terms_index

on nih_table(thesaurus_terms);

tablespace document_index_tablespace;

create index nih_department_index

on nih_table(department);

tablespace document_index_tablespace;

create index nih_icd_index

on nih_table(icd);

tablespace document_index_tablespace;

create index nih_irg_index

on nih_table(irg);

tablespace document_index_tablespace;

create table author_table(

author_id number constraint author_table_author_id_nn not null

constraint author_table_author_id_pk primary key,

last_name varchar2(1000) constraint author_table_last_name_nn not null,

middle_name varchar2(1000) null,

first_name varchar2(1000) null,

pi_status varchar2(1000) null,

email_address varchar2(1000) null);

tablespace document_base_tablespace;

create index author_last_name_index

on author_table(last_name);

tablespace document_index_tablespace;

create index author_middle_name_index

on author_table(middle_name);

tablespace document_index_tablespace;

create index author_first_name_index

on author_table(first_name);

tablespace document_index_tablespace;

create index author_fml_name_index

on author_table(first_name, middle_name, last_name);

tablespace document_index_tablespace;

create index author_ifml_name_index

on author_table(author_id, first_name, middle_name, last_name);

tablespace document_index_tablespace;

create table institution_table(

institution_id number constraint institution_table_nn not null

constraint institution_table_pk primary key,

name varchar2(2000) constraint institution_table_name_nn not null,

address varchar2(2000) null,

city varchar2(2000) null,

state varchar2(500) null,

zipcode varchar2(500) null,

country varchar2(500) null,

department varchar2(1000) null);

tablespace document_base_tablespace;

create table dataset_statistics(

dataset_statistics_id number constraint dataset_statistics_pk primary key,

dataset_name varchar2(400),

dataset_size number,

permissions varchar2(400),

num_docs number,

last_updated_date date);

create index institution_name_index

on institution_table(name);

tablespace document_index_tablespace;

create table created_by (

doc_id number,

author_id number,

note varchar2(100),

constraint created_by_authorid_fk foreign key (author_id) references author_table (author_id),

constraint created_by_document_id_fk foreign key (doc_id) references document_table (doc_id));

tablespace document_index_tablespace;

create index created_by_author_id_index

on created_by(author_id);

tablespace document_index_tablespace;

create index created_by_doc_id_index

on created_by(doc_id);

tablespace document_index_tablespace;

create index created_by_da_id_index

on created_by(doc_id, author_id);

tablespace document_index_tablespace;

create table awards (

award_id number constraint awards_award_id_nn not null

constraint awards_award_id_pk primary key,

type varchar2(50) default 'unknown',

institution_id number,

doc_id number,

note varchar2(100),

foreign key (institution_id) references institution_table (institution_id),

foreign key (doc_id) references document_table (doc_id));

tablespace document_base_tablespace;

//////////////////////////////////////////////////////////////////////

/////////////////////// Oracle Text //////////////////////////////////

//////////////////////////////////////////////////////////////////////

create index document_abstract_index on document_table(abstract)

indextype is ctxsys.context

parameters ('datastore ctxsys.default_datastore')

tablespace abstract_index_tablespace;

create index document_text_index on document_table(text)

indextype is ctxsys.context

parameters ('datastore ctxsys.default_datastore')

tablespace text_index_tablespace;

execute CTX_DDL.SYNC_INDEX('document_abstract_index');

execute CTX_DDL.SYNC_INDEX('document_text_index');

ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST');

ctx_ddl.set_attribute('mywordlist','INDEX_PREFIX','YES');

ctx_ddl.set_attribute('mywordlist','PREFIX_MIN_LENGTH',3);

ctx_ddl.set_attribute('mywordlist','PREFIX_MAX_LENGTH', 4);

ctx_ddl.set_attribute('mywordlist','SUBSTRING_INDEX', 'YES');

//////////////////////////////////////////////////////////////////////

/////////////////////// Triggers /////////////////////////////////////

//////////////////////////////////////////////////////////////////////

-- Triggers for the document_table.

create or replace sequence doc_id_seq

start with 1

increment by 1

nomaxvalue;

create or replace trigger doc_id_trigger

before insert on document_table

for each row

begin

select doc_id_seq.nextval into :new.doc_id from dual;

end;

/

create or replace trigger date_entered_trigger

before insert on document_table

for each row

begin

:new.date_entered := SYSTIMESTAMP;

-- insert into date_entered values (SYSTIMESTAMP);

end;

/

create or replace trigger query_id_date_trigger

before insert or update on query_table

for each row

begin

select to_char(SYSDATE, 'DD-MON-YYYY') into :new.date_entered from dual;

select to_char(SYSDATE, 'HH:MI.SS') into :new.time_entered from dual;

end;

/

create sequence nsf_id_seq

start with 1

increment by 1

nomaxvalue;

create trigger nsf_id_trigger

before insert on nsf_grant_table

for each row

begin

select nsf_id_seq.nextval into :new.nsf_id from dual;

end;

/

create sequence nih_id_seq

start with 1

increment by 1

nomaxvalue;

create trigger nih_id_trigger

before insert on nih_table

for each row

begin

select nih_id_seq.nextval into :new.nih_id from dual;

end;

/

create sequence cos_id_seq

start with 1

increment by 1

nomaxvalue;

create trigger cos_id_trigger

before insert on cos_table

for each row

begin

select cos_id_seq.nextval into :new.cos_id from dual;

end;

/

create sequence medline_id_seq

start with 1

increment by 1

nomaxvalue;

create trigger medline_id_trigger

before insert on medline_table

for each row

begin

select medline_id_seq.nextval into :new.medline_id from dual;

end;

/

-- Triggers for the keyword_table.

create or replace sequence keyword_id_seq

start with 1

increment by 1

nomaxvalue;

create or replace trigger keyword_id_trigger

before insert on keyword_table

for each row

begin

select keyword_id_seq.nextval into :new.keyword_id from dual;

end;

/

-- Triggers for the query_table.