CIS526: Homework 1

Assigned: Sep 07, 2004

Due: in class, Tue, Sep14, 2004

Homework Policy

  • All assignments are INDIVIDUAL! You may discuss the problems with your colleagues, but you must solve the homework by yourself. Please acknowledge all sources you use in the homework (papers, code or ideas from someone else).

Assignments should be submitted in class on the day when they are due. No credit is given for assignments submitted at a later time, unless you have a medical problem.

Problems

In this homework, you will have to answer a number of questions about two data sets. You can use any software or programming language you find suitable. This includes using WordPad, MSExcel, MSAccess, MS SQL Server, Matlab, VisualBasic, C, C++, Java, or their combination. It is preferred that Matlab is used as extensively as possible.

Problem 1:

Download mtuberculosis_fasta.txt from This textual file contains a large number of protein sequences with their elementary information. The specific format is called FASTA format (for more info, type ‘fasta format’ in Google). As help to Matlab users, you are provided with readFASTA.m Matlab file that allows reading of FASTA files.

Answer the following questions:

  1. what is the total number of sequences listed in the file
  2. what is the size of alphabet used for sequence presentation
  3. how many sequences contain ‘hypothetical protein’ in their header lines
  4. what is the fraction of letter A among all sequences listed in the file
  5. what fraction of sequences has length above 150
  6. how many sequences have at least one subsequence ‘SSP’
  7. how many sequences contain both ‘SSP’ and ‘PPS’ subsequences
  8. what is the average length of sequences with emb name (listed in header of each sequence) starting with ‘CAA’

Problem 2:

Download movie.txt file from This textual file is containing summary information about a number of movies, and it can be easily loaded into MSExcel. The following is a description of its content:

Movie (movie.txt) provides descriptive information about each movie:

ID: Number -- primary key

Name: Text

PR_URL: Text -- URL of studio PR site

IMDb_URL: Text -- URL of Internet Movie Database entry

Theater_Status: Text -- either "old" or "current"

Theater_Release: Date/Time

Video_Status: Text -- either "old" or "current"

Video_Release: Date/Time

Action, Animation, Art_Foreign, Classic, Comedy, Drama, Family, Horror, Romance, Thriller: Yes/No

IMDb URLs are provided by courtesy of Internet Movie Database.

The theater and video status and release dates were (approximately) correct in the San Francisco bay area as of September 15, 1997, when EachMovie was

terminated.

Answer the following question:

  1. how many action movies are listed in the file
  2. how many action BUT NOT comedy movies are listed
  3. how many family movies were released on Wedenesday
  4. group movies by the first letter of their title. Calculate fraction of action movies in each group
  5. based on previous answer, do you thing first letter of movie title could be used as a predictor of its genre

Deliverables: for both problems provide the solutions and explanation of techniques used to find an answer and listing of the code used (if applicable).

Good Luck!!