Final Project
CS/ECE/ME 539
Professor Hu
UW-Madison
MWF 1:20p
David A. Gerasimow
The Design and Implementation of a Dynamic Data MLP to Predict Motion Picture Revenue
Table of Contents
Introduction: Preface, Past Research, Improvements Over Past Research / 3Initial Data Collection / 4
Data Collection Improvements, Data Encoding / 5
Pre-analysis of Data, Development of the Dynamic Data Neural Network, Step 1 of the UpdateWizard: Downloading, Step 2 of the UpdateWizard: Updating / 6
Step 3 of the UpdateWizard: Creating Training and Testing Files, Development of the MLP using Dynamic Data / 7
Using the Dynamic Data MLP, Choice 1 of moviesbp.m, Choice 2 of moviesbp.m / 9
Figure 1: DataExtractor Screenshot / 10
Figure 2: DataConcatenator Screenshot / 11
Figure 3: DataConverter Screenshot: Films Removed From Data File / 12
Figure 4: DataConverter Screenshot: Films to be Updated / 13
Figure 5: Results of preanalysis.m / 14
Figure 6: UpdateWizard Screenshot – Step 1, Figure 7: UpdateWizard Screenshot – Step 2 / 17
Figure 8: UpdateWizard Screenshot – Step 3, Figure 9: NewMovie Screenshot / 18
Discussion of Results / 19
Bibliography / 20
VB Source Code / 21
Note to Grader: This report is over twenty pages, but this is because I was unsure if the grader has the ability to read and run Visual Basic 6.0 source code.
Introduction
Preface
For the last century, film has been one of the American public’s favorite entertainment mediums. Large production companies often spend hundreds of millions of dollars to create a single film. However, the amount of money spent on creating a film seems to have little bearing on its success. The Blair Witch Project, for instance, was made for under one million dollars, but it made over twenty-nine million dollars in its first weekend in the box office. On the other hand, Waterworld, starring superstar Kevin Costner, cost roughly one-hundred and seventy-five million dollars to produce, but made back less than half of that amount in domestic box office revenue.
Predicting how much a movie will earn in opening-weekend box office revenue is a notoriously difficult thing to do. There are many subjective aspects of a movie. In addition, public taste changes quickly and unpredictably. Developing a mathematical formula to predict how much a film will make will allow production companies to maximize profit and skip film development projects that will hurt their profit margins.
Past Research
In CS/ECE/ME 539, in the fall semester of 2001, a student attempted to predict the opening weekend box office revenue of a given film using an artificial neural network. He claimed that an accurate prediction of how much a movie will gross in total can be achieved by examining its opening weekend. They are proportional to each other. If a film has a huge opening weekend, it is likely to earn a lot of money in the long run. His logic is correct, and I will use it again in this project.
The network’s inputs are the film’s characteristics, such as genre, rating, runtime, etc. Despite his thorough work, there are deficiencies in his project. This project will be a major improvement over his results. Namely, it will produce higher correct classification results; while, at the same time, it will allow future users to easily update the data files. The neural network will, over time, accumulate more and more training data. A major component of this report is developing what I call a dynamic data neural network. The training data is automatically updated weekly. Instead of the project ending with the end of the semester, future classes will be able to easily update this project’s results, and the network’s correct classification rates will improve over time.
Improvements over Past Research
As an avid film ‘buff,’ I came up with the idea to research film box office revenue with respect to neural networks independently, but I was disappointed to find out that it had been done before. As such, I set out to improve upon previous results. By adding more features and more feature vectors, I hoped to receive better results. Moreover, I wrote an UpdateWizard that can automatically and repeatedly update the data files. For example, every week, the top grossing films are listed at www.boxofficeguru.com. The UpdateWizard automatically downloads the list and updates the data file. Finally, it will create new training and testing data files for use with MATLAB and the dynamic data neural network.
Initial Data Collection
Data pertaining to box office revenue is plentiful on the internet. As such, I sought out the data with the most input features and the most reliability. After a thorough search, I decided to use the information found at www.boxofficeguru.com. Already entered into its .html files are all the films since 1989 that have grossed more then fifteen-million dollars in their first weekend. Also, additional data is posted. The film’s opening date, number of theatres at opening, distributor, number of days in opening weekend, and, most importantly, the exact amount the film grossed in its opening weekend are available.
Unfortunately, the data is not in a pleasant, readable format for programming use. Consequently, using Microsoft Visual Basic 6.0 Professional Edition, I wrote a Windows application called DataExtractor (dataextractor.exe) to parse the information out of the *.html files. For this portion of the data collection, which is only performed once, I manually downloaded the data files and renamed them 35plus.htm, 25to35.htm, 20to25.htm, 17to20.htm, and 15to17.htm. After running these five files through the DataExtractor, five readable files are created called 35plus_1.txt, 25to35_1.txt, 20to25_1.txt, 17to20_1.txt, and 15to17_1.txt.
A screenshot of the DataExtractor is found on page 10 (Fig. 1).
The source code for the DataExtractor is found on pages 21+.
After the DataExtractor has parsed the information, the five output files (35plus_1.txt, 25to35_1.txt, 20to25_1.txt, 17to20_1.txt, and 15to17_1.txt) need to be concatenated. Again, using Visual Basic 6.0, I developed another Windows application called DataConcatenator (dataconcatenator.exe). It takes the five aforementioned output files as inputs and creates a single file called concatenated_data.txt.
A screenshot of the DataConcatenator is found on page 11 (Fig. 2).
The source code for the DataConcatenator is found on pages 21+.
Now that a readable, single file exists, more input features needed to be added. In order to avoid unnecessary reentering of data, I wrote another Windows application called DataConverter (dataconverter.exe). As its inputs, it takes two files: data.txt and concatenated_data.txt. data.txt contains the data the student from last semester used, while concatenated_data.txt contains the updated film information created by the DataExtractor. The DataConverter compares these two files. If a film from data.txt did not gross over fifteen-million dollars in its first weekend, it is removed from the data file. Otherwise, the data is copied into mydata.txt. Films that have been released since data.txt (which was created in 2001) are determined and are enumerated in the file titlestoupdate.txt.
The DataConverter displays the films that did not gross over fifteen-million dollars as well as movies that need to be updated (i.e., released since 2001).
Several screenshots of the DataConverter in action are found on pages 12-13 (fig. 3 and fig. 4). The source code for the DataConverter can be found on pages 21+.
Once these three programs have ran (DataExtractor, DataConcatenator, DataCoverter), a list of film titles that need to be updated is created and stored under the filename titlestoupdate.txt. With this information, I looked up all the films in the file at www.imdb.com. This website has more information than www.boxofficeguru.com. A film’s genre, rating, runtime, color/black & white/animated, and sequel data are listed. I looked up each movie individually and entered the data in a Microsoft Word document.
Data Collection Improvements
Here, I made some improvements over the last project done on box office revenue. First, I eliminated his use of the IMDB user ratings as an input feature. This data is irrelevant because people do not know if a movie will be good before its opening weekend. While this data would be important if developing a neural network to determine total gross revenue, it is not useful in determining opening weekend gross revenue. In addition, I removed the input feature that determines the day of the month on which the film was released. Because any given day of the week does not correspond to any specific day of a month, this input feature was random, and therefore, it had little use in the MLP development.
Several other improvements were also made. First, from general observations of the film industry, sequels tend to do well. The audience knows what to expect. Usually, a film studio only releases a sequel if the original did well. Whether or not a film is a sequel is an important aspect of determining its opening weekend revenue. Next, another input feature was added. Animated films tend to do very well. Whether or not a film is animated has a significant impact on its opening weekend revenue. The addition of these two input features increased correct classification rates of the multi-layer perceptron.
Data Encoding
The data contained in movies.txt is the final data file after the procedures described above have been followed. Many features of a film are not numerical. As such, I created an encoding scheme that allowed the non-numerical data fields to be useful to the multi-layer perceptron.
Genre / Rating / DistributorAction / 20 / G / -5 / Sony / 1
Comedy / 21 / Universal / 2
Drama / 22 / Warner Brothers / 3
Family / 23 / Fox / 4
Horror / 24 / PG / -4 / New Line / 5
Mystery / 25 / Buena Vista / 6
Animation / 26 / Paramount / 7
Romance / 27 / MGM/United Artists / 8
Sci-Fi / 28 / PG-13 / -3 / MGM / 9
Thriller / 29 / DreamWorks / 10
Western / 30 / Miramax / 11
Cell Intentionally Left Blank / TriStar / 12
R / -2 / Columbia / 13
Artisan / 14
Polygram / 15
USA Films / 16
Cell Intentionally Left Blank / Orion / 17
Pre-analysis of the Data
Before developing the MLP, it is helpful to thoroughly examine the data. As such, I wrote preanalysis.m in MATLAB to assist me in this task. It produces graphs containing how many films have certain characteristics. Also, mean values and the inputs’ standard deviations are computed where applicable.
The graphs produced by preanalysis.m can be found on pages 14-16 (fig. 5).
Development of the Dynamic Data Neural Network
A major component of this project is the development of what I call a dynamic data neural network. The training and testing data used by the MLP is constantly changing. This is an improvement over other neural networks, including the one previously designed to tackle the opening weekend box office revenue problem.
In Visual Basic 6.0, I developed a Windows application called the UpdateWizard (updatewizard.exe). This program performs all the necessary steps to update the data. Consequently, as time goes on, the MLP, that will be developed later, will change and improve as its training data is updated.
Step 1 of the UpdateWizard: Downloading
The UpdateWizard begins by downloading the most up-to-date data files from www.boxofficeguru.com. The program contacts the server and downloads five files: open35+.htm, open25-35.htm, open20-25.htm, open17-20.htm, and open15-17.htm. The files are processed and concatenated using methods similar to those found in the DataExtractor and DataConverter. After the files have been downloaded, processed, and linked, the updated data is compared to the current data file (movies.txt). Films that are new since the last update are presented to the user.
A screenshot of the UpdateWizard in step 1 can be found on page 17 (fig. 6).
Step 2 of the UpdateWizard: Updating
In this step of the UpdateWizard, the user enters the information for the films that are new since the last update. This information can be found at www.imdb.com. After all the films have been updated, they are added to the data file movies.txt. The data is now up-to-date. If no updates are available, this step is skipped, and the UpdateWizard proceeds directly to step 3.
A screenshot of the UpdateWizard in step 2 can be found on page 17 (fig. 7).
Step 3 of the UpdateWizard: Creating Training and Testing Files
In the third and final step of the UpdateWizard, training and testing files are created. The user has several options in this step. First, the user decides how many classes the data will be partitioned into. Second, he or she decides when to begin the testing file. By selecting a date, the training file will consist of all the films that were released prior to that date, and the testing file will consist of all the films that were released after that date.
User Options in Step 3
Training File Options (i.e., classification scheme):
1. 2 Classes – Class 1: 15m-22.5m, Class 2: 22.5m+
2. 4 Classes – Class 1: 15m-18.5m, Class 2: 18.5m-23m, Class 3: 23m-32m, Class 4: 32m+
3. 5 Classes – Class 1: 15m-17m, Class 2: 17m-20m, Class 3: 20m-25m, Class 4: 25m-35m, Class 5: 35m+
Testing File Options (i.e., training and testing data separation):
Begin Testing File on January 1st of 2001, 2002, or 2003.
Output File Description
training_X_YYYY.txt and testing_X_YYYY.txt where X is the classification scheme and YYYY is the year at which the testing file begins.
A screenshot of the UpdateWizard in step 3 can be found on page 18 (fig. 8).