Updating Missing Values of Traffic Counts: Factor Approaches, Time Series Analysis versus Genetically Designed Regression

and Neural Network Models

Ming Zhong

Doctoral Student

Faculty of Engineering, University of Regina

Regina, SK, Canada, S4S 0A2

Phone: (902) 496-8152

Fax: (902) 420-5035

Email:

Pawan Lingras*

Professor

Dept. of Mathematics and Computing Science

Saint Mary’s University

Halifax, NS, Canada, B3H 3C3

Phone: (902) 420-5798

Fax: (902) 420-5035

Email:

And Satish Sharma

Professor

Faculty of Engineering, University of Regina

Regina, SK, Canada, S4S 0A2

Phone: (306) 585-4553

Fax: (306) 585-4855

Email:

*Corresponding Author: Pawan Lingras

1

Updating Missing Values of Traffic Counts: Factor Approaches, Time Series Analysis versus Genetically Designed Regression

and Neural Network Models

Ming Zhong, Pawan Lingras, and Satish Sharma

ABSTRACT: The principle of Base Data Integrity addressed by both American Association of State Highway and Transportation Officials (AASHTO) and American Society for Testing and Materials (ASTM) recommends that missing values should not be imputed in the base data. However, updating missing values may be necessary in data analysis and helpful in establishing more cost-effective traffic data programs. The analyses applied to data sets from two highway agencies show that on average over 50% permanent traffic counts (PTCs) have missing values. It will be difficult to eliminate such a significant portion of data from the analysis. Literature review indicates that the limited research uses factor or autoregressive integrated moving average (ARIMA) models for predicting missing values. Factor-based models tend to be less accurate. ARIMA models only use the historical data. In this study, genetically designed neural network and regression models, factor models, and ARIMA models were developed to update pseudo-missing values of six PTCs from Alberta, Canada. Both short-term prediction models and the models based on data from before and after the failure were developed. Factor models were used as benchmark models. It was found that genetically designed regression models based on data from before and after the failure had the most accurate results. Average errors for refined models were lower than 1% and the 95th percentile errors were below 2% for counts with stable patterns. Even for counts with relatively unstable patterns, average errors were lower than 3% in most cases, and the 95th percentile errors were consistently below 9%. ARIMA models and genetically designed neural network models also showed superior performance than benchmark factor models. It is believed that the models proposed in this study would be helpful for highway agencies in their traffic data programs.

Key words: Missing values,Traffic counts,Genetic algorithms, Time delay neural network, Locally weighted regression, Autoregressive integrated moving average

INTRODUCTION

Highway agencies commit a significant portion of their resources to data collection, summarization, and analysis (Sharma et al. 1996). The data is used in planning, design, control, operation, and management of traffic and highway facilities. However, the presence of missing values makes the data analysis difficult. Without proper imputation methods, traffic counts with missing values are usually discarded and new counts have to be retaken.

This study analyzed missing values for the data sets from two highway agencies in North America. First data set was from Alberta Transportation Department, and the other was from the Minnesota Department of Transportation (MnDOT). In Alberta, over seven years, more than half of total counts have missing values. During some years the percentage is as high as 70% to 90%. A year data from MnDOT shows more than 40% counts having missing values. Williams et al. (1998) applied seasonal ARIMA and exponential smoothing models to predict short-term traffic for two study sites on an urban freeway near Washington, D.C. It was reported that approximately 20 percent of the data in the development and test sets of their study were missing. Ramsey and Hayden (1994) introduced a computer program – AutoCounts used by The Countywide Traffic and Accident Data Unit (TADU) in England to process automatic traffic count data. It was found that infill models had to be used to estimate average flows for more than 50 percent of months for many years at a study site.

There are increasing concerns about data imputation and Base Data Integrity. The principle of Base Data Integrity is an important theme discussed in both American Society for Testing and Materials (ASTM) Standard Practice E1442, Highway Traffic Monitoring Standards (America 1991) and the American Association of State Highway and Transportation Officials (AASHTO) Guidelines for Traffic Data Programs (America 1992). The principle says that traffic measurements must be retained without modification and adjustment. Missing values should not be imputed in the base data. However, this does not prohibit imputing data at analysis stage. In some cases, traffic counts with missing values could be the only data available for certain purpose and data imputation is necessary for further analysis. In accordance with the principle of Truth-in-Data, AASHTO Guidelines (America 1992) also recommends highway agencies should document the procedures for editing traffic data.

For the traffic counts with missing values, highway agencies usually either retake the counts or estimate the missing values. Estimating missing values is known as data imputation. Since sometimes retaking counts was impossible due to limited resources and time, imputing the data became a popular method (Albright 1991a). For example, it was reported that many highway agencies in the United States estimated missing values for their traffic counts (New Mexico 1990). In Europe, highway authorities in Netherlands, France, and the United Kingdom all used some computer programs for data validation routines. Usually missing or invalid data was replaced with historical data from the same site during the same period (FHWA 1997). The experience with data from Alberta Transportation also indicates that the agency used data imputation before 1995. The replaced values of missing data were marked with minus signs for some years. Imputing data with reasonable accuracy may help establish more cost-effective traffic data program. The analysis of Alberta data also shows that a significant percent (varying from 10% to 44% from year to year) of traffic counts have missing data for several successive days or months. Usually these PTCs can not be used to calculate AADT or DHV due to the missing data. Such PTCs may be used as seasonal traffic counts (STCs), short-period traffic counts (SPTCs), or just discarded by highway agencies. However, the information contained in these PTCs is certainly more than that from STCs and SPTCs. If missing data from PTCs can be accurately updated, further analysis could be applied based on AADT or DHV.

A review of literature indicates that little research has been done on missing values. Most methods used by transportation practitioners were simple factor approaches or moving average regression analyses (New Mexico 1990; FHWA 1997). Two studies (Ahmed and Cook 1979; Nihan and holmesland 1980) from the United States used Box-Jenkins techniques to predict short-term traffic for urban freeways. The models showed reasonable accuracy. These models can be used to update missing values for traffic counts. Models developed by Nihan and Holmesland (1980) were able to predict average weekday volumes for two months, in which entire monthly data was missing. A group of scholars at University of Leeds, England, tried to model outliers and missing values in traffic count time series by employing exponentially weighted moving average, autocorrelation based influence function, and autoregressive integrated moving average (ARIMA) models (Clark 1992; Redfern et al. 1993; Waston et al. 1993). It was found that ARIMA models outperformed other models in detecting missing values and outliers.

In this study, factor approaches, time series analysis, and genetically designed neural network and regression models are tested on six permanent traffic counts (PTCs) from Alberta, Canada to investigate their abilities of updating missing values. This study also compares the models based on historical data with models based on data from before and after failure. The six PTCs belong to different groups based on the trip purpose and trip length distributions. The experiments presented in this paper illustrate how to use proposed techniques to update missing values of these PTCs. The techniques used in this study could not only be applied to permanent traffic volume counts, but also to seasonal or short-term traffic volume counts, vehicle classification counts, weight counts, and speed counts.

LITERATURE REVIEW

There is significant amount of research related to missing values (Little and Rubin 1987; Bole et al. 1990; Beveridge 1992; Wright 1993; Gupta and Lam 1996; Singh and Harmancioglu 1996). However, limited research is available on how missing data are handled by transportation practitioners. Southworth et al. (1989) introduced a system called RTMAS for urban population evacuations in times of threat. One subroutine of this system is AUTOBOX, which applies Box-Jenkins time series model to the hourly or daily traffic count data. AUTOBOX allows complete autoregressive integrated moving average (ARIMA) modeling. The example in their study clearly showed that proposed ARIMA model was good at detecting unusual traffic profiles and was also good at predicting hourly counts. They used past five days data to predict 24 hourly volumes of the same day of the next week. It was found that 22 hourly volumes were within 95% confidence level of the observed counts. The other two were detected as outliers caused by an evacuation response to the threat of Hurricane Elena. Such system can also be used to predict missing values for traffic counts.

In 1990, New Mexico State Highway and Transportation conducted a survey of traffic monitoring practice (New Mexico 1990) in the United States. It was shown that when portable devices failed, 13 states used some procedure to estimate the missing values and complete the data set. When permanent devices failed, 23 states employed some procedure to estimate the missing values (Albright 1991b). Various methods were used for this purpose. For example, in Alabama, if less than 6 hours are missing, the data are estimated using the previous year or other data from the month. If more than 6 hours are missing, the day is voided. In Delaware, estimates of missing values are based on a straight line using the data from the months before and after the failure. Most of these methods apply simple factors to historical data to estimate missing values. In Kentucky, a computer program was used to estimate and fill in the blanks (New Mexico 1990). In 1997, Federal Highway Administration (FHWA) conducted a research for traffic monitoring programs and technologies in Europe (FHWA 1997). It was reported that highway agencies in Netherlands, France, and the United Kingdom used some computer programs for data validation routines. For example, a software system INTENS was used in Netherlands for data analysis and validation. The software used a “smart” linear interpolation process between locations from which data were available to estimate missing traffic volumes. In France, a system MELODIE was used for data validation. Data validation was conducted visually by the system operator. Invalid data were replaced with previous month’s data. Several data validation schemes were used in the United Kingdom. One of them was used by Central Transport Group (CTG) to validate permanent recorder data. Invalid data were replaced with data extracted from the valid data of last week collected from that site. No research has been found for assessing the accuracy of such imputations.

A series of studies (Clark 1992; Redfern et al. 1993; Watson et al. 1993) were carried out by a group of scholars at University of Leeds, England, in the early 1990’s. Redfern et al. (1993) tested four types of models on four traffic time series supplied by Department of Transportation (DOT) in London. These models were exponentially weighted moving average, autocorrelation based influence function, ARIMA model using large residuals, and ARIMA model using the Tsay likelihood ratio diagnostics. It was reported that the estimation of replacement values for both extreme and missing values was most efficiently done using the parametric ARIMA(1,0,0)(0,1,1)7 model. However, it was also reported that the estimated replacements of the missing values showed considerable variation (Redfern et al. 1993). The study also mentioned concerns about the Base Data Integrity.

A survey of practical solutions used by consultancies and local authorities in England (Redfern et al. 1993) reported that there were two broad categories of solutions. One is “by-eye” method and the other is computerized packages (Redfern et al. 1993). Most automated practical solutions to patching were based upon simple, moving or exponentially weighted moving average, or their variants. For example, DOT in London employed an exponentially weighted moving average model to update missing values. The process involved validating new traffic count data against old data from the same site collected over the previous weeks at the same time. Following equation was used to estimate missing or rejected data, , at time t:

(1)

where xt-1,s, xt-2,s,…, xt-n,s represent the observations for that particular site and vehicle category at the same times for weeks 1, 2, …, n before the current observation; is a constant such that 0<<1. A value of 0.7 was typically used for parameter .

The Countywide Traffic and Accident Data Unit (TADU) used AutoCounts to validate collected data and infill missing values from automatic traffic counts (Ramsey and Hayden 1994). The agency needs monthly five- and seven-day flow averages for trend and yearly analysis within AutoCounts. Usually these statistics can be obtained directly from the validated data that have been flagged as typical. However, when there are no typical data the infill model is applied. The model estimates weekly flows, and starts with a seasonal profile where all weeks are considered to be equal. Then, considering the data in ascending order of age by year, the profile is modified each year. As a starting point the previous year’s profile is calculated as follows:

(2)

The model is applied on a week-to-week basis for: w = 1 to 53 and = w +1. Here FWw is the actual weekly flow for the week w; fw is the estimated weekly seasonal factor for week w; f42 (w = 42), mid-October, is always 1.0. The model is applied iteratively either a maximum of 50 times or until no improvement in fit is achieved. The output of the process is a full 53-week flow profile for the year under consideration. No evaluations were made on the accuracy of such models (Ramsey and Hayden 1994).

REVIEW OF TECHNIQUES

This section provides a brief review of factor approaches, time series analysis, regression analysis, neural networks, and genetic algorithms used in the present study.

Factor Approaches

Factor approaches may be the most popular data imputation or prediction methods. Factor approaches usually involve developing a set of factors from historical data set and then applying these factors to new data for predictions. For example, a set of hourly factors (HF), daily factors (DF), and monthly factors (MF) can be developed based on data from permanent traffic counts. Traffic parameters, such as AADT and DHV, then could be predicted by applying these factors to short-period traffic counts (Garber and Hoel 1999). The virtue of such methods is their simplicity. However, the results are usually less accurate than more sophisticated models.

Time Series Analysis using ARIMA

A time series is a chronological sequence of observations on a particular variable. Time series data are often examined in hopes of discovering a historical pattern that can be exploited in the forecast. Time series modeling is based on the assumption that the historical values of a variable provide an indication of its value in the future (Box and Jenkins 1970).

Many techniques are available for modeling univariate time series, such as exponential smoothing, Holt-Winters procedure, and Box-Jenkins procedure. Exponential smoothing should only be used for non-seasonal time series showing little visible trend. Exponential smoothing may easily be generalized to deal with time series containing trend and seasonal variation. The resulting procedure is usually referred to as the Holt-Winters procedure. Box-Jenkins procedure is the most popular tool for time series analysis. The procedure builds an autoregressive integrated moving average (ARIMA) model using the Box-Jenkins methodology. Both autoregressive and moving average components are considered in these models. Such a model is called an integrated model because the stationary model that is fitted to the differenced data has to be summed or integrated to provide a model for the non-stationary data (Chatfield 1989). The general autoregressive integrated moving average process is of the form:

Given (3)

(4)