ONLINE Appendix 1: Supervised Machine Classification of Tasks
The training and testing dataset are samples randomly drawn from all applications. The training and testing dataset contains 700 and 300 tasks respectively. I used the RTextTool package in R for supervised classification. The R package came with eight standard algorithms, as shown in Table A1.
To decide which algorithms should be used for my classification, I first conducted 4-fold cross validation on my data. I subdivided the training dataset into four subsets, usually referred to as ‘folds’. For each subset, I trained a model using all of the data except the data in the fold, and tested the model on the fold. The first column in Table A1 shows the mean accuracy across the eight algorithms.
I also computed the precision, recall and F-score, and calculated accuracy from the confusion matrix. Accuracy measures the percentage of cases that are correctly labeled. The recall rate assesses the proportion of positive cases that are correctly identified. The precision rategauges the proportion of the predicted positive cases that are correct. F-score is the harmonic mean which combines precision and recall. Based on these statistics, I decided to use five of the algorithms (shaded in grey) to classify the cases. The final classification was based on the common consensus among these five algorithms.
Table A1. Statistics from Eight Algorithms
Algorithm / Mean Accuracy from 4-fold validation / Precision / Recall / F Score / Accuracy from confusion matrixBAGGING / 0.74 / 0.73 / 0.56 / 0.58 / 77
BOOSTIING / 0.82 / 0.73 / 0.59 / 0.61 / 76
GLMNET / 0.61 / 0.71 / 0.62 / 0.64 / 78
MAXENT / 0.38 / 0.69 / 0.66 / 0.66 / 82
SLDA / 0.42 / 0.14 / 0.14 / 0.09 / 9
SVM / 0.38 / 0.62 / 0.67 / 0.64 / 76
RF / 0.77 / 0.69 / 0.51 / 0.55 / 68
TREE / 0.64 / 0.52 / 0.41 / 0.41 / 42
Table A2 presents the final confusion matrix, which is a contingency table that displays the number of correctly and incorrectly classified cases. The rows show the hand-coded (“actual”) classification. The columns display the machine-coded (“predicted”) classification. The overall accuracy rate, which is computed by dividing the total number of corrected cases by the total number of cases, is 81%.
Note that category 1 (single-family) home is the largest category. It is also the category that has one of the highest accuracy. Out of 122 cases, 121 were correctly classified. One case was mistakenly classified as “2” (apartment complex), which is a similar category. For category 10 (land use changes), all five cases were correctly identified. Because retaining walls can appear in many occasions, category 5 (retaining wall) had the highest error rate. Out of the 9 cases, only 1 was correctly identified. The others were classified as “11” (public work), “1” (single-family home), “2” (apartment complex) or “3” (dock/pier).
The small sample size in many categories and complex tasks present challenges with classification. Category 11 (public works) has the highest false positives. Six cases that should be classified as category “2” (apartment complex) was mistaken for public works. Seven cases that were category “7” (flood management system) got mistaken as category 11.
Table A2: Confusion Matrix
Assigned classification (“Actual”) / Machine-coded classification (“Predicted”) / Row %1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12
1 / 121 / 1 / 99%
2 / 2 / 32 / 1 / 2 / 6 / 74%
3 / 1 / 18 / 1 / 90%
4 / 1 / 7 / 1 / 1 / 1 / 64%
5 / 2 / 1 / 1 / 1 / 4 / 11%
6 / 1 / 1 / 18 / 1 / 2 / 78%
7 / 1 / 1 / 13 / 1 / 7 / 57%
8 / 1 / 2 / 1 / 50%
9 / 2 / 1 / 3 / 17%
10 / 5 / 100%
11 / 1 / 2 / 1 / 2 / 24 / 80%
12 / 3 / 1 / 25%
Column % / 93% / 86% / 86% / 100% / 50% / 95% / 65% / 50% / 100% / 71% / 47% / 100% / 93%
Note: Cells that are left blank have no cases. Shaded diagonal cells indicate number of cases that are correctly classified. Cells above the diagonal indicate false negative cases. Cells below the diagonal indicate false positive cases.
ONLINE Appendix 2. Logistic Regression Estimates and Standard Errors for Machine- and Manual-Coded Datasets
Note: Dependent variable is application approval (1=approved; 0=otherwise). The graph shows that the point estimates of the two subsets are close, illustrating that cases coded by machine produce comparable results as those coded by researchers. Since there are far more machine-coded than manual-coded cases, the former have smaller variance and standard errors.