1 Introduction
Boosting algorithms [16, 4, 5, 2, 17, 7, 15, 6]
have become very successful in machine learning. In this paper, we provide an empirical evaluation of
four treebased boosting algorithms for multiclass classification: mart[6], abcmart[11], robust logitboost[13], and abclogitboost[12], on a wide range of datasets.Abcboost[11], where “abc” stands for adaptive base class, is a recent new idea for improving multiclass classification. Both abcmart[11] and abclogitboost[12] are specific implementations of abcboost. Although the experiments in [11, 12] were reasonable, we consider a more thorough study is necessary. Most datasets used in [11, 12] are (very) small. While those datasets (e.g., pendigits, zipcode) are still popular in machine learning research papers, they may be too small to be practically very meaningful. Nowadays, applications with millions of training samples are not uncommon, for example, in search engines[14].
It would be also interesting to compare these four treebased boosting algorithms with other popular learning methods such as support vector machines (SVM) and deep learning. A recent study[9]^{1}^{1}1 http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007 conducted a thorough empirical comparison of many learning algorithms including SVM, neural nets, and deep learning. The authors of [9] maintain a nice Web site from which one can download the datasets and compares the test misclassification errors.
In this paper, we provide extensive experiment results using mart, abcmart, robust logitboost, and abclogitboost on the datasets used in [9], plus other publicly available datasets. One interesting dataset is the UCI Poker. By private communications with C.J. Lin (the author of LibSVM), we learn that SVM achieved a classification accuracy of on this dataset. Interestingly, all four boosting algorithms can easily achieve accuracies.
We try to make this paper selfcontained by providing a detailed introduction to abcmart, robust logitboost, and abclogitboost in the next section.
2 LogitBoost, Mart, Abcmart, Robust LogitBoost, and AbcLogitBoost
We denote a training dataset by , where is the number of feature vectors (samples), is the th feature vector, and is the th class label, where in multiclass classification.
Both logitboost[7] and mart (multiple additive regression trees)[6]
algorithms can be viewed as generalizations to logistic regression, which assumes class probabilities
as(1) 
While traditional logistic regression assumes , logitboost and mart adopt the flexible “additive model,” which is a function of terms:
(2) 
where , the base learner, is typically a regression tree. The parameters, and , are learned from the data, by maximum likelihood, which is equivalent to minimizing the negative loglikelihood loss
(3) 
where if and otherwise.
For identifiability, , i.e., the sumtozero constraint, is routinely adopted [7, 6, 19, 10, 18, 21, 20].
2.1 Logitboost
As described in Alg. 1, [7] builds the additive model (2
) by a greedy stagewise procedure, using a secondorder (diagonal) approximation, which requires knowing the first two derivatives of the loss function (
3) with respective to the function values . [7] obtained:(4) 
Those derivatives can be derived by assuming no relations among , to . However, [7] used the “sumtozero” constraint throughout the paper and they provided an alternative explanation. [7] showed (4) by conditioning on a “base class” and noticed the resultant derivatives are independent of the choice of the base.
At each stage, logitboost fits an individual regression function separately for each class. This is analogous to the popular individualized regression approach in multinomial logistic regression, which is known [3, 1] to result in loss of statistical efficiency, compared to the full (conditional) maximum likelihood approach.
On the other hand, in order to use trees as base learner, the diagonal approximation appears to be a must, at least from the practical perspective.
2.2 Adaptive Base Class Boost (ABCBoost)
[11] derived the derivatives of the loss function (3) under the sumtozero constraint. Without loss of generality, we can assume that class 0 is the base class. For any ,
(5) 
The base class must be identified at each boosting iteration during training. [11] suggested an exhaustive procedure to adaptively find the best base class to minimize the training loss (3) at each iteration.
2.3 Robust LogitBoost
The mart paper[6] and a recent (2008) discussion paper [8] commented that logitboost (Alg. 1) can be numerically unstable. In fact, the logitboost paper[7] suggested some “crucial implementation protections” on page 17 of [7]:

In Line 5 of Alg. 1, compute the response by (if ) or (if ).

Bound the response by . The value of is not sensitive as long as in
Note that the above operations were applied to each individual sample. The goal was to ensure that the response should not be too large. On the other hand, we should hope to use larger to better capture the data variation. Therefore, this thresholding operation occurs very frequently and it is expected that part of the useful information is lost.
The next subsection explains that, if implemented carefully, logitboost is almost identical to mart. The only difference is the treesplitting criterion.
2.4 TreeSplitting Criterion Using SecondOrder Information
Consider weights , and response values , to , which are assumed to be ordered according to the sorted order of the corresponding feature values. The treesplitting procedure is to find the index , , such that the weighted mean square error (MSE) is reduced the most if split at . That is, we seek the to maximize
where , , . After simplification, one can obtain
Plugging in , yields,
Because the computations involve as a group, this procedure is actually numerically stable.
In comparison, mart[6] only used the first order information to construct the trees, i.e.,
2.5 Adaptive Base Class Logitboost (ABCLogitBoost)
The abcboost [11] algorithm consists of two key components:

At each boosting iteration, adaptively select the base class according to the training loss. [11] suggested an exhaustive search strategy.
2.6 Main Parameters
Alg. 2 and Alg. 3 have three parameters (, and ), to which the performance is in general not very sensitive, as long as they fall in some reasonable range. This is a significant advantage in practice.
The number of terminal nodes, , determines the capacity of the base learner. [6] suggested . [7, 21] commented that is unlikely. In our experience, for large datasets (or moderate datasets in highdimensions), is often a reasonable choice; also see [14] for more examples.
The shrinkage, , should be large enough to make sufficient progress at each step and small enough to avoid overfitting. [6] suggested . Normally, is used.
The number of boosting iterations, , is largely determined by the affordable computing time. A commonlyregarded merit of boosting is that, on many datasets, overfitting can be largely avoided for reasonable , and .
3 Datasets
Table 1 lists the datasets used in our study. [11, 12] provided experiments on several other (small) datasets.
dataset  # training  # test  # features  

Covertype290k  7  290506  290506  54 
Covertype145k  7  145253  290506  54 
Poker525k  10  525010  500000  25 
Poker275k  10  275010  500000  25 
Poker150k  10  150010  500000  25 
Poker100k  10  100010  500000  25 
Poker25kT1  10  25010  500000  25 
Poker25kT2  10  25010  500000  25 
Mnist10k  10  10000  60000  784 
MBasic  10  12000  50000  784 
MRotate  10  12000  50000  784 
MImage  10  12000  50000  784 
MRand  10  12000  50000  784 
MRotImg  10  12000  50000  784 
MNoise1  10  10000  2000  784 
MNoise2  10  10000  2000  784 
MNoise3  10  10000  2000  784 
MNoise4  10  10000  2000  784 
MNoise5  10  10000  2000  784 
MNoise6  10  10000  2000  784 
Letter15k  26  15000  5000  16 
Letter4k  26  4000  16000  16 
Letter2k  26  2000  18000  16 
3.1 Covertype
The original UCI Covertype dataset is fairly large, with samples. To generate Covertype290k, we randomly split the original data into halves, one half for training and another half for testing. For Covertype145k, we randomly select one half from the training set of Covertype290k and still keep the test set.
3.2 Poker
The UCI Poker dataset originally used only samples for training and samples for testing. Since the test set is very large, we randomly divide it equally into two parts (I and II). Poker25kT1 uses the original training set for training and Part I of the original test set for testing. Poker25kT2 uses the original training set for training and Part II of the original test set for testing. This way, Poker25kT1 can use the test set of Poker25kT2 for validation, and Poker25kT2 can use the test set of Poker25kT1 for validation. As the two test sets are still very large, this treatment will provide reliable results.
Since the original training set (about ) is too small compared to the size of the test set, we enlarge the training set to form Poker525k, Poker275k, Poker150k, and Poker100k. All four enlarged training datasets use the same test set as Pokere25kT2 (i.e., Part II of the original test set). The training set of Poker525k contains the original () training set plus Part I of the original test set. Similarly, the training set of Poker275k / Poker150k / Poker100k contains the original training set plus 250k/125k/75k samples from Part I of the original test set.
The original Poker dataset provides 10 features, 5 “suit” features and 5 ”rank” features. While the “ranks” are naturally ordinal, it appears reasonable to treat “suits” as nominal features. By private communications, R. Cattral, the donor of the Poker data, suggested us to treat the “suits” as nominal. C.J. Lin also kindly told us that the performance of SVM was not affected whether “suits” are treated nominal or ordinal. In our experiments, we choose to use “suits” as nominal feature; and hence the total number of features becomes 25 after expanding each “suite” feature with 4 binary features.
3.3 Mnist
While the original Mnist dataset is extremely popular, this dataset is known to be too easy[9]. Originally, Mnist used 60000 samples for training and 10000 samples for testing.
Mnist10k uses the original (10000) test set for training and the original (60000) training set for testing. This creates a more challenging task.
3.4 Mnist with Many Variations
[9] (www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007) created a variety of much more difficult datasets by adding various background (correlated) noise, background images, rotations, etc, to the original Mnist dataset. We shortened the notations of the generated datasets to be MBasic, MRotate, MImage, MRand, MRotImg, and MNoise1, MNoise2 to MNoise6.
By private communications with D. Erhan, one of the authors of [9], we learn that the sizes of the training sets actually vary depending on the learning algorithms. For some methods such as SVM, they retrained the algorithms using all 120000 training samples after choosing the best parameters; and for other methods, they used 10000 samples for training. In our experiments, we use 12000 training samples for MBasic, MRotate, MImage, MRand and MRotImg; and we use 10000 training samples for MNoise1 to MNoise6.
Note that the datasets MNoise1 to MNoise6 have merely 2000 test samples each. By private communications with D. Erhan, we understand this was because [9] did not mean to compare the statistical significance of the test errors for those six datasets.
3.5 Letter
The UCI Letter dataset has in total 20000 samples. In our experiments, Letter4k (Letter2k) use the last 4000 (2000) samples for training and the rest for testing. The purpose is to demonstrate the performance of the algorithms using only small training sets.
We also include Letter15k, which is one of the standard partitions of the Letter dataset, by using 15000 samples for training and 5000 samples for testing.
4 Summary of Experiment Results
We simply use logitboost (or even logit in the plots) to denote robust logitboost.
Table 2 summarizes the test misclassification errors. For all datasets except Poker25kT1 and Poker25kT2, we report the test errors with the tree size =20 and shrinkage . For Poker25kT1 and Poker25kT2, we use and . We report more detailed experiment results in Sec. 5.
For Covertype290k, Poker525k, Poker275k, Poker150k, and Poker100k, as they are fairly large, we only train boosting iterations. For all other datasets, we always train iterations or terminate when the training loss (3) is close to the machine accuracy. Since we do not notice obvious overfitting on those datasets, we simply report the test errors at the last iterations.
Dataset  mart  abcmart  logitboost  abclogitboost  # test 

Covertype290k  11350  10454  10765  9727  290506 
Covertype145k  15767  14665  14928  13986  290506 
Poker525k  7061  2424  2704  1736  500000 
Poker275k  15404  3679  6533  2727  500000 
Poker150k  22289  12340  16163  5104  500000 
Poker100k  27871  21293  25715  13707  500000 
Poker25kT1  43575  34879  46789  37345  500000 
Poker25kT2  42935  34326  46600  36731  500000 
Mnist10k  2815  2440  2381  2102  60000 
MBasic  2058  1843  1723  1602  50000 
MRotate  7674  6634  6813  5959  50000 
MImage  5821  4727  4703  4268  50000 
MRand  6577  5300  5020  4725  50000 
MRotImg  24912  23072  22962  22343  50000 
MNoise1  305  245  267  234  2000 
MNoise2  325  262  270  237  2000 
MNoise3  310  264  277  238  2000 
MNoise4  308  243  256  238  2000 
MNoise5  294  244  242  227  2000 
MNoise6  279  224  226  201  2000 
Letter15k  155  125  139  109  5000 
Letter4k  1370  1149  1252  1055  16000 
Letter2k  2482  2220  2309  2034  18000 
4.1 Values
Table 3 summarizes the following four types of values:

: for testing if abcmart has significantly lower error rates than mart.

: for testing if (robust) logitboost has significantly lower error rates than mart.

: for testing if abclogitboost has significantly lower error rates than abcmart.

: for testing if abclogitboost has significantly lower error rates than (robust) logitboost.
The
values are computed using binomial distributions and normal approximations. Recall, if a random variable
, then the probability parametercan be estimated by
, and the variance of
can be estimated by . The values can then be computed using normal approximation of binomial distributions.Note that the test sets for MNoise1 to MNoise6 are very small because [9] originally did not intend to compare the statistical significance on those six datasets. We compute their values anyway.
Dataset  

Covertype290k  
Covertype145k  
Poker525k  0  0  0  0 
Poker275k  0  0  0  0 
Poker150k  0  0  0  0 
Poker100k  0  0  0  0 
Poker25kT1  0  —  —  0 
Poker25kT2  0  —  —  0 
Mnist10k  
MBasic  0.0164  
MRotate  
MImage  
MRand  
MRotImg  
MNoise1  0.0574  
MNoise2  0.0024  0.0072  0.1158  0.0583 
MNoise3  0.0190  0.0701  0.1073  0.0327 
MNoise4  0.0014  0.0090  0.4040  0.1935 
MNoise5  0.0102  0.0079  0.2021  0.2305 
MNoise6  0.0043  0.0058  0.1189  0.1002 
Letter15k  0.0345  0.1718  0.1449  0.0268 
Letter4k  0.019  
Letter2k  0.001  

The results demonstrate that abclogitboost and abcmart considerably outperform logitboost and mart, respectively. In addition, except for Poker25kT1 and Poker25kT2, we observe that abclogitboost outperforms abcmart, and logitboost outperforms mart.
4.2 Comparisons with SVM and Deep Learning
For UCI Poker, we know that SVM could only achieve an error rate of about (by private communications with C.J. Lin). In comparison, all four algorithms, mart, abcmart, (robust) logitboost, and abclogitboost, could achieve much smaller error rates (i.e., ) on Poker25kT1 and Poker25kT2.
Figure 1 provides the comparisons on the six (correlated) noise datasets: MNoise1 to MNoise6. Table 4 compares the error rates on MBasic, MRotate, MImage, MRand, and MRotImg.
MBasic  MRotate  MImage  MRand  MRotImg  

SVMRBF  
SVMPOLY  
NNET  
DBN3  
SAA3  
DBN1  
mart  
abcmart  
logitboost  
abclogitboost  8.54% 
4.3 Performance vs. Boosting Iterations
5 More Detailed Experiment Results
Ideally, we would like to demonstrate that, with any reasonable choice of parameters and , abcmart and abclogitboost will always improve mart and logitboost, respectively. This is actually indeed the case on the datasets we have experimented. In this section, we provide the detailed experiment results on Mnist10k, Poker25kT1, Poker25kT2, Letter4k, and Letter2k.
5.1 Detailed Experiment Results on Mnist10k
For this dataset, we experiment with every combination of and . We train the four boosting algorithms till the training loss (3) is close to the machine accuracy, to exhaust the capacity of the learner so that we could provide a reliable comparison, up to iterations.
Table 5 presents the test misclassification errors and Table 6 presents the values. Figures 5, 6, and 7 provide the test misclassification errors for all boosting iterations.

mart  abcmart  


3356 3060  3329 3019  3318 2855  3326 2794 
3185 2760  3093 2626  3129 2656  3217 2590  
3049 2558  3054 2555  3054 2534  3035 2577  
3020 2547  2973 2521  2990 2520  2978 2506  
2927 2498  2917 2457  2945 2488  2907 2490  
2925 2487  2901 2471  2877 2470  2884 2454  
2899 2478  2893 2452  2873 2465  2860 2451  
2857 2469  2880 2460  2870 2437  2855 2454  
2833 2441  2834 2448  2834 2444  2815 2440  
2840 2447  2827 2431  2801 2427  2784 2455  
2826 2457  2822 2443  2828 2470  2807 2450  
2837 2482  2809 2440  2836 2447  2782 2506  
2813 2502  2826 2459  2824 2469  2786 2499  
logitboost  abclogit  

2936 2630  2970 2600  2980 2535  3017 2522 
2710 2263  2693 2252  2710 2226  2711 2223  
2599 2159  2619 2138  2589 2120  2597 2143  
2553 2122  2527 2118  2516 2091  2500 2097  
2472 2084  2468 2090  2468 2090  2464 2095  
2451 2083  2420 2094  2432 2063  2419 2050  
2424 2111  2437 2114  2393 2097  2395 2082  
2399 2088  2402 2087  2389 2088  2380 2097  
2388 2128  2414 2112  2411 2095  2381 2102  
2442 2174  2415 2147  2417 2129  2419 2138  
2468 2235  2434 2237  2423 2221  2449 2177  
2551 2310  2509 2284  2518 2257  2531 2260  
2612 2353  2622 2359  2579 2332  2570 2341 
P1  
0  
P2  
P3  
P4  
The experiment results illustrate that the performances of all four algorithms are stable on a widerange of base class tree sizes , e.g., . The shrinkage parameter does not affect much the test performance, although smaller values result in more boosting iterations (before the training losses reach the machine accuracy).
We further randomly divide the test set of Mnist10k (60000 test samples) equally into two parts (I and II). We then test algorithms on Part I (using the same training results). We name this “new” dataset Mnist10kT1. The purpose of this experiment is to further demonstrate the stability of the algorithms.
Table 7 presents the test misclassification errors of Mnist10kT1. Compared to Table 5, the misclassification errors of Mnist10kT1 are roughly of the misclassification errors of Mnist10k for all and . This helps establish that our experiment results on Mnist10k provide a very reliable comparison.

mart  abcmart  


1682 1514  1668 1505  1666 1416  1663 1380 
1573 1382  1523 1320  1533 1329  1582 1288  
1501 1263  1515 1257  1523 1250  1491 1279  
1492 1270  1457 1248  1470 1239  1459 1236  
1432 1244  1427 1234  1444 1228  1436 1227  
1424 1237  1420 1231  1407 1223  1419 1212  
1430 1226  1426 1224  1411 1223  1418 1204  
1400 1222  1413 1218  1390 1210  1404 1211  
1398 1213  1381 1205  1388 1213  1382 1198  
1402 1221  1366 1201  1372 1199  1346 1205  
1384 1211  1374 1208  1368 1224  1366 1205  
1397 1244  1375 1220  1397 1222  1365 1246  
1371 1239  1380 1221  1382 1223  1362 1242  
logitboost  abclogit  

1419 1299  1449 1281  1446 1251  1460 1244 
1313 1111  1313 1114  1326 1101  1317 1097  
1278 1058  1287 1050  1270 1036  1262 1058  
1252 1061  1244 1057  1237 1040  1229 1041  
1224 1020  1219 1049  1217 1053  1224 1047  
1213 1038  1207 1050  1201 1039  1198 1026  
1185 1050  1205 1058  1189 1044  1178 1041  
1186 1048  1184 1038  1184 1046  1167 1056  
1185 1077  1199 1063  1183 1042  1184 1045  
1208 1095  1196 1083  1191 1064  1194 1068  
1225 1113  1201 1117  1190 1113  1211 1087  
1254 1159  1247 1145  1248 1127  1249 1127  
1292 1177  1284 1174  1275 1161  1276 1176 
5.2 Detailed Experiment Results on Poker25kT1 and Poker25kT2
Recall the original UCI Poker dataset used 25010 samples for training and 1000000 samples for testing. To provide a reliable comparison (and validation), we form two datasets Poker25kT1 and Poker25kT2 by equally dividing the original test set into two parts (I and II). Both use the same training set. Poker25kT1 uses Part I of the original test set for testing and Poker25kT2 uses Part II for testing.
Table 8 and Table 9 present the test misclassification errors, for and . Comparing these two tables, we can see the corresponding entries are very close to each other, which again verifies that the four boosting algorithms provide reliable results on this dataset.
For most and , all four algorithms achieve error rates . For both Poker25kT1 and Poker25kT2, the lowest test errors are attained at and . Unlike Mnist10k, the test errors, especially using mart and logitboost, are slightly sensitive to the parameters.
Note that when (and is small), only training steps would not be sufficient in this case.
mart  abcmart  

145880 90323  132526 67417  124283 49403  113985 42126 
71628 38017  59046 36839  48064 35467  43573 34879  
64090 39220  53400 37112  47360 36407  44131 35777  
60456 39661  52464 38547  47203 36990  46351 36647  
Comments
There are no comments yet.