Random Forest Classification

Exploration of the Random Forest Classifier

For the prediction of tree species a Random Forest Classifier was used. Random Forest is a widely used machine learning algorithm based on relativly simple decision tree models. At each node of a tree the data is split on the basis of variables in the predictor dataset. While Random Forest can also be used for regression problems, it actually performs better on classifications. In the final node of each individual tree a classification is given. The majority of all trees grown in the forest then decides the final class of a given observation.

Here, the model was trained on a forward-feature selection based on variables obtained from an RGB image of the University Forest in Caldern. The following predictors proved best to model tree species. Additionally, the gain in Kappa score of adding each variable to the model is given.

Predictor Kappa gain
Red Band  
Color Index of Vegetation (CIVE) 0.178
First PCA Sobel Filter 15 window 0.055
Second PCA Mean Filter 5 window 0.067
First PCA Gauss Filter 5 window 0.011
Triangulated Greenness Index (TGI) 0.013
Second PCA Sobel Filter 15 window 0.011
First PCA Mean Filter 5 window 0.009
First PCA Sobel Filter 21 window 0.009
Second PCA Laplace of Gaussian Filter 15 window 0.008
First PCA Gauss Filter 15 window 0.004
Blue Band 0.002
Green Leaf Index (GLI) 0.003

The code for calculating different vegetation indices based on RGB can be found here. In order to reduce the dimensionality of the predictor dataset, we decided to implement a principal component analysis (PCA). The first two principal components combined together explain above 80% of the variance of all the vegetation indices in the predictor dataset. This way, we could apply spatial filters solely to the principal components and therefore dramatically reduce the computation time of the forward-feature selection due to a lower amount of predictors.

The forward feature selection chose 13 variables to model the tree species (see above). The Kappa score was chosen as the evaluation metric and the final internal Kappa was 0.37. According to common interpretation of the Kappa score, this only yields in a fair level of agreement.

Score Level of Agreement
<0 Less than chance
0 - 0.20 Slightly
0.21 - 0.40 Fair
0.41 - 0.60 Moderate
0.61 - 0.80 Substantial
0.81 - 0.99 Almost Perfect

The final variables chosen for classification and their contribution to the increase in the Kappa score are displayed below:

Kappa scores for added variables

This relatively low Kappa score might result from an inadequate sampling strategy. Here, 400 trees, 100 for each species, were polygonised in QGIS and attributed to a tree species based on visual interpretation of the RGB image as well as an additional shape file indicating forest sections and the dominating species within these sections. The pixels were extracted based on the ID of the individual polygons and a training sample of 8,000 pixels with 20 pixels from each tree were extracted. The remaining dataset was used to calculate an independent Kappa value which achieves with 0.26 also a fair level of agreement.

Internal and External Kappa scores (n= 8,000 and 651,041 respectivley)

The low level of agreement results in the confusion of different species. The largest component of the error results from oak pixels being classified as beech pixels. Also a large proportion of pixels from the class Douglas Spruce are wrongly classified as beech. Because the amount of the error is quite high, structurally overestimating the beech class while underestimating the other, the model needs some improvement for an accurate classification of tree species, especially when it comes to the differentiation of beeches to other species. A different sampling design could prove beneficial while it is unlikely that a higher level of agreement could be achieved with more or different predictors based on RGB images. In contrast, structural predictors generated through the use of LiDAR data could possibly also yield to higher Kappa scores.

Alluvial Plot of Classification Error)

Class Specific Accuracy values are as follows:

Class Sensitivity Specifity Positive Prediction Rate Negative Prediction Rate Balanced Accuracy
Beech 0.68 0.57 0.45 0.78 0.63
Dgl. Spruce 0.28 0.94 0.53 0.83 0.61
Oak 0.45 0.77 0.44 0.78 0.61
Spruce 0.35 0.97 0.72 0.88 0.66