Random Decision Forest’s Interaction Affordances

(This post responds to an assignment for MAS.S62 Interactive Machine Learning at the MIT Media Lab to analyze the input and output channels of a machine learning algorithm for their potential as affordances for interaction.)

When examined for its potential for interaction affordances, Random Decision Forests (Breiman 2001) distinguishes itself from other machine learning algorithms in its potential for transparency. Due to the nature of the algorithm, most Random Decision Forest implementations provide an extraordinary amount of information about the final state of the classifier and how it derived from the training data.

In this analysis, I discuss five outputs that are available from a Random Decision Forest and ways they could be used to provide interface or visualization options for a layman user of such a classifier. I also describe one input that could be similarly useful.

(For each output and input, I provide a link to the corresponding function in the OpenCV Random Decision Forest implementation. Other implementations should also provide similar access.)

Output: variable importance

In addition to returning the classification result, most Random Decision Forest implementations can also provide a measure of the importance that each variable in the feature vector played in the result. These importance scores are calculated by adding noise to each variable one-by-one and calculating the corresponding increase in the misclassification rate.

Presenting this data to the user in the form of a table, ranked list, or textual description could aid in feature selection and also help improve user understanding of the underlying data.

OpenCV’s implementation: CvRTrees::getVarImportance()

Output: proximity between any two samples

A trained Random Decision Forest can calculate the proximity between any two given samples in the training set. Proximity is calculated by comparing the number of trees where the two samples ended up in the same leaf node to the total number of trees in the ensemble.

This proximity data could be presented to the user of an interactive machine learning system in order to both improve the user’s understanding of the current state of training and to suggest additional labeled samples that would significantly improve classification. By iteratively calculating the proximities of each pair of samples in the training set (or a large subset of these) a system could produce a navigable visualization of the existing training samples that could significantly aid the user in identifying mis-labeled samples, crafting useful additional samples, and understanding the causes of the system’s predictions.

OpenCV’s implementation: CvRTrees::get_proximity()

Output: prediction confidence

Due to the ensemble structure of a Random Decision Forest, the classifier can calculate a confidence score for its predictions. The confidence score is calculated based on the proportion of decision trees in the forest that agreed with the winning classification for the given sample.

This confidence could be presented to a user in multiple different ways. A user could set a confidence threshold below which predictions should be ignored; the system could prompt the user for additional labeled samples whenever the confidence is too low; or the confidence could be reflected in the visual presentation of the prediction (size, color, etc) so that the user can take it into consideration.

OpenCV’s implementation: CvRTrees::predict_prob() (Note: OpenCV’s implementation only works on binary classification problems.)

Output: individual decision trees

Since Random Decision Forest is usually implemented on top of a simpler decision tree classifier, many implementations provide direct access to the individual decision trees that made up the ensemble.

With access to the individual decision trees, an application could provide the user with a comprehensive visualization of the Forest’s operation including showing the error rates for the individual trees and the variable on which each tree made each split. This visualization could aid in feature selection and in-depth evaluation and exploration of the quality of the training set.

OpenCV’s implementation: CvRTrees::get_tree()

OUTPUT: calculate training error

Since Random Decision Forests store each of their training samples internally as they construct their decision trees, unlike many other machine learning methods, they can evaluate their own training error after the completion of training. On classification problems, this error is calculated as the percentage of mis-classified training samples, in regression problems it is the mean square of the errors.

This error metric is simple enough that it could be shown to an end-user as a basic form of feedback on the current state of training quality. However, without other metrics, this would create the danger of encouraging the user to work towards overfitting the training sample.

CvRTrees::get_train_error() (Note: OpenCV’s implementation only works on classification problems.)

Input: Max number of trees in the forest

The most important input for a user to a Random Decision Forest is the maximum number of trees allowed in the forest. Up to the point of diminishing returns, this is essentially a proxy for the trade-off between training time and result quality.

This could be presented to the user as a slider, allowing them to choose faster training or better results throughout the process of interactively improving a classifier.

OpenCV’s implementation: CvDTreeParams::set_max_depth()

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Random Decision Forest’s Interaction Affordances

  1. Joe McCarthy says:

    Interesting collection of potentially visualizable aspects of Random Forests. While these seem like they would be very useful to someone with sufficient background in the use of machine learning tools, I wonder how many of them would be easily understandable to the uninitiated. Seems like a fruitful area for collaboration between ML and HCI researchers.

    I have not used OpenCV, but I’ve recently been working with Scikit-Learn’s open source Python-based implementation of several machine learning algorithms, including Random Forest Classifier. FWIW, 2 of the 5 output elements mentioned above, and the input element, are currently supported by this implementation; I suspect the other elements may currently be hidden in the code, but would have to dig around further to see whether / how they might be exposed.

    Output: variable importance
    sklearn attribute: feature_importances_

    Output: prediction confidence
    sklearn method: predict_probas(X)

    Input: Max number of trees in the forest
    sklearn parameter: n_estimators

  2. greg says:

    Thanks for the comment, Joe, and especially for posting the SciKit functions. As mentioned at the top of the post, I wrote this as part of a class I’m in at the MIT Media Lab on Interactive Machine Learning. IML is a pretty new field and it sits exactly at the intersection between ML and HCI you’re describing. I decided to post my work publicly despite knowing that it’s technical enough to be difficult to access for most readers in the hopes that it would at least be useful to some. At this point, the obstacles to even the most-basic uses of machine learning are still quite high for most people (something I’m working on through my efforts to add wrappers for OpenCV’s machine learning functions to my OpenCV library for Processing).

Leave a Reply

Your email address will not be published. Required fields are marked *