GestuRe: An Interactive Machine Learning System for Recognizing Hand Gestures

(This post describes a project I completed for MAS.S62 Interactive Machine Learning at the MIT Media Lab. The assignment was to create a simple interactive machine learning system that gave the user visibility into the learning algorithm.)

GestuRe is a mixed-initiative interactive machine learning system for recognizing hand gestures. It attempts to give the user visibility into the classifier’s prediction confidence and control of the conditions under which the system actively requests labeled gestures when its predictions are uncertain.

Training object or gesture recognition systems is often a tedious and uncertain process. The slow loop from gathering images to training a model to testing the classifier separates the process of selecting training samples from the situation in which the system is actually used.

In building such systems, I’ve frequently been frustrated by the inability to add corrective samples at the moment a mistaken classification occurs. I’ve also struggled to feel confident in the state of the classifier at any point in the training process. This system attempts to address both problems, producing a recognition system that is fluid to train and whose reliability can be understood and improved.

GestuRe creates a feature vector from an input image from a live camera using Histogram of Oriented Gradients [Dalal & Triggs 2005]. The user selects a class label for these input images and the system then trains a Support Vector Machine-based classifier on these labeled samples [Vapnik 1995].

The system then displays to the user the prediction likelihood for each submitted class as well as the current classification. Further, the system shows the user all of the training samples captured for each class. Note: everywhere in the interface that the system presents a class to the user, it uses one of the training images to represent that class rather than text. This makes the interface easier to comprehend when the user’s attention is split between its output and their own image in the live video feed.

The user submits labeled samples in two phases. First they establish the classes by submitting a series of images for each distinct gesture they want the system to detect. Then, after initial training, the system begins classifying the live gestures presented by the user and displaying its results. In this phase, whenever the system’s confidence in its top prediction (as represented by the gap in probability between the most likely class and the second most likely) falls below a user-defined threshold, the system prompts the user for additional training samples.

This prompt consists of a modal interface which presents the user with a snapshot of the gesture that caused the low confidence prediction. Alongside this snapshot, the system presents a series of images representing each known class. The user selects the correct class, creating a new labeled sample and the system retrains the classifier, hopefully increasing prediction confidence.

This modal active learning mode places high demands on the user, risking the danger of the user feeling like they’re being treated as an oracle. To alleviate this feeling, GestuRe gives the user a series of parameters to control the conditions under which it prompts them for a labeled sample. First amongst these is a “confidence threshold”, which determines the minimum probability gap between the top two classes that will trigger a request for a label. A lower confidence threshold results in fewer requests but more incorrect predictions. A high threshold results in more persistent requests for labels but the eventual training of a higher quality classifier.

Second, the user can control how long the system will endure low-confidence predictions before requesting a sample. Since prediction probabilities fluctuate rapidly with live video input, the confidence threshold alone would trigger active learning too frequently, even on a well-trained classifier, simply do to the variations in the video input and, especially, the ambiguous states as the user moves their hands between well-defined gestures. The “time before ask” slider allows the user to determine the number of sequential below-threshold predictions before the system will prompt for a labeled sample. The system also displays a progress bar so the user can get a sense for when the system’s predictions are below the confidence threshold and how close its coming to prompting for more labeled samples.

Finally, the system allows the user to turn active training off altogether. This mode is especially useful when adding a new gesture to the system by submitting a batch of samples. Also, it allows the user to experience the current quality of the system without being interrupted for new labels.

GestuRe could be further improved in two ways. First, it would help to show the user a visualization of the histogram of oriented gradients representation that is actually used in classification. This would help them identify noisy scenes, variable hand position, and other factors that were contributing to low classification confidence. Secondly, it would help to identify which classes needed additional clarifying samples. Possibly performing offline cross-validation on the saved samples in the background could help determine if the model had lower accuracy or precision for any particular class.

Finally, I look forward to testing GestuRe on other types of recognition problems beyond hand gestures. In addition to my past work with object recognition mentioned above, during development I discovered that GestuRe can be used to classify facial expressions as well as this video, using an early version of the interface, demonstrates:

Interactive Machine Learning with Funny Faces from Greg Borenstein on Vimeo.

I implemented GestuRe using the OpenCV computer vision library, libsvm, and the Processing creative coding framework. In particular, I used OpenCV for Processing, my own OpenCV wrapper library, as well as PSVM, my libsvm wrapper. (While OpenCV includes implementations of a series of machine-learning algorithms, its SVM implementation is based on an older version of libsvm which performs significantly worse with the same settings and data.)

GestuRe’s source code is available on Github here: atduskgreg/gestuRe.

This entry was posted in Art. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *