Gesture-aware Interactive Machine Teaching with In-situ Object Annotations

The University of Tokyo
UIST 2022

Our IMT system can annotate the object of interest in real time when the user performs teaching using gestures.


Interactive Machine Teaching (IMT) systems allow non-experts to easily create Machine Learning (ML) models. However, existing vision-based IMT systems either ignore annotations on the objects of interest or require users to annotate in a post-hoc manner. Without the annotations on objects, the model may misinterpret the objects using unrelated features. Post-hoc annotations cause additional workload, which diminishes the usability of the overall model building process. % with vision-based IMT. In this paper, we develop LookHere, which integrates in-situ object annotations into vision-based IMT. LookHere exploits users' deictic gestures to segment the objects of interest in real time. This segmentation information can be additionally used for training. To achieve the reliable performance of this object segmentation, we utilize our custom dataset called HuTics, including 2040 front-facing images of deictic gestures toward various objects by 170 people. The quantitative results of our user study showed that participants were 16.3 times faster in creating a model with our system compared to a standard IMT system with a post-hoc annotation process while demonstrating comparable accuracies. Additionally, models created by our system showed a significant accuracy improvement ($\Delta mIoU=0.466$) in segmenting the objects of interest compared to those without annotations.



The model trained by standard IMT systems may misinterpret the objects by using unrelated features.

The user can clarify the object s/he wants to teach by the post-hoc annotation but this process is very time-consuming.


Our system, LookHere, achieves in-situ annotations when the user perform teaching.


HuTics is a human deictic gestures dataset that includes 2040 images collected from 170 people. It covers four kinds of deictic gestures to objects: exhibiting, pointing, presenting and touching. Note that we only recruited human labelers to annonate the objects of interest, and the human hands (and arms) are predicted using [this project]. Therefore, the hand annotations in the dataset are not always correct.






    doi = {10.48550/ARXIV.2208.01211},
    url = {},
    author = {Zhou, Zhongyi and Yatani, Koji},
    title = {Gesture-aware Interactive Machine Teaching with In-situ Object Annotations},
    publisher = {arXiv},
    year = {2022}
    author = {Zhou, Zhongyi and Yatani, Koji},
    title = {Enhancing Model Assessment in Vision-Based Interactive Machine Teaching through Real-Time Saliency Map Visualization},
    year = {2021},
    isbn = {9781450386555},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {},
    doi = {10.1145/3474349.3480194},
    pages = {112–114},
    numpages = {3},
    keywords = {Visualization, Saliency Map, Interactive Machine Teaching},
    location = {Virtual Event, USA},
    series = {UIST '21}