Meet BootsTAP: An Effective Method for Leveraging Large-Scale, Unlabeled Data to Improve TAP (Tracking-Any-Point) Performance

In the past few years, generalist AI systems have shown remarkable progress in the field of computer vision and natural language processing and are widely used in many real-world settings, such as robotics, video generation, and 3D asset creation. Their capabilities lead to better efficiency and an enhanced user experience. However, their applications in these fields are limited by the lack of physical and spatial reasoning of these models.

To address this issue, the researchers at Google DeepMind have introduced BootsTAP, a method that facilitates precise representation of motions in videos and has already shown impressive results in robotics, video generation, and video editing. In this method, algorithms take a video and a set of query points as input and return the tracked position of these points in the other video frames.

This approach is a highly general task and serves as a source of information about the motion of objects across longer periods. Unlike previous state-of-the-art methods that relied on synthetic data, this work leverages unlabeled real-world videos to improve point tracking by using self-consistency as a supervisory signal.

The architecture of TAP is based on a teacher-student model, with both initialized with a synthetic dataset. The teacher model takes unlabeled video as input, and its prediction acts as a pseudo-ground truth for the student. A second copy of the video is fed to the student after applying affine transformations, resampling frames to a lower resolution, and adding JPEG corruption. The student’s prediction is transformed back to the original coordinate space, and a self-supervised loss is computed for the same.

Subsequently, the teacher’s weights are updated using an exponential moving average of that of the student model, and it is ensured that the teacher’s predictions are more accurate than the student’s. The results show that when this architecture is applied to real-world videos, there is a significant improvement over previous state-of-the-art across the entire TAP-Vid benchmark.

For quantitative evaluation, the researchers created different datasets such as TAP-Vid-Kinetics, RoboTAP, and others that contain real-world videos from Robotics Manipulation, Kinetics-700-2020 validation set, and the DAVIS 2017 validation set, among others. When they evaluated their work against other methods like CoTracker, TAPIR, TAP-Net, etc., it consistently outperformed the others on the TAP-Vid dataset across different metrics. Moreover, it also improves occlusion prediction accuracy and enables improved localization accuracy.

In conclusion, the authors of this paper introduced an effective method for improving TAP performance. Although their work has some limitations, like computationally expensive training and lack of elegantly handling duplicated objects, it still outperforms many previous methods across different metrics, and it demonstrates the capabilities of self-supervised learning in bridging the sim-to-real gap.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.