Imagine if we could study the movement of objects in videos by tracking their position and orientation and how different points on the object move. This information will be useful for making inferences about the 3D properties, physical properties, and interactions of various objects.
So, what is the most basic step to achieve the above mentioned goal?
Suppose we can determine the position of some marked points in each video frame and observe them. In this case, we can use this methodology to study the motion of objects by sticking these dots to a particular object and then studying their motion.
This is what the author of this article is trying to do here. The problem statement looks like this, “How can we follow the movement of a point on a long video? »
Although some attention has been paid to this issue, there is still a need for a data set or benchmark for evaluation. Some pre-existing methods partially solve this problem, such as the most popular bounding boxand segmentation tracking the algorithms provide limited information about the deformation and rotation of objects. optical flow models can track a point on any surface, but the amount of time they can track motion is limited to a few frames. Additionally, optical flow models are also limited in their ability to estimate occlusion (see Fig. 2 for other methods). This article can be seen as an extension of optical flow to longer videos with occlusion estimation. In this article, Deepmind researchers also introduced a companion benchmark, TAP (Track Any Point)-Vid, which is composed of real-world videos with precise human annotations of the point tracks, and synthetic videos with perfect ground truth point tracks. The benchmark is created using a new semi-automatic crowdsourcing pipeline that uses optical flow to estimate leads over shorter time periods and allows the annotator to focus on more difficult sections of the video, and the annotators also correct the optical flow estimation. Tracking points in a synthetic environment is quite simple. Even though humans are quite good at tracking points, getting ground truth trails for real-world video is a tedious and time-consuming task because objects and cameras tend to move in complex ways. This may be one of the reasons why this issue has received so little attention.
Let’s talk about how the dataset is created…
The benchmark contains real and synthetic videos. Approximately 1189 actual YouTube videos from the Kinetics dataset and 30 videos from the DAVIS evaluation dataset with approximately 25 points are annotated using a small group of annotators with multiple rounds of proofreading and corrections. A total of four datasets are created, namely, TAP-Vid-Kinetics, TAP-Vid-DAVIS, TAP-Vid-Kubric (synthetic) and TAP-Vid-RGB-Stacking (synthetic).
The model is trained only on the synthetic Kubric dataset. The other three datasets are used for evaluation and testing, as this transfer from a synthetic dataset to a real dataset is more likely to represent the transition from the visible environment to the invisible environment.
Again, back to baselines and the proposed algorithm for point tracking. Some COTR, VFSand RAFT extensions are used as pre-existing benchmarks for comparison with the proposed TAP-Net. The above algorithm fails to handle occlusion, deformation of objects, and transfer from a synthetic environment to a real environment, respectively.
The approach of TAP Net is inspired by cost volumes. The given video is first divided into feature grids. Next, the features of the query point are compared to features anywhere else in the video, and a cost volume is calculated (see Fig. 4). After that, a simple neural network is applied, which predicts the occlusion logit and the location of the point at the time of the query (see Fig. 5). Since we are performing two different tasks with the same model, the cost function is simply a weighted combination of two losses: one is a Huber loss for position regression and the other is a cross-entropy loss standard for occlusion classification.
TAP Net easily outperforms all other benchmark methods on all four data sets. You can see the results in Table 1 below.
Advances in TAP will prove very useful in the manipulation of robotic objects and certain applications of reinforcement learning. Even though TAP Net outperforms all previous basic methods, it still has some limitations. Neither the benchmark nor the TAP-Net can handle liquids or transparent objects, and annotations for real-world videos cannot be 100% correct.
Check Paper and GithubGenericName. All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.
Vineet Kumar is an intern consultant at MarktechPost. He is currently pursuing his BS from Indian Institute of Technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in deep learning, computer vision and related fields.
#Discover #TAPVid #dataset #videos #accompanied #point #tracks #annotated #manually #obtained #simulator