Today, AI finds its application in almost every field imaginable. It has definitely transformed our lives, streamlining processes and enhancing efficiency in ways we couldn’t have imagined before. Its capabilities could be improved even further by advancements in understanding human skills, which could facilitate numerous applications such as virtual coaching, robotics, and even social networking. This research paper focuses on better equipping AI systems to make them better at human skill comprehension.
For capturing human skills, it is necessary to consider both egocentric (first-person) as well as exocentric (third-person) viewpoints. Moreover, there must be a synergy between these two as it is essential to map other’s behavior onto our own for better learning. The existing datasets are not competent enough to realize this potential as ego-exo datasets are very limited, small in scale, and often lack synchronization across cameras. To tackle this issue, the researchers at Meta have introduced Ego-Exo4D, a foundational dataset that is multimodal, multiview, large scale, and comprises diverse scenes from multiple cities worldwide.
For better comprehension, sometimes both viewpoints are necessary, for example, a chef explaining the equipment from a third-person perspective and showing their hand movements from a first-person perspective. Thus, to achieve the goal of better human skills, Ego-Exo4D consists of a first-person view and multiple exocentric views for each sequence. Moreover, the researchers have ensured that all the views are time-synchronized. The multiview dataset has been captured using an ego-exo camera rig that captures both close-body shots and full-body poses.
Ego-Exo4D focuses on skilled human activities to capture body pose movements and interaction with objects. The dataset consists of diverse activities from different domains, such as cooking, bike repair, etc., with the data being captured in authentic settings in contrast to previous methods that do so in lab environments. For data collection, the researchers recruited more than 800 participants and ensured robust privacy and ethics standards were followed.
All the videos in the dataset are time-indexed, which means that the camera wearers describe their actions, a third person describes every camera shot, and a third person critiques the performance of the camera wearer, making the dataset stand out from others. Additionally, in the absence of ego-exo data for training, major research problems are posed in the egocentric perception of skilled activities. Therefore, to address this, the researchers have devised a set of foundational benchmarks designed to provide a starting point from which the community can build. They have organized these benchmarks into four task families – relation, recognition, proficiency, and ego-pose.
In conclusion, Ego-Exo4D is a comprehensive dataset of unprecedented scale that consists of skilled human activities from different domains. It is a first-of-its-kind dataset that bridges the gaps left behind by its predecessors. The dataset finds its application in many fields, such as activity recognition, body pose estimation, AI coaching, etc., and the researchers believe that it will be the driving force behind research in multimodal activities, ego-exo, and beyond.
Check out the Paper, Project, and Reference Article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
The post Meet Ego-Exo4D: A Foundational Dataset and Benchmark Suite to Support Research on Video Learning and Multimodal Perception appeared first on MarkTechPost.
#AIShorts #Applications #ArtificialIntelligence #DeepLearning #EditorsPick #LanguageModel #LargeLanguageModel #MachineLearning #Staff #TechNews #Technology #Uncategorized [Source: AI Techpark]