Meet JARVIS-1: Open-World Multi-Task Agents with Memory-Augmented Multimodal Language Models

A team of researchers from Peking University, UCLA, the Beijing University of Posts and Telecommunications, and the Beijing Institute for General Artificial Intelligence introduces JARVIS-1, a multimodal agent designed for open-world tasks in Minecraft. Leveraging pre-trained multimodal language models, JARVIS-1 interprets visual observations and human instructions, generating sophisticated plans for embodied control.

JARVIS-1 utilizes multimodal input and language models for planning and control. Developed on pre-trained multimodal language models, JARVIS-1 integrates a multimodal memory for planning based on pre-trained knowledge and in-game experiences. Achieving near-perfect performance across 200 diverse tasks, it notably excels in the challenging long-horizon diamond pickaxe task, earning a fivefold improvement in completion rate. The study emphasizes the significance of multimodal memory in enhancing agent autonomy and general intelligence in open-world scenarios.

The research addresses challenges in creating sophisticated agents for complex tasks in open-world environments. Existing approaches need help with multimodal data, long-term planning, and life-long learning. The proposed JARVIS-1 agent, built on pre-trained multimodal language models, excels in Minecraft tasks. JARVIS-1 achieves nearly perfect performance in over 200 tasks, significantly improving the long-horizon diamond pickaxe task. The agent demonstrates autonomous learning, evolving with minimal external intervention, contributing to the pursuit of generally capable artificial intelligence.

JARVIS-1, designed on pre-trained multimodal language models, combines visual and textual inputs to generate plans. The agent’s multimodal memory integrates pre-trained knowledge with in-game experiences for planning. Existing approaches use hierarchical goal execution architecture and large language models as high-level planners. JARVIS-1 is evaluated on 200 tasks from the Minecraft Universe Benchmark, revealing challenges in diamond functions due to the imperfect execution of short-horizon text instructions by the controller.

JARVIS-1’s multimodal memory fosters self-improvement, enhancing general intelligence and autonomy by outperforming other instruction-following agents. JARVIS-1 surpasses DEPS without memory in challenging tasks, with the success rate in diamond-related tasks nearly tripling. The study underscores the importance of refining plan generation for easier execution and enhancing the controller’s ability to follow instructions, particularly in diamond-related tasks.

JARVIS-1, an open-world agent built on pre-trained multimodal language models, is proficient in multimodal perception, plan generation, and embodied control within the Minecraft universe. Incorporating multimodal memory enhances decision-making by leveraging pre-trained knowledge and real-time experiences. JARVIS-1 substantially increases completion rates for tasks like the long-horizon diamond pickaxe, exceeding previous records by up to five times. This breakthrough sets the stage for future developments in versatile and adaptable agents within complex virtual environments.

Further research suggests enhancing plan generation for task execution, improving the controller’s ability to follow instructions in diamond-related tasks, and investigating methods to ease execution. Exploring ways to boost decision-making in open-world scenarios through multimodal memory and real-time experiences is proposed. The expansion of JARVIS-1’s capabilities for a broader range of tasks in Minecraft and potential adaptation to other virtual environments is recommended. The study encourages continuous improvement through lifelong learning, fostering self-improvement and the development of greater general intelligence and autonomy in JARVIS-1.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..