The paper addresses the significant challenge of evaluating the tool-use capabilities of large language models (LLMs) in real-world scenarios. Existing benchmarks often fail to effectively measure these capabilities because they rely on AI-generated queries, single-step tasks, dummy tools, and text-only interactions, which do not accurately represent the complexities and requirements of real-world problem-solving.
Current methodologies for evaluating LLMs typically involve synthetic benchmarks that do not reflect the intricacies of real-world tasks. These methods use AI-generated queries and single-step tasks, which are simpler and more predictable than the multifaceted problems encountered in everyday scenarios. Moreover, the tools used in these evaluations are often dummy tools that do not provide a realistic measure of an LLM’s ability to interact with actual software and services.
A team of researchers from Shanghai Jiao Tong University and Shanghai AI Laboratory propose the General Tool Agents (GTA) benchmark to bridge this gap. This new benchmark is designed to assess LLMs’ tool-use capabilities in real-world situations more accurately. The GTA benchmark features human-written queries with implicit tool-use requirements, real deployed tools spanning various categories (perception, operation, logic, creativity), and multimodal inputs that closely mimic real-world contexts. This setup will provide a more comprehensive and realistic evaluation of an LLM’s ability to plan and execute complex tasks using various tools.
The GTA benchmark is composed of 229 real-world tasks that require the use of various tools. Each task involves multiple steps and necessitates reasoning and planning by the LLM to determine which tools to use and in what order. The evaluation is carried out using two modes: step-by-step and end-to-end. In the step-by-step mode, the LLM is given the initial steps of a reference toolchain and is expected to predict the next action. This mode evaluates the model’s fine-grained tool-use capabilities without actual tool use, allowing for a detailed comparison of the model’s output against the reference steps.
In the end-to-end mode, the LLM calls the tools and attempts to solve the problem by itself, with each step depending on the previous ones. This mode reflects the actual task execution performance of the LLM. The researchers use several metrics to evaluate performance, including instruction-following accuracy (InstAcc), tool selection accuracy (ToolAcc), argument accuracy (ArgAcc), summary accuracy (SummAcc) in the step-by-step mode, and answer accuracy (AnsAcc) in the end-to-end mode.
The results reveal that real-world tasks pose a significant challenge for current LLMs. The best-performing models, GPT-4 and GPT-4o, were able to correctly solve fewer than 50% of the tasks. Other models achieved less than 25% accuracy. However, these results also highlight the potential for improvement in LLMs’ tool-use capabilities. Among open-source models, the Qwen-72b achieved the highest accuracy, demonstrating that with further advancements, LLMs can better meet the demands of real-world scenarios.
The GTA benchmark effectively exposes the shortcomings of current LLMs in handling real-world tool-use tasks. By utilizing human-written queries, real deployed tools, and multimodal inputs, the benchmark provides a more accurate and comprehensive evaluation of LLMs’ capabilities. The findings underscore the pressing need for further advancements in the development of general-purpose tool agents. This benchmark sets a new standard for evaluating LLMs and will serve as a crucial guide for future research aimed at enhancing their tool-use proficiency.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 46k+ ML SubReddit
Find Upcoming AI Webinars here
The post The GTA Benchmark: A New Standard for General Tool Agent AI Evaluation appeared first on MarkTechPost.
#AIAgents #AIPaperSummary #AIShorts #Applications #ArtificialIntelligence #EditorsPick #LanguageModel #LargeLanguageModel #Staff #TechNews #Technology [Source: AI Techpark]