Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling
Large Language Models (LLMs) have shown remarkable capabilities in various domains, yet their economic impact has been limited by challenges in tool use and function calling. This paper introduces ThorV2, a novel architecture that significantly enhances LLMs’ function calling abilities. We develop a comprehensive benchmark focused on HubSpot CRM operations to evaluate ThorV2 against leading models from OpenAI and Anthropic. Our results demonstrate that ThorV2 outperforms existing models in accuracy, reliability, latency, and cost efficiency for both single and multi-API calling tasks. We also show that ThorV2 is far more reliable and scales better to multistep tasks compared to traditional models. Our work offers the tantalizing possibility of more accurate function-calling compared to today’s best-performing models using significantly smaller LLMs. These advancements have significant implications for the development of more capable AI assistants and the broader application of LLMs in real-world scenarios.
Introduction. Large Language Models (LLMs) have revolutionized natural language processing and artificial intelligence, demonstrating remarkable capabilities across a wide range of tasks ([1, 2]). However, their economic impact has been somewhat limited, particularly in domains requiring precise interaction with external tools and APIs. A key challenge in this area is the task of function calling, where LLMs must accurately interpret user queries and translate them into appropriate API calls. The underwhelming performance of LLMs in function calling has had tangible consequences. Muchanticipated AI gadgets like RabbitR1 and Humane AI Pin have faced criticism due to their inability to reliably fulfill user tasks. OpenAI’s GPT-store has similarly met with a very lukewarm response. We also note that while chatbots and coding assistants can increase productivity, almost no jobs have been completely taken over by AI as of September 2024 [3]. This highlights a critical gap between the theoretical capabilities of LLMs and their practical performance in the real world.
Discussion / Conclusion. The demonstrated superiority of ThorV2 over leading commercial alternatives has significant implications for businesses and end-users. The high accuracy (90.1% for single API calls and 96.55% for multi-API calls) translates to a more reliable user experience, reducing frustration arising from task failures or data mishandling. The low latency of ThorV2 (2.29 seconds for single API calls and 3.55 seconds for multi-API calls) enables near-instantaneous task execution, enhancing user satisfaction and productivity. This speed advantage becomes even more pronounced in complex, multi-step tasks, where ThorV2’s sublinear latency scaling shines. ThorV2’s perfect reliability score (100%) is particularly noteworthy. This consistency allows users to develop trust in the system over time, as they can rely on it to perform the same tasks consistently. From a development perspective, this reliability makes the system much easier to engineer, maintain, and improve. The cost-effectiveness of ThorV2 ($1.60 per 1000 queries for single API calls) makes it an attractive option for businesses looking to implement AI assistants at scale.