Bloomberg’s AI engineers introduce an improved agent tool-calling methodology at ACL 2025

July 27, 2025

During the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) this week in Vienna, Austria, researchers from Bloomberg’s AI Engineering group in London are showcasing their expertise in large language models (LLMs) and tool-based agentic AI with their paper “A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents.”

In the paper, which is published “Findings of the Association for Computational Linguistics: ACL 2025,” Bin Wu, a Bloomberg Data Science Ph.D. Fellow and Ph.D. student at University College London, Edgar Meij, Head of AI Platforms in Bloomberg’s AI Engineering group, and Emine Yilmaz, a professor and EPSRC Fellow at University College London’s Department of Computer Science – where she also leads the Web Intelligence Group at the UCL Centre for Artificial Intelligence – demonstrate the crucial role of the instructions provided in agent prompts and tool descriptions – collectively referred to as context. Incomplete or suboptimal context in the instructions and tool descriptions significantly increases the required number of tool calls that LLMs need to make in order to generate an adequate response, leading to computational overhead. They propose a new methodology for automatically improving agent prompts and tool descriptions, and demonstrate that it substantially reduces the number of tool calls the LLM agent needs to make.

In addition, two members of Bloomberg’s AI Strategy & Research team in the company’s CTO Office – Sebastian Gehrmann, Head of Responsible AI, and Enrico Santus, Principal Technical Strategist for Human-AI Interaction and Academic Engagement – are two of the organizers of the fourth iteration of the Generation, Evaluation & Metrics Workshop (GEM2), which will be held as part of ACL on July 31, 2025. In light of the broad accessibility of LLMs, this workshop will serve as a forum for researchers and practitioners from both the natural language processing and machine learning communities to come together to explore potential approaches and research directions to address the broader types of natural language generation (NLG) challenges – in particular, the evaluation of model-generated outputs. While these advanced models can generate fluent text, ensuring the usefulness, quality, and fairness of their output is essential to help bridge the gap between research and real-world applications.

We asked the paper’s lead author and one of the workshop organizers to explain why their work is notable in advancing the state-of-the-art with regards to LLMs and agentic AI:

Wednesday, July 30, 2025

Session 12: IP-Posters (Findings Posters – In-Person 4)
11:00-12:30 CEST

A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents
Bin Wu (Centre for Artificial Intelligence, University College London), Edgar Meij (Bloomberg), Emine Yilmaz (Centre for Artificial Intelligence, University College London)

Click to read "A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents," published in "Findings of the Association for Computational Linguistics" at ACL 2025.

Please summarize your research. Why are your results notable?

Bin Wu: This research proposes a joint optimization framework that aims to improve the efficiency of tool-augmented LLM agents by systematically refining both agent instructions and tool descriptions. Traditional approaches have either focused on enhancing tool use effectiveness through reasoning strategies – like chain-of-thought (CoT) or tree-of-thoughts (ToT) prompting – or optimized only a single aspect (i.e., either the instructions or the tool documentation). However, these prior methods incur high computational costs and often overlook efficiency, particularly under conditions where context is incomplete.

Our proposed framework introduces a three-stage process:

Feedback Generator: Evaluates effectiveness and efficiency of tool calls.
Suggestion Coordinator: Produces separate improvement suggestions for agent prompts and tool docs.
Context Refiner: Processes these suggestions to stably and scalably update the context.

Notable results include the following:

We show that incomplete context requires LLMs to call more tools to generate their response.
We confirm up to a 70% reduction in required tool calls on StableToolBench and 47% fewer redundant calls on RestBench while maintaining or improving pass rates.

Why is it important to optimize context to improve the efficiency of agentic tool calling?

In practice, incomplete context is very common. This occurs because agent instructions are always designed manually through much trial-and-error. Plus, tool descriptions are also designed by humans, and it is especially difficult to capture for complex tools. Revealed by our empirical analysis, an incomplete context is one of the things that lead to computational overhead. Thus, when end-to-end agentic LLMs use tools, optimizing context is essential to help improve their efficiency.

How does your research advance the state-of-the-art in the field of agentic/generative AI?

This work advances the field in the following key ways:

Joint optimization of context: Most prior research improved either the agent prompt or the tool descriptions, but not both simultaneously. This study is the first to propose a joint, automated optimization of both, acknowledging their interaction and combined effect on agent performance.
Verbalized optimization pipeline: Instead of relying on resource-intensive model fine-tuning, the authors introduce a training-free, text-based optimization framework. It uses the LLMs themselves to produce feedback and improvements – making it scalable and applicable to closed-source or resource-constrained environments.
New evaluation metric – CAPR: The introduction of Cost-Aware Pass Rate (CAPR) is a significant contribution. Unlike traditional metrics focused solely on effectiveness, CAPR incorporates computational cost, thereby aligning better with real-world requirements for efficient and cost-effective AI agents.

Were there any surprising or unexpected outcomes from your research?

Yes, several findings were unexpected and noteworthy:

Incomplete context hampers efficiency more than effectiveness. While incomplete context degrades performance as expected, our experiments revealed that it particularly worsens efficiency – not just effectiveness. Agents still solved tasks, but did so needing far more tool calls, highlighting a hidden cost that was overlooked in prior research.
Tool descriptions also play a large role. Contrary to common assumptions that agent instructions are the dominant factor, jointly-optimized tool descriptions yield far greater efficiency gains than instruction improvements alone.
Verbalized optimization can overfit. Iterative context refinement sometimes led to overfitting, where additional iterations increased the required tool calls and degraded performance. This mirrors overfitting in traditional machine learning and suggests the need for regularization techniques in verbalized optimization.

Read more about Bloomberg’s agentic AI infrastructure here.

Thursday, July 31, 2025

GEM2 Workshop: Generation, Evaluation & Metrics
Sebastian Gehrmann (Bloomberg), Gabriel Stanovsky (Hebrew University of Jerusalem), Simon Mille (Dublin City University), Enrico Santus (Bloomberg), Miruna Clinciu (Heriot Watt University), Kaustubh Dhole (Emory University), Yotam Perlitz (IBM Research), Rotem Dror (University of Haifa), Itay Itzhak (Hebrew University of Jerusalem), Ofir Arviv (IBM Research), Eliya Habba (Hebrew University of Jerusalem), Michal Shmueli Scheuer (IBM Research), João Sedoc (New York University) and Oyvind Tafjord (Allen Institute for Artificial Intelligence)

Please explain the goal of this workshop. Why are you helping to organize it?

Enrico Santus: This is the fourth edition of the Generation, Evaluation & Metrics Workshop (GEM). My colleague, Sebastian Gehrmann, originally started it in 2020, when evaluation of generated text first started becoming incredibly important. Now that GenAI is ubiquitous, GEM has grown into one of the largest workshops held at any NLP conference, and we couldn’t be more excited to help lead it together with the outstanding organizing team.

As GenAI is increasingly used for high-impact applications – from healthcare to robotics and finance – the stakes for evaluation have never been higher. Yet, many of today’s benchmarks are brittle, hard to reproduce, or fail to reflect real-world complexity. That’s why we believe GEM2 will help shift the field toward more meaningful, efficient, and robust evaluation practices.

This year, more than 86 scientific publications will be presented at GEM2, alongside three keynotes and a panel. Moreover, for the second year, we have also built a space where industry and academia can meet each other through a dedicated Industrial Track. That conversation will be catalyzed in a panel with leading voices from DeepMind, Contextual AI, and aiXplain, during which the speakers will share what it means to evaluate generative models in real-world production environments.

How do you expect or hope that this workshop will help advance the state-of-the-art in terms of the evaluation of LLMs?

We hope GEM2 helps change how our community thinks about evaluation. Right now, much of the focus in LLM benchmarking is on leaderboards, but they don’t tell the full story. Models are sensitive to prompting, few-shot formatting, and even punctuation. Reproducibility is a challenge, and many current metrics don’t reflect how models behave under pressure or in production. GEM2 encourages the field to go deeper, to explore robustness, fairness, instruction-following variance, and real-world generalization.

We’re incredibly fortunate to have three invited speakers who each bring powerful perspectives:

Barbara Plank (LMU Munich) will present new work on ambiguity, inconsistency, and flawed reasoning in LLMs.
Leshem Choshen (MIT-IBM) will dive into underexplored frontiers, like pretraining evaluation, tinyBenchmarks, multicultural benchmarking, and the risks of data contamination.
Ehud Reiter (University of Aberdeen) will challenge us to go beyond metrics and focus on real-world impact.

Most importantly, GEM2 is about community. Over the past four years, the GEM community has grown into a vibrant global network, bringing together hundreds of contributors from across continents, disciplines, and institutions. Through their work, the GEM community is shaping the future of NLP evaluation, and we are excited to be among its hosts.

Bloomberg’s AI engineers introduce an improved agent tool-calling methodology at ACL 2025

Wednesday, July 30, 2025

Thursday, July 31, 2025

Read more related stories