This month, we’re pausing our regular programming to bring you a special edition of The Holistic AI Brief.
This edition distills what we learned about the future of enterprise AI from the Great Agent Hack 2025, a hackathon co-hosted by Holistic AI and University College London's Centre for Digital Innovation (UCL CDI) that brought together more than 200 builders and researchers from leading universities, labs, and industry. With 51 project submissions across three challenge tracks, the event offered a rare look at the speed, direction, and maturity of emerging agentic systems — and what that evolution means for governance, safety, and enterprise deployment. As one judge described the experience:
“I walked out of the Great Agent Hack 2025 with a serious intellectual hangover — the feeling you get when brilliant people make you rethink what ‘possible’ looks like.”
— César Ortega Quintero, Expert Data Scientist, MAPFRE Open Innovation
Details on the challenge tracks and winning teams are available at:
Here, we focus on the strategic signals shaping enterprise adoption of agentic systems.
Across submissions and judging discussions, four key themes emerged:

AI observability extends classic logging, metrics, and traces to the full AI stack— data, model behavior, and infrastructure. It continuously captures signals such as input distributions, outputs, explanations, latency, costs, and security events so teams can understand and control system behavior over time. At the Great Agent Hack, teams converged on tracing tools, dashboards, workflow visualizations, and step-level introspection, underscoring how essential visibility has become to building and operating AI systems.
Why it matters: In production, AI models drift, upstream data changes, and user behavior evolves. Observability allows teams to detect these shifts early, rather than learning about failures through incidents. Organizations with mature observability dramatically reduce detection time and downtime, which directly lowers outages, poor decisions, and customer‑visible errors. And by logging prompts, responses, access patterns, and performance, observability provides the evidence needed for compliance and accountability.
Across submissions, teams constructed coordinated pipelines where specialized agents planned, reviewed, verified, and executed tasks together. This approach mirrors real enterprise workflows, where different roles contribute distinct expertise to complete a process. While multi-agent designs are not universal across all AI applications, their consistent use at the Great Agent Hack reflects a growing shift toward systems that distribute responsibilities across multiple interacting components.
Why it matters: As AI systems adopt more multi-agent patterns, governance must expand beyond supervising a single model and begin addressing the behavior of multiple agents interacting with one another. Oversight needs to track how decisions flow across agents, where coordination can break down, and how failures emerge through interaction rather than in isolation. Effective governance must account for the dynamics of the entire system, not just individual endpoints.
Teams developed automated probing tools, behavioral stress tests, and structured evaluation capabilities to uncover failure modes and measure robustness. The emphasis on systematic testing —rather than ad hoc prompts — reflects a growing recognition that agentic systems require deeper, more repeatable methods for exposing vulnerabilities.
Why it matters: Agentic AI systems behave differently from single-step LLMs: they plan, iterate, call external tools, and interact with other agents. These behaviors create new types of failure modes that only appear when the full system is exercised under pressure. Continuous, system-level red-teaming allows teams to surface hidden risks, validate safeguards, and understand how agents respond to edge cases before those issues impact customers or operations. Advanced testing frameworks are becoming essential for deploying agentic AI safely and predictably.
Several teams showed that small and mid-sized models — such as Gemma-3 4B, Llama-3.1 8B, Claude Haiku, and Nova Lite — can perform extremely well when paired with strong agent design and selective use of larger models like Claude Sonnet for more complex steps. This pattern reflects a broader shift toward model-mixing: using smaller models for routine tasks and reserving larger models for high-stakes reasoning.
Why it matters: Enterprises are unlikely to rely on a single model. Instead, they will use a diverse model portfolio across their IT landscape, matching the right model to the right task to optimize cost, latency, and reliability. Well-structured smaller models offer predictable cost profiles and stable performance under load, while larger models remain essential for deeper reasoning or complex decision-making.
The field is moving quickly, and these trends are likely key indicators of where enterprise builders and governance leaders should focus : observability, multi-agent oversight, advanced testing, and efficient system design.
As Naqash Tahir, Executive Director, R&D and Investments at PGIM’s RealAssetX, noted: “Advancing trustworthy and transparent AI is essential for the future of our industry. We’re proud to support the Great Agent Hack 2025, where collaboration and innovation drive meaningful progress in agentic AI.”


Get a demo
Get a demo