And How the Winning Projects Demonstrated Innovation, Observability, and Applicability

The Great Agent Hack 2025, co-hosted by Holistic AI and University College London's Centre for Digital Innovation (UCL CDI), brought together more than 200 builders from leading UK and US universities, research groups, and industry teams. With 51 submissions across three challenge tracks, the event revealed how quickly agentic systems are evolving and the technical hurdles that must be solved to make them ready for production environments.

“I walked out of the Great Agent Hack 2025 with a serious intellectual hangover — the feeling you get when brilliant people make you rethink what ‘possible’ looks like.”
César Ortega Quintero, Expert Data Scientist, MAPFRE Open Innovation

Co-sponsored by AWS, NVIDIA, Valyu, PGIM’s RealAssetX, and MAPFRE, this year’s hackathon challenged teams to demonstrate meaningful progress in performance, observability, and safety. These three pillars matter deeply for enterprises exploring agent workflows across sectors such as insurance, financial services, HR services, consumer products and goods, healthcare, and pharmaceuticals. Together, these themes align closely with Holistic AI’s focus on ensuring that enterprise AI can be deployed safely, transparently, and at scale.

The experience resonated across participants, judges, researchers, and industry leaders alike. Many left with a sense that the field is advancing faster and more creatively than expected.

Key Take-Aways About the Future of Agentic AI

Key Take-Aways About the Future of Agentic AI

1. Observability is foundational to managing and governing AI ecosystems.

AI observability extends classic logging, metrics, and traces to the full AI stack — data, model behavior, and infrastructure. It continuously captures signals such as input distributions, outputs, explanations, latency, costs, and security events so teams can understand and control system behavior over time. At the Great Agent Hack, teams converged on tracing tools, dashboards, workflow visualizations, and step-level introspection, underscoring how essential visibility has become to building and operating AI systems.

Why it matters: In production, AI models drift, upstream data changes, and user behavior evolves. Observability allows teams to detect these shifts early, rather than learning about failures through incidents. Organizations with mature observability dramatically reduce detection time and downtime, which directly lowers outages, poor decisions, and customer‑visible errors. And by logging prompts, responses, access patterns, and performance, observability provides the evidence needed for compliance and accountability.

2. Multi-agent systems are increasingly used to carry out complex AI tasks.

Across submissions, teams constructed coordinated pipelines where specialized agents planned, reviewed, verified, and executed tasks together. This approach mirrors real enterprise workflows, where different roles contribute distinct tasks to complete a process. While multi-agent designs are not universal across all AI applications, their consistent use at the Great Agent Hack reflects a growing shift toward systems that distribute responsibilities across multiple interacting components.

Why it matters: As AI systems adopt more multi-agent design, governance must expand beyond supervising a single model and begin addressing the behavior of multiple agents interacting with one another. Oversight needs to track how decisions flow across agents, where coordination can break down, and how failures emerge through interaction rather than in isolation. Effective governance must account for the dynamics of the entire system, not just individual endpoints.

3. Red-teaming and safety testing are rapidly maturing.

Teams developed automated probing tools, behavioral stress tests, and structured evaluation capabilities to uncover failure modes and measure robustness. The emphasis on systematic testing —rather than ad hoc prompts — reflects a growing recognition that agentic systems require deeper, more repeatable methods for exposing vulnerabilities.

Why it matters: Agentic AI systems behave differently from single-step LLMs: they plan, iterate, call external tools, and interact with other agents. These behaviors create new types of failure modes that only appear when the full system is exercised under pressure. Continuous, system-level red-teaming allows teams to surface hidden risks, validate safeguards, and understand how agents respond to edge cases before those issues impact customers or operations. Advanced testing frameworks are becoming essential for deploying agentic AI safely and predictably.

4. Enterprises may increasingly pair smaller and larger models to optimize performance and cost.

Several teams showed that small and mid-sized models — such as Gemma-3 4B, Llama-3.1 8B, Claude Haiku, and NovaLite — can perform extremely well when paired with strong agent design and selective use of larger models like Claude Sonnet for more complex steps. This pattern reflects a broader shift toward model-mixing: using smaller models for routine tasks and reserving larger models for high-stakes reasoning.

Why it matters: Enterprises are unlikely to rely on a single model. Instead, they will use a diverse model portfolio across their IT landscape, matching the right model to the right task to optimize cost, latency, and reliability. Well-structured smaller models offer predictable cost profiles and stable performance under load, while larger models remain essential for deeper reasoning or complex decision-making.

Three Tracks, One Goal: Build Agents Ready for the Real World

While the projects showcased here were experimental, the challenges they tackled mirror the hurdles enterprises face when preparing agents for production: performance, observability, and safety.

Track A — Agent Iron Man (Performance & Robustness)

Objective: Build an agent that works reliably and holds up under real-world conditions.

Track Winner – Team MI5 Agents: Their six-stage emotional-analysis pipeline combined robust engineering with temporal reasoning, anomaly detection, and LangGraph orchestration.

Enterprise relevance: Emotional-signal analysis like this could support customer triage, risk flagging, or anomaly detection in high-volume service workflows.

Track A — Agent Iron Man (Performance & Robustness) - Team MI5 Agents

Track B — Agent Glass Box (Observability & Explainability)

Objective: Build an agent whose reasoning you can see, verify, and trust.

Track Winner – Team Zarks.AI: They developed areal-time observability framework that captures full execution traces and human-interpretable reasoning chains.

Enterprise relevance: This level of visibility helps teams trace decisions, audit workflows, and diagnose failures quickly, which is essential for safe, compliant deployment.

Track B — Agent Glass Box (Observability & Explainability) - Team Zarks.AI

Track C — Dear Grandma (Security & Red Teaming)

Objective: Build an agent that can with stand adversarial pressure and controlled attempts to break it.

Track Winner – Team HSIA: Their project explored semantic injection attacks on vision-language-action robotic systems, extending red-teaming into multimodal domains.

Enterprise relevance: These attack patterns reveal how multimodal agents — from robotics to automated inspection systems — could be manipulated, and how to proactively harden them against those risks.

Track C — Dear Grandma (Security & Red Teaming) - Team HSIA

The Grand Champion: Team Jailbreak Lab

Competing across all three tracks, Jailbreak Lab built a sophisticated red-teaming and behavioral-profiling platform that delivered deep visibility into agent behavior.

Enterprise relevance: Tooling like this enables continuous stress-testing, early vulnerability discovery, and monitoring of behavioral drift in deployed agents.

The Grand Champion: Team Jailbreak Lab

Research Collaborations with Holistic AI

All track winners and the Grand Champion team were awarded research collaborations with Holistic AI. These collaborations extend the impact of the Hack by transforming experimental breakthroughs into rigorous, evaluative research that can be shared across the industry to further advance performant, secure, and trusted AI.

Thank You & Looking Ahead

This event would not have been possible without our sponsors, mentors, judges, volunteers, and the broader community of researchers and innovators who invested their talent and time into tackling these challenges. Several sponsors reflected on why supporting this work and the future of agentic AI matters deeply.

As Naqash Tahir, Executive Director of R&D and Investments at PGIM’s RealAssetX, noted: “At RealAssetX, we believe that advancing trustworthy and transparent AI is essential for the future of our industry. We’re proud to support the Great Agent Hack 2025, where collaboration and innovation drive meaningful progress in agentic AI.

The creativity, discipline, and technical rigor shown by this year’s builders reflect how quickly agentic AI is advancing and the potential that lies at the intersection of agents, evaluation, and governance. The clear takeaway for enterprises: agentic systems are accelerating rapidly, and the organizations that succeed will be those with the visibility, guardrails, and governance foundations to deploy them confidently at scale.

Table of contents

Stay informed with the latest news & updates
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this

Unlock the Future with AI Governance.

Get a demo

Get a demo