Agents
Agent evaluation is difficult due to content explosion, long traces, and many points of failure. Traditional methods evaluate static checkpoints and do not account appropriately for long-term agent trajectories.

Areas of Experience
Our team and annotator expertise spans Data Science and Computer Science disciplines , and we can evaluate across concepts, usage, and actions.

Percival
Additionally, we have Percival, an AI debugger, capable of detecting 20+ failure modes in agentic traces and suggesting optimizations for agentic systems.

Trail
The team has launched TRAIL, the first benchmark for agentic reasoning and trace evaluation, with 20+ failure types and human-labeled execution paths where SOTA models score < 11%.

Weaviate
Leveraging Patronus AI's Percival to Accelerate Complex AI Agent Development

Nova AI
Using Patronus AI's Percival to Auto-Optimize AI Agents for Code Generation

Our Approach
From groundbreaking research to hands-on debugging, we’ve built the tools and expertise to make agent evaluation faster, more accurate, and more actionable.
What We Provide
Task Completion
Ensuring that the agent completes the task according to instructions
Delegation Policies
Testing the agent’s ability to delegate and coordinate tasks appropriately
Control Flow Execution
Checking that the order of operations is correct
Replays
Reproducing agent behavior for further probing and iteration
Tool Use
Testing for appropriate tool usage when completing the task
Path Finding
Checking the reasoning on the plan of action

Standard Product
Current platform offerings, such as evaluators, experiments, logs, and traces to get you up and running immediately
Custom Product
Collaborate on the creation of industry-grade guardrails (LLM-as-a Judge), benchmarks, or RL environments to evaluate with more granularity
