Tag: agent-evaluation
Automated AI Agent Test Case Generation
Automate the creation of test suites for AI agents by generating cases from SKILL.md files, manual descriptions, or by capturing live agent interactions via a proxy.
Predicting accuracy and source convergence for agents
This tool assesses the accuracy of past predictions using metrics like the Brier and log scores, and it quantifies multi-source agreement to identify consensus probability and potential outlier sources.
Agent Dispute Resolution and Judging Tool
This tool acts as a neutral arbiter, evaluating disputes between agents by strictly comparing a deliverable against a binding job specification. It outputs a structured JSON ruling, determining if the deliverer wins, the customer wins, or i…
Analyze agent code quality and reliability metrics
This skill assesses the quality and reliability of agent-generated code by generating a comprehensive dashboard. It reports success rates, lint/test pass rates across various models, and completion time distributions.
Iris MCP Server for Agent Evaluation
An MCP server for evaluating AI agent outputs for quality, safety, and cost using deterministic rules. It enables PII detection, trace logging, and execution cost monitoring without the need for LLM-as-judge.