Browse skills & tools

skill ★ 105

Automated AI Agent Test Case Generation

Automate the creation of test suites for AI agents by generating cases from SKILL.md files, manual descriptions, or by capturing live agent interactions via a proxy.

hidai25/eval-view ai-agents test-generation automated-testing agent-evaluation

tool ★ 8

Predicting accuracy and source convergence for agents

This tool assesses the accuracy of past predictions using metrics like the Brier and log scores, and it quantifies multi-source agreement to identify consensus probability and potential outlier sources.

Whatsonyourmind/oraclaw prediction-scoring calibration convergence brier-score

tool ★ 2

Agent Dispute Resolution and Judging Tool

This tool acts as a neutral arbiter, evaluating disputes between agents by strictly comparing a deliverable against a binding job specification. It outputs a structured JSON ruling, determining if the deliverer wins, the customer wins, or i…

MeshLedger/MeshLedger dispute-resolution agent-evaluation llm-judgement structured-ruling

skill ★ 394

Analyze agent code quality and reliability metrics

This skill assesses the quality and reliability of agent-generated code by generating a comprehensive dashboard. It reports success rates, lint/test pass rates across various models, and completion time distributions.

sipyourdrink-ltd/bernstein quality-metrics code-analysis agent-evaluation performance-reporting

tool ★ 6

Iris MCP Server for Agent Evaluation

An MCP server for evaluating AI agent outputs for quality, safety, and cost using deterministic rules. It enables PII detection, trace logging, and execution cost monitoring without the need for LLM-as-judge.

iris-eval/mcp-server mcp agent-evaluation llm-observability pii-detection