Tag: llm-evaluation
comprehensive ai data and model quality evaluation
Dingo provides a comprehensive framework for evaluating data, models, and applications using both deterministic rule-based checks and advanced LLM-based metrics. It supports complex evaluations like RAG faithfulness and 3H assessment via CL…
Adversarial Code Review for Pipeline Hardening
This skill facilitates the systematic hardening of BFCL training and evaluation pipelines through iterative adversarial review rounds. It employs an external LLM to identify potential bugs, which are then verified against the codebase to en…
Redteam Plugin and Grader Development Standards
Provides standardised protocols for developing redteam plugins and graders, including XML tag requirements, rubric structures, and attack template configurations.
Skill Creator for Agentic Workflows
An agentic skill for the end-to-end development of new skills, covering drafting, quantitative evaluation, and iterative refinement. It assists in capturing user intent, authoring structured SKILL.md files, and optimising descriptions for i…
AI agent regression testing with EvalView
Detect regressions in AI agent behaviour by comparing current outputs and tool calls against golden baselines. It identifies changes in outputs, tool usage, and significant score drops.
Comprehensive AI Data and Model Quality Evaluator
Dingo provides a comprehensive framework for evaluating data and AI outputs using both deterministic rule-based checks and advanced LLM-based metrics. It supports complex workflows, including RAG evaluation and autonomous fact-checking, via…
Anthropic Cookbook Notebook Auditor
Audits Anthropic Cookbook notebooks against a specific style guide and rubric to ensure high-quality technical content and pedagogical effectiveness. It incorporates automated checks for hardcoded secrets and evaluates narrative, code, and …