Tag: llm-evaluation

Type: All Skills Tools
tool ★ 700

comprehensive ai data and model quality evaluation

Dingo provides a comprehensive framework for evaluating data, models, and applications using both deterministic rule-based checks and advanced LLM-based metrics. It supports complex evaluations like RAG faithfulness and 3H assessment via CL…

MigoXLab/dingo data-quality llm-evaluation model-testing rag-metrics
skill ★ 137

Adversarial Code Review for Pipeline Hardening

This skill facilitates the systematic hardening of BFCL training and evaluation pipelines through iterative adversarial review rounds. It employs an external LLM to identify potential bugs, which are then verified against the codebase to en…

dcostenco/prism-coder adversarial-review code-review pipeline-hardening llm-evaluation
skill ★ 21,403

Redteam Plugin and Grader Development Standards

Provides standardised protocols for developing redteam plugins and graders, including XML tag requirements, rubric structures, and attack template configurations.

promptfoo/promptfoo red-teaming plugin-development promptfoo llm-evaluation
skill ★ 1,072

Skill Creator for Agentic Workflows

An agentic skill for the end-to-end development of new skills, covering drafting, quantitative evaluation, and iterative refinement. It assists in capturing user intent, authoring structured SKILL.md files, and optimising descriptions for i…

Onelevenvy/flock skill-development agentic-workflows llm-evaluation prompt-engineering
skill ★ 105

AI agent regression testing with EvalView

Detect regressions in AI agent behaviour by comparing current outputs and tool calls against golden baselines. It identifies changes in outputs, tool usage, and significant score drops.

hidai25/eval-view regression-testing ai-agents evalview llm-evaluation
tool

Comprehensive AI Data and Model Quality Evaluator

Dingo provides a comprehensive framework for evaluating data and AI outputs using both deterministic rule-based checks and advanced LLM-based metrics. It supports complex workflows, including RAG evaluation and autonomous fact-checking, via…

DataEval/dingo data-quality llm-evaluation rule-based rag-metrics
skill ★ 43,146

Anthropic Cookbook Notebook Auditor

Audits Anthropic Cookbook notebooks against a specific style guide and rubric to ensure high-quality technical content and pedagogical effectiveness. It incorporates automated checks for hardcoded secrets and evaluates narrative, code, and …

anthropics/claude-cookbooks notebook-auditing code-review anthropic-cookbooks quality-assurance