Tag: evaluation
Structured Skill for Metric Benchmarking and Logging
This skill defines and extracts comprehensive comparison metrics and baseline values from existing analysis notebooks. It appends a structured Phase 3 benchmark entry to the experiment's log.json, ensuring all necessary metrics are recorded…
Reproducible Redteaming for Prompt Security QA
This tool facilitates reproducible redteaming evaluations of prompts, allowing developers to run, inspect, and triage security scan results. It supports stable evaluation of existing test artifacts or full regeneration, enabling focused rer…
Authoring and Running Promptfoo Evaluation Suites
This skill guides developers through authoring comprehensive promptfoo evaluation suites for robust regression testing and quality assurance. It covers defining prompts, structuring test cases, implementing various assertions, and validatin…
Creating and managing promptfoo evaluation suites
This skill guides the creation and maintenance of comprehensive promptfoo evaluation suites, enabling rigorous QA for non-redteam coverage, regression testing, and new matrix development. It details structuring configs, writing prompts, sel…
Evaluates politeness of pull request comments
This skill assesses the tone and politeness of a pull request review comment. It is designed to judge the communication quality, deliberately excluding technical correctness or code risk assessment.
Full Lifecycle ML Model Development and Deployment
This skill enables the full lifecycle of machine learning, covering model training, fine-tuning (PyTorch/Transformers), and deployment pipelines. It specialises in building robust RAG systems, optimizing inference, and ensuring rigorous eva…
Skill Creation and Iterative Improvement Agent
This skill guides users through the entire lifecycle of developing agentic skills. It assists with drafting, running quantitative evaluations, analysing benchmarks, and iteratively refining the skill description for optimal triggering accur…
Agent Skill Authoring and Evaluation
Design, refine, and audit reusable agent skills by creating structured SKILL.md files and evaluating trigger precision. The process includes auditing skill collections for redundancy and designing testable behaviour through evaluation promp…
MCP Server Evaluation Creator
Provides a structured methodology for generating complex, multi-hop Q pairs to benchmark the effectiveness of MCP servers through verifiable tool-use evaluations.
Skill Creation and Iterative Improvement Agent
This skill guides users through the entire lifecycle of agentic skill development, from initial intent capture and drafting to rigorous testing and optimization. It facilitates benchmarking, variance analysis, and refining skill description…