Tag: evaluation

Type: All Skills Tools
skill ★ 7,851

Structured Skill for Metric Benchmarking and Logging

This skill defines and extracts comprehensive comparison metrics and baseline values from existing analysis notebooks. It appends a structured Phase 3 benchmark entry to the experiment's log.json, ensuring all necessary metrics are recorded…

Upsonic/Upsonic benchmark metrics evaluation json-logging
tool ★ 21,403

Reproducible Redteaming for Prompt Security QA

This tool facilitates reproducible redteaming evaluations of prompts, allowing developers to run, inspect, and triage security scan results. It supports stable evaluation of existing test artifacts or full regeneration, enabling focused rer…

promptfoo/promptfoo redteaming prompt-security qa llm
skill ★ 21,403

Authoring and Running Promptfoo Evaluation Suites

This skill guides developers through authoring comprehensive promptfoo evaluation suites for robust regression testing and quality assurance. It covers defining prompts, structuring test cases, implementing various assertions, and validatin…

promptfoo/promptfoo promptfoo evaluation qa regression-testing
skill ★ 21,403

Creating and managing promptfoo evaluation suites

This skill guides the creation and maintenance of comprehensive promptfoo evaluation suites, enabling rigorous QA for non-redteam coverage, regression testing, and new matrix development. It details structuring configs, writing prompts, sel…

promptfoo/promptfoo promptfoo evaluation qa llm
skill ★ 17,312

Evaluates politeness of pull request comments

This skill assesses the tone and politeness of a pull request review comment. It is designed to judge the communication quality, deliberately excluding technical correctness or code risk assessment.

topoteretes/cognee pr-review comment-analysis tone-detection evaluation
skill ★ 394

Full Lifecycle ML Model Development and Deployment

This skill enables the full lifecycle of machine learning, covering model training, fine-tuning (PyTorch/Transformers), and deployment pipelines. It specialises in building robust RAG systems, optimizing inference, and ensuring rigorous eva…

sipyourdrink-ltd/bernstein ml-engineering model-training rag-pipelines inference-optimization
skill ★ 1

Skill Creation and Iterative Improvement Agent

This skill guides users through the entire lifecycle of developing agentic skills. It assists with drafting, running quantitative evaluations, analysing benchmarks, and iteratively refining the skill description for optimal triggering accur…

Sowiedu/Edict skill-creation agent-development skill-optimization evaluation
skill

Agent Skill Authoring and Evaluation

Design, refine, and audit reusable agent skills by creating structured SKILL.md files and evaluating trigger precision. The process includes auditing skill collections for redundancy and designing testable behaviour through evaluation promp…

TencentCloudBase/CloudBase-AI-ToolKit agent-skills skill-authoring prompt-engineering evaluation
skill ★ 4

MCP Server Evaluation Creator

Provides a structured methodology for generating complex, multi-hop Q pairs to benchmark the effectiveness of MCP servers through verifiable tool-use evaluations.

jmrplens/gitlab-mcp-server mcp evaluation benchmarking llm-testing
skill ★ 136,096

Skill Creation and Iterative Improvement Agent

This skill guides users through the entire lifecycle of agentic skill development, from initial intent capture and drafting to rigorous testing and optimization. It facilitates benchmarking, variance analysis, and refining skill description…

anthropics/skills skill-creation agent-development llm-engineering skill-optimization