Browse skills & tools

skill ★ 7,851

Reproduce Research Methods and Benchmark Metrics

This skill automates the process of implementing a new research method into a structured Jupyter notebook, ensuring reproducibility by utilizing existing data splits and dependencies. It systematically records all required metrics and imple…

Upsonic/Upsonic research-reproducibility benchmarking jupyter-notebook data-science

skill ★ 7,851

Systematic research method implementation and benchmarking

This skill automates the systematic implementation of novel research methods into a structured Jupyter notebook. It ensures fair benchmarking by using existing data splits and logging all dependencies and metrics into a structured JSON log …

Upsonic/Upsonic research-implementation benchmarking jupyter-notebook data-science

skill ★ 7,851

Automated experiment benchmarking and metric extraction

Defines comparison metrics and extracts baseline values from notebook outputs to record them in a structured JSON log for downstream evaluation.

Upsonic/Upsonic benchmarking experiment-tracking machine-learning metrics-extraction

skill ★ 8

Structured Multi-Alternative Comparison

A systematic framework for evaluating multiple alternatives using consistent criteria, a comparison matrix, and evidence-based decision recommendations.

n24q02m/wet-mcp decision-making comparison-matrix evaluation-framework structured-analysis

skill ★ 22

Optimize RAG chunk size and retrieval settings

This skill systematically sweeps various chunk sizes and retrieval modes (keyword, hybrid, rerank) to benchmark and determine the optimal configuration for your specific corpus. It reports standard metrics like nDCG and MRR against a golden…

nicholasglazer/gnosis-mcp rag retrieval tuning chunk-size

skill ★ 372,633

OpenClaw Test Performance Benchmarking and Optimization

This skill provides a systematic workflow for diagnosing and optimizing the runtime performance of OpenClaw's test suite and plugin ecosystem. It guides developers through establishing performance baselines, collecting detailed metrics (e.g…

openclaw/openclaw openclaw performance testing benchmarking

skill

Experiment Benchmarking and Metric Extraction

Defines comparison metrics and extracts baseline values from notebook outputs to update experiment logs for downstream evaluation.

Upsonic/gpt-computer-assistant benchmarking experiment-tracking metrics-extraction data-science

skill

Automated Experiment Benchmarking Skill

Defines comparison metrics and extracts baseline values from notebook outputs to record them in a structured JSON log for downstream evaluation.

Upsonic/gpt-computer-assistant benchmarking experiment-tracking machine-learning metrics-extraction

skill ★ 4

MCP Server Evaluation Creator

Provides a structured methodology for generating complex, multi-hop Q pairs to benchmark the effectiveness of MCP servers through verifiable tool-use evaluations.

jmrplens/gitlab-mcp-server mcp evaluation benchmarking llm-testing

skill ★ 136,096

Skill Creation and Iterative Improvement Agent

This skill guides users through the entire lifecycle of agentic skill development, from initial intent capture and drafting to rigorous testing and optimization. It facilitates benchmarking, variance analysis, and refining skill description…

anthropics/skills skill-creation agent-development llm-engineering skill-optimization