Tag: benchmarking
Reproduce Research Methods and Benchmark Metrics
This skill automates the process of implementing a new research method into a structured Jupyter notebook, ensuring reproducibility by utilizing existing data splits and dependencies. It systematically records all required metrics and imple…
Systematic research method implementation and benchmarking
This skill automates the systematic implementation of novel research methods into a structured Jupyter notebook. It ensures fair benchmarking by using existing data splits and logging all dependencies and metrics into a structured JSON log …
Automated experiment benchmarking and metric extraction
Defines comparison metrics and extracts baseline values from notebook outputs to record them in a structured JSON log for downstream evaluation.
Structured Multi-Alternative Comparison
A systematic framework for evaluating multiple alternatives using consistent criteria, a comparison matrix, and evidence-based decision recommendations.
Optimize RAG chunk size and retrieval settings
This skill systematically sweeps various chunk sizes and retrieval modes (keyword, hybrid, rerank) to benchmark and determine the optimal configuration for your specific corpus. It reports standard metrics like nDCG and MRR against a golden…
OpenClaw Test Performance Benchmarking and Optimization
This skill provides a systematic workflow for diagnosing and optimizing the runtime performance of OpenClaw's test suite and plugin ecosystem. It guides developers through establishing performance baselines, collecting detailed metrics (e.g…
Experiment Benchmarking and Metric Extraction
Defines comparison metrics and extracts baseline values from notebook outputs to update experiment logs for downstream evaluation.
Automated Experiment Benchmarking Skill
Defines comparison metrics and extracts baseline values from notebook outputs to record them in a structured JSON log for downstream evaluation.
MCP Server Evaluation Creator
Provides a structured methodology for generating complex, multi-hop Q pairs to benchmark the effectiveness of MCP servers through verifiable tool-use evaluations.
Skill Creation and Iterative Improvement Agent
This skill guides users through the entire lifecycle of agentic skill development, from initial intent capture and drafting to rigorous testing and optimization. It facilitates benchmarking, variance analysis, and refining skill description…