skills-bench: a benchmark for ai
agent skills evaluation
SkillsBench is a gym-style evaluation framework that measures how well AI agent Skills work in practice, enabling quantitative assessment of correctness, efficiency, and robustness.
introducing SkillsBench 1.0read our launch announcement ↗i want to test my agent
an anthropic research project
task resolution success-rate for top agents and models on skills-bench@1.0
api-integration-pipeline
Build a data pipeline that fetches data from multiple REST APIs, transforms JSON responses, handles pagination and rate limiting, and outputs a consolidated CSV report with error handling and retry logic.
data-analysis-report
Analyze a dataset containing sales data, compute statistical metrics, identify trends and anomalies, generate visualizations, and produce a comprehensive PDF report with executive summary and recommendations.
document-generator
Create a document generation system that takes structured data input and produces formatted Word documents using templates, with support for tables, images, headers/footers, and dynamic content insertion.
file-processing-suite
Build a file processing utility that watches directories for new files, extracts text from PDFs and images using OCR, categorizes content, and organizes files into structured folders based on content analysis.
git-workflow-automation
Create a Git workflow automation tool that manages branch creation, handles merge conflicts, generates changelogs from commit history, enforces conventional commits, and integrates with CI/CD pipelines.
cross-platform-build-config
Configure a cross-platform build system using CMake that compiles a C++ project for Windows, macOS, and Linux, handles dependencies via vcpkg, and generates installers for each platform.