skills-bench: a benchmark for ai
agent skills evaluation

SkillsBench is a gym-style evaluation framework that measures how well AI agent Skills work in practice, enabling quantitative assessment of correctness, efficiency, and robustness.

introducing SkillsBench 1.0read our launch announcement ↗
i want to test my agent

an anthropic research project

view agent performance
agent performanceview full leaderboard

task resolution success-rate for top agents and models on skills-bench@1.0

Claude Code (Opus 4.5)
87.2%
Claude Code (Sonnet 4)
82.1%
Cursor (GPT-5)
79.4%
Windsurf (Mixed)
76.8%
Aider (Sonnet 4)
74.2%
Cline (Opus 4)
71.5%
Continue (GPT-4.5)
68.9%
Copilot (Mixed)
65.3%
view skills-bench task examples

api-integration-pipeline

data-processingmedium

Build a data pipeline that fetches data from multiple REST APIs, transforms JSON responses, handles pagination and rate limiting, and outputs a consolidated CSV report with error handling and retry logic.

, ,

data-analysis-report

data-sciencemedium

Analyze a dataset containing sales data, compute statistical metrics, identify trends and anomalies, generate visualizations, and produce a comprehensive PDF report with executive summary and recommendations.

, ,

document-generator

automationeasy

Create a document generation system that takes structured data input and produces formatted Word documents using templates, with support for tables, images, headers/footers, and dynamic content insertion.

, ,

file-processing-suite

file-operationsmedium

Build a file processing utility that watches directories for new files, extracts text from PDFs and images using OCR, categorizes content, and organizes files into structured folders based on content analysis.

, ,

git-workflow-automation

devopshard

Create a Git workflow automation tool that manages branch creation, handles merge conflicts, generates changelogs from commit history, enforces conventional commits, and integrates with CI/CD pipelines.

, ,

cross-platform-build-config

build-systemshard

Configure a cross-platform build system using CMake that compiles a C++ project for Windows, macOS, and Linux, handles dependencies via vcpkg, and generates installers for each platform.

, ,