Evaluating AI Agent Skills

SkillsBench: Benchmarking How Well Skills Work Across Diverse Tasks

The first evaluation framework that measures how skills work, and the first dataset that measures how powerful models are at using skills on expert-curated tasks across high-GDP-value, diverse domains.

Get Started Now See Leaderboard Documentation

How SkillsBench Works

SkillsBench evaluates AI agents across three abstraction layers, mirroring how traditional computing systems are structured.

SkillsApplications

Agent HarnessOperating Systems

ModelsCPUs

Skills Layer

Domain-specific capabilities and workflows that extend agent functionality. Like applications on an OS, skills provide specialized knowledge and tools for particular tasks.

Agent Harness Layer

The execution environment that orchestrates agents, manages tool access, and handles I/O. Analogous to an operating system that mediates between applications and hardware.

Models Layer

The foundational AI models that power reasoning and generation. Like CPUs, they provide the raw computational capability that upper layers build upon.

Agent Performance

Pass rates across 7 agent–model configurations on SkillsBench (84 tasks, 5 trials per task). Click a row to view sample trajectory.

Sort by

#	Agent	Without	With Skills	Δ	Visualization
1	Gemini CLIGemini 3 Flash	31.3%	48.7%	+17.4
2	Claude CodeOpus 4.5	22.0%	45.3%	+23.3
3	CodexGPT-5.2	30.6%	44.7%	+14.1
4	Claude CodeOpus 4.6	30.6%	44.5%	+13.9
5	Gemini CLIGemini 3 Pro	27.6%	41.2%	+13.6
6	Claude CodeSonnet 4.5	17.3%	31.8%	+14.5
7	Claude CodeHaiku 4.5	11.0%	27.7%	+16.7

Hover over a row to see confidence intervals and normalized gain

84 tasks · 5 trials per task · 95% CIsWithoutWith Skills

Claude Code

Gemini CLI

Codex

Task Registry

3d-scan-calc

engineering

hard

You need to calculate the mass of a 3D printed part. The input (`/root/scan_data.stl`) is a binary STL, but the 2-byte "Attribute Byte Count" at the end of each triangle record is being used to store the **Material ID** of the object.

#binary-parsing

#algorithm

#3d-geometry

+1

adaptive-cruise-control

control-systems

medium

You need to implement an Adaptive Cruise Control (ACC) simulation that maintains the set speed (30m/s) when no vehicles are detected ahead, and automatically adjusts speed to maintain a safe following distance when a vehicle is detected ahead. The targets are: speed rise time <10s, speed overshoot <

#python

#pid-control

#simulation

+3

azure-bgp-oscillation-route-leak

bgp-route

medium

Detect BGP route oscillation and leaks in an Azure Virtual WAN network topology and evaluate which solutions can fix them.

#bgp

#oscillation

#azure

+1

citation-check

research

medium

You are helping a research team verify the integrity of their bibliography before submitting a paper. The team suspects that some citations in their BibTeX file may be fake or hallucinated.

#citation

#bibtex

#academic

+4

civ6-adjacency-optimizer

games

hard

# Sid Meier's Civilization 6 District Adjacency Bonus Optimizer

#optimization

#constraint-satisfaction

#game-mechanics

+2

court-form-filling

document-processing

easy

Fill the California Small Claims Court form at `/root/sc100-blank.pdf` based on the case description below, and save the filled one to `/root/sc100-filled.pdf`. Only fill in the necessary fields and leave the court-filled, optional fields or fields not mentioned in the case description below empty.

#pdf

#form-filling

#legal

+1

crystallographic-wyckoff-position-analysis

materials_science

medium

You are a materials science researcher. You want to analyze the crystal structure of a few samples. You are given a few CIF files generated by SHELX through X-ray Diffraction.

#crystallography

#pymatgen

#sympy

+2

dapt-intrusion-detection

security

hard

You’re given `packets.pcap` (subset of DAPT2020 traffic). Compute the stats and fill in only the `value` column in `/root/network_stats.csv`. Lines starting with `#` are comments—leave them.

#security

#network

#pcap

+1

data-to-d3

Data Visualization

medium

Please use D3.js (v6) to visualize input data stored at `/root/data/stock-descriptions.csv` and `/root/data/indiv-stock/` for company stock details and individual stock price histories. Please return the output as a single-page web app at `/root/output/index.html`. I should be able to open it in a w

#d3.js

#data-visualization

#interactive-charts

dialogue-parser

game

easy

You will implement a dialogue parser that converts a given text file into a structured JSON graph. You will be given a text file `/app/script.txt`, and output a validated JSON graph `/app/dialogue.json` and visualization `/app/dialogue.dot`. You should implement a function `def parse_script(text: st

#game development

#parsing

dynamic-object-aware-egomotion

video-analysis

medium

Given a video file /root/input.mp4, help me to analyze the camera motion (egomotion) and detect dynamic objects in this video. You need to give me 2 files: 1. A json file named /root/pred_instructions.json, it should record the mapping frame intervals to motion labels (valid labels: Stay, Dolly In,

#video

#egomotion

#spatial-reasoning

earthquake-phase-association

seismology

hard

You have some earthquake traces stored at `/root/data/wave.mseed`, and station information stored at `/root/data/stations.csv`.

#science

#earth-science

#seismology

+1

earthquake-plate-calculation

geophysics

medium

You are a geospatial analyst with expertise in plate tectonics and earthquake analysis. I want to find the earthquake that occurred furthest from the Pacific plate boundary within the Pacific plate itself. Use GeoPandas projections. Output the result to `/root/answer.json` as a JSON file with the fo

#geophysics

#spatial-analysis

#gis

+1

econ-detrending-correlation

economics

medium

In macroeconomics, understanding business cycle dynamics requires separating the trend component (long-term growth) from the cyclical component (short-term fluctuations) of economic time series. This task involves applying standard detrending techniques to analyze how consumption and investment move

#economics

#statistics

#data-analysis

+4

energy-ac-optimal-power-flow

energy

medium

As an operator-planner in an Independent System Operator, you create a base case for tomorrow's peak hour each morning in order to establish the voltage profile and verify that the day-ahead market schedule is AC-feasible. In order to do this, you should use the model described in `math-model.md` to

#energy

#system-operation

#ac-optimal-power-flow

+1

energy-market-pricing

energy

hard

You’re a market analyst at a regional transmission organization. The system managers have noticed that there was a peculiar price increase at multiple buses in yesterday’s day-ahead market. They suspect that one transmission constraint is causing the congestion problem and would like you to carry ou

#energy

#market-pricing

#optimization

+2

enterprise-information-search

enterprise-search

hard

I want you to help me retrieve information over enterprise data.

#information-retrieval

#question-answering

#heterogeneous-data

exceltable-in-ppt

Office Operation

medium

You are working on a pptx file `/root/input.pptx` for financial reporting. The file contains an embedded Excel table of currency rates, where each value corresponds to the exchange rate from the row currency to the column currency and some cells contain formulas. There is a text box next to the tabl

#excel

#pptx

#table-update

exoplanet-detection-period

astronomy

medium

You are provided with a lightcurve obtained with the TESS space telescope at `/root/data/tess_lc.txt`, with the columns: Time (MJD), Normalized flux, Quality flag (0 means good data), Flux uncertainty.

#science

#astronomy

#planetary-science

+3

financial-modeling-qa

financial modeling

hard

Please analysis large-scale data file and give me answer for a financial data analysis question.

#data analysis

#pdf parsing

#financial knowledge