SkillsBench 1.1: Agent Skills Benchmark Release
SkillsBench 1.1 updates the Agent Skills benchmark to 87 native BenchFlow task.md packages across 8 domains. The paper reports 18 model-harness configurations; the public leaderboard currently tracks 24 in total including previous results. In the paper aggregate, curated Skills raise mean resolution rate from 33.9% to 50.5% (+16.6 points).