Teach an AI agent how to fish for information and it can feed itself with data. Tell an AI agent to figure things out on its own and it may make things worse.
AI agents are machine learning models (e.g. Claude Opus 4.6) that have access to other software through a CLI harness (e.g. Claude Code) and operate in an iterative loop. These agents can be instructed to handle various tasks, some of which may not be covered in their training data.
When lacking the appropriate training, software agents can be given access to new “skills,” which are essentially added reference material to impart domain-specific capabilities. “Skills” in this context refer to instructions, metadata, and other resources like scripts and templates that agents load to obtain procedural knowledge.
For example, an AI agent could be instructed how to process PDFs with a skill that consists of markdown text, code, libraries, and reference material about APIs. While the agent might have some idea how to do this from its training data, it should perform better with more specific guidance.
Yet according to a recent study, SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, asking an agent to develop that skill on its own will end in disappointment. The “intelligence” part of artificial intelligence is somewhat overstated.
At least that’s the case with large language models (LLMs) at inference time – when the trained model is being used as opposed to during the training process.
A new benchmark
Certain forms of machine learning, like deep learning, can be applied in a way that allows neural network models to improve their performance in domain-specific tasks like video games.
The explosion of AI agents – Claude Code from Anthropic, Gemini CLI from Google, and Codex CLI from OpenAI – has led to the rapid development of skills to augment what the agents can do. Skill directories are proliferating like weeds. And given how OpenClaw agents have been teaching each other in the Moltbook automated community network, it seems well past time to figure out how good a job they do at it.
To date, there’s been no common way to see whether these skills deliver what they promise. So a team of 40 (!) computer scientists, affiliated with with companies like Amazon, BenchFlow, ByteDance, Foxconn, and Zennity, and various universities, including Carnegie Mellon, Stanford, UC Berkeley, and Oxford, set out to develop a benchmark test to evaluate how agent skills augment performance during inference.
The authors, led by Xiangyi Li, founder of agent measurement startup BenchFlow, developed a test they dubbed SkillsBench, and described their findings in the above-mentioned preprint paper.
The researchers looked at seven agent-model setups across 84 tasks for 7,308 trajectories – one agent’s attempt at solving a single task under a specific skills condition. Three conditions were tested: no skills, curated skills, and self-generated skills.
The agents using curated skills – designed by people – completed tasks 16.2 percent more frequently than no-skill agents on average, though with high variance.
One example cited in the study is a flood-risk analysis task. Agents without skills didn’t apply the appropriate statistical math, so achieved a pass rate of only 2.9 percent. With a curated skill that told the agent to use the Pearson type III probability distribution and apply the appropriate standard USGS methodology, and that specified other details like scipy function calls and parameter interpretation, the agent’s task pass rate increased to 80 percent.
When analyzed in terms of specific knowledge domains, curating healthcare (+51.9 percentage points) and manufacturing (+41.9 percentage points) skills helped AI agents the most, while curating skills related to mathematics (+6.0 percentage points) and software engineering (+4.5 percentage points) provided smaller gains. The authors explain this by observing that domains requiring specialized knowledge tend to be underrepresented in training data. So it makes sense for humans to augment agents working on tasks in those domains.
And when doing so, less is more – skills with only a few (2-3) modules performed better than massive data dumps.
That applies to model scale too – curated skills help smaller models punch above their weight class in terms of task completion. Anthropic’s Claude Haiku 4.5 model with skills (27.7 percent) outperformed Haiku 4.5 without skills (11 percent) and also Claude Opus 4.5 without skills (22 percent).
When it came time to get agents to teach themselves skills, the study authors directed them to
- analyze the task requirements, domain knowledge, and APIs required;
- write 1-5 modular skill documents to solve the task;
- save each skill as a markdown file; and
- to then solve the task using the generated reference material.
Agents that tried this did worse than if they hadn’t tried at all.
“Self-generated skills provide negligible or negative benefit (–1.3 percentage points average), demonstrating that effective skills require human-curated domain expertise,” the authors state.
For now at least, the AI revolution will not be fully automated – the machines still need human teachers to set them on the right path. ®

