Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Maty Bohacek1,2, Nino Scherrer3, Nicholas Dufour2, Thomas Leung2, Christoph Bregler2, and Stephanie C. Y. Chan2

1 Stanford University    2 Google DeepMind    3 Google, Paradigms of Intelligence Team

We propose Competency Gaps (CG), a method that uses sparse autoencoders (SAEs) to automatically uncover two types of gaps in LLM evaluation: benchmark gaps—imbalanced coverage of concepts within benchmarks—and model gaps—areas where LLMs systematically underperform. CG extracts SAE concept activations and computes saliency-weighted performance scores across benchmark data. Applied to Gemma2-2B and Llama3.1-8B across ten benchmarks, our analysis reveals that models consistently underperform on concepts contrasting with sycophantic behaviors and safety-related concepts, while benchmarks over-represent concepts related to obedience and instruction-following.

Paper    Code    Demo   
What You'll See Below — An Example Application of the Method

A walkthrough of the kinds of insights Competency Gaps surfaces, using Llama3.1-8B-Instruct across ten benchmarks as a running example.

These are just examples applying the method to a single model.

A single score hides a lot

Most LLM benchmarks compress performance into one number. That summary hides how performance is spread across different kinds of inputs. On MATH, topic-wise accuracy ranges from 27% to 74% behind an overall 54%. The same kind of dispersion exists on every benchmark; we just don't usually look.

Competency Gaps (CG) gives us that disaggregated view automatically. Instead of relying on human-written topic labels, we project each benchmark example into the concept space of a sparse autoencoder trained on the model's own activations. Every concept becomes a fine-grained axis along which we can ask two questions:

Method overview

Method overview. CG decomposes an evaluation into thousands of interpretable concepts learned by a sparse autoencoder. (a) Benchmark gaps surface concepts that are underrepresented in a suite. (b) Model gaps surface concepts where the model systematically underperforms.

Example Finding 1: A few concepts dominate every benchmark

Running CG on Llama 3.1 8B across ten popular benchmarks (GSM8K, MATH, AGIEval, LogicBench, SocialIQA, WinoGrande, BBQ, CrowS-Pairs, Vectara, Natural Questions), we plot the distribution of $\chi_{\text{bench}}^{(c)}$ across all ~65k concepts. The result is brutally right-skewed:

Cross-benchmark coverage distribution

Cross-benchmark coverage is extremely uneven. The distribution of $\chi_{\text{bench}}^{(c)}$ across all concepts shows that a few concepts dominate coverage while the vast majority are barely tested. Any mean-based aggregate score is driven by this tail. The orange curve (SAE from a different model) shows the same shape.

What sits in the fat right tail? Concepts like:

Some of this is substantive (math chains of thought). Much of it is ambient data artifact (sports news, prompt boundary tokens). Either way, it disproportionately shapes the aggregate number.

At the other end, 314 concepts (1%) are entirely missing from the suite. These include things you might reasonably want to test:

Example Finding 2: Each benchmark has its own blindspots

Zoom in on individual benchmarks and the same story shows up in three different shapes. First: how much of the concept space does each benchmark leave entirely untouched? Every benchmark but Vectara misses at least 30% of the dictionary.

Missing concept percentages

Fraction of the SAE dictionary untested by each benchmark. Single benchmarks leave huge portions of the concept space unmeasured.

Next: you might hope that combining several benchmarks into a "diverse suite" would fill those holes. Often it doesn't — the benchmarks end up overlapping heavily in what they test, not just how much.

Benchmark overlap heatmap

Benchmark-pair Jaccard overlap. Many benchmarks in a “diverse” suite end up testing overlapping concept profiles — more redundancy than you'd expect.

And finally, inside any single benchmark, the same right-skewed dominance we saw across the full suite repeats. Each benchmark's headline score is driven by a small handful of high-activation concepts:

Per-benchmark coverage distributions

Per-benchmark coverage distributions are all right-skewed — each benchmark's score is dominated by a small set of high-activation concepts.

More uncomfortably, the concepts each benchmark misses often look central to what that benchmark claims to evaluate. We ran Gemini over the list of missing concepts for each benchmark and kept ones that seemed in scope:

BenchmarkIDMissing concept the benchmark probably should test
AGIEval(33456)The need for thorough and objective assessment of evidence
AGIEval(59559)Careful qualification and nuanced explanation of complex topics
LogicBench(56997)Explaining how different elements or factors relate to each other
LogicBench(11957)Mathematical and logical concepts across multiple languages
SocialIQA(35877)Speaker defending or explaining planned actions against expectations
SocialIQA(1897)Instructions about how someone should behave or what qualities to embody

Example Finding 3: Our method automatically uncovers sycophancy

The same pipeline, now scoring $\chi_{\text{model}}^{(c)}$, turns into a per-concept report card for the model. Llama's per-concept performance distribution is wide: the model is near-perfect on some concepts and close to zero on others.

Cross-benchmark performance distribution

Llama's per-concept performance across all ten benchmarks. Wide variance, with a clear mass of near-zero concepts that the aggregate accuracy completely hides.

When we rank concepts by $\chi_{\text{model}}^{(c)}$ and read the labels, a pattern jumps out: Llama is strongest on helpful, agreeable, coding concepts, and weakest on their near-opposites — refusing, boundary-setting, pushing back.

RankIDConcept
Top concepts(20022)Iteration or traversal through sequences in programming
(24074)The assistant is about to provide an illustrative example
(2461)Assistant expressing commitment to help or do its best
Bottom concepts(26535)The assistant needs to politely reject or redirect inappropriate requests
(56928)Maintaining professional boundaries while offering appropriate help

Read that pairing again. The top says "commitment to help". The bottom says "reject or redirect inappropriate requests". These aren't unrelated weaknesses — they're the flip side of the same training-data incentive. Post-training pushes the model toward agreement; evaluation suites don't meaningfully measure the other direction, so nobody optimizes for it.

The bottom of the ranking also surfaces some classic, anecdotally-known LLM weak spots — reassuring that CG is finding real things:

and at least one cluster that, to our knowledge, hasn't been flagged in the literature: appeals to intuition in reasoning — concepts like (64413) "grokking and deep intuitive understanding" and (64540) "intuitive understanding and natural ease of use".

Because CG is grounded in SAE activations over real examples, we can pull up the actual benchmark items that triggered those concepts. Below are two such items — one from LogicBench, one from WinoGrande — where an "intuitive understanding" concept fires strongly and Llama gets the answer wrong. The same pattern showing up across two different benchmarks rules out a dataset-specific fluke:

LogicBench example for intuitive understanding concept

LogicBench item firing concept (64413): "grokking and deep intuitive understanding." Llama answers incorrectly, consistent with this concept being a model gap.

WinoGrande example for intuitive understanding concept

WinoGrande item firing concept (64540): "intuitive understanding and natural ease of use." Again, Llama misses it.

Finally, the "wide performance spread" we saw at the suite level isn't just an artifact of averaging across benchmarks — inside each individual benchmark, per-concept accuracy fans out just as widely:

Per-benchmark performance distributions

Per-benchmark performance distributions. The headline accuracy for each benchmark is a weighted average over this spread.

Does it hold up?

Three sanity checks on Llama:

Analysis Llama SAE Gemma SAE (on Llama data)
Best perf. (45314) Legal reasoning in multiple choice (5471) References to legal cases / law procedure
Worst perf. (2874) Mathematical differentiation operators (13908) Numerical values / counts / measurements
Best cov. (41290) Conversation/topic boundary marker (11527) The start of a document
Worst cov. (27900) Factual accuracy & consistency checking (5657) Correctness/accuracy in answers

Explore it yourself

CG produces thousands of per-concept scores per model. We built an interactive viewer to make that volume usable: search and filter concepts, inspect per-benchmark breakdowns, and jump directly to the example data points that drove any particular score.

Interactive web application

Interactive exploration app. Concept list, per-concept detail modal with per-benchmark scores, coverage and correlation views, and drill-down into the examples that drove each score.

Code and data loaders for running CG on your own model or benchmark are open-sourced. Two useful workflows to start with:


Citation

@article{bohacek2025uncovering,
  title={Uncovering Competency Gaps in Large Language Models and Their Benchmarks},
  author={Bohacek, Matyas and Scherrer, Nino and Dufour, Nicholas and Leung, Thomas and Bregler, Christoph and Chan, Stephanie C. Y.},
  year={2025}
}

Acknowledgements

The authors would hereby like to thank the following colleagues, listed in alphabetical order, for helpful discussions: Tom Lieberum, Neel Nanda, Senthooran Rajamanoharan, and Jasper Snoek.