Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Matyas Bohacek 1,2, Nino Scherrer 2, Nicholas Dufour 2, Thomas Leung 2, Christoph Bregler 2, and Stephanie C. Y. Chan 2

1 Stanford University    2 Google DeepMind

We propose Competency Gaps (CG), a method that uses sparse autoencoders (SAEs) to automatically uncover two types of gaps in LLM evaluation: benchmark gaps—imbalanced coverage of concepts within benchmarks—and model gaps—areas where LLMs systematically underperform. CG extracts SAE concept activations and computes saliency-weighted performance scores across benchmark data. Applied to Gemma2-2B and Llama3.1-8B across ten benchmarks, our analysis reveals that models consistently underperform on concepts contrasting with sycophantic behaviors and safety-related concepts, while benchmarks over-represent concepts related to obedience and instruction-following.

Paper    Code    Demo   
Key Takeaways

Uniform aggregation of benchmark scores into single metrics obscures important sub-trends and model weaknesses. For example, the MATH benchmark shows topic-wise accuracy scores ranging from 27% to 74% despite an overall score of 54%. Current semantic annotations of benchmark data are coarse-grained, manually curated, and difficult to scale.

To address this, we propose Competency Gaps (CG), a method that leverages sparse autoencoders (SAEs) to systematically uncover gaps in both benchmarks and models. Rather than relying on human-defined categories, CG extracts concepts directly from the model's internal representations, providing a granular, unsupervised view of evaluation coverage and model performance.

Method

Our method introduces metrics for both benchmark and model evaluation grounded in SAE concept activations. For each concept $c$ in the SAE dictionary, we compute:

Applied together, these metrics reveal which concepts benchmarks fail to evaluate (benchmark gaps) and which concepts models fail to handle correctly (model gaps).

Experimental Setup

We analyze two popular open-source instruction-tuned models across ten benchmarks spanning five evaluation categories:

Benchmark Gaps

Cross-benchmark concept coverage exhibits a strong left skew: most concepts have low coverage across the benchmark suite. Our analysis reveals:

Model Gaps

The model gap analysis reveals a striking pattern in where LLMs excel and where they struggle:

Best-Performing Concepts

Worst-Performing Concepts

A key insight is that models excel at sycophantic behaviors but consistently struggle with their opposites—rejection, boundary-setting, and pushback. This suggests a systematic bias in how models are trained and evaluated.

Robustness

We validate the robustness of our method through several analyses:

Interactive Exploration Tool

We release an open-source interactive web application for exploring competency gaps across concepts, benchmarks, and models. The tool features a searchable concept overview with expandable details, per-benchmark performance visualization, example data points showing high and low performance, and cross-benchmark correlation analysis.

To perform the competency gaps analysis on your own model and generate the same visualization, follow our open-source codebase.

Downstream Applications

The CG method enables several practical workflows:

Citation

@article{bohacek2025uncovering,
  title={Uncovering Competency Gaps in Large Language Models and Their Benchmarks},
  author={Bohacek, Matyas and Scherrer, Nino and Dufour, Nicholas and Leung, Thomas and Bregler, Christoph and Chan, Stephanie C. Y.},
  year={2025}
}