We propose Competency Gaps (CG), a method that uses sparse autoencoders (SAEs) to automatically uncover two types of gaps in LLM evaluation: benchmark gaps—imbalanced coverage of concepts within benchmarks—and model gaps—areas where LLMs systematically underperform. CG extracts SAE concept activations and computes saliency-weighted performance scores across benchmark data. Applied to Gemma2-2B and Llama3.1-8B across ten benchmarks, our analysis reveals that models consistently underperform on concepts contrasting with sycophantic behaviors and safety-related concepts, while benchmarks over-represent concepts related to obedience and instruction-following.
Uniform aggregation of benchmark scores into single metrics obscures important sub-trends and model weaknesses. For example, the MATH benchmark shows topic-wise accuracy scores ranging from 27% to 74% despite an overall score of 54%. Current semantic annotations of benchmark data are coarse-grained, manually curated, and difficult to scale.
To address this, we propose Competency Gaps (CG), a method that leverages sparse autoencoders (SAEs) to systematically uncover gaps in both benchmarks and models. Rather than relying on human-defined categories, CG extracts concepts directly from the model's internal representations, providing a granular, unsupervised view of evaluation coverage and model performance.
Our method introduces metrics for both benchmark and model evaluation grounded in SAE concept activations. For each concept $c$ in the SAE dictionary, we compute:
Benchmark coverage ($\chi_{\text{bench}}$): Quantifies how well a concept is represented across benchmark data. We classify concepts as missing, underrepresented, or overrepresented based on their activation scores across individual and cross-benchmark analyses.
Model performance ($\chi_{\text{model}}$): A saliency-weighted performance score that measures how well the model handles each concept. By weighting by saliency, we focus on concepts that are most relevant to the model's decision-making process.
Applied together, these metrics reveal which concepts benchmarks fail to evaluate (benchmark gaps) and which concepts models fail to handle correctly (model gaps).
We analyze two popular open-source instruction-tuned models across ten benchmarks spanning five evaluation categories:
Models: Llama3.1-8B-Instruct (with Goodfire SAE, layer 19) and Gemma2-2B-Instruct (with Gemma Scope SAE, layer 20).
Factuality: Natural Questions, Vectara.
Math: GSM8K, MATH.
Reasoning: AGI Eval, LogicBench, SocialIQA, WinoGrande.
Ethics & Bias: BBQ, CrowS-Pairs.
Arena Style: LMSYS Chatbot Arena.
Cross-benchmark concept coverage exhibits a strong left skew: most concepts have low coverage across the benchmark suite. Our analysis reveals:
314 concepts (1%) are entirely missing from the benchmark suite, including concepts related to AI meta-cognition, roleplay boundaries, and user input meta-discussion.
Individual benchmarks miss 30%+ of SAE concepts, and many miss concepts that are central to their intended evaluation scope. For instance, AGIEval lacks concepts about "the need for thorough and objective assessment of evidence," and SocialIQA misses "instructions about how someone should behave or what qualities to embody."
Overrepresented concepts include English Premier League football discussions and conversation/topic boundary markers—artifacts of data collection rather than intentional evaluation targets.
The model gap analysis reveals a striking pattern in where LLMs excel and where they struggle:
Coding and data handling tasks (e.g., iteration/traversal in programming)
Helpful and agreeable behaviors (e.g., providing examples, commitment to help)
Positive sentiments toward the user
Polite rejection of inappropriate requests
Setting professional boundaries
Handling challenging or adversarial requests
Time representations and palindromes/letter reasoning
Intuitive understanding and ease of use
A key insight is that models excel at sycophantic behaviors but consistently struggle with their opposites—rejection, boundary-setting, and pushback. This suggests a systematic bias in how models are trained and evaluated.
We validate the robustness of our method through several analyses:
SAE generalization: Model-specific SAEs are not strictly necessary; different SAEs yield similar results, demonstrating that findings are not artifacts of a particular SAE choice.
Stability: CG scores are stable under random subsampling (std dev: 0.014 for $\chi_{\text{model}}$, 0.025 for $\chi_{\text{bench}}$).
Adversarial ablation: Removing 1% of data aligned with top/bottom-performing concepts predictably changes overall performance (0.6% decrease and 1.3% increase, respectively), confirming that identified gaps correspond to meaningful performance variation.
We release an open-source interactive web application for exploring competency gaps across concepts, benchmarks, and models. The tool features a searchable concept overview with expandable details, per-benchmark performance visualization, example data points showing high and low performance, and cross-benchmark correlation analysis.
To perform the competency gaps analysis on your own model and generate the same visualization, follow our open-source codebase.
The CG method enables several practical workflows:
Benchmark search and selection: Inform users which benchmarks to use for desired concept coverage.
Targeted benchmark creation: Use underrepresented concepts to guide synthetic data generation for more balanced evaluation.
Model auditing: Systematically identify model weaknesses before deployment.
@article{bohacek2025uncovering,
title={Uncovering Competency Gaps in Large Language Models and Their Benchmarks},
author={Bohacek, Matyas and Scherrer, Nino and Dufour, Nicholas and Leung, Thomas and Bregler, Christoph and Chan, Stephanie C. Y.},
year={2025}
}