Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

James Oldfield^1,2, Shawn Im¹, Yixuan Li¹, Mihalis A. Nicolaou³, Ioannis Patras², Grigorios G. Chrysos¹

¹University of Wisconsin–Madison, ²Queen Mary University of London, ³The Cyprus Institute

Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions.

Model form

Please see: form-equivalence.ipynb which introduces the MxD model form and its properties

Experiments

The transcoders directory contains code for training and evaluating sparse MLP layers in LLMs. The meat of this folder is forked from this repo.

We also use code from SAEBench to load in datasets for sparse probing. Thanks to these folks!

Citation

If you find our work useful, please consider citing our paper:

@misc{oldfield2025interpretabilitysacrifice,
      title={Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders}, 
      author={James Oldfield and Shawn Im and Yixuan Li and Mihalis A. Nicolaou and Ioannis Patras and Grigorios G Chrysos},
      year={2025},
      eprint={2505.21364},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.21364}, 
}

Contact

Please feel free to get in touch at: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
transcoders		transcoders
.gitignore		.gitignore
form-equivalence.ipynb		form-equivalence.ipynb
glus-to-moes.ipynb		glus-to-moes.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

Model form

Experiments

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

james-oldfield/MxD

Folders and files

Latest commit

History

Repository files navigation

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

Model form

Experiments

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages