Post-training Mechanistic Analysis

Code for the COLM 2025 paper: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Setting Up

To run the code base, nnsight and TransformerLens need to be installed.

For installing nnsight, directly run pip install nnsight.

For installing TransformerLens, we recommend installing the version in this code base by running pip install -e ./TransformerLens.

For running experiments using models not currently included by TransformerLens, modify ./TransformerLens/transformer_lens/loading_from_pretrained.py.

Running Experiments

Knowledge and internal belief of truthfulness experiments are under Knowledge+Truthfulness/, Refusal direction experiments are under Refusal/, and entropy neuron related experiments are under Confidence/. Please check the README of each directory for how to run the code.

Citation

If you think our work is helpful please consider citing our paper:

@misc{du2025posttraining,
    title={How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence},
    author={Hongzhe Du and Weikai Li and Min Cai and Karim Saraipour and Zimin Zhang and Himabindu Lakkaraju and Yizhou Sun and Shichang Zhang},
    year={2025},
    eprint={2504.02904},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Confidence		Confidence
Knowledge+Truthfulness		Knowledge+Truthfulness
Refusal		Refusal
TransformerLens		TransformerLens
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Post-training Mechanistic Analysis

Setting Up

Running Experiments

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

HZD01/post-training-mechanistic-analysis

Folders and files

Latest commit

History

Repository files navigation

Post-training Mechanistic Analysis

Setting Up

Running Experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages