Skip to content

Code for the COLM 2025 paper: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

License

Notifications You must be signed in to change notification settings

HZD01/post-training-mechanistic-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Post-training Mechanistic Analysis

Code for the COLM 2025 paper: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

main_figure

Setting Up

To run the code base, nnsight and TransformerLens need to be installed.

For installing nnsight, directly run pip install nnsight.

For installing TransformerLens, we recommend installing the version in this code base by running pip install -e ./TransformerLens.

For running experiments using models not currently included by TransformerLens, modify ./TransformerLens/transformer_lens/loading_from_pretrained.py.

Running Experiments

Knowledge and internal belief of truthfulness experiments are under Knowledge+Truthfulness/, Refusal direction experiments are under Refusal/, and entropy neuron related experiments are under Confidence/. Please check the README of each directory for how to run the code.

Citation

If you think our work is helpful please consider citing our paper:

@misc{du2025posttraining,
    title={How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence},
    author={Hongzhe Du and Weikai Li and Min Cai and Karim Saraipour and Zimin Zhang and Himabindu Lakkaraju and Yizhou Sun and Shichang Zhang},
    year={2025},
    eprint={2504.02904},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

Code for the COLM 2025 paper: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •