Skip to content

Yug-Oswal/reward-hacking

Repository files navigation

These lightning talk slides explain this work well:
Slides

We got selected among top 10 out of 90+ teams in SPAR to present a lightning talk to top AI safety researchers and organizations like UK AISI, Constellation, and BlueDot Impact.

About

Validating the hypothesis of inducing reward hacking in LLMs via steering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors