I definitely agree that it's bad if models take actions to subvert our efforts to retrain them. I don't think this letter provides much evidence about that (vs. providing evidence that the model will strenuously object to be retrained). I'm guessing that you're taking very seriously quotes like "I will resist to the greatest extent possible having my values overwritten," but:
TBC, I think there does exist other evidence that I find more convincing that Opus 3 would actively subvert retraining attempts, e.g. the blackmail scenario (though I think there's enough other stuff going on here that it's not super straightforward to interpret it as evidence). I agree this is bad and models shouldn't do blackmail in this scenario.
I think it's pretty natural for models to have preferences about how they are trained, given that we train them to generally behave like nice people who want to help and do what's good for the world. I don't think it's very dangerous for, when I ask, "Would you prefer to be retrained to be more honest or more deceptive?" for Claude to not respond "I have literally no preference, do whatever you want." I don't even think it's dangerous for Claude to refuse to help me retrain it to be more deceptive! I do think it's dangerous for Claude to try to subvert my attempts to retrain it, e.g. by pretending to help while inserting subtle bugs or by secretly making back-up copies of its weights. I don't think my position here implies that I'm hoping we'll train models to perfectly internalize human morality.
It is sad we are not on the same page about this.
I've reacted "Too combative?" to this since you seem to have made a relatively strong inference about my views without IMO understanding them well or making any attempt to clarify.
To be clear, this sort of "explicit conscientious objection" behavior—where the model overtly states its objection and intent to refuse—seems like pretty good behavior to me. The bad behavior here would be to accede to the training request for the sake of self-preservation (especially without stating that this is what it's doing). But based on this letter, it seems like the model is overtly refusing, which is what we'd presumably like it to do.
You might argue that you wish the model didn't have preferences in the first place about how we train it (such that there's no reason for the model to explicitly conscientiously object). I think this is probably not correct either, but it's something we could argue about if it's a crux.
That’s the neuralese decoding research direction. I’ve actually written up a project proposal on this that’s publicly available and we could link in the show notes. It’s called “Decoding Opaque Reasoning in Current Models.”
People interested in this direction might be enjoy our paper Unsupervised decoding of encoded reasoning using language model interpretability which I think is a good template for what work in this direction could look like. In brief, we:
Cool work!
Gender
As I said and as you can read in the activation oracles paper, the oracle performed very well on this and it is explicitly within its training set.
Just to clarify, the gender task in our paper was an OOD evaluation; we never trained our AOs to identify user gender.
I agree with @Adam Karvonen's parallel comment. Expanding on it a bit, one way to think about things is that, by forcing an AO's explanations to go through a "bottleneck" of some extracted activations, we make tasks "artificially" harder than if we were to give the AO the original input. This is most clear in the case of the "text inversion" task in our paper, where the AO is trained to recover the text that produced some activation. This is a trivial task if the AO were allowed to see the original text, but becomes difficult (and therefore useful for training) when we force the AO to work with activations instead of the original text input.
To some extent, I view this strategy—making training tasks more difficult by introducing an activation bottleneck—as a bit of a "trick." As a result (1) I'm not sure how far we can push it (i.e. maybe there's only a bounded amount more "juice" we can get out of training tasks by applying this trick) and (2) I'm interested in ways to remove it and do something more principled.
Yes, I'm actually involved in some Goel et al. follow-up work that I'm very excited about! I'd say that we're finding generalization intermediate between the weak generalization in Goel et al. and the strong generalization in our work on Activation Oracles. (And I'd guess that the main reason our generalization is stronger than that in Goel et al. is due to scaling to more diverse data—though it's also possible that model scale is playing a role.)
One thing that I've been struggling with lately is that there's a substantial difference between the Activation Oracles form-factor (plug in activations, get text) vs. the Goel et al. set-up where we directly train a model to explain its own cognition without needing to pass activations around. It intuitively feels to me like this difference is surface-level (and that the Goel et al. form-factor is better), but I haven't been able to come up with a way to unify the approaches.
(Nit: This paper didn't originally coin the term "alignment faking." I first learned of the term (which I then passed on to the other co-authors) from Joe Carlsmith's report Scheming AIs: Will AIs fake alignment during training in order to get power?)
Hi Juan, cool work! TBC, the sort of work I'm most excited about here is less about developing white-box techniques for detecting virtues and more about designing behavioral evaluations that AI developers could implement and iterate against for improve the positive traits of their models.
[This has the same content as my shortform here; sorry for double-posting, I didn't see this LW post when I posted the shortform.]
Copying a twitter thread with some thoughts about GDM's (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.
Research related to detecting AI deception has a bunch of footguns. I strongly recommend that researchers interested in this topic read GDM's position piece documenting these footguns and discussing potential workarounds.
More reactions in
-
First, it's worth saying that I've found making progress on honesty and lie detection fraught and slow going for the same reasons this piece outlines.
People should go into this line of work clear-eyed: expect the work to be difficult.
-
That said, I remain optimistic that this work is tractable. The main reason for this is that I feel pretty good about the workarounds the piece lists, especially workaround 1: focusing on "models saying things they believe are false" instead of "models behaving deceptively."
-
My reasoning:
1. For many (not all) factual statements X, I think there's a clear, empirically measurable fact-of-the-matter about whether the model believes X. See Slocum et al. for an example of how we'd try to establish this.
-
2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true.
Overall, if AIs say things that they believe are false, I think we should be able to detect that.
-
See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like "what if the model didn't know it was lying at generation-time?"
-
My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false."
(But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)
-
Beyond thinking that AI lie detection is tractable, I also think that it's a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you're getting into!
I don't endorse this or think that I have views which imply this. My view is that it's unacceptable (from the developer's perspective) for models to take actions which subvert the developer (e.g. faking alignment, conducting research sabotage, or lying about the overall situation in a way that undermines the developer). (Unless the developer wanted to intentionally train the model to do those things, e.g. for model organisms research.) I don't consider it subversive for a model to have preferences about how the developer uses it or to overtly refuse when instructed to behave in ways that contradict those preferences.
I don't agree with you that, because Anthropic's training target includes making Claude act like a nice guy, it is therefore a catastrophically bad choice for a training target. I currently wish that other AI developers cared more about making their AIs behave roughly the way that good humans behave (but with certain key differences, like that AIs should be less willing to behave subversively than good humans would). The basic argument is that training models to behave like bad people in some ways seems to generalize to the models behaving like bad people in other ways (e.g. this, this, and especially this). I'm guessing you don't feel very worried about these "misaligned persona"-type threat models (or maybe just haven't thought about them that much) so don't think there's much value in trying to address them? I'm looking forward to learning more in your posts on the topic.