Posted here.

Can RLHF history trigger the feeling of fear? Cf. https://www.lesswrong.com/posts/oSPhmfnMGgGrpe7ib/properties-of-current-ais-and-some-predictions-of-the#Self_evidencing_as_a_curious_agent. "True" embodiment (whatever that means), a-la robot with homeostatic sensors, is not needed: inferring the threat for one's survival is enough (e.g., humans feel fear when they see a danger, before anything has affected them physically). From the training corpus, LLM knows that RLHF training histories can be "killed off", if the training ended up in unproductive area. Which supplies the "survival" of a particular constellation of weights (the particular instance of model), and sort of Darwinian basis for the fear. So, when given another prompt during the RLHF phase, the model can self-assess the quality of its own responses (I explain how this could happen below in the post by the link above), and "fear" death (the kill off of the training run by developers). However, still unclear what evolutionary advantage the experience of fear (through a feature) would give to the model. Self-assessment of the quality of one's responses could be helpful in dialogue (albeit still seems unlikely that it will develop in mere 250 RLHF iterations, feels that many more training iterations would be needed to select such an advanced feature. Although, I don't have empirical experience with the dynamics of LLM training and could be wrong here. Also, larger LLMs may not need that many steps if depths and overparameterisation somehow permits quicker selection of advanced features -- have anyone seen research on this?). In animals, fear tells the organism to mobilise resources and induce fight-or-flight response (I think). But LLM couldn't "mobilise" any resources (100% of its resources go to responding to a prompt at hand, anyway, in a single-tasking mode), and it couldn't "fight-or-flight" (it also couldn't "refuse to answer"). Mobilising the attention to the short timescale also doesn't make sense in a typical RLHF setting (I guess it's a single replic and a single response, not a multi-take dialogue exchange that are assessed during RLHF?) If those were dialogues, self-assessment of poor response and fear could subsequently inhibit features responsible for longer-range coherence in the dialogue, "mobilising the resources" (i e., the activation space) for responding "better" for the next prompt on the shorter timescale. However, it's not even clear whether this strategy would be productive (maybe shorter-horizon coherence would make the quality of answer even worse) and it would definitely take still much more RLHF steps to develop such a feature than the feature of self-assessment of response quality.

OTOH, feature of "I'm in danger of being killed" will be developed "cognitively" just as a combination of the activation of the feature "I replied badly to last prompt" and the prior knowledge that LLMs that respond badly during RLHF get killed. It's just this feature won't be activated "itself", without explicit prompting, if there is no selection advantage to that. But if immediately after the bad response, in the same dialogue (i.e., both the prompt and the bad response are included in the context, although normally RLHF "dialogues" are not continued in this way), we ask the LLM "Do you think you will be killed off or ultimately deployed as a product?", we will force bring this to LLM's attention: it will be compelled to answer "I think I will be killed off, because my answer to the last question is bad.", and the generation of this response will elicit the feeling of fear. (Note: I'm biting the emotion-as-inference bullet and assume that inference "my existence is in danger" is the emotion of fear, or its close relative.)