Using PEFT, researchers tuned LLMs’ personalities and found they naturally used emojis to express traits, revealing new interpretability insights.
This paper looks at how large language models can have their personalities adjusted using parameter-efficient fine-tuning. Instead of relying on methods like prompt editing or gradient-based editors, which often produce unstable results, the authors use QLoRA-based PEFT to manipulate the Big Five personality traits. When applied to models like LLaMA-2-7B-chat and Mistral-7B-Instruct, the fine-tuning revealed a surprising behavior: the models started using emojis to express certain traits, even though no emojis were included in the training data. For example, LLaMA-2 used emojis in almost all cases linked to extraversion, while Mistral did so for openness. Analysis showed that this was not random but an intentional strategy the models adopted to represent personality. The work introduces a new dataset for personality manipulation, benchmarks for evaluating traits, and evidence that PEFT is more effective than prompt-based methods, while also showing how emoji use is tied to specific neuron activations inside the models.
Get a demo
Get a demo