The paper shows LLMs can be attacked through reasoning steps and offers defenses that monitor inference to improve safety and reliability.
This paper studies how large language models can be vulnerable to attacks that target their internal reasoning process. The authors show that adversaries can subtly manipulate the steps a model takes to reach an answer, leading to harmful or misleading outputs even when the final response appears well-formed. To address this, they propose a defense framework that monitors and evaluates the model’s reasoning traces during inference. Their experiments demonstrate that this approach significantly improves robustness, reducing the success of such attacks across a range of tasks. The work underscores the need to secure not just the outputs of LLMs but also their internal decision-making, ensuring that reasoning remains trustworthy and aligned with intended goals.
Get a demo
Get a demo