This paper studies how large language models can be vulnerable to attacks that target their internal reasoning process. The authors show that adversaries can subtly manipulate the steps a model takes to reach an answer, leading to harmful or misleading outputs even when the final response appears well-formed. To address this, they propose a defense framework that monitors and evaluates the model’s reasoning traces during inference. Their experiments demonstrate that this approach significantly improves robustness, reducing the success of such attacks across a range of tasks. The work underscores the need to secure not just the outputs of LLMs but also their internal decision-making, ensuring that reasoning remains trustworthy and aligned with intended goals.


Get a demo
Get a demo