CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

The paper shows LLMs can be attacked through reasoning steps and offers defenses that monitor inference to improve safety and reliability.

This paper studies how large language models can be vulnerable to attacks that target their internal reasoning process. The authors show that adversaries can subtly manipulate the steps a model takes to reach an answer, leading to harmful or misleading outputs even when the final response appears well-formed. To address this, they propose a defense framework that monitors and evaluates the model’s reasoning traces during inference. Their experiments demonstrate that this approach significantly improves robustness, reducing the success of such attacks across a range of tasks. The work underscores the need to secure not just the outputs of LLMs but also their internal decision-making, ensuring that reasoning remains trustworthy and aligned with intended goals.

Download our latest
Academic Paper
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this

Unlock the Future with AI Governance.

Get a demo

Get a demo