CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

This paper studies how large language models can be vulnerable to attacks that target their internal reasoning process. The authors show that adversaries can subtly manipulate the steps a model takes to reach an answer, leading to harmful or misleading outputs even when the final response appears well-formed. To address this, they propose a defense framework that monitors and evaluates the model’s reasoning traces during inference. Their experiments demonstrate that this approach significantly improves robustness, reducing the success of such attacks across a range of tasks. The work underscores the need to secure not just the outputs of LLMs but also their internal decision-making, ensuring that reasoning remains trustworthy and aligned with intended goals.

Share this

Download our

Download the Free

Academic Paper

Get the insights you need to stay compliant and competitive in the evolving AI landscape.

Academic Paper
CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.