CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

Published on

August 18, 2025

This paper studies how large language models can be vulnerable to attacks that target their internal reasoning process. The authors show that adversaries can subtly manipulate the steps a model takes to reach an answer, leading to harmful or misleading outputs even when the final response appears well-formed. To address this, they propose a defense framework that monitors and evaluates the model’s reasoning traces during inference. Their experiments demonstrate that this approach significantly improves robustness, reducing the success of such attacks across a range of tasks. The work underscores the need to secure not just the outputs of LLMs but also their internal decision-making, ensuring that reasoning remains trustworthy and aligned with intended goals.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.