Large Language Models (LLMs), such as OpenAI’s o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%.
Disclaim: The blog is automatically generated by AI and could contain misinformation.
Large Language Models like OpenAI’s o1-series have shown impressive reasoning capabilities through extended Chain-of-Thought (CoT) mechanisms. However, our research reveals a critical inefficiency: substantial redundancy in reasoning traces that hurts both performance and efficiency.
Figure: Overview of our SEAL framework showing offline extraction and online intervention stages
We discovered that current CoT reasoning suffers from significant issues:
Recent studies show that LLMs often determine the correct final answer early in the reasoning process but continue generating excessive and redundant thought sequences. This inefficient reasoning can even degrade final performance as models become trapped in redundant verification loops.
Our systematic analysis categorizes LLM internal reasoning into three distinct thought types:
Figure: Example showing decomposition of reasoning into different thought types
Figure: Statistics showing thought distribution in correct vs incorrect samples
Key Findings from Our Analysis:
Figure: t-SNE visualization showing clear separation of thought types in latent space
Our latent space analysis reveals crucial insights:
We introduce SEAL (Steerable Reasoning Calibration) - a novel training-free approach that addresses these inefficiencies through a two-stage process:
Models Tested: DeepSeek-R1-Distill (1.5B, 7B), QwQ-32B-Preview Benchmarks: Math500, GSM8K, LiveCodeBench
Figure: Comparison showing SEAL’s superior performance over logit penalty methods
SEAL demonstrates significant improvements across multiple models and benchmarks:
Math500 Results:
Model | Method | Accuracy (%) | Tokens | Hard Accuracy (%) | Hard Tokens |
---|---|---|---|---|---|
R1-Distill-1.5B | Base | 67.0 | 4526 | 54.2 | 5737 |
R1-Distill-1.5B | SEAL | 76.6 (+9.6) | 3340 | 63.7 (+9.5) | 4552 |
R1-Distill-7B | Base | 85.8 | 3389 | 79.8 | 4176 |
R1-Distill-7B | SEAL | 89.4 (+3.6) | 2661 | 84.0 (+4.2) | 3365 |
Cross-Domain Generalization:
Task | Model | Base Acc | SEAL Acc | Token Reduction |
---|---|---|---|---|
GSM8K | R1-7B | 88.0% | 88.4% (+0.4) | 28.9% |
LiveCodeBench | R1-7B | 44.5% | 51.7% (+7.2) | 12.9% |
Limitation of Logit Penalty: Operates on individual tokens (e.g., “wait”, “alternatively”) rather than conceptual level
SEAL’s Advantage:
Figure: Ablation study showing optimal steering layers
Figure: SEAL significantly reduces sequence length for incorrect samples
Key Efficiency Metrics:
Figure: Example showing how excessive reflection leads to incorrect answers despite finding the correct solution multiple times
Case Study: In this Math500 example, the model:
SEAL’s Solution: By reducing excessive reflection thoughts, SEAL helps models stick with their correct initial reasoning.
SEAL proves that less can indeed be more in LLM reasoning. By intelligently calibrating the reasoning process, we achieve better accuracy with significantly fewer computational resources, making advanced reasoning more accessible and efficient.
Code Available: Our implementation is publicly available on GitHub, enabling researchers and practitioners to easily apply SEAL to their own models and tasks.