FluxAttention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Quantong Qiu1 Zhiyi Hong1 Yi Yang1 Haitian Wang1 Kebin Liu2 Qingqing Dang2 Juntao Li1* Min Zhang1
1School of Computer Science and Technology, Soochow University 2Baidu Inc, China * Corresponding author: ljt@suda.edu.cn
2.8×

Prefill Speedup

FluxAttention reports up to 2.8× speed improvement in the prefill stage through layer-wise routing between Full Attention and Sparse Attention.

2.0×

Decode Speedup

FluxAttention achieves up to 2.0× speed improvement in the decode stage while preserving high-fidelity retrieval.

12h

Efficient Training

The framework is parameter-efficient and requires only 12 hours of training on 8×A800 GPUs.

Abstract

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks.

Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level.

By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8×A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to 2.8× and 2.0× in the prefill and decode stages.

Key Insight

Why Static or Head-Level Sparsity Is Not Enough

Context-Aware Routing Should Happen at Layer Granularity

FluxAttention uses a lightweight Layer Router to adaptively select Full Attention or Sparse Attention per layer, preserving retrieval quality while improving practical efficiency.

FluxAttention motivation: task-dependent sparsity sensitivity and decode efficiency comparison
Impact of sparsity on model quality and decode efficiency: static sparsity can trigger task-specific performance collapse, while layer-level routing provides better practical speedup than head-level routing.

Static Ratios Are Brittle

Task requirements vary: retrieval-intensive tasks need denser token interaction, while context-holistic tasks can tolerate higher sparsity.

Layer-Level Routing Helps

The router predicts layer-wise attention mode from context, enabling dynamic computation allocation without per-head fragmentation.

Hardware Efficiency Matters

Uniform layer-level workloads reduce synchronization stalls and better translate theoretical FLOP savings into wall-clock decode gains.

Method

FluxAttention: Layer Router for Hybrid Attention

FluxAttention integrates a lightweight Layer Router into frozen pretrained LLMs and learns soft routing with Gumbel-Softmax during training, then discretizes to hard routing at inference.

FluxAttention method overview with layer router and hybrid attention routing
Method overview. The router evaluates prompt context and routes each layer to FA or SA. Training updates only the router while freezing backbone LLM parameters.

Layer Router

Routes each layer to Full Attention or Sparse Attention according to the current context and retrieval demand.

Soft-to-Hard Routing

Uses differentiable Gumbel-Softmax in training and deterministic hard routing in inference for practical deployment.

Parameter Efficiency

The router can be trained efficiently on frozen pretrained LLMs, requiring only 12 hours on 8×A800 GPUs.

Results

Better Speed-Performance Trade-Offs Across Benchmarks

FluxAttention speedup summary in prefill and decode stages
Main claim from the paper: FluxAttention achieves up to 2.8× prefill speedup and 2.0× decode speedup while maintaining strong task performance.

Dynamic Task Adaptation

The router adjusts layer sparsity by context, improving robustness across retrieval-intensive and context-holistic tasks.

Real Decode Gains

Layer-level routing avoids head-level synchronization long-tail and delivers stronger decode acceleration in practice.

Fast Adaptation Cost

Parameter-efficient tuning converges in about 12 hours on 8×A800 GPUs with frozen backbone weights.

FluxAttention detailed results on efficiency and performance trade-offs
Dynamic routing patterns show that the model learns to allocate attention modes according to task demands, with more FA layers for retrieval-intensive tasks and more SA layers for context-holistic tasks.
FluxAttention detailed results on efficiency and performance trade-offs
Detailed comparisons highlight robust quality under adaptive sparsity and practical acceleration benefits from layer-level routing.

Experimental Validation

Context-Aware Routing Improves Long-Context Inference

FluxAttention is evaluated on multiple long-context and mathematical reasoning benchmarks, where it delivers better speed-performance trade-offs than baseline models.

FluxAttention benchmark validation across long-context and reasoning tasks
Benchmark validation demonstrates that FluxAttention improves the performance-efficiency trade-off across long-context and mathematical reasoning settings.

BibTeX

@misc{qiu2026fluxattentioncontextawarehybrid,
  title={FluxAttention: Context-Aware Hybrid Attention for Efficient LLMs Inference},
          author={Quantong Qiu and Zhiyi Hong and Yi Yang and Haitian Wang and Kebin Liu and Qingqing Dang and Juntao Li and Min Zhang},
          year={2026},
          eprint={2604.07394},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2604.07394}
}