DeepSeek’s NSA Outperforms Full Attention, Making AI Models Faster and Smarter

By combining trainable sparse attention with cutting-edge GPU optimizations, NSA achieves up to 9× faster computation and perfect long-context retrieval, setting a new standard for efficient large language models.

Research: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. Image Credit: Krot_Studio / ShutterstockResearch: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. Image Credit: Krot_Studio / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

A recent research paper posted on the arXiv preprint* server by DeepSeek-AI, Peking University’s PKU-Anker LLM Lab, and the University of Washington introduces Natively Sparse Attention (NSA)—a new sparse attention mechanism designed to improve efficiency in long-context modeling without compromising accuracy. By integrating Triton-based hardware optimizations with a hierarchical sparse attention strategy combining token compression, token selection, and sliding window attention, NSA achieves significant speed improvements while maintaining or exceeding the performance of full-attention models.

As language models handle longer input sequences, standard attention mechanisms become computationally expensive. Sparse attention techniques have been explored as a solution, but many suffer from limited flexibility, inefficient KV-cache management, and non-trainable sparsity patterns. Existing methods often fail to translate theoretical efficiency gains into real-world speedups or focus solely on inference, neglecting potential benefits during training. NSA addresses these issues by introducing a trainable sparse attention mechanism that is both optimized for modern GPUs and effective throughout all model stages, including training and inference.

Overview of NSA’s architecture. Left: The framework processes input sequences through three parallel attention branches: For a given query, preceding keys and values are processed into compressed attention for coarse-grained patterns, selected attention for important token blocks, and sliding attention for local context. Right: Visualization of different attention patterns produced by each branch. Green areas indicate regions where attention scores need to be computed, while white areas represent regions that can be skipped.

Overview of NSA’s architecture. Left: The framework processes input sequences through three parallel attention branches: For a given query, preceding keys and values are processed into compressed attention for coarse-grained patterns, selected attention for important token blocks, and sliding attention for local context. Right: Visualization of different attention patterns produced by each branch. Green areas indicate regions where attention scores need to be computed, while white areas represent regions that can be skipped.

NSA incorporates three core techniques to enhance both efficiency and accuracy. The first is token compression, which groups tokens into coarser representations, reducing computational overhead while preserving essential context. The second is token selection, which ensures that only the most relevant fine-grained tokens are retained, preventing the loss of critical information. Finally, a sliding window attention mechanism captures local dependencies efficiently, maintaining contextual continuity while minimizing redundancy. By combining these elements, NSA not only reduces computational complexity but also ensures an optimal balance between global context awareness and local precision, leading to improved model generalization.

Unlike many sparse attention models that primarily aim to reduce computational complexity, NSA is explicitly designed for GPU acceleration. It leverages blockwise memory access, optimized arithmetic intensity, and GPU streaming multiprocessors, ensuring that its theoretical speed advantages translate into real-world performance gains. Additionally, the architecture enables end-to-end training, eliminating the need for pretraining with full attention before applying sparsity. Unlike many existing methods where sparsity is only applied during inference, NSA learns sparse patterns from the start, avoiding performance degradation and ensuring efficient adaptation across all training phases. This allows the model to learn sparse patterns from the outset, leading to better generalization across all stages of deployment.

Sparse attention methods have traditionally struggled to achieve real-world acceleration due to GPU memory bottlenecks, inefficient scheduling, and imbalanced arithmetic intensity. Many approaches theoretically reduce computation but fail to optimize hardware execution, resulting in limited speed improvements in practical applications. NSA resolves these inefficiencies by aligning its sparse attention mechanism with hardware constraints, minimizing wasted computation and reducing memory transfer overhead. For example, during decoding, NSA significantly reduces memory requirements—achieving up to 11.6× memory reduction—by optimizing KV-cache management, which is a critical bottleneck in existing sparse attention models.

Another major limitation of existing sparse attention models is that sparsity is often applied only at the inference stage, forcing models to deviate from their pretrained full-attention paths, which degrades performance. NSA introduces trainable sparse operators, ensuring that sparse patterns are consistently learned throughout both training and inference. This approach leads to better long-context adaptation and improved overall model efficiency, making NSA a more effective solution than methods that apply sparsity only during inference.

To evaluate NSA, the researchers tested it on a 27B-parameter transformer model trained with 260 billion tokens, incorporating Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) architectures for enhanced efficiency. The model was benchmarked across language understanding, reasoning, and coding tasks, including MMLU, GSM8K, and HumanEval, where it achieved comparable or superior performance to full-attention models. In long-context retrieval tasks, such as the Needle-in-a-Haystack benchmark, NSA achieved perfect retrieval accuracy across all positions in 64k-token contexts, significantly outperforming existing sparse attention methods and demonstrating its strength in long-sequence understanding. Testing on multi-document question-answering and deep reasoning tasks, using the LongBench benchmark suite, showed that NSA delivered the highest average score, surpassing both full-attention and competing sparse attention methods. In mathematical reasoning tasks, NSA was fine-tuned and tested on the American Invitational Mathematics Examination (AIME) dataset, demonstrating superior performance over full-attention models, particularly in scenarios requiring reasoning over extended sequences.

Beyond accuracy improvements, NSA also delivers substantial efficiency gains. When compared to FlashAttention-2, one of the most optimized full-attention implementations, NSA achieved up to 9.0× faster forward computation and 6.0× faster backward computation for sequences of 64k tokens. These speedups were attributed to Triton-based kernel optimizations, coalesced memory access, and the elimination of redundant KV-cache transfers. During decoding tasks, where memory access is a primary bottleneck, NSA reduced memory requirements by 11.6×, leading to faster sequence generation and more efficient GPU utilization. Unlike many existing sparse attention methods that suffer from inefficient KV-cache management, NSA minimizes unnecessary cache loading and memory bandwidth usage, making it an ideal choice for high-speed inference applications.

This research demonstrates that sparse attention can be both trainable and hardware-optimized, paving the way for scalable, high-performance long-context language models. By focusing on algorithmic efficiency alongside hardware-aware execution, NSA offers a practical solution for future LLM architectures as sequence lengths continue to increase.

This work marks a significant advancement in sparse attention research, addressing fundamental inefficiencies and providing a high-performance solution tailored for modern AI infrastructures.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y. X., Wang, L., Xiao, Z., Wang, Y., Ruan, C., Zhang, M., Liang, W., & Zeng, W. (2025). Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. ArXiv. https://arxiv.org/abs/2502.11089
Joel Scanlon

Written by

Joel Scanlon

Joel relocated to Australia in 1995 from the United Kingdom and spent five years working in the mining industry as an exploration geotechnician. His role involved utilizing GIS mapping and CAD software. Upon transitioning to the North Coast of NSW, Australia, Joel embarked on a career as a graphic designer at a well-known consultancy firm. Subsequently, he established a successful web services business catering to companies across the eastern seaboard of Australia. It was during this time that he conceived and launched News-Medical.Net. Joel has been an integral part of AZoNetwork since its inception in 2000. Joel possesses a keen interest in exploring the boundaries of technology, comprehending its potential impact on society, and actively engaging with AI-driven solutions and advancements.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Scanlon, Joel. (2025, February 19). DeepSeek’s NSA Outperforms Full Attention, Making AI Models Faster and Smarter. AZoAi. Retrieved on February 21, 2025 from https://www.azoai.com/news/20250219/DeepSeeke28099s-NSA-Outperforms-Full-Attention-Making-AI-Models-Faster-and-Smarter.aspx.

  • MLA

    Scanlon, Joel. "DeepSeek’s NSA Outperforms Full Attention, Making AI Models Faster and Smarter". AZoAi. 21 February 2025. <https://www.azoai.com/news/20250219/DeepSeeke28099s-NSA-Outperforms-Full-Attention-Making-AI-Models-Faster-and-Smarter.aspx>.

  • Chicago

    Scanlon, Joel. "DeepSeek’s NSA Outperforms Full Attention, Making AI Models Faster and Smarter". AZoAi. https://www.azoai.com/news/20250219/DeepSeeke28099s-NSA-Outperforms-Full-Attention-Making-AI-Models-Faster-and-Smarter.aspx. (accessed February 21, 2025).

  • Harvard

    Scanlon, Joel. 2025. DeepSeek’s NSA Outperforms Full Attention, Making AI Models Faster and Smarter. AZoAi, viewed 21 February 2025, https://www.azoai.com/news/20250219/DeepSeeke28099s-NSA-Outperforms-Full-Attention-Making-AI-Models-Faster-and-Smarter.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.