Add support for SWA (left, right) with FusedAttention #2477

sudhakarsingh27 · 2025-12-04T00:54:57Z

Description

FusedAttention supports "right" side sliding window attention for some time now. This adds support for SWA (left, right) with FusedAttention backend in TE.
(changes cherry-picked from original PR: #1369)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

transformer_engine

common
- fused_attn
  - fused_attn.cpp
    - add bottom_right_diagonal parameter to the API
    - Edit the filters to allow sliding window config to pick arbitrary seqlen fused attn backend
  - fused_attn_f16_arbitrary_seqlen.cu: add bottom_right_diagonal parameter to the API
  - fused_attn_fp8.cu: add bottom_right_diagonal parameter to the FADescriptor_v1 API
  - utils.h: add bottom_right_diagonal parameter to FADescriptor_v1 API
pytorch
- transformer.py
  - plumb bottom_right_diagonal through the call stack: TransformerLayer --> SelfAttention/CrossAttention
- attention
  - dot_product_attention
    - backends.py:
      - UnfusedDotProductAttention
        
        add bottom_right_diagonal parameter to the forward API
        
        why is it not used in the forward?
        
        bottom_right_alignment is being used in the Alibi call, perhaps this should be corrected
      - FusedAttn custom module
        
        add bottom_right_diagonal parameter to the forward API
      - FusedAttention module
        
        plumb bottom_right_diagonal through the call stack
    - dot_product_attention.py
      - DotProductAttention
        
        Plumb bottom_right_diagonal through the call stack
        
        Add calculation of bottom_right_diagonal if it's None
    - utils.py
      - AttentionParams
        
        [x]
      - get_attention_backend
        
        update sliding window filter section
        
        update attention bias filter section
  - multi_head_attention.py
    - Add bottom_right_diagonal to forward API and call
    - Add calculation of bottom_right_diagonal if it's None
- cpp_extentions
  - fused_attn.py
    - plumb bottom_right_diagonal in fused_attn_fwd/fused_attn_bwd
- csrc
  - extension
    - attention.cpp
      - plumb bottom_right_diagonal through the call stack: fused_attn_fwd --> nvte_fused_attn_fwd
      - same as above for bwd
  - extensions.h
    - add bottom_right_diagonal to fused_attn_fwd and fused_attn_bwd API definitions

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…IA#1369 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

sudhakarsingh27 · 2025-12-04T00:56:01Z

/te-ci pytorch L0

greptile-apps · 2025-12-04T01:04:01Z

Greptile Summary

This PR adds support for sliding window attention (SWA) with configurable left and right window sizes to the FusedAttention backend. The implementation plumbs a new bottom_right_diagonal parameter through the entire stack from PyTorch and JAX frontends to the C++/CUDA backend, enabling control over diagonal alignment for sliding window masks.

Critical Issues Found:

Variable name bugs in utils.py (lines 911, 938): Sets use_flash_attention = False instead of use_flash_attention_2 = False, incorrectly disabling all FlashAttention backends instead of just FlashAttention 2 for specific cross-attention scenarios
FP8 implementation incomplete: The fused_attn_fp8_fwd_impl_v1 and fused_attn_fp8_bwd_impl_v1 functions hardcode bottom_right_diagonal to true instead of accepting it as a parameter, preventing FP8 users from configuring this feature

Positive aspects:

F16 arbitrary sequence length backend properly implements the parameter
JAX implementation correctly plumbs the parameter through FFI bindings
Backend selection filters updated to handle new sliding window configurations
Test coverage added for sliding window attention

Confidence Score: 2/5

This PR has critical bugs that will cause incorrect behavior in production
Two critical variable name bugs in utils.py will disable all FlashAttention backends when only FlashAttention 2 should be disabled. FP8 attention implementation hardcodes bottom_right_diagonal instead of making it configurable, limiting the feature's usefulness for FP8 workloads.
Pay close attention to transformer_engine/pytorch/attention/dot_product_attention/utils.py (variable name bugs) and transformer_engine/common/fused_attn/fused_attn_fp8.cu (hardcoded parameters)

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Variable name bug causes incorrect backend selection for sliding window and ALiBi attention
transformer_engine/common/fused_attn/fused_attn_fp8.cu	Hardcoded bottom_right_diagonal values bypass parameter passing mechanism for FP8 attention
transformer_engine/common/fused_attn/fused_attn.cpp	Added bottom_right_diagonal parameter plumbing through API, updated backend selection filters

Sequence Diagram

sequenceDiagram
    participant User
    participant PyTorch/JAX Frontend
    participant CPP Extensions
    participant Fused Attn Backend
    participant cuDNN

    User->>PyTorch/JAX Frontend: Call attention with window_size_left/right
    PyTorch/JAX Frontend->>PyTorch/JAX Frontend: Calculate bottom_right_diagonal from mask_type
    PyTorch/JAX Frontend->>CPP Extensions: Pass bottom_right_diagonal parameter
    CPP Extensions->>Fused Attn Backend: Forward to nvte_fused_attn_fwd/bwd
    
    alt F16 Arbitrary SeqLen Backend
        Fused Attn Backend->>Fused Attn Backend: Use bottom_right_diagonal parameter
        Fused Attn Backend->>cuDNN: Pass to FADescriptor_v1
        cuDNN-->>Fused Attn Backend: Execute with correct alignment
    else FP8 Backend
        Fused Attn Backend->>Fused Attn Backend: Hardcode bottom_right_diagonal=true
        Note over Fused Attn Backend: BUG: Ignores parameter
        Fused Attn Backend->>cuDNN: Pass hardcoded true to FADescriptor_v1
        cuDNN-->>Fused Attn Backend: Always uses bottom-right alignment
    end
    
    Fused Attn Backend-->>CPP Extensions: Return attention output
    CPP Extensions-->>PyTorch/JAX Frontend: Return result
    PyTorch/JAX Frontend-->>User: Return attention output

greptile-apps

Additional Comments (2)

transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py, line 1281 (link)

logic: Trailing comma creates single-element tuple instead of boolean - should this be just bottom_right_alignment = attn_mask_type not in ["causal", "padding_causal"]?
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py, line 1482 (link)

style: Uses hardcoded mask type check instead of the new bottom_right_diagonal parameter for ALiBi alignment. Should this use bottom_right_diagonal parameter for consistency?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{15 files reviewed, 8 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/attention/dot_product_attention/backends.py

transformer_engine/pytorch/attention/dot_product_attention/utils.py

greptile-apps · 2025-12-04T01:04:00Z

transformer_engine/pytorch/transformer.py

+        if self_attn_mask_type in {"causal", "padding_causal"}:
+            bottom_right_diagonal = False
+        if bottom_right_diagonal is None or self_attn_mask_type in {
+            "causal_bottom_right",
+            "padding_causal_bottom_right",
+        }:
+            bottom_right_diagonal = True


logic: Logic overrides the instance variable even when explicitly set in forward call - should preserve user's explicit choice. Should the mask type check override an explicitly passed bottom_right_diagonal parameter, or only apply when it's None?

greptile-apps · 2025-12-04T01:04:01Z

transformer_engine/pytorch/transformer.py

+        if enc_dec_attn_mask_type in {"causal", "padding_causal"}:
+            enc_dec_bottom_right_diagonal = False
+        if enc_dec_bottom_right_diagonal is None or enc_dec_attn_mask_type in {
+            "causal_bottom_right",
+            "padding_causal_bottom_right",
+        }:
+            enc_dec_bottom_right_diagonal = True


logic: Same logic issue as above - mask type check overrides explicit parameter values

transformer_engine/common/fused_attn/fused_attn.cpp

tests/pytorch/attention/test_attention.py

transformer_engine/common/fused_attn/fused_attn.cpp

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

transformer_engine/pytorch/attention/dot_product_attention/backends.py

transformer_engine/pytorch/attention/dot_product_attention/utils.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…adding_brcm_try2

greptile-apps

_{15 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…adding_brcm_try2

greptile-apps

Additional Comments (4)

transformer_engine/pytorch/transformer.py, line 777-783 (link)

logic: Conditional logic overwrites explicitly passed parameter. When self_attn_mask_type in {"causal", "padding_causal"} at line 777, this unconditionally sets bottom_right_diagonal = False, even if the user explicitly passed a different value in the forward call. The check at line 779 (if bottom_right_diagonal is None) becomes unreachable in those cases.

Consider restructuring to only apply defaults when bottom_right_diagonal is None:
transformer_engine/pytorch/transformer.py, line 787-793 (link)

logic: Same conditional logic issue for encoder-decoder attention. The mask type check at line 787 unconditionally overwrites enc_dec_bottom_right_diagonal, ignoring explicitly passed values.
transformer_engine/pytorch/attention/dot_product_attention/utils.py, line 920 (link)

logic: Variable name mismatch: setting use_flash_attention instead of use_flash_attention_2
transformer_engine/pytorch/attention/dot_product_attention/utils.py, line 947 (link)

logic: Variable name mismatch: setting use_flash_attention instead of use_flash_attention_2

_{15 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

sudhakarsingh27 · 2025-12-10T23:33:18Z

/te-ci pytorch L0

sudhakarsingh27 · 2025-12-11T18:19:15Z

/te-ci pytorch L0

greptile-apps

Additional Comments (1)

transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py, line 1283 (link)

syntax: Tuple has trailing comma causing it to be a single-element tuple instead of boolean

_{15 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

sudhakarsingh27 · 2025-12-11T21:25:46Z

/te-ci pytorch L0

greptile-apps

_{15 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2025-12-11T23:19:25Z

/te-ci pytorch L0

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-01-12T23:50:57Z

/te-ci pytorch L1

…adding_brcm_try2

sudhakarsingh27 · 2026-01-13T19:20:07Z

/te-ci pytorch L1

greptile-apps · 2026-01-13T19:22:31Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

…adding_brcm_try2

sudhakarsingh27 · 2026-01-14T01:16:42Z

/te-ci pytorch L1

tests/pytorch/attention/test_attention_with_cp.py

tests/pytorch/utils.py

transformer_engine/common/fused_attn/fused_attn.cpp

transformer_engine/common/include/transformer_engine/fused_attn.h

transformer_engine/pytorch/attention/dot_product_attention/utils.py

transformer_engine/pytorch/attention/dot_product_attention/backends.py

transformer_engine/common/fused_attn/fused_attn.cpp

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-01-14T22:08:53Z

/te-ci pytorch L1

greptile-apps

_{17 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-14T22:17:48Z

transformer_engine/pytorch/cpp_extensions/fused_attn.py

                in [i + seqlen_k - seqlen_q - window_size[0], i + seqlen_k - seqlen_q
                + window_size[1]] inclusive. Special cases (-1, -1) and (-1, 0) mean no sliding
                window and causal mask specifically.
+    bottom_right_diagonal: bool, default = True


style: Documentation inconsistency: forward function documents default as None but backward documents True

Suggested change

bottom_right_diagonal: bool, default = True

bottom_right_diagonal: bool, default = None

greptile-apps · 2026-01-14T22:17:48Z

transformer_engine/common/fused_attn/fused_attn_fp8.cu

                               0,
                               0,
                               true,
+                               true,


logic: Critical: Hardcoded parameter instead of proper API parameter

This hardcoded true value should be replaced with a proper bottom_right_diagonal parameter passed from the function signature. The current implementation bypasses the intended parameter passing mechanism.

Issue: The function fused_attn_fp8_fwd_impl_v1 should accept bottom_right_diagonal as a parameter and pass it here, rather than hardcoding true. This breaks the configurability that the PR is supposed to add. Should this hardcoded true be replaced with a bottom_right_diagonal parameter that gets passed through the function signature?

greptile-apps · 2026-01-14T22:17:49Z

transformer_engine/common/fused_attn/fused_attn_fp8.cu

                               0,
                               0,
+                               true,
                               false,


logic: Critical: Hardcoded parameter instead of proper API parameter

Similar to the forward pass, this hardcoded false value should be replaced with a proper bottom_right_diagonal parameter passed from the function signature.

Issue: The function fused_attn_fp8_bwd_impl_v1 should accept bottom_right_diagonal as a parameter and pass it here. The hardcoded false suggests different behavior for backward pass, but this should still be configurable. Should this hardcoded false be replaced with a bottom_right_diagonal parameter, and why is it different from the forward pass value?

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…adding_brcm_try2

sudhakarsingh27 · 2026-01-16T19:08:07Z

/te-ci pytorch L1

greptile-apps

_{17 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-16T19:14:28Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+                "Disabling FlashAttention as it only supports sliding window with bottom right"
+                " diagonal alignment for cross-attention"
+            )
+            use_flash_attention = False


logic: Wrong variable name - should be use_flash_attention_2 = False not use_flash_attention = False. This causes all FlashAttention backends to be incorrectly disabled for sliding window with cross-attention when only FlashAttention 2 should be disabled.

Suggested change

use_flash_attention = False

use_flash_attention_2 = False

greptile-apps · 2026-01-16T19:14:29Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+                    "Disabling FlashAttention as it only supports ALiBi with bottom right diagonal"
+                    " alignment for cross-attention"
+                )
+                use_flash_attention = False


logic: Wrong variable name - should be use_flash_attention_2 = False not use_flash_attention = False. This causes all FlashAttention backends to be incorrectly disabled for ALiBi with cross-attention when only FlashAttention 2 should be disabled.

Suggested change

use_flash_attention = False

use_flash_attention_2 = False

greptile-apps · 2026-01-16T19:14:30Z

transformer_engine/common/fused_attn/fused_attn_fp8.cu

                               NVTE_Softmax_Type::NVTE_VANILLA_SOFTMAX,
                               0,
                               0,
                               true,


logic: Hardcoded bottom_right_diagonal value bypasses the parameter passing mechanism. The FP8 forward function needs to accept bottom_right_diagonal as a parameter and pass it here instead of hardcoding true. Should FP8 attention support configurable bottom_right_diagonal, or is there a technical reason it must always be true?

greptile-apps · 2026-01-16T19:14:30Z

transformer_engine/common/fused_attn/fused_attn_fp8.cu

                               NVTE_Softmax_Type::NVTE_VANILLA_SOFTMAX,
                               0,
                               0,
+                               true,


logic: Hardcoded bottom_right_diagonal value in backward pass. Should accept this as a parameter and pass it through instead of hardcoding true.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

cyanguwa · 2026-01-18T22:48:37Z

You may need to add the extra argument bottom_right_diagonal to the nvte_xxx C API calls in Jax too, otherwise TE-Jax won't build. Have a look at PR 2584 maybe, where the deterministic arg was added, so files such as jax/csrc/extensions/attention.cpp need to be changed as well. Thanks!

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…adding_brcm_try2

for more information, see https://pre-commit.ci

sudhakarsingh27 · 2026-01-20T21:18:57Z

/te-ci L1

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…ansformerEngine into swa_padding_brcm_try2

…adding_brcm_try2

for more information, see https://pre-commit.ci

greptile-apps

_{20 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-21T20:50:06Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+                "Disabling FlashAttention as it only supports sliding window with bottom right"
+                " diagonal alignment for cross-attention"
+            )
+            use_flash_attention = False


syntax: Wrong variable - should be use_flash_attention_2 = False not use_flash_attention = False. This disables all FlashAttention backends when only FlashAttention 2 should be disabled for sliding window with cross-attention.

Suggested change

use_flash_attention = False

use_flash_attention_2 = False

greptile-apps · 2026-01-21T20:50:07Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+                    "Disabling FlashAttention as it only supports ALiBi with bottom right diagonal"
+                    " alignment for cross-attention"
+                )
+                use_flash_attention = False


syntax: Wrong variable - should be use_flash_attention_2 = False not use_flash_attention = False. This disables all FlashAttention backends when only FlashAttention 2 should be disabled for ALiBi with cross-attention.

Suggested change

use_flash_attention = False

use_flash_attention = False

greptile-apps · 2026-01-21T20:50:08Z

transformer_engine/common/fused_attn/fused_attn_fp8.cu

                               true,
+                               true,


logic: Hardcoded bottom_right_diagonal to true bypasses the configurable parameter mechanism. The fused_attn_fp8_fwd_impl_v1 function should accept bottom_right_diagonal as a parameter and pass it here instead of hardcoding. Is there a technical reason FP8 attention must always use bottom_right_diagonal=true, or should this be configurable?

greptile-apps · 2026-01-21T20:50:08Z

transformer_engine/common/fused_attn/fused_attn_fp8.cu

                               NVTE_Softmax_Type::NVTE_VANILLA_SOFTMAX,
                               0,
                               0,
+                               true,


logic: Hardcoded bottom_right_diagonal to true in backward pass. Should accept this as a parameter for consistency with the forward pass configuration.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

sudhakarsingh27 · 2026-01-21T21:02:44Z

/te-ci L1

SWA (left, right) with FusedAttention changes cherry-picked from NVID…

33e2e47

…IA#1369 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 requested a review from cyanguwa December 4, 2025 00:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

eab24be

for more information, see https://pre-commit.ci

cyanguwa mentioned this pull request Dec 4, 2025

[common/PyTorch] Add FusedAttention support for SWA (left, right) #1369

Closed

13 tasks

greptile-apps bot reviewed Dec 4, 2025

View reviewed changes