CPU Optimizations for FP8 #2559

vthumbe1503 · 2026-01-05T18:17:02Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-01-06T13:03:13Z

/te-ci L1 pytorch

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-07T17:23:30Z

/te-ci L1 pytorch

greptile-apps · 2026-01-07T17:48:47Z

Greptile Summary

This PR implements CPU-side optimizations for FP8 operations to reduce overhead in the hot path.

Key optimizations:

Cached function results: nvte_is_non_tn_fp8_gemm_supported() is now called once per function instead of multiple times
Symbol caching: CUDA driver symbol lookups are cached with thread-safe mutex protection
Property caching: dtype, requires_grad, shape, and is_cuda properties on quantized tensors now use cached values to avoid expensive PyObject lookups
Direct Python C API: Tensor construction bypasses pybind11 overhead by using PyDict_New()/PyObject_Call() directly
Optimized tensor allocation: Uses at::native::empty_cuda directly to avoid redundant device guard overhead
Upfront attribute access: requires_grad and dtype are cached at the start of operations instead of accessed repeatedly

Architecture notes:

The QuantizedTensor class now has lazy initialization fallbacks for _dtype and _requires_grad to handle tensors created through alternate paths (unpickling, FSDP, etc.)
The stride parameter was added to QuantizedTensor.__new__ to allow C++ code to pass pre-computed strides

Confidence Score: 4/5

This PR is safe to merge with low risk - optimizations are well-structured and maintain existing semantics
Score reflects well-implemented CPU optimizations with proper thread safety (mutex, call_once) and lazy initialization fallbacks for edge cases. The C API usage patterns are standard and the logic changes preserve original semantics. Minor concerns around exception safety in quantizer.cpp don't affect normal execution paths.
transformer_engine/pytorch/csrc/quantizer.cpp has exception safety concerns but they only manifest during Python exceptions which would already be fatal

Important Files Changed

Filename	Overview
transformer_engine/common/util/cuda_driver.h	Adds thread-safe symbol caching using `std::mutex` and `std::unordered_map` to avoid repeated symbol lookups. Properly synchronized.
transformer_engine/pytorch/csrc/quantizer.cpp	Bypasses pybind11 overhead with direct Python C API calls. Adds `empty_cuda` wrapper and caches `nvte_is_non_tn_fp8_gemm_supported()`. Several patterns could leak memory on exceptions.
transformer_engine/pytorch/module/linear.py	Caches `requires_grad` properties upfront to avoid repeated lookups. Logic change from `requires_grad()` helper to explicit OR is semantically equivalent.
transformer_engine/pytorch/quantized_tensor.py	Adds cached `dtype` and `requires_grad` properties with lazy initialization fallback. Adds optional `stride` parameter to `__new__`. Has proper fallback for alternate construction paths.
transformer_engine/pytorch/tensor/float8_tensor.py	Adds optimized `shape` and `is_cuda` properties that avoid expensive PyObject lookups. Includes proper error handling for edge cases.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Linear as _Linear.forward
    participant QTensor as QuantizedTensor
    participant Quantizer as Float8Quantizer (C++)
    participant PyAPI as Python C API
    
    Note over Linear: CPU Optimization: Cache requires_grad upfront
    User->>Linear: forward(inp, weight, bias)
    Linear->>QTensor: inp.requires_grad (cached)
    QTensor-->>Linear: _requires_grad (no PyObject lookup)
    
    Note over Quantizer: CPU Optimization: Direct C API calls
    Linear->>Quantizer: create_tensor()
    Quantizer->>Quantizer: nvte_is_non_tn_fp8_gemm_supported() (cached once)
    Quantizer->>PyAPI: PyDict_New(), PyObject_Call()
    Note right of PyAPI: Bypasses pybind11 overhead
    PyAPI-->>Quantizer: Float8Tensor instance
    
    Note over QTensor: CPU Optimization: Cached properties
    Linear->>QTensor: tensor.shape
    QTensor-->>Linear: _data.shape (direct access)

greptile-apps

Additional Comments (3)

transformer_engine/pytorch/csrc/util.cpp, line 18-20 (link)

logic: Critical logical error: || should be &&. This condition will always betruesince a value cannot simultaneously be both scaling modes, causing the function to always return nullopt for valid inputs.
transformer_engine/pytorch/quantized_tensor.py, line 373-393 (link)

style: commented-out code for requires_grad caching optimization - consider removing dead code entirely. Is this code planned to be implemented later or should it be removed?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
transformer_engine/pytorch/module/linear.py, line 484 (link)

logic: Logical error: this condition should use OR (||) not AND (&&). The original logic was checking if ANY tensor requires gradients for FP8 handling, but this now only activates when ALL three require gradients, including bias which may be None.

Should the FP8 condition check if any tensor requires gradients (OR logic) rather than all tensors (AND logic)?

_{10 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-07T17:59:34Z

/te-ci L1 pytorch

greptile-apps

Greptile Overview

Greptile Summary

This PR implements CPU-side performance optimizations for FP8 operations by caching frequently accessed attributes and reducing redundant function calls. The optimizations target expensive PyObject attribute lookups on custom tensor types and repeated C++ function calls.

Key Changes:

Caches requires_grad, dtype, shape, and is_cuda attribute accesses to avoid expensive PyObject lookups on custom tensors
Reorders attribute checks in get_tensor_device() to prioritize internal quantized tensor attributes
Makes num_devices static in nvte_is_non_tn_fp8_gemm_supported() to cache device count
Stores GEMM support check results in local variables to avoid redundant function calls

Critical Issues Found:

Variable redeclaration error in cublaslt_gemm.cu (line 224) will prevent compilation
Logic bug in linear.py (line 484) changes FP8 state management from OR logic to AND logic, breaking functionality when bias is None or doesn't require grad

Confidence Score: 0/5

This PR cannot be merged due to compilation error and critical logic bug
Two critical issues prevent merging: (1) C++ compilation will fail due to variable redeclaration at line 224 of cublaslt_gemm.cu, and (2) logic bug at line 484 of linear.py breaks FP8 state management by requiring all three tensors to have requires_grad=True instead of any one of them
Pay close attention to transformer_engine/common/gemm/cublaslt_gemm.cu (compilation error) and transformer_engine/pytorch/module/linear.py (logic bug)

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/gemm/cublaslt_gemm.cu	1/5	Caches function call result to reduce overhead, but contains variable redeclaration error that will cause compilation failure
transformer_engine/common/transformer_engine.cpp	5/5	Makes `num_devices` static to avoid redundant calls to `cuda::num_devices()` - valid optimization
transformer_engine/pytorch/module/linear.py	0/5	Caches `requires_grad` checks for performance, but contains critical logic bug at line 484 that changes FP8 state management behavior

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Linear as Linear Module
    participant Quantizer as Quantizer/QuantizedTensor
    participant GEMM as GEMM Operations
    participant CPP as C++ Extensions

    Note over Linear,CPP: Performance Optimization Flow
    
    User->>Linear: forward(input, weight, bias)
    
    Note over Linear: Cache requires_grad checks
    Linear->>Linear: inp_requires_grad = inp.requires_grad<br/>weight_requires_grad = weight.requires_grad<br/>bias_requires_grad = bias.requires_grad
    
    Linear->>Quantizer: Check if quantized tensor
    alt QuantizedTensor
        Note over Quantizer: Use cached dtype property
        Quantizer->>Quantizer: return self._dtype
        Note over Quantizer: Use cached shape/is_cuda
        Quantizer->>Quantizer: return self._data.shape
    else Regular Tensor
        Quantizer->>Linear: Standard attribute access
    end
    
    Linear->>CPP: get_tensor_device(tensor)
    Note over CPP: Reordered attribute checks
    CPP->>CPP: Check _rowwise_data first<br/>Check _columnwise_data<br/>Check device last
    CPP-->>Linear: device_index
    
    Linear->>GEMM: Configure GEMM parameters
    Note over GEMM: Cache nvte_is_non_tn_fp8_gemm_supported
    GEMM->>CPP: nvte_is_non_tn_fp8_gemm_supported()
    Note over CPP: Static num_devices cached
    CPP-->>GEMM: support_flag
    GEMM->>GEMM: Store in local variable
    
    GEMM->>GEMM: Execute optimized GEMM
    GEMM-->>Linear: output
    
    Note over Linear: FP8 State Management
    alt FP8 enabled and requires_grad check
        Linear->>Linear: Update FP8 tensors<br/>based on cached flags
    end
    
    Linear-->>User: output

greptile-apps · 2026-01-07T18:02:31Z

Additional Comments (2)

transformer_engine/common/gemm/cublaslt_gemm.cu
variable redeclared in same scope - already declared at line 132

    // int is_nvte_non_tn_fp8_gemm_supported already declared at line 132

transformer_engine/pytorch/module/linear.py
logic change from original requires_grad(inp, weight, bias) which returns True if ANY tensor requires grad. New code requires ALL THREE to be True, breaking FP8 state management when bias is None or doesn't require grad

            if ctx.fp8 and (inp_requires_grad or weight_requires_grad or bias_requires_grad):

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-11T19:04:36Z

/te-ci L1 pytorch

vthumbe1503 · 2026-01-11T19:04:45Z

/te-ci L1 pytorch

greptile-apps

_{8 files reviewed, 8 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-11T19:09:51Z

transformer_engine/common/gemm/cublaslt_gemm.cu

    ret.A_scale_inv = A.scale_inv.dptr;
    ret.lda = is_A_transposed ? k : m;
-    if (!nvte_is_non_tn_fp8_gemm_supported() && !is_A_transposed) {
+    int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();


Variable is_nvte_non_tn_fp8_gemm_supported is redeclared in the same scope - it was already declared at line 224 for the B matrix configuration. This causes a compilation error or shadowing issue.

The variable should only be declared once at the beginning of the function scope and reused for both A and B matrix configurations.

Suggested change

int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();

// Move this declaration before line 125 (Configure A matrix)

int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();

greptile-apps · 2026-01-11T19:09:52Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),
+                                     PyTuple_New(0), kwargs);


Memory leak: PyTuple_New(0) creates a new Python tuple object with a reference count of 1, but this reference is never decremented. The tuple is passed directly to PyObject_Call which doesn't steal the reference, so it must be explicitly released.

This same issue occurs at lines 186, 398, 417, 697, 718, 1039, 1058, 1346, and 1367.

Suggested change

PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),

PyTuple_New(0), kwargs);

PyObject* args = PyTuple_New(0);

PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),

args, kwargs);

Py_DECREF(args);

Py_DECREF(kwargs);

transformer_engine/pytorch/csrc/quantizer.cpp

greptile-apps · 2026-01-11T19:09:54Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    PyDict_SetItemString(kwargs, "fp8_dtype", py::cast(this->dtype).ptr());
+    PyDict_SetItemString(kwargs, "data_transpose", transpose_py.ptr());
+    PyDict_SetItemString(kwargs, "quantizer", this->quantizer.ptr());
+    PyDict_SetItemString(kwargs, "device", py::cast(device).ptr());


Invalid parameter: The Float8TensorStorage.__new__ method does not accept a device parameter. Looking at the Python class definition in float8_tensor_storage.py, the __new__ signature only expects: data, fp8_scale_inv, fp8_dtype, data_transpose, and quantizer.

This will cause a runtime error when trying to create Float8TensorStorage instances. This issue also occurs at lines 184, 396, 415, and in similar patterns for MXFP8 and other quantizers.

Suggested change

PyDict_SetItemString(kwargs, "device", py::cast(device).ptr());

// Remove this line - device parameter is not accepted

// PyDict_SetItemString(kwargs, "device", py::cast(device).ptr());

transformer_engine/pytorch/quantized_tensor.py

greptile-apps · 2026-01-11T19:09:57Z

transformer_engine/pytorch/csrc/extensions/pybind.cpp

 PyTypeObject *NVFP4TensorPythonClass = nullptr;
 PyTypeObject *NVFP4TensorStoragePythonClass = nullptr;
 PyTypeObject *NVFP4QuantizerClass = nullptr;
+bool is_extension_initialized = false;


Race condition: The is_extension_initialized flag is not thread-safe. If init_extension() is called concurrently from multiple threads, both threads could pass the check at line 105 before either sets the flag at line 110, causing the initialization code to run multiple times.

The original implementation using null-checks on the individual class pointers was safer because Python's module import system provides thread safety. Consider using std::once_flag and std::call_once for proper thread-safe initialization:

Suggested change

bool is_extension_initialized = false;

bool is_extension_initialized = false;

std::once_flag extension_init_flag;

And in init_extension():

void init_extension() { std::call_once(extension_init_flag, []() { init_float8_extension(); init_mxfp8_extension(); init_float8blockwise_extension(); init_nvfp4_extensions(); }); }

greptile-apps · 2026-01-11T19:09:58Z

transformer_engine/common/transformer_engine.cpp

+  static int num_devices = transformer_engine::cuda::num_devices();
  static std::vector<int> cache(num_devices, -1);
  static std::vector<std::once_flag> flags(num_devices);


Potential edge case: The num_devices is captured once as a static variable, so if the number of devices changes at runtime (e.g., with CUDA MPS or device enumeration changes), the cache and flags vectors could be undersized, leading to out-of-bounds access at line 968.

While dynamic device changes are rare, consider adding a bounds check:

Suggested change

static int num_devices = transformer_engine::cuda::num_devices();

static std::vector<int> cache(num_devices, -1);

static std::vector<std::once_flag> flags(num_devices);

static int num_devices = transformer_engine::cuda::num_devices();

static std::vector<int> cache(num_devices, -1);

static std::vector<std::once_flag> flags(num_devices);

int device_id = transformer_engine::cuda::current_device();

NVTE_CHECK(device_id < num_devices, "Device ID exceeds cached device count");

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-11T19:10:24Z

transformer_engine/common/gemm/cublaslt_gemm.cu

    ret.B_scale_inv = B.scale_inv.dptr;
    ret.ldb = is_B_transposed ? n : k;
-    if (!nvte_is_non_tn_fp8_gemm_supported() && is_B_transposed) {
+    int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();


Variable is_nvte_non_tn_fp8_gemm_supported is redeclared in the same function scope (already declared at line 132). In C++, this is a compilation error. The variable should be declared once at the function scope level before both the A matrix and B matrix configuration blocks.

Suggested change

int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();

// Move declaration to line 104 (after GemmParam ret;) to avoid redeclaration

greptile-apps · 2026-01-11T19:10:25Z

transformer_engine/pytorch/csrc/quantizer.cpp

+  py::object scale_inv_py = py::cast(scale_inv);
  // Initialize scale-inverse tensor
  if (!scale_inv) {
    scale_inv = at::reciprocal(scale);
  }


Critical bug: scale_inv_py is created on line 146 by dereferencing scale_inv (which is std::optional<at::Tensor>), but scale_inv is only initialized on lines 148-150 if it's empty. When scale_inv is not provided to this function, line 146 will dereference an empty optional, causing undefined behavior or a crash.

The fix is to move line 146 after the initialization block:

Suggested change

py::object scale_inv_py = py::cast(scale_inv);

// Initialize scale-inverse tensor

if (!scale_inv) {

scale_inv = at::reciprocal(scale);

}

py::object transpose_py = with_transpose ? py::cast(*transpose) : py::none();

// Initialize scale-inverse tensor

if (!scale_inv) {

scale_inv = at::reciprocal(scale);

}

py::object scale_inv_py = py::cast(*scale_inv);

greptile-apps · 2026-01-11T19:10:26Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    PyDict_SetItemString(kwargs, "device", py::cast(device).ptr());
+
+    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),
+                                     PyTuple_New(0), kwargs);


Memory leak: PyTuple_New(0) creates a new tuple with reference count 1, but PyObject_Call does not steal the reference. The tuple must be stored and explicitly DECREF'd after the call. This pattern is repeated 12 times throughout this file (lines 166, 186, 398, 417, 697, 718, 1039, 1058, 1346, 1367).

Suggested change

PyTuple_New(0), kwargs);

PyObject* args = PyTuple_New(0);

PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),

args, kwargs);

Py_DECREF(args);

transformer_engine/pytorch/quantized_tensor.py

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

greptile-apps

_{13 files reviewed, 13 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-11T19:16:57Z

transformer_engine/common/gemm/cublaslt_gemm.cu

    ret.B_scale_inv = B.scale_inv.dptr;
    ret.ldb = is_B_transposed ? n : k;
-    if (!nvte_is_non_tn_fp8_gemm_supported() && is_B_transposed) {
+    int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();


[P0] Variable is_nvte_non_tn_fp8_gemm_supported is already declared at line 132. Declaring it again here causes a compilation error. Remove this declaration and reuse the variable from line 132.

Suggested change

int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();

// Reuse is_nvte_non_tn_fp8_gemm_supported from line 132

greptile-apps · 2026-01-11T19:16:58Z

transformer_engine/pytorch/csrc/quantizer.cpp

  }
  py::object transpose_py = with_transpose ? py::cast(*transpose) : py::none();
-
+  py::object transpose_py = with_transpose ? py::cast(*transpose) : py::none();


[P0] Use-after-free bug: scale_inv_py is created from scale_inv before scale_inv is initialized. When scale_inv parameter is empty (lines 148-150), this dereferences an empty std::optional, causing undefined behavior. Move this line after the scale_inv initialization.

Suggested change

py::object transpose_py = with_transpose ? py::cast(*transpose) : py::none();

// Move scale_inv_py creation after initialization (line 150)

greptile-apps · 2026-01-11T19:16:59Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    PyDict_SetItemString(kwargs, "quantizer", this->quantizer.ptr());
+    PyDict_SetItemString(kwargs, "device", py::cast(device).ptr());
+
+    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),


[P0] Memory leak: PyTuple_New(0) creates a new Python object with refcount 1, but it's never decremented. This leaks memory on every tensor creation. Add Py_DECREF for the tuple:

Suggested change

PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),

PyObject* args = PyTuple_New(0);

PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass), args, kwargs);

Py_DECREF(args);

transformer_engine/pytorch/csrc/quantizer.cpp

transformer_engine/pytorch/module/linear.py

greptile-apps · 2026-01-11T19:17:05Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    PyDict_SetItemString(kwargs, "is_2D_scaled", py::cast(block_scaling_dim == 2).ptr());
+    PyDict_SetItemString(kwargs, "data_format", py::cast(data_format).ptr());
+
+    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8BlockwiseQTensorPythonClass),


[P0] Memory leak: PyTuple_New(0) is not decremented (same pattern as line 166).

greptile-apps · 2026-01-11T19:17:06Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    PyDict_SetItemString(kwargs, "fp8_dtype", py::cast(this->dtype).ptr());
+    PyDict_SetItemString(kwargs, "quantizer", this->quantizer.ptr());
+
+    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(MXFP8TensorStoragePythonClass),


[P0] Memory leak: PyTuple_New(0) is not decremented (same pattern as line 166).

transformer_engine/pytorch/csrc/quantizer.cpp

greptile-apps · 2026-01-11T19:17:08Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    PyDict_SetItemString(kwargs, "fp4_dtype", py::cast(this->dtype).ptr());
+    PyDict_SetItemString(kwargs, "quantizer", this->quantizer.ptr());
+
+    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(NVFP4TensorStoragePythonClass),


[P0] Memory leak: PyTuple_New(0) is not decremented (same pattern as line 166).

transformer_engine/pytorch/csrc/quantizer.cpp

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/util/cuda_driver.h

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-18T15:44:32Z

transformer_engine/pytorch/tensor/float8_tensor.py

+            return self._data.shape
+        if self._transpose is not None:
+            transpose_shape = self._transpose.shape
+            return tuple(transpose_shape[1:]) + (transpose_shape[0],)


style: Returns tuple instead of torch.Size when computed from transpose, while self._data.shape returns torch.Size. Consider wrapping with torch.Size() for type consistency:

Suggested change

return tuple(transpose_shape[1:]) + (transpose_shape[0],)

return torch.Size(tuple(transpose_shape[1:]) + (transpose_shape[0],))

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations

greptile-apps · 2026-01-19T22:23:46Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-01-20T01:53:55Z

/te-ci L1 pytorch

vthumbe1503 · 2026-01-20T02:07:00Z

/te-ci L1 pytorch

vthumbe1503 · 2026-01-20T02:08:35Z

/te-ci L1 pytorch

vthumbe1503 · 2026-01-20T03:21:31Z

/te-ci L1 pytorch

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations

vthumbe1503 · 2026-01-20T11:01:45Z

/te-ci L1 pytorch

vthumbe1503 · 2026-01-20T11:03:49Z

/te-ci pytorch

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-20T16:38:45Z

/te-ci pytorch

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-01-20T17:26:15Z

/te-ci pytorch

… at::empty Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-20T18:11:49Z

/te-ci pytorch

vthumbe1503 · 2026-01-20T22:00:16Z

/te-ci L1 pytorch

vthumbe1503 · 2026-01-20T22:42:15Z

/te-ci pytorch

vthumbe1503 and others added 2 commits January 5, 2026 18:11

add all the optimizations

93ee022

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

06338bc

for more information, see https://pre-commit.ci

vthumbe1503 added the cpu_overhead label Jan 6, 2026

vthumbe1503 and others added 4 commits January 6, 2026 12:34

requires_grad optimization

50de9cd

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

5fee841

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_fp8_optimizations

4c79ac7

[pre-commit.ci] auto fixes from pre-commit.com hooks

62b88e1

for more information, see https://pre-commit.ci

vthumbe1503 added 3 commits January 7, 2026 17:19

test if commenting out requires_grad works

99494d7

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

b157f85

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_fp8_optimizations

2a7b627

vthumbe1503 marked this pull request as ready for review January 7, 2026 17:22

greptile-apps bot reviewed Jan 7, 2026

View reviewed changes

vthumbe1503 added 2 commits January 7, 2026 17:58

fix minor bug

b61a6a8

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

938651e

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps bot reviewed Jan 7, 2026

View reviewed changes

fix ci

88dfdbd

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_fp8_optimizations

1526eea

greptile-apps bot reviewed Jan 11, 2026

View reviewed changes

vthumbe1503 and others added 3 commits January 11, 2026 19:12

missed a bug

5809dcc

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

b3bd748

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Update transformer_engine/pytorch/csrc/quantizer.cpp

30fecf2

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

greptile-apps bot reviewed Jan 11, 2026

View reviewed changes

vthumbe1503 and others added 2 commits January 18, 2026 06:48

Merge branch 'main' into cpu_fp8_optimizations

ca177ae

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1538fd9

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 18, 2026

View reviewed changes

transformer_engine/common/util/cuda_driver.h Outdated Show resolved Hide resolved

vthumbe1503 and others added 3 commits January 18, 2026 15:38

address review comment from greptile

710b581

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

fix merge conflixt

8a57a75

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7e4f093

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 18, 2026

View reviewed changes

vthumbe1503 added 2 commits January 19, 2026 22:19

address review comment + stride optimization

8604b69

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

de44954

…ormerEngine into cpu_fp8_optimizations

vthumbe1503 and others added 2 commits January 19, 2026 22:25

address linter issue

cc50745

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f2e9a5d

for more information, see https://pre-commit.ci

vthumbe1503 added 2 commits January 20, 2026 11:00

minor lint

0d75c3e

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

3d9f673

…ormerEngine into cpu_fp8_optimizations

fix ci bug

53e8e4e

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

pre-commit-ci bot and others added 2 commits January 20, 2026 16:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

c746abd

for more information, see https://pre-commit.ci

Merge branch 'main' into cpu_fp8_optimizations

9c922f5

another optimization to do at::native::empty_cuda directly instead of…

88b782a

… at::empty Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_fp8_optimizations

5562cbe

	int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();
	// Move this declaration before line 125 (Configure A matrix)
	int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();

		PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),
		PyTuple_New(0), kwargs);

-    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),
-                                     PyTuple_New(0), kwargs);
+    PyObject* args = PyTuple_New(0);
+    PyObject* result = PyObject_Call(reinterpret_cast<PyObject*>(Float8TensorStoragePythonClass),
+                                     args, kwargs);
+    Py_DECREF(args);
+    Py_DECREF(kwargs);

	PyDict_SetItemString(kwargs, "device", py::cast(device).ptr());
	// Remove this line - device parameter is not accepted
	// PyDict_SetItemString(kwargs, "device", py::cast(device).ptr());

	bool is_extension_initialized = false;
	bool is_extension_initialized = false;
	std::once_flag extension_init_flag;

	int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();
	// Move declaration to line 104 (after GemmParam ret;) to avoid redeclaration

	int is_nvte_non_tn_fp8_gemm_supported = nvte_is_non_tn_fp8_gemm_supported();
	// Reuse is_nvte_non_tn_fp8_gemm_supported from line 132

	py::object transpose_py = with_transpose ? py::cast(*transpose) : py::none();
	// Move scale_inv_py creation after initialization (line 150)

	return tuple(transpose_shape[1:]) + (transpose_shape[0],)
	return torch.Size(tuple(transpose_shape[1:]) + (transpose_shape[0],))

CPU Optimizations for FP8 #2559

Are you sure you want to change the base?

CPU Optimizations for FP8 #2559

Uh oh!

Conversation

vthumbe1503 commented Jan 5, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

vthumbe1503 commented Jan 6, 2026

Uh oh!

vthumbe1503 commented Jan 7, 2026

Uh oh!

greptile-apps bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (3)

Uh oh!

vthumbe1503 commented Jan 7, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 0/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 7, 2026

Uh oh!

vthumbe1503 commented Jan 11, 2026

Uh oh!

vthumbe1503 commented Jan 11, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 7, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading