Skip to content

Conversation

@yeahdongcn
Copy link
Collaborator

Description

This PR adds caching optimizations to reduce the overhead of torchada's runtime patching for frequently accessed attributes.

Changes

  • Add attribute caching to _CudaModuleWrapper for torch.cuda.* access
  • Add attribute caching to _CudartWrapper for torch.cuda.cudart() function lookups
  • Add attribute caching to _CDLLWrapper for ctypes function name translation
  • Cache torch.backends.cuda.is_built() result (constant at runtime)
  • Add string translation cache to _translate_device() for common device strings

Performance Improvements

Operation Before After Speedup
torch.cuda.Stream (attr) 842ns 131ns 6.4x
torch.cuda.device_count() 855ns 147ns 5.8x
cudart.cudaHostRegister 385ns 84ns 4.6x
torch.backends.cuda.is_built() 301ns 154ns 2.0x

All operations now complete in <700ns, which is negligible compared to GPU kernel launch times (5,000-20,000ns).

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@yeahdongcn yeahdongcn requested a review from yafengio January 29, 2026 09:06
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@yeahdongcn
Copy link
Collaborator Author

Also tested along with SGLang, everything works as expected.

@yeahdongcn yeahdongcn merged commit 09d0e02 into main Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants