MooreThreads · yeahdongcn · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -68,7 +68,7 @@ pytest tests/ --tb=short
 
 **Test Markers** (defined in `conftest.py`):
 - `@pytest.mark.musa` - Requires MUSA platform
-- `@pytest.mark.cuda` - Requires CUDA platform  
+- `@pytest.mark.cuda` - Requires CUDA platform
 - `@pytest.mark.gpu` - Requires any GPU
 - `@pytest.mark.slow` - Slow tests
 
@@ -138,6 +138,38 @@ import uuid
 lib_name = f"test_lib_{uuid.uuid4().hex[:8]}"
 ```
 
+## Performance Benchmarking
+
+torchada uses aggressive caching to minimize runtime overhead. Performance is tracked across versions.
+
+**Benchmark files**:
+- `benchmarks/benchmark_overhead.py` - Benchmark script
+- `benchmarks/benchmark_history.json` - Historical results
+
+**Running benchmarks**:
+```bash
+# Run benchmarks (print only)
+docker exec -w /ws yeahdongcn1 python benchmarks/benchmark_overhead.py
+
+# Run and save results to history (do this before releasing new versions)
+docker exec -w /ws yeahdongcn1 python benchmarks/benchmark_overhead.py --save
+```
+
+**Performance targets**:
+- Fast operations (<200ns): `torch.cuda.device_count()`, `torch.cuda.Stream`, `torch.cuda.Event`, `_translate_device()`, `torch.backends.cuda.is_built()`
+- Medium operations (200-800ns): Operations with inherent costs (runtime calls, object creation) that cannot be optimized further
+
+**When to run benchmarks**:
+1. After adding new patches that affect hot paths
+2. Before releasing a new version (use `--save` to record results)
+3. When optimizing existing patches
+
+**Optimization techniques used**:
+- Attribute caching in `__dict__` to bypass `__getattr__` on subsequent accesses
+- Platform check caching (global variable `_is_musa_platform_cached`)
+- String translation caching (`_device_str_cache`)
+- Closure variable caching for wrapper functions
+
 ## Security Considerations
 
 - All patches are applied at import time via `apply_patches()`

diff --git a/README.md b/README.md
@@ -180,6 +180,22 @@ def is_musa():
     return hasattr(torch.version, 'musa') and torch.version.musa is not None
 ```
 
+## Performance
+
+torchada uses aggressive caching to minimize runtime overhead. All frequently-called operations complete in under 200 nanoseconds:
+
+| Operation | Overhead |
+|-----------|----------|
+| `torch.cuda.device_count()` | ~140ns |
+| `torch.cuda.Stream` (attribute access) | ~130ns |
+| `torch.cuda.Event` (attribute access) | ~130ns |
+| `_translate_device('cuda')` | ~140ns |
+| `torch.backends.cuda.is_built()` | ~155ns |
+
+For comparison, a typical GPU kernel launch takes 5,000-20,000ns. The patching overhead is negligible for real-world applications.
+
+Operations with inherent costs (runtime calls, object creation) take 300-600ns but cannot be optimized further without changing behavior.
+
 ## Known Limitation
 
 **Device type string comparisons fail on MUSA:**
@@ -238,7 +254,7 @@ See `src/torchada/_mapping.py` for the complete mapping table (380+ mappings).
 
 ```
 # pyproject.toml or requirements.txt
-torchada>=0.1.26
+torchada>=0.1.27
 ```
 
 ### Step 2: Conditional Import

diff --git a/README_CN.md b/README_CN.md
@@ -180,6 +180,22 @@ def is_musa():
     return hasattr(torch.version, 'musa') and torch.version.musa is not None
 ```
 
+## 性能
+
+torchada 使用激进的缓存策略来最小化运行时开销。所有频繁调用的操作都在 200 纳秒内完成：
+
+| 操作 | 开销 |
+|------|------|
+| `torch.cuda.device_count()` | ~140ns |
+| `torch.cuda.Stream`（属性访问） | ~130ns |
+| `torch.cuda.Event`（属性访问） | ~130ns |
+| `_translate_device('cuda')` | ~140ns |
+| `torch.backends.cuda.is_built()` | ~155ns |
+
+作为对比，典型的 GPU 内核启动耗时 5,000-20,000ns。补丁开销对于实际应用来说可以忽略不计。
+
+具有固有成本的操作（运行时调用、对象创建）耗时 300-600ns，但在不改变行为的情况下无法进一步优化。
+
 ## 已知限制
 
 **设备类型字符串比较在 MUSA 上会失败：**
@@ -238,7 +254,7 @@ if torchada.is_gpu_device(device):  # 在 CUDA 和 MUSA 上都能工作
 
 ```
 # pyproject.toml 或 requirements.txt
-torchada>=0.1.26
+torchada>=0.1.27
 ```
 
 ### 步骤 2：条件导入

diff --git a/benchmarks/benchmark_history.json b/benchmarks/benchmark_history.json
@@ -0,0 +1,85 @@
+{
+  "schema_version": 1,
+  "description": "Historical benchmark results for torchada performance tracking",
+  "results": [
+    {
+      "version": "0.1.27",
+      "date": "2026-01-29",
+      "platform": "MUSA",
+      "pytorch_version": "2.7.1",
+      "torch_musa_version": "2.7.1+5ee0a64",
+      "operations": {
+        "torch.cuda.device_count()": {
+          "mean_ns": 138,
+          "median_ns": 136,
+          "min_ns": 125
+        },
+        "torch.cuda.current_device()": {
+          "mean_ns": 428,
+          "median_ns": 423,
+          "min_ns": 391
+        },
+        "torch.cuda.is_available() [NOT redirected]": {
+          "mean_ns": 512,
+          "median_ns": 508,
+          "min_ns": 465
+        },
+        "torch.cuda.Stream (attr)": {
+          "mean_ns": 123,
+          "median_ns": 121,
+          "min_ns": 112
+        },
+        "torch.cuda.Event (attr)": {
+          "mean_ns": 124,
+          "median_ns": 122,
+          "min_ns": 113
+        },
+        "cudart.cudaHostRegister (attr)": {
+          "mean_ns": 81,
+          "median_ns": 80,
+          "min_ns": 74
+        },
+        "torch.device('cuda')": {
+          "mean_ns": 595,
+          "median_ns": 592,
+          "min_ns": 543
+        },
+        "torch.device('cuda:0')": {
+          "mean_ns": 616,
+          "median_ns": 615,
+          "min_ns": 556
+        },
+        "torch.device('cuda', 0)": {
+          "mean_ns": 612,
+          "median_ns": 609,
+          "min_ns": 558
+        },
+        "cpu_tensor.is_cuda (property)": {
+          "mean_ns": 343,
+          "median_ns": 337,
+          "min_ns": 310
+        },
+        "_translate_device('cuda')": {
+          "mean_ns": 142,
+          "median_ns": 139,
+          "min_ns": 122
+        },
+        "_translate_device('cuda:0')": {
+          "mean_ns": 142,
+          "median_ns": 139,
+          "min_ns": 125
+        },
+        "torch.backends.cuda.is_built()": {
+          "mean_ns": 160,
+          "median_ns": 159,
+          "min_ns": 142
+        }
+      },
+      "summary": {
+        "fast_ops_count": 7,
+        "medium_ops_count": 6,
+        "notes": ""
+      }
+    }
+  ]
+}