Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ pytest tests/ --tb=short

**Test Markers** (defined in `conftest.py`):
- `@pytest.mark.musa` - Requires MUSA platform
- `@pytest.mark.cuda` - Requires CUDA platform
- `@pytest.mark.cuda` - Requires CUDA platform
- `@pytest.mark.gpu` - Requires any GPU
- `@pytest.mark.slow` - Slow tests

Expand Down Expand Up @@ -138,6 +138,38 @@ import uuid
lib_name = f"test_lib_{uuid.uuid4().hex[:8]}"
```

## Performance Benchmarking

torchada uses aggressive caching to minimize runtime overhead. Performance is tracked across versions.

**Benchmark files**:
- `benchmarks/benchmark_overhead.py` - Benchmark script
- `benchmarks/benchmark_history.json` - Historical results

**Running benchmarks**:
```bash
# Run benchmarks (print only)
docker exec -w /ws yeahdongcn1 python benchmarks/benchmark_overhead.py

# Run and save results to history (do this before releasing new versions)
docker exec -w /ws yeahdongcn1 python benchmarks/benchmark_overhead.py --save
```

**Performance targets**:
- Fast operations (<200ns): `torch.cuda.device_count()`, `torch.cuda.Stream`, `torch.cuda.Event`, `_translate_device()`, `torch.backends.cuda.is_built()`
- Medium operations (200-800ns): Operations with inherent costs (runtime calls, object creation) that cannot be optimized further

**When to run benchmarks**:
1. After adding new patches that affect hot paths
2. Before releasing a new version (use `--save` to record results)
3. When optimizing existing patches

**Optimization techniques used**:
- Attribute caching in `__dict__` to bypass `__getattr__` on subsequent accesses
- Platform check caching (global variable `_is_musa_platform_cached`)
- String translation caching (`_device_str_cache`)
- Closure variable caching for wrapper functions

## Security Considerations

- All patches are applied at import time via `apply_patches()`
Expand Down
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,22 @@ def is_musa():
return hasattr(torch.version, 'musa') and torch.version.musa is not None
```

## Performance

torchada uses aggressive caching to minimize runtime overhead. All frequently-called operations complete in under 200 nanoseconds:

| Operation | Overhead |
|-----------|----------|
| `torch.cuda.device_count()` | ~140ns |
| `torch.cuda.Stream` (attribute access) | ~130ns |
| `torch.cuda.Event` (attribute access) | ~130ns |
| `_translate_device('cuda')` | ~140ns |
| `torch.backends.cuda.is_built()` | ~155ns |

For comparison, a typical GPU kernel launch takes 5,000-20,000ns. The patching overhead is negligible for real-world applications.

Operations with inherent costs (runtime calls, object creation) take 300-600ns but cannot be optimized further without changing behavior.

## Known Limitation

**Device type string comparisons fail on MUSA:**
Expand Down Expand Up @@ -238,7 +254,7 @@ See `src/torchada/_mapping.py` for the complete mapping table (380+ mappings).

```
# pyproject.toml or requirements.txt
torchada>=0.1.26
torchada>=0.1.27
```

### Step 2: Conditional Import
Expand Down
18 changes: 17 additions & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,22 @@ def is_musa():
return hasattr(torch.version, 'musa') and torch.version.musa is not None
```

## 性能

torchada 使用激进的缓存策略来最小化运行时开销。所有频繁调用的操作都在 200 纳秒内完成:

| 操作 | 开销 |
|------|------|
| `torch.cuda.device_count()` | ~140ns |
| `torch.cuda.Stream`(属性访问) | ~130ns |
| `torch.cuda.Event`(属性访问) | ~130ns |
| `_translate_device('cuda')` | ~140ns |
| `torch.backends.cuda.is_built()` | ~155ns |

作为对比,典型的 GPU 内核启动耗时 5,000-20,000ns。补丁开销对于实际应用来说可以忽略不计。

具有固有成本的操作(运行时调用、对象创建)耗时 300-600ns,但在不改变行为的情况下无法进一步优化。

## 已知限制

**设备类型字符串比较在 MUSA 上会失败:**
Expand Down Expand Up @@ -238,7 +254,7 @@ if torchada.is_gpu_device(device): # 在 CUDA 和 MUSA 上都能工作

```
# pyproject.toml 或 requirements.txt
torchada>=0.1.26
torchada>=0.1.27
```

### 步骤 2:条件导入
Expand Down
85 changes: 85 additions & 0 deletions benchmarks/benchmark_history.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
{
"schema_version": 1,
"description": "Historical benchmark results for torchada performance tracking",
"results": [
{
"version": "0.1.27",
"date": "2026-01-29",
"platform": "MUSA",
"pytorch_version": "2.7.1",
"torch_musa_version": "2.7.1+5ee0a64",
"operations": {
"torch.cuda.device_count()": {
"mean_ns": 138,
"median_ns": 136,
"min_ns": 125
},
"torch.cuda.current_device()": {
"mean_ns": 428,
"median_ns": 423,
"min_ns": 391
},
"torch.cuda.is_available() [NOT redirected]": {
"mean_ns": 512,
"median_ns": 508,
"min_ns": 465
},
"torch.cuda.Stream (attr)": {
"mean_ns": 123,
"median_ns": 121,
"min_ns": 112
},
"torch.cuda.Event (attr)": {
"mean_ns": 124,
"median_ns": 122,
"min_ns": 113
},
"cudart.cudaHostRegister (attr)": {
"mean_ns": 81,
"median_ns": 80,
"min_ns": 74
},
"torch.device('cuda')": {
"mean_ns": 595,
"median_ns": 592,
"min_ns": 543
},
"torch.device('cuda:0')": {
"mean_ns": 616,
"median_ns": 615,
"min_ns": 556
},
"torch.device('cuda', 0)": {
"mean_ns": 612,
"median_ns": 609,
"min_ns": 558
},
"cpu_tensor.is_cuda (property)": {
"mean_ns": 343,
"median_ns": 337,
"min_ns": 310
},
"_translate_device('cuda')": {
"mean_ns": 142,
"median_ns": 139,
"min_ns": 122
},
"_translate_device('cuda:0')": {
"mean_ns": 142,
"median_ns": 139,
"min_ns": 125
},
"torch.backends.cuda.is_built()": {
"mean_ns": 160,
"median_ns": 159,
"min_ns": 142
}
},
"summary": {
"fast_ops_count": 7,
"medium_ops_count": 6,
"notes": ""
}
}
]
}
Loading