Skip to content

Conversation

@luohua13
Copy link
Contributor

@luohua13 luohua13 commented Jan 19, 2026

Summary by CodeRabbit

  • Documentation
    • Added comprehensive installation guide for NVIDIA DRA Driver on ACP clusters with verification steps
    • Added configuration instructions for enabling CDI support in containerd
    • Added deployment guide for enabling Dynamic Resource Allocation in Kubernetes 1.32–1.33
    • Added introductory documentation covering DRA concepts and architecture
    • Added how-to guides for GPU workload deployment using ResourceClaimTemplates

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 19, 2026

Walkthrough

This PR adds comprehensive documentation for the Alauda Build of NVIDIA DRA Driver for GPUs, including introduction, installation guide, and how-to guides for enabling CDI in containerd and DRA API groups in Kubernetes clusters.

Changes

Cohort / File(s) Summary
Core Documentation Pages
docs/en/infrastructure_management/device_management/pgpu_dra/index.mdx, docs/en/infrastructure_management/device_management/pgpu_dra/intro.mdx
Introduces main landing page and DRA concept explaining fine-grained device resource selection based on attributes and capabilities
Installation & Setup
docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
Comprehensive installation guide covering prerequisites (NVIDIA driver, Kubernetes, CDI, DRA API groups), step-by-step installation procedures, verification steps, workload deployment using ResourceClaimTemplates, and sample YAML examples
How-To Guides
docs/en/infrastructure_management/device_management/pgpu_dra/how_to/index.mdx, docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx, docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx
Provides how-to index page, CDI enablement instructions for containerd v1.7.x with config.toml modifications and verification steps, and DRA API group enablement for Kubernetes 1.32–1.33 with feature gate configurations

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • zhaomingkun1030
  • typhoonzero

Poem

🐰 A GPU tale now written with care,
DRA guides floating through the air,
From CDI to Kubernetes bliss,
Installation steps we couldn't miss! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Add dra' is vague and non-descriptive. While it references a real part of the changeset (DRA documentation), it lacks meaningful context about what DRA is or what aspect is being added. Consider expanding the title to be more descriptive, such as 'Add documentation for NVIDIA DRA Driver installation and configuration' to clarify the scope and purpose of the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx`:
- Around line 1-34: The verification step uses an unreliable grep on journalctl
("journalctl -u containerd | grep \"EnableCDI:true\""); replace it with a
structured-journal query to reliably find CDI logs by message content: update
the verification command to use "journalctl -u containerd.service -o json | jq
'select(.MESSAGE|test(\"EnableCDI\";\"i\"))'" (or equivalent json+jq filter) so
log entries are filtered case-insensitively and from structured output; modify
the line that contains the original grep command in the document to this new
command.

In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`:
- Line 45: Fix the grammatical error in the sentence that reads "For kubelet,
edit `/var/lib/kubelet/config.yaml` on the all nodes:" by removing the extra
article — change it to "For kubelet, edit `/var/lib/kubelet/config.yaml` on all
nodes:" so the phrase "on all nodes" is used correctly.
- Line 12: Replace the incorrect phrase "On the all master nodes" with the
corrected phrase "On all master nodes" in the document (look for the exact
string "On the all master nodes" in the content and update it to "On all master
nodes").
- Around line 19-21: The three separate flags for runtime-config are incorrect;
replace the multiple --runtime-config lines with a single comma-separated flag
like --runtime-config=resource.k8s.io/v1beta1,resource.k8s.io/v1beta2 and keep
the --feature-gates=DynamicResourceAllocation=true line as-is so the controller
sees both API versions via one --runtime-config entry.

In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx`:
- Line 22: Replace the phrase "Prefer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
with "Refer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
in the docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
content so the wording is corrected; locate the exact string "Prefer to
[Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
and change "Prefer to" to "Refer to".
- Line 19: Update the phrasing in the sentence that currently reads "Prefer to
[Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" to use
"Refer to" instead of "Prefer to" so it reads "Refer to [Installation guide of
Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)"; edit the
text in docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
replacing the exact phrase "Prefer to [Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" with the
corrected wording.
- Around line 170-172: Replace the incorrect kubectl command string "kubectl
logs pod -n dra-gpu-workload -f" with the proper syntax: use the pod name first
then the namespace flag (or omit the namespace if using default). Update the
line so it reads either "kubectl logs dra-gpu-workload -f" (since the manifest
uses the default namespace) or "kubectl logs dra-gpu-workload -n default -f" to
be explicit.
🧹 Nitpick comments (1)
docs/en/infrastructure_management/device_management/pgpu_dra/how_to/index.mdx (1)

3-6: Consider providing a Chinese translation for the zh title.

The zh (Chinese) title is currently set to "How To" which is English. For proper localization, consider using a Chinese translation such as "操作指南" or "使用指南".

💡 Suggested fix
 i18n:
-title:
-en: How To
-zh: How To
+  title:
+    en: How To
+    zh: 操作指南

Comment on lines +1 to +34
---
weight: 20
---

# Enable CDI in Containerd

CDI (Container Device Interface) provides a standard mechanism for device vendors to describe what is required to provide access to a specific resource such as a GPU beyond a simple device name.

CDI support is enabled by default in containerd version 2.0 and later. Earlier versions, starting from 1.7.0, support for this feature requires manual activation.

## Steps to Enable CDI in containerd v1.7.x

1. Update containerd configuration.
Edit the configuration file:
```bash
vi /etc/containerd/config.toml
```
Add or modify the following section:
```toml
[plugins."io.containerd.grpc.v1.cri"]
enable_cdi = true
```
2. Restart containerd.
```bash
systemctl restart containerd
systemctl status containerd
```
Ensure the service is running correctly.

3. Verify CDI is Enabled.
```bash
journalctl -u containerd | grep "EnableCDI:true"
```
Wait a moment, if there are logs, it means the setup was successful.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

containerd CDI EnableCDI log output format journalctl

💡 Result:

If you enable CDI in containerd (plugins."io.containerd.grpc.v1.cri" enable_cdi = true), containerd logs remain in the system journal (systemd/journald). To view containerd/CDI-related messages:

  • Restart containerd after enabling CDI in /etc/containerd/config.toml. [1]
  • Show containerd service logs (human): sudo journalctl -u containerd.service -f. [2]
  • Show full structured entries (JSON) to see fields (MESSAGE, SYSLOG_IDENTIFIER, _PID, etc.): sudo journalctl -u containerd.service -o json. [2]
  • Filter for CDI-specific text (e.g., “cdi” or “EnableCDI”): sudo journalctl -u containerd.service -o json | jq 'select(.MESSAGE|test("cdi";"i"))'. [2]

Notes:

  • containerd’s CDI config keys: enable_cdi and cdi_spec_dirs (e.g. /etc/cdi, /var/run/cdi). Restart required. [1]
  • Use -f to follow live logs, -b to limit to current boot, and -o json (or -o json-pretty) to get machine-readable output suitable for parsing. [2]

Sources:
[1] containerd / CDI configuration (enable_cdi, cdi_spec_dirs).
[2] journald / journalctl usage and -o json output.


🌐 Web query:

containerd version 1.7.0 CDI enable_cdi log messages

💡 Result:

I can help — do you want (pick one):

  • an explanation of the containerd 1.7.0 enable_cdi configuration and how it behaves (where to set enable_cdi and cdi_spec_dirs)?
  • typical containerd log messages produced when CDI is enabled/used?
  • help diagnosing specific log lines you’re seeing (paste the exact log lines)?

Background: containerd exposes enable_cdi and cdi_spec_dirs in the CRI plugin config (plugins."io.containerd.grpc.v1.cri"). Set enable_cdi = true and list CDI spec dirs (e.g. /etc/cdi, /var/run/cdi) and restart containerd. [1][2].


LGTM! Clear and well-structured CDI enablement guide.

The documentation is well-organized with clear steps for enabling CDI in containerd v1.7.x. However, improve the verification command on line 32: instead of journalctl -u containerd | grep "EnableCDI:true", use the more robust approach journalctl -u containerd.service -o json | jq 'select(.MESSAGE|test("EnableCDI";"i"))' to reliably filter CDI-related logs from structured journal output.

🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx`
around lines 1 - 34, The verification step uses an unreliable grep on journalctl
("journalctl -u containerd | grep \"EnableCDI:true\""); replace it with a
structured-journal query to reliably find CDI logs by message content: update
the verification command to use "journalctl -u containerd.service -o json | jq
'select(.MESSAGE|test(\"EnableCDI\";\"i\"))'" (or equivalent json+jq filter) so
log entries are filtered case-insensitively and from structured output; modify
the line that contains the original grep command in the document to this new
command.


## Steps to Enable DRA in Kubernetes 1.32–1.33

On the all master nodes:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix grammatical error.

"On the all master nodes" should be "On all master nodes".

📝 Suggested fix
-On the all master nodes:
+On all master nodes:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
On the all master nodes:
On all master nodes:
🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`
at line 12, Replace the incorrect phrase "On the all master nodes" with the
corrected phrase "On all master nodes" in the document (look for the exact
string "On the all master nodes" in the content and update it to "On all master
nodes").

Comment on lines +19 to +21
- --feature-gates=DynamicResourceAllocation=true # required
- --runtime-config=resource.k8s.io/v1beta1 # required
- --runtime-config=resource.k8s.io/v1beta2 # required
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

kubernetes kube-apiserver --runtime-config flag syntax multiple API groups

💡 Result:

Use comma-separated key=value pairs; later entries override earlier ones. Keys are either core v1 ("v1") or group/version ("group/version"), or special keys like api/all, api/ga, api/beta, api/alpha. Examples:

  • Enable apps/v1 and batch/v1: --runtime-config=apps/v1=true,batch/v1=true.
  • Disable an API then re-enable v1 only: --runtime-config=api/all=false,api/v1=true.
  • Disable a specific resource version: --runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/jobs=false.

See kube-apiserver flag reference and "Enable or Disable a Kubernetes API" docs for details. [1][2]

Sources
[1] kube-apiserver --runtime-config docs. [2] Enable or Disable a Kubernetes API (kubernetes.io).


🌐 Web query:

kube-apiserver runtime-config comma separated values multiple

💡 Result:

Yes — --runtime-config accepts multiple comma-separated key[=value] entries. Use API paths (e.g. group/version) or special keys (api/all, api/legacy); later entries override earlier ones. Example:
--runtime-config=api/all=false,api/v1=true,batch/v2alpha1,extensions/v1beta1/deployments=false. [1][2]

Sources:
[1] Kubernetes: Enable Or Disable A Kubernetes API.
[2] kube-apiserver command-line reference.


Consolidate --runtime-config flags into a single comma-separated entry.

Per Kubernetes documentation, --runtime-config accepts multiple API groups only when comma-separated within a single flag. The current configuration uses multiple separate flags, which will not work as intended.

Correct syntax
         - --feature-gates=DynamicResourceAllocation=true # required
-        - --runtime-config=resource.k8s.io/v1beta1 # required
-        - --runtime-config=resource.k8s.io/v1beta2 # required
+        - --runtime-config=resource.k8s.io/v1beta1=true,resource.k8s.io/v1beta2=true # required
🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`
around lines 19 - 21, The three separate flags for runtime-config are incorrect;
replace the multiple --runtime-config lines with a single comma-separated flag
like --runtime-config=resource.k8s.io/v1beta1,resource.k8s.io/v1beta2 and keep
the --feature-gates=DynamicResourceAllocation=true line as-is so the controller
sees both API versions via one --runtime-config entry.

# ... other flags
```

4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix grammatical error.

"on the all nodes" should be "on all nodes".

📝 Suggested fix
-4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes:
+4. For kubelet, edit `/var/lib/kubelet/config.yaml` on all nodes:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes:
4. For kubelet, edit `/var/lib/kubelet/config.yaml` on all nodes:
🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`
at line 45, Fix the grammatical error in the sentence that reads "For kubelet,
edit `/var/lib/kubelet/config.yaml` on the all nodes:" by removing the extra
article — change it to "For kubelet, edit `/var/lib/kubelet/config.yaml` on all
nodes:" so the phrase "on all nodes" is used correctly.

## Procedure

### Installing Nvidia driver in your gpu node
Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix phrasing: "Prefer to" → "Refer to".

"Prefer to" is grammatically incorrect in this context.

📝 Suggested fix
-Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
+Refer to the [NVIDIA Official Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
Refer to the [NVIDIA Official Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
🤖 Prompt for AI Agents
In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx` at
line 19, Update the phrasing in the sentence that currently reads "Prefer to
[Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" to use
"Refer to" instead of "Prefer to" so it reads "Refer to [Installation guide of
Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)"; edit the
text in docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
replacing the exact phrase "Prefer to [Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" with the
corrected wording.

Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)

### Installing Nvidia Container Runtime
Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix phrasing: "Prefer to" → "Refer to".

Same issue as above.

📝 Suggested fix
-Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
+Refer to the [NVIDIA Container Toolkit Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
Refer to the [NVIDIA Container Toolkit Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
🤖 Prompt for AI Agents
In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx` at
line 22, Replace the phrase "Prefer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
with "Refer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
in the docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
content so the wording is corrected; locate the exact string "Prefer to
[Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
and change "Prefer to" to "Refer to".

Comment on lines +170 to +172
```bash
kubectl logs pod -n dra-gpu-workload -f
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect kubectl logs command syntax.

The command kubectl logs pod -n dra-gpu-workload is incorrect. The -n flag specifies the namespace, not the pod name. Since the pod is named dra-gpu-workload and no namespace is specified in the manifest (defaults to default), the command should be:

🐛 Proposed fix
-    kubectl logs pod -n dra-gpu-workload -f
+    kubectl logs dra-gpu-workload -f
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
kubectl logs pod -n dra-gpu-workload -f
```
🤖 Prompt for AI Agents
In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx`
around lines 170 - 172, Replace the incorrect kubectl command string "kubectl
logs pod -n dra-gpu-workload -f" with the proper syntax: use the pod name first
then the namespace flag (or omit the namespace if using default). Update the
line so it reads either "kubectl logs dra-gpu-workload -f" (since the manifest
uses the default namespace) or "kubectl logs dra-gpu-workload -n default -f" to
be explicit.

@cloudflare-workers-and-pages
Copy link

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: cb08806
Status: ✅  Deploy successful!
Preview URL: https://93ced307.alauda-ai.pages.dev
Branch Preview URL: https://add-dra.alauda-ai.pages.dev

View logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants