-
Notifications
You must be signed in to change notification settings - Fork 0
Add dra #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThis PR adds comprehensive documentation for the Alauda Build of NVIDIA DRA Driver for GPUs, including introduction, installation guide, and how-to guides for enabling CDI in containerd and DRA API groups in Kubernetes clusters. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx`:
- Around line 1-34: The verification step uses an unreliable grep on journalctl
("journalctl -u containerd | grep \"EnableCDI:true\""); replace it with a
structured-journal query to reliably find CDI logs by message content: update
the verification command to use "journalctl -u containerd.service -o json | jq
'select(.MESSAGE|test(\"EnableCDI\";\"i\"))'" (or equivalent json+jq filter) so
log entries are filtered case-insensitively and from structured output; modify
the line that contains the original grep command in the document to this new
command.
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`:
- Line 45: Fix the grammatical error in the sentence that reads "For kubelet,
edit `/var/lib/kubelet/config.yaml` on the all nodes:" by removing the extra
article — change it to "For kubelet, edit `/var/lib/kubelet/config.yaml` on all
nodes:" so the phrase "on all nodes" is used correctly.
- Line 12: Replace the incorrect phrase "On the all master nodes" with the
corrected phrase "On all master nodes" in the document (look for the exact
string "On the all master nodes" in the content and update it to "On all master
nodes").
- Around line 19-21: The three separate flags for runtime-config are incorrect;
replace the multiple --runtime-config lines with a single comma-separated flag
like --runtime-config=resource.k8s.io/v1beta1,resource.k8s.io/v1beta2 and keep
the --feature-gates=DynamicResourceAllocation=true line as-is so the controller
sees both API versions via one --runtime-config entry.
In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx`:
- Line 22: Replace the phrase "Prefer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
with "Refer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
in the docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
content so the wording is corrected; locate the exact string "Prefer to
[Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
and change "Prefer to" to "Refer to".
- Line 19: Update the phrasing in the sentence that currently reads "Prefer to
[Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" to use
"Refer to" instead of "Prefer to" so it reads "Refer to [Installation guide of
Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)"; edit the
text in docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
replacing the exact phrase "Prefer to [Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" with the
corrected wording.
- Around line 170-172: Replace the incorrect kubectl command string "kubectl
logs pod -n dra-gpu-workload -f" with the proper syntax: use the pod name first
then the namespace flag (or omit the namespace if using default). Update the
line so it reads either "kubectl logs dra-gpu-workload -f" (since the manifest
uses the default namespace) or "kubectl logs dra-gpu-workload -n default -f" to
be explicit.
🧹 Nitpick comments (1)
docs/en/infrastructure_management/device_management/pgpu_dra/how_to/index.mdx (1)
3-6: Consider providing a Chinese translation for thezhtitle.The
zh(Chinese) title is currently set to "How To" which is English. For proper localization, consider using a Chinese translation such as "操作指南" or "使用指南".💡 Suggested fix
i18n: -title: -en: How To -zh: How To + title: + en: How To + zh: 操作指南
| --- | ||
| weight: 20 | ||
| --- | ||
|
|
||
| # Enable CDI in Containerd | ||
|
|
||
| CDI (Container Device Interface) provides a standard mechanism for device vendors to describe what is required to provide access to a specific resource such as a GPU beyond a simple device name. | ||
|
|
||
| CDI support is enabled by default in containerd version 2.0 and later. Earlier versions, starting from 1.7.0, support for this feature requires manual activation. | ||
|
|
||
| ## Steps to Enable CDI in containerd v1.7.x | ||
|
|
||
| 1. Update containerd configuration. | ||
| Edit the configuration file: | ||
| ```bash | ||
| vi /etc/containerd/config.toml | ||
| ``` | ||
| Add or modify the following section: | ||
| ```toml | ||
| [plugins."io.containerd.grpc.v1.cri"] | ||
| enable_cdi = true | ||
| ``` | ||
| 2. Restart containerd. | ||
| ```bash | ||
| systemctl restart containerd | ||
| systemctl status containerd | ||
| ``` | ||
| Ensure the service is running correctly. | ||
|
|
||
| 3. Verify CDI is Enabled. | ||
| ```bash | ||
| journalctl -u containerd | grep "EnableCDI:true" | ||
| ``` | ||
| Wait a moment, if there are logs, it means the setup was successful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
containerd CDI EnableCDI log output format journalctl
💡 Result:
If you enable CDI in containerd (plugins."io.containerd.grpc.v1.cri" enable_cdi = true), containerd logs remain in the system journal (systemd/journald). To view containerd/CDI-related messages:
- Restart containerd after enabling CDI in /etc/containerd/config.toml. [1]
- Show containerd service logs (human): sudo journalctl -u containerd.service -f. [2]
- Show full structured entries (JSON) to see fields (MESSAGE, SYSLOG_IDENTIFIER, _PID, etc.): sudo journalctl -u containerd.service -o json. [2]
- Filter for CDI-specific text (e.g., “cdi” or “EnableCDI”): sudo journalctl -u containerd.service -o json | jq 'select(.MESSAGE|test("cdi";"i"))'. [2]
Notes:
- containerd’s CDI config keys: enable_cdi and cdi_spec_dirs (e.g. /etc/cdi, /var/run/cdi). Restart required. [1]
- Use -f to follow live logs, -b to limit to current boot, and -o json (or -o json-pretty) to get machine-readable output suitable for parsing. [2]
Sources:
[1] containerd / CDI configuration (enable_cdi, cdi_spec_dirs).
[2] journald / journalctl usage and -o json output.
🌐 Web query:
containerd version 1.7.0 CDI enable_cdi log messages
💡 Result:
I can help — do you want (pick one):
- an explanation of the containerd 1.7.0 enable_cdi configuration and how it behaves (where to set enable_cdi and cdi_spec_dirs)?
- typical containerd log messages produced when CDI is enabled/used?
- help diagnosing specific log lines you’re seeing (paste the exact log lines)?
Background: containerd exposes enable_cdi and cdi_spec_dirs in the CRI plugin config (plugins."io.containerd.grpc.v1.cri"). Set enable_cdi = true and list CDI spec dirs (e.g. /etc/cdi, /var/run/cdi) and restart containerd. [1][2].
LGTM! Clear and well-structured CDI enablement guide.
The documentation is well-organized with clear steps for enabling CDI in containerd v1.7.x. However, improve the verification command on line 32: instead of journalctl -u containerd | grep "EnableCDI:true", use the more robust approach journalctl -u containerd.service -o json | jq 'select(.MESSAGE|test("EnableCDI";"i"))' to reliably filter CDI-related logs from structured journal output.
🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx`
around lines 1 - 34, The verification step uses an unreliable grep on journalctl
("journalctl -u containerd | grep \"EnableCDI:true\""); replace it with a
structured-journal query to reliably find CDI logs by message content: update
the verification command to use "journalctl -u containerd.service -o json | jq
'select(.MESSAGE|test(\"EnableCDI\";\"i\"))'" (or equivalent json+jq filter) so
log entries are filtered case-insensitively and from structured output; modify
the line that contains the original grep command in the document to this new
command.
|
|
||
| ## Steps to Enable DRA in Kubernetes 1.32–1.33 | ||
|
|
||
| On the all master nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical error.
"On the all master nodes" should be "On all master nodes".
📝 Suggested fix
-On the all master nodes:
+On all master nodes:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| On the all master nodes: | |
| On all master nodes: |
🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`
at line 12, Replace the incorrect phrase "On the all master nodes" with the
corrected phrase "On all master nodes" in the document (look for the exact
string "On the all master nodes" in the content and update it to "On all master
nodes").
| - --feature-gates=DynamicResourceAllocation=true # required | ||
| - --runtime-config=resource.k8s.io/v1beta1 # required | ||
| - --runtime-config=resource.k8s.io/v1beta2 # required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
kubernetes kube-apiserver --runtime-config flag syntax multiple API groups
💡 Result:
Use comma-separated key=value pairs; later entries override earlier ones. Keys are either core v1 ("v1") or group/version ("group/version"), or special keys like api/all, api/ga, api/beta, api/alpha. Examples:
- Enable apps/v1 and batch/v1: --runtime-config=apps/v1=true,batch/v1=true.
- Disable an API then re-enable v1 only: --runtime-config=api/all=false,api/v1=true.
- Disable a specific resource version: --runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/jobs=false.
See kube-apiserver flag reference and "Enable or Disable a Kubernetes API" docs for details. [1][2]
Sources
[1] kube-apiserver --runtime-config docs. [2] Enable or Disable a Kubernetes API (kubernetes.io).
🌐 Web query:
kube-apiserver runtime-config comma separated values multiple
💡 Result:
Yes — --runtime-config accepts multiple comma-separated key[=value] entries. Use API paths (e.g. group/version) or special keys (api/all, api/legacy); later entries override earlier ones. Example:
--runtime-config=api/all=false,api/v1=true,batch/v2alpha1,extensions/v1beta1/deployments=false. [1][2]
Sources:
[1] Kubernetes: Enable Or Disable A Kubernetes API.
[2] kube-apiserver command-line reference.
Consolidate --runtime-config flags into a single comma-separated entry.
Per Kubernetes documentation, --runtime-config accepts multiple API groups only when comma-separated within a single flag. The current configuration uses multiple separate flags, which will not work as intended.
Correct syntax
- --feature-gates=DynamicResourceAllocation=true # required
- - --runtime-config=resource.k8s.io/v1beta1 # required
- - --runtime-config=resource.k8s.io/v1beta2 # required
+ - --runtime-config=resource.k8s.io/v1beta1=true,resource.k8s.io/v1beta2=true # required🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`
around lines 19 - 21, The three separate flags for runtime-config are incorrect;
replace the multiple --runtime-config lines with a single comma-separated flag
like --runtime-config=resource.k8s.io/v1beta1,resource.k8s.io/v1beta2 and keep
the --feature-gates=DynamicResourceAllocation=true line as-is so the controller
sees both API versions via one --runtime-config entry.
| # ... other flags | ||
| ``` | ||
|
|
||
| 4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical error.
"on the all nodes" should be "on all nodes".
📝 Suggested fix
-4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes:
+4. For kubelet, edit `/var/lib/kubelet/config.yaml` on all nodes:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes: | |
| 4. For kubelet, edit `/var/lib/kubelet/config.yaml` on all nodes: |
🤖 Prompt for AI Agents
In
`@docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx`
at line 45, Fix the grammatical error in the sentence that reads "For kubelet,
edit `/var/lib/kubelet/config.yaml` on the all nodes:" by removing the extra
article — change it to "For kubelet, edit `/var/lib/kubelet/config.yaml` on all
nodes:" so the phrase "on all nodes" is used correctly.
| ## Procedure | ||
|
|
||
| ### Installing Nvidia driver in your gpu node | ||
| Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix phrasing: "Prefer to" → "Refer to".
"Prefer to" is grammatically incorrect in this context.
📝 Suggested fix
-Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
+Refer to the [NVIDIA Official Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) | |
| Refer to the [NVIDIA Official Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) |
🤖 Prompt for AI Agents
In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx` at
line 19, Update the phrasing in the sentence that currently reads "Prefer to
[Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" to use
"Refer to" instead of "Prefer to" so it reads "Refer to [Installation guide of
Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)"; edit the
text in docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
replacing the exact phrase "Prefer to [Installation guide of Nvidia Official
website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)" with the
corrected wording.
| Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) | ||
|
|
||
| ### Installing Nvidia Container Runtime | ||
| Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix phrasing: "Prefer to" → "Refer to".
Same issue as above.
📝 Suggested fix
-Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
+Refer to the [NVIDIA Container Toolkit Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) | |
| Refer to the [NVIDIA Container Toolkit Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) |
🤖 Prompt for AI Agents
In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx` at
line 22, Replace the phrase "Prefer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
with "Refer to [Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
in the docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx
content so the wording is corrected; locate the exact string "Prefer to
[Installation guide of Nvidia Container
Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)"
and change "Prefer to" to "Refer to".
| ```bash | ||
| kubectl logs pod -n dra-gpu-workload -f | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect kubectl logs command syntax.
The command kubectl logs pod -n dra-gpu-workload is incorrect. The -n flag specifies the namespace, not the pod name. Since the pod is named dra-gpu-workload and no namespace is specified in the manifest (defaults to default), the command should be:
🐛 Proposed fix
- kubectl logs pod -n dra-gpu-workload -f
+ kubectl logs dra-gpu-workload -f📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ```bash | |
| kubectl logs pod -n dra-gpu-workload -f | |
| ``` |
🤖 Prompt for AI Agents
In `@docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx`
around lines 170 - 172, Replace the incorrect kubectl command string "kubectl
logs pod -n dra-gpu-workload -f" with the proper syntax: use the pod name first
then the namespace flag (or omit the namespace if using default). Update the
line so it reads either "kubectl logs dra-gpu-workload -f" (since the manifest
uses the default namespace) or "kubectl logs dra-gpu-workload -n default -f" to
be explicit.
Deploying alauda-ai with
|
| Latest commit: |
cb08806
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://93ced307.alauda-ai.pages.dev |
| Branch Preview URL: | https://add-dra.alauda-ai.pages.dev |
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.