From 11c8223ba8968d41840da19968b7a9b320d4e9bd Mon Sep 17 00:00:00 2001 From: Jeremy Berchtold Date: Thu, 15 Jan 2026 09:47:21 -0800 Subject: [PATCH 1/4] expand troubleshooting docs Signed-off-by: Jeremy Berchtold --- README.rst | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/README.rst b/README.rst index 55be0e583f..9cc5577626 100644 --- a/README.rst +++ b/README.rst @@ -308,6 +308,30 @@ Troubleshooting cd transformer_engine pip install -v -v -v --no-build-isolation . +5. **Problems when using UV or virtual environments:** + + * **Symptoms:** Cannot import ``transformer_engine`` + * **Solution:** Ensure your UV environment is active and that you have used ``uv pip install --no-build-isolation `` instead of a regular pip install to your system environment. + + * **Symptoms:** Errors at runtime with ``CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED`` + * **Solution:** This can occur when TE is built against the container's system installation of cuDNN, but pip packages inside the virtual environment pull in pip packages for ``nvidia-cudnn-cu12/cu13``. To resolve this, when building TE from source please specify the following environment variables to point to the cuDNN in your virtual environment. + + ```bash + export CUDNN_PATH=$(pwd)/.venv/lib/python3.12/site-packages/nvidia/cudnn + export CUDNN_HOME=$CUDNN_PATH + export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$LD_LIBRARY_PATH + ``` + + * **Symptoms:** Regular TE installs work correctly, but UV wheel builds fail at runtime. + * **Solution:** Ensure that ``uv build --wheel --no-build-isolation -v`` is used during the wheel build as well as the pip installation of the wheel. Use ``-v`` to ensure build isolation is active and that TE is not pulling in a mismatching version of PyTorch or JAX that differs from the UV environment's version. + +**JAX-specific Common Issues and Solutions:** + +1. **FFI Issues:** + + * **Symptoms:** ``No registered implementation for custom call to for platform CUDA`` + * **Solution:** Ensure ``--no-build-isolation`` is used during installation. If pre-building wheels, ensure that the wheel is both built and installed with ``--no-build-isolation``. See "Problems when using UV" above if using UV. + .. troubleshooting-end-marker-do-not-remove Breaking Changes From b2b79ab1f01ab2ef18137527487ea6fcb6700ade Mon Sep 17 00:00:00 2001 From: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Date: Thu, 15 Jan 2026 09:53:01 -0800 Subject: [PATCH 2/4] Update README.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> --- README.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.rst b/README.rst index 9cc5577626..59c2f46352 100644 --- a/README.rst +++ b/README.rst @@ -323,7 +323,7 @@ Troubleshooting ``` * **Symptoms:** Regular TE installs work correctly, but UV wheel builds fail at runtime. - * **Solution:** Ensure that ``uv build --wheel --no-build-isolation -v`` is used during the wheel build as well as the pip installation of the wheel. Use ``-v`` to ensure build isolation is active and that TE is not pulling in a mismatching version of PyTorch or JAX that differs from the UV environment's version. + * **Solution:** Ensure that ``uv build --wheel --no-build-isolation -v`` is used during the wheel build as well as the pip installation of the wheel. Use ``-v`` for verbose output to verify that TE is not pulling in a mismatching version of PyTorch or JAX that differs from the UV environment's version. **JAX-specific Common Issues and Solutions:** From c6000d7061d559dab8f526220c5e0c7f0bf3db2c Mon Sep 17 00:00:00 2001 From: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Date: Thu, 15 Jan 2026 09:53:08 -0800 Subject: [PATCH 3/4] Update README.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> --- README.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.rst b/README.rst index 59c2f46352..392fb63f87 100644 --- a/README.rst +++ b/README.rst @@ -316,11 +316,12 @@ Troubleshooting * **Symptoms:** Errors at runtime with ``CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED`` * **Solution:** This can occur when TE is built against the container's system installation of cuDNN, but pip packages inside the virtual environment pull in pip packages for ``nvidia-cudnn-cu12/cu13``. To resolve this, when building TE from source please specify the following environment variables to point to the cuDNN in your virtual environment. - ```bash - export CUDNN_PATH=$(pwd)/.venv/lib/python3.12/site-packages/nvidia/cudnn - export CUDNN_HOME=$CUDNN_PATH - export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$LD_LIBRARY_PATH - ``` + + .. code-block:: bash + + export CUDNN_PATH=$(pwd)/.venv/lib/python3.12/site-packages/nvidia/cudnn + export CUDNN_HOME=$CUDNN_PATH + export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$LD_LIBRARY_PATH * **Symptoms:** Regular TE installs work correctly, but UV wheel builds fail at runtime. * **Solution:** Ensure that ``uv build --wheel --no-build-isolation -v`` is used during the wheel build as well as the pip installation of the wheel. Use ``-v`` for verbose output to verify that TE is not pulling in a mismatching version of PyTorch or JAX that differs from the UV environment's version. From bfaad088fc84f2659ab0db55870f7bdf82ccafb5 Mon Sep 17 00:00:00 2001 From: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Date: Thu, 15 Jan 2026 10:00:14 -0800 Subject: [PATCH 4/4] Update README.rst Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> --- README.rst | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/README.rst b/README.rst index 392fb63f87..211964e7e1 100644 --- a/README.rst +++ b/README.rst @@ -308,11 +308,15 @@ Troubleshooting cd transformer_engine pip install -v -v -v --no-build-isolation . -5. **Problems when using UV or virtual environments:** +**Problems using UV or Virtual Environments:** + +1. **Import Error:** * **Symptoms:** Cannot import ``transformer_engine`` * **Solution:** Ensure your UV environment is active and that you have used ``uv pip install --no-build-isolation `` instead of a regular pip install to your system environment. +2. **cuDNN Sublibrary Loading Failed:** + * **Symptoms:** Errors at runtime with ``CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED`` * **Solution:** This can occur when TE is built against the container's system installation of cuDNN, but pip packages inside the virtual environment pull in pip packages for ``nvidia-cudnn-cu12/cu13``. To resolve this, when building TE from source please specify the following environment variables to point to the cuDNN in your virtual environment. @@ -323,7 +327,9 @@ Troubleshooting export CUDNN_HOME=$CUDNN_PATH export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$LD_LIBRARY_PATH - * **Symptoms:** Regular TE installs work correctly, but UV wheel builds fail at runtime. +3. **Building Wheels:** + + * **Symptoms:** Regular TE installs work correctly but UV wheel builds fail at runtime. * **Solution:** Ensure that ``uv build --wheel --no-build-isolation -v`` is used during the wheel build as well as the pip installation of the wheel. Use ``-v`` for verbose output to verify that TE is not pulling in a mismatching version of PyTorch or JAX that differs from the UV environment's version. **JAX-specific Common Issues and Solutions:** @@ -331,7 +337,7 @@ Troubleshooting 1. **FFI Issues:** * **Symptoms:** ``No registered implementation for custom call to for platform CUDA`` - * **Solution:** Ensure ``--no-build-isolation`` is used during installation. If pre-building wheels, ensure that the wheel is both built and installed with ``--no-build-isolation``. See "Problems when using UV" above if using UV. + * **Solution:** Ensure ``--no-build-isolation`` is used during installation. If pre-building wheels, ensure that the wheel is both built and installed with ``--no-build-isolation``. See "Problems using UV or Virtual Environments" above if using UV. .. troubleshooting-end-marker-do-not-remove