-
Notifications
You must be signed in to change notification settings - Fork 80
Support uint8 data type for Allreduce #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds uint8 support for Allreduce by introducing mscclpp::DataType::UINT8, wiring NCCL/MSCCLPP dtype conversions, and ensuring NVLS paths are avoided for uint8 due to lack of byte-level reduction support.
Changes:
- Add
UINT8tomscclpp::DataTypeand related vector aliases. - Add
uint8element/vector reduction support in allreduce device helpers and dispatcher. - Disable/guard NVLS allreduce implementations and selection logic for
uint8.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/ext/nccl/nccl.cu |
Disables NVLS algorithm selection for UINT8 allreduce requests. |
src/ext/nccl/datatype_conversion.hpp |
Maps ncclUint8 to mscclpp::DataType::UINT8 and updates dtype sizing. |
src/ext/collectives/include/allreduce/common.hpp |
Adds uint8 reduction helpers and enables UINT8 dispatch in allreduce kernels. |
src/ext/collectives/allreduce/allreduce_nvls_with_copy_2.cu |
Explicitly rejects NVLS-with-copy2 for uint8. |
src/ext/collectives/allreduce/allreduce_nvls_with_copy.cu |
Explicitly rejects NVLS-with-copy for uint8. |
src/ext/collectives/allreduce/allreduce_nvls.cu |
Explicitly rejects NVLS for uint8. |
src/core/include/execution_kernel.hpp |
Adds uint8 vector add support and blocks MULTI_LOAD_REDUCE_STORE for uint8. |
src/core/executor/execution_kernel.cu |
Adds CUDA executor kernel launch support for DataType::UINT8. |
include/mscclpp/gpu_data_types.hpp |
Adds DataType::UINT8 and u8x* vector type aliases. |
Comments suppressed due to low confidence (1)
include/mscclpp/gpu_data_types.hpp:71
- Inserting
UINT8in the middle ofenum class DataTypechanges the underlying numeric values of existing enumerators (e.g.,FLOAT16shifts). IfDataTypecrosses any ABI boundaries (public API, serialization, IPC), this is a breaking change. Consider assigning explicit values to preserve existing ones, or appendUINT8at the end to keep prior values stable.
enum class DataType {
INT32, // 32-bit signed integer.
UINT32, // 32-bit unsigned integer.
UINT8, // 8-bit unsigned integer.
FLOAT16, // IEEE 754 half precision.
FLOAT32, // IEEE 754 single precision.
BFLOAT16, // bfloat16 precision.
FP8_E4M3, // FP8 with E4M3 layout.
FP8_E5M2, // FP8 with E5M2 layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Support uint8 data type for Allreduce.
Current limitation: uint8 is not supported for NVLS.
Performance results with RCCL-test with MSCCLPP on MI300X:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)