Support uint8 data type for Allreduce #736

seagater · 2026-01-26T15:16:47Z

Support uint8 data type for Allreduce.
Current limitation: uint8 is not supported for NVLS.

Performance results with RCCL-test with MSCCLPP on MI300X:

# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

1024	512	half	sum	-1	5.39	0.19	0.33	5.45	0.19	0.33
2048	1024	half	sum	-1	5.53	0.37	0.65	5.63	0.36	0.64
4096	2048	half	sum	-1	5.55	0.74	1.29	5.56	0.74	1.29
8192	4096	half	sum	-1	5.8	1.41	2.47	5.84	1.4	2.46
16384	8192	half	sum	-1	6.57	2.49	4.36	6.56	2.5	4.37
32768	16384	half	sum	-1	8.02	4.09	7.15	8.06	4.07	7.11
65536	32768	half	sum	-1	8.77	7.47	13.07	8.82	7.43	13
131072	65536	half	sum	-1	9.61	13.64	23.87	9.78	13.4	23.45
262144	131072	half	sum	-1	11.68	22.44	39.27	12.1	21.67	37.93
524288	262144	half	sum	-1	13.77	38.08	66.64	13.87	37.79	66.13
1048576	524288	half	sum	-1	19.11	54.87	96.03	19.27	54.42	95.24
2097152	1048576	half	sum	-1	24.1	87	152.26	24.24	86.52	151.41
4194304	2097152	half	sum	-1	37.16	112.87	197.52	37.44	112.03	196.06
8388608	4194304	half	sum	-1	61.53	136.33	238.58	61.68	135.99	237.99
16777216	8388608	half	sum	-1	108.8	154.22	269.88	109.2	153.6	268.79
33554432	16777216	half	sum	-1	197.8	169.68	296.94	198.6	168.92	295.61
67108864	33554432	half	sum	-1	384.6	174.51	305.39	385.1	174.27	304.98
134217728	67108864	half	sum	-1	754.1	177.99	311.48	754.9	177.78	311.12
268435456	134217728	half	sum	-1	1491.8	179.94	314.89	1493.2	179.77	314.6
536870912	268435456	half	sum	-1	2979.6	180.18	315.31	2983.9	179.92	314.87

# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

1024	1024	fp8_e4m3	sum	-1	5.4	0.19	0.33	5.45	0.19	0.33
2048	2048	fp8_e4m3	sum	-1	5.5	0.37	0.65	5.6	0.37	0.64
4096	4096	fp8_e4m3	sum	-1	5.61	0.73	1.28	5.68	0.72	1.26
8192	8192	fp8_e4m3	sum	-1	5.96	1.38	2.41	5.98	1.37	2.4
16384	16384	fp8_e4m3	sum	-1	6.49	2.52	4.42	6.58	2.49	4.36
32768	32768	fp8_e4m3	sum	-1	8.09	4.05	7.09	8.15	4.02	7.03
65536	65536	fp8_e4m3	sum	-1	8.58	7.64	13.37	8.7	7.53	13.18
131072	131072	fp8_e4m3	sum	-1	9.44	13.88	24.29	9.62	13.63	23.85
262144	262144	fp8_e4m3	sum	-1	10.12	25.9	45.32	10.37	25.27	44.22
524288	524288	fp8_e4m3	sum	-1	13.73	38.19	66.82	13.89	37.74	66.04
1048576	1048576	fp8_e4m3	sum	-1	18.66	56.2	98.34	18.92	55.41	96.97
2097152	2097152	fp8_e4m3	sum	-1	24.54	85.46	149.56	24.63	85.16	149.03
4194304	4194304	fp8_e4m3	sum	-1	37.79	110.98	194.21	38.05	110.22	192.88
8388608	8388608	fp8_e4m3	sum	-1	62.22	134.82	235.94	62.63	133.94	234.4
16777216	16777216	fp8_e4m3	sum	-1	109.9	152.62	267.09	110.4	151.9	265.83
33554432	33554432	fp8_e4m3	sum	-1	201.1	166.82	291.94	202.3	165.84	290.22
67108864	67108864	fp8_e4m3	sum	-1	390	172.06	301.11	390.2	171.99	300.99
134217728	134217728	fp8_e4m3	sum	-1	763.9	175.7	307.47	764.2	175.62	307.34
268435456	268435456	fp8_e4m3	sum	-1	1509.5	177.83	311.2	1510.1	177.76	311.08
536870912	536870912	fp8_e4m3	sum	-1	3010.2	178.35	312.11	3014.2	178.11	311.7

# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

1024	1024	fp8_e5m2	sum	-1	5.41	0.19	0.33	5.44	0.19	0.33
2048	2048	fp8_e5m2	sum	-1	5.5	0.37	0.65	5.67	0.36	0.63
4096	4096	fp8_e5m2	sum	-1	5.61	0.73	1.28	5.69	0.72	1.26
8192	8192	fp8_e5m2	sum	-1	5.96	1.37	2.4	6	1.36	2.39
16384	16384	fp8_e5m2	sum	-1	6.63	2.47	4.32	6.59	2.49	4.35
32768	32768	fp8_e5m2	sum	-1	8.07	4.06	7.1	8.16	4.02	7.03
65536	65536	fp8_e5m2	sum	-1	8.62	7.61	13.31	8.73	7.51	13.14
131072	131072	fp8_e5m2	sum	-1	9.43	13.9	24.33	9.6	13.66	23.9
262144	262144	fp8_e5m2	sum	-1	10.11	25.94	45.39	10.38	25.26	44.21
524288	524288	fp8_e5m2	sum	-1	13.73	38.19	66.84	13.87	37.79	66.13
1048576	1048576	fp8_e5m2	sum	-1	18.65	56.22	98.39	18.93	55.38	96.92
2097152	2097152	fp8_e5m2	sum	-1	24.54	85.47	149.57	24.63	85.16	149.03
4194304	4194304	fp8_e5m2	sum	-1	37.84	110.83	193.96	38.01	110.36	193.12
8388608	8388608	fp8_e5m2	sum	-1	62.32	134.61	235.58	62.55	134.12	234.71
16777216	16777216	fp8_e5m2	sum	-1	110	152.58	267.01	110.3	152.12	266.21
33554432	33554432	fp8_e5m2	sum	-1	201.1	166.9	292.07	201.8	166.26	290.96
67108864	67108864	fp8_e5m2	sum	-1	390	172.07	301.12	390.5	171.87	300.78
134217728	134217728	fp8_e5m2	sum	-1	763.9	175.69	307.46	764.5	175.56	307.23
268435456	268435456	fp8_e5m2	sum	-1	1509.4	177.84	311.22	1509.8	177.8	311.14
536870912	536870912	fp8_e5m2	sum	-1	3013	178.18	311.82	3018	177.89	311.31

# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

1024	1024	uint8	sum	-1	5.46	0.19	0.33	5.46	0.19	0.33
2048	2048	uint8	sum	-1	5.54	0.37	0.65	5.63	0.36	0.64
4096	4096	uint8	sum	-1	5.61	0.73	1.28	5.63	0.73	1.27
8192	8192	uint8	sum	-1	5.9	1.39	2.43	5.9	1.39	2.43
16384	16384	uint8	sum	-1	6.6	2.48	4.35	6.64	2.47	4.32
32768	32768	uint8	sum	-1	8.99	3.65	6.38	8.99	3.64	6.38
65536	65536	uint8	sum	-1	9.44	6.94	12.15	9.58	6.84	11.98
131072	131072	uint8	sum	-1	11.72	11.18	19.57	11.83	11.08	19.4
262144	262144	uint8	sum	-1	12.29	21.32	37.31	12.45	21.05	36.84
524288	524288	uint8	sum	-1	13.87	37.8	66.15	13.93	37.64	65.88
1048576	1048576	uint8	sum	-1	19.11	54.88	96.04	19.3	54.33	95.08
2097152	2097152	uint8	sum	-1	24.38	86.01	150.51	24.52	85.53	149.67
4194304	4194304	uint8	sum	-1	37.52	111.78	195.61	37.76	111.08	194.39
8388608	8388608	uint8	sum	-1	62.4	134.44	235.26	62.56	134.1	234.67
16777216	16777216	uint8	sum	-1	110.2	152.22	266.39	110.3	152.04	266.08
33554432	33554432	uint8	sum	-1	199.8	167.94	293.9	197.5	169.88	297.29
67108864	67108864	uint8	sum	-1	386.3	173.73	304.03	378.4	177.37	310.39
134217728	134217728	uint8	sum	-1	758	177.07	309.87	741.1	181.12	316.95
268435456	268435456	uint8	sum	-1	1500.1	178.95	313.16	1466.2	183.09	320.4
536870912	536870912	uint8	sum	-1	2991.7	179.45	314.04	2924.8	183.56	321.23

…ral template declaration.

Copilot

Pull request overview

Adds uint8 support for Allreduce by introducing mscclpp::DataType::UINT8, wiring NCCL/MSCCLPP dtype conversions, and ensuring NVLS paths are avoided for uint8 due to lack of byte-level reduction support.

Changes:

Add UINT8 to mscclpp::DataType and related vector aliases.
Add uint8 element/vector reduction support in allreduce device helpers and dispatcher.
Disable/guard NVLS allreduce implementations and selection logic for uint8.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/ext/nccl/nccl.cu`	Disables NVLS algorithm selection for `UINT8` allreduce requests.
`src/ext/nccl/datatype_conversion.hpp`	Maps `ncclUint8` to `mscclpp::DataType::UINT8` and updates dtype sizing.
`src/ext/collectives/include/allreduce/common.hpp`	Adds `uint8` reduction helpers and enables `UINT8` dispatch in allreduce kernels.
`src/ext/collectives/allreduce/allreduce_nvls_with_copy_2.cu`	Explicitly rejects NVLS-with-copy2 for `uint8`.
`src/ext/collectives/allreduce/allreduce_nvls_with_copy.cu`	Explicitly rejects NVLS-with-copy for `uint8`.
`src/ext/collectives/allreduce/allreduce_nvls.cu`	Explicitly rejects NVLS for `uint8`.
`src/core/include/execution_kernel.hpp`	Adds `uint8` vector add support and blocks `MULTI_LOAD_REDUCE_STORE` for `uint8`.
`src/core/executor/execution_kernel.cu`	Adds CUDA executor kernel launch support for `DataType::UINT8`.
`include/mscclpp/gpu_data_types.hpp`	Adds `DataType::UINT8` and `u8x*` vector type aliases.

Comments suppressed due to low confidence (1)

include/mscclpp/gpu_data_types.hpp:71

Inserting UINT8 in the middle of enum class DataType changes the underlying numeric values of existing enumerators (e.g., FLOAT16 shifts). If DataType crosses any ABI boundaries (public API, serialization, IPC), this is a breaking change. Consider assigning explicit values to preserve existing ones, or append UINT8 at the end to keep prior values stable.

enum class DataType {
  INT32,     // 32-bit signed integer.
  UINT32,    // 32-bit unsigned integer.
  UINT8,     // 8-bit unsigned integer.
  FLOAT16,   // IEEE 754 half precision.
  FLOAT32,   // IEEE 754 single precision.
  BFLOAT16,  // bfloat16 precision.
  FP8_E4M3,  // FP8 with E4M3 layout.
  FP8_E5M2,  // FP8 with E5M2 layout.

src/ext/nccl/datatype_conversion.hpp

src/core/include/execution_kernel.hpp

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

src/ext/collectives/include/allreduce/common.hpp

…d uint8 in mscclppToNcclDataType

Qinghua Zhou and others added 5 commits January 16, 2026 04:07

Add uint8 data type for allreduce

032c40f

Add uint8 in execution kernel; Skip uint8 for nvls adapter

314fbb2

Skip uint8 for nvls

a47f3c7

Merge main branch with uint8 support

b62daa5

Move the min_elements<unit8_t> template specialization below the gene…

ff617db

…ral template declaration.

seagater requested review from Binyang2014, caiomcbr, chhwang, Copilot and mahdiehghazim January 26, 2026 15:16

Copilot started reviewing on behalf of seagater January 26, 2026 15:17 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

src/ext/nccl/datatype_conversion.hpp Show resolved Hide resolved

src/core/include/execution_kernel.hpp Show resolved Hide resolved

src/core/include/execution_kernel.hpp Show resolved Hide resolved

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Binyang2014 reviewed Jan 26, 2026

View reviewed changes

src/ext/collectives/include/allreduce/common.hpp Outdated Show resolved Hide resolved

Qinghua Zhou added 3 commits January 27, 2026 17:19

Add __vadd4 and __vminu4 for uint8; Add uint8 in execution_kernel; Ad…

0b829c3

…d uint8 in mscclppToNcclDataType

Merge branch 'main' into qinghuazhou/uint8_support

280e088

Optimize cal_uint8x4_sum and cal_uint8x4_min for HIP platform

550a04b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support uint8 data type for Allreduce #736

Support uint8 data type for Allreduce #736

Uh oh!

seagater commented Jan 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support uint8 data type for Allreduce #736

Are you sure you want to change the base?

Support uint8 data type for Allreduce #736

Uh oh!

Conversation

seagater commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seagater commented Jan 26, 2026 •

edited

Loading