-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Summary
The vendor image includes a working dual-NPU “peer” model format + runtime path (e.g., deepseek_7b_1k_int8_peer) that loads across both NPUs and allocates from both MMZ pools. However, there is no public tooling or end-to-end documentation to create peer/composite models from upstream weights (or from single-NPU .model artifacts). Please publish the model conversion / compilation toolchain (or at minimum the peer/composite build step + format docs) so external users can generate peer builds for other LLMs and actually utilize both NPUs.
Why this matters
- EIC7702 boards expose two NPUs; using only one is a major capability loss.
- Peer mode enables larger models (or larger context / headroom) by distributing artifacts across both NPU memory pools instead of forcing a single NPU carveout to be huge.
- This unblocks integration work with higher-level serving stacks (OpenAI-compatible endpoints, OpenWebUI/OpenWebLLM frontends, etc.) because folks can build the required artifacts themselves.
Evidence that peer mode works (current vendor image)
Peer model artifacts exist
/opt/eswin/sample-code/npu_sample/qwen_sample/models/deepseek_7b_1k_int8_peer/- Peer-specific shards exist (example names):
lm_npu_b1_d0.model/lm_npu_b1_d1.modelmodified_block_0_npu_b1_d0.model/modified_block_0_npu_b1_d1.model
du -shfootprint is similar to non-peer (expected for “split across dies” rather than “duplicate everything”).
Runtime opens both NPUs during peer inference
sudo fuser -v /dev/npu0 /dev/npu1shows the same PID has both devices open during peer inference.
Both MMZ pools are actively allocated in peer inference
cat /proc/eswin/vbreports both pools in-use with roughly symmetric free space remaining.- Example observed values (per pool):
- Total:
0x17ffff000≈ 6.00 GiB - Free:
0x589e2000≈ 1.38 GiB
- Total:
- This implies ~4.62 GiB used on each pool (consistent with dual-NPU split).
- Example observed values (per pool):
(If you want exact logs/screenshots, I can attach them; the above is reproducible on the current vendor image.)
Documentation indicates peer/composite is a supported concept (but the build path is missing)
The ENNP docs from the https://github.com/eswincomputing/ebc7702-dev-board-ubuntu repo explicitly describe dual-device / dual-die execution and “composite model” runtime support, and they also describe an offline toolchain. The missing piece is a publicly available, externally usable workflow for generating peer/composite LLM artifacts.
1) Dual-die inference and “composite model” runtime support is documented
From ENNP User Manual v1.3, §7 “Dual-Die Inference”:
- Describes inference on a “dual-die architecture” and notes pipeline inference can run in parallel on both dies, with guidance to minimize cross-die interaction.
- References runtime concepts/APIs such as:
ES_NPU_GetNumDevices(...)ES_NPU_SetDevice(...)ES_NPU_LoadCompositeModel(...)
From ENNP Developer Manual v0.9.4, the NPU API is documented including:
ES_NPU_GetNumDevices/ES_NPU_SetDeviceES_NPU_LoadCompositeModel(...)and the associatedNPU_COMPOSITE_MODEL_INFO_Sstructure (suggesting multi-device model groupings are supported at the runtime/API level).
So peer/composite across multiple devices is documented as a supported mode — but the public model-build pipeline for it is missing.
2) An offline model toolchain is documented (but not publicly distributed in a usable way)
From ENNP User Manual v1.3, §3.3–3.4 and §6.5:
- Offline tools are named:
- EsQuant (quantization tool)
- EsAAC (model compiler)
- EsGoldenDataGen
- EsSimulator
- The docs also state the EsNNTools folder includes a Docker image containing the toolchain and show example usage of running EsAAC inside a container.
This strongly suggests the toolchain exists, but it is not currently published/linked in a way external users can obtain and use to build peer/composite LLM artifacts.
Documentation discoverability / fragmentation (request)
The dual-die/composite references were not discoverable from the main “Framework” repo/docs for the DC-ROMA/Framework workstream, and required digging through a separate vendor dev-board repository and zip bundle to find. Please:
- Provide a single canonical documentation location for ENNP + model toolchain docs, and
- Link it clearly from the DC-DeepComputing “Framework” repo (and/or vendor image README), including peer/composite model guidance and the toolchain distribution.
What I’m requesting (in order of usefulness)
1) Toolchain to build peer/composite models
Either source-available or binary release is fine, but it needs to be usable externally. Concretely, publish whatever is required from the documented tool suite (e.g., the EsNNTools Docker image(s) or equivalent packages) plus the missing peer/composite build steps.
Specifically:
- A documented way to convert an upstream model into ESWIN
.modelartifacts (docs suggest ONNX is a supported input; document accepted formats and constraints). - The peer/composite build path that produces the multi-device shard set (examples):
*_d0.model/*_d1.model(or equivalent die/device shard naming)- rules for what is duplicated vs partitioned (e.g., embeddings)
- the correct multi-device config schema (for
es_qwen2and/or direct ENNP runtime usage)
2) Documentation for peer/composite model format + partitioning rules
At minimum, publish:
- How a composite/peer model package is structured on disk (directory layout, required metadata, naming rules)
- The partitioning strategy (layer split policy; what must co-reside; what can be independent)
- Constraints/limits (supported families, quantization requirements, context limits, memory headroom expectations)
3) A reproducible reference example
Provide one small reference that takes a known model and produces a peer/composite build:
- Inputs: upstream weights (or non-peer
.modeloutputs) - Output: peer/composite shards + config + a command to run
- Verification steps: demonstrate both
/dev/npu0and/dev/npu1are used and both MMZ pools allocate
4) Optional but very helpful: runtime knobs
If peer/composite mode can be enabled without regenerating artifacts (or with minimal metadata changes), document:
- environment variables / flags
- how device selection/binding works (NPU0 vs NPU1) and how allocations map to
mmz_nid_0_part_0/mmz_nid_1_part_0
Acceptance criteria (so we know it’s “done”)
- External user can take a supported upstream model (or non-peer
.modeloutput), follow documented steps, and produce a peer/composite model package. - The resulting build:
- opens both
/dev/npu0and/dev/npu1during inference - allocates from both MMZ pools
- runs inference successfully (even if performance varies by model)
- opens both
Environment
- Board: DC-ROMA II / EIC7702 (FML13V03)
- OS image: Ubuntu 24.04 LTS v1.0.15019
- Kernel:
Linux roma 6.6.92-eic7x-2025.07 #2025.09.26.03.45+ SMP Fri Sep 26 03:53:01 UTC 2025 riscv64 - Sample runtime:
/opt/eswin/sample-code/npu_sample/qwen_sample/bin/es_qwen2
Thanks — publishing the peer/composite build toolchain (or even just the missing “split + package” stage plus format docs) would dramatically increase the value of the platform and enable moving beyond sample models.