Skip to content

Request: publish toolchain/docs to build “peer” (dual-NPU) models for EIC7702 / DC-ROMA II #48

@Shalasere

Description

@Shalasere

Summary

The vendor image includes a working dual-NPU “peer” model format + runtime path (e.g., deepseek_7b_1k_int8_peer) that loads across both NPUs and allocates from both MMZ pools. However, there is no public tooling or end-to-end documentation to create peer/composite models from upstream weights (or from single-NPU .model artifacts). Please publish the model conversion / compilation toolchain (or at minimum the peer/composite build step + format docs) so external users can generate peer builds for other LLMs and actually utilize both NPUs.

Why this matters

  • EIC7702 boards expose two NPUs; using only one is a major capability loss.
  • Peer mode enables larger models (or larger context / headroom) by distributing artifacts across both NPU memory pools instead of forcing a single NPU carveout to be huge.
  • This unblocks integration work with higher-level serving stacks (OpenAI-compatible endpoints, OpenWebUI/OpenWebLLM frontends, etc.) because folks can build the required artifacts themselves.

Evidence that peer mode works (current vendor image)

Peer model artifacts exist

  • /opt/eswin/sample-code/npu_sample/qwen_sample/models/deepseek_7b_1k_int8_peer/
  • Peer-specific shards exist (example names):
    • lm_npu_b1_d0.model / lm_npu_b1_d1.model
    • modified_block_0_npu_b1_d0.model / modified_block_0_npu_b1_d1.model
  • du -sh footprint is similar to non-peer (expected for “split across dies” rather than “duplicate everything”).

Runtime opens both NPUs during peer inference

  • sudo fuser -v /dev/npu0 /dev/npu1 shows the same PID has both devices open during peer inference.

Both MMZ pools are actively allocated in peer inference

  • cat /proc/eswin/vb reports both pools in-use with roughly symmetric free space remaining.
    • Example observed values (per pool):
      • Total: 0x17ffff0006.00 GiB
      • Free: 0x589e20001.38 GiB
    • This implies ~4.62 GiB used on each pool (consistent with dual-NPU split).

(If you want exact logs/screenshots, I can attach them; the above is reproducible on the current vendor image.)

Documentation indicates peer/composite is a supported concept (but the build path is missing)

The ENNP docs from the https://github.com/eswincomputing/ebc7702-dev-board-ubuntu repo explicitly describe dual-device / dual-die execution and “composite model” runtime support, and they also describe an offline toolchain. The missing piece is a publicly available, externally usable workflow for generating peer/composite LLM artifacts.

1) Dual-die inference and “composite model” runtime support is documented

From ENNP User Manual v1.3, §7 “Dual-Die Inference”:

  • Describes inference on a “dual-die architecture” and notes pipeline inference can run in parallel on both dies, with guidance to minimize cross-die interaction.
  • References runtime concepts/APIs such as:
    • ES_NPU_GetNumDevices(...)
    • ES_NPU_SetDevice(...)
    • ES_NPU_LoadCompositeModel(...)

From ENNP Developer Manual v0.9.4, the NPU API is documented including:

  • ES_NPU_GetNumDevices / ES_NPU_SetDevice
  • ES_NPU_LoadCompositeModel(...) and the associated NPU_COMPOSITE_MODEL_INFO_S structure (suggesting multi-device model groupings are supported at the runtime/API level).

So peer/composite across multiple devices is documented as a supported mode — but the public model-build pipeline for it is missing.

2) An offline model toolchain is documented (but not publicly distributed in a usable way)

From ENNP User Manual v1.3, §3.3–3.4 and §6.5:

  • Offline tools are named:
    • EsQuant (quantization tool)
    • EsAAC (model compiler)
    • EsGoldenDataGen
    • EsSimulator
  • The docs also state the EsNNTools folder includes a Docker image containing the toolchain and show example usage of running EsAAC inside a container.

This strongly suggests the toolchain exists, but it is not currently published/linked in a way external users can obtain and use to build peer/composite LLM artifacts.

Documentation discoverability / fragmentation (request)

The dual-die/composite references were not discoverable from the main “Framework” repo/docs for the DC-ROMA/Framework workstream, and required digging through a separate vendor dev-board repository and zip bundle to find. Please:

  • Provide a single canonical documentation location for ENNP + model toolchain docs, and
  • Link it clearly from the DC-DeepComputing “Framework” repo (and/or vendor image README), including peer/composite model guidance and the toolchain distribution.

What I’m requesting (in order of usefulness)

1) Toolchain to build peer/composite models

Either source-available or binary release is fine, but it needs to be usable externally. Concretely, publish whatever is required from the documented tool suite (e.g., the EsNNTools Docker image(s) or equivalent packages) plus the missing peer/composite build steps.

Specifically:

  • A documented way to convert an upstream model into ESWIN .model artifacts (docs suggest ONNX is a supported input; document accepted formats and constraints).
  • The peer/composite build path that produces the multi-device shard set (examples):
    • *_d0.model / *_d1.model (or equivalent die/device shard naming)
    • rules for what is duplicated vs partitioned (e.g., embeddings)
    • the correct multi-device config schema (for es_qwen2 and/or direct ENNP runtime usage)

2) Documentation for peer/composite model format + partitioning rules

At minimum, publish:

  • How a composite/peer model package is structured on disk (directory layout, required metadata, naming rules)
  • The partitioning strategy (layer split policy; what must co-reside; what can be independent)
  • Constraints/limits (supported families, quantization requirements, context limits, memory headroom expectations)

3) A reproducible reference example

Provide one small reference that takes a known model and produces a peer/composite build:

  • Inputs: upstream weights (or non-peer .model outputs)
  • Output: peer/composite shards + config + a command to run
  • Verification steps: demonstrate both /dev/npu0 and /dev/npu1 are used and both MMZ pools allocate

4) Optional but very helpful: runtime knobs

If peer/composite mode can be enabled without regenerating artifacts (or with minimal metadata changes), document:

  • environment variables / flags
  • how device selection/binding works (NPU0 vs NPU1) and how allocations map to mmz_nid_0_part_0 / mmz_nid_1_part_0

Acceptance criteria (so we know it’s “done”)

  • External user can take a supported upstream model (or non-peer .model output), follow documented steps, and produce a peer/composite model package.
  • The resulting build:
    • opens both /dev/npu0 and /dev/npu1 during inference
    • allocates from both MMZ pools
    • runs inference successfully (even if performance varies by model)

Environment

  • Board: DC-ROMA II / EIC7702 (FML13V03)
  • OS image: Ubuntu 24.04 LTS v1.0.15019
  • Kernel: Linux roma 6.6.92-eic7x-2025.07 #2025.09.26.03.45+ SMP Fri Sep 26 03:53:01 UTC 2025 riscv64
  • Sample runtime: /opt/eswin/sample-code/npu_sample/qwen_sample/bin/es_qwen2

Thanks — publishing the peer/composite build toolchain (or even just the missing “split + package” stage plus format docs) would dramatically increase the value of the platform and enable moving beyond sample models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions