Request: publish toolchain/docs to build “peer” (dual-NPU) models for EIC7702 / DC-ROMA II

## Summary
The vendor image includes a working **dual-NPU “peer”** model format + runtime path (e.g., `deepseek_7b_1k_int8_peer`) that loads across **both** NPUs and allocates from **both** MMZ pools. However, there is no public tooling or end-to-end documentation to *create* peer/composite models from upstream weights (or from single-NPU `.model` artifacts). Please publish the model conversion / compilation toolchain (or at minimum the peer/composite build step + format docs) so external users can generate peer builds for other LLMs and actually utilize both NPUs.

## Why this matters
- EIC7702 boards expose **two NPUs**; using only one is a major capability loss.
- Peer mode enables larger models (or larger context / headroom) by distributing artifacts across both NPU memory pools instead of forcing a single NPU carveout to be huge.
- This unblocks integration work with higher-level serving stacks (OpenAI-compatible endpoints, OpenWebUI/OpenWebLLM frontends, etc.) because folks can build the required artifacts themselves.

## Evidence that peer mode works (current vendor image)

### Peer model artifacts exist
- `/opt/eswin/sample-code/npu_sample/qwen_sample/models/deepseek_7b_1k_int8_peer/`
- Peer-specific shards exist (example names):
  - `lm_npu_b1_d0.model` / `lm_npu_b1_d1.model`
  - `modified_block_0_npu_b1_d0.model` / `modified_block_0_npu_b1_d1.model`
- `du -sh` footprint is similar to non-peer (expected for “split across dies” rather than “duplicate everything”).

### Runtime opens both NPUs during peer inference
- `sudo fuser -v /dev/npu0 /dev/npu1` shows the same PID has both devices open during peer inference.

### Both MMZ pools are actively allocated in peer inference
- `cat /proc/eswin/vb` reports both pools in-use with roughly symmetric free space remaining.
  - Example observed values (per pool):
    - Total: `0x17ffff000` ≈ **6.00 GiB**
    - Free:  `0x589e2000` ≈ **1.38 GiB**
  - This implies ~**4.62 GiB** used on **each** pool (consistent with dual-NPU split).

(If you want exact logs/screenshots, I can attach them; the above is reproducible on the current vendor image.)

## Documentation indicates peer/composite is a supported concept (but the build path is missing)
The ENNP docs from the `https://github.com/eswincomputing/ebc7702-dev-board-ubuntu` repo explicitly describe dual-device / dual-die execution and “composite model” runtime support, and they also describe an offline toolchain. The missing piece is a publicly available, externally usable workflow for generating peer/composite LLM artifacts.

### 1) Dual-die inference and “composite model” runtime support is documented
From **ENNP User Manual v1.3**, §7 “Dual-Die Inference”:
- Describes inference on a “dual-die architecture” and notes pipeline inference can run in parallel on both dies, with guidance to minimize cross-die interaction.
- References runtime concepts/APIs such as:
  - `ES_NPU_GetNumDevices(...)`
  - `ES_NPU_SetDevice(...)`
  - `ES_NPU_LoadCompositeModel(...)`

From **ENNP Developer Manual v0.9.4**, the NPU API is documented including:
- `ES_NPU_GetNumDevices` / `ES_NPU_SetDevice`
- `ES_NPU_LoadCompositeModel(...)` and the associated `NPU_COMPOSITE_MODEL_INFO_S` structure (suggesting multi-device model groupings are supported at the runtime/API level).

**So peer/composite across multiple devices is documented as a supported mode — but the public model-build pipeline for it is missing.**

### 2) An offline model toolchain is documented (but not publicly distributed in a usable way)
From **ENNP User Manual v1.3**, §3.3–3.4 and §6.5:
- Offline tools are named:
  - **EsQuant** (quantization tool)
  - **EsAAC** (model compiler)
  - **EsGoldenDataGen**
  - **EsSimulator**
- The docs also state the **EsNNTools** folder includes a Docker image containing the toolchain and show example usage of running EsAAC inside a container.

This strongly suggests the toolchain exists, but it is not currently published/linked in a way external users can obtain and use to build peer/composite LLM artifacts.

## Documentation discoverability / fragmentation (request)
The dual-die/composite references were not discoverable from the main “Framework” repo/docs for the DC-ROMA/Framework workstream, and required digging through a separate vendor dev-board repository and zip bundle to find. Please:
- Provide a **single canonical documentation location** for ENNP + model toolchain docs, and
- Link it clearly from the DC-DeepComputing “Framework” repo (and/or vendor image README), including peer/composite model guidance and the toolchain distribution.

## What I’m requesting (in order of usefulness)

### 1) Toolchain to build peer/composite models
Either source-available or binary release is fine, but it needs to be usable externally. Concretely, publish whatever is required from the documented tool suite (e.g., the EsNNTools Docker image(s) or equivalent packages) plus the missing peer/composite build steps.

Specifically:
- A documented way to convert an upstream model into ESWIN `.model` artifacts (docs suggest ONNX is a supported input; document accepted formats and constraints).
- The peer/composite build path that produces the multi-device shard set (examples):
  - `*_d0.model` / `*_d1.model` (or equivalent die/device shard naming)
  - rules for what is duplicated vs partitioned (e.g., embeddings)
  - the correct multi-device config schema (for `es_qwen2` and/or direct ENNP runtime usage)

### 2) Documentation for peer/composite model format + partitioning rules
At minimum, publish:
- How a composite/peer model package is structured on disk (directory layout, required metadata, naming rules)
- The partitioning strategy (layer split policy; what must co-reside; what can be independent)
- Constraints/limits (supported families, quantization requirements, context limits, memory headroom expectations)

### 3) A reproducible reference example
Provide one small reference that takes a known model and produces a peer/composite build:
- Inputs: upstream weights (or non-peer `.model` outputs)
- Output: peer/composite shards + config + a command to run
- Verification steps: demonstrate both `/dev/npu0` and `/dev/npu1` are used and both MMZ pools allocate

### 4) Optional but very helpful: runtime knobs
If peer/composite mode can be enabled without regenerating artifacts (or with minimal metadata changes), document:
- environment variables / flags
- how device selection/binding works (NPU0 vs NPU1) and how allocations map to `mmz_nid_0_part_0` / `mmz_nid_1_part_0`

## Acceptance criteria (so we know it’s “done”)
- External user can take a supported upstream model (or non-peer `.model` output), follow documented steps, and produce a peer/composite model package.
- The resulting build:
  - opens both `/dev/npu0` and `/dev/npu1` during inference
  - allocates from both MMZ pools
  - runs inference successfully (even if performance varies by model)

## Environment
- Board: DC-ROMA II / EIC7702 (FML13V03)
- OS image: Ubuntu 24.04 LTS v1.0.15019
- Kernel: `Linux roma 6.6.92-eic7x-2025.07 #2025.09.26.03.45+ SMP Fri Sep 26 03:53:01 UTC 2025 riscv64`
- Sample runtime: `/opt/eswin/sample-code/npu_sample/qwen_sample/bin/es_qwen2`

Thanks — publishing the peer/composite build toolchain (or even just the missing “split + package” stage plus format docs) would dramatically increase the value of the platform and enable moving beyond sample models.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request: publish toolchain/docs to build “peer” (dual-NPU) models for EIC7702 / DC-ROMA II #48

Summary

Why this matters

Evidence that peer mode works (current vendor image)

Peer model artifacts exist

Runtime opens both NPUs during peer inference

Both MMZ pools are actively allocated in peer inference

Documentation indicates peer/composite is a supported concept (but the build path is missing)

1) Dual-die inference and “composite model” runtime support is documented

2) An offline model toolchain is documented (but not publicly distributed in a usable way)

Documentation discoverability / fragmentation (request)

What I’m requesting (in order of usefulness)

1) Toolchain to build peer/composite models

2) Documentation for peer/composite model format + partitioning rules

3) A reproducible reference example

4) Optional but very helpful: runtime knobs

Acceptance criteria (so we know it’s “done”)

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Request: publish toolchain/docs to build “peer” (dual-NPU) models for EIC7702 / DC-ROMA II #48

Description

Summary

Why this matters

Evidence that peer mode works (current vendor image)

Peer model artifacts exist

Runtime opens both NPUs during peer inference

Both MMZ pools are actively allocated in peer inference

Documentation indicates peer/composite is a supported concept (but the build path is missing)

1) Dual-die inference and “composite model” runtime support is documented

2) An offline model toolchain is documented (but not publicly distributed in a usable way)

Documentation discoverability / fragmentation (request)

What I’m requesting (in order of usefulness)

1) Toolchain to build peer/composite models

2) Documentation for peer/composite model format + partitioning rules

3) A reproducible reference example

4) Optional but very helpful: runtime knobs

Acceptance criteria (so we know it’s “done”)

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions