From 7eff7b656a8a2b55aa6c827760786b7545a0a381 Mon Sep 17 00:00:00 2001 From: mfw78 Date: Mon, 3 Mar 2025 09:56:38 +0000 Subject: [PATCH 1/5] feat(swip-26): strictly typed chunk system --- SWIPs/swip-26.md | 173 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 173 insertions(+) create mode 100644 SWIPs/swip-26.md diff --git a/SWIPs/swip-26.md b/SWIPs/swip-26.md new file mode 100644 index 0000000..3e10dd8 --- /dev/null +++ b/SWIPs/swip-26.md @@ -0,0 +1,173 @@ +--- +swip: 26 +title: Standardised Chunk Type Framework +status: Draft +type: Standards Track +category: Core +author: mfw78 (@mfw78) +created: 2025-03-03 +--- + +## Simple Summary +This SWIP introduces a standardised framework for defining chunk types in Swarm, improving security and interoperability through consistent type identification and validation. + +## Abstract +This SWIP proposes a standardised framework for defining and processing chunk types in Swarm. By creating a formal type system for chunks, including content-addressed chunks (CAC) and single-owner chunks (SOC), we improve security, interoperability, and maintainability across the Swarm ecosystem. The proposal defines a structured approach to chunk identification, versioning, and validation without modifying the wire protocol. Key innovations include fixed-length type-specific headers, deterministic address calculation, and formalised validation rules. + +## Motivation +Swarm's storage layer is built around chunks as the fundamental unit of data. Currently, the system supports multiple chunk types, but lacks a standardised type system. This creates several issues: + +1. **Ambiguous Processing**: Without explicit type information, chunk processing depends on implicit detection methods, leading to potential security vulnerabilities. + +2. **Limited Extensibility**: Adding new chunk types requires changes to core validation logic, making it difficult to evolve the system. + +3. **Inconsistent Validation**: Chunk validation logic is spread across multiple components, leading to potential inconsistencies. + +4. **Type-Safety Gaps**: Without formal type definitions, runtime type errors can occur when processing chunks. + +A standardised chunk type framework would address these issues by providing a consistent, extensible system for defining, identifying, and validating different chunk types. + +## Specification + +### Core Concepts + +#### 1. Chunk Structure + +A standardised chunk shall conceptually consist of: + +1. **Header**: Metadata describing the chunk and its contents + - Common Header: Information common to all chunk types (type, version) + - Type-Specific Header: Additional fields specific to the chunk type +2. **Payload**: The actual chunk data + +The chunk's address is not part of the chunk itself but is deterministically derived from the chunk's contents based on its type. + +#### 2. Common Chunk Header + +The common chunk header shall contain: + +1. **Type**: The chunk type identifier (1 byte) +2. **Version**: The chunk format version (1 byte) + +| Type ID | Name | Description | +|---------|------|-------------| +| 0x00 | CAC | Content-addressed chunk | +| 0x01 | SOC | Single-owner chunk | +| 0x02-0xFF | Reserved | Reserved for future chunk types | + +#### 3. Fixed-Length Type-Specific Headers + +All type-specific headers MUST be of fixed length for their respective chunk types. This ensures that at a wire-level, the maximum size of a chunk is always known and predictable, based on the first 2 bytes (type and version). + +Example header sizes: +- CAC header: 10 bytes (2 bytes common header + 8 bytes span) +- SOC header: 99 bytes (2 bytes common header + 32 bytes ID + 65 bytes signature) + +### Address Calculation + +The address of a chunk shall be deterministically calculated based on its type, version, and contents. We define the general address calculation function as: + +$$\text{Address} = f_{\text{type}}(\text{header}, \text{payload})$$ + +Where $f_{\text{type}}$ is the type-specific address calculation function. + +#### Generic Address Derivation Function + +For any chunk type, the address derivation function can be formally defined as: + +$$f_{\text{type}}(\text{header}, \text{payload}) = \mathcal{H}(g_{\text{type}}(\text{header}, \text{payload}))$$ + +Where: +- $\mathcal{H}$ is a cryptographic hash function (i.e. `keccak256`) +- $g_{\text{type}}$ is a type-specific data preparation function + +Different chunk types will implement specific derivation functions based on their requirements. + +### Chunk Type Specifications + +The Swarm Specifications shall define the standardised format for each chunk type. Adding a new chunk type to the specifications requires: + +1. Assignment of a unique type identifier +2. Definition of fixed-length type-specific header structure +3. Definition of payload structure +4. Specification of address calculation function $f_{\text{type}}$ +5. Specification of validation requirements + +These specifications ensure that all implementations handle chunks consistently and securely across the Swarm ecosystem. + +### Type Processing + +The chunk processing logic shall: + +1. Receive the chunk type and version information from the wire protocol +2. Use the type and version to determine the expected fixed-length type-specific header size as defined in the Swarm Specifications +3. Verify that the received header matches the expected size for the given type +4. Fail fast if the header is malformed or incomplete +5. Extract the type-specific header fields +6. Calculate the chunk address using the type-specific address calculation function +7. Apply type-specific validation rules +8. Process the payload according to type-specific structure + +This approach allows for early validation of chunk integrity based on protocol-level type information, reducing parsing errors and simplifying processing logic. + +## Rationale + +The proposed standardised chunk type framework addresses several key issues in the current implementation: + +1. **Type Ambiguity**: By explicitly encoding chunk types in the header, we eliminate ambiguity in chunk processing, enhancing security and reliability. + +2. **Extensibility**: The formal specifications allow for future chunk types to be added in a standardised way without modifying core validation logic. + +3. **Validation Consistency**: Centralising validation rules in the specifications ensures consistent enforcement across components and implementations. + +4. **Memory Efficiency**: Fixed-length headers enable predictable memory allocation and reduce fragmentation. + +5. **Parsing Efficiency**: Type-specific parsing paths reduce the need for speculative parsing, improving performance. + +The design choices prioritise: +- Security through explicit typing and validation +- Efficiency through predictable memory allocation and fail-fast validation +- Extensibility through the standardised specification system +- Backward compatibility with existing chunk types + +## Backwards Compatibility + +This proposal maintains backward compatibility by: + +1. Preserving existing chunk address calculation methods for current chunk types +2. Supporting current chunk formats with version 0 of each type +3. Allowing for gradual adoption of the type system +4. Providing a conversion layer between legacy and new chunk formats + +## Test Cases + +Test cases should include: + +1. **Header Validation**: Tests that verify correct parsing of type-specific headers for different chunk types +2. **Address Calculation**: Tests that confirm proper address derivation for each chunk type +3. **Size Verification**: Tests that ensure fixed-length headers meet their size requirements +4. **Malformed Input**: Tests that verify proper rejection of malformed chunks +5. **Version Handling**: Tests for correct processing of different versions of the same chunk type + +## Implementation + +Implementation will proceed in phases: + +1. Formalise the chunk type specifications for CAC and SOC in the Swarm Specifications +2. Implement type-aware chunk processing in the node software +3. Add validation framework for existing chunk types based on the specifications +4. Develop compatibility layer for processing legacy chunks + +## Security Considerations + +The standardised chunk type framework improves security through: + +1. **Explicit Type Checking**: Reduces the risk of type confusion attacks +2. **Fixed-Length Headers**: Prevents buffer overflow attacks +3. **Early Validation**: Enables fail-fast behaviour for malformed chunks +4. **Deterministic Addressing**: Ensures consistent and secure chunk addressing +5. **Versioned Security**: Allows security improvements via version updates + +## Copyright Waiver + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From d427fc67ef70e6939264321f508a452c8532109f Mon Sep 17 00:00:00 2001 From: mfw78 Date: Mon, 3 Mar 2025 10:09:45 +0000 Subject: [PATCH 2/5] chore(swip-26): add flowchart --- SWIPs/swip-26.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/SWIPs/swip-26.md b/SWIPs/swip-26.md index 3e10dd8..b4ca4d2 100644 --- a/SWIPs/swip-26.md +++ b/SWIPs/swip-26.md @@ -110,6 +110,24 @@ The chunk processing logic shall: This approach allows for early validation of chunk integrity based on protocol-level type information, reducing parsing errors and simplifying processing logic. +#### Flowchart + +The flowchart below illustrates the processing steps for a chunk: + +```mermaid +flowchart TD + Start[Receive chunk via wire protocol] --> A[Protobuf decodes chunk type, version, header, and payload] + A --> B{Header size matches expected size for type?} + B -->|No| C[Fail: Invalid header size] + B -->|Yes| D[Extract type-specific header fields] + D --> E[Calculate chunk address using type-specific function] + E --> F{Validate chunk content} + F -->|Invalid| G[Fail: Invalid chunk content] + F -->|Valid| H[Process payload according to type-specific structure] + H --> I[Pass processed chunk to appropriate protocol handler] + I --> End[Protocol-specific processing] +``` + ## Rationale The proposed standardised chunk type framework addresses several key issues in the current implementation: From 945322294449e1878de2168e6ec1901d198adb60 Mon Sep 17 00:00:00 2001 From: mfw78 <53399572+mfw78@users.noreply.github.com> Date: Mon, 5 May 2025 07:16:31 +0000 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: significance --- SWIPs/swip-26.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/SWIPs/swip-26.md b/SWIPs/swip-26.md index b4ca4d2..ffc3301 100644 --- a/SWIPs/swip-26.md +++ b/SWIPs/swip-26.md @@ -12,10 +12,10 @@ created: 2025-03-03 This SWIP introduces a standardised framework for defining chunk types in Swarm, improving security and interoperability through consistent type identification and validation. ## Abstract -This SWIP proposes a standardised framework for defining and processing chunk types in Swarm. By creating a formal type system for chunks, including content-addressed chunks (CAC) and single-owner chunks (SOC), we improve security, interoperability, and maintainability across the Swarm ecosystem. The proposal defines a structured approach to chunk identification, versioning, and validation without modifying the wire protocol. Key innovations include fixed-length type-specific headers, deterministic address calculation, and formalised validation rules. +This SWIP proposes a standardised framework for defining and processing chunk types in Swarm. By creating a formal type system for chunks, including content-addressed chunks (CAC) and single-owner chunks (SOC), we improve security, interoperability, and maintainability across the Swarm ecosystem. The proposal defines a structured approach to chunk identification, versioning, and validation. The key innovation is the formal definition of fixed-length type-specific headers to be delivered alongside chunks and formally documenting address determination and payload validation rules. ## Motivation -Swarm's storage layer is built around chunks as the fundamental unit of data. Currently, the system supports multiple chunk types, but lacks a standardised type system. This creates several issues: +Swarm's storage layer is built around chunks as the fundamental unit of data. Currently, the system supports multiple chunk types, but lack standardised headers. This creates several issues: 1. **Ambiguous Processing**: Without explicit type information, chunk processing depends on implicit detection methods, leading to potential security vulnerabilities. From 234ffb9266963646b5c85bfe75228e670391e59e Mon Sep 17 00:00:00 2001 From: mfw78 Date: Tue, 20 Jan 2026 22:05:10 +0000 Subject: [PATCH 4/5] feat(swip-26): add wire protocol representation for typed chunks - Define Chunk protobuf message with type, version, and payload fields - Specify that all protocol messages referencing chunks MUST use the Chunk message type instead of raw bytes - Add Delivery message example for pushsync/pullsync integration - Include migration path for backward compatibility - Fix minor grammar and style issues --- SWIPs/swip-26.md | 53 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 51 insertions(+), 2 deletions(-) diff --git a/SWIPs/swip-26.md b/SWIPs/swip-26.md index ffc3301..fc5b19a 100644 --- a/SWIPs/swip-26.md +++ b/SWIPs/swip-26.md @@ -15,7 +15,7 @@ This SWIP introduces a standardised framework for defining chunk types in Swarm, This SWIP proposes a standardised framework for defining and processing chunk types in Swarm. By creating a formal type system for chunks, including content-addressed chunks (CAC) and single-owner chunks (SOC), we improve security, interoperability, and maintainability across the Swarm ecosystem. The proposal defines a structured approach to chunk identification, versioning, and validation. The key innovation is the formal definition of fixed-length type-specific headers to be delivered alongside chunks and formally documenting address determination and payload validation rules. ## Motivation -Swarm's storage layer is built around chunks as the fundamental unit of data. Currently, the system supports multiple chunk types, but lack standardised headers. This creates several issues: +Swarm's storage layer is built around chunks as the fundamental unit of data. Currently, the system supports multiple chunk types, but lacks standardised headers. This creates several issues: 1. **Ambiguous Processing**: Without explicit type information, chunk processing depends on implicit detection methods, leading to potential security vulnerabilities. @@ -78,7 +78,7 @@ For any chunk type, the address derivation function can be formally defined as: $$f_{\text{type}}(\text{header}, \text{payload}) = \mathcal{H}(g_{\text{type}}(\text{header}, \text{payload}))$$ Where: -- $\mathcal{H}$ is a cryptographic hash function (i.e. `keccak256`) +- $\mathcal{H}$ is a cryptographic hash function (e.g. `keccak256`) - $g_{\text{type}}$ is a type-specific data preparation function Different chunk types will implement specific derivation functions based on their requirements. @@ -128,6 +128,55 @@ flowchart TD I --> End[Protocol-specific processing] ``` +### Wire Protocol Representation + +To enable typed chunks at the wire level, the following Protocol Buffer definitions shall be used: + +#### Chunk Message + +```protobuf +message Chunk { + uint32 type = 1; // Chunk type identifier (see type table) + uint32 version = 2; // Chunk format version + bytes payload = 3; // Type-specific header + chunk data +} +``` + +The `payload` field contains the concatenation of the type-specific header and the chunk data. Based on the `type` and `version` fields, the receiver can determine the fixed-length type-specific header size and extract it from the beginning of the payload. + +For example: +- **CAC (type=0, version=0)**: `payload` = span (8 bytes) || BMT chunk data +- **SOC (type=0x01, version=0)**: `payload` = ID (32 bytes) || signature (65 bytes) || wrapped chunk data + +#### Integration with Existing Protocols + +All protocol buffer definitions that reference chunk data MUST use the typed `Chunk` message instead of raw `bytes`. This ensures consistent type information is available at the wire level across all protocols. + +For example, the `Delivery` message used by pushsync and pullsync protocols shall be updated: + +```protobuf +message Delivery { + bytes address = 1; + Chunk data = 2; + bytes stamp = 3; +} +``` + +This pattern applies universally: any protocol message that transmits chunk content MUST embed the `Chunk` message type, ensuring: + +1. Chunk type and version are always available at the wire level +2. Recipients can determine the expected type-specific header size +3. Address calculation and validation can be performed using type-specific rules +4. Consistent handling across all protocols that deal with chunks + +#### Migration Path + +During the transition period, implementations should: + +1. Accept both legacy `Delivery` messages (with raw bytes) and new typed `Delivery` messages +2. When receiving legacy messages, attempt heuristic type detection for backward compatibility +3. When sending, prefer the new typed format if the peer supports it + ## Rationale The proposed standardised chunk type framework addresses several key issues in the current implementation: From a71c816f1497f432eba29030495d0c930db14249 Mon Sep 17 00:00:00 2001 From: mfw78 Date: Tue, 20 Jan 2026 22:08:26 +0000 Subject: [PATCH 5/5] chore(swip-26): update migration path to breaking change - Remove backward compatibility with legacy messages - Specify lazy determination and population of type information for existing localstore data --- SWIPs/swip-26.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/SWIPs/swip-26.md b/SWIPs/swip-26.md index fc5b19a..57c2e20 100644 --- a/SWIPs/swip-26.md +++ b/SWIPs/swip-26.md @@ -171,11 +171,13 @@ This pattern applies universally: any protocol message that transmits chunk cont #### Migration Path -During the transition period, implementations should: +This specification represents a breaking change; implementations MUST use the typed `Chunk` message for all wire protocol communications. -1. Accept both legacy `Delivery` messages (with raw bytes) and new typed `Delivery` messages -2. When receiving legacy messages, attempt heuristic type detection for backward compatibility -3. When sending, prefer the new typed format if the peer supports it +For existing data in the localstore that lacks type information, implementations should: + +1. Determine the chunk type heuristically upon access (e.g. by examining the chunk structure) +2. Lazily populate the type information in the localstore when chunks are retrieved +3. Avoid a large upfront migration by only updating type metadata as chunks are accessed ## Rationale