-
Notifications
You must be signed in to change notification settings - Fork 30
add swip-26: strictly typed chunk system #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
7eff7b6
d427fc6
9453222
234ffb9
a71c816
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,242 @@ | ||
| --- | ||
| swip: 26 | ||
| title: Standardised Chunk Type Framework | ||
| status: Draft | ||
| type: Standards Track | ||
| category: Core | ||
| author: mfw78 (@mfw78) | ||
| created: 2025-03-03 | ||
| --- | ||
|
|
||
| ## Simple Summary | ||
| This SWIP introduces a standardised framework for defining chunk types in Swarm, improving security and interoperability through consistent type identification and validation. | ||
|
|
||
| ## Abstract | ||
| This SWIP proposes a standardised framework for defining and processing chunk types in Swarm. By creating a formal type system for chunks, including content-addressed chunks (CAC) and single-owner chunks (SOC), we improve security, interoperability, and maintainability across the Swarm ecosystem. The proposal defines a structured approach to chunk identification, versioning, and validation. The key innovation is the formal definition of fixed-length type-specific headers to be delivered alongside chunks and formally documenting address determination and payload validation rules. | ||
|
|
||
| ## Motivation | ||
| Swarm's storage layer is built around chunks as the fundamental unit of data. Currently, the system supports multiple chunk types, but lacks standardised headers. This creates several issues: | ||
|
|
||
| 1. **Ambiguous Processing**: Without explicit type information, chunk processing depends on implicit detection methods, leading to potential security vulnerabilities. | ||
|
|
||
| 2. **Limited Extensibility**: Adding new chunk types requires changes to core validation logic, making it difficult to evolve the system. | ||
significance marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 3. **Inconsistent Validation**: Chunk validation logic is spread across multiple components, leading to potential inconsistencies. | ||
|
|
||
| 4. **Type-Safety Gaps**: Without formal type definitions, runtime type errors can occur when processing chunks. | ||
|
|
||
| A standardised chunk type framework would address these issues by providing a consistent, extensible system for defining, identifying, and validating different chunk types. | ||
|
|
||
| ## Specification | ||
|
|
||
| ### Core Concepts | ||
|
|
||
| #### 1. Chunk Structure | ||
|
|
||
| A standardised chunk shall conceptually consist of: | ||
|
|
||
| 1. **Header**: Metadata describing the chunk and its contents | ||
| - Common Header: Information common to all chunk types (type, version) | ||
| - Type-Specific Header: Additional fields specific to the chunk type | ||
| 2. **Payload**: The actual chunk data | ||
|
|
||
| The chunk's address is not part of the chunk itself but is deterministically derived from the chunk's contents based on its type. | ||
|
|
||
| #### 2. Common Chunk Header | ||
|
|
||
| The common chunk header shall contain: | ||
|
|
||
| 1. **Type**: The chunk type identifier (1 byte) | ||
| 2. **Version**: The chunk format version (1 byte) | ||
|
|
||
| | Type ID | Name | Description | | ||
| |---------|------|-------------| | ||
| | 0x00 | CAC | Content-addressed chunk | | ||
| | 0x01 | SOC | Single-owner chunk | | ||
| | 0x02-0xFF | Reserved | Reserved for future chunk types | | ||
|
|
||
| #### 3. Fixed-Length Type-Specific Headers | ||
|
|
||
| All type-specific headers MUST be of fixed length for their respective chunk types. This ensures that at a wire-level, the maximum size of a chunk is always known and predictable, based on the first 2 bytes (type and version). | ||
|
|
||
| Example header sizes: | ||
| - CAC header: 10 bytes (2 bytes common header + 8 bytes span) | ||
| - SOC header: 99 bytes (2 bytes common header + 32 bytes ID + 65 bytes signature) | ||
|
|
||
| ### Address Calculation | ||
|
|
||
| The address of a chunk shall be deterministically calculated based on its type, version, and contents. We define the general address calculation function as: | ||
|
|
||
| $$\text{Address} = f_{\text{type}}(\text{header}, \text{payload})$$ | ||
|
|
||
| Where $f_{\text{type}}$ is the type-specific address calculation function. | ||
|
|
||
| #### Generic Address Derivation Function | ||
|
|
||
| For any chunk type, the address derivation function can be formally defined as: | ||
|
|
||
| $$f_{\text{type}}(\text{header}, \text{payload}) = \mathcal{H}(g_{\text{type}}(\text{header}, \text{payload}))$$ | ||
|
|
||
| Where: | ||
| - $\mathcal{H}$ is a cryptographic hash function (e.g. `keccak256`) | ||
| - $g_{\text{type}}$ is a type-specific data preparation function | ||
|
|
||
| Different chunk types will implement specific derivation functions based on their requirements. | ||
|
|
||
| ### Chunk Type Specifications | ||
|
|
||
| The Swarm Specifications shall define the standardised format for each chunk type. Adding a new chunk type to the specifications requires: | ||
|
|
||
| 1. Assignment of a unique type identifier | ||
| 2. Definition of fixed-length type-specific header structure | ||
| 3. Definition of payload structure | ||
| 4. Specification of address calculation function $f_{\text{type}}$ | ||
| 5. Specification of validation requirements | ||
|
|
||
| These specifications ensure that all implementations handle chunks consistently and securely across the Swarm ecosystem. | ||
|
|
||
| ### Type Processing | ||
|
|
||
| The chunk processing logic shall: | ||
|
|
||
| 1. Receive the chunk type and version information from the wire protocol | ||
| 2. Use the type and version to determine the expected fixed-length type-specific header size as defined in the Swarm Specifications | ||
| 3. Verify that the received header matches the expected size for the given type | ||
| 4. Fail fast if the header is malformed or incomplete | ||
| 5. Extract the type-specific header fields | ||
| 6. Calculate the chunk address using the type-specific address calculation function | ||
| 7. Apply type-specific validation rules | ||
| 8. Process the payload according to type-specific structure | ||
|
|
||
| This approach allows for early validation of chunk integrity based on protocol-level type information, reducing parsing errors and simplifying processing logic. | ||
|
|
||
| #### Flowchart | ||
|
|
||
| The flowchart below illustrates the processing steps for a chunk: | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| Start[Receive chunk via wire protocol] --> A[Protobuf decodes chunk type, version, header, and payload] | ||
| A --> B{Header size matches expected size for type?} | ||
| B -->|No| C[Fail: Invalid header size] | ||
| B -->|Yes| D[Extract type-specific header fields] | ||
| D --> E[Calculate chunk address using type-specific function] | ||
| E --> F{Validate chunk content} | ||
| F -->|Invalid| G[Fail: Invalid chunk content] | ||
| F -->|Valid| H[Process payload according to type-specific structure] | ||
| H --> I[Pass processed chunk to appropriate protocol handler] | ||
| I --> End[Protocol-specific processing] | ||
| ``` | ||
|
|
||
| ### Wire Protocol Representation | ||
|
|
||
| To enable typed chunks at the wire level, the following Protocol Buffer definitions shall be used: | ||
|
|
||
| #### Chunk Message | ||
|
|
||
| ```protobuf | ||
| message Chunk { | ||
| uint32 type = 1; // Chunk type identifier (see type table) | ||
| uint32 version = 2; // Chunk format version | ||
| bytes payload = 3; // Type-specific header + chunk data | ||
| } | ||
| ``` | ||
|
|
||
| The `payload` field contains the concatenation of the type-specific header and the chunk data. Based on the `type` and `version` fields, the receiver can determine the fixed-length type-specific header size and extract it from the beginning of the payload. | ||
|
|
||
| For example: | ||
| - **CAC (type=0, version=0)**: `payload` = span (8 bytes) || BMT chunk data | ||
| - **SOC (type=0x01, version=0)**: `payload` = ID (32 bytes) || signature (65 bytes) || wrapped chunk data | ||
|
|
||
| #### Integration with Existing Protocols | ||
|
|
||
| All protocol buffer definitions that reference chunk data MUST use the typed `Chunk` message instead of raw `bytes`. This ensures consistent type information is available at the wire level across all protocols. | ||
|
|
||
| For example, the `Delivery` message used by pushsync and pullsync protocols shall be updated: | ||
|
|
||
| ```protobuf | ||
| message Delivery { | ||
| bytes address = 1; | ||
| Chunk data = 2; | ||
| bytes stamp = 3; | ||
| } | ||
| ``` | ||
|
|
||
| This pattern applies universally: any protocol message that transmits chunk content MUST embed the `Chunk` message type, ensuring: | ||
|
|
||
| 1. Chunk type and version are always available at the wire level | ||
| 2. Recipients can determine the expected type-specific header size | ||
| 3. Address calculation and validation can be performed using type-specific rules | ||
| 4. Consistent handling across all protocols that deal with chunks | ||
|
|
||
| #### Migration Path | ||
|
|
||
| This specification represents a breaking change; implementations MUST use the typed `Chunk` message for all wire protocol communications. | ||
|
|
||
| For existing data in the localstore that lacks type information, implementations should: | ||
|
|
||
| 1. Determine the chunk type heuristically upon access (e.g. by examining the chunk structure) | ||
| 2. Lazily populate the type information in the localstore when chunks are retrieved | ||
| 3. Avoid a large upfront migration by only updating type metadata as chunks are accessed | ||
|
|
||
| ## Rationale | ||
|
|
||
| The proposed standardised chunk type framework addresses several key issues in the current implementation: | ||
|
|
||
| 1. **Type Ambiguity**: By explicitly encoding chunk types in the header, we eliminate ambiguity in chunk processing, enhancing security and reliability. | ||
|
|
||
| 2. **Extensibility**: The formal specifications allow for future chunk types to be added in a standardised way without modifying core validation logic. | ||
significance marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 3. **Validation Consistency**: Centralising validation rules in the specifications ensures consistent enforcement across components and implementations. | ||
|
|
||
| 4. **Memory Efficiency**: Fixed-length headers enable predictable memory allocation and reduce fragmentation. | ||
|
|
||
| 5. **Parsing Efficiency**: Type-specific parsing paths reduce the need for speculative parsing, improving performance. | ||
|
|
||
| The design choices prioritise: | ||
| - Security through explicit typing and validation | ||
| - Efficiency through predictable memory allocation and fail-fast validation | ||
| - Extensibility through the standardised specification system | ||
| - Backward compatibility with existing chunk types | ||
|
|
||
| ## Backwards Compatibility | ||
|
|
||
| This proposal maintains backward compatibility by: | ||
|
|
||
| 1. Preserving existing chunk address calculation methods for current chunk types | ||
| 2. Supporting current chunk formats with version 0 of each type | ||
| 3. Allowing for gradual adoption of the type system | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for me, first, it seems like it needs a breaking change in the base protocol to handle type headers.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A localstore migration can be used in order to both assign type, and version numbers to chunks contained within the localstore, assigning version 0 to the respective chunk types.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think it is fine for a client to opine methodology around when to determine chunk types in the first instance as far as the protocol is concerned, i believe protobuf fields are optional, but sending headers should become mandatory once the swarm has had time to adjust. in this way the change could be made with less disruption. let's pin down our approach. thank you for raising this @nugaon |
||
| 4. Providing a conversion layer between legacy and new chunk formats | ||
|
|
||
| ## Test Cases | ||
|
|
||
| Test cases should include: | ||
|
|
||
| 1. **Header Validation**: Tests that verify correct parsing of type-specific headers for different chunk types | ||
| 2. **Address Calculation**: Tests that confirm proper address derivation for each chunk type | ||
| 3. **Size Verification**: Tests that ensure fixed-length headers meet their size requirements | ||
| 4. **Malformed Input**: Tests that verify proper rejection of malformed chunks | ||
| 5. **Version Handling**: Tests for correct processing of different versions of the same chunk type | ||
|
|
||
| ## Implementation | ||
|
|
||
| Implementation will proceed in phases: | ||
|
|
||
| 1. Formalise the chunk type specifications for CAC and SOC in the Swarm Specifications | ||
| 2. Implement type-aware chunk processing in the node software | ||
| 3. Add validation framework for existing chunk types based on the specifications | ||
| 4. Develop compatibility layer for processing legacy chunks | ||
|
|
||
| ## Security Considerations | ||
|
|
||
| The standardised chunk type framework improves security through: | ||
|
|
||
| 1. **Explicit Type Checking**: Reduces the risk of type confusion attacks | ||
| 2. **Fixed-Length Headers**: Prevents buffer overflow attacks | ||
| 3. **Early Validation**: Enables fail-fast behaviour for malformed chunks | ||
| 4. **Deterministic Addressing**: Ensures consistent and secure chunk addressing | ||
| 5. **Versioned Security**: Allows security improvements via version updates | ||
|
|
||
| ## Copyright Waiver | ||
|
|
||
| Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). | ||
Uh oh!
There was an error while loading. Please reload this page.