refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

andygrove · 2026-01-22T01:54:10Z

Summary

Renames spark.comet.scan.allowIncompatible to spark.comet.scan.unsignedSmallIntSafetyCheck
Changes default from false to true (safety check enabled by default)
Removes ByteType from the safety check, leaving only ShortType subject to fallback

Why ByteType is Safe

ByteType columns are always safe for native execution because:

Parquet type mapping: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType.
UINT_8 maps to ShortType: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127).
Truncation preserves signed values: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation.

Why ShortType Needs the Safety Check

ShortType columns may be problematic because:

Ambiguous origin: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible).
Different reader behavior: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255.
No metadata available: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns.

Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with spark.comet.scan.unsignedSmallIntSafetyCheck=false.

🤖 Generated with Claude Code

…yCheck This change renames `spark.comet.scan.allowIncompatible` to `spark.comet.scan.unsignedSmallIntSafetyCheck` and changes its default to `true` (enabled by default). The key change is that ByteType is removed from the safety check entirely, leaving only ShortType subject to fallback behavior. ## Why ByteType is Safe ByteType columns are always safe for native execution because: 1. **Parquet type mapping**: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType. 2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127). 3. **Truncation preserves signed values**: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation. ## Why ShortType Needs the Safety Check ShortType columns may be problematic because: 1. **Ambiguous origin**: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible). 2. **Different reader behavior**: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255. 3. **No metadata available**: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns. Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use local `root_op` variable instead of unwrapping `exec_context.root_op` - Replace `is_some()` + `unwrap()` pattern with `if let Some(...)` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove marked this pull request as ready for review January 22, 2026 01:55

andygrove marked this pull request as draft January 22, 2026 02:18

andygrove and others added 5 commits January 22, 2026 07:34

format

8185dae

rename

b0305d2

rename

0bae43c

Fix clippy warnings for Rust 1.93

537d62e

- Use local `root_op` variable instead of unwrapping `exec_context.root_op` - Replace `is_some()` + `unwrap()` pattern with `if let Some(...)` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge branch 'fix-clippy-rust-1.93' into unsigned-small-int-safety-check

4f6b2a8

andygrove modified the milestone: 0.13.0 Jan 22, 2026

upmerge

2c68bbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

Uh oh!

andygrove commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

Are you sure you want to change the base?

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

Uh oh!

Conversation

andygrove commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why ByteType is Safe

Why ShortType Needs the Safety Check

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented Jan 22, 2026 •

edited

Loading