Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 22, 2026

Summary

  • Renames spark.comet.scan.allowIncompatible to spark.comet.scan.unsignedSmallIntSafetyCheck
  • Changes default from false to true (safety check enabled by default)
  • Removes ByteType from the safety check, leaving only ShortType subject to fallback

Why ByteType is Safe

ByteType columns are always safe for native execution because:

  1. Parquet type mapping: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType.

  2. UINT_8 maps to ShortType: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127).

  3. Truncation preserves signed values: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation.

Why ShortType Needs the Safety Check

ShortType columns may be problematic because:

  1. Ambiguous origin: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible).

  2. Different reader behavior: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255.

  3. No metadata available: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns.

Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with spark.comet.scan.unsignedSmallIntSafetyCheck=false.

🤖 Generated with Claude Code

…yCheck

This change renames `spark.comet.scan.allowIncompatible` to
`spark.comet.scan.unsignedSmallIntSafetyCheck` and changes its default
to `true` (enabled by default).

The key change is that ByteType is removed from the safety check entirely,
leaving only ShortType subject to fallback behavior.

## Why ByteType is Safe

ByteType columns are always safe for native execution because:

1. **Parquet type mapping**: Spark's ByteType can only originate from signed
   INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps
   to ByteType.

2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned UINT_8
   columns, Spark maps them to ShortType (16-bit), not ByteType. This is
   because UINT_8 values (0-255) exceed the signed byte range (-128 to 127).

3. **Truncation preserves signed values**: When storing signed INT8 in 8 bits,
   the truncation from any wider representation preserves the correct signed
   value due to two's complement representation.

## Why ShortType Needs the Safety Check

ShortType columns may be problematic because:

1. **Ambiguous origin**: ShortType can come from either signed INT16 (safe) or
   unsigned UINT_8 (potentially incompatible).

2. **Different reader behavior**: Arrow-based readers like DataFusion respect
   the unsigned UINT_8 logical type and read data as unsigned, while Spark
   ignores the logical type and reads as signed. This can produce different
   results for values 128-255.

3. **No metadata available**: At scan time, Comet cannot determine whether a
   ShortType column originated from INT16 or UINT_8, so the safety check
   conservatively falls back to Spark for all ShortType columns.

Users who know their data does not contain unsigned UINT_8 columns can disable
the safety check with `spark.comet.scan.unsignedSmallIntSafetyCheck=false`.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review January 22, 2026 01:55
@andygrove andygrove marked this pull request as draft January 22, 2026 02:18
andygrove and others added 5 commits January 22, 2026 07:34
- Use local `root_op` variable instead of unwrapping `exec_context.root_op`
- Replace `is_some()` + `unwrap()` pattern with `if let Some(...)`

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove modified the milestone: 0.13.0 Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant