Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 20, 2026

Summary

Optimizes primitive type array iteration in native shuffle by using bulk operations instead of per-element iteration.

Key optimizations:

  • Non-nullable path: Uses append_slice() for optimal memcpy-style copy
  • Nullable path: Uses pointer iteration with efficient null bitset reading (reads 64 bits at a time)

Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp

Benchmark Results

Benchmark for converting SparkUnsafeArray (10K elements) to Arrow array:

Type Baseline Optimized Speedup
i32/no_nulls 6.08µs 0.65µs 9.3x
i32/with_nulls 22.49µs 16.21µs 1.39x
i64/no_nulls 6.15µs 1.22µs 5x
i64/with_nulls 16.41µs 16.41µs 1x
f64/no_nulls 8.05µs 1.22µs 6.6x
f64/with_nulls 16.52µs 16.21µs 1.02x
date32/no_nulls ~6µs* 0.66µs ~9x
timestamp/no_nulls ~6µs* 1.21µs ~5x

*Baseline estimated from similar types

Why such dramatic improvement for non-nullable?

The original code appended elements one by one using index-based access:

for idx in 0..array.get_num_elements() {
    builder.append_value(array.get_int(idx));  // get_int does: offset + idx * 4
}

The optimized code uses slice-based bulk append:

let slice = unsafe {
    std::slice::from_raw_parts(self.element_offset as *const i32, num_elements)
};
builder.append_slice(slice);  // Single memcpy-style operation

Test plan

  • All Rust tests pass (118 tests)
  • Native shuffle test suite passes (16 tests)
  • Clippy clean
  • Added dedicated benchmark (native/core/benches/array_conversion.rs)

🤖 Generated with Claude Code

@andygrove andygrove force-pushed the shuffle-optimization branch from 1f7ae01 to 4743c23 Compare January 20, 2026 18:21
@andygrove andygrove changed the title perf: optimize shuffle array element iteration with pointer arithmetic perf: optimize shuffle array element iteration with slice-based append Jan 20, 2026
Use bulk-append methods for primitive types in SparkUnsafeArray:
- Non-nullable path uses append_slice() for optimal memcpy-style copy
- Nullable path uses pointer iteration with efficient null bitset reading

Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp

Benchmark results (10K elements):

| Type | Baseline | Optimized | Speedup |
|------|----------|-----------|---------|
| i32/no_nulls | 6.08µs | 0.65µs | **9.3x** |
| i32/with_nulls | 22.49µs | 16.21µs | **1.39x** |
| i64/no_nulls | 6.15µs | 1.22µs | **5x** |
| i64/with_nulls | 16.41µs | 16.41µs | 1x |
| f64/no_nulls | 8.05µs | 1.22µs | **6.6x** |
| f64/with_nulls | 16.52µs | 16.21µs | 1.02x |
| date32/no_nulls | - | 0.66µs | ~9x |
| timestamp/no_nulls | - | 1.21µs | ~5x |

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove force-pushed the shuffle-optimization branch from 4743c23 to e32dd52 Compare January 20, 2026 18:31
andygrove and others added 3 commits January 20, 2026 11:32
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The #[inline] attribute on functions with loops iterating over thousands
of elements provides no benefit - the function call overhead is negligible
compared to loop body execution, and inlining large functions causes
instruction cache pressure.

Keep #[inline] only on small helper functions:
- get_header_portion_in_bytes (tiny const fn)
- is_null_at (small, hot path)
- null_bitset_ptr (tiny accessor)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused ArrayBuilder import
- Use div_ceil() instead of manual implementation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove
Copy link
Member Author

@sqlbenchmark run tpch

1 similar comment
@andygrove
Copy link
Member Author

@sqlbenchmark run tpch

@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.02%. Comparing base (f09f8af) to head (fe54548).
⚠️ Report is 858 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3222      +/-   ##
============================================
+ Coverage     56.12%   60.02%   +3.89%     
- Complexity      976     1429     +453     
============================================
  Files           119      170      +51     
  Lines         11743    15746    +4003     
  Branches       2251     2602     +351     
============================================
+ Hits           6591     9451    +2860     
- Misses         4012     4976     +964     
- Partials       1140     1319     +179     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Member Author

@sqlbenchmark run tpch

@sqlbenchmark
Copy link

Comet TPC-H Benchmark Results

Commit: fe54548 - fix: address clippy warnings in benchmark
Scale Factor: SF100
Iterations: 1

Query Times

Query Time (s) Query Time (s)
Q1 10.84 Q12 6.83
Q2 5.88 Q13 6.76
Q3 9.72 Q14 3.56
Q4 11.25 Q15 7.27
Q5 19.02 Q16 4.94
Q6 2.50 Q17 32.00
Q7 11.88 Q18 33.60
Q8 24.26 Q19 6.73
Q9 37.46 Q20 6.87
Q10 10.51 Q21 45.81
Q11 4.26 Q22 5.03

Total Time: 306.97 seconds

Spark Configuration
Setting Value
Spark Master local[*]
Driver Memory 32G
Driver Cores 8
Executor Memory 32G
Executor Cores 8
Off-Heap Enabled true
Off-Heap Size 24g
Shuffle Manager CometShuffleManager
Comet Replace SMJ true

Automated benchmark run by dfbench

@sqlbenchmark
Copy link

Comet TPC-H Benchmark Results

Commit: fe54548 - fix: address clippy warnings in benchmark
Scale Factor: SF100
Iterations: 1

Query Times

Query Time (s) Query Time (s)
Q1 10.75 Q12 6.77
Q2 5.99 Q13 6.88
Q3 9.64 Q14 3.56
Q4 11.48 Q15 7.15
Q5 19.00 Q16 4.98
Q6 2.59 Q17 31.85
Q7 11.91 Q18 33.46
Q8 24.44 Q19 6.69
Q9 37.66 Q20 6.57
Q10 10.38 Q21 45.80
Q11 4.30 Q22 4.88

Total Time: 306.72 seconds

Spark Configuration
Setting Value
Spark Master local[*]
Driver Memory 32G
Driver Cores 8
Executor Memory 32G
Executor Cores 8
Off-Heap Enabled true
Off-Heap Size 24g
Shuffle Manager CometShuffleManager
Comet Replace SMJ true

Automated benchmark run by dfbench

@sqlbenchmark
Copy link

Comet TPC-H Benchmark Results

Commit: fe54548 - fix: address clippy warnings in benchmark
Scale Factor: SF100
Iterations: 1

Query Times

Query Time (s) Query Time (s)
Q1 10.77 Q12 6.80
Q2 5.86 Q13 6.81
Q3 9.62 Q14 3.52
Q4 11.69 Q15 7.17
Q5 18.82 Q16 4.49
Q6 2.53 Q17 31.85
Q7 11.90 Q18 33.30
Q8 24.14 Q19 6.76
Q9 37.42 Q20 6.54
Q10 10.38 Q21 46.16
Q11 4.26 Q22 4.98

Total Time: 305.78 seconds

Spark Configuration
Setting Value
Spark Master local[*]
Driver Memory 32G
Driver Cores 8
Executor Memory 32G
Executor Cores 8
Off-Heap Enabled true
Off-Heap Size 24g
Shuffle Manager CometShuffleManager
Comet Replace SMJ true

Automated benchmark run by dfbench

@andygrove andygrove marked this pull request as ready for review January 20, 2026 20:35
@andygrove
Copy link
Member Author

note that I did not expect any improvements in tpc-h since it does not use arrays

@apache apache deleted a comment from sqlbenchmark Jan 20, 2026
@andygrove andygrove modified the milestone: 0.13.0 Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants