-
Notifications
You must be signed in to change notification settings - Fork 273
perf: optimize shuffle array element iteration with slice-based append #3222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1f7ae01 to
4743c23
Compare
Use bulk-append methods for primitive types in SparkUnsafeArray: - Non-nullable path uses append_slice() for optimal memcpy-style copy - Nullable path uses pointer iteration with efficient null bitset reading Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp Benchmark results (10K elements): | Type | Baseline | Optimized | Speedup | |------|----------|-----------|---------| | i32/no_nulls | 6.08µs | 0.65µs | **9.3x** | | i32/with_nulls | 22.49µs | 16.21µs | **1.39x** | | i64/no_nulls | 6.15µs | 1.22µs | **5x** | | i64/with_nulls | 16.41µs | 16.41µs | 1x | | f64/no_nulls | 8.05µs | 1.22µs | **6.6x** | | f64/with_nulls | 16.52µs | 16.21µs | 1.02x | | date32/no_nulls | - | 0.66µs | ~9x | | timestamp/no_nulls | - | 1.21µs | ~5x | Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4743c23 to
e32dd52
Compare
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The #[inline] attribute on functions with loops iterating over thousands of elements provides no benefit - the function call overhead is negligible compared to loop body execution, and inlining large functions causes instruction cache pressure. Keep #[inline] only on small helper functions: - get_header_portion_in_bytes (tiny const fn) - is_null_at (small, hot path) - null_bitset_ptr (tiny accessor) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@sqlbenchmark run tpch |
1 similar comment
|
@sqlbenchmark run tpch |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3222 +/- ##
============================================
+ Coverage 56.12% 60.02% +3.89%
- Complexity 976 1429 +453
============================================
Files 119 170 +51
Lines 11743 15746 +4003
Branches 2251 2602 +351
============================================
+ Hits 6591 9451 +2860
- Misses 4012 4976 +964
- Partials 1140 1319 +179 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@sqlbenchmark run tpch |
Comet TPC-H Benchmark ResultsCommit: Query Times
Total Time: 306.97 seconds Spark Configuration
Automated benchmark run by dfbench |
Comet TPC-H Benchmark ResultsCommit: Query Times
Total Time: 306.72 seconds Spark Configuration
Automated benchmark run by dfbench |
Comet TPC-H Benchmark ResultsCommit: Query Times
Total Time: 305.78 seconds Spark Configuration
Automated benchmark run by dfbench |
|
note that I did not expect any improvements in tpc-h since it does not use arrays |
Summary
Optimizes primitive type array iteration in native shuffle by using bulk operations instead of per-element iteration.
Key optimizations:
append_slice()for optimal memcpy-style copySupported types: i8, i16, i32, i64, f32, f64, date32, timestamp
Benchmark Results
Benchmark for converting SparkUnsafeArray (10K elements) to Arrow array:
*Baseline estimated from similar types
Why such dramatic improvement for non-nullable?
The original code appended elements one by one using index-based access:
The optimized code uses slice-based bulk append:
Test plan
native/core/benches/array_conversion.rs)🤖 Generated with Claude Code