Skip to content

Enable LargeListArray support in Parquet reader schema validation #513

@callmepandey

Description

@callmepandey

Summary

Follow-up to #502. The data conversion layer now supports LargeListArray (64-bit offsets) via ProjectRecordBatch, but the Parquet reader's schema validation still rejects LARGE_LIST types. Additionally, the reader needs to expose Arrow's list_type property to allow users to request LargeListArray output.

Problem

  1. ValidateParquetSchemaEvolution in parquet_schema_util.cc:177-180 only accepts ::arrow::Type::LIST:
case TypeId::kList:
  if (arrow_type->id() == ::arrow::Type::LIST) {
    return {};
  }
  break;
  1. Arrow's Parquet reader defaults to Type::LIST output. Without exposing ArrowReaderProperties::set_list_type(), users cannot request LargeListArray output.

Proposed Solution

1. Update schema validation to accept both list types

case TypeId::kList:
  if (arrow_type->id() == ::arrow::Type::LIST ||
      arrow_type->id() == ::arrow::Type::LARGE_LIST) {
    return {};
  }
  break;

2. Add kListType to ReaderProperties

Expose a property to configure the Arrow list type preference.

3. Pass through to Arrow reader

In ParquetReader::Impl::Open(), call arrow_reader_properties.set_list_type() with the configured value.

Why This Is Safe

  1. Iceberg's ListType doesn't distinguish between LIST and LARGE_LIST
  2. The projection layer (ProjectRecordBatch) already handles both via templated ProjectListArrayImpl<>
  3. Both represent the same logical "list" concept, just with different offset sizes

Files to Change

  • src/iceberg/parquet/parquet_schema_util.cc - Update ValidateParquetSchemaEvolution
  • src/iceberg/parquet/parquet_reader.cc - Pass list_type to ArrowReaderProperties
  • src/iceberg/reader.h - Add kListType to ReaderProperties
  • src/iceberg/test/parquet_test.cc - Add integration tests

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions