Skip to content

Conversation

@daniel-adam-tfs
Copy link
Contributor

@daniel-adam-tfs daniel-adam-tfs commented Dec 8, 2025

Rationale for this change

Fixing the decryption of V2 pages.

What changes are included in this PR?

This PR fixes the decryption of V2 pages.

Are these changes tested?

Yes, tests cover various scenarios for both V1 and V2 page encryption and decryption.

Are there any user-facing changes?

No.

@daniel-adam-tfs
Copy link
Contributor Author

daniel-adam-tfs commented Dec 8, 2025

@zeroshade OK, so I've generated tools/pyarrow_encrypted_uniform.parquet using tools/write_encrypted_parquet.py and tools/arrowgo_encrypted_uniform.parquet using TestEncryptFile. Very likely that I'm doing something wrong, but let me walk you through what I have.

In the python code I need an instance of FileEncryptionProperties which I use in the call to write_table, but the only API I was able to find in the PyArrow lib is CryptoFactory.file_encryption_properties, but for that I need a KMS. I've created a Mock implementation, which just base64 encodes and decodes the input and use this to generate the pyarrow_encrypted_uniform.parquet.
I try to read this generated file in TestDecryptFile, it panics in StringKeyIDRetriever.GetKey call.

encryption.StringKeyIDRetriever=["footer_key": "0123456789012345", ]
func (s StringKeyIDRetriever) GetKey(keyMetadata []byte) string {
	k, ok := s[*(*string)(unsafe.Pointer(&keyMetadata))]
	if !ok {
		panic(fmt.Errorf("parquet: key missing for id %s", keyMetadata))
	}
	return k
}
keyMetadata={\"keyMaterialType\":\"PKMT1\",\"internalStorage\":true,\"isFooterKey\":true,\"kmsInstanceID\":\"DEFAULT\",\"kmsInstanceURL\":\"DEFAULT\",\"masterKeyID\":\"footer_key\",\"wrappedDEK\":\"tHPE5PlN58jGE1soVo/arMTVu8C8oezum3vSnNdEcEdIn5ImAcv9rtpfZow=\",\"doubleWrapping\":true,\"keyEncryptionKeyID\":\"7pmHfFBvnjd2Wbf218WOMQ==\",\"wrappedKEK\":\"1eX3O2IHHTkAnuIXbbIQRA==\"}"

The whole keyMetadata value is used as a key in the retriever map, which obviously doesn't work. I'm guessing that I should implement a custom KeyIDRetriever, and json decode the retrieved metadata and return the masterKeyID value?

@daniel-adam-tfs
Copy link
Contributor Author

The other way around, when I try to run tools/read_encrypted_parquet.py, which should attempt to decrypt the arrow-go generated file arrowgo_encrypted_uniform.parquet I get this error:

Reading arrowgo_encrypted_uniform.parquet
Traceback (most recent call last):
  File "/daniel-adam-tfs/arrow-go/tools/./read_encrypted_parquet.py", line 27, in <module>
    with pq.ParquetFile(
         ~~~~~~~~~~~~~~^
            input_file,
            ^^^^^^^^^^^
            decryption_properties=decryption_properties) as f:
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/pyarrow/parquet/core.py", line 328, in __init__
    self.reader.open(
    ~~~~~~~~~~~~~~~~^
        source, use_memory_map=memory_map,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        arrow_extensions_enabled=arrow_extensions_enabled,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "pyarrow/_parquet.pyx", line 1656, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Failed to parse key metadata footer_key

Seems that it expects the key metadata to be written differently?

@zeroshade
Copy link
Member

Sorry for the delay here, I was on vacation all last week.

I see what the issue going on here is. What is exposed via pyarrow is the high-level API for utilizing a KMS to manage the keys by wrapping them. It will generate random bytes to use for the key and then wrap the key using the wrap_key and unwrap_key functions in the KMS implementation.

If we look at your mock example:

class MockKmsClient(pe.KmsClient):
    def __init__(self, kms_connection_configuration):
        super().__init__()

    def wrap_key(self, key_bytes, master_key_identifier):
        return base64.b64encode(key_bytes)

    def unwrap_key(self, wrapped_key, master_key_identifier):
        return base64.b64decode(wrapped_key)

And compare that with the unit tests from the Arrow repo for pyarrow parquet encryption, I was able to figure out what it is doing. Essentially, the key_bytes are the random bytes that actually got used to encrypt the data. Those bytes are the key that Go will need for decrypting. In your example with this MockKmsClient, you're storing the key bytes directly into the metadata, so the following should work on the Go side for decrypting:

type metadataKeyRetriever struct{}

func (metadataKeyRetriever) GetKey(keyMetadata []byte) string {
    var keyMeta struct {
        WrappedKey string `json:"wrappedDEK"`
    }

    json.Unmarshal(keyMetadata, &keyMeta)
    byts, err := base64.StdEncoding.DecodeString(keyMeta.WrappedDEK)
    if err != nil {
        panic(err)
    }

    return string(byts)
}

Another option might be to manipulate the key_bytes with the master key bytes so that the key itself isn't stored directly in the metadata so easily (or use some external thing). I was able to get the above to work, so think of it like this:

For Python -

  • wrap_key takes the Key ID (master_key_identifier) and the actual random bytes used as the encryption key (key_bytes) and outputs what goes into the key metadata. These key_bytes are what Go needs since we only expose the low-level API currently.
  • unwrap_key takes the result from wrap_key and the Key ID (master_key_identifier) and returns the actual key to use for decrypting. Go would need to output the properly formatted JSON blob that pyarrow is expecting as the key metadata so that the python unwrap_key would be able to somehow determine the actual key_bytes from the metadata.

Does this help?

@daniel-adam-tfs
Copy link
Contributor Author

daniel-adam-tfs commented Jan 12, 2026

Back from the holidays and vacation. Let me check if I can work out how to make this work with your notes.

EDIT 1: I was able to make the python -> Go version wok. Lets see if I can make it work the other way as well, I just need to write correct meta data for pyarrow I think.

EDIT 2: Yes, when I copy the metadata format, I get the Go -> python to work as well.

@daniel-adam-tfs daniel-adam-tfs force-pushed the bugfix/566-fix-encryption branch from 6c718f4 to 38b8fac Compare January 12, 2026 14:16
@daniel-adam-tfs
Copy link
Contributor Author

daniel-adam-tfs commented Jan 12, 2026

I've generated myself 8 files, 4 using arrow-go and 4 using PyArrow with this configuration:

  1. V1 Data Page, uncompressed
  2. V1 Data Page, snappy compressed
  3. V2 Data Page, uncompressed
  4. V2 Data Page, snappy compressed

PyArrow successfully reads all 8 of them. For arrow-go the V2+uncompressed fails in both versions, that is the issue we discovered in the related memory PR. However interestingly 4) generated by PyArrow also fails. I'll investigate tomorrow.

I'm also thinking about maybe creating a test docker container to test PyArrow <-> arrow-go compatibility. Or are there some other integration tests done in some other way that I missed?

@daniel-adam-tfs
Copy link
Contributor Author

OK, I have it working - with V1 or V2 page, with or without compression, with or without definition levels, with or without repetition levels. I'll clean it up and create tests tomorrow and we can get this thing merged.

@zeroshade
Copy link
Member

I'm also thinking about maybe creating a test docker container to test PyArrow <-> arrow-go compatibility. Or are there some other integration tests done in some other way that I missed?

Currently the integration tests are simply confirming that arrow-go can read the known files with encryption from the github.com/apache/parquet-testing repo and then confirm round trip by writing and then reading the locally written versions of the files. It doesn't do any cross-language testing. The assumption being that if we confirm proper reading of the known good files and then confirm we can write and read new versions of the same files, that it would be sufficient if all the implementations do the same.

It might be worthwhile to have more explicit cross-implementation testing of the encryption files though.

Extract separate decryptV2 and decompressV2 methods to properly handle
decryption and decompression of V2 data pages. Previously, V2 encrypted
pages were not being decrypted before decompression, leading to failures.

The fix ensures that:
- Encrypted V2 pages are decrypted before decompression
- Uncompressed encrypted V2 pages are properly handled
- Level bytes (definition/repetition) are correctly preserved in both paths

Expand encryption test coverage to validate the fix across:
- Multiple compression types (snappy, uncompressed)
- Both data page versions (V1, V2)
- All encryption configurations (uniform, column-only, column+footer)
@daniel-adam-tfs daniel-adam-tfs force-pushed the bugfix/566-fix-encryption branch from 361041b to 1c2dfea Compare January 15, 2026 12:09
// write the int64 column, each row repeats twice
int64Writer := nextColumn().(*file.Int64ColumnChunkWriter)
for i := 0; i < 2*en.rowsPerRG; i++ {
for i := 0; i < en.rowsPerRG; i++ {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting "batch writing for V2 data pages must start at a row boundary" errors for V2 pages with the original version

p.dataPageHdrV2.Statistics = getDummyStats(statsSize, true)
p.dataPageHdrV2.NumValues = nrows
p.WriteDataPageHeaderV2(1024, 20, 10)
p.WriteDataPageHeaderV2(1024, 0, 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original version of serializedPageReader.Next() ignored error checking of io.ReadFull. Was that deliberate? I've added the error checking and now the p.pageReader.Next() returns false if the compressed and uncompressed values are not 0, so I modified those.

@daniel-adam-tfs
Copy link
Contributor Author

daniel-adam-tfs commented Jan 15, 2026

I'm also thinking about maybe creating a test docker container to test PyArrow <-> arrow-go compatibility. Or are there some other integration tests done in some other way that I missed?

Currently the integration tests are simply confirming that arrow-go can read the known files with encryption from the github.com/apache/parquet-testing repo and then confirm round trip by writing and then reading the locally written versions of the files. It doesn't do any cross-language testing. The assumption being that if we confirm proper reading of the known good files and then confirm we can write and read new versions of the same files, that it would be sufficient if all the implementations do the same.

It might be worthwhile to have more explicit cross-implementation testing of the encryption files though.

I've modified the tests, and extended the round-trip files.

I'm thinking that a cross-implementation verification would still be useful. Using Python seems like the easiest option, but I'm not sure how to deal with the KMS API that is used in PyArrow. You figured out how to make it work with arrow-go, but is that API guaranteed to be stable?

@daniel-adam-tfs daniel-adam-tfs changed the title Add TestEncryptFile and TestDecryptFile tests Fix decryption of V2 data pages Jan 15, 2026
@daniel-adam-tfs daniel-adam-tfs marked this pull request as ready for review January 16, 2026 10:44
@daniel-adam-tfs
Copy link
Contributor Author

@zeroshade I ran the failing tests and the code does not pass through the changed code. I guess there is some race condition in the tests, but it is unrelated to the code changes here.

Comment on lines +858 to 868
if p.cryptoCtx.DataDecryptor != nil {
if err := p.readV2Encrypted(p.r, lenCompressed, levelsBytelen, compressed, buf.Bytes()); err != nil {
p.err = err
return false
}
} else {
io.ReadFull(p.r, buf.Bytes())
if err := p.readV2Unencrypted(p.r, lenCompressed, levelsBytelen, compressed, buf.Bytes()); err != nil {
p.err = err
return false
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is simply that we're not passing uncompressed v2 through the decryptor right?

Wouldn't a simpler solution here just be to add the decryption check to the non-compressed branch?

i.e. since the call to decompress will already decrypt the data, we just need to replace the io.ReadFull(p.r, buf.Bytes()) with code to check if we need to decrypt and then decrypt it here, otherwise just leave the io.ReadFull right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants