Fix decryption of V2 data pages #596

daniel-adam-tfs · 2025-12-08T13:15:05Z

Rationale for this change

Fixing the decryption of V2 pages.

What changes are included in this PR?

This PR fixes the decryption of V2 pages.

Are these changes tested?

Yes, tests cover various scenarios for both V1 and V2 page encryption and decryption.

Are there any user-facing changes?

No.

daniel-adam-tfs · 2025-12-08T13:34:42Z

@zeroshade OK, so I've generated tools/pyarrow_encrypted_uniform.parquet using tools/write_encrypted_parquet.py and tools/arrowgo_encrypted_uniform.parquet using TestEncryptFile. Very likely that I'm doing something wrong, but let me walk you through what I have.

In the python code I need an instance of FileEncryptionProperties which I use in the call to write_table, but the only API I was able to find in the PyArrow lib is CryptoFactory.file_encryption_properties, but for that I need a KMS. I've created a Mock implementation, which just base64 encodes and decodes the input and use this to generate the pyarrow_encrypted_uniform.parquet.
I try to read this generated file in TestDecryptFile, it panics in StringKeyIDRetriever.GetKey call.

encryption.StringKeyIDRetriever=["footer_key": "0123456789012345", ]

func (s StringKeyIDRetriever) GetKey(keyMetadata []byte) string {
	k, ok := s[*(*string)(unsafe.Pointer(&keyMetadata))]
	if !ok {
		panic(fmt.Errorf("parquet: key missing for id %s", keyMetadata))
	}
	return k
}

keyMetadata={\"keyMaterialType\":\"PKMT1\",\"internalStorage\":true,\"isFooterKey\":true,\"kmsInstanceID\":\"DEFAULT\",\"kmsInstanceURL\":\"DEFAULT\",\"masterKeyID\":\"footer_key\",\"wrappedDEK\":\"tHPE5PlN58jGE1soVo/arMTVu8C8oezum3vSnNdEcEdIn5ImAcv9rtpfZow=\",\"doubleWrapping\":true,\"keyEncryptionKeyID\":\"7pmHfFBvnjd2Wbf218WOMQ==\",\"wrappedKEK\":\"1eX3O2IHHTkAnuIXbbIQRA==\"}"

The whole keyMetadata value is used as a key in the retriever map, which obviously doesn't work. I'm guessing that I should implement a custom KeyIDRetriever, and json decode the retrieved metadata and return the masterKeyID value?

daniel-adam-tfs · 2025-12-08T13:39:53Z

The other way around, when I try to run tools/read_encrypted_parquet.py, which should attempt to decrypt the arrow-go generated file arrowgo_encrypted_uniform.parquet I get this error:

Reading arrowgo_encrypted_uniform.parquet
Traceback (most recent call last):
  File "/daniel-adam-tfs/arrow-go/tools/./read_encrypted_parquet.py", line 27, in <module>
    with pq.ParquetFile(
         ~~~~~~~~~~~~~~^
            input_file,
            ^^^^^^^^^^^
            decryption_properties=decryption_properties) as f:
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/pyarrow/parquet/core.py", line 328, in __init__
    self.reader.open(
    ~~~~~~~~~~~~~~~~^
        source, use_memory_map=memory_map,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        arrow_extensions_enabled=arrow_extensions_enabled,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "pyarrow/_parquet.pyx", line 1656, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Failed to parse key metadata footer_key

Seems that it expects the key metadata to be written differently?

zeroshade · 2025-12-22T21:22:48Z

Sorry for the delay here, I was on vacation all last week.

I see what the issue going on here is. What is exposed via pyarrow is the high-level API for utilizing a KMS to manage the keys by wrapping them. It will generate random bytes to use for the key and then wrap the key using the wrap_key and unwrap_key functions in the KMS implementation.

If we look at your mock example:

class MockKmsClient(pe.KmsClient):
    def __init__(self, kms_connection_configuration):
        super().__init__()

    def wrap_key(self, key_bytes, master_key_identifier):
        return base64.b64encode(key_bytes)

    def unwrap_key(self, wrapped_key, master_key_identifier):
        return base64.b64decode(wrapped_key)

And compare that with the unit tests from the Arrow repo for pyarrow parquet encryption, I was able to figure out what it is doing. Essentially, the key_bytes are the random bytes that actually got used to encrypt the data. Those bytes are the key that Go will need for decrypting. In your example with this MockKmsClient, you're storing the key bytes directly into the metadata, so the following should work on the Go side for decrypting:

type metadataKeyRetriever struct{}

func (metadataKeyRetriever) GetKey(keyMetadata []byte) string {
    var keyMeta struct {
        WrappedKey string `json:"wrappedDEK"`
    }

    json.Unmarshal(keyMetadata, &keyMeta)
    byts, err := base64.StdEncoding.DecodeString(keyMeta.WrappedDEK)
    if err != nil {
        panic(err)
    }

    return string(byts)
}

Another option might be to manipulate the key_bytes with the master key bytes so that the key itself isn't stored directly in the metadata so easily (or use some external thing). I was able to get the above to work, so think of it like this:

For Python -

wrap_key takes the Key ID (master_key_identifier) and the actual random bytes used as the encryption key (key_bytes) and outputs what goes into the key metadata. These key_bytes are what Go needs since we only expose the low-level API currently.
unwrap_key takes the result from wrap_key and the Key ID (master_key_identifier) and returns the actual key to use for decrypting. Go would need to output the properly formatted JSON blob that pyarrow is expecting as the key metadata so that the python unwrap_key would be able to somehow determine the actual key_bytes from the metadata.

Does this help?

daniel-adam-tfs · 2026-01-12T14:03:34Z

Back from the holidays and vacation. Let me check if I can work out how to make this work with your notes.

EDIT 1: I was able to make the python -> Go version wok. Lets see if I can make it work the other way as well, I just need to write correct meta data for pyarrow I think.

EDIT 2: Yes, when I copy the metadata format, I get the Go -> python to work as well.

daniel-adam-tfs · 2026-01-12T17:51:32Z

I've generated myself 8 files, 4 using arrow-go and 4 using PyArrow with this configuration:

V1 Data Page, uncompressed
V1 Data Page, snappy compressed
V2 Data Page, uncompressed
V2 Data Page, snappy compressed

PyArrow successfully reads all 8 of them. For arrow-go the V2+uncompressed fails in both versions, that is the issue we discovered in the related memory PR. However interestingly 4) generated by PyArrow also fails. I'll investigate tomorrow.

I'm also thinking about maybe creating a test docker container to test PyArrow <-> arrow-go compatibility. Or are there some other integration tests done in some other way that I missed?

daniel-adam-tfs · 2026-01-13T17:13:36Z

OK, I have it working - with V1 or V2 page, with or without compression, with or without definition levels, with or without repetition levels. I'll clean it up and create tests tomorrow and we can get this thing merged.

zeroshade · 2026-01-13T21:49:43Z

I'm also thinking about maybe creating a test docker container to test PyArrow <-> arrow-go compatibility. Or are there some other integration tests done in some other way that I missed?

Currently the integration tests are simply confirming that arrow-go can read the known files with encryption from the github.com/apache/parquet-testing repo and then confirm round trip by writing and then reading the locally written versions of the files. It doesn't do any cross-language testing. The assumption being that if we confirm proper reading of the known good files and then confirm we can write and read new versions of the same files, that it would be sufficient if all the implementations do the same.

It might be worthwhile to have more explicit cross-implementation testing of the encryption files though.

Extract separate decryptV2 and decompressV2 methods to properly handle decryption and decompression of V2 data pages. Previously, V2 encrypted pages were not being decrypted before decompression, leading to failures. The fix ensures that: - Encrypted V2 pages are decrypted before decompression - Uncompressed encrypted V2 pages are properly handled - Level bytes (definition/repetition) are correctly preserved in both paths Expand encryption test coverage to validate the fix across: - Multiple compression types (snappy, uncompressed) - Both data page versions (V1, V2) - All encryption configurations (uniform, column-only, column+footer)

daniel-adam-tfs · 2026-01-15T12:13:52Z

parquet/encryption_write_config_test.go

 		// write the int64 column, each row repeats twice
 		int64Writer := nextColumn().(*file.Int64ColumnChunkWriter)
-		for i := 0; i < 2*en.rowsPerRG; i++ {
+		for i := 0; i < en.rowsPerRG; i++ {


I was getting "batch writing for V2 data pages must start at a row boundary" errors for V2 pages with the original version

daniel-adam-tfs · 2026-01-15T12:17:39Z

parquet/file/file_reader_test.go

 	p.dataPageHdrV2.Statistics = getDummyStats(statsSize, true)
 	p.dataPageHdrV2.NumValues = nrows
-	p.WriteDataPageHeaderV2(1024, 20, 10)
+	p.WriteDataPageHeaderV2(1024, 0, 0)


The original version of serializedPageReader.Next() ignored error checking of io.ReadFull. Was that deliberate? I've added the error checking and now the p.pageReader.Next() returns false if the compressed and uncompressed values are not 0, so I modified those.

daniel-adam-tfs · 2026-01-15T15:41:17Z

I'm also thinking about maybe creating a test docker container to test PyArrow <-> arrow-go compatibility. Or are there some other integration tests done in some other way that I missed?

Currently the integration tests are simply confirming that arrow-go can read the known files with encryption from the github.com/apache/parquet-testing repo and then confirm round trip by writing and then reading the locally written versions of the files. It doesn't do any cross-language testing. The assumption being that if we confirm proper reading of the known good files and then confirm we can write and read new versions of the same files, that it would be sufficient if all the implementations do the same.

It might be worthwhile to have more explicit cross-implementation testing of the encryption files though.

I've modified the tests, and extended the round-trip files.

I'm thinking that a cross-implementation verification would still be useful. Using Python seems like the easiest option, but I'm not sure how to deal with the KMS API that is used in PyArrow. You figured out how to make it work with arrow-go, but is that API guaranteed to be stable?

daniel-adam-tfs · 2026-01-16T10:59:29Z

@zeroshade I ran the failing tests and the code does not pass through the changed code. I guess there is some race condition in the tests, but it is unrelated to the code changes here.

zeroshade · 2026-01-16T20:55:37Z

parquet/file/page_reader.go

+			if p.cryptoCtx.DataDecryptor != nil {
+				if err := p.readV2Encrypted(p.r, lenCompressed, levelsBytelen, compressed, buf.Bytes()); err != nil {
+					p.err = err
 					return false
 				}
 			} else {
-				io.ReadFull(p.r, buf.Bytes())
+				if err := p.readV2Unencrypted(p.r, lenCompressed, levelsBytelen, compressed, buf.Bytes()); err != nil {
+					p.err = err
+					return false
+				}
 			}


So the issue is simply that we're not passing uncompressed v2 through the decryptor right?

Wouldn't a simpler solution here just be to add the decryption check to the non-compressed branch?

i.e. since the call to decompress will already decrypt the data, we just need to replace the io.ReadFull(p.r, buf.Bytes()) with code to check if we need to decrypt and then decrypt it here, otherwise just leave the io.ReadFull right?

daniel-adam-tfs force-pushed the bugfix/566-fix-encryption branch from 6c718f4 to 38b8fac Compare January 12, 2026 14:16

daniel-adam-tfs force-pushed the bugfix/566-fix-encryption branch from 361041b to 1c2dfea Compare January 15, 2026 12:09

daniel-adam-tfs commented Jan 15, 2026

View reviewed changes

fixup! fix: correct decryption of V2 data pages

0b204ea

daniel-adam-tfs changed the title ~~Add TestEncryptFile and TestDecryptFile tests~~ Fix decryption of V2 data pages Jan 15, 2026

daniel-adam-tfs marked this pull request as ready for review January 16, 2026 10:44

daniel-adam-tfs requested a review from zeroshade as a code owner January 16, 2026 10:44

zeroshade reviewed Jan 16, 2026

View reviewed changes

Fix decryption of V2 data pages #596

Are you sure you want to change the base?

Fix decryption of V2 data pages #596

Uh oh!

Conversation

daniel-adam-tfs commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

daniel-adam-tfs commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-adam-tfs commented Dec 8, 2025

Uh oh!

zeroshade commented Dec 22, 2025

Uh oh!

daniel-adam-tfs commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-adam-tfs commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-adam-tfs commented Jan 13, 2026

Uh oh!

zeroshade commented Jan 13, 2026

Uh oh!

daniel-adam-tfs Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

daniel-adam-tfs Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

daniel-adam-tfs commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-adam-tfs commented Jan 16, 2026

Uh oh!

zeroshade Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniel-adam-tfs commented Dec 8, 2025 •

edited

Loading

daniel-adam-tfs commented Dec 8, 2025 •

edited

Loading

daniel-adam-tfs commented Jan 12, 2026 •

edited

Loading

daniel-adam-tfs commented Jan 12, 2026 •

edited

Loading

daniel-adam-tfs commented Jan 15, 2026 •

edited

Loading