-
Notifications
You must be signed in to change notification settings - Fork 87
Fix decryption of V2 data pages #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix decryption of V2 data pages #596
Conversation
|
@zeroshade OK, so I've generated tools/pyarrow_encrypted_uniform.parquet using tools/write_encrypted_parquet.py and tools/arrowgo_encrypted_uniform.parquet using TestEncryptFile. Very likely that I'm doing something wrong, but let me walk you through what I have. In the python code I need an instance of FileEncryptionProperties which I use in the call to write_table, but the only API I was able to find in the PyArrow lib is CryptoFactory.file_encryption_properties, but for that I need a KMS. I've created a Mock implementation, which just base64 encodes and decodes the input and use this to generate the pyarrow_encrypted_uniform.parquet. encryption.StringKeyIDRetriever=["footer_key": "0123456789012345", ]func (s StringKeyIDRetriever) GetKey(keyMetadata []byte) string {
k, ok := s[*(*string)(unsafe.Pointer(&keyMetadata))]
if !ok {
panic(fmt.Errorf("parquet: key missing for id %s", keyMetadata))
}
return k
}keyMetadata={\"keyMaterialType\":\"PKMT1\",\"internalStorage\":true,\"isFooterKey\":true,\"kmsInstanceID\":\"DEFAULT\",\"kmsInstanceURL\":\"DEFAULT\",\"masterKeyID\":\"footer_key\",\"wrappedDEK\":\"tHPE5PlN58jGE1soVo/arMTVu8C8oezum3vSnNdEcEdIn5ImAcv9rtpfZow=\",\"doubleWrapping\":true,\"keyEncryptionKeyID\":\"7pmHfFBvnjd2Wbf218WOMQ==\",\"wrappedKEK\":\"1eX3O2IHHTkAnuIXbbIQRA==\"}"The whole keyMetadata value is used as a key in the retriever map, which obviously doesn't work. I'm guessing that I should implement a custom KeyIDRetriever, and json decode the retrieved metadata and return the |
|
The other way around, when I try to run tools/read_encrypted_parquet.py, which should attempt to decrypt the arrow-go generated file arrowgo_encrypted_uniform.parquet I get this error: Reading arrowgo_encrypted_uniform.parquet
Traceback (most recent call last):
File "/daniel-adam-tfs/arrow-go/tools/./read_encrypted_parquet.py", line 27, in <module>
with pq.ParquetFile(
~~~~~~~~~~~~~~^
input_file,
^^^^^^^^^^^
decryption_properties=decryption_properties) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/pyarrow/parquet/core.py", line 328, in __init__
self.reader.open(
~~~~~~~~~~~~~~~~^
source, use_memory_map=memory_map,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<8 lines>...
arrow_extensions_enabled=arrow_extensions_enabled,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "pyarrow/_parquet.pyx", line 1656, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Failed to parse key metadata footer_keySeems that it expects the key metadata to be written differently? |
|
Sorry for the delay here, I was on vacation all last week. I see what the issue going on here is. What is exposed via pyarrow is the high-level API for utilizing a KMS to manage the keys by wrapping them. It will generate random bytes to use for the key and then wrap the key using the If we look at your mock example: class MockKmsClient(pe.KmsClient):
def __init__(self, kms_connection_configuration):
super().__init__()
def wrap_key(self, key_bytes, master_key_identifier):
return base64.b64encode(key_bytes)
def unwrap_key(self, wrapped_key, master_key_identifier):
return base64.b64decode(wrapped_key)And compare that with the unit tests from the Arrow repo for pyarrow parquet encryption, I was able to figure out what it is doing. Essentially, the type metadataKeyRetriever struct{}
func (metadataKeyRetriever) GetKey(keyMetadata []byte) string {
var keyMeta struct {
WrappedKey string `json:"wrappedDEK"`
}
json.Unmarshal(keyMetadata, &keyMeta)
byts, err := base64.StdEncoding.DecodeString(keyMeta.WrappedDEK)
if err != nil {
panic(err)
}
return string(byts)
}Another option might be to manipulate the For Python -
Does this help? |
|
Back from the holidays and vacation. Let me check if I can work out how to make this work with your notes. EDIT 1: I was able to make the python -> Go version wok. Lets see if I can make it work the other way as well, I just need to write correct meta data for pyarrow I think. EDIT 2: Yes, when I copy the metadata format, I get the Go -> python to work as well. |
6c718f4 to
38b8fac
Compare
|
I've generated myself 8 files, 4 using arrow-go and 4 using PyArrow with this configuration:
PyArrow successfully reads all 8 of them. For arrow-go the V2+uncompressed fails in both versions, that is the issue we discovered in the related memory PR. However interestingly 4) generated by PyArrow also fails. I'll investigate tomorrow. I'm also thinking about maybe creating a test docker container to test PyArrow <-> arrow-go compatibility. Or are there some other integration tests done in some other way that I missed? |
|
OK, I have it working - with V1 or V2 page, with or without compression, with or without definition levels, with or without repetition levels. I'll clean it up and create tests tomorrow and we can get this thing merged. |
Currently the integration tests are simply confirming that arrow-go can read the known files with encryption from the github.com/apache/parquet-testing repo and then confirm round trip by writing and then reading the locally written versions of the files. It doesn't do any cross-language testing. The assumption being that if we confirm proper reading of the known good files and then confirm we can write and read new versions of the same files, that it would be sufficient if all the implementations do the same. It might be worthwhile to have more explicit cross-implementation testing of the encryption files though. |
Extract separate decryptV2 and decompressV2 methods to properly handle decryption and decompression of V2 data pages. Previously, V2 encrypted pages were not being decrypted before decompression, leading to failures. The fix ensures that: - Encrypted V2 pages are decrypted before decompression - Uncompressed encrypted V2 pages are properly handled - Level bytes (definition/repetition) are correctly preserved in both paths Expand encryption test coverage to validate the fix across: - Multiple compression types (snappy, uncompressed) - Both data page versions (V1, V2) - All encryption configurations (uniform, column-only, column+footer)
361041b to
1c2dfea
Compare
| // write the int64 column, each row repeats twice | ||
| int64Writer := nextColumn().(*file.Int64ColumnChunkWriter) | ||
| for i := 0; i < 2*en.rowsPerRG; i++ { | ||
| for i := 0; i < en.rowsPerRG; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting "batch writing for V2 data pages must start at a row boundary" errors for V2 pages with the original version
| p.dataPageHdrV2.Statistics = getDummyStats(statsSize, true) | ||
| p.dataPageHdrV2.NumValues = nrows | ||
| p.WriteDataPageHeaderV2(1024, 20, 10) | ||
| p.WriteDataPageHeaderV2(1024, 0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original version of serializedPageReader.Next() ignored error checking of io.ReadFull. Was that deliberate? I've added the error checking and now the p.pageReader.Next() returns false if the compressed and uncompressed values are not 0, so I modified those.
I've modified the tests, and extended the round-trip files. I'm thinking that a cross-implementation verification would still be useful. Using Python seems like the easiest option, but I'm not sure how to deal with the KMS API that is used in PyArrow. You figured out how to make it work with arrow-go, but is that API guaranteed to be stable? |
|
@zeroshade I ran the failing tests and the code does not pass through the changed code. I guess there is some race condition in the tests, but it is unrelated to the code changes here. |
| if p.cryptoCtx.DataDecryptor != nil { | ||
| if err := p.readV2Encrypted(p.r, lenCompressed, levelsBytelen, compressed, buf.Bytes()); err != nil { | ||
| p.err = err | ||
| return false | ||
| } | ||
| } else { | ||
| io.ReadFull(p.r, buf.Bytes()) | ||
| if err := p.readV2Unencrypted(p.r, lenCompressed, levelsBytelen, compressed, buf.Bytes()); err != nil { | ||
| p.err = err | ||
| return false | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue is simply that we're not passing uncompressed v2 through the decryptor right?
Wouldn't a simpler solution here just be to add the decryption check to the non-compressed branch?
i.e. since the call to decompress will already decrypt the data, we just need to replace the io.ReadFull(p.r, buf.Bytes()) with code to check if we need to decrypt and then decrypt it here, otherwise just leave the io.ReadFull right?
Rationale for this change
Fixing the decryption of V2 pages.
What changes are included in this PR?
This PR fixes the decryption of V2 pages.
Are these changes tested?
Yes, tests cover various scenarios for both V1 and V2 page encryption and decryption.
Are there any user-facing changes?
No.