Skip to content

feat(encryption) [9/N] Read encrypted parquet data-files#2584

Open
xanderbailey wants to merge 4 commits into
apache:mainfrom
xanderbailey:xb/encryption_decrypt_parquet
Open

feat(encryption) [9/N] Read encrypted parquet data-files#2584
xanderbailey wants to merge 4 commits into
apache:mainfrom
xanderbailey:xb/encryption_decrypt_parquet

Conversation

@xanderbailey

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

We're currently not able to decrypt PME encrypted files, all that's needed for this PR is to wire through the key_metadata from the manifest entries into the scan tasks and then configure the parquet reader with the correct encryption key and aad.

It's important to note that no encryption manager is needed here since the key_metadata on the manifest files contain the plaintext keys.

What changes are included in this PR?

Are these changes tested?

Tests follow the pattern on manually writing an encrypted parquet file using the raw arrow-rs writer and then using the FileScanTask and iceberg reader to read that tmp file.

I looked at using the parquet testing repo but the encrypted files there all use column based encryption which isn't supported by the spec. I'll separately track adding some files there that we can use here but the testing strategy here is consistent with how Java does things so I think this should be sufficient for us to continue.

@xanderbailey xanderbailey changed the title feat(encryption) [8/N] Add decryption support for parquet data-files feat(encryption) [8/N] Read encrypted parquet data-files Jun 4, 2026
@xanderbailey xanderbailey changed the title feat(encryption) [8/N] Read encrypted parquet data-files feat(encryption) [9/N] Read encrypted parquet data-files Jun 4, 2026

@mbutrovich mbutrovich left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xanderbailey for driving the encryption effort! I took a first pass through the PR. Overall this is a clean, well-scoped slice: the key_metadata plumbing is consistent across the data path and both delete-file loaders, and the missing-key / wrong-key tests are a nice touch. I left a few small inline comments; the only substantive thread is the data-file key-resolution seam, but maybe I should bring this up at the RFC-level?

The reader resolves keys by decoding the key metadata as StandardKeyMetadata and pulling out a plaintext DEK (left an inline note at the decode site with specifics). That is exactly right for the reference format, but the spec defines key_metadata as "implementation-specific" (field 131), and StandardKeyMetadata is the reference implementation's choice rather than a spec mandate. So the hardcoded decode means a scheme that, say, stores a wrapped DEK plus a key id and resolves it from a KMS at read time cannot plug in. Would it be worth keeping the StandardKeyMetadata path as the default but exposing the resolution step as a trait, so other implementations can supply their own? Iceberg Java does something similar: EncryptionManager is an interface with StandardEncryptionManager as the spec-compliant default, selected per table. Two things to note if that is interesting: the hook likely needs to be async (KMS-resolved keys need a round-trip, and arrow-rs's KeyRetriever is synchronous), and the reader would need a handle to the table-level KeyManagementClient that #2650 wires up, which today stops at the Table and does not reach the scan path. The RFC (#2183) currently makes EncryptionManager a concrete struct, and its pluggable seams sit at the KEK/KMS-wrap layer, so this feels like a question for that discussion rather than this PR. Mostly raising it so the concrete decode path does not become hard to retrofit as the series builds on it.

One other small tangent is that you mention:

the encrypted files there all use column based encryption which isn't supported by the spec

parquet-testing does also ship a uniform_encryption.parquet.encrypted that uses a single file key, which matches the spec model (arrow-rs reads it with just a footer key). Hand-writing the fixtures still seems reasonable (it gives control over the StandardKeyMetadata wrapping and AAD prefix), but if iceberg-rust ever vendors that data the committed uniform file could double as a cross-implementation interop fixture.

CC @ggershinsky as well to see if he has any comments.


/// Builds `ArrowReaderOptions`, adding `FileDecryptionProperties` when
/// key metadata is present for Parquet Modular Encryption.
fn build_arrow_reader_options(key_metadata: Option<&[u8]>) -> Result<ArrowReaderOptions> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolution is hardcoded to StandardKeyMetadata::decode(km) -> plaintext DEK (line 456-458). Worth noting the spec defines key_metadata as "implementation-specific key metadata for encryption" (field 131), and StandardKeyMetadata is the reference implementation's format rather than a spec-mandated one. So baking in that single decode path means non-standard key-management schemes (for example, where key_metadata is a key reference resolved from a KMS at read time rather than an embedded DEK) cannot plug in.

Would it be worth keeping the StandardKeyMetadata path as the default but exposing the resolution step as a trait, so other implementations can supply their own? Roughly:

#[async_trait]
pub trait FileDecryptionHandler: Send + Sync {
    async fn file_decryption_properties(&self, key_metadata: &[u8]) -> Result<FileDecryptionProperties>;
}

One reason to make it async: a scheme that resolves the DEK from a KMS at read time (where key_metadata carries a wrapped DEK plus a key id, not a plaintext DEK) needs a network round-trip during resolution. arrow-rs has a lower-level parquet::encryption::decrypt::KeyRetriever, but its retrieve_key(&self, key_metadata) -> Result<Vec<u8>> is synchronous, so it cannot perform an async unwrap inline. The general path is therefore to resolve the key (async) and hand arrow-rs an explicit-key FileDecryptionProperties via builder(dek) (exactly what this PR already does at line 457), with the default StandardKeyMetadata impl just being the synchronous no-KMS case.

Iceberg Java structures it this way: EncryptionManager is an interface, with StandardEncryptionManager as the spec-compliant default, selected per table and pluggable via the encryption.kms-impl catalog property. Worth noting the seam there is at the manager / decrypt level, not only at the KMS client, because an alternative scheme may differ in both the key_metadata layout and the unwrap step, not just the KMS backend.

Is a hook like this already planned for a later PR, or intentionally deferred? Mainly want to avoid the concrete decode path becoming hard to retrofit once the rest of the series builds on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is very interesting, I opened a thread https://github.com/apache/iceberg/pull/16527/changes and https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:[DISCUSS]%20Specifying%20the%20encryption%20key%20metadata%20formats%20for%20cross-implementation%20interop to actually close this gap in the spec. It would appear to me to be the case that adding specific implementations here would beak a table's ability to interop with different engines / clients and hence I have suggested closing that gap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude an I riffed for a bit to see what this might look like https://github.com/xanderbailey/iceberg-rust/pull/2/changes and it's not a crazy change to add in but I'm disinclined to add it into this PR since it changes the scope of this PR by quite a bit. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
#[builder(default)]
pub key_metadata: Option<Box<[u8]>>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileScanTask derives Serialize, so with StandardKeyMetadata the plaintext DEK serializes into the plan when a task ships to a worker. skip_serializing_if (line 125) only covers the None case. Is that an acceptable trust boundary here, or worth a field doc note? (A pluggable key-resolution hook in the reader would also address it, since the task would then carry only a key reference rather than the DEK.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark

Has BaseFileScanTask link which has this which contains raw dek metadata. It's a very good flag though so I'll add a comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! There are other places in Spark and/or Comet where we serialize credentials and the only thing we can do is suggest that users enable secure communication channels (SSL, encryption, etc.).

assert_eq!(err.kind(), crate::ErrorKind::Unexpected);
let err_str = format!("{err}");
assert!(
err_str.contains("encrypted footer"),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to update this if parquet ever words that error differently. Nothing worth changing, just noting it as a future maintenance spot.

assert_eq!(result.len(), 1);
}

fn write_encrypted_parquet(path: &str, batch: &arrow_array::RecordBatch, key: &[u8]) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_encrypted_parquet is copy-pasted, and the delete-loader copy drops the aad_prefix param. Share one helper so the delete path also gets AAD coverage?

fn build_arrow_reader_options(key_metadata: Option<&[u8]>) -> Result<ArrowReaderOptions> {
match key_metadata {
Some(km) => {
let standard_key_metadata = StandardKeyMetadata::decode(km)?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StandardKeyMetadata.file_length is parsed but unused; the size used for reading comes from task.file_size_in_bytes. That matches Java, where fileLength() has no consumer on the native-encryption path (it is for AGS1 stream decryption, not PME data files), so this looks correct. Not worth asserting file_length == file_size_in_bytes: file_length is optional and the spec does not guarantee the two are equal.

Some(km) => {
let standard_key_metadata = StandardKeyMetadata::decode(km)?;
let mut builder = FileDecryptionProperties::builder(
standard_key_metadata.encryption_key().as_bytes().to_vec(),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decoded DEK is passed straight to FileDecryptionProperties::builder(key). A malformed key currently surfaces as arrow-rs's generic build/decrypt error. Would an explicit check that the key is a valid AES length (16/24/32 bytes), returning a clear iceberg::Error, be worth adding? That is a real invariant with a better message than the downstream failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO: Extract name_mapping from table metadata property "schema.name-mapping.default"
.with_name_mapping(None)
.with_case_sensitive(self.case_sensitive)
.with_key_metadata(self.manifest_entry.data_file.key_metadata().map(Box::from))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map(Box::from) copies the key-metadata slice into a fresh allocation for every file on the scan path. Fine if intended, just flagging it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata here is very small (max 32 byte keys + aad) so it's a pretty small cost and hard to avoid with the current setup unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants