feat(encryption) [9/N] Read encrypted parquet data-files by xanderbailey · Pull Request #2584 · apache/iceberg-rust

xanderbailey · 2026-06-04T08:56:15Z

Which issue does this PR close?

Working towards Support client-side encryption #2034

We're currently not able to decrypt PME encrypted files, all that's needed for this PR is to wire through the key_metadata from the manifest entries into the scan tasks and then configure the parquet reader with the correct encryption key and aad.

It's important to note that no encryption manager is needed here since the key_metadata on the manifest files contain the plaintext keys.

What changes are included in this PR?

Are these changes tested?

Tests follow the pattern on manually writing an encrypted parquet file using the raw arrow-rs writer and then using the FileScanTask and iceberg reader to read that tmp file.

I looked at using the parquet testing repo but the encrypted files there all use column based encryption which isn't supported by the spec. I'll separately track adding some files there that we can use here but the testing strategy here is consistent with how Java does things so I think this should be sufficient for us to continue.

mbutrovich

Thanks @xanderbailey for driving the encryption effort! I took a first pass through the PR. Overall this is a clean, well-scoped slice: the key_metadata plumbing is consistent across the data path and both delete-file loaders, and the missing-key / wrong-key tests are a nice touch. I left a few small inline comments; the only substantive thread is the data-file key-resolution seam, but maybe I should bring this up at the RFC-level?

The reader resolves keys by decoding the key metadata as StandardKeyMetadata and pulling out a plaintext DEK (left an inline note at the decode site with specifics). That is exactly right for the reference format, but the spec defines key_metadata as "implementation-specific" (field 131), and StandardKeyMetadata is the reference implementation's choice rather than a spec mandate. So the hardcoded decode means a scheme that, say, stores a wrapped DEK plus a key id and resolves it from a KMS at read time cannot plug in. Would it be worth keeping the StandardKeyMetadata path as the default but exposing the resolution step as a trait, so other implementations can supply their own? Iceberg Java does something similar: EncryptionManager is an interface with StandardEncryptionManager as the spec-compliant default, selected per table. Two things to note if that is interesting: the hook likely needs to be async (KMS-resolved keys need a round-trip, and arrow-rs's KeyRetriever is synchronous), and the reader would need a handle to the table-level KeyManagementClient that #2650 wires up, which today stops at the Table and does not reach the scan path. The RFC (#2183) currently makes EncryptionManager a concrete struct, and its pluggable seams sit at the KEK/KMS-wrap layer, so this feels like a question for that discussion rather than this PR. Mostly raising it so the concrete decode path does not become hard to retrofit as the series builds on it.

One other small tangent is that you mention:

the encrypted files there all use column based encryption which isn't supported by the spec

parquet-testing does also ship a uniform_encryption.parquet.encrypted that uses a single file key, which matches the spec model (arrow-rs reads it with just a footer key). Hand-writing the fixtures still seems reasonable (it gives control over the StandardKeyMetadata wrapping and AAD prefix), but if iceberg-rust ever vendors that data the committed uniform file could double as a cross-implementation interop fixture.

CC @ggershinsky as well to see if he has any comments.

mbutrovich · 2026-06-15T15:19:40Z

+
+    /// Builds `ArrowReaderOptions`, adding `FileDecryptionProperties` when
+    /// key metadata is present for Parquet Modular Encryption.
+    fn build_arrow_reader_options(key_metadata: Option<&[u8]>) -> Result<ArrowReaderOptions> {


Resolution is hardcoded to StandardKeyMetadata::decode(km) -> plaintext DEK (line 456-458). Worth noting the spec defines key_metadata as "implementation-specific key metadata for encryption" (field 131), and StandardKeyMetadata is the reference implementation's format rather than a spec-mandated one. So baking in that single decode path means non-standard key-management schemes (for example, where key_metadata is a key reference resolved from a KMS at read time rather than an embedded DEK) cannot plug in.

Would it be worth keeping the StandardKeyMetadata path as the default but exposing the resolution step as a trait, so other implementations can supply their own? Roughly:

#[async_trait] pub trait FileDecryptionHandler: Send + Sync { async fn file_decryption_properties(&self, key_metadata: &[u8]) -> Result<FileDecryptionProperties>; }

One reason to make it async: a scheme that resolves the DEK from a KMS at read time (where key_metadata carries a wrapped DEK plus a key id, not a plaintext DEK) needs a network round-trip during resolution. arrow-rs has a lower-level parquet::encryption::decrypt::KeyRetriever, but its retrieve_key(&self, key_metadata) -> Result<Vec<u8>> is synchronous, so it cannot perform an async unwrap inline. The general path is therefore to resolve the key (async) and hand arrow-rs an explicit-key FileDecryptionProperties via builder(dek) (exactly what this PR already does at line 457), with the default StandardKeyMetadata impl just being the synchronous no-KMS case.

Iceberg Java structures it this way: EncryptionManager is an interface, with StandardEncryptionManager as the spec-compliant default, selected per table and pluggable via the encryption.kms-impl catalog property. Worth noting the seam there is at the manager / decrypt level, not only at the KMS client, because an alternative scheme may differ in both the key_metadata layout and the unwrap step, not just the KMS backend.

Is a hook like this already planned for a later PR, or intentionally deferred? Mainly want to avoid the concrete decode path becoming hard to retrofit once the rest of the series builds on it.

Ah this is very interesting, I opened a thread https://github.com/apache/iceberg/pull/16527/changes and https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:[DISCUSS]%20Specifying%20the%20encryption%20key%20metadata%20formats%20for%20cross-implementation%20interop to actually close this gap in the spec. It would appear to me to be the case that adding specific implementations here would beak a table's ability to interop with different engines / clients and hence I have suggested closing that gap.

Claude an I riffed for a bit to see what this might look like https://github.com/xanderbailey/iceberg-rust/pull/2/changes and it's not a crazy change to add in but I'm disinclined to add it into this PR since it changes the scope of this PR by quite a bit. WDYT?

It also doesn't look like java allows for this kind of plug-ability (yet) https://github.com/apache/iceberg/blob/8d0508308cd068778897c06d3a802740ed4305a6/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L796-L805

mbutrovich · 2026-06-15T15:20:19Z

+    #[serde(default)]
+    #[serde(skip_serializing_if = "Option::is_none")]
+    #[builder(default)]
+    pub key_metadata: Option<Box<[u8]>>,


FileScanTask derives Serialize, so with StandardKeyMetadata the plaintext DEK serializes into the plan when a task ships to a worker. skip_serializing_if (line 125) only covers the None case. Is that an acceptable trust boundary here, or worth a field doc note? (A pluggable key-resolution hook in the reader would also address it, since the task would then carry only a key reference rather than the DEK.)

Spark

Has BaseFileScanTask link which has this which contains raw dek metadata. It's a very good flag though so I'll add a comment.

add docs for trust boundry

Thank you! There are other places in Spark and/or Comet where we serialize credentials and the only thing we can do is suggest that users enable secure communication channels (SSL, encryption, etc.).

mbutrovich · 2026-06-15T15:22:24Z

+        assert_eq!(err.kind(), crate::ErrorKind::Unexpected);
+        let err_str = format!("{err}");
+        assert!(
+            err_str.contains("encrypted footer"),


We would need to update this if parquet ever words that error differently. Nothing worth changing, just noting it as a future maintenance spot.

mbutrovich · 2026-06-15T15:22:46Z

        assert_eq!(result.len(), 1);
    }
+
+    fn write_encrypted_parquet(path: &str, batch: &arrow_array::RecordBatch, key: &[u8]) {


write_encrypted_parquet is copy-pasted, and the delete-loader copy drops the aad_prefix param. Share one helper so the delete path also gets AAD coverage?

mbutrovich · 2026-06-15T15:38:21Z

+    fn build_arrow_reader_options(key_metadata: Option<&[u8]>) -> Result<ArrowReaderOptions> {
+        match key_metadata {
+            Some(km) => {
+                let standard_key_metadata = StandardKeyMetadata::decode(km)?;


StandardKeyMetadata.file_length is parsed but unused; the size used for reading comes from task.file_size_in_bytes. That matches Java, where fileLength() has no consumer on the native-encryption path (it is for AGS1 stream decryption, not PME data files), so this looks correct. Not worth asserting file_length == file_size_in_bytes: file_length is optional and the spec does not guarantee the two are equal.

mbutrovich · 2026-06-15T15:39:03Z

+            Some(km) => {
+                let standard_key_metadata = StandardKeyMetadata::decode(km)?;
+                let mut builder = FileDecryptionProperties::builder(
+                    standard_key_metadata.encryption_key().as_bytes().to_vec(),


The decoded DEK is passed straight to FileDecryptionProperties::builder(key). A malformed key currently surfaces as arrow-rs's generic build/decrypt error. Would an explicit check that the key is a valid AES length (16/24/32 bytes), returning a clear iceberg::Error, be worth adding? That is a real invariant with a better message than the downstream failure.

mbutrovich · 2026-06-15T15:39:35Z

            // TODO: Extract name_mapping from table metadata property "schema.name-mapping.default"
            .with_name_mapping(None)
            .with_case_sensitive(self.case_sensitive)
+            .with_key_metadata(self.manifest_entry.data_file.key_metadata().map(Box::from))


map(Box::from) copies the key-metadata slice into a fresh allocation for every file on the scan path. Fine if intended, just flagging it.

The metadata here is very small (max 32 byte keys + aad) so it's a pretty small cost and hard to avoid with the current setup unfortunately.

xanderbailey changed the title ~~feat(encryption) [8/N] Add decryption support for parquet data-files~~ feat(encryption) [8/N] Read encrypted parquet data-files Jun 4, 2026

xanderbailey changed the title ~~feat(encryption) [8/N] Read encrypted parquet data-files~~ feat(encryption) [9/N] Read encrypted parquet data-files Jun 4, 2026

mbutrovich reviewed Jun 15, 2026

View reviewed changes

xanderbailey added 4 commits June 15, 2026 21:53

Add decryption support for parquet data-files

dc53c73

pub api

f062473

check invalid key length

fa8a4c9

add docs for trust boundry

51f84d2

xanderbailey force-pushed the xb/encryption_decrypt_parquet branch from b02c281 to 51f84d2 Compare June 15, 2026 20:53

xanderbailey mentioned this pull request Jun 15, 2026

DRAFT allow non standard key metadata in parquet xanderbailey/iceberg-rust#2

Draft

Conversation

xanderbailey commented Jun 4, 2026

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants