Skip to content

refactor(amber): stop hardcoding S3 in REST catalog init#4988

Open
mengw15 wants to merge 11 commits into
apache:mainfrom
mengw15:refactor/remove-hardcoded-s3-from-rest-catalog
Open

refactor(amber): stop hardcoding S3 in REST catalog init#4988
mengw15 wants to merge 11 commits into
apache:mainfrom
mengw15:refactor/remove-hardcoded-s3-from-rest-catalog

Conversation

@mengw15
Copy link
Copy Markdown
Contributor

@mengw15 mengw15 commented May 8, 2026

What changes were proposed in this PR?

Stop hardcoding s3.endpoint, s3.region, s3.path-style-access, s3.access-key-id and s3.secret-access-key at REST-catalog init in both IcebergUtil.createRestCatalog (Scala) and iceberg_utils.create_rest_catalog (Python). Both helpers now pass only warehouse + catalog uri (and on the Scala side the FileIO impl hint).

Why: When a Lakekeeper warehouse is created, its S3 settings (endpoint, region, credentials, path-style) are registered against that warehouse on the server. At catalog init the client only needs warehouse + uri — Lakekeeper resolves the S3 config from the warehouse record and serves it back. The hardcoded StorageConfig.s3* values on the client were redundant, and forcing them everywhere also pinned every warehouse to the single system bucket. Removing them lets each warehouse own its own storage settings.

StorageConfig.s3* itself is kept — pytexera/storage/large_binary_manager.py still uses it for the non-Iceberg texera-large-binaries bucket (R UDF large-binary support), which is out of scope.

Any related issues, documentation, discussions?

Closes #4987

How was this PR tested?

  • sbt "WorkflowCore/compile" — passes; verifies no other Scala caller depends on the removed properties.
  • Python edits parse cleanly via ast.parse; the only caller (iceberg_catalog_instance.py) is updated to match the new create_rest_catalog signature.

End-to-end verification (warehouse with its own S3 settings → REST catalog opened with only warehouse + uri → table round-trip) requires a running Lakekeeper, which CI doesn't have today. #4276 (draft) wires Lakekeeper into CI; once that lands I'll add the integration test on top of it.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

@github-actions github-actions Bot added python refactor Refactor the code common labels May 8, 2026
@mengw15 mengw15 requested a review from bobbai00 May 8, 2026 20:50
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.66%. Comparing base (62d4489) to head (2ebe567).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4988      +/-   ##
============================================
+ Coverage     42.64%   42.66%   +0.02%     
- Complexity     2188     2189       +1     
============================================
  Files          1045     1045              
  Lines         39876    39870       -6     
  Branches       4205     4205              
============================================
+ Hits          17004    17010       +6     
+ Misses        21811    21800      -11     
+ Partials       1061     1060       -1     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.72% <ø> (ø) Carriedforward from 62d4489
amber 43.35% <100.00%> (+0.05%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 33.85% <ø> (ø) Carriedforward from 62d4489
python 88.90% <ø> (ø)
workflow-compiling-service 47.72% <ø> (ø) Carriedforward from 62d4489

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Thanks @mengw15 can you make the description a bit concise about why this change is needed? Also please add tests to confirm this change works.

@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 9, 2026

Thanks @mengw15 can you make the description a bit concise about why this change is needed? Also please add tests to confirm this change works.

Thanks for the questions.

Why the change is needed. When a Lakekeeper warehouse is created, the S3 settings (endpoint, region, credentials, path-style, etc.) are already registered against that warehouse on the server side. At REST-catalog init the client only needs the warehouse identifier and uri — Lakekeeper resolves and serves the S3 config from the warehouse record. The previously hardcoded s3.* properties from StorageConfig were therefore redundant on the client; deleting them lets each warehouse own its own storage settings instead of all warehouses being forced onto the system bucket. I'll tighten the PR description to say just this.

About tests. End-to-end verification needs a running Lakekeeper, which CI doesn't have yet. #4276 (draft) adds Lakekeeper to CI; once that lands I'll layer an integration test on top of it that creates a warehouse with its own S3 settings, opens a REST catalog with only warehouse + uri, and round-trips a table.

mengw15 and others added 3 commits May 10, 2026 09:29
Align test_iceberg_rest_catalog_integration.py with create_rest_catalog's
new signature after S3 settings stopped being passed at catalog init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Yicong-Huang
Copy link
Copy Markdown
Contributor

now as #4276 is merged, can we hook up with new tests?

@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 11, 2026

now as #4276 is merged, can we hook up with new tests?

With #4276 merged, which brought over the Lakekeeper CI job and the two integration tests.

These two tests are testing the createRestCatalog in Scala and python. so this PR is covered end-to-end. With CI passed, I think we can confirm that this change works.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Yicong-Huang commented May 11, 2026

These two tests are testing the createRestCatalog in Scala and python. so this PR is covered end-to-end. With CI passed, I think we can confirm that this change works.

sg. thanks! let's also make sure the coverage is filled, this can make sure your changes in this PR are actually being tested in the CI.

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.75%. Comparing base (14f8be4) to head (b7b3f84).

Files with missing lines Patch % Lines
...ala/org/apache/texera/amber/util/IcebergUtil.scala 0.00% 3 Missing ⚠️

See more #4988 (comment)

mengw15 and others added 2 commits May 11, 2026 10:45
Amber integration job runs without jacoco, so IcebergRestCatalogIntegrationSpec
does not register on codecov. Add a unit test that drives createRestCatalog
far enough to construct the property Map; .initialize then throws because no
Lakekeeper is up in unit-test scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 11, 2026

These two tests are testing the createRestCatalog in Scala and python. so this PR is covered end-to-end. With CI passed, I think we can confirm that this change works.

sg. thanks! let's also make sure the coverage is filled, this can make sure your changes in this PR are actually being tested in the CI.

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.✅ Project coverage is 42.75%. Comparing base (14f8be4) to head (b7b3f84).

Files with missing lines Patch % Lines
...ala/org/apache/texera/amber/util/IcebergUtil.scala 0.00% 3 Missing ⚠️
See more #4988 (comment)

It seems like the amber-integration job doesn't upload to Codecov. Added a mock unit test in IcebergUtilSpec to satisfy the patch number; it just drives createRestCatalog until .initialize throws (no Lakekeeper in unit-test scope). Real coverage still comes from the integration test.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Yes integration test is for now designed not to alter coverage report: we rely on unit tests.

mengw15 and others added 2 commits May 11, 2026 18:01
Tightens the previous coverage-only test: instead of intercepting any
Exception, assert RESTException specifically. The property Map is built
before .initialize, so a RESTException from either an unreachable
Lakekeeper or a missing warehouse confirms the Map composition is sound
and the failure is server-side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg REST catalog should not hardcode S3 properties at init

3 participants