feat: sandbox runtime and capability policy by AngeloDanducci · Pull Request #1171 · generative-computing/mellea

AngeloDanducci · 2026-05-27T14:01:14Z

Pull Request

Issue

Fixes #1021

Description

Allow more granular permissions to be used during sandboxing via capability policy.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code was added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

Component
Requirement
Sampling Strategy
Tool

NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.

github-actions · 2026-05-27T14:01:29Z

This comment is managed by a bot. Editing it is fine — checking off boxes, adding notes — but please leave the HTML comment marker on the first line alone, otherwise checklist updates will break.

Requirement PR Checklist

Use this checklist when adding or modifying requirements in mellea/stdlib/requirements/.

Base Class

Extends appropriate base class:
- Requirement - standard requirement
- ALoraRequirement - uses specialized Intrinsic/Adapter for generation-based validation

Validation Logic

validation_fn defined (if using Python-based validation)
- re-usable functionality within the validation_fn should be separated out into mellea/stdlib/tools/
validate returns a ValidationResult with
- a thunk and context if using a backend to generate
- a specific reason and score when possible

Integration

Requirement exported in mellea/stdlib/requirements/__init__.py or, if you are adding a library of requirements, from your sub-module

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

psschwei · 2026-05-28T12:43:42Z

General question: if I wanted to test this out, how would I use it?

planetf1

Really clean redesign — the tier model is a big improvement over the old boolean flags, CapabilityPolicy with its honest ENFORCED_* separation is a nice touch, and make_execution_environment is exactly the right API shape. A few things below need fixing before merge; the rest are suggestions or noticed-in-passing nits.

planetf1

Really clean redesign — the tier model is a big improvement over the old boolean flags, CapabilityPolicy with its honest ENFORCED_* separation is a nice touch, and make_execution_environment is exactly the right API shape. A few things below need fixing before merge; the rest are suggestions or noticed-in-passing nits.

Posted in error — duplicate of review 4381122017. Please disregard this one.

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci · 2026-05-29T06:11:34Z

Thanks for the feedback @planetf1 , I've addressed it all in the most recent commit as well as a small change to help with E2E.

@psschwei , if you want to test this you can try using it similar to this snippet, assuming you have docker/colima/podman running and the sandbox extra installed via uv sync:

from pathlib import Path
from mellea.stdlib.tools import LLMSandboxEnvironment, CapabilityPolicy

policy = CapabilityPolicy(
    timeout=60,
    artifact_export_paths=[Path("/output/result.txt")],
)

with LLMSandboxEnvironment(policy=policy) as env:
    result = env.execute("""
import pathlib
pathlib.Path('/output').mkdir(exist_ok=True)
with open('/output/result.txt', 'w') as f:
    f.write('hello from the sandbox')
""")
    print("success:", result.success)
    print("artifacts:", result.artifacts)
    if result.artifacts:
        print("content:", result.artifacts[0].path.read_text())

You should see a resultant artifact from the sandbox following the policy.

planetf1 · 2026-05-29T07:32:12Z

Checked through all the fixes from my earlier review — everything looks good. The timeout regression, leak, legacy int shim, container ID fallback, unconditional warning, and truncation edge case are all addressed correctly.

I also ran some broader variations of your code snippet (capability matrix coverage, static import blocking, one-shot warning, local interpreter, artifact export) and they all behaved as expected. There was a Docker socket timeout on a couple of the sandbox tests but that was down to my local environment rather than anything in the code.

Just the one test failure to sort (comment above) and this is good to go.

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci requested a review from a team as a code owner May 27, 2026 14:01

AngeloDanducci requested review from markstur, planetf1 and psschwei May 27, 2026 14:01

github-actions Bot added the enhancement New feature or request label May 27, 2026

AngeloDanducci added 4 commits May 27, 2026 17:00

first pass at sandbox and capability policy

9b08974

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

sandbox policy and tier cleanup

824327a

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

replace ad-hoc execution flags with executiontier and capabilitypolicy

d8e56ea

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

meet docstring quality gate

620a3c7

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci force-pushed the ad-1021 branch from 7a236c5 to 620a3c7 Compare May 27, 2026 21:02

planetf1 requested changes May 28, 2026

View reviewed changes

planetf1 previously requested changes May 28, 2026

View reviewed changes

address review feedback

56df72b

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

planetf1 reviewed May 29, 2026

View reviewed changes

Comment thread test/stdlib/requirements/test_reqlib_python.py

update compability matrix tier test

c86dd80

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci enabled auto-merge May 29, 2026 14:58

AngeloDanducci requested a review from planetf1 May 29, 2026 14:58

Conversation

AngeloDanducci commented May 27, 2026

Pull Request

Issue

Description

Testing

Attribution

Adding a new component, requirement, sampling strategy, or tool?

Uh oh!

github-actions Bot commented May 27, 2026 • edited by AngeloDanducci Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirement PR Checklist

Base Class

Validation Logic

Integration

Uh oh!

psschwei commented May 28, 2026

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AngeloDanducci commented May 29, 2026

Uh oh!

Uh oh!

planetf1 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 27, 2026 •

edited by AngeloDanducci

Loading