Skip to content

feat(deployments): scaffold plugin API and registry (AIRCORE-755)#280

Open
tylersbray wants to merge 3 commits into
mainfrom
755-deployments-plugin-scaffold-the-plugin-entities-api-backend/tbray
Open

feat(deployments): scaffold plugin API and registry (AIRCORE-755)#280
tylersbray wants to merge 3 commits into
mainfrom
755-deployments-plugin-scaffold-the-plugin-entities-api-backend/tbray

Conversation

@tylersbray

@tylersbray tylersbray commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Scaffolds nemo-deployments-plugin as the substrate-agnostic contract layer for AIRCORE-755:

  • Entities: DeploymentConfig, Deployment, Volume plus RFC supporting types
  • API v1: CRUD for configs/deployments/volumes at /apis/deployments/v1/workspaces/{workspace}/...
  • Reconciler hooks: bulk status_in list filter; controller-only PUT .../status (service-principal gated via X-NMP-Principal-Id)
  • Backend contract: DeploymentBackend ABC + BackendStatusUpdate / VolumeStatusUpdate
  • Executor registry: named-instance pattern; plugin starts cleanly with zero backends registered
  • Validation: prerequisite cycle detection (paginated config graph)

No substrate backends or reconcile controller in this PR — those are AIRCORE-756/757/758.

Test plan

  • uv run pytest plugins/nemo-deployments/tests/unit -v (31 tests)
  • uv run ruff check plugins/nemo-deployments
  • uv run ty check plugins/nemo-deployments
  • Pre-commit hooks pass (including no-nmp-common-in-plugins)
  • Manual smoke: nemo services run → CRUD against /apis/deployments/v1/... (follow-on if desired)

Summary by CodeRabbit

  • New Features

    • Adds Deployments, Deployment Configs (with prerequisite-cycle validation), Volumes, and status update APIs under /apis/deployments/v1/workspaces/{workspace}
    • Introduces executor registry support for multiple backend executors
    • Status endpoints require service-principal authentication (writes gated)
  • Documentation

    • Added plugin README with setup, API base paths, and usage examples
  • Tests

    • Extensive unit tests covering APIs, validation, registry, and service startup

Add nemo-deployments-plugin as the substrate-agnostic contract layer for
AIRCORE-755: DeploymentConfig/Deployment/Volume entities, v1 CRUD routes,
controller status endpoints, DeploymentBackend ABC, and named executor
registry. Plugin starts with zero backends; reconciler and docker/k8s
backends land in follow-on tickets.

Signed-off-by: Tyler Bray <tbray@nvidia.com>
@tylersbray tylersbray requested review from a team as code owners June 11, 2026 19:14
@github-actions github-actions Bot added the feat label Jun 11, 2026
Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/entities.py Fixed
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 19018/25211 75.4% 61.4%
Integration Tests 11013/23983 45.9% 20.4%

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c6f2fe1-813a-44ca-b2da-eef9e749acf1

📥 Commits

Reviewing files that changed from the base of the PR and between 994fe24 and f6dbc92.

📒 Files selected for processing (2)
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/base.py
  • plugins/nemo-deployments/tests/unit/test_registry.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • plugins/nemo-deployments/tests/unit/test_registry.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/base.py

📝 Walkthrough

Walkthrough

Adds a NeMo Deployments plugin scaffold: types and entity schemas, API v1 CRUD endpoints (deployment-configs, deployments, volumes, status), backend abstraction and executor registry, prerequisite-cycle validation, service wiring, README and comprehensive unit tests.

Changes

NeMo Deployments Plugin

Layer / File(s) Summary
Constants, types, and configuration models
plugins/nemo-deployments/src/nemo_deployments_plugin/constants.py, plugins/nemo-deployments/src/nemo_deployments_plugin/types.py, plugins/nemo-deployments/src/nemo_deployments_plugin/config.py
Defines entity type constants, deployment/volume status literals, Endpoint model, and DeploymentsConfig with executor entries and port-range validation.
Entity and API schemas
plugins/nemo-deployments/src/nemo_deployments_plugin/entities.py, plugins/nemo-deployments/src/nemo_deployments_plugin/schema.py
Defines DeploymentConfig/Deployment/Volume NemoEntity schemas with container/probe/affinity primitives, backend config variants, and OpenAPI request/response/filter models.
Backend abstraction and executor registry
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/base.py, plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
Implements DeploymentBackend ABC, status/log projection types, ExecutorSpec, ExecutorRegistry, BACKEND_CLASSES mapping, unique-name/default validation, and rollback on partial initialization.
Prerequisite validation
plugins/nemo-deployments/src/nemo_deployments_plugin/validation.py
DFS-based prerequisite cycle detection and helpers to extract/build prerequisite maps from existing configs.
API dependencies and auth
plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/dependencies.py
Re-exports entity client dependency and adds require_service_principal guard that rejects non-service principals with HTTP 403 for controller status endpoints.
CRUD endpoints: configs and deployments
plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/deployment_configs.py, plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/deployments.py
Implements POST/GET/DELETE for deployment-configs with prerequisite-cycle validation; POST/GET/DELETE for deployments with config existence checks, PENDING init, status_in filtering, and error mapping.
CRUD endpoints: volumes
plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/volumes.py
Implements POST/GET/DELETE for volumes with PENDING status defaults, pagination, and filter support.
Status controller endpoints
plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/status.py
Controller-only PUT endpoints to update deployment and volume status fields (status, message, error, endpoints, exit code), protected by service-principal dependency.
Service registration
plugins/nemo-deployments/src/nemo_deployments_plugin/service.py, plugins/nemo-deployments/pyproject.toml, pyproject.toml (root)
Defines DeploymentsService registering routers under /apis/deployments/v1/workspaces, initializes ExecutorRegistry from DeploymentsConfig on startup, shuts down registry, and wires plugin into root workspace/enablement and package entry point.
Tests: helpers, entities and config
plugins/nemo-deployments/tests/unit/helpers.py, test_entities.py, test_config.py
Provides test factories and verifies entity defaults (Deployment status/desired_state, Volume size), container shape, prerequisite detection, and port-range validation.
Tests: prerequisite validation
plugins/nemo-deployments/tests/unit/test_prerequisite_validation.py
Tests acyclic, self-referential, and multi-node prerequisite cycle detection.
Tests: CRUD APIs
plugins/nemo-deployments/tests/unit/test_api_deployment_configs.py, test_api_deployments.py, test_api_volumes.py
Tests creation/list/delete flows, prerequisite cycles, config lookups, status initialization, status_in filtering, and conflict handling.
Tests: status controller
plugins/nemo-deployments/tests/unit/test_api_status.py
Tests service-principal authorization (rejects user/on-behalf-of with 403) and successful status updates for deployment/volume.
Tests: registry and service
plugins/nemo-deployments/tests/unit/test_registry.py, test_service_startup.py
Tests empty/named/default executor resolution, unknown backend type errors, rollback on partial init, distinct backend instances per config, and service route mounting.
Documentation and pytest config
plugins/nemo-deployments/README.md, pytest.ini
Adds plugin README (scope, API base path, sentinel workspace '-' for bulk queries, follow-on work) and pytest test discovery paths for plugin sources/tests.

Suggested Reviewers

  • mckornfield
  • anastasia-nesterenko
  • gabwow
  • albcui
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title directly summarizes the main change: scaffolding the deployments plugin API and registry, matching the primary changeset focus.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 755-deployments-plugin-scaffold-the-plugin-entities-api-backend/tbray

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (3)
plugins/nemo-deployments/README.md (1)

3-4: 💤 Low value

Clarify "deployment lifecycle" with concrete scope.

Replace the vague term "deployment lifecycle" with specific capabilities: e.g., "Create, read, update, and delete deployment configurations and active deployments; manage volumes; track deployment status."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/README.md` around lines 3 - 4, Update the README
description to replace the vague phrase "deployment lifecycle" with a concrete
list of supported capabilities: state that the plugin provides
create/read/update/delete (CRUD) for deployment configurations and active
deployments, volume management, deployment status tracking, entity schemas, CRUD
APIs, a DeploymentBackend abstract base class, and an executor registry; ensure
the one-line summary in the README (top paragraph) explicitly enumerates these
capabilities so readers immediately see the plugin's scope.

Source: Coding guidelines

plugins/nemo-deployments/src/nemo_deployments_plugin/entities.py (1)

243-247: ⚡ Quick win

No-op validator is misleading.

This @model_validator does not validate anything and always returns self, so it implies an invariant that is not enforced. Convert this to a plain code comment/doc note, or implement an actual check where both Deployment and its DeploymentConfig.restart_policy are available.

Suggested cleanup
-    `@model_validator`(mode="after")
-    def _document_status_restart_policy_consistency(self) -> Deployment:
-        """Document restart_policy vs terminal status expectations for the reconciler."""
-        # Reconciler enforces: Never → SUCCEEDED terminal; Always/OnFailure → READY while running.
-        return self
+    # Restart-policy/terminal-status consistency is enforced by the reconciler
+    # when DeploymentConfig is resolved for this Deployment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/entities.py` around
lines 243 - 247, The _document_status_restart_policy_consistency method is a
no-op model_validator that misleads readers; either remove the `@model_validator`
and convert the docstring into a plain class/module comment, or implement an
actual validation that accesses self.status and
self.deployment_config.restart_policy (or DeploymentConfig.restart_policy) and
raises a ValueError/ValidationError when the pair is inconsistent (e.g.,
restart_policy == "Never" requires terminal SUCCEEDED state, restart_policy in
{"Always","OnFailure"} requires non-terminal/READY while running). Update or
remove the decorator accordingly and keep the descriptive text explaining the
reconciler expectations.
plugins/nemo-deployments/tests/unit/test_registry.py (1)

35-53: ⚡ Quick win

Use concrete return types in _StubBackend instead of Any.

These methods can return concrete LogResult/VolumeStatusUpdate types directly; keeping Any here weakens contract checking in tests.

As per coding guidelines, "Always prefer concrete type hints over string-based ones; do not import types under TYPE_CHECKING, instead prefer regular imports when possible."

Proposed diff
 from typing import Any
 
 import pytest
-from nemo_deployments_plugin.backends.abc import BackendStatusUpdate, DeploymentBackend
+from nemo_deployments_plugin.backends.abc import (
+    BackendStatusUpdate,
+    DeploymentBackend,
+    LogResult,
+    VolumeStatusUpdate,
+)
@@
-    async def get_logs(self, **kwargs: Any) -> Any:
-        from nemo_deployments_plugin.backends.abc import LogResult
-
+    async def get_logs(self, **kwargs: Any) -> LogResult:
         return LogResult(lines=[])
@@
-    async def create_volume(self, **kwargs: Any) -> Any:
-        from nemo_deployments_plugin.backends.abc import VolumeStatusUpdate
-
+    async def create_volume(self, **kwargs: Any) -> VolumeStatusUpdate:
         return VolumeStatusUpdate(status="PENDING")
@@
-    async def read_volume_status(self, **kwargs: Any) -> Any:
-        from nemo_deployments_plugin.backends.abc import VolumeStatusUpdate
-
+    async def read_volume_status(self, **kwargs: Any) -> VolumeStatusUpdate:
         return VolumeStatusUpdate(status="BOUND")
@@
-    async def delete_volume(self, workspace: str, name: str) -> Any:
-        from nemo_deployments_plugin.backends.abc import VolumeStatusUpdate
-
+    async def delete_volume(self, workspace: str, name: str) -> VolumeStatusUpdate:
         return VolumeStatusUpdate(status="RELEASED")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/test_registry.py` around lines 35 - 53,
Update _StubBackend methods to use concrete return types instead of Any: change
signatures of get_logs to return nemo_deployments_plugin.backends.abc.LogResult
and create_volume/read_volume_status/delete_volume to return
nemo_deployments_plugin.backends.abc.VolumeStatusUpdate; add regular imports for
LogResult and VolumeStatusUpdate at top of the test file (do not use
TYPE_CHECKING or string types) and return the concrete instances as already
constructed in the method bodies so the type hints match the returned objects.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/nemo-deployments/pyproject.toml`:
- Around line 7-11: The dependencies list in pyproject.toml for the
nemo-deployments plugin is missing fastapi, causing runtime ModuleNotFoundError
when plugin API modules import it; update the [dependencies] block (the
dependencies = [...] list in pyproject.toml for the nemo-deployments package) to
include "fastapi" (optionally with a suitable version constraint, e.g. >=0.95)
so fastapi is installed at runtime alongside "nemo-platform",
"nemo-platform-plugin", and "pydantic>=2.10.6".

In `@plugins/nemo-deployments/README.md`:
- Around line 1-5: Add a top-level "Prerequisites" section before the existing
"# NeMo Deployments Plugin" content that lists required software and setup steps
(e.g., supported Python version, required package manager such as uv, NeMo
Platform runtime availability/configuration, any required environment variables
or credentials, and optional tooling like Docker/uv CLI if applicable); update
README.md to include this section title "Prerequisites" and concise bullet
points so readers know what must be installed or configured before continuing.
- Around line 15-18: Add a "Next Steps" section at the end of
plugins/nemo-deployments/README.md titled "Next Steps" that provides cross-links
to the plugin's API reference, the backend implementation guide (reference
AIRCORE-756, AIRCORE-757, AIRCORE-758), and any other related plugin
documentation; place it immediately after the setup/test commands shown (the uv
sync / pytest block), include one-line descriptions for each link (what the
reader will find there), and ensure links are relative or to the canonical docs
site consistent with other plugin READMEs.
- Around line 1-18: The README mixes reference, how-to, and explanatory content;
split it into a single-quadrant page or reorganize with progressive disclosure.
Choose either: (A) convert this README into a HOW-TO "Set up and test the
plugin" that keeps the uv sync/pytest commands and a brief 30s summary, then
move the API base path and architecture/Backend/DeploymentBackend ABC details
into separate REFERENCE and EXPLANATION pages; or (B) expand this README into
layered sections (Layer 1: 30s summary of purpose/value and intended audience;
Layer 2: 3–5min core concepts and mention of DeploymentBackend and executor
registry; Layer 3: 10min+ API endpoints with the
`/apis/deployments/v1/workspaces/{workspace}/...` example and cross-workspace
sentinel `-`; Layer 4: links to separate reference pages for full API spec and
schema). Update headings accordingly (e.g., "How to set up and test",
"Concepts", "API reference", "Further reading") and move details about the
DeploymentBackend ABC and executor registry into dedicated files or sections to
avoid mixing quadrants.

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/dependencies.py`:
- Around line 17-24: The code currently lets X-NMP-Principal-On-Behalf-Of
override X-NMP-Principal-Id via _effective_principal_id, allowing spoofing;
change require_service_principal to read the principal directly from
request.headers.get(_PRINCIPAL_ID_HEADER, "") (or alter _effective_principal_id
to not consult _ON_BEHALF_OF_HEADER) and perform the startswith("service:")
check against that trustworthy value (use the symbols _PRINCIPAL_ID_HEADER,
_ON_BEHALF_OF_HEADER, _effective_principal_id, and require_service_principal to
locate and update the logic).

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/deployments.py`:
- Around line 140-143: The update call to entity_client.update(deployment) can
raise NemoEntityConflictError during concurrent writes; modify the try/except
around that call to also catch NemoEntityConflictError (in addition to
NemoEntityNotFoundError) and return an HTTP 409 conflict response (matching the
status-update endpoints) with a concise message including the deployment name,
while keeping the existing logger.info for NemoEntityNotFoundError; locate the
entity_client.update(deployment) call and add the new except
NemoEntityConflictError branch to produce the 409 response.

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py`:
- Around line 56-62: The registry construction loop in the classmethod that
builds executors is not failure-atomic: if one backend instantiation
(classes[spec.backend](sdk, spec.config)) raises, previously-created backend
instances remain running. Wrap the executor creation loop in a try/except, and
on any exception iterate over the already-created executors (executors.values())
and call their shutdown/close method (e.g., shutdown() or close(), whichever the
backend implements) inside its own try/except to swallow secondary errors, then
re-raise the original exception; keep the existing UnknownBackendTypeError check
and final return of cls(executors, default_executor=default_executor).

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/config.py`:
- Around line 32-33: Add a validation step to the config model that defines
port_range_start and port_range_end to enforce 1 <= port_range_start <= 65535, 1
<= port_range_end <= 65535 and port_range_start <= port_range_end at load time;
implement this as a Pydantic validator (e.g., a `@root_validator` or two
`@validator` methods) on the config class that contains the port_range_start and
port_range_end fields so it raises a clear ValueError when bounds or ordering
are invalid, preventing downstream runtime failures.

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/schema.py`:
- Line 77: The restart_policy field in the schema currently uses a loose type
(restart_policy: str | None = None); change it to use the RestartPolicy enum
type (e.g., restart_policy: RestartPolicy | None = None) so only valid policies
are accepted; import or reference the RestartPolicy enum (or class) used
elsewhere in the repo and update any related validation or type annotations in
the same module (schema.py) to use RestartPolicy instead of raw str.

In `@plugins/nemo-deployments/tests/unit/test_api_volumes.py`:
- Around line 32-39: The test should assert the object passed into
mock_entity_client.create to ensure the route sets Volume.status="PENDING"
before persisting; in test_create_volume_201 capture the awaited call to
mock_entity_client.create (e.g., inspect mock_entity_client.create.await_args or
use assert_awaited_once and read call args) and assert the passed volume object
has status == "PENDING" (and optionally name/size match the request) rather than
only asserting the mocked return value.

---

Nitpick comments:
In `@plugins/nemo-deployments/README.md`:
- Around line 3-4: Update the README description to replace the vague phrase
"deployment lifecycle" with a concrete list of supported capabilities: state
that the plugin provides create/read/update/delete (CRUD) for deployment
configurations and active deployments, volume management, deployment status
tracking, entity schemas, CRUD APIs, a DeploymentBackend abstract base class,
and an executor registry; ensure the one-line summary in the README (top
paragraph) explicitly enumerates these capabilities so readers immediately see
the plugin's scope.

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/entities.py`:
- Around line 243-247: The _document_status_restart_policy_consistency method is
a no-op model_validator that misleads readers; either remove the
`@model_validator` and convert the docstring into a plain class/module comment, or
implement an actual validation that accesses self.status and
self.deployment_config.restart_policy (or DeploymentConfig.restart_policy) and
raises a ValueError/ValidationError when the pair is inconsistent (e.g.,
restart_policy == "Never" requires terminal SUCCEEDED state, restart_policy in
{"Always","OnFailure"} requires non-terminal/READY while running). Update or
remove the decorator accordingly and keep the descriptive text explaining the
reconciler expectations.

In `@plugins/nemo-deployments/tests/unit/test_registry.py`:
- Around line 35-53: Update _StubBackend methods to use concrete return types
instead of Any: change signatures of get_logs to return
nemo_deployments_plugin.backends.abc.LogResult and
create_volume/read_volume_status/delete_volume to return
nemo_deployments_plugin.backends.abc.VolumeStatusUpdate; add regular imports for
LogResult and VolumeStatusUpdate at top of the test file (do not use
TYPE_CHECKING or string types) and return the concrete instances as already
constructed in the method bodies so the type hints match the returned objects.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 13375584-28cd-4917-9bf5-7df481d974cb

📥 Commits

Reviewing files that changed from the base of the PR and between c2dd51d and 212e31e.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (25)
  • plugins/nemo-deployments/README.md
  • plugins/nemo-deployments/pyproject.toml
  • plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/dependencies.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/deployment_configs.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/deployments.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/status.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/volumes.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/abc.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/config.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/constants.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/entities.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/schema.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/service.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/validation.py
  • plugins/nemo-deployments/tests/unit/helpers.py
  • plugins/nemo-deployments/tests/unit/test_api_deployment_configs.py
  • plugins/nemo-deployments/tests/unit/test_api_deployments.py
  • plugins/nemo-deployments/tests/unit/test_api_status.py
  • plugins/nemo-deployments/tests/unit/test_api_volumes.py
  • plugins/nemo-deployments/tests/unit/test_entities.py
  • plugins/nemo-deployments/tests/unit/test_prerequisite_validation.py
  • plugins/nemo-deployments/tests/unit/test_registry.py
  • plugins/nemo-deployments/tests/unit/test_service_startup.py
  • pyproject.toml

Comment thread plugins/nemo-deployments/pyproject.toml
Comment thread plugins/nemo-deployments/README.md Outdated
Comment on lines +1 to +18
# NeMo Deployments Plugin

Substrate-agnostic deployment lifecycle for the NeMo Platform. This plugin provides
entity schemas, CRUD APIs, a `DeploymentBackend` ABC, and an executor registry.
Backend implementations and the reconcile controller are delivered in follow-on tickets.

## API base path

`/apis/deployments/v1/workspaces/{workspace}/...`

Cross-workspace bulk queries use the entity-store sentinel workspace ``-``:

``GET /apis/deployments/v1/workspaces/-/deployments?status_in=pending,starting``

```bash
uv sync
uv run pytest plugins/nemo-deployments/tests/unit -v
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

README violates Diataxis structure and progressive-disclosure guidelines.

This page mixes three Diataxis quadrants (REFERENCE for API path, HOW-TO for commands, EXPLANATION for plugin overview) into 18 lines. Per guidelines, "Each documentation page should fit ONE Diataxis quadrant; do not mix tutorials with reference tables or how-tos with architecture explanations."

Restructure by either narrowing to one quadrant (e.g., HOW-TO: "Set up and test the plugin") and moving API/architecture details to separate pages, or expanding into a proper multi-section structure with progressive disclosure: Layer 1 (30s: what/who/value) → Layer 2 (3-5min: core concepts and architecture) → Layer 3 (10min+: API endpoints and advanced features) → Layer 4 (separate reference pages for complete specs).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/README.md` around lines 1 - 18, The README mixes
reference, how-to, and explanatory content; split it into a single-quadrant page
or reorganize with progressive disclosure. Choose either: (A) convert this
README into a HOW-TO "Set up and test the plugin" that keeps the uv sync/pytest
commands and a brief 30s summary, then move the API base path and
architecture/Backend/DeploymentBackend ABC details into separate REFERENCE and
EXPLANATION pages; or (B) expand this README into layered sections (Layer 1: 30s
summary of purpose/value and intended audience; Layer 2: 3–5min core concepts
and mention of DeploymentBackend and executor registry; Layer 3: 10min+ API
endpoints with the `/apis/deployments/v1/workspaces/{workspace}/...` example and
cross-workspace sentinel `-`; Layer 4: links to separate reference pages for
full API spec and schema). Update headings accordingly (e.g., "How to set up and
test", "Concepts", "API reference", "Further reading") and move details about
the DeploymentBackend ABC and executor registry into dedicated files or sections
to avoid mixing quadrants.

Source: Coding guidelines

Comment thread plugins/nemo-deployments/README.md
Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/dependencies.py Outdated
Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/api/v1/deployments.py Outdated
Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py Outdated
Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/config.py
Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/schema.py Outdated
Comment on lines +32 to +39
def test_create_volume_201(client: TestClient, mock_entity_client: AsyncMock) -> None:
mock_entity_client.create.return_value = make_volume()
resp = client.post(
"/apis/deployments/v1/workspaces/default/volumes",
json={"name": "vol1", "size": "5Gi"},
)
assert resp.status_code == 201
assert resp.json()["status"] == "PENDING"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the object passed to create, not just the mocked response.

This test can still pass if the route stops setting Volume.status="PENDING" before persistence. Assert the awaited create(...) argument to lock the contract.

Suggested patch
 def test_create_volume_201(client: TestClient, mock_entity_client: AsyncMock) -> None:
     mock_entity_client.create.return_value = make_volume()
     resp = client.post(
         "/apis/deployments/v1/workspaces/default/volumes",
         json={"name": "vol1", "size": "5Gi"},
     )
     assert resp.status_code == 201
     assert resp.json()["status"] == "PENDING"
+    mock_entity_client.create.assert_awaited_once()
+    created = mock_entity_client.create.await_args.args[0]
+    assert created.status == "PENDING"
+    assert created.workspace == "default"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_create_volume_201(client: TestClient, mock_entity_client: AsyncMock) -> None:
mock_entity_client.create.return_value = make_volume()
resp = client.post(
"/apis/deployments/v1/workspaces/default/volumes",
json={"name": "vol1", "size": "5Gi"},
)
assert resp.status_code == 201
assert resp.json()["status"] == "PENDING"
def test_create_volume_201(client: TestClient, mock_entity_client: AsyncMock) -> None:
mock_entity_client.create.return_value = make_volume()
resp = client.post(
"/apis/deployments/v1/workspaces/default/volumes",
json={"name": "vol1", "size": "5Gi"},
)
assert resp.status_code == 201
assert resp.json()["status"] == "PENDING"
mock_entity_client.create.assert_awaited_once()
created = mock_entity_client.create.await_args.args[0]
assert created.status == "PENDING"
assert created.workspace == "default"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/test_api_volumes.py` around lines 32 -
39, The test should assert the object passed into mock_entity_client.create to
ensure the route sets Volume.status="PENDING" before persisting; in
test_create_volume_201 capture the awaited call to mock_entity_client.create
(e.g., inspect mock_entity_client.create.await_args or use assert_awaited_once
and read call args) and assert the passed volume object has status == "PENDING"
(and optionally name/size match the request) rather than only asserting the
mocked return value.

Rename backends/abc.py to base.py and extract shared types to break the
stdlib abc shadowing cycle. Fix auth to use Principal-Id only, add registry
rollback and port-range validation, and wire pytest pythonpath for helpers.

Signed-off-by: Tyler Bray <tbray@nvidia.com>
@tylersbray tylersbray changed the title feat(deployments): scaffold plugin entities, API, and executor registry (AIRCORE-755) feat(deployments): scaffold plugin API and registry (AIRCORE-755) Jun 11, 2026
Comment thread plugins/nemo-deployments/tests/unit/test_registry.py Fixed
Use import abc to avoid self-import false positive and fail backends via
init() hook so DeploymentBackend.__init__ is exercised in tests.

Signed-off-by: Tyler Bray <tbray@nvidia.com>
@tylersbray tylersbray requested a review from benmccown June 11, 2026 20:15
tylersbray added a commit that referenced this pull request Jun 12, 2026
Introduce DeploymentsController with deployment/volume reconcilers,
prerequisite gating, drift recovery, and orphan cleanup on top of the
755 plugin scaffold. Stacks on PR #280 (AIRCORE-755).

AIRCORE-758

Signed-off-by: Tyler Bray <tbray@nvidia.com>

@benmccown benmccown left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor stuff that an agent should be able to address without too much issue. Nice work so far.

Substrate-agnostic deployment lifecycle for the NeMo Platform. This plugin provides
entity schemas, CRUD APIs, a `DeploymentBackend` ABC, and an executor registry.

**Scope (this ticket):** scaffold only — entity types, v1 CRUD routes, backend contract,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can probs drop everything from this Scope line through to the end of the file, save for the Tests section.

After the work is done and backends are ready, I'm sure an agent can cook up a good README that's helpful and doesn't have transient info in it.


class CreateDeploymentRequest(BaseModel):
name: str
deployment_config_name: str

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have lifted this one to one from the RFC, but I'm wondering aloud, should we just go with deployment_config so that we could in theory support workspace/name or name (assume same workspace)? Maybe there's a world where we want to support deploying a deployment config from workspace A into workspace B? Just keeps our options open.

endpoints: list[Endpoint] = Field(default_factory=list)
exit_code: int | None = None
error_details: dict[str, Any] | None = None
status_history: list[StatusEvent] | None = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need status_history on the API request? The entity itself should keep a status_history field for us, but it shouldn't be on the requester to preserve/maintain that history field.

class UpdateDeploymentStatusRequest(BaseModel):
status: DeploymentStatus
status_message: str = ""
endpoints: list[Endpoint] = Field(default_factory=list)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does endpoints need to exist on this request object? Wouldn't it be something on the top level Deployment entity?

prerequisites: list[Prerequisite] = Field(default_factory=list)
drift_recovery: DriftRecoveryPolicy = Field(default_factory=DriftRecoveryPolicy, alias="driftRecovery")
labels: dict[str, str] = Field(default_factory=dict)
backend_config: DeploymentBackendConfig = Field(default_factory=DeploymentBackendConfig, alias="backendConfig")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is straight from the RFC, but now I'm that guy and thinking about one alternative out loud:

Wdyt about executor_config? I hadn't thought about the "Executor (instance) vs. Service Backend (class)" disambiguation much when writing the RFC. The class could become DeploymentBackendConfig -> DeploymentExecutorConfig. I feel like it's potentially clearer?

No strong feelings here, we can keep it as is as well.

volumes,
)

prefix = "/v1/workspaces/{workspace}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and all API routes anywhere in the PR) should be /v2/ and not /v1/.

@@ -0,0 +1,31 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think these types could all just go in schemas.py but it doesn't really matter, tomatoe/tomatoe.

default=None,
description="Fallback executor when Deployment.executor is unset.",
)
port_range_start: int = Field(default=9000, description="Default Docker port range start.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the port range start/end go in the docker service/executor config? Cause k8s executors won't use these, right?

from pydantic import BaseModel, Field


class CreateDeploymentConfigRequest(BaseModel):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be in this PR, but we should probs think through Update APIs at some point (which fields are mutable, maybe implement a naive form of entity versioning copying the established pattern from models service).

entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> None:
try:
await entity_client.delete(DeploymentConfig, name=name, workspace=workspace)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we ensure that there are no existing Deployments that reference the config and 4xx with a helpful message if so? Same goes for Volumes (if they're referenced by deployments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants