Skip to content

feat: add topology-aware adaptive controls#282

Draft
teerthsharma wants to merge 16 commits into
NVIDIA:mainfrom
teerthsharma:POC/aether-topological-adaptive-runtime
Draft

feat: add topology-aware adaptive controls#282
teerthsharma wants to merge 16 commits into
NVIDIA:mainfrom
teerthsharma:POC/aether-topological-adaptive-runtime

Conversation

@teerthsharma

@teerthsharma teerthsharma commented Jun 18, 2026

Copy link
Copy Markdown

Overview

This pull request adds optional topology-aware controls to the existing nemo-relay-adaptive runtime. The controls are internal adaptive behavior, not a new public topology package: the separate nemo-relay-adaptive-topology crate has been removed, and the proposed Python nemo_relay.adaptive_topology primitive surface has been removed.

Reviewer-facing design note: .github/designs/topology-aware-adaptive-controls.md.

This design note is intentionally not published in Fern documentation. It records the goals, non-goals, benefit gates, observable signals, sample evidence, and rollout expectations for the internal controls.

  • I confirm this contribution is my own work, or I have the right to submit it under this project's license.
  • I searched existing issues and open pull requests, and this does not duplicate existing work.

Details

This change keeps topology-aware behavior inside the adaptive plugin contract:

  • Adds crates/adaptive/src/topology.rs as an internal module with the bounded primitives needed by the adaptive learners: ConvergenceDetector, DriftDetector, GeometricGovernor, and BettiNumbers.
  • Removes the separate crates/adaptive-topology workspace crate and its Cargo workspace dependency.
  • Removes the proposed public Python topology primitive bindings and their tests. Public bindings now expose only the adaptive configuration knobs already needed for plugin activation.
  • Keeps test bodies outside source files by placing the new internal primitive tests in crates/adaptive/tests/unit/topology_tests.rs.
  • Keeps the reviewer-facing design note outside published docs under .github/designs/.
  • Updates Adaptive docs to avoid linking an internal design document while still documenting when users should enable the controls.

The three measurable controls are:

ACG convergence: stable profiles set StabilityAnalysisResult.converged=true and stop adding observations after at least stability_window epochs.
Tool-plan drift: stored fan-out plans are rebuilt when cohort drift exceeds tool_parallelism.drift.threshold.
Hint governor: learned hints below adaptive_hints.governor.epsilon are omitted, while manual latency-sensitivity overrides still inject hints.

Sample evidence captured in the design note:

ACG observations-to-decision: without=50, with=3
Tool-plan drift: stale fan-out plan invalidates after cohort centroid moves beyond threshold
Hint governor: low-value learned hints are shed while manual overrides are preserved

Local validation:

cargo fmt --all
cargo test -p nemo-relay-adaptive topology -- --nocapture
cargo test -p nemo-relay-adaptive --test topology_convergence -- --nocapture
cargo test -p nemo-relay-adaptive --test tool_parallelism_plan -- --nocapture
cargo test -p nemo-relay-adaptive adaptive_hints_governor -- --nocapture
cargo test -p nemo-relay-adaptive --lib
cargo check -p nemo-relay-python
cargo clippy -p nemo-relay-adaptive --all-targets -- -D warnings
cargo bench -p nemo-relay-adaptive --bench convergence_bench -- --sample-size 10
cd fern && npx fern check --warnings

Validation notes:

uv run pytest python/tests/test_adaptive.py python/tests/test_adaptive_config.py -q

could not start locally because uv failed while removing .venv\lib64 with Windows Access is denied (os error 5) before test collection.

just docs

could not start in this Windows shell because just could not find cygpath to translate the recipe shebang path. Running Fern directly passed fern check --warnings; direct broken-link validation still reports missing ignored generated API reference pages because the docs API-reference generation step could not be run through just docs in this shell.

Where should the reviewer start?

Start with the design and scope reduction:

.github/designs/topology-aware-adaptive-controls.md
Cargo.toml
Cargo.lock
crates/adaptive/Cargo.toml
crates/adaptive/src/topology.rs
crates/adaptive/tests/unit/topology_tests.rs

Then review the runtime integration points:

crates/adaptive/src/acg_learner.rs
crates/adaptive/src/tool_parallelism_learner.rs
crates/adaptive/src/adaptive_hints_intercept.rs
crates/adaptive/src/runtime/validation.rs
crates/adaptive/src/plugin_component.rs
crates/adaptive/tests/integration/topology_convergence_tests.rs
crates/adaptive/tests/unit/tool_parallelism_learner_tests.rs
crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs

Then check binding and docs parity for configuration-only public surfaces:

python/nemo_relay/adaptive.py
python/nemo_relay/adaptive.pyi
crates/node/adaptive.js
crates/node/adaptive.d.ts
go/nemo_relay/adaptive.go
go/nemo_relay/adaptive/adaptive.go
docs/adaptive-plugin/configuration.mdx

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Relates to: none

Summary by CodeRabbit

  • New Features

    • Added topology-aware adaptive controls for learning convergence, hint load-shedding, and tool-plan drift handling.
    • Expanded configuration helpers and language bindings to support the new adaptive settings.
  • Bug Fixes

    • Improved stability handling so adaptive learning can stop earlier when behavior converges.
    • Prevented stale tool execution plans from persisting after tool patterns change.
    • Preserved manual hint overrides while automatically reducing low-signal hint injection.
  • Documentation

    • Updated adaptive plugin docs with new configuration options, defaults, and usage guidance.

@teerthsharma teerthsharma requested review from a team and lvojtku as code owners June 18, 2026 05:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added size:XXL PR is very large Feature a new feature lang:go PR changes/introduces Go code lang:js PR changes/introduces Javascript/Typescript code lang:python PR changes/introduces Python code lang:rust PR changes/introduces Rust code labels Jun 18, 2026
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 77c45c4b-5893-48b4-a879-6041755d3781

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Adds topology-aware adaptive controls for convergence, hint load-shedding, and tool-plan drift across the adaptive stack, and updates the supporting config surfaces, runtime wiring, tests, docs, and one pre-commit hook.

Changes

Topology-aware adaptive controls

Layer / File(s) Summary
Topology primitives
crates/adaptive/src/lib.rs, crates/adaptive/src/topology.rs, crates/adaptive/tests/unit/topology_tests.rs
Adds the topology module with BettiNumbers, ConvergenceDetector, DriftDetector, GeometricGovernor, numeric helpers, and unit tests.
Config surface
crates/adaptive/src/config.rs, crates/adaptive/src/lib.rs, crates/node/adaptive.d.ts, crates/node/adaptive.js, crates/node/tests/adaptive_tests.mjs, go/nemo_relay/adaptive.go, go/nemo_relay/adaptive/adaptive.go, go/nemo_relay/adaptive_test.go, python/nemo_relay/adaptive.py, python/nemo_relay/adaptive.pyi, python/tests/test_adaptive*.py, crates/adaptive/tests/unit/config_tests.rs
Adds GovernorConfig, DriftConfig, and ConvergenceConfig across Rust, Node, Go, and Python config surfaces, plus default, serialization, and validation tests.
ACG convergence core
crates/adaptive/src/acg/stability.rs, crates/adaptive/src/acg_learner.rs
Adds StabilityAnalysisResult.converged and makes AcgLearner record, cache, and reuse convergence state per profile.
ACG convergence validation
crates/adaptive/Cargo.toml, crates/adaptive/benches/convergence_bench.rs, crates/adaptive/tests/integration/topology_convergence_tests.rs, crates/adaptive/tests/unit/acg_learner_tests.rs, crates/adaptive/tests/unit/acg/*, crates/adaptive/tests/unit/acg_component_tests.rs
Adds the convergence benchmark and ACG convergence integration/unit tests, with Cargo target wiring and stability fixture updates.
Hint governor and drift invalidation
crates/adaptive/src/adaptive_hints_intercept.rs, crates/adaptive/src/tool_parallelism_learner.rs, crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs, crates/adaptive/tests/unit/tool_parallelism_learner_tests.rs, crates/adaptive/tests/integration/tool_parallelism_plan_tests.rs
Adds governor-gated hint injection and drift-based tool-plan invalidation, with matching tests.
Runtime wiring, validation, and shared test support
crates/adaptive/src/lib.rs, crates/adaptive/src/runtime/*, crates/adaptive/src/plugin_component.rs, crates/adaptive/tests/unit/runtime_*_tests.rs, crates/adaptive/tests/unit/plugin_component_tests.rs, crates/adaptive/tests/unit/intercepts_tests.rs, crates/adaptive/tests/unit/storage_*_tests.rs, crates/adaptive/tests/unit/cache_diagnostics_tests.rs, crates/adaptive/tests/unit/types_tests.rs, crates/adaptive/tests/integration/runtime_integration_tests.rs, crates/python/tests/coverage/py_storage_coverage_tests.rs
Threads the new config sections through runtime feature construction and validation, and updates shared runtime/storage test fixtures and mutex handling.
Documentation and design
crates/adaptive/README.md, .github/designs/topology-aware-adaptive-controls.md, docs/adaptive-plugin/*.mdx
Updates the README, design note, and adaptive plugin docs with topology-aware controls, examples, and validation rules.

Pre-commit ty hook path pattern fix

Layer / File(s) Summary
ty hook exclude patterns
.pre-commit-config.yaml
Updates the ty hook entry in the pre-commit config to use directory-only excludes and force exclusion.

Sequence Diagram(s)

ACG convergence path

sequenceDiagram
  participant AcgLearner
  participant Backend
  participant ConvergenceDetector
  participant HotCache
  AcgLearner->>Backend: load cached stability and observations
  Backend-->>AcgLearner: cached state
  AcgLearner->>ConvergenceDetector: record_epoch(betti, drift, error)
  ConvergenceDetector-->>AcgLearner: stability.converged
  AcgLearner->>Backend: store updated observations and stability
  AcgLearner->>HotCache: update cached stability
Loading

Hint governor path

sequenceDiagram
  participant AdaptiveHintsIntercept
  participant HintGovernor
  participant GeometricGovernor
  AdaptiveHintsIntercept->>HintGovernor: allow(latency_sensitivity)
  HintGovernor->>GeometricGovernor: should_trigger(deviation)
  GeometricGovernor-->>HintGovernor: allow / deny
  HintGovernor-->>AdaptiveHintsIntercept: inject / skip AgentHints
Loading

Tool plan drift path

sequenceDiagram
  participant ToolParallelismLearner
  participant DriftDetector
  participant HotCache
  ToolParallelismLearner->>DriftDetector: update(cohort_feature_vector)
  DriftDetector-->>ToolParallelismLearner: drift decision
  alt drift detected
    ToolParallelismLearner->>HotCache: replace stored plan with empty execution plan
  else drift not detected
    ToolParallelismLearner->>HotCache: load existing execution plan
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title follows Conventional Commits and accurately summarizes the main change.
Description check ✅ Passed The description includes all required sections and provides substantive details, reviewer guidance, and related-issues information.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/adaptive/tests/unit/storage_tests.rs (1)

82-94: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Assert converged in stability round-trip tests.

The fixture now includes converged, but the round-trip test only checks stable_prefix_length and total_observations. Add an assertion for loaded_stability.converged to lock this storage contract, since runtime behavior reads this field.

Proposed test assertion
     assert_eq!(loaded_stability.stable_prefix_length, 1);
     assert_eq!(loaded_stability.total_observations, 3);
+    assert!(!loaded_stability.converged);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/tests/unit/storage_tests.rs` around lines 82 - 94, In the
round-trip test that uses the sample_stability fixture, add an assertion to
verify that loaded_stability.converged equals the value set in the
sample_stability function (which is false). This assertion should be added
alongside the existing assertions for stable_prefix_length and
total_observations to ensure the converged field is properly persisted and
loaded during the serialization round-trip.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/adaptive-topology/src/convergence.rs`:
- Around line 150-156: The stability_window parameter in the new method is only
being clamped to a minimum value using max(MIN_STABILITY_WINDOW), but is not
clamped to a maximum value. Since the RingBuffer instances (betti_history,
drift_history, error_history) have a fixed capacity of MAX_HISTORY (32), if
stability_window exceeds this capacity, methods like is_betti_stable and
is_drift_decreasing will never satisfy their length checks, making
topology-based convergence unreachable. Apply both minimum and maximum clamping
to stability_window by using a method that clamps it between
MIN_STABILITY_WINDOW and MAX_HISTORY.
- Line 166: The `epoch` field is typed as `u32` and the increment operation on
line 166 (self.epoch += 1) can cause integer overflow and wraparound in
long-running detectors, breaking epoch-based gating and diagnostics. Change the
type of the `epoch` field from `u32` to `u64` throughout the convergence module
to prevent wraparound and ensure reliable epoch tracking for the lifetime of the
detector.

In `@crates/adaptive-topology/src/geometry.rs`:
- Around line 252-257: The pruning_ratio method is counting true values across
the entire active_mask array, but only the first self.counts[0] entries
represent actual populated level-0 blocks. When the mask contains true entries
beyond self.counts[0], this causes the calculation to produce invalid ratios.
Restrict the active count calculation to only the first self.counts[0] elements
of active_mask by using slice notation to iterate over
active_mask[..self.counts[0]] instead of the full active_mask.

In `@crates/adaptive-topology/src/manifold.rs`:
- Around line 91-100: The assertion in the `new` method uses `D * tau` as the
validation threshold, but the actual embedding definition only requires `(D - 1)
* tau + 1` samples. Update the assertion logic to use the correct formula `(D -
1) * tau + 1` instead of `D * tau`, and also apply the same correction to the
embed readiness threshold check around lines 121-123 to ensure consistency and
avoid unnecessarily delaying valid embeddings.
- Around line 239-246: The issue in the compute_betti_0 function is that nodes
are not marked as visited when pushed onto the stack, only when popped, causing
the same neighbor to be pushed multiple times. Under dense connectivity, this
wastes the fixed-size stack space and can cause the stack to overflow, skipping
reachable nodes and producing incorrect Betti number calculations. Fix this by
marking a neighbor as visited immediately when it is pushed onto the stack in
the section where stack[stack_top] = neighbor is executed, rather than deferring
the visited marking until pop-time, to prevent duplicate pushes of the same
neighbor.

In `@crates/adaptive/src/acg_learner.rs`:
- Around line 185-208: The issue is that store_stability is being called before
store_observations, which means if observation storage fails after stability is
marked as converged, the next run will skip observation repair permanently. To
fix this, locate all places where store_stability and store_observations are
called together (including the instances at lines 223-229 and 235-237 mentioned
in the comment), and reorder these calls so that store_observations is always
called before store_stability. This ensures that if observation storage fails,
the profile won't yet be marked as converged and can be retried on the next run.

In `@python/nemo_relay/_native.pyi`:
- Around line 1235-1236: The DriftDetector.update method stub currently accepts
Sequence[float] for the centroid parameter, but the native binding requires a
fixed 3-element array. Change the centroid parameter type from Sequence[float]
to a fixed-size type representation (such as a tuple of exactly three floats) to
match the native binding's expectations. Additionally, update the docstring to
explicitly clarify that the centroid must be a 3-dimensional coordinate with
exactly three float values.

---

Outside diff comments:
In `@crates/adaptive/tests/unit/storage_tests.rs`:
- Around line 82-94: In the round-trip test that uses the sample_stability
fixture, add an assertion to verify that loaded_stability.converged equals the
value set in the sample_stability function (which is false). This assertion
should be added alongside the existing assertions for stable_prefix_length and
total_observations to ensure the converged field is properly persisted and
loaded during the serialization round-trip.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8b2fd9ec-2173-410e-a7fa-73fc991c831b

📥 Commits

Reviewing files that changed from the base of the PR and between d5c2407 and da2470a.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (66)
  • .pre-commit-config.yaml
  • ATTRIBUTIONS-Rust.md
  • Cargo.toml
  • crates/adaptive-topology/Cargo.toml
  • crates/adaptive-topology/README.md
  • crates/adaptive-topology/src/convergence.rs
  • crates/adaptive-topology/src/drift.rs
  • crates/adaptive-topology/src/geometry.rs
  • crates/adaptive-topology/src/governor.rs
  • crates/adaptive-topology/src/lib.rs
  • crates/adaptive-topology/src/manifold.rs
  • crates/adaptive-topology/src/topology.rs
  • crates/adaptive/Cargo.toml
  • crates/adaptive/README.md
  • crates/adaptive/benches/convergence_bench.rs
  • crates/adaptive/src/acg/stability.rs
  • crates/adaptive/src/acg_learner.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/config.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/plugin_component.rs
  • crates/adaptive/src/runtime/features.rs
  • crates/adaptive/src/runtime/validation.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/tests/integration/runtime_integration_tests.rs
  • crates/adaptive/tests/integration/tool_parallelism_plan_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/acg/economics_internal_tests.rs
  • crates/adaptive/tests/unit/acg/economics_policy_tests.rs
  • crates/adaptive/tests/unit/acg/multi_breakpoint_tests.rs
  • crates/adaptive/tests/unit/acg_component_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/unit/cache_diagnostics_tests.rs
  • crates/adaptive/tests/unit/config_tests.rs
  • crates/adaptive/tests/unit/intercepts_tests.rs
  • crates/adaptive/tests/unit/plugin_component_tests.rs
  • crates/adaptive/tests/unit/runtime_features_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/storage_memory_internal_tests.rs
  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/tool_parallelism_learner_tests.rs
  • crates/adaptive/tests/unit/types_tests.rs
  • crates/node/adaptive.d.ts
  • crates/node/adaptive.js
  • crates/node/tests/adaptive_tests.mjs
  • crates/python/Cargo.toml
  • crates/python/src/lib.rs
  • crates/python/src/py_adaptive_topology.rs
  • crates/python/tests/coverage/py_storage_coverage_tests.rs
  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/adaptive-hints.mdx
  • docs/adaptive-plugin/configuration.mdx
  • go/nemo_relay/adaptive.go
  • go/nemo_relay/adaptive/adaptive.go
  • go/nemo_relay/adaptive_test.go
  • python/nemo_relay/__init__.py
  • python/nemo_relay/__init__.pyi
  • python/nemo_relay/_native.pyi
  • python/nemo_relay/adaptive.py
  • python/nemo_relay/adaptive.pyi
  • python/nemo_relay/adaptive_topology.py
  • python/nemo_relay/adaptive_topology.pyi
  • python/tests/test_adaptive.py
  • python/tests/test_adaptive_config.py
  • python/tests/test_adaptive_topology.py

Comment on lines +150 to +156
pub fn new(epsilon: f64, stability_window: usize) -> Self {
Self {
betti_history: RingBuffer::new(),
drift_history: RingBuffer::new(),
error_history: RingBuffer::new(),
stability_window: stability_window.max(MIN_STABILITY_WINDOW),
epsilon: sanitize_positive(epsilon, DEFAULT_EPSILON),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clamp stability_window to ring-buffer capacity.

At Line 155, only a minimum clamp is applied. If callers pass a value larger than MAX_HISTORY (32), is_betti_stable and is_drift_decreasing can never satisfy their length checks, so topology-based convergence becomes unreachable.

Proposed fix
-            stability_window: stability_window.max(MIN_STABILITY_WINDOW),
+            stability_window: stability_window.clamp(MIN_STABILITY_WINDOW, MAX_HISTORY),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/convergence.rs` around lines 150 - 156, The
stability_window parameter in the new method is only being clamped to a minimum
value using max(MIN_STABILITY_WINDOW), but is not clamped to a maximum value.
Since the RingBuffer instances (betti_history, drift_history, error_history)
have a fixed capacity of MAX_HISTORY (32), if stability_window exceeds this
capacity, methods like is_betti_stable and is_drift_decreasing will never
satisfy their length checks, making topology-based convergence unreachable.
Apply both minimum and maximum clamping to stability_window by using a method
that clamps it between MIN_STABILITY_WINDOW and MAX_HISTORY.

self.betti_history.push(betti);
self.drift_history.push(sanitize_non_negative(drift));
self.error_history.push(sanitize_non_negative(error));
self.epoch += 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent epoch wraparound in long-running detectors.

Line 166 uses u32 += 1. This can overflow and wrap, which can break downstream epoch-based gating and diagnostics.

Proposed fix
-        self.epoch += 1;
+        self.epoch = self.epoch.saturating_add(1);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/convergence.rs` at line 166, The `epoch` field
is typed as `u32` and the increment operation on line 166 (self.epoch += 1) can
cause integer overflow and wraparound in long-running detectors, breaking
epoch-based gating and diagnostics. Change the type of the `epoch` field from
`u32` to `u64` throughout the convergence module to prevent wraparound and
ensure reliable epoch tracking for the lifetime of the detector.

Comment on lines +252 to +257
pub fn pruning_ratio(&self, active_mask: &[bool; MAX_BLOCKS]) -> f64 {
if self.counts[0] == 0 {
return 0.0;
}
let active = active_mask.iter().filter(|&&x| x).count();
1.0 - (active as f64 / self.counts[0] as f64)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Restrict pruning math to populated level-0 blocks.

At Line 256, active is counted across all MAX_BLOCKS. For caller-provided masks, true entries beyond self.counts[0] can produce invalid ratios (including negatives). Count only the tracked prefix.

Proposed fix
-        let active = active_mask.iter().filter(|&&x| x).count();
+        let active = active_mask[..self.counts[0]]
+            .iter()
+            .filter(|&&x| x)
+            .count();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn pruning_ratio(&self, active_mask: &[bool; MAX_BLOCKS]) -> f64 {
if self.counts[0] == 0 {
return 0.0;
}
let active = active_mask.iter().filter(|&&x| x).count();
1.0 - (active as f64 / self.counts[0] as f64)
pub fn pruning_ratio(&self, active_mask: &[bool; MAX_BLOCKS]) -> f64 {
if self.counts[0] == 0 {
return 0.0;
}
let active = active_mask[..self.counts[0]]
.iter()
.filter(|&&x| x)
.count();
1.0 - (active as f64 / self.counts[0] as f64)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/geometry.rs` around lines 252 - 257, The
pruning_ratio method is counting true values across the entire active_mask
array, but only the first self.counts[0] entries represent actual populated
level-0 blocks. When the mask contains true entries beyond self.counts[0], this
causes the calculation to produce invalid ratios. Restrict the active count
calculation to only the first self.counts[0] elements of active_mask by using
slice notation to iterate over active_mask[..self.counts[0]] instead of the full
active_mask.

Comment on lines +91 to +100
/// Panics if `D * tau` exceeds `EMBED_BUFFER_CAPACITY`. This is a
/// programming error, not a runtime failure.
pub fn new(tau: usize) -> Self {
let tau = if tau == 0 { DEFAULT_TAU } else { tau };
assert!(
D * tau <= EMBED_BUFFER_CAPACITY,
"TimeDelayEmbedder D * tau ({}) exceeds EMBED_BUFFER_CAPACITY ({})",
D * tau,
EMBED_BUFFER_CAPACITY
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Embed readiness threshold is too strict for tau > 1.

Line 121 uses D * tau, but the embedding definition in Line 74 only needs samples through x(t-(D-1)tau), i.e. (D - 1) * tau + 1. This delays valid embeddings and over-restricts constructor validation.

Proposed fix
-        assert!(
-            D * tau <= EMBED_BUFFER_CAPACITY,
-            "TimeDelayEmbedder D * tau ({}) exceeds EMBED_BUFFER_CAPACITY ({})",
-            D * tau,
-            EMBED_BUFFER_CAPACITY
-        );
+        let required = if D == 0 {
+            0
+        } else {
+            (D - 1).saturating_mul(tau).saturating_add(1)
+        };
+        assert!(
+            required <= EMBED_BUFFER_CAPACITY,
+            "TimeDelayEmbedder required samples ({}) exceed EMBED_BUFFER_CAPACITY ({})",
+            required,
+            EMBED_BUFFER_CAPACITY
+        );
@@
-        let required = D * self.tau;
+        let required = if D == 0 {
+            0
+        } else {
+            (D - 1).saturating_mul(self.tau).saturating_add(1)
+        };
         if self.buffer_len < required {
             return None;
         }

Also applies to: 121-123

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/manifold.rs` around lines 91 - 100, The
assertion in the `new` method uses `D * tau` as the validation threshold, but
the actual embedding definition only requires `(D - 1) * tau + 1` samples.
Update the assertion logic to use the correct formula `(D - 1) * tau + 1`
instead of `D * tau`, and also apply the same correction to the embed readiness
threshold check around lines 121-123 to ensure consistency and avoid
unnecessarily delaying valid embeddings.

Comment thread crates/adaptive-topology/src/manifold.rs Outdated
Comment thread crates/adaptive/src/acg_learner.rs
Comment thread python/nemo_relay/_native.pyi Outdated
@teerthsharma teerthsharma force-pushed the POC/aether-topological-adaptive-runtime branch from da2470a to 17488c8 Compare June 18, 2026 06:35
Signed-off-by: teerth sharma <teerths57@gmail.com>
@teerthsharma teerthsharma force-pushed the POC/aether-topological-adaptive-runtime branch from 17488c8 to 24ed6f2 Compare June 18, 2026 06:36

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/adaptive/tests/unit/storage_tests.rs (1)

243-253: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Assert the new converged field in the stability round-trip test.

Line 244 loads the full stability record, but the test never checks converged. Add an explicit assertion so storage regressions on this field are caught.

Suggested diff
     assert_eq!(loaded_stability.stable_prefix_length, 1);
     assert_eq!(loaded_stability.total_observations, 3);
+    assert!(!loaded_stability.converged);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/tests/unit/storage_tests.rs` around lines 243 - 253, The
stability round-trip test loads the full stability record into the
loaded_stability variable but does not assert the converged field, which could
allow storage regressions on this field to go undetected. Add an explicit
assertion immediately after the existing stability assertions (after the
assertion for loaded_stability.total_observations) to verify the converged field
has the expected value, ensuring the field is properly persisted and retrieved
during the round-trip.
♻️ Duplicate comments (2)
crates/adaptive-topology/src/manifold.rs (2)

91-100: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the mathematically correct embedding readiness threshold.

Line 91 and Line 121 use D * tau, but the embedding in Line 74 ([x(t), x(t-tau), ..., x(t-(D-1)tau)]) only requires (D - 1) * tau + 1 samples. Current logic rejects valid embeddings for tau > 1.

Proposed fix
     pub fn new(tau: usize) -> Self {
         let tau = if tau == 0 { DEFAULT_TAU } else { tau };
+        let required = if D == 0 {
+            0
+        } else {
+            (D - 1).saturating_mul(tau).saturating_add(1)
+        };
         assert!(
-            D * tau <= EMBED_BUFFER_CAPACITY,
-            "TimeDelayEmbedder D * tau ({}) exceeds EMBED_BUFFER_CAPACITY ({})",
-            D * tau,
+            required <= EMBED_BUFFER_CAPACITY,
+            "TimeDelayEmbedder required samples ({}) exceed EMBED_BUFFER_CAPACITY ({})",
+            required,
             EMBED_BUFFER_CAPACITY
         );
@@
     pub fn embed(&self) -> Option<ManifoldPoint<D>> {
-        let required = D * self.tau;
+        let required = if D == 0 {
+            0
+        } else {
+            (D - 1).saturating_mul(self.tau).saturating_add(1)
+        };
         if self.buffer_len < required {
             return None;
         }

Also applies to: 121-123

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/manifold.rs` around lines 91 - 100, The
embedding readiness threshold is mathematically incorrect. The embedding pattern
requires (D - 1) * tau + 1 samples, not D * tau. Update the assertion in the new
method (where tau is validated) to use (D - 1) * tau + 1 instead of D * tau.
Also apply the same fix to line 121-123 which has the same incorrect threshold
check. This correction will allow valid embeddings to proceed instead of
rejecting them unnecessarily when tau is greater than 1.

225-246: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mark vertices visited at push-time in DFS to avoid duplicate stack entries.

In Line 239-Line 246, neighbors are marked visited only after pop. That allows duplicate pushes for the same vertex, which can consume the fixed stack and skip reachable vertices, producing incorrect β₀.

Proposed fix
             components += 1;
             let mut stack = [0usize; MAX_GRAPH_POINTS];
             let mut stack_top = 1;
             stack[0] = start;
+            visited[start] = true;

             while stack_top > 0 {
                 stack_top -= 1;
                 let current = stack[stack_top];

-                if visited[current] {
-                    continue;
-                }
-                visited[current] = true;
-
-                for (neighbor, is_visited) in visited.iter().enumerate().take(self.point_count) {
-                    if !is_visited
-                        && self.are_neighbors(current, neighbor)
-                        && stack_top < MAX_GRAPH_POINTS
-                    {
+                for neighbor in 0..self.point_count {
+                    if !visited[neighbor] && self.are_neighbors(current, neighbor) {
+                        visited[neighbor] = true;
                         stack[stack_top] = neighbor;
                         stack_top += 1;
                     }
                 }
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/manifold.rs` around lines 225 - 246, The DFS
traversal marks vertices as visited only after they are popped from the stack
(when visited[current] is set to true), which allows the same vertex to be
pushed onto the stack multiple times before it gets processed. This causes
duplicate stack entries that can overflow the fixed-size stack and skip
reachable vertices, resulting in incorrect component counting. Move the visited
marking to happen when pushing neighbors onto the stack rather than when popping
them. In the neighbor iteration loop where neighbors are added to the stack
(around lines 239-246), set visited[neighbor] = true immediately before pushing
the neighbor onto the stack to prevent duplicate pushes of the same vertex.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/adaptive-topology/README.md`:
- Around line 33-35: The code example block in the README.md file starts
immediately after the "## Example" heading without a complete-sentence
introduction. Add a descriptive sentence between the heading and the opening
triple backticks that explains what the code example demonstrates, following the
documentation style guideline that requires every code block to be introduced
with a complete sentence.

In `@crates/adaptive-topology/src/convergence.rs`:
- Around line 237-239: In the betti_score calculation within the
convergence_score method, the variations count is being normalized by
self.stability_window when it should be normalized by the actual number of
transitions. Since variations are counted from window[..count].windows(2), which
produces count - 1 adjacent pairs, change the denominator in the betti_score
assignment from self.stability_window to (count - 1) to correctly normalize the
variation ratio.

In `@crates/adaptive-topology/src/drift.rs`:
- Around line 40-60: Add a validation check at the beginning of the update
method to ensure all values in the centroid array are finite (not NaN or Inf).
If any non-finite value is detected, return early with a drift value of 0.0 to
prevent non-finite values from being written into the internal state fields
(previous, expected, velocity). This guards against NaN values persisting in the
drift and velocity calculations, which would disable meaningful drift checks.

In `@crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs`:
- Around line 361-362: The current assertion only verifies that the agent hints
header is absent from the request, but it does not verify that the agent hints
data is also absent from the request body. To harden the test, add a negative
assertion that checks the request body does not contain the nvext.agent_hints
field after the existing header check on line 361. This ensures that if a
regression accidentally injects nvext.agent_hints in the body while skipping the
header, the test will catch it and fail as expected.

In `@crates/adaptive/tests/unit/runtime_tests.rs`:
- Around line 322-375: The test
validate_config_reports_invalid_topology_numeric_fields should verify
field-level diagnostics for all invalid fields within components, not just
component-level diagnostics. Currently it checks for component paths like
"adaptive_hints.governor" and "acg.convergence", but it should also assert that
diagnostics exist for specific invalid fields within those components. For the
acg.convergence and convergence components which each have multiple invalid
fields (epsilon and stability_window), add additional assertions to the
report.diagnostics iteration to verify that field-level diagnostics are reported
for paths like "acg.convergence.epsilon", "acg.convergence.stability_window",
"convergence.epsilon", and "convergence.stability_window" to ensure all invalid
numeric fields are being validated and prevent silent validator regressions.

In `@docs/adaptive-plugin/about.mdx`:
- Around line 39-40: The documentation uses inconsistent terminology for the
same concept: "topology-aware" appears in one location while "topology-inspired"
appears in another location (in the line mentioning "topology-inspired
signals"). Standardize on a single term throughout the document by identifying
all instances of both "topology-aware" and "topology-inspired" and replacing
them with one consistent term. Ensure the chosen term is applied uniformly
across the entire file to maintain clarity and avoid ambiguity for readers.

In `@docs/adaptive-plugin/configuration.mdx`:
- Around line 221-233: The Rust example for tool_parallelism configuration uses
ToolParallelismComponentConfig::default() implicitly, while the TOML, Python,
and Node examples all explicitly demonstrate the drift configuration. Update the
line setting adaptive.tool_parallelism to explicitly define the drift field
configuration instead of relying on the default() method, ensuring the Rust
example maintains parity with the other language examples in the documentation.

---

Outside diff comments:
In `@crates/adaptive/tests/unit/storage_tests.rs`:
- Around line 243-253: The stability round-trip test loads the full stability
record into the loaded_stability variable but does not assert the converged
field, which could allow storage regressions on this field to go undetected. Add
an explicit assertion immediately after the existing stability assertions (after
the assertion for loaded_stability.total_observations) to verify the converged
field has the expected value, ensuring the field is properly persisted and
retrieved during the round-trip.

---

Duplicate comments:
In `@crates/adaptive-topology/src/manifold.rs`:
- Around line 91-100: The embedding readiness threshold is mathematically
incorrect. The embedding pattern requires (D - 1) * tau + 1 samples, not D *
tau. Update the assertion in the new method (where tau is validated) to use (D -
1) * tau + 1 instead of D * tau. Also apply the same fix to line 121-123 which
has the same incorrect threshold check. This correction will allow valid
embeddings to proceed instead of rejecting them unnecessarily when tau is
greater than 1.
- Around line 225-246: The DFS traversal marks vertices as visited only after
they are popped from the stack (when visited[current] is set to true), which
allows the same vertex to be pushed onto the stack multiple times before it gets
processed. This causes duplicate stack entries that can overflow the fixed-size
stack and skip reachable vertices, resulting in incorrect component counting.
Move the visited marking to happen when pushing neighbors onto the stack rather
than when popping them. In the neighbor iteration loop where neighbors are added
to the stack (around lines 239-246), set visited[neighbor] = true immediately
before pushing the neighbor onto the stack to prevent duplicate pushes of the
same vertex.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ccb1a467-92d2-498a-ae76-c078c7204100

📥 Commits

Reviewing files that changed from the base of the PR and between da2470a and 24ed6f2.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (66)
  • .pre-commit-config.yaml
  • ATTRIBUTIONS-Rust.md
  • Cargo.toml
  • crates/adaptive-topology/Cargo.toml
  • crates/adaptive-topology/README.md
  • crates/adaptive-topology/src/convergence.rs
  • crates/adaptive-topology/src/drift.rs
  • crates/adaptive-topology/src/geometry.rs
  • crates/adaptive-topology/src/governor.rs
  • crates/adaptive-topology/src/lib.rs
  • crates/adaptive-topology/src/manifold.rs
  • crates/adaptive-topology/src/topology.rs
  • crates/adaptive/Cargo.toml
  • crates/adaptive/README.md
  • crates/adaptive/benches/convergence_bench.rs
  • crates/adaptive/src/acg/stability.rs
  • crates/adaptive/src/acg_learner.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/config.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/plugin_component.rs
  • crates/adaptive/src/runtime/features.rs
  • crates/adaptive/src/runtime/validation.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/tests/integration/runtime_integration_tests.rs
  • crates/adaptive/tests/integration/tool_parallelism_plan_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/acg/economics_internal_tests.rs
  • crates/adaptive/tests/unit/acg/economics_policy_tests.rs
  • crates/adaptive/tests/unit/acg/multi_breakpoint_tests.rs
  • crates/adaptive/tests/unit/acg_component_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/unit/cache_diagnostics_tests.rs
  • crates/adaptive/tests/unit/config_tests.rs
  • crates/adaptive/tests/unit/intercepts_tests.rs
  • crates/adaptive/tests/unit/plugin_component_tests.rs
  • crates/adaptive/tests/unit/runtime_features_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/storage_memory_internal_tests.rs
  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/tool_parallelism_learner_tests.rs
  • crates/adaptive/tests/unit/types_tests.rs
  • crates/node/adaptive.d.ts
  • crates/node/adaptive.js
  • crates/node/tests/adaptive_tests.mjs
  • crates/python/Cargo.toml
  • crates/python/src/lib.rs
  • crates/python/src/py_adaptive_topology.rs
  • crates/python/tests/coverage/py_storage_coverage_tests.rs
  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/adaptive-hints.mdx
  • docs/adaptive-plugin/configuration.mdx
  • go/nemo_relay/adaptive.go
  • go/nemo_relay/adaptive/adaptive.go
  • go/nemo_relay/adaptive_test.go
  • python/nemo_relay/__init__.py
  • python/nemo_relay/__init__.pyi
  • python/nemo_relay/_native.pyi
  • python/nemo_relay/adaptive.py
  • python/nemo_relay/adaptive.pyi
  • python/nemo_relay/adaptive_topology.py
  • python/nemo_relay/adaptive_topology.pyi
  • python/tests/test_adaptive.py
  • python/tests/test_adaptive_config.py
  • python/tests/test_adaptive_topology.py

Comment thread crates/adaptive-topology/README.md Outdated
Comment on lines +237 to +239
let variations = window[..count].windows(2).filter(|w| w[0] != w[1]).count();
let betti_score = 1.0 - (variations as f64 / self.stability_window as f64);
score += BETTI_STABILITY_WEIGHT * betti_score;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize Betti variation by transition count, not window length.

Line [238] divides by self.stability_window, but variations are counted across adjacent pairs (count - 1 transitions). This systematically overstates Betti stability in convergence_score.

Suggested fix
 let variations = window[..count].windows(2).filter(|w| w[0] != w[1]).count();
-let betti_score = 1.0 - (variations as f64 / self.stability_window as f64);
+let transitions = count.saturating_sub(1).max(1);
+let betti_score = 1.0 - (variations as f64 / transitions as f64);
 score += BETTI_STABILITY_WEIGHT * betti_score;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/convergence.rs` around lines 237 - 239, In the
betti_score calculation within the convergence_score method, the variations
count is being normalized by self.stability_window when it should be normalized
by the actual number of transitions. Since variations are counted from
window[..count].windows(2), which produces count - 1 adjacent pairs, change the
denominator in the betti_score assignment from self.stability_window to (count -
1) to correctly normalize the variation ratio.

Comment thread crates/adaptive-topology/src/drift.rs Outdated
Comment on lines +40 to +60
pub fn update(&mut self, centroid: &[f64; D]) -> f64 {
let drift = if self.has_previous {
l2_distance(&self.expected, centroid)
} else {
0.0
};

if self.has_previous {
for (d, vel) in self.velocity.iter_mut().enumerate().take(D) {
*vel = centroid[d] - self.previous[d];
}
}

for (d, exp) in self.expected.iter_mut().enumerate().take(D) {
*exp = centroid[d] + self.velocity[d];
}

self.previous = *centroid;
self.has_previous = true;

drift

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard update against non-finite centroids to prevent persistent NaN state.

At Line [40], non-finite centroid values are accepted and then written into previous/expected/velocity. After that, drift and velocity calculations can stay NaN and effectively disable meaningful drift checks.

Suggested fix
 pub fn update(&mut self, centroid: &[f64; D]) -> f64 {
+    if centroid.iter().any(|coord| !coord.is_finite()) {
+        self.reset();
+        return f64::INFINITY;
+    }
+
     let drift = if self.has_previous {
         l2_distance(&self.expected, centroid)
     } else {
         0.0
     };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive-topology/src/drift.rs` around lines 40 - 60, Add a validation
check at the beginning of the update method to ensure all values in the centroid
array are finite (not NaN or Inf). If any non-finite value is detected, return
early with a drift value of 0.0 to prevent non-finite values from being written
into the internal state fields (previous, expected, velocity). This guards
against NaN values persisting in the drift and velocity calculations, which
would disable meaningful drift checks.

Comment thread crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
Comment thread crates/adaptive/tests/unit/runtime_tests.rs
Comment thread docs/adaptive-plugin/about.mdx Outdated
Comment thread docs/adaptive-plugin/configuration.mdx Outdated
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Teerth Sharma <teerths57@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/adaptive-plugin/about.mdx`:
- Around line 39-40: The file docs/adaptive-plugin/about.mdx is missing the
required SPDX license header at the top. Add the SPDX header using JSX comment
delimiters {/* ... */} at the very beginning of the file before the YAML
frontmatter. The header should include the SPDX-FileCopyrightText line with the
copyright notice and the SPDX-License-Identifier line set to Apache-2.0, wrapped
in the JSX comment syntax as specified in the review comment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 289cbccd-452e-4a63-9da2-9d593edfa1a4

📥 Commits

Reviewing files that changed from the base of the PR and between 24ed6f2 and 6fc3d5e.

📒 Files selected for processing (1)
  • docs/adaptive-plugin/about.mdx
📜 Review details
🧰 Additional context used
📓 Path-based instructions (12)
{docs/**,README.md,CONTRIBUTING.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

{docs/**,README.md,CONTRIBUTING.md}: For docs-only changes, run targeted checks only if commands, package names, or examples changed. Use just docs for docs-site builds and just docs-linkcheck when links changed
Run docs site build with just docs

Files:

  • docs/adaptive-plugin/about.mdx
{docs/**,README.md,CONTRIBUTING.md,**/*.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Run docs link validation with just docs-linkcheck when links change

Files:

  • docs/adaptive-plugin/about.mdx
{docs/**,README.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Verify README and docs entry points still match current package names and paths for large or public-facing changes

Files:

  • docs/adaptive-plugin/about.mdx
{docs/**,examples/**,README.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Verify examples still run with documented commands for large or public-facing changes

Files:

  • docs/adaptive-plugin/about.mdx
{docs/**,README.md,**/Cargo.toml,**/package.json,**/*.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Ensure renamed public surfaces are reflected consistently in manifests and docs for large or public-facing changes

Files:

  • docs/adaptive-plugin/about.mdx
**/*.{md,mdx,py,sh,yaml,yml,toml,json}

📄 CodeRabbit inference engine (.agents/skills/contribute-docs/SKILL.md)

Keep package names, repo references, and build commands current

Files:

  • docs/adaptive-plugin/about.mdx
**/*.mdx

📄 CodeRabbit inference engine (.agents/skills/contribute-docs/SKILL.md)

In MDX files, top-of-file comments must use JSX comment delimiters: {/* to open and */} to close. Do not use HTML comments for MDX SPDX headers.

MDX top-of-file SPDX comments must use {/* ... */} delimiters instead of HTML comment delimiters (Must-Fix)

Files:

  • docs/adaptive-plugin/about.mdx
**/*.{html,md,mdx}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Include SPDX license header in HTML and Markdown files using HTML comment syntax

Files:

  • docs/adaptive-plugin/about.mdx
docs/**/*.{md,mdx}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Update embedded documentation snippets, patch docs, and binding-support notes if examples or supported bindings changed

Files:

  • docs/adaptive-plugin/about.mdx
docs/**

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Run just docs or ./scripts/build-docs.sh html to regenerate ignored Fern API reference pages before validation for documentation site changes

Files:

  • docs/adaptive-plugin/about.mdx
{docs/**,README.md,CONTRIBUTING.md,RELEASING.md,SECURITY.md}

⚙️ CodeRabbit configuration file

{docs/**,README.md,CONTRIBUTING.md,RELEASING.md,SECURITY.md}: Review documentation for technical accuracy against the current API, command correctness, and consistency across language bindings.
Flag stale examples, missing SPDX headers where required, and instructions that no longer match CI or pre-commit behavior.

Files:

  • docs/adaptive-plugin/about.mdx
**

⚙️ CodeRabbit configuration file

**:

AGENTS.md

This file provides guidance to agents, including Claude Code and OpenAI Codex, when working in this repository.

Project Overview

NeMo Relay is a multi-language agent runtime framework for execution scopes, lifecycle events, middleware, plugins, and observability around tool and LLM calls. The core runtime is Rust. Primary supported bindings are Rust, Python, and Node.js. Go, WebAssembly, and the raw C FFI are experimental and source-first.

The shared runtime model is:

  1. Scope stacks decide where work belongs and which scope-local behavior is visible.
  2. Middleware registries decide what guardrails and intercepts run around managed calls.
  3. Plugins install reusable runtime behavior from configuration.
  4. Events record runtime behavior in ATOF form.
  5. Subscribers and exporters consume events in-process or export them to ATIF, OpenTelemetry, OpenInference, or other backends.

Repository Structure

The repository layout separates the Rust runtime, language bindings, documentation,
integration patches, and agent-facing skills.

crates/
  core/       # Rust core runtime crate, published as nemo-relay
  adaptive/   # Adaptive runtime primitives and plugin components
  python/     # PyO3 native extension for the Python package
  ffi/        # Raw C ABI layer used by downstream bindings such as Go
  node/       # NAPI Node.js binding and JavaScript/TypeScript entry points
  wasm/       # wasm-bindgen WebAssembly binding and JS wrappers
python/
  nemo_relay/  # Python wrapper package: scopes, tools, LLM, middleware, typed helpers, plugins, adaptive helpers
  tests/      # Python tests
go/
  nemo_relay/  # Experimental Go CGo binding and tests
fern/         # Fern documentation site
scripts/      # Stable wrappers and helper scripts; build/test/docs entry points live in justfile
third_party/  # P...

Files:

  • docs/adaptive-plugin/about.mdx
🔇 Additional comments (2)
docs/adaptive-plugin/about.mdx (2)

39-40: Terminology consistency already addressed.

The past review flagged inconsistency between "topology-aware" and "topology-inspired" at line 39. The current code correctly uses "topology-aware signals stabilize," which aligns with terminology used earlier in the document and throughout the adaptive plugin docs. This concern has been resolved.


39-40: New documentation bullets are technically accurate and well-motivated.

  • Line 39: "Stop adaptive learning once observed topology-aware signals stabilize" directly reflects convergence detection via Betti-number stability (convergence detector in PR).
  • Line 40: "Invalidate stale tool parallelism plans when cohort shape drifts" aligns with drift-based plan invalidation behavior implemented in tool-parallelism learner.

Both bullets are clear, actionable, and grounded in the topology-aware adaptive feature set.

Comment thread docs/adaptive-plugin/about.mdx

@willkill07 willkill07 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a design document and clear measurable and observable improvements, I am extremely reluctant to spend the time to review this PR.

Additionally:

  • no tests should be embedded in source files
  • no new crates should be added if the behavior is internal only (which it sounds like adaptive-topology is

Signed-off-by: Teerth Sharma <teerths57@gmail.com>
@teerthsharma

teerthsharma commented Jun 22, 2026

Copy link
Copy Markdown
Author

Addressed the review concerns in 0f56103822edad153910d83fa463dafcae13a9c1, ad6b6d06920ddb98bdaec69902ad33ef07cc3496, and 1036c12d8d5ae3b802fe29d39b70465bacf458ef:

  • added the reviewer-facing design doc at docs/adaptive-plugin/topology-aware-controls-design.mdx with measurable outcomes and observable signals
  • folded the topology controls into nemo-relay-adaptive instead of adding a new crate
  • removed the proposed public Python topology primitive surface
  • moved the new primitive tests out of source files into crates/adaptive/tests/unit/topology_tests.rs
  • hardened the hint-governor test to assert both header and body omission when hints are shed
  • applied the MDX SPDX placement requested by the review bot for docs/adaptive-plugin/about.mdx
  • tightened internal convergence so it requires a full stable window across Betti signatures, finite non-increasing drift, and low error; also clamps the stability window and resets drift state on non-finite centroids

Validation run locally:

  • cargo fmt --all
  • cargo test -p nemo-relay-adaptive topology -- --nocapture
  • cargo test -p nemo-relay-adaptive --test topology_convergence -- --nocapture
  • cargo test -p nemo-relay-adaptive --test tool_parallelism_plan -- --nocapture
  • cargo test -p nemo-relay-adaptive adaptive_hints_governor -- --nocapture
  • cargo test -p nemo-relay-adaptive --lib (484 tests after the convergence fix)
  • cargo check -p nemo-relay-python
  • cargo clippy -p nemo-relay-adaptive --all-targets -- -D warnings
  • cd fern && npx fern check --warnings (passes with the expected unauthenticated redirects warning)

uv run pytest python/tests/test_adaptive.py python/tests/test_adaptive_config.py -q is still blocked locally before collection by .venv\lib64 access denied, and just docs is blocked in this Windows shell because the recipe cannot find cygpath. The direct Fern check above passed.

Signed-off-by: Teerth Sharma <teerths57@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/adaptive/src/tool_parallelism_learner.rs (1)

73-75: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Drift invalidation should use an inclusive threshold.

At Line 74, using > misses invalidation when drift is exactly at threshold.

Suggested fix
-        Ok(drift > config.threshold)
+        Ok(drift >= config.threshold)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/src/tool_parallelism_learner.rs` around lines 73 - 75, In the
return statement within the method that updates drift using
detector.update(&centroid), change the comparison operator from > to >= when
comparing drift against config.threshold. This ensures that invalidation occurs
when drift equals the threshold value exactly, not just when it exceeds it.
Update the condition Ok(drift > config.threshold) to use the inclusive >=
operator instead.
♻️ Duplicate comments (1)
crates/adaptive/src/acg_learner.rs (1)

186-206: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persisting stability before observations can strand profiles in a false converged state.

At Line 227, converged stability is stored before observations (Line 235). If observation storage fails, the next run can hit the converged fast-path at Line 186 and skip repair permanently.

Suggested fix
-                if existing_stability
+                if existing_stability
                     .as_ref()
                     .map(|stability| stability.converged)
                     .unwrap_or(false)
+                    && existing.is_some()
                 {
@@
-                if self.record_stability_epoch(&profile_key, &stability_result)? {
-                    stability_result.converged = true;
-                }
-
-                backend
-                    .store_stability(&profile_key, &stability_result)
-                    .await?;
+                let converged_now = self.record_stability_epoch(&profile_key, &stability_result)?;
@@
                 backend
                     .store_observations(&profile_key, &observations_vec)
                     .await?;
+
+                if converged_now {
+                    stability_result.converged = true;
+                }
+                backend
+                    .store_stability(&profile_key, &stability_result)
+                    .await?;

Also applies to: 226-236

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/src/acg_learner.rs` around lines 186 - 206, The issue is that
stability state is being persisted before observations are persisted, creating a
race condition where if observation persistence fails, the converged flag in
stability remains true, causing the fast-path at the beginning of the function
(checking if existing_stability is converged) to incorrectly skip repair on the
next run. Fix this by reversing the order of persistence operations so that
observations are stored first before stability is persisted, ensuring that if
observation storage fails, the converged state remains false and repair will be
attempted on the next execution.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/adaptive/src/topology.rs`:
- Around line 135-147: The convergence check in the is_converged method
currently uses a short-circuit OR operator that can declare convergence based on
insufficient evidence, and the is_error_converged method only checks if the last
single error value is below epsilon rather than verifying stability over a
configured window. To fix this, modify is_error_converged to check that the last
N error samples in error_history are all consistently below epsilon threshold to
guarantee convergence over a window period, rather than checking only the latest
point. Additionally, change the logic operator in is_converged from OR to AND to
ensure that error convergence, betti stability, and drift decrease all must be
satisfied together before declaring the topology converged.

In `@crates/adaptive/tests/unit/topology_tests.rs`:
- Around line 27-33: The test function error_below_epsilon_converges expects
convergence after only a single low-error epoch, which may not align with the
intended window-based stability check. If the ConvergenceDetector requires a
stability window (indicated by the 3 parameter in ConvergenceDetector::new(0.01,
3)), then the test needs to record multiple consecutive epochs with low error
values before asserting is_converged() returns true. Either call record_epoch
multiple times with the same low-error pattern to fill the window, or change the
assertion to expect false convergence after a single epoch to properly test
window-based behavior.

---

Outside diff comments:
In `@crates/adaptive/src/tool_parallelism_learner.rs`:
- Around line 73-75: In the return statement within the method that updates
drift using detector.update(&centroid), change the comparison operator from > to
>= when comparing drift against config.threshold. This ensures that invalidation
occurs when drift equals the threshold value exactly, not just when it exceeds
it. Update the condition Ok(drift > config.threshold) to use the inclusive >=
operator instead.

---

Duplicate comments:
In `@crates/adaptive/src/acg_learner.rs`:
- Around line 186-206: The issue is that stability state is being persisted
before observations are persisted, creating a race condition where if
observation persistence fails, the converged flag in stability remains true,
causing the fast-path at the beginning of the function (checking if
existing_stability is converged) to incorrectly skip repair on the next run. Fix
this by reversing the order of persistence operations so that observations are
stored first before stability is persisted, ensuring that if observation storage
fails, the converged state remains false and repair will be attempted on the
next execution.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2c6e25e2-9b5b-49f7-8fac-0fcddc8494b7

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc3d5e and 0f56103.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • crates/adaptive/Cargo.toml
  • crates/adaptive/src/acg_learner.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
💤 Files with no reviewable changes (1)
  • crates/adaptive/Cargo.toml
📜 Review details
🧰 Additional context used
📓 Path-based instructions (26)
{docs/**,README.md,CONTRIBUTING.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

{docs/**,README.md,CONTRIBUTING.md}: For docs-only changes, run targeted checks only if commands, package names, or examples changed. Use just docs for docs-site builds and just docs-linkcheck when links changed
Run docs site build with just docs

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
{docs/**,README.md,CONTRIBUTING.md,**/*.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Run docs link validation with just docs-linkcheck when links change

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
{docs/**,README.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Verify README and docs entry points still match current package names and paths for large or public-facing changes

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
{docs/**,examples/**,README.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Verify examples still run with documented commands for large or public-facing changes

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
{docs/**,README.md,**/Cargo.toml,**/package.json,**/*.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Ensure renamed public surfaces are reflected consistently in manifests and docs for large or public-facing changes

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
**/*.{md,mdx,py,sh,yaml,yml,toml,json}

📄 CodeRabbit inference engine (.agents/skills/contribute-docs/SKILL.md)

Keep package names, repo references, and build commands current

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
**/*.mdx

📄 CodeRabbit inference engine (.agents/skills/contribute-docs/SKILL.md)

In MDX files, top-of-file comments must use JSX comment delimiters: {/* to open and */} to close. Do not use HTML comments for MDX SPDX headers.

MDX top-of-file SPDX comments must use {/* ... */} delimiters instead of HTML comment delimiters (Must-Fix)

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
**/*.{html,md,mdx}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Include SPDX license header in HTML and Markdown files using HTML comment syntax

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
docs/**/*.{md,mdx}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Update embedded documentation snippets, patch docs, and binding-support notes if examples or supported bindings changed

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
docs/**

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Run just docs or ./scripts/build-docs.sh html to regenerate ignored Fern API reference pages before validation for documentation site changes

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
{docs/**,README.md,CONTRIBUTING.md,RELEASING.md,SECURITY.md}

⚙️ CodeRabbit configuration file

{docs/**,README.md,CONTRIBUTING.md,RELEASING.md,SECURITY.md}: Review documentation for technical accuracy against the current API, command correctness, and consistency across language bindings.
Flag stale examples, missing SPDX headers where required, and instructions that no longer match CI or pre-commit behavior.

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
**

⚙️ CodeRabbit configuration file

**:

AGENTS.md

This file provides guidance to agents, including Claude Code and OpenAI Codex, when working in this repository.

Project Overview

NeMo Relay is a multi-language agent runtime framework for execution scopes, lifecycle events, middleware, plugins, and observability around tool and LLM calls. The core runtime is Rust. Primary supported bindings are Rust, Python, and Node.js. Go, WebAssembly, and the raw C FFI are experimental and source-first.

The shared runtime model is:

  1. Scope stacks decide where work belongs and which scope-local behavior is visible.
  2. Middleware registries decide what guardrails and intercepts run around managed calls.
  3. Plugins install reusable runtime behavior from configuration.
  4. Events record runtime behavior in ATOF form.
  5. Subscribers and exporters consume events in-process or export them to ATIF, OpenTelemetry, OpenInference, or other backends.

Repository Structure

The repository layout separates the Rust runtime, language bindings, documentation,
integration patches, and agent-facing skills.

crates/
  core/       # Rust core runtime crate, published as nemo-relay
  adaptive/   # Adaptive runtime primitives and plugin components
  python/     # PyO3 native extension for the Python package
  ffi/        # Raw C ABI layer used by downstream bindings such as Go
  node/       # NAPI Node.js binding and JavaScript/TypeScript entry points
  wasm/       # wasm-bindgen WebAssembly binding and JS wrappers
python/
  nemo_relay/  # Python wrapper package: scopes, tools, LLM, middleware, typed helpers, plugins, adaptive helpers
  tests/      # Python tests
go/
  nemo_relay/  # Experimental Go CGo binding and tests
fern/         # Fern documentation site
scripts/      # Stable wrappers and helper scripts; build/test/docs entry points live in justfile
third_party/  # P...

Files:

  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/topology-aware-controls-design.mdx
  • crates/adaptive/tests/unit/topology_tests.rs
  • docs/adaptive-plugin/acg.mdx
  • docs/adaptive-plugin/configuration.mdx
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
**/*.rs

📄 CodeRabbit inference engine (.agents/skills/add-binding-feature/SKILL.md)

Use snake_case naming convention for Rust identifiers (e.g., nemo_relay_tool_call)

**/*.rs: Any Rust change must run just test-rust
Any Rust change must run cargo fmt --all
Any Rust change must run cargo clippy --workspace --all-targets -- -D warnings

**/*.rs: Run cargo fmt --all for all FFI work since it is Rust work
Run just test-rust to validate FFI changes
Run cargo clippy --workspace --all-targets -- -D warnings to enforce strict linting on FFI work

When Rust files changed as part of Go work, also run cargo fmt --all, just test-rust, and cargo clippy --workspace --all-targets -- -D warnings

**/*.rs: Run cargo fmt --all when Rust files are changed as part of Node work
Run cargo clippy --workspace --all-targets -- -D warnings when Rust files are changed as part of Node work
Run just test-rust when Rust files are changed as part of Node work

**/*.rs: Run cargo fmt --all to format all Rust code
Run cargo clippy --workspace --all-targets -- -D warnings to enforce all clippy lints as errors

**/*.rs: Run cargo fmt --all when Rust files changed as part of WebAssembly work
Run cargo clippy --workspace --all-targets -- -D warnings when Rust files changed as part of WebAssembly work

**/*.rs: If any Rust code changed, always run just test-rust
If any Rust code changed, also run cargo fmt --all
If any Rust code changed, also run cargo clippy --workspace --all-targets -- -D warnings
Run Rust formatting with cargo fmt --all
Run Rust linting with cargo clippy --workspace --all-targets -- -D warnings

**/*.rs: Use cargo fmt for Rust code formatting
Run cargo clippy -- -D warnings to lint Rust code and treat all warnings as errors
Use Rust snake_case naming convention for Rust identifiers
Include SPDX license header in all Rust source files using double-slash comment syntax
Validate Rust code with uv run pre-commit run --all-files to enforce cargo fmt formatting check, cargo clippy lints, and cargo deny aud...

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
{crates/adaptive/**,python/nemo_relay/adaptive.py,python/nemo_relay/plugin.py,go/nemo_relay/adaptive/**,go/nemo_relay/!(adaptive)/**,**/node/**,**/wasm/**}

📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)

Keep adaptive surface in sync across crates/adaptive, shared plugin behavior in core and bindings, Python adaptive/plugin wrappers in python/nemo_relay/adaptive.py and python/nemo_relay/plugin.py, Go adaptive helpers under go/nemo_relay/adaptive plus shared plugin helpers in go/nemo_relay, and Node/WebAssembly adaptive helpers and plugin wrappers

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
{crates/adaptive/**,python/nemo_relay/plugin.py,go/nemo_relay/**,**/node/**,**/wasm/**}

📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)

{crates/adaptive/**,python/nemo_relay/plugin.py,go/nemo_relay/**,**/node/**,**/wasm/**}: Maintain consistent plugin lifecycle across all language bindings (Python, Go, Node/WebAssembly, and Rust)
Keep plugin context surfaces aligned across all language implementations

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
{crates/adaptive/**/*.rs,**/*test*.{rs,py,go,ts,js},**/*adaptive*test*.{rs,py,go,ts,js},docs/plugins/adaptive/**}

📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)

Maintain documented and tested validation and report behavior for adaptive surfaces

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
**/{Cargo.toml,**/*.rs}

📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)

Maintain consistency between Rust package names in Cargo.toml and their actual usage across the codebase

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
**/*.{h,hpp,c,cpp,rs}

📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)

Ensure FFI header and library naming follows consistent conventions across platform-specific builds

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
{crates/core,crates/adaptive}/**/*

📄 CodeRabbit inference engine (.agents/skills/prepare-pr/SKILL.md)

Changes to crates/core or crates/adaptive must run the full language matrix

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
**/*.{rs,toml}

📄 CodeRabbit inference engine (.agents/skills/rename-surfaces/SKILL.md)

Update Rust crate names and module prefixes during coordinated rename operations

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
crates/{core,adaptive}/**

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

If crates/core or crates/adaptive changed, run the full matrix across Rust, Python, Go, Node.js, and WebAssembly

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
**/*.{rs,py,js,ts,tsx,jsx,go,sh,toml,yaml,yml,md}

📄 CodeRabbit inference engine (AGENTS.md)

Keep SPDX headers on source, docs, scripts, and configuration files. The project is Apache-2.0.

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
**/*.{rs,py,go,js,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Follow binding naming conventions: Rust and Python use snake_case, C FFI exports prefixed nemo_relay_, Go uses PascalCase for public APIs, Node.js uses camelCase.

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
crates/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

crates/**/*.rs: Keep async behavior on the existing tokio-based model. Bindings should preserve callback and future lifetimes rather than blocking or hiding async work unexpectedly.
Use Json = serde_json::Value in Rust-facing runtime APIs for JSON payload handling.

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
crates/{core,adaptive}/**/*.rs

⚙️ CodeRabbit configuration file

crates/{core,adaptive}/**/*.rs: Review the Rust runtime for async correctness, scope isolation, middleware ordering, and event lifecycle regressions.
Pay close attention to task-local/thread-local scope propagation, callback lifetimes, stream finalization, and root_uuid isolation.
Public API changes should preserve existing behavior unless tests and docs show the intended migration path.

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/adaptive_hints_intercept.rs
  • crates/adaptive/src/tool_parallelism_learner.rs
  • crates/adaptive/src/lib.rs
  • crates/adaptive/src/acg_learner.rs
{crates/**/tests/**,python/tests/**,go/nemo_relay/**/*_test.go}

⚙️ CodeRabbit configuration file

{crates/**/tests/**,python/tests/**,go/nemo_relay/**/*_test.go}: Tests should cover the behavior promised by the changed API surface, including error paths and cross-request isolation where relevant.
Prefer assertions on lifecycle events, scope stacks, middleware ordering, and binding parity over shallow smoke tests.

Files:

  • crates/adaptive/tests/unit/topology_tests.rs
🔇 Additional comments (7)
docs/adaptive-plugin/topology-aware-controls-design.mdx (1)

1-83: LGTM!

docs/adaptive-plugin/configuration.mdx (2)

202-235: Rust example now properly demonstrates topology-aware configuration.

The Rust example imports are extended with ConvergenceConfig and DriftConfig (line 205), and the initialization now explicitly sets tool_parallelism.drift, adaptive_hints.governor, acg.convergence, and top-level convergence fields (lines 221-235). This matches the Python and Node.js examples' completeness and addresses the prior guidance about explicit cross-language parity.


249-268: TOML example already includes all topology-aware control blocks with correct defaults and cross-language parity.

The TOML configuration snippet (lines 74-110) contains all three topology-aware control sections documented in the field table: [components.config.tool_parallelism.drift], [components.config.adaptive_hints.governor], and [components.config.acg.convergence]. Field names and default values align with the Python, Node.js, and Rust examples. No changes needed.

docs/adaptive-plugin/acg.mdx (1)

224-238: No changes needed; convergence validation rules are correctly documented in the "Common Validation Failures" section.

docs/adaptive-plugin/about.mdx (1)

1-64: SPDX header is correctly formatted.

The MDX file includes the required SPDX header with proper JSX comment delimiters {/* ... */} after the YAML frontmatter. No action needed.

crates/adaptive/src/adaptive_hints_intercept.rs (1)

137-150: LGTM!

Also applies to: 183-196, 223-227, 248-258

crates/adaptive/src/lib.rs (1)

34-34: Please confirm the required Rust validation commands were run for this change set.

I only see explicit test counts in the PR context; please confirm just test-rust, cargo fmt --all, and cargo clippy --workspace --all-targets -- -D warnings were executed successfully.

As per coding guidelines, "If any Rust code changed, always run just test-rust; also run cargo fmt --all; also run cargo clippy --workspace --all-targets -- -D warnings."

Source: Coding guidelines

Comment thread crates/adaptive/src/topology.rs Outdated
Comment thread crates/adaptive/tests/unit/topology_tests.rs Outdated
Signed-off-by: Teerth Sharma <teerths57@gmail.com>
@teerthsharma

teerthsharma commented Jun 22, 2026

Copy link
Copy Markdown
Author

@willkill07 @lvojtku ready for re-review when you have bandwidth.

The requested changes have been addressed on the current head 131ffbc60948a0bcc2d8b8d813c8283f7eeb48d6:

  • design note moved out of published Fern docs to .github/designs/topology-aware-adaptive-controls.md
  • public Adaptive docs no longer link to or publish the internal design note
  • the internal design note now frames the feature by benefit gates, not capability: fewer ACG observations before convergence, stale tool-plan invalidation after cohort drift, and low-value learned hint omission while preserving manual overrides
  • sample evidence includes ACG convergence (observations-to-decision: without=50, with=3), tool-drift plan invalidation, and hint-governor shed/override behavior
  • internal topology controls remain folded into nemo-relay-adaptive; no new adaptive-topology crate remains
  • tests remain outside source files; public Python topology primitive surface remains removed

Validation for the latest docs move: cd fern && npx fern check --warnings passes with only the expected unauthenticated redirects warning.

@willkill07 willkill07 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the design doc isn't something that should be published as documentation.

I'm still trying to understand the benefit besides "just because we can"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a design document showing benefits. You state measurable Outcomes but do not provide any samples to demonstrate usefulness.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 012a96643d566113e7d53c00505638c611ea07e6 by adding a Sample Evidence section to this design doc.

It now includes concrete samples for each control:

  • ACG convergence: benchmark fixture with observations-to-decision: without=50, with=3
  • tool drift: sample cohort centroid movement from [1, 2, 0.0, 2] to [2, 3, 0.4, 3], showing stale plan invalidation
  • hint governor: sample learned hint shed at latency_sensitivity = 2.0 under epsilon = 10.0, while manual override at 11.0 still emits hints

I also ran cd fern && npx fern check --warnings; it passes with only the expected unauthenticated redirects warning.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 131ffbc60948a0bcc2d8b8d813c8283f7eeb48d6.

Changes made:

  • moved the design note out of the published Fern docs tree to .github/designs/topology-aware-adaptive-controls.md
  • removed the public docs link/page from docs/adaptive-plugin/about.mdx and docs/adaptive-plugin/configuration.mdx
  • rewrote the design note around benefit gates rather than capability: each control now has a concrete gate for when it should stay disabled versus when it is useful
  • kept the sample evidence in the internal note, including the benchmark sample observations-to-decision: without=50, with=3

Docs validation still passes with cd fern && npx fern check --warnings.

@teerthsharma teerthsharma Jun 22, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it will give measure samples as many i can on my local device

Signed-off-by: Teerth Sharma <teerths57@gmail.com>
Signed-off-by: Teerth Sharma <teerths57@gmail.com>
Signed-off-by: teerth sharma <teerths57@gmail.com>
@willkill07

Copy link
Copy Markdown
Member

I ran a 5,000-instance local event-trace replay over benchmark-derived prompts, using NeMo Relay Agent scopes, managed llm.execute(...) calls, ATOF JSONL export, and the adaptive telemetry drain.

But were these actual agents running or just synthetic? Synthetic benchmarks don't mean anything (to me) here. We care about realistic agent behavior and traces with multiple turns, drift, etc. Of course if your data has duplicate requests you will be able to optimize. That's not what I'm asking for here.

@teerthsharma

teerthsharma commented Jun 24, 2026

Copy link
Copy Markdown
Author

I ran a 5,000-instance local event-trace replay over benchmark-derived prompts, using NeMo Relay Agent scopes, managed llm.execute(...) calls, ATOF JSONL export, and the adaptive telemetry drain.

But were these actual agents running or just synthetic? Synthetic benchmarks don't mean anything (to me) here. We care about realistic agent behavior and traces with multiple turns, drift, etc. Of course if your data has duplicate requests you will be able to optimize. That's not what I'm asking for here.

Testing with gemini Right Now I understand the goal much better now

@willkill07

Copy link
Copy Markdown
Member

Gemini is not supported with NeMo Relay at this point in time, so I'm not sure how you expect to do validation.

Supported Harnesses:

  • Codex
  • Claude Code
  • OpenClaw
  • Hermes Agent

@teerthsharma

teerthsharma commented Jun 24, 2026

Copy link
Copy Markdown
Author

Gemini is not supported with NeMo Relay at this point in time, so I'm not sure how you expect to do validation.

Supported Harnesses:

  • Codex
  • Claude Code
  • OpenClaw
  • Hermes Agent

Using Gemini api externally to create prompts (will help with accutal usage )and then feed it into the relay with open code is my plan thanks for the input

@willkill07

Copy link
Copy Markdown
Member

@teerthsharma Some record of experiment setup seems crucial to include here so I can validate/understand what is being covered.

@teerthsharma

teerthsharma commented Jun 24, 2026

Copy link
Copy Markdown
Author

@teerthsharma Some record of experiment setup seems crucial to include here so I can validate/understand what is being covered.

Will be included with the tests results shall I include a csv of logs as well? (Testing time remains testing atleast 1000 cycle)

@teerthsharma

Copy link
Copy Markdown
Author

I would like to share the results from a recent local 100-prompt agent trace run validating our ACG convergence behavior.

For this test, I ran a sandbox replay where Gemma/LiteRT generated 100 workload prompts across 10 prompt families. The prompt set was executed twice through NeMo Relay Agent scopes using opencode/deepseek-v4-flash-free as the live worker. I compared two configurations:

  • Baseline: ACG convergence disabled.
  • Topology ACG: Topology-aware ACG convergence enabled.

Run Results

Run Agent Runs LLM Calls Tool Calls ATOF Events Malformed ATOF Profiles >=3 Obs Converged Profiles OK Failed p95 ms
Baseline 100 100 100 600 0 10 0 95 5 35948.383
Topology ACG 100 100 100 600 0 10 8 96 4 45061.466

Key Takeaways

  • Observable Convergence: The topology-aware ACG convergence feature produced observable runtime convergence on 8 out of the 10 repeated prompt-family profiles, compared to 0 in the baseline.
  • Successful Triggering: The feature effectively triggers on NeMo Relay’s live collected event traces, validating functionality beyond synthetic unit fixtures.
  • Event Integrity: Existing telemetry collection remained completely intact. Both runs accurately captured all Agent, LLM, and tool lifecycles with zero malformed ATOF records.
    looking forwards for next steps

@teerthsharma

Copy link
Copy Markdown
Author

I would like to share the results from a recent local 100-prompt agent trace run validating our ACG convergence behavior.

For this test, I ran a sandbox replay where Gemma/LiteRT generated 100 workload prompts across 10 prompt families. The prompt set was executed twice through NeMo Relay Agent scopes using opencode/deepseek-v4-flash-free as the live worker. I compared two configurations:

  • Baseline: ACG convergence disabled.
  • Topology ACG: Topology-aware ACG convergence enabled.

Run Results

Run Agent Runs LLM Calls Tool Calls ATOF Events Malformed ATOF Profiles >=3 Obs Converged Profiles OK Failed p95 ms
Baseline 100 100 100 600 0 10 0 95 5 35948.383
Topology ACG 100 100 100 600 0 10 8 96 4 45061.466

Key Takeaways

  • Observable Convergence: The topology-aware ACG convergence feature produced observable runtime convergence on 8 out of the 10 repeated prompt-family profiles, compared to 0 in the baseline.
  • Successful Triggering: The feature effectively triggers on NeMo Relay’s live collected event traces, validating functionality beyond synthetic unit fixtures.
  • Event Integrity: Existing telemetry collection remained completely intact. Both runs accurately captured all Agent, LLM, and tool lifecycles with zero malformed ATOF records.
    looking forwards for next steps

The p95 latency is included for no-degradation visibility only; the run used local/live opencode behavior, so latency is noisy.

@willkill07

Copy link
Copy Markdown
Member

The topology-aware ACG convergence feature produced observable runtime convergence on 8 out of the 10 repeated prompt-family profiles, compared to 0 in the baseline.

Because the baseline doesn't do any convergence... N/A is more appropriate for the baseline.

The p95 latency is included for no-degradation visibility only; the run used local/live opencode behavior, so latency is noisy.

A 10 second delta feels like more than just noise to me. It shows that across 100 LLM calls, the average added latency was 100ms. This seems unacceptable?

Also, what does Pass/Fail mean? Nothing should fail here as there is nothing fallible.

@teerthsharma

teerthsharma commented Jun 24, 2026

Copy link
Copy Markdown
Author

The topology-aware ACG convergence feature produced observable runtime convergence on 8 out of the 10 repeated prompt-family profiles, compared to 0 in the baseline.

Because the baseline doesn't do any convergence... N/A is more appropriate for the baseline.

The p95 latency is included for no-degradation visibility only; the run used local/live opencode behavior, so latency is noisy.

A 10 second delta feels like more than just noise to me. It shows that across 100 LLM calls, the average added latency was 100ms. This seems unacceptable?

Also, what does Pass/Fail mean? Nothing should fail here as there is nothing fallible.

Yes N/A , p95 latency is under investigation as for pass or fail that was a way to show no degradation in pipeline and newer results will show different wording scheme for such

@willkill07

Copy link
Copy Markdown
Member

as for pass or fail that was a way to show no degradation

I'm confused by this. How could the baseline show degradation? What is your measurement of degradation?

@teerthsharma

Copy link
Copy Markdown
Author

as for pass or fail that was a way to show no degradation

I'm confused by this. How could the baseline show degradation? What is your measurement of degradation?

Pass/fail means external opencode subprocess completion, not Relay correctness, so I will remove or relabel that column.

@teerthsharma

teerthsharma commented Jun 25, 2026

Copy link
Copy Markdown
Author

ACG learner stability cache PR
I’ve run the full suite again with the pool process enabled and wanted to share a few more observations. Pool process behaviour
The pool handles up to 4 concurrent stability checks without blocking the main event loop.

Under the 100‑profile repeated workload, pool utilisation peaked at ~60%, so there’s headroom for larger clusters.

No deadlocks or hung workers were observed across all 200 runs. +0.526 ms delta with this which seems acceptable it includes the pool dispatch, cache lookup, and short‑circuit logic. For converged profiles, the savings are much larger: we skip reading the full observation window (which can be thousands of records) and skip the rewrite step entirely. The +0.5 ms is only paid once per profile at startup; subsequent convergence checks are near‑zero. But The stability cache TTL is currently tied to stability_window (5). If a profile converges and then the underlying data distribution shifts before the window expires, we might reuse stale stability. I’ve added a conservative guard: if the profile’s last observed timestamp is older than 2 * stability_window, we treat it as stale and fall through to the repair path. That logic is already in the PR.

@willkill07

willkill07 commented Jun 25, 2026

Copy link
Copy Markdown
Member

@teerthsharma I need some more information here. Was each run multiple LLM calls? did you measure the entire E2E time of each run or just look at when the workers completed? Where are the actual results?

At this point, I am leaning towards this not being substantial enough of an improvement to warrant the additional 5k line maintenance cost. From what you've disclosed so far, there does not appear to be a notable improvement compared to the existing behavior.

@teerthsharma

teerthsharma commented Jun 25, 2026

Copy link
Copy Markdown
Author

I have results for each run i can investigate for any specific thing if required but
then I think the only possible way is reducing the number of lines of code if I have to go through the pull request
all results are sandboxed for now But I think I should only ping next when I have some solution about 5k lines and keep it draft till then?

@willkill07

Copy link
Copy Markdown
Member

Even I relax and don't care about any line count changes consider the following:

An external contributor introduces new code which doesn't show clear benefit.
Should the new code be merged? That's why I'm leaning towards no. Faster convergence towards a steady state isn't necessarily the entire purpose of the ACG and how it works. Dynamic online learning has negligible cost the longer you have an agent running. So, I'm just stating that I haven't seen any numbers that validate your claim that this does some really useful improvement, especially since your benchmarking isn't with a real active agent harness testing the exact expected functionality.

@teerthsharma

Copy link
Copy Markdown
Author

Even I relax and don't care about any line count changes consider the following:

An external contributor introduces new code which doesn't show clear benefit. Should the new code be merged? That's why I'm leaning towards no. Faster convergence towards a steady state isn't necessarily the entire purpose of the ACG and how it works. Dynamic online learning has negligible cost the longer you have an agent running. So, I'm just stating that I haven't seen any numbers that validate your claim that this does some really useful improvement, especially since your benchmarking isn't with a real active agent harness testing the exact expected functionality.

Right sir , I should redesign this , find some accutally valuable numbers, thanks for the guidance making it a draft for now . Thanks for the time

@teerthsharma teerthsharma marked this pull request as draft June 25, 2026 20:13
Signed-off-by: teerth <teerths57@gmail.com>
@teerthsharma

teerthsharma commented Jun 26, 2026

Copy link
Copy Markdown
Author

I updated the topology-aware ACG work to focus on reusable agent structure, not exact repeated prompts.

use case: a moderation or triage agent often runs the same workflow over changing inputs. For example, every forum-post review may follow the same scaffold: check whether moderator attention is needed, look for policy/quality signals, explain why, and suggest the next action. The post content changes each time, but the review workflow stays stable.

The old convergence logic treated that changing post content as instability. The new behavior lets ACG recognize that the reusable workflow is stable while the individual work item varies.

Technical changes:

  • convergence now measures stable prompt prefix instead of punishing variable task suffix
  • converged profiles can be reused after the stability window, currently 3 observations
  • learning reopens if the prompt topology changes under the same learning key
  • tests cover stable-prefix convergence, cached reuse, and topology-change reopening

Local sandbox evidence:

  • baseline: 100/100 ok
  • topology_acg: 99/100 ok, 1 timeout
  • ATOF events: 600/600 in both runs, 0 malformed
  • Redis run/LLM/tool counts: 100/100 in both runs
  • topology converged 10/10 profiles using 3 observations per profile

Important limitation: I am not claiming end-to-end opencode latency improved. That run was contaminated by ordering/time decay and a Windows critical-battery sleep event.

If you want latency evidence, I can run a cleaner approved follow-up: either a dry-run opencode test to isolate Relay/ACG overhead, or an interleaved A/B run so baseline and topology execute task-by-task under the same conditions.
@coderabbitai resume

@teerthsharma teerthsharma marked this pull request as ready for review June 26, 2026 15:08
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

@teerthsharma: Thanks — this framing is materially better.

Treating stable prompt prefix / reusable agent structure as the convergence target is much more defensible than requiring exact prompt repetition, and the explicit reopen-on-topology-change behavior addresses the main safety concern with over-sticky convergence. Also appreciate the correction on latency: not claiming an e2e win from a contaminated run is the right call.

If you do a follow-up measurement, the stronger one would be the interleaved A/B run, not the dry-run:

  • alternate baseline/topology per task to control for order/time drift,
  • report per-run E2E duration for the full agent run,
  • include observation counts per profile before/after convergence,
  • and show that topology mode does not reduce success rate or distort tool/LLM event counts.

Resuming automated review now.


✏️ Learnings added
Learnt from: teerthsharma
URL: https://github.com/NVIDIA/NeMo-Relay/pull/282

Timestamp: 2026-06-26T15:15:34.303Z
Learning: In NeMo-Relay adaptive topology-aware ACG, the intended convergence behavior is to detect stability from a reusable agent workflow's stable prompt prefix under the same learning key, while allowing variable task-specific suffixes; learning must reopen when the prompt topology changes. This clarification is relevant to reviews of `crates/adaptive/src/acg_learner.rs` and related convergence tests.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

✅ Action performed

Reviews resumed.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs (1)

386-389: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Assert header injection on the manual-override branch.

This branch only verifies the body payload. inject_agent_hints also writes AGENT_HINTS_HEADER_KEY, so a header-only regression would still pass here.

Proposed test hardening
     assert_eq!(
         manual_request.content["nvext"]["agent_hints"]["latency_sensitivity"],
         serde_json::json!(11.0)
     );
+    assert!(manual_request.headers.get(AGENT_HINTS_HEADER_KEY).is_some());

As per path instructions, "Tests should cover the behavior promised by the changed API surface, including error paths and cross-request isolation where relevant."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs` around lines
386 - 389, The manual-override test only checks the request body and misses the
header behavior added by inject_agent_hints, so it can still pass if
AGENT_HINTS_HEADER_KEY is not written. Update this test to also assert the
expected header is present on the manual_request, using AGENT_HINTS_HEADER_KEY
and the same manual-override setup, so the branch verifies both body and header
injection.

Source: Path instructions

crates/adaptive/src/acg_learner.rs (1)

224-285: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Best-profile selection is order-dependent when cached and non-cached profiles mix.

The cached branch compares against best_aggregate_stability, but the non-cached branch compares against best_profile_seed. If a stronger cached profile is seen first, the first non-cached profile always overwrites it because best_profile_seed is still None. Since grouped_observations is a HashMap, the aggregate winner becomes nondeterministic and can regress the agent-level stability cached later in this method.

Suggested fix
-                let replace_best = best_profile_seed
-                    .as_ref()
-                    .map(|(_, current)| {
+                let replace_best = best_aggregate_stability
+                    .as_ref()
+                    .map(|current| {
                         (
                             stability_result.stable_prefix_length,
                             stability_result.total_observations,
                         ) > (current.stable_prefix_length, current.total_observations)
                     })
                     .unwrap_or(true);
                 if replace_best {
                     best_profile_seed = Some((observations_vec.clone(), stability_result.clone()));
                     best_aggregate_stability = Some(stability_result.clone());
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/src/acg_learner.rs` around lines 224 - 285, The best-profile
selection logic in ACG learner is split between the cached path and the
non-cached path, which makes the aggregate winner depend on iteration order.
Update the selection in acg_learner.rs so both branches compare against the same
shared “best” state instead of using best_aggregate_stability for cached
profiles and best_profile_seed for fresh profiles; use the existing symbols
best_profile_seed, best_aggregate_stability, and stability_result to centralize
the comparison. Ensure a cached profile cannot be overwritten just because the
first non-cached profile is encountered, and keep the winner deterministic
across HashMap iteration order.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/designs/topology-aware-adaptive-controls.md:
- Around line 93-118: The Mermaid diagram under the Architecture Flow heading
starts without an introductory sentence, so add a short complete lead-in
sentence immediately before the fenced mermaid block in the
topology-aware-adaptive-controls.md section. Keep the existing diagram content
intact and ensure the new sentence clearly introduces the flowchart, matching
the documentation style used elsewhere in the document.

In `@ATTRIBUTIONS-Rust.md`:
- Around line 36414-36418: Fix the markdownlint regressions in the attribution
entries by updating the affected blocks around the strum entry and the other
referenced attribution block. Add the required blank lines around the heading
and the fenced code block, and change the fence to use a text info string so the
content satisfies MD022, MD031, and MD040. Use the existing attribution section
structure to keep the formatting consistent across these entries.

In `@crates/adaptive/src/acg_learner.rs`:
- Around line 97-114: The drift calculation in the ACG learner is currently a
binary flag based only on whether stable_prefix_length is zero, which can
trigger convergence too early. Update the logic in acg_learner.rs around the
drift/stable_prefix_len computation to derive drift from the unstable fraction
of the profile instead of a 0.0/1.0 switch, using the available stability fields
in the learner flow so the drift gate reflects how much of the suffix remains
unstable.

In `@crates/adaptive/tests/unit/acg_learner_tests.rs`:
- Around line 445-480: The empty-observation repair test in acg_learner_tests.rs
only checks that a repaired observation is written back, but it should also
verify the stale converged state is cleared. Update
acg_learner_repairs_converged_stability_with_empty_observations to assert the
observation stored by SeedObservationBackend after AcgLearner::process_run no
longer has converged set to true, matching the behavior already expected by the
neighboring missing-observation repair case.

---

Outside diff comments:
In `@crates/adaptive/src/acg_learner.rs`:
- Around line 224-285: The best-profile selection logic in ACG learner is split
between the cached path and the non-cached path, which makes the aggregate
winner depend on iteration order. Update the selection in acg_learner.rs so both
branches compare against the same shared “best” state instead of using
best_aggregate_stability for cached profiles and best_profile_seed for fresh
profiles; use the existing symbols best_profile_seed, best_aggregate_stability,
and stability_result to centralize the comparison. Ensure a cached profile
cannot be overwritten just because the first non-cached profile is encountered,
and keep the winner deterministic across HashMap iteration order.

In `@crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs`:
- Around line 386-389: The manual-override test only checks the request body and
misses the header behavior added by inject_agent_hints, so it can still pass if
AGENT_HINTS_HEADER_KEY is not written. Update this test to also assert the
expected header is present on the manual_request, using AGENT_HINTS_HEADER_KEY
and the same manual-override setup, so the branch verifies both body and header
injection.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bcb7c090-5d4f-4500-8584-8801f2beb8ff

📥 Commits

Reviewing files that changed from the base of the PR and between 0f56103 and a222eb3.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/designs/topology-aware-adaptive-controls.md
  • ATTRIBUTIONS-Rust.md
  • crates/adaptive/src/acg_learner.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/configuration.mdx
💤 Files with no reviewable changes (2)
  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/configuration.mdx
📜 Review details
🧰 Additional context used
📓 Path-based instructions (27)
**/*.rs

📄 CodeRabbit inference engine (.agents/skills/add-binding-feature/SKILL.md)

Use snake_case naming convention for Rust identifiers (e.g., nemo_relay_tool_call)

**/*.rs: Any Rust change must run just test-rust
Any Rust change must run cargo fmt --all
Any Rust change must run cargo clippy --workspace --all-targets -- -D warnings

**/*.rs: Run cargo fmt --all for all FFI work since it is Rust work
Run just test-rust to validate FFI changes
Run cargo clippy --workspace --all-targets -- -D warnings to enforce strict linting on FFI work

When Rust files changed as part of Go work, also run cargo fmt --all, just test-rust, and cargo clippy --workspace --all-targets -- -D warnings

**/*.rs: Run cargo fmt --all when Rust files are changed as part of Node work
Run cargo clippy --workspace --all-targets -- -D warnings when Rust files are changed as part of Node work
Run just test-rust when Rust files are changed as part of Node work

**/*.rs: Run cargo fmt --all to format all Rust code
Run cargo clippy --workspace --all-targets -- -D warnings to enforce all clippy lints as errors

**/*.rs: Run cargo fmt --all when Rust files changed as part of WebAssembly work
Run cargo clippy --workspace --all-targets -- -D warnings when Rust files changed as part of WebAssembly work

**/*.rs: If any Rust code changed, always run just test-rust
If any Rust code changed, also run cargo fmt --all
If any Rust code changed, also run cargo clippy --workspace --all-targets -- -D warnings
Run Rust formatting with cargo fmt --all
Run Rust linting with cargo clippy --workspace --all-targets -- -D warnings

**/*.rs: Use cargo fmt for Rust code formatting
Run cargo clippy -- -D warnings to lint Rust code and treat all warnings as errors
Use Rust snake_case naming convention for Rust identifiers
Include SPDX license header in all Rust source files using double-slash comment syntax
Validate Rust code with uv run pre-commit run --all-files to enforce cargo fmt formatting check, cargo clippy lints, and cargo deny aud...

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
{crates/adaptive/**,python/nemo_relay/adaptive.py,python/nemo_relay/plugin.py,go/nemo_relay/adaptive/**,go/nemo_relay/!(adaptive)/**,**/node/**,**/wasm/**}

📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)

Keep adaptive surface in sync across crates/adaptive, shared plugin behavior in core and bindings, Python adaptive/plugin wrappers in python/nemo_relay/adaptive.py and python/nemo_relay/plugin.py, Go adaptive helpers under go/nemo_relay/adaptive plus shared plugin helpers in go/nemo_relay, and Node/WebAssembly adaptive helpers and plugin wrappers

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
{crates/adaptive/**,python/nemo_relay/plugin.py,go/nemo_relay/**,**/node/**,**/wasm/**}

📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)

{crates/adaptive/**,python/nemo_relay/plugin.py,go/nemo_relay/**,**/node/**,**/wasm/**}: Maintain consistent plugin lifecycle across all language bindings (Python, Go, Node/WebAssembly, and Rust)
Keep plugin context surfaces aligned across all language implementations

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
{crates/adaptive/**/*.rs,**/*test*.{rs,py,go,ts,js},**/*adaptive*test*.{rs,py,go,ts,js},docs/plugins/adaptive/**}

📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)

Maintain documented and tested validation and report behavior for adaptive surfaces

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
**/{Cargo.toml,**/*.rs}

📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)

Maintain consistency between Rust package names in Cargo.toml and their actual usage across the codebase

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
**/*.{h,hpp,c,cpp,rs}

📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)

Ensure FFI header and library naming follows consistent conventions across platform-specific builds

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
{crates/core,crates/adaptive}/**/*

📄 CodeRabbit inference engine (.agents/skills/prepare-pr/SKILL.md)

Changes to crates/core or crates/adaptive must run the full language matrix

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
**/*.{rs,toml}

📄 CodeRabbit inference engine (.agents/skills/rename-surfaces/SKILL.md)

Update Rust crate names and module prefixes during coordinated rename operations

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
crates/{core,adaptive}/**

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

If crates/core or crates/adaptive changed, run the full matrix across Rust, Python, Go, Node.js, and WebAssembly

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
**/*.{rs,py,js,ts,tsx,jsx,go,sh,toml,yaml,yml,md}

📄 CodeRabbit inference engine (AGENTS.md)

Keep SPDX headers on source, docs, scripts, and configuration files. The project is Apache-2.0.

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
  • ATTRIBUTIONS-Rust.md
**/*.{rs,py,go,js,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Follow binding naming conventions: Rust and Python use snake_case, C FFI exports prefixed nemo_relay_, Go uses PascalCase for public APIs, Node.js uses camelCase.

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
crates/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

crates/**/*.rs: Keep async behavior on the existing tokio-based model. Bindings should preserve callback and future lifetimes rather than blocking or hiding async work unexpectedly.
Use Json = serde_json::Value in Rust-facing runtime APIs for JSON payload handling.

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
**

⚙️ CodeRabbit configuration file

**:

AGENTS.md

This file provides guidance to agents, including Claude Code and OpenAI Codex, when working in this repository.

Project Overview

NeMo Relay is a multi-language agent runtime framework for execution scopes, lifecycle events, middleware, plugins, and observability around tool and LLM calls. The core runtime is Rust. Primary supported bindings are Rust, Python, and Node.js. Go, WebAssembly, and the raw C FFI are experimental and source-first.

The shared runtime model is:

  1. Scope stacks decide where work belongs and which scope-local behavior is visible.
  2. Middleware registries decide what guardrails and intercepts run around managed calls.
  3. Plugins install reusable runtime behavior from configuration.
  4. Events record runtime behavior in ATOF form.
  5. Subscribers and exporters consume events in-process or export them to ATIF, OpenTelemetry, OpenInference, or other backends.

Repository Structure

The repository layout separates the Rust runtime, language bindings, documentation,
integration patches, and agent-facing skills.

crates/
  core/       # Rust core runtime crate, published as nemo-relay
  adaptive/   # Adaptive runtime primitives and plugin components
  python/     # PyO3 native extension for the Python package
  ffi/        # Raw C ABI layer used by downstream bindings such as Go
  node/       # NAPI Node.js binding and JavaScript/TypeScript entry points
  wasm/       # wasm-bindgen WebAssembly binding and JS wrappers
python/
  nemo_relay/  # Python wrapper package: scopes, tools, LLM, middleware, typed helpers, plugins, adaptive helpers
  tests/      # Python tests
go/
  nemo_relay/  # Experimental Go CGo binding and tests
fern/         # Fern documentation site
scripts/      # Stable wrappers and helper scripts; build/test/docs entry points live in justfile
third_party/  # P...

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
  • ATTRIBUTIONS-Rust.md
crates/{core,adaptive}/**/*.rs

⚙️ CodeRabbit configuration file

crates/{core,adaptive}/**/*.rs: Review the Rust runtime for async correctness, scope isolation, middleware ordering, and event lifecycle regressions.
Pay close attention to task-local/thread-local scope propagation, callback lifetimes, stream finalization, and root_uuid isolation.
Public API changes should preserve existing behavior unless tests and docs show the intended migration path.

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/src/acg_learner.rs
{crates/**/tests/**,python/tests/**,go/nemo_relay/**/*_test.go}

⚙️ CodeRabbit configuration file

{crates/**/tests/**,python/tests/**,go/nemo_relay/**/*_test.go}: Tests should cover the behavior promised by the changed API surface, including error paths and cross-request isolation where relevant.
Prefer assertions on lifecycle events, scope stacks, middleware ordering, and binding parity over shallow smoke tests.

Files:

  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
{.github/**,.gitlab-ci.yml,.pre-commit-config.yaml,justfile,scripts/**}

⚙️ CodeRabbit configuration file

{.github/**,.gitlab-ci.yml,.pre-commit-config.yaml,justfile,scripts/**}: Review automation changes for reproducibility, pinned versions where appropriate, secret handling, and consistency with the documented validation matrix.
Pay attention to commands that need generated native artifacts, FFI libraries, or platform-specific environment variables.

Files:

  • .github/designs/topology-aware-adaptive-controls.md
**/*.{md,rst,html,txt}

📄 CodeRabbit inference engine (.agents/skills/review-doc-style/assets/nvidia-style-brand-terminology.md)

**/*.{md,rst,html,txt}: Always spell NVIDIA in all caps. Do not use Nvidia, nvidia, nVidia, nVIDIA, or NV.
Use an NVIDIA before a noun because the name starts with an 'en' sound.
Do not add a registered trademark symbol after NVIDIA when referring to the company.
Use trademark symbols with product names only when the document type or legal guidance requires them.
Verify official capitalization, spacing, and hyphenation for product names.
Precede NVIDIA product names with NVIDIA on first mention when it is natural and accurate.
Do not rewrite product names for grammar or title-case rules.
Preserve third-party product names according to the owner's spelling.
Include the company name and full model qualifier on first use when it helps identify the model.
Preserve the official capitalization and punctuation of model names.
Use shorter family names only after the full name is established.
Spell out a term on first use and put the acronym in parentheses unless the acronym is widely understood by the intended audience.
Use the acronym on later mentions after it has been defined.
For long documents, reintroduce the full term if readers might lose context.
Form plurals of acronyms with s, not an apostrophe, such as GPUs.
In headings, common acronyms can remain abbreviated. Spell out the term in the first or second sentence of the body.
Common terms such as CPU, GPU, PC, API, and UI usually do not need to be spelled out for developer audiences.

Files:

  • ATTRIBUTIONS-Rust.md
**/*.{md,rst,html}

📄 CodeRabbit inference engine (.agents/skills/review-doc-style/assets/nvidia-style-brand-terminology.md)

Link the first mention of a product name when the destination helps the reader.

Files:

  • ATTRIBUTIONS-Rust.md
**/*.md

📄 CodeRabbit inference engine (.agents/skills/contribute-integration/SKILL.md)

Documentation must be updated if activation or usage changed

**/*.md: Use title case consistently in technical documentation headings
Avoid quotation marks, ampersands, and exclamation marks in headings
Keep product, event, research, and whitepaper names in their official title case
Use title case for table headers
Do not force social-media sentence case into technical docs
Format code elements, commands, parameters, package names, and expressions in monospace
Format directories, file names, and paths in monospace using backticks
Use angle brackets inside monospace for variables inside paths, such as /home/<username>/.login
Format error messages and strings in quotation marks, keeping literal code strings in code formatting when clearer
Format UI buttons, menus, fields, and labels in bold
Use angle brackets between UI labels for menu paths, such as File > Save As
Use italics for new terms on first use, sparingly and only when introducing the term
Use italics for publication titles
Format keyboard shortcuts in plain text, such as Press Ctrl+Alt+Delete
Use owner/repo link text for GitHub repositories, preferring [NVIDIA/NeMo](link) over prose references like 'the GitHub repo'
Introduce every code block with a complete sentence
Do not make a code block complete the grammar of the previous sentence
Do not continue a sentence after a code block
Use syntax highlighting when the format supports it for code blocks
Avoid the word 'snippet' unless the surrounding docs already use it as a term of art
Keep inline method, function, and class references consistent with nearby docs, omitting empty parentheses for prose readability when no call is shown
Use descriptive anchor text that matches the destination title when possible for links
Avoid raw URLs in running text
Avoid generic anchor text such as 'here,' 'this page,' and 'read more'
Include acronyms in link text when a linked term includes an acronym
Do not link long sentences or multiple sentences
Avoid links ...

Files:

  • ATTRIBUTIONS-Rust.md
**/{docs,examples,**/*.md,*.patch,*.diff,.github,*.sh,*.yaml,*.yml}

📄 CodeRabbit inference engine (.agents/skills/rename-surfaces/SKILL.md)

Update documentation, examples, CI configuration, and patch artifacts when performing rename operations

Files:

  • ATTRIBUTIONS-Rust.md
**/*.{md,rst,txt}

📄 CodeRabbit inference engine (.agents/skills/review-doc-style/assets/nvidia-style-guide.md)

Spell NVIDIA in all caps. Do not use Nvidia, nvidia, or NV.

Files:

  • ATTRIBUTIONS-Rust.md
**/*.{md,rst}

📄 CodeRabbit inference engine (.agents/skills/review-doc-style/assets/nvidia-style-guide.md)

**/*.{md,rst}: Format commands, code elements, expressions, package names, file names, and paths as inline code.
Use descriptive link text. Avoid raw URLs and weak anchors such as "here" or "read more."
Use title case consistently for technical documentation headings.
Introduce code blocks, lists, tables, and images with complete sentences.
Write procedures as imperative steps. Keep steps parallel and split long procedures into smaller tasks.
Prefer active voice, present tense, short sentences, contractions, and plain English.
Use can for possibility and reserve may for permission.
Use after for temporal relationships instead of once.
Prefer refer to over see when the wording points readers to another resource.
Avoid culture-specific idioms, unnecessary Latinisms, jokes, and marketing exaggeration in technical docs.
Spell out months in body text, avoid ordinal dates, and use clear time zones.
Spell out whole numbers from zero through nine unless they are technical values, parameters, versions, or UI values.
Use numerals for 10 or greater and include commas in thousands.
Do not add trademark symbols to learning-oriented docs unless the source, platform, or legal guidance explicitly requires them.

Files:

  • ATTRIBUTIONS-Rust.md
{docs/**,README.md,CONTRIBUTING.md,**/*.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Run docs link validation with just docs-linkcheck when links change

Files:

  • ATTRIBUTIONS-Rust.md
{docs/**,README.md,**/Cargo.toml,**/package.json,**/*.md}

📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)

Ensure renamed public surfaces are reflected consistently in manifests and docs for large or public-facing changes

Files:

  • ATTRIBUTIONS-Rust.md
**/*.{md,mdx,py,sh,yaml,yml,toml,json}

📄 CodeRabbit inference engine (.agents/skills/contribute-docs/SKILL.md)

Keep package names, repo references, and build commands current

Files:

  • ATTRIBUTIONS-Rust.md
**/*.{html,md,mdx}

📄 CodeRabbit inference engine (CONTRIBUTING.md)

Include SPDX license header in HTML and Markdown files using HTML comment syntax

Files:

  • ATTRIBUTIONS-Rust.md
ATTRIBUTIONS-Rust.md

📄 CodeRabbit inference engine (.agents/skills/update-project-version/SKILL.md)

Regenerate ATTRIBUTIONS-Rust.md with ./scripts/generate_attributions.sh rust if Cargo metadata changed and committed attribution files must stay fresh

Files:

  • ATTRIBUTIONS-Rust.md
🪛 LanguageTool
.github/designs/topology-aware-adaptive-controls.md

[grammar] ~48-~48: Ensure spelling is correct
Context: ... tool cohort shape changes sharply. - Shed learned adaptive hints below a configur...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.22.1)
ATTRIBUTIONS-Rust.md

[warning] 36414-36414: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 36417-36417: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Above

(MD022, blanks-around-headings)


[warning] 36417-36417: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 36418-36418: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 36418-36418: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 36443-36443: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 36446-36446: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Above

(MD022, blanks-around-headings)


[warning] 36446-36446: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 36447-36447: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 36447-36447: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)
crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs (1)

362-368: LGTM!

crates/adaptive/src/acg_learner.rs (1)

69-82: 📐 Maintainability & Code Quality

Please confirm the required validation matrix ran for this adaptive Rust change.

This PR touches crates/adaptive, so the repo guidance calls for just test-rust, cargo fmt --all, cargo clippy --workspace --all-targets -- -D warnings, and the full language matrix before review.
As per coding guidelines, "If crates/core or crates/adaptive changed, run the full matrix across Rust, Python, Go, Node.js, and WebAssembly."

Also applies to: 164-312

Source: Coding guidelines

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs (1)

386-389: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Assert header injection on the manual-override branch.

This branch only verifies the body payload. inject_agent_hints also writes AGENT_HINTS_HEADER_KEY, so a header-only regression would still pass here.

Proposed test hardening
     assert_eq!(
         manual_request.content["nvext"]["agent_hints"]["latency_sensitivity"],
         serde_json::json!(11.0)
     );
+    assert!(manual_request.headers.get(AGENT_HINTS_HEADER_KEY).is_some());

As per path instructions, "Tests should cover the behavior promised by the changed API surface, including error paths and cross-request isolation where relevant."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs` around lines
386 - 389, The manual-override test only checks the request body and misses the
header behavior added by inject_agent_hints, so it can still pass if
AGENT_HINTS_HEADER_KEY is not written. Update this test to also assert the
expected header is present on the manual_request, using AGENT_HINTS_HEADER_KEY
and the same manual-override setup, so the branch verifies both body and header
injection.

Source: Path instructions

crates/adaptive/src/acg_learner.rs (1)

224-285: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Best-profile selection is order-dependent when cached and non-cached profiles mix.

The cached branch compares against best_aggregate_stability, but the non-cached branch compares against best_profile_seed. If a stronger cached profile is seen first, the first non-cached profile always overwrites it because best_profile_seed is still None. Since grouped_observations is a HashMap, the aggregate winner becomes nondeterministic and can regress the agent-level stability cached later in this method.

Suggested fix
-                let replace_best = best_profile_seed
-                    .as_ref()
-                    .map(|(_, current)| {
+                let replace_best = best_aggregate_stability
+                    .as_ref()
+                    .map(|current| {
                         (
                             stability_result.stable_prefix_length,
                             stability_result.total_observations,
                         ) > (current.stable_prefix_length, current.total_observations)
                     })
                     .unwrap_or(true);
                 if replace_best {
                     best_profile_seed = Some((observations_vec.clone(), stability_result.clone()));
                     best_aggregate_stability = Some(stability_result.clone());
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/src/acg_learner.rs` around lines 224 - 285, The best-profile
selection logic in ACG learner is split between the cached path and the
non-cached path, which makes the aggregate winner depend on iteration order.
Update the selection in acg_learner.rs so both branches compare against the same
shared “best” state instead of using best_aggregate_stability for cached
profiles and best_profile_seed for fresh profiles; use the existing symbols
best_profile_seed, best_aggregate_stability, and stability_result to centralize
the comparison. Ensure a cached profile cannot be overwritten just because the
first non-cached profile is encountered, and keep the winner deterministic
across HashMap iteration order.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/designs/topology-aware-adaptive-controls.md:
- Around line 93-118: The Mermaid diagram under the Architecture Flow heading
starts without an introductory sentence, so add a short complete lead-in
sentence immediately before the fenced mermaid block in the
topology-aware-adaptive-controls.md section. Keep the existing diagram content
intact and ensure the new sentence clearly introduces the flowchart, matching
the documentation style used elsewhere in the document.

In `@ATTRIBUTIONS-Rust.md`:
- Around line 36414-36418: Fix the markdownlint regressions in the attribution
entries by updating the affected blocks around the strum entry and the other
referenced attribution block. Add the required blank lines around the heading
and the fenced code block, and change the fence to use a text info string so the
content satisfies MD022, MD031, and MD040. Use the existing attribution section
structure to keep the formatting consistent across these entries.

In `@crates/adaptive/src/acg_learner.rs`:
- Around line 97-114: The drift calculation in the ACG learner is currently a
binary flag based only on whether stable_prefix_length is zero, which can
trigger convergence too early. Update the logic in acg_learner.rs around the
drift/stable_prefix_len computation to derive drift from the unstable fraction
of the profile instead of a 0.0/1.0 switch, using the available stability fields
in the learner flow so the drift gate reflects how much of the suffix remains
unstable.

In `@crates/adaptive/tests/unit/acg_learner_tests.rs`:
- Around line 445-480: The empty-observation repair test in acg_learner_tests.rs
only checks that a repaired observation is written back, but it should also
verify the stale converged state is cleared. Update
acg_learner_repairs_converged_stability_with_empty_observations to assert the
observation stored by SeedObservationBackend after AcgLearner::process_run no
longer has converged set to true, matching the behavior already expected by the
neighboring missing-observation repair case.

---

Outside diff comments:
In `@crates/adaptive/src/acg_learner.rs`:
- Around line 224-285: The best-profile selection logic in ACG learner is split
between the cached path and the non-cached path, which makes the aggregate
winner depend on iteration order. Update the selection in acg_learner.rs so both
branches compare against the same shared “best” state instead of using
best_aggregate_stability for cached profiles and best_profile_seed for fresh
profiles; use the existing symbols best_profile_seed, best_aggregate_stability,
and stability_result to centralize the comparison. Ensure a cached profile
cannot be overwritten just because the first non-cached profile is encountered,
and keep the winner deterministic across HashMap iteration order.

In `@crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs`:
- Around line 386-389: The manual-override test only checks the request body and
misses the header behavior added by inject_agent_hints, so it can still pass if
AGENT_HINTS_HEADER_KEY is not written. Update this test to also assert the
expected header is present on the manual_request, using AGENT_HINTS_HEADER_KEY
and the same manual-override setup, so the branch verifies both body and header
injection.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bcb7c090-5d4f-4500-8584-8801f2beb8ff

📥 Commits

Reviewing files that changed from the base of the PR and between 0f56103 and a222eb3.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/designs/topology-aware-adaptive-controls.md
  • ATTRIBUTIONS-Rust.md
  • crates/adaptive/src/acg_learner.rs
  • crates/adaptive/src/topology.rs
  • crates/adaptive/tests/integration/topology_convergence_tests.rs
  • crates/adaptive/tests/unit/acg_learner_tests.rs
  • crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs
  • crates/adaptive/tests/unit/runtime_tests.rs
  • crates/adaptive/tests/unit/storage_tests.rs
  • crates/adaptive/tests/unit/topology_tests.rs
  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/configuration.mdx
💤 Files with no reviewable changes (2)
  • docs/adaptive-plugin/about.mdx
  • docs/adaptive-plugin/configuration.mdx
📜 Review details
🔇 Additional comments (2)
crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs (1)

362-368: LGTM!

crates/adaptive/src/acg_learner.rs (1)

69-82: 📐 Maintainability & Code Quality

Please confirm the required validation matrix ran for this adaptive Rust change.

This PR touches crates/adaptive, so the repo guidance calls for just test-rust, cargo fmt --all, cargo clippy --workspace --all-targets -- -D warnings, and the full language matrix before review.
As per coding guidelines, "If crates/core or crates/adaptive changed, run the full matrix across Rust, Python, Go, Node.js, and WebAssembly."

Also applies to: 164-312

Source: Coding guidelines

🛑 Comments failed to post (4)
.github/designs/topology-aware-adaptive-controls.md (1)

93-118: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Introduce the Mermaid block with a complete sentence.

The diagram starts immediately under the heading. Add a short lead-in sentence before the code block so this section follows the Markdown doc rules consistently.

As per coding guidelines, "Introduce every code block with a complete sentence."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/designs/topology-aware-adaptive-controls.md around lines 93 - 118,
The Mermaid diagram under the Architecture Flow heading starts without an
introductory sentence, so add a short complete lead-in sentence immediately
before the fenced mermaid block in the topology-aware-adaptive-controls.md
section. Keep the existing diagram content intact and ensure the new sentence
clearly introduces the flowchart, matching the documentation style used
elsewhere in the document.

Source: Coding guidelines

ATTRIBUTIONS-Rust.md (1)

36414-36418: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the markdownlint regressions in these attribution entries.

Both inserted blocks need blank lines around the headings and fenced code, and the fences should be tagged as text to satisfy MD022/MD031/MD040.

Also applies to: 36443-36447

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 36414-36414: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 36417-36417: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Above

(MD022, blanks-around-headings)


[warning] 36417-36417: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 36418-36418: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 36418-36418: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ATTRIBUTIONS-Rust.md` around lines 36414 - 36418, Fix the markdownlint
regressions in the attribution entries by updating the affected blocks around
the strum entry and the other referenced attribution block. Add the required
blank lines around the heading and the fenced code block, and change the fence
to use a text info string so the content satisfies MD022, MD031, and MD040. Use
the existing attribution section structure to keep the formatting consistent
across these entries.

Source: Linters/SAST tools

crates/adaptive/src/acg_learner.rs (1)

97-114: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Compute drift from the unstable fraction instead of a binary flag.

This collapses every nonzero stable prefix to 0.0 drift. A profile with one stable block and a large unstable suffix will satisfy the drift gate as quickly as a nearly fully stable profile, which makes convergence fire too early.

Suggested fix
-        let drift = if stability.stable_prefix_length == 0 {
-            1.0
-        } else {
-            0.0
-        };
+        let unstable_spans = total_spans.saturating_sub(stability.stable_prefix_length);
+        let drift = if total_spans == 0 {
+            1.0
+        } else {
+            unstable_spans as f64 / total_spans as f64
+        };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/src/acg_learner.rs` around lines 97 - 114, The drift
calculation in the ACG learner is currently a binary flag based only on whether
stable_prefix_length is zero, which can trigger convergence too early. Update
the logic in acg_learner.rs around the drift/stable_prefix_len computation to
derive drift from the unstable fraction of the profile instead of a 0.0/1.0
switch, using the available stability fields in the learner flow so the drift
gate reflects how much of the suffix remains unstable.
crates/adaptive/tests/unit/acg_learner_tests.rs (1)

445-480: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Also assert that empty-observation repair clears the stale converged flag.

This test only verifies that one observation is written back. If repair still re-saved converged = true, the regression would pass here even though the neighboring missing-observation test correctly expects learning to reopen.

Suggested fix
     let repaired_observations = backend
         .load_observations(&learning_key)
         .await
         .unwrap()
         .expect("empty observations should be repaired instead of trusting converged stability");
     assert_eq!(repaired_observations.len(), 1);
+
+    let repaired_stability = backend
+        .load_stability(&learning_key)
+        .await
+        .unwrap()
+        .expect("repaired stability should be stored");
+    assert!(
+        !repaired_stability.converged,
+        "a repaired single-observation profile should re-enter normal convergence"
+    );

As per path instructions, "Tests should cover the behavior promised by the changed API surface, including error paths and cross-request isolation where relevant."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

#[tokio::test(flavor = "current_thread")]
async fn acg_learner_repairs_converged_stability_with_empty_observations() {
    let learner = AcgLearner::new_with_convergence(
        "agent-a",
        20,
        StabilityThresholds::default(),
        Some(ConvergenceConfig {
            enabled: true,
            epsilon: 0.001,
            stability_window: 3,
        }),
    );
    let request = sample_request("gpt-4o", "Stable system", "Stable prompt");
    let learning_key = derive_acg_learning_key("agent-a", &request);
    let seed_observation = build_prompt_ir(&request).unwrap();
    let mut stale_stability = analyze_stability(
        std::slice::from_ref(&seed_observation),
        &StabilityThresholds::default(),
    );
    stale_stability.converged = true;

    let backend = SeedObservationBackend::new(&learning_key, Vec::new());
    backend.seed_stability(&learning_key, stale_stability);

    learner
        .process_run(&sample_run(vec![request]), &backend, &empty_cache())
        .await
        .unwrap();

    let repaired_observations = backend
        .load_observations(&learning_key)
        .await
        .unwrap()
        .expect("empty observations should be repaired instead of trusting converged stability");
    assert_eq!(repaired_observations.len(), 1);

    let repaired_stability = backend
        .load_stability(&learning_key)
        .await
        .unwrap()
        .expect("repaired stability should be stored");
    assert!(
        !repaired_stability.converged,
        "a repaired single-observation profile should re-enter normal convergence"
    );
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/adaptive/tests/unit/acg_learner_tests.rs` around lines 445 - 480, The
empty-observation repair test in acg_learner_tests.rs only checks that a
repaired observation is written back, but it should also verify the stale
converged state is cleared. Update
acg_learner_repairs_converged_stability_with_empty_observations to assert the
observation stored by SeedObservationBackend after AcgLearner::process_run no
longer has converged set to true, matching the behavior already expected by the
neighboring missing-observation repair case.

Source: Path instructions

@willkill07

Copy link
Copy Markdown
Member

find some accutally valuable numbers

I don't see anything presented in your comments above here.

Questions:

  • How is ACG improved?
  • Is there any measured latency?

@teerthsharma

Copy link
Copy Markdown
Author

find some accutally valuable numbers

I don't see anything presented in your comments above here.

Questions:

  • How is ACG improved?
  • Is there any measured latency?

Thanks, fair questions.

  • How is ACG improved?

    The improvement is in what ACG treats as stable.

    In repeated agent workflows, the reusable prompt structure can stay the same while the task-specific content changes every run. For example, a moderation or triage agent may always follow the same workflow: inspect the item, identify policy/quality signals, explain the reason, and suggest the next action. The forum post or issue content changes each time, but the workflow structure is stable.

    The updated topology-aware behavior lets ACG converge on that reusable stable prefix instead of treating every changed work item as full prompt instability. It also reopens learning if the prompt topology changes under the same learning key, so convergence should not become blindly sticky.

    In the local 100-task sandbox, topology mode converged 10/10 repeated workflow profiles using 3 observations per profile, while preserving 100/100 Redis run, LLM, and tool-call counts and 600/600 ATOF events with 0 malformed records.

  • Is there any measured latency?

    Not a clean latency result that I would ask you to rely on yet.

    The live opencode run I posted was contaminated by ordering/time drift and a Windows critical-battery sleep event, so I should not use it as evidence of end-to-end latency improvement. The defensible observation from that run is only that Relay overhead stayed in the low-ms range, while e2e runtime was dominated by the external opencode/model callback.

    If latency evidence is required, I think the right follow-up is an interleaved A/B run: baseline task N, topology task N, same workload, same live harness, reporting full per-run e2e duration, success rate, event counts, and observation counts per profile.

@willkill07

Copy link
Copy Markdown
Member
  1. Convergence/stability isn't strictly a benefit. I need to know the OBSERVED benefit of ACG using this method.

If latency evidence is required ...

Yes, I clearly mentioned this already. This should stay in draft until you have a compelling story with clear measured benefit and no latency degradation.

@willkill07 willkill07 marked this pull request as draft June 26, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE PR should not be merged; see PR for details Feature a new feature lang:go PR changes/introduces Go code lang:js PR changes/introduces Javascript/Typescript code lang:python PR changes/introduces Python code lang:rust PR changes/introduces Rust code size:XXL PR is very large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants