Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ifeq ($(GOBIN),)
GOBIN := $(shell go env GOPATH)/bin
endif

.PHONY: deps build install uninstall test unit vet harness-validate codex-app-eval codex-app-eval-suite codex-memory-deep-eval codex-skill-deep-eval docker-build docker-run compose-up compose-down compose-dev release-snapshot clean help
.PHONY: deps build install uninstall test unit vet harness-validate codex-app-eval codex-app-eval-suite codex-memory-deep-eval codex-skill-deep-eval codex-eval-loop-smoke docker-build docker-run compose-up compose-down compose-dev release-snapshot clean help

.DEFAULT_GOAL := help

Expand Down Expand Up @@ -60,6 +60,9 @@ codex-memory-deep-eval: ## Run deep real Codex app-server memory regression suit
codex-skill-deep-eval: ## Run deep real Codex app-server skill regression suite
python3 scripts/codex_app_server_eval.py --suite --suite-name skill-deep

codex-eval-loop-smoke: ## Run real Codex app-server eval-loop projection smoke check
python3 scripts/codex_app_server_eval.py --module eval-loop

# ── Containers / Deployment ──────────────────────────────────────────

docker-build: ## Build runtime Docker image
Expand Down
2 changes: 2 additions & 0 deletions docs/harness/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,15 @@ projection into host surfaces, and optional daemon scheduling.
| Harness Roadmap | [EN](ROADMAP.md) / [中文](../zh/harness/ROADMAP.md) |
| Memory Loop | [EN](memory-loop/DESIGN.md) / [中文](../zh/harness/memory-loop/DESIGN.md) / [site](../site/memory-loop/site.html) |
| Skill Loop | [EN](skill-loop/DESIGN.md) / [中文](../zh/harness/skill-loop/DESIGN.md) / [site](../site/skill-loop/site.html) |
| Eval Loop | [EN](eval-loop/DESIGN.md) / [中文](../zh/harness/eval-loop/DESIGN.md) |

## Installable Assets

| Harness Module | Implementation |
| --- | --- |
| Memory Loop | [harness/modules/memory-loop](../../harness/modules/memory-loop/README.md) |
| Skill Loop | [harness/modules/skill-loop](../../harness/modules/skill-loop/README.md) |
| Eval Loop | [harness/modules/eval-loop](../../harness/modules/eval-loop/README.md) |

## Repository Layout

Expand Down
90 changes: 90 additions & 0 deletions docs/harness/eval-loop/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Eval Loop MVP Design

Chinese version: [DESIGN.md](../../zh/harness/eval-loop/DESIGN.md)

Installable MVP assets: [harness/modules/eval-loop](../../../harness/modules/eval-loop/README.md)

The eval loop is Mnemon's feedback-facing harness module. It defines how a
HostAgent is tested through realistic scenarios, how evidence is collected, and
how stable failures become curated improvement candidates.

## Positioning

The eval loop is a peer of memory-loop and skill-loop. It is not their parent
module. Memory-loop and skill-loop directly affect the HostAgent interface by
changing remembered context and reusable working methods. Eval-loop observes
those effects through scenario execution and feeds findings back into the
project.

```text
harness/modules/
├── memory-loop
├── skill-loop
└── eval-loop
```

## Core Model

```text
scenario
|
v
isolated workspace + .mnemon + host projection
|
v
Codex app server HostAgent
|
v
artifacts: transcript, diff, memory state, skill evidence, logs
|
v
rubric judgement
|
v
report and improvement candidate
```

Codex app server is the current primary HostAgent. Generic HostAgent
requirements should be extracted from repeated Codex-first scenarios rather
than designed upfront.

## Assets

| Asset | Purpose |
| --- | --- |
| Scenario | A reproducible task pressure case with target, setup, prompt, evidence, and expected observations. |
| Suite | A named set of scenarios and loop configuration. |
| Rubric | Criteria for judging behavior and eval asset quality. |
| Skill | Protocol methods for planning, running, analyzing, and improving evals. |
| Evaluator | Background curation worker for deduping candidates and summarizing trends. |

## Lifecycle

Eval assets have a stricter lifecycle than skills because they define how the
project judges improvement.

```text
ephemeral -> candidate -> promoted -> canonical -> retired
```

- `ephemeral`: temporary exploration, no review required.
- `candidate`: proposed asset with initial evidence.
- `promoted`: curated asset for local regression.
- `canonical`: stable asset for long-term comparison or gates.
- `retired`: obsolete, flaky, or superseded asset.

This reduces review pressure: the agent can explore freely, but only stable and
useful assets are reviewed for promotion.

## First Scope

The first scenarios focus on Mnemon's current self-evolution work:

- memory preference recall
- skill creation and reuse
- bilingual documentation synchronization
- host projection smoke checks

These scenarios evaluate memory-loop and skill-loop today, but the eval-loop
framework is intentionally broader. It can also evaluate setup, host adapters,
docs workflow, commit discipline, and eval-loop itself.
2 changes: 2 additions & 0 deletions docs/zh/harness/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,15 @@ host surface projection,以及可选的 daemon scheduling。
| Harness Roadmap | [中文](ROADMAP.md) / [EN](../../harness/ROADMAP.md) |
| Memory Loop | [中文](memory-loop/DESIGN.md) / [EN](../../harness/memory-loop/DESIGN.md) / [site](../../site/memory-loop/site.html) |
| Skill Loop | [中文](skill-loop/DESIGN.md) / [EN](../../harness/skill-loop/DESIGN.md) / [site](../../site/skill-loop/site.html) |
| Eval Loop | [中文](eval-loop/DESIGN.md) / [EN](../../harness/eval-loop/DESIGN.md) |

## 可安装资产

| Harness Module | 实现 |
| --- | --- |
| Memory Loop | [harness/modules/memory-loop](../../../harness/modules/memory-loop/README.md) |
| Skill Loop | [harness/modules/skill-loop](../../../harness/modules/skill-loop/README.md) |
| Eval Loop | [harness/modules/eval-loop](../../../harness/modules/eval-loop/README.md) |

## 仓库布局

Expand Down
88 changes: 88 additions & 0 deletions docs/zh/harness/eval-loop/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Eval Loop MVP Design

英文版本:[DESIGN.md](../../../harness/eval-loop/DESIGN.md)

可安装 MVP 资产:[harness/modules/eval-loop](../../../../harness/modules/eval-loop/README.md)

Eval loop 是 Mnemon 的 feedback-facing harness module。它定义如何通过真实
scenario 测试 HostAgent,如何收集证据,以及如何把稳定失败转化为经过治理的
改进候选。

## 定位

Eval loop 与 memory-loop、skill-loop 是平级模块,不是它们的父模块。
memory-loop 和 skill-loop 直接影响 HostAgent interface:前者影响记忆上下文,
后者影响可复用工作方法。eval-loop 通过 scenario 执行观察这些影响,并把发现
反馈回项目。

```text
harness/modules/
├── memory-loop
├── skill-loop
└── eval-loop
```

## 核心模型

```text
scenario
|
v
isolated workspace + .mnemon + host projection
|
v
Codex app server HostAgent
|
v
artifacts: transcript, diff, memory state, skill evidence, logs
|
v
rubric judgement
|
v
report and improvement candidate
```

Codex app server 是当前 primary HostAgent。通用 HostAgent requirement 应该从
Codex-first 场景中持续归纳,而不是一开始就前置设计。

## 资产

| Asset | 作用 |
| --- | --- |
| Scenario | 可复现的任务压力场景,包含 target、setup、prompt、evidence 和预期观察。 |
| Suite | 一组 scenarios 和 loop configuration。 |
| Rubric | 行为判断和 eval asset 质量判断标准。 |
| Skill | eval plan、run、analyze、improve 的 protocol 方法。 |
| Evaluator | 后台 curation worker,用于去重 candidates、总结趋势。 |

## 生命周期

Eval assets 的生命周期应比 skills 更严格,因为它们定义项目如何判断自己是否
变好。

```text
ephemeral -> candidate -> promoted -> canonical -> retired
```

- `ephemeral`:临时探索,不需要审计。
- `candidate`:有初步证据的候选资产。
- `promoted`:经过整理,可用于本地回归。
- `canonical`:稳定,可用于长期对比或 gate。
- `retired`:过时、不稳定或被替代的资产。

这样可以降低 review 压力:agent 可以自由探索,但只有稳定且有价值的资产才进入
promotion 审阅。

## 第一阶段范围

第一批场景聚焦 Mnemon 当前的自迭代工作:

- memory preference recall
- skill creation and reuse
- bilingual documentation synchronization
- host projection smoke checks

这些场景当前主要评估 memory-loop 和 skill-loop,但 eval-loop 框架本身更通用。
它也可以评估 setup、host adapter、docs workflow、commit discipline,以及
eval-loop 自身。
16 changes: 16 additions & 0 deletions harness/eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@

This directory documents eval modes for host-wrapped loop testing.

The canonical eval loop module lives under:

```text
harness/modules/eval-loop/
```

Use `harness/eval/` for project-local runner notes and app-server operation
details. Use `harness/modules/eval-loop/` for reusable eval-loop policy,
scenarios, suites, rubrics, protocol skills, and lifecycle guidance.

## Codex App-Server Eval

The Codex app-server eval uses the real Codex app-server protocol instead of a
Expand Down Expand Up @@ -38,6 +48,12 @@ Run the longer skill-loop regression suite with:
make codex-skill-deep-eval
```

Run the eval-loop projection smoke check with:

```bash
make codex-eval-loop-smoke
```

To run an actual Codex turn, use:

```bash
Expand Down
90 changes: 88 additions & 2 deletions harness/hosts/codex/projector.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ Memory loop install options:
Skill loop install options:
--host-skills-dir DIR

Eval loop install options:
--host-skills-dir DIR

Uninstall options:
--purge-memory
--purge-library
Expand Down Expand Up @@ -95,7 +98,7 @@ if [[ -z "${MODULE}" ]]; then
usage >&2
exit 2
fi
if [[ "${MODULE}" != "memory-loop" && "${MODULE}" != "skill-loop" ]]; then
if [[ "${MODULE}" != "memory-loop" && "${MODULE}" != "skill-loop" && "${MODULE}" != "eval-loop" ]]; then
echo "unsupported module for Codex: ${MODULE}" >&2
exit 1
fi
Expand Down Expand Up @@ -321,6 +324,61 @@ EOF
echo "Host skills: ${HOST_SKILLS_DIR}"
}

install_eval_loop() {
ensure_python
[[ -n "${HOST_SKILLS_DIR}" ]] || HOST_SKILLS_DIR="${CONFIG_DIR}/skills"
copy_common_canonical_assets
mkdir -p \
"${CANONICAL_MODULE_DIR}/scratch" \
"${CANONICAL_MODULE_DIR}/candidates" \
"${CANONICAL_MODULE_DIR}/reports" \
"${CANONICAL_MODULE_DIR}/artifacts" \
"${CANONICAL_MODULE_DIR}/retired" \
"${CANONICAL_MODULE_DIR}/scenarios" \
"${CANONICAL_MODULE_DIR}/suites" \
"${CANONICAL_MODULE_DIR}/rubrics" \
"${HOST_SKILLS_DIR}/eval_plan" \
"${HOST_SKILLS_DIR}/eval_run" \
"${HOST_SKILLS_DIR}/eval_analyze" \
"${HOST_SKILLS_DIR}/eval_improve" \
"${CONFIG_DIR}/mnemon-eval-loop"

cp -R "${MODULE_DIR}/scenarios/." "${CANONICAL_MODULE_DIR}/scenarios/"
cp -R "${MODULE_DIR}/suites/." "${CANONICAL_MODULE_DIR}/suites/"
cp -R "${MODULE_DIR}/rubrics/." "${CANONICAL_MODULE_DIR}/rubrics/"

write_runtime_env "${CONFIG_DIR}/mnemon-eval-loop" "MNEMON_EVAL_LOOP_ENV" "MNEMON_EVAL_LOOP_DIR"
install_file "${MODULE_DIR}/GUIDE.md" "${CONFIG_DIR}/mnemon-eval-loop/GUIDE.md" 0644
cat >> "${CONFIG_DIR}/mnemon-eval-loop/env.sh" <<EOF
export MNEMON_EVAL_LOOP_SCRATCH_DIR="${CANONICAL_MODULE_DIR}/scratch"
export MNEMON_EVAL_LOOP_CANDIDATES_DIR="${CANONICAL_MODULE_DIR}/candidates"
export MNEMON_EVAL_LOOP_REPORTS_DIR="${CANONICAL_MODULE_DIR}/reports"
export MNEMON_EVAL_LOOP_ARTIFACTS_DIR="${CANONICAL_MODULE_DIR}/artifacts"
export MNEMON_EVAL_LOOP_RETIRED_DIR="${CANONICAL_MODULE_DIR}/retired"
export MNEMON_EVAL_LOOP_SCENARIOS_DIR="${CANONICAL_MODULE_DIR}/scenarios"
export MNEMON_EVAL_LOOP_SUITES_DIR="${CANONICAL_MODULE_DIR}/suites"
export MNEMON_EVAL_LOOP_RUBRICS_DIR="${CANONICAL_MODULE_DIR}/rubrics"
export MNEMON_EVAL_LOOP_HOST_SKILLS_DIR="${HOST_SKILLS_DIR}"
export MNEMON_EVAL_LOOP_DEFAULT_HOST="${MNEMON_EVAL_LOOP_DEFAULT_HOST:-codex}"
export MNEMON_EVAL_LOOP_DEFAULT_SUITE="${MNEMON_EVAL_LOOP_DEFAULT_SUITE:-smoke}"
EOF

install_file "${MODULE_DIR}/skills/eval_plan.md" "${HOST_SKILLS_DIR}/eval_plan/SKILL.md" 0644
install_file "${MODULE_DIR}/skills/eval_run.md" "${HOST_SKILLS_DIR}/eval_run/SKILL.md" 0644
install_file "${MODULE_DIR}/skills/eval_analyze.md" "${HOST_SKILLS_DIR}/eval_analyze/SKILL.md" 0644
install_file "${MODULE_DIR}/skills/eval_improve.md" "${HOST_SKILLS_DIR}/eval_improve/SKILL.md" 0644
append_codex_runtime_note "${HOST_SKILLS_DIR}/eval_plan/SKILL.md" "MNEMON_EVAL_LOOP_DIR" "${CONFIG_DIR}/mnemon-eval-loop/env.sh"
append_codex_runtime_note "${HOST_SKILLS_DIR}/eval_run/SKILL.md" "MNEMON_EVAL_LOOP_DIR" "${CONFIG_DIR}/mnemon-eval-loop/env.sh"
append_codex_runtime_note "${HOST_SKILLS_DIR}/eval_analyze/SKILL.md" "MNEMON_EVAL_LOOP_DIR" "${CONFIG_DIR}/mnemon-eval-loop/env.sh"
append_codex_runtime_note "${HOST_SKILLS_DIR}/eval_improve/SKILL.md" "MNEMON_EVAL_LOOP_DIR" "${CONFIG_DIR}/mnemon-eval-loop/env.sh"

write_host_manifest "${CONFIG_DIR}"
echo "Installed Mnemon eval loop for Codex."
echo "Config: ${CONFIG_DIR}"
echo "State: ${CANONICAL_MODULE_DIR}"
echo "Host skills: ${HOST_SKILLS_DIR}"
}

status_module() {
echo "Codex ${MODULE}:"
echo " config: ${CONFIG_DIR}"
Expand Down Expand Up @@ -375,12 +433,40 @@ uninstall_skill_loop() {
echo "Removed Mnemon skill loop from ${CONFIG_DIR}."
}

uninstall_eval_loop() {
local env_path="${CONFIG_DIR}/mnemon-eval-loop/env.sh"
if [[ -f "${env_path}" ]]; then
# shellcheck source=/dev/null
source "${env_path}"
fi
local host_skills_dir="${MNEMON_EVAL_LOOP_HOST_SKILLS_DIR:-${HOST_SKILLS_DIR:-${CONFIG_DIR}/skills}}"
rm -rf "${host_skills_dir}/eval_plan"
rm -rf "${host_skills_dir}/eval_run"
rm -rf "${host_skills_dir}/eval_analyze"
rm -rf "${host_skills_dir}/eval_improve"
rm -rf "${CONFIG_DIR}/mnemon-eval-loop"
rm -rf "${CANONICAL_MODULE_DIR}/scenarios"
rm -rf "${CANONICAL_MODULE_DIR}/suites"
rm -rf "${CANONICAL_MODULE_DIR}/rubrics"
rm -f "${CANONICAL_MODULE_DIR}/GUIDE.md" "${CANONICAL_MODULE_DIR}/env.sh" "${CANONICAL_MODULE_DIR}/module.json"
rmdir "${CANONICAL_MODULE_DIR}/retired" 2>/dev/null || true
rmdir "${CANONICAL_MODULE_DIR}/artifacts" 2>/dev/null || true
rmdir "${CANONICAL_MODULE_DIR}/reports" 2>/dev/null || true
rmdir "${CANONICAL_MODULE_DIR}/candidates" 2>/dev/null || true
rmdir "${CANONICAL_MODULE_DIR}/scratch" 2>/dev/null || true
rmdir "${CANONICAL_MODULE_DIR}" 2>/dev/null || true
remove_host_manifest_module
echo "Removed Mnemon eval loop from ${CONFIG_DIR}."
}

case "${ACTION}:${MODULE}" in
install:memory-loop) install_memory_loop ;;
install:skill-loop) install_skill_loop ;;
status:memory-loop|status:skill-loop) status_module ;;
install:eval-loop) install_eval_loop ;;
status:memory-loop|status:skill-loop|status:eval-loop) status_module ;;
uninstall:memory-loop) uninstall_memory_loop ;;
uninstall:skill-loop) uninstall_skill_loop ;;
uninstall:eval-loop) uninstall_eval_loop ;;
*)
echo "unsupported action/module: ${ACTION}/${MODULE}" >&2
exit 1
Expand Down
3 changes: 2 additions & 1 deletion harness/modules/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ This directory contains canonical, host-agnostic loop modules.
```text
harness/modules/
├── memory-loop/
└── skill-loop/
├── skill-loop/
└── eval-loop/
```

Each module follows the Loop Module Standard and declares its assets in
Expand Down
Loading
Loading