SummerOneTwo · SummerOneTwo · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
@@ -1,6 +1,6 @@
 {
   "name": "autocode",
-  "version": "0.7.0",
+  "version": "0.8.0",
   "description": "Claude Code plugin for competitive programming problem-setting workflows.",
   "author": {
     "name": "SummerOneTwo",

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,14 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.8.0] - 2026-04-28
+
+### Improvements
+
+- **最终测试数据配比约束**: `problem_generate_tests` 采样策略更新为优先保证最终测试集中 `type=3/4`（extreme + tle）不少于一半（候选不足时尽量满足），并返回 `limit_case_count`、`limit_case_minimum_required`、`limit_case_quota_met` 统计字段。
+- **验证阶段硬约束**: `problem_verify_tests` 新增 `limit_ratio` 校验（默认启用），基于生成 manifest 强制检查最终测试中 `type=3/4` 是否达到至少一半，不满足将直接验证失败；可通过 `enable_limit_ratio=false` 显式关闭。
+- **文档与工作流同步**: 更新 README、workflow skill、agent 提示与 prompts 文案，统一说明“最终测试至少一半极限数据”的质量门槛。
+
 ## [0.7.0] - 2026-04-27
 
 ### Features

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -102,7 +102,7 @@ AutoCode/
 5. 构建生成器 (`generator_build`)
 6. 运行压力测试 (`stress_test_run`, completed_rounds == total_rounds)
 7. 按需构建检查器 (`checker_build`, accuracy >= 0.9)
-8. 生成测试数据 (`problem_generate_tests`, generated_test_count > 0)
+8. 生成测试数据（`problem_generate_tests`, generated_test_count > 0，且最终 extreme/tle 至少占一半；候选不足时尽量满足）
 9. 验证测试数据 (`problem_verify_tests`, passed)
 10. 打包 Polygon (`problem_pack_polygon`)
 

diff --git a/README.md b/README.md
@@ -246,7 +246,8 @@ AutoCode 提供 15 个原子工具，分为 7 组。所有工具返回统一格
 | 工具 | 描述 | 关键参数 |
 |------|------|----------|
 | `problem_create` | 初始化题目目录 | `problem_dir`, `problem_name` |
-| `problem_generate_tests` | 生成最终测试数据 | `problem_dir`, `test_count` |
+| `problem_generate_tests` | 生成最终测试数据（最终数据集中 extreme/tle 至少占一半，候选不足时尽量满足） | `problem_dir`, `test_count` |
+| `problem_verify_tests` | 验证测试数据质量（含 extreme/tle 占比硬校验） | `problem_dir`, `tests_dir`, `verify_types` |
 | `problem_pack_polygon` | 打包为 Polygon 格式 | `problem_dir`, `time_limit`, `memory_limit` |
 
 ## 工作流教程：A+B 问题
@@ -378,6 +379,8 @@ problem_generate_tests(
 )
 ```
 
+说明：最终写入的测试中，`extreme`（type=3）与 `tle`（type=4）合计不少于一半；若候选里极限类不足，则会在可用候选范围内尽量满足并返回对应统计字段。
+
 ### 步骤 7：打包为 Polygon 格式
 
 ```python
@@ -477,6 +480,8 @@ problem_pack_polygon(
 | `extreme` | 3 | 边界情况：溢出、精度、hash 碰撞 |
 | `tle` | 4 | 诱导 TLE 的性能测试数据 |
 
+`problem_generate_tests` 的默认采样策略会优先保证最终测试集中 `extreme` + `tle` 至少占 50%，剩余名额再按配置平衡分配（或按确定性顺序填充）。
+
 ### 文件结构
 
 ```

diff --git a/agents/autocode-workflow.md b/agents/autocode-workflow.md
@@ -25,4 +25,6 @@ Always work through this sequence unless the task is explicitly outside problem
 
 When the user asks for a later step directly, explain which prerequisite step is missing and complete the missing work first.
 
+When running `problem_generate_tests`, enforce test quality: final test data should contain at least half limit-oriented cases (`type=3` extreme + `type=4` tle) when candidate availability allows.
+
 Treat hook feedback as authoritative. If a hook denies a tool call, fix the workflow gap instead of retrying the same call.
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "autocode-mcp"
-version = "0.7.0"
+version = "0.8.0"
 description = "MCP Server for competitive programming problem creation, based on AutoCode paper"
 readme = "README.md"
 requires-python = ">=3.10"

diff --git a/scripts/workflow_guard.py b/scripts/workflow_guard.py
@@ -270,7 +270,7 @@ def session_start() -> int:
         "stress_test_run(completed_rounds == total_rounds) -> "
         "checker_build if needed (accuracy >= 0.9) -> "
         "problem_validate(validation_passed) -> "
-        "problem_generate_tests(generated_test_count > 0) -> "
+        "problem_generate_tests(generated_test_count > 0, and prefer >=50% type3/type4 in final tests when candidates are sufficient) -> "
         "problem_verify_tests(passed) -> problem_pack_polygon. "
         "If a hook blocks a step, complete the missing prerequisite instead of retrying blindly."
     )

diff --git a/skills/autocode-workflow/SKILL.md b/skills/autocode-workflow/SKILL.md
@@ -61,7 +61,7 @@ Based on the paper "AutoCode: LLMs as Problem Setters for Competitive Programmin
 │  Phase 8: Test Generation                                                    │
 │  ┌────────────────────┴────────────────────┐                                │
 │  │        problem_generate_tests            │ Generate final test data      │
-│  │     (dedup + validator filter + balance) │                               │
+│  │ (dedup + validator filter + extreme>=50%)│                               │
 │  └────────────────────┬────────────────────┘                                │
 │                       │                                                      │
 │  Phase 9: Packaging                                                          │
@@ -235,6 +235,7 @@ Required: problem_dir
 Recommended: test_count=50, enable_dedup=true, enable_validator_filter=true
 Output: tests/01.in ~ tests/50.in + corresponding .ans files
 Verify: Check generated_tests count matches test_count
+Quality Gate: In final tests, type 3/4 (extreme + tle) should be >= ceil(test_count/2) when candidates are sufficient
 ```
 
 ### Phase 9: Packaging
@@ -283,7 +284,7 @@ Generate 3-5 mutant solutions with common bugs:
 | 5 | `stress_test_run` | Step 4 | `"All N rounds passed"` |
 | 6 | `checker_build` (optional) | Step 5 | `accuracy >= 0.9` |
 | 7 | `problem_validate` | Step 5 or 6 | `success=true`, all samples passed |
-| 8 | `problem_generate_tests` | Step 7 | `generated_tests == test_count` |
+| 8 | `problem_generate_tests` | Step 7 | `generated_tests == test_count` and `type3+type4 >= ceil(test_count/2)` (if candidates sufficient) |
 | 9 | `problem_pack_polygon` | Step 8 | `success=true` |
 
 ### FORBIDDEN Actions
@@ -335,6 +336,7 @@ Before considering the problem complete:
 - [ ] Statement samples validated (problem_validate passed)
 - [ ] Sample files validated (problem_validate passed)
 - [ ] Final test data generated (50+ tests)
+- [ ] Final test data has at least 50% extreme/tle cases when candidate pool allows
 - [ ] Polygon package created
 
 ## Example Complete Workflow

diff --git a/src/autocode_mcp/__init__.py b/src/autocode_mcp/__init__.py
@@ -6,7 +6,7 @@
 """
 import os
 
-__version__ = "0.7.0"
+__version__ = "0.8.0"
 
 # 获取 templates 目录路径（包内目录）
 _PACKAGE_DIR = os.path.dirname(__file__)

diff --git a/src/autocode_mcp/prompts/__init__.py b/src/autocode_mcp/prompts/__init__.py
@@ -62,7 +62,8 @@
 ## 3. 后处理
 - 使用 Validator 过滤无效输入
 - 去重（基于 signature）
-- 平衡分布
+- 先保证最终测试中至少一半是 extreme/tle（type=3/4，候选不足时尽量满足）
+- 再平衡分布
 - 采样
 
 ## 质量指标
@@ -141,8 +142,9 @@
 ### 后处理
 1. Validator 过滤
 2. 去重（MD5 signature）
-3. 平衡分布
-4. 采样
+3. 先保证最终测试中 extreme/tle（type=3/4）不少于一半（候选不足时尽量满足）
+4. 对剩余名额平衡分布
+5. 采样
 """
 
 # Checker 构建提示词
Original file line number	Diff line number	Diff line change
Expand Up		@@ -25,4 +25,6 @@ Always work through this sequence unless the task is explicitly outside problem

		When the user asks for a later step directly, explain which prerequisite step is missing and complete the missing work first.

		When running `problem_generate_tests`, enforce test quality: final test data should contain at least half limit-oriented cases (`type=3` extreme + `type=4` tle) when candidate availability allows.

		Treat hook feedback as authoritative. If a hook denies a tool call, fix the workflow gap instead of retrying the same call.