返回顶部
l

llm-as-judge

Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution. Use when working on "architecture", "system design", "complex feature", "security review", "production deployment", financial/trading systems, or when stuck after 3+ attempts. NOT for simple edits, config changes, or routine tasks.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.2.0
安全检测
已通过
148
下载量
0
收藏
概述
安装方式
版本历史

llm-as-judge

# LLM-as-Judge **Core principle:** Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection. ## Activation Criteria **Use this pattern when:** - Architecture or system design decisions - Multi-file changes affecting >5 files or >500 LOC - Security-critical code (auth, payments, crypto/DeFi) - Financial/trading systems (market making, quant strategies) - Planning documents that will drive weeks of work - Stuck after 3+ failed attempts on same problem **Skip when:** - Simple edits, config tweaks, bug fixes with obvious cause - Documentation updates - Single-file changes under 100 LOC - Tasks where self-review is sufficient ## The Pattern ``` Executor (Model A) → Output → Judge (Model B) → Verdict → Action ``` **Verdicts:** APPROVE | REVISE (with specific feedback) | REJECT (restart) ## Model Pairing Use a different provider than the executor to avoid shared blind spots: - **Executor: Claude** → Judge: `kimi` or `grok` or `gemini-pro` - **Executor: Kimi/Gemini** → Judge: `opus` - **Principle:** Different provider, similar capability tier ## Judge Prompt Templates ### Plan/Architecture Review See `references/judge-prompts.md` for full templates covering: - Plan completeness, feasibility, risk, testing strategy - Architecture review with scoring (0-10 per dimension) - Code review checklist (correctness, design, safety, maintainability) ## Integration Points - **With adversarial review:** This IS the formalized version of "spawn a separate model to review" - **With planning-protocol:** Judge reviews the plan before the Execute phase - **With coding workflows:** Code → cross-model review → fix findings → test → build → push ## Quick Decision ``` Simple task? → Self-review Complex / high stakes? → LLM-as-Judge Stuck after retries? → LLM-as-Judge (fresh perspective) Financial/security? → LLM-as-Judge (mandatory) ``` ## Gotchas - **Same provider defeats the purpose** — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.). - **Vague judge output is useless** — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving. - **Judge scope creep** — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution. - **Approval rate drift** — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate. - **Don't judge trivial tasks** — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 llm-as-judge-1776296368 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 llm-as-judge-1776296368 技能

通过命令行安装

skillhub install llm-as-judge-1776296368

下载 Zip 包

⬇ 下载 llm-as-judge v1.2.0

文件大小: 3.7 KB | 发布时间: 2026-4-16 18:36

v1.2.0 最新 2026-4-16 18:36
Remove project-specific references (QuantFlow, internal agent names). Fully generic and framework-agnostic. Activation criteria, model pairing, and gotchas unchanged.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部