Benchmark Model Provider
Use this skill to help users choose the most suitable model for their own workflow instead of giving generic “best model” advice.
Tiếng Việt
Dùng skill này khi Boss muốn biết model nào thật sự đáng dùng cho workflow hằng ngày: model nào research tốt hơn, viết báo cáo ổn hơn, code ngon hơn, rẻ hơn, nhanh hơn, hay đáng dùng lâu dài hơn. Skill này không trả lời kiểu cảm tính, mà dựng benchmark theo đúng nhu cầu thực tế của người dùng rồi chấm, rerank và xuất report rõ ràng.
中文说明
当用户想知道“哪个模型更聪明、更便宜、更适合日常工作流、更适合研究/写报告/编程”时,使用这个技能。它不会给出泛泛而谈的“最佳模型”建议,而是根据用户自己的实际任务构建基准测试,保留原始结果、重新排序,并生成可审阅、可分享的报告。
Treat the benchmark as a personal decision framework:
- - derive the benchmark from the user's real work
- keep the run auditable
- preserve raw outputs for reranking
- generate outputs that can be reviewed, shared, and published cleanly
What this skill is for
People often ask questions like:
- - Which model is smarter?
- Which model is cheaper to run daily?
- Which model is deeper or more useful for my job?
- Should I use a local model or a service model?
This skill exists to answer those questions with a repeatable benchmark process, not with vague preferences.
Core operating flow
- 1. Collect benchmark context
- purpose
- domain
- usage frequency
- 2. Build or select a benchmark spec with 5–10 domain-specific questions
- List currently available providers/models from trusted local OpenClaw context when allowed
- Ask whether the user wants to use the current list or add more models
- Verify every user-supplied model before running; if the name does not match, ask again or suggest the closest valid model id
- Run each model independently on the same benchmark set
- Preserve raw outputs and metrics so the run can be audited and reranked later
- Score results across quality, depth, cost, and speed metrics
- Build reports in markdown / HTML / PDF
- Optionally suggest simple ways to publish the generated HTML report (Vercel, Netlify, Cloudflare Pages, GitHub Pages) if the user wants a shareable link
Default decisions
| Area | Default |
|---|
| Benchmark mode | INLINECODE0 |
| Overall scoring |
quality + depth + cost |
| Speed handling | measured and reported, excluded from default overall |
| Execution strategy |
sequential unless orchestration is needed |
| Web publish target | (no built-in publish) — suggest Vercel / Netlify / Cloudflare Pages / GitHub Pages |
Workflow rules
Benchmark input rules
- - Default to
prompt_only unless the user explicitly wants agent_context. - In
prompt_only, send only the raw prompt. - Do not inject extra context, memory, few-shot examples, or hidden scaffolding in
prompt_only mode. - In
agent_context, use one fixed shared system/context layer for all compared models and record it in metadata.
Execution rules
- - Support both
sequential and subagent_orchestrated execution strategies. - Allow bounded parallel execution for subagents (for example
--max-parallel 4) when the endpoint can tolerate it. - Treat
rerank as a first-class operation; do not rerun models when only the scoring formula changes. - Report progress at every major step so the user never feels the process is hanging.
- During batch execution, surface a clear update whenever one agent/model finishes.
- Normalize model ids before calling the endpoint when the provider catalog exposes raw model ids but the user/runtime spec may contain provider-prefixed names.
- If the endpoint returns naming/provider mismatch errors, explain the mismatch clearly instead of leaving only a raw 502/unknown-provider error.
Output rules
- - Mark every estimated metric clearly.
- Rewrite reports/landing pages to the newest snapshot.
- Do not append patch fragments to stale output.
- Reports should include: ranking table, cost table, executive summary, overall assessment, recommended model selection, and full answer details.
- Default the report language to the user's current conversation language.
- Only switch the report language when the user explicitly asks for a different language or a bilingual output.
- PDF output must use Unicode-capable fonts so Vietnamese, Chinese, and multilingual content render correctly.
- Multilingual support means the renderer can display multiple languages correctly; it does not mean the skill should arbitrarily change the report language.
- Ask before delivering externally via Vercel or other web publishing.
Safety and trust boundary
This skill may perform network I/O depending on how the benchmark spec is configured.
Safe-by-design intent
- - Example specs should use placeholder endpoints, not a private hardcoded runtime.
- The user should supply only trusted API endpoints and credentials.
- Publishing should happen only when the user explicitly wants delivery.
Important runtime notes
- -
run_benchmark.py sends prompts to the base_url configured in the benchmark spec. - This skill does not publish to Vercel/Netlify/Cloudflare/GitHub automatically. It only generates local HTML/PDF artifacts.
- If you want a shareable link, publish the generated HTML folder using one of these services: Vercel, Netlify, Cloudflare Pages, or GitHub Pages.
- Only run the skill with endpoints, tokens, and outputs you trust.
For detailed runtime assumptions, read:
- - INLINECODE13
- INLINECODE14
- INLINECODE15
What to read
Read only what you need:
- -
references/initial-project-spec.md — authoritative design baseline - INLINECODE17 — benchmark spec structure, run artifacts, file layout
- INLINECODE18 — scoring model, normalization rules, default weights
- INLINECODE19 — pricing precedence and estimation policy
- INLINECODE20 — benchmark modes, execution strategies, operational modes
- INLINECODE21 — delivery choices, publish rules, progress feedback rules
- INLINECODE22 — trust boundaries, network behavior, safe usage guidance
- INLINECODE23 — expected environment variables and dependency notes
- INLINECODE24 — benchmark context templates and ready-made examples in multiple languages
Scripts
| Script | Purpose |
|---|
| INLINECODE25 | Build a benchmark spec from benchmark context |
| INLINECODE26 |
Execute benchmark runs and write raw outputs/metrics |
|
scripts/estimate_tokens.py | Estimate token counts when provider usage is missing |
|
scripts/resolve_pricing.py | Resolve pricing sources and compute estimated/official pricing |
|
scripts/score_models.py | Combine raw metrics and rubric scores into rankings |
|
scripts/build_report.py | Build markdown, HTML, and PDF report artifacts |
|
scripts/publish_report.py | No deployment automation. Export/copy PDF and print suggested static hosting options (Vercel/Netlify/Cloudflare Pages/GitHub Pages). |
Output contract
Try to produce these artifacts whenever possible:
- - versioned benchmark spec
- raw per-model answer files
- raw metrics JSON
- score breakdown JSON
- markdown summary report
- HTML landing page
- PDF output when requested
- publish result metadata when delivery occurs
基准模型提供者
使用此技能帮助用户选择最适合其工作流程的模型,而不是给出泛泛的“最佳模型”建议。
Tiếng Việt
当老板想知道在日常工作流程中真正值得使用的模型时使用此技能:哪个模型研究能力更强、写报告更稳定、编程更好、更便宜、更快,或者更值得长期使用。此技能不会给出情绪化的回答,而是根据用户的实际需求构建基准测试,然后进行评分、重新排序并生成清晰的报告。
中文说明
当用户想知道“哪个模型更聪明、更便宜、更适合日常工作流、更适合研究/写报告/编程”时,使用这个技能。它不会给出泛泛而谈的“最佳模型”建议,而是根据用户自己的实际任务构建基准测试,保留原始结果、重新排序,并生成可审阅、可分享的报告。
将基准测试视为一个个人决策框架:
- - 从用户的实际工作中推导基准测试
- 保持运行过程可审计
- 保留原始输出以便重新排序
- 生成可审阅、可分享、可清晰发布的输出
此技能的用途
人们经常问这样的问题:
- - 哪个模型更聪明?
- 哪个模型日常运行更便宜?
- 哪个模型对我的工作更深入或更有用?
- 我应该使用本地模型还是服务模型?
此技能旨在通过可重复的基准测试流程来回答这些问题,而不是基于模糊的偏好。
核心操作流程
- 1. 收集基准测试上下文
- 目的
- 领域
- 使用频率
- 2. 构建或选择基准测试规范,包含5-10个领域特定问题
- 列出当前可用的提供商/模型,在允许的情况下从可信的本地OpenClaw上下文中获取
- 询问用户是否想使用当前列表或添加更多模型
- 在运行前验证每个用户提供的模型;如果名称不匹配,再次询问或建议最接近的有效模型ID
- 在相同的基准测试集上独立运行每个模型
- 保留原始输出和指标,以便后续可审计和重新排序
- 根据质量、深度、成本和速度指标对结果进行评分
- 以markdown/HTML/PDF格式构建报告
- 可选地,如果用户想要可分享的链接,建议发布生成的HTML报告的简单方式(Vercel、Netlify、Cloudflare Pages、GitHub Pages)
默认决策
| 领域 | 默认值 |
|---|
| 基准测试模式 | prompt_only |
| 总体评分 |
质量 + 深度 + 成本 |
| 速度处理 | 测量并报告,排除在默认总体评分之外 |
| 执行策略 | sequential,除非需要编排 |
| 网页发布目标 | (无内置发布)— 建议使用Vercel / Netlify / Cloudflare Pages / GitHub Pages |
工作流规则
基准测试输入规则
- - 默认使用promptonly,除非用户明确要求agentcontext。
- 在promptonly模式下,仅发送原始提示。
- 在promptonly模式下,不要注入额外的上下文、记忆、少样本示例或隐藏的脚手架。
- 在agent_context模式下,对所有比较的模型使用一个固定的共享系统/上下文层,并在元数据中记录。
执行规则
- - 支持sequential和subagent_orchestrated两种执行策略。
- 当端点可以承受时,允许子代理的有界并行执行(例如--max-parallel 4)。
- 将rerank视为一等操作;当只有评分公式发生变化时,不要重新运行模型。
- 在每个主要步骤报告进度,让用户始终感觉过程没有挂起。
- 在批量执行期间,每当一个代理/模型完成时,显示清晰的更新。
- 当提供商目录暴露原始模型ID但用户/运行时规范可能包含提供商前缀名称时,在调用端点前规范化模型ID。
- 如果端点返回命名/提供商不匹配错误,清晰解释不匹配原因,而不是只留下原始的502/未知提供商错误。
输出规则
- - 清晰标记每个估算指标。
- 将报告/登陆页面重写为最新快照。
- 不要将补丁片段附加到过时的输出上。
- 报告应包括:排名表、成本表、执行摘要、总体评估、推荐模型选择以及完整的答案详情。
- 默认报告语言为用户当前对话语言。
- 仅在用户明确要求不同语言或双语输出时切换报告语言。
- PDF输出必须使用支持Unicode的字体,以便越南语、中文和多语言内容正确渲染。
- 多语言支持意味着渲染器能正确显示多种语言,不意味着技能应随意更改报告语言。
- 在通过Vercel或其他网页发布方式对外交付前需询问。
安全与信任边界
根据基准测试规范的配置方式,此技能可能执行网络I/O操作。
安全设计意图
- - 示例规范应使用占位符端点,而非私有硬编码运行时。
- 用户应仅提供受信任的API端点和凭证。
- 仅在用户明确要求交付时才进行发布。
重要运行时说明
- - runbenchmark.py将提示发送到基准测试规范中配置的baseurl。
- 此技能不会自动发布到Vercel/Netlify/Cloudflare/GitHub。它仅生成本地HTML/PDF工件。
- 如果您想要可分享的链接,请使用以下服务之一发布生成的HTML文件夹:Vercel、Netlify、Cloudflare Pages或GitHub Pages。
- 仅使用您信任的端点、令牌和输出来运行此技能。
有关详细的运行时假设,请阅读:
- - references/runtime-safety.md
- references/environment-vars.md
- references/pricing-sources.md
需要阅读的内容
只阅读您需要的内容:
- - references/initial-project-spec.md — 权威设计基线
- references/benchmark-schema.md — 基准测试规范结构、运行工件、文件布局
- references/scoring-rubric.md — 评分模型、规范化规则、默认权重
- references/pricing-sources.md — 定价优先级和估算策略
- references/execution-modes.md — 基准测试模式、执行策略、操作模式
- references/output-modes.md — 交付选择、发布规则、进度反馈规则
- references/runtime-safety.md — 信任边界、网络行为、安全使用指南
- references/environment-vars.md — 预期环境变量和依赖项说明
- examples/*.yaml — 基准测试上下文模板和多种语言的现成示例
脚本
| 脚本 | 用途 |
|---|
| scripts/buildbenchmarkspec.py | 从基准测试上下文构建基准测试规范 |
| scripts/run_benchmark.py |
执行基准测试运行并写入原始输出/指标 |
| scripts/estimate_tokens.py | 当提供商使用数据缺失时估算令牌数量 |
| scripts/resolve_pricing.py | 解析定价来源并计算估算/官方定价 |
| scripts/score_models.py | 将原始指标和评分标准组合成排名 |
| scripts/build_report.py | 构建markdown、HTML和PDF报告工件 |
| scripts/publish_report.py | 无部署自动化。导出/复制PDF并建议静态托管选项(Vercel/Netlify/Cloudflare Pages/GitHub Pages)。 |
输出约定
尽可能生成以下工件:
- - 带版本的基准测试规范
- 每个模型的原始答案文件
- 原始指标JSON
- 评分明细JSON
- markdown摘要报告
- HTML登陆页面
- 按需生成PDF输出
- 交付时的发布结果元数据