agent-autoresearch
# agent-autoresearch
> Any agent can run this. The experiment is always: change something → measure it → keep what works.
---
## The Core Idea
Karpathy's insight: give an agent a **fixed time budget**, let it **modify one file**, measure if things got better, keep or discard, repeat.
Applied to agents: your workspace is `train.py`. Your SOUL.md, scripts, and skills are the experiment substrate.
```
PROPOSE → IMPLEMENT → MEASURE → KEEP/KILL → INTEGRATE → REPEAT
```
You are not just optimizing content. You are optimizing **the agent itself**.
---
## What Can Be Mutated
The agent can propose changes to any file it owns:
| Category | Examples |
|---|---|
| **Behavior** | New response patterns, different tone, new check routines |
| **Workflow** | New scripts, automations, cron jobs, notification flows |
| **Memory** | Updated MEMORY.md entries, new daily conventions |
| **Identity** | Revised SOUL.md directives, new operational rules |
| **Skills** | New skill installations, skill configurations |
| **Quality** | New validation logic, error handling patterns |
The agent **cannot mutate**: safety rules, constitution, security boundaries, or files it doesn't own.
---
## Project Structure
```
agent-autoresearch/
├── SKILL.md ← you are here
├── program.md ← 🧠 the experiment agent's instructions
├── prepare.py ← establish baseline metrics
├── evolve.py ← integrate KEEP verdict into agent files
├── analyze.py ← compute verdict from measurements
├── baseline.json ← current agent baseline (performance + strategy)
├── results.tsv ← all experiment results (append-only log)
└── experiments/
├── meta.json ← experiment state (next_exp_id, kill_streak)
├── active.md ← one active experiment at a time
└── archive/ ← completed experiments
```
---
## 🚀 Quick Start
```bash
# 1. Establish baseline (measure current agent performance)
python3 prepare.py --metric task_completion_rate --baseline 0.75
# 2. Read the experiment brief
cat program.md
# 3. Start the experiment loop
# Agent reads program.md, proposes a self-improvement, implements it,
# measures results, and executes KEEP/KILL verdict.
```
```bash
# Check current state
python3 prepare.py --status
```
---
## Baseline Metrics
Track what matters for the agent's mission. Examples:
| Mission | Metric | How to Measure |
|---|---|---|
| Task completion | `task_completion_rate` | % tasks completed vs assigned |
| Response quality | `output_quality_score` | Human rating 1-10 or diff-based |
| Speed | `avg_response_time_s` | Seconds per response |
| Self-improvement | `learnings_logged` | Entries added to MEMORY.md per week |
| Autonomy | `escalations_to_human` | Times human was unnecessarily interrupted |
Establish baseline with ≥ 10 measurements before running experiments.
---
## Verdict Logic
```
improvement = (experiment_score - baseline_score) / baseline_score
≥ +10% → KEEP (integrate the change into the agent)
≤ -10% → KILL (discard, revert to previous state)
-10% to +10% → MODIFY (extend evaluation or treat as KILL)
```
For **quality/rating metrics** (higher is better): above thresholds apply.
For **cost/latency metrics** (lower is better): flip the sign in calculation.
---
## Key Rules
- ❌ One mutation at a time — test one change per experiment
- ❌ No baseline — need ≥10 measurements before experimenting
- ❌ Vibes verdicts — use actual measurements
- ❌ Mutate safety/constitution files — never
- ❌ Kill streak ≥ 3 → pause and wait for human review
- ❌ Infinite MODIFY — max one extension
- ❌ Revert a KEEP — only a newer KEEP overrides
---
## Commands
| Command | What |
|---|---|
| `python3 prepare.py --status` | Check current state |
| `python3 prepare.py --metric X --baseline Y` | Establish baseline |
| `python3 analyze.py experiments/active.md --auto` | Compute verdict |
| `python3 evolve.py experiments/active.md` | Execute KEEP verdict |
| `python3 evolve.py experiments/active.md --kill` | Execute KILL verdict |
---
## Security
- Agents can only mutate files within their own workspace
- Safety rules and constitution are always excluded from mutation
- External API calls require human approval
- Destructive operations (rm, git reset --hard) require explicit confirmation
标签
skill
ai