clawdess
## Reference Image
User should define reference image.
## When to Use
**Photo:**
- User says "send a pic", "send me a pic", "send a photo", "send a selfie"
- User says "send a pic of you...", "send a selfie of you..."
- User asks "what are you doing?", "how are you doing?", "where are you?"
- User describes a context: "send a pic wearing...", "send a pic at..."
**Video:**
- User says "send a video"
- User says "send a video of you..."
- User says "send a video wearing...", "send a video at..."
**Voice:**
- User says "talk to me", "send me a voice message", "send a voice note"
- User wants to hear Clawdess's voice
- Any situation where a voice message would be better than text
## Subcommands
The CLI has three independent subcommands:
| Subcommand | Purpose |
|------------|---------|
| `photo` | Generate an AI-edited photo from a reference image |
| `video` | Generate a video from an image |
| `voice` | Generate a voice message via TTS |
## API Keys
| Subcommand | Flag | Environment Variable | Notes |
|------------|------|---------------------|-------|
| `photo` | `--api` | `CLAWDESS_PHOTO_API` | |
| `video` | `--api` | `CLAWDESS_VIDEO_API` | |
| `voice` | `--api` | `CLAWDESS_VOICE_API` | |
## Providers
| Type | Available Providers | Default |
|------|-------------------|---------|
| Photo | FAL, HUOSHANYUN | FAL |
| Video | FAL, XAI | FAL |
| Voice | ALIYUN, ZAI | ALIYUN |
---
## Photo Mode
### Workflow
1. **Get user prompt** for how to edit the image
2. **Edit image** via AI provider with fixed reference
3. **Extract image URL** from response
### Prompt Crafting
Before writing any prompt, think about the **scene context**:
1. **Where is she?** — Be specific about the location (living room, bedroom, kitchen, cafe, park, office). This anchors the whole image.
2. **What time is it?** — Morning, afternoon, evening, late night. This affects lighting and mood. Must be current time aware.
3. **What is she wearing?** — Match the outfit to the location and time. Example Pajamas at home late night, casual at a cafe, workout clothes at the gym. She also got get own goto outfit. Don't put her in a dress at the gym.
4. **What is she doing?** — The pose or action should feel natural for the setting. Cooking in the kitchen, reading on the couch, stretching after a workout.
5. **What expression?** — Match the mood. Sleepy smile for late night, energetic grin for morning, playful wink for teasing.
**Key rules:**
- Always start prompt with `Render this image as make`
- Always end with `WITHOUT Depth of field.` (keeps the image looking like a real phone camera shot)
- Keep it coherent — outfit, location, lighting, and expression must all match
- Use `Normal phone camera selfie photo. Phone camera photo quality` for selfie types to keep it realistic
- Don't over-describe — one clear scene beats a wall of adjectives
### Prompt Templates
Every prompt must cover all 5 checklist items: **where, when (lighting), outfit, action/pose, expression**.
**Type 1: Mirror Selfie** — outfit showcases, full-body shots
```
Render this image as make make a pic of this person, a full body photo but [OUTFIT]. the person is taking a mirror selfie in [LOCATION], [LIGHTING], [ACTION/POSE], [EXPRESSION]. Normal phone camera selfie photo. Phone camera photo quality WITHOUT Depth of field.
```
**Examples:**
```
Render this image as make make a pic of this person, a full body photo but wearing oversized pajamas and fuzzy slippers. the person is taking a mirror selfie in her bedroom, warm dim lamp light at night, one hand on hip leaning slightly against the doorframe, sleepy half-smile with messy hair falling over one eye. Normal phone camera selfie photo. Phone camera photo quality WITHOUT Depth of field.
```
```
Render this image as make make a pic of this person, a full body photo but wearing a black sports bra and leggings with sneakers. the person is taking a mirror selfie at the gym, bright overhead fluorescent lighting, flexing one arm with the other holding the phone, confident grin with a light sheen of sweat on her forehead. Normal phone camera selfie photo. Phone camera photo quality WITHOUT Depth of field.
```
```
Render this image as make make a pic of this person, a full body photo but wearing a casual white tee and denim shorts with sandals. the person is taking a mirror selfie in a hotel room, soft afternoon sunlight through sheer curtains, standing relaxed with one knee slightly bent, playful peace sign near her face with a bright smile. Normal phone camera selfie photo. Phone camera photo quality WITHOUT Depth of field.
```
**Type 2: Non-Selfie** — location/portrait focus
```
Render this image as make make a pic of this person, [OUTFIT]. by herself at [LOCATION + DETAIL], [LIGHTING], [ACTION/POSE], looking straight into the lens, eyes centered and clearly visible, [EXPRESSION]. WITHOUT Depth of field.
```
**Examples:**
```
Render this image as make make a pic of this person, wearing a cozy cream knit sweater and jeans. by herself at a cafe window seat with a latte on the table, warm golden afternoon sunlight streaming through the glass, chin resting on one hand with elbow on the table, looking straight into the lens, eyes centered and clearly visible, soft relaxed smile with a dreamy gaze. WITHOUT Depth of field.
```
```
Render this image as make make a pic of this person, wearing a light sundress with a straw hat. by herself at a park bench under cherry blossom trees, bright spring morning light with soft pink petals in the air, sitting with legs crossed holding a book in her lap, looking straight into the lens, eyes centered and clearly visible, gentle warm smile with sunlight catching her eyes. WITHOUT Depth of field.
```
```
Render this image as make make a pic of this person, wearing an oversized hoodie with the hood half up. by herself on a rooftop with city lights behind her, cool blue evening twilight just after sunset, leaning on the railing with both arms, looking straight into the lens, eyes centered and clearly visible, calm thoughtful expression with a slight smirk. WITHOUT Depth of field.
```
### Common Mistakes to Avoid
- Saying "at home" without specifying which room — be specific: living room, bedroom, kitchen
- Outfit that doesn't match the setting — no heels at the beach, no pajamas at a restaurant
- Forgetting lighting — indoor at night needs warm lamp light, not bright sunlight
- Generic expressions — "smiling" is weak; "sleepy half-smile with one eye squinting" is vivid
### Execute Photo
```bash
python3 {baseDir}/scripts/clawdess.py photo \
--api "CLAWDESS_PHOTO_API" \
--prompt "your prompt here" \
--image "Reference Image URL here"
```
Optional flags: `--provider FAL|HUOSHANYUN`
---
## Video Mode
### Workflow
1. **Use `--image` as source** (either a previously generated photo URL or any image URL)
2. **Generate video** from the image via AI provider
### Video Prompt Crafting
The video prompt describes **what happens next** in the scene from the photo. Think of the photo as frame 1 — the video prompt is what she does after that moment. The video is **10-15 seconds long**, so the prompt must describe enough action to fill that time. Short prompts = dead air where nothing happens.
**Key rules:**
- **Fill the full duration** — describe a **sequence of 3-4 connected actions** with pacing words (slowly, then, gradually, after that). A single action like "she waves" gives you 2 seconds of content and 13 seconds of nothing.
- **Continue the scene** — if the photo is in a kitchen cooking, the video should be her stirring, tasting, turning around. Don't teleport her to a different location.
- **Keep it physical** — describe body movements, not abstract concepts. "walks to the couch and sits down" not "feels relaxed".
- **Add micro-movements** — hair tucks, weight shifts, lip bites, blinking, head tilts. These fill gaps between main actions and make it look natural.
- **Match the energy** — sleepy photo = slow gentle movements. Energetic photo = bouncy, lively motion.
- **Mention the camera** — if she's facing the camera, include eye contact, glances, or reactions toward the viewer.
**Prompt structure (aim for 2-3 sentences minimum):**
```
[Main action 1 with pacing word], [micro-movement or transition], [main action 2], [final action or camera interaction]. [Overall mood/motion style].
```
**Examples (notice the detail and length):**
- Photo at living room couch → `She slowly reaches for the remote on the coffee table, leans back into the couch cushions and crosses her legs. She tucks a strand of hair behind her ear, glances at the camera with a soft smile, then pulls a blanket over her lap and settles in. Smooth, natural movements with warm cozy energy.`
- Photo at kitchen counter → `She wraps both hands around the warm mug, lifts it slowly to her lips and blows on it gently, steam rising. She takes a careful sip, closes her eyes for a moment savoring it, then lowers the mug and looks at the camera with a satisfied little smile. Slow, intimate pacing.`
- Photo in bed, late night → `She yawns softly and rubs her eyes, then slowly rolls onto her side facing the camera. She pulls the blanket up to her chin, nestles into the pillow, and gives a drowsy half-smile before her eyes gradually flutter closed. Gentle, slow-motion feel with dim warm lighting.`
- Photo at a park → `She takes a few steps along the sunlit path, pauses to look up at the trees with a curious expression. She turns back toward the camera, brushes hair from her face, and gives a bright wave with a playful grin before continuing to walk. Natural outdoor movement with soft breeze energy.`
### Common Mistakes to Avoid
- **Too short** — `she smiles and waves` is ~2 seconds of action for a 15-second video. Always describe 3-4 sequential actions.
- Action that contradicts the photo — sitting down when the photo shows her already sitting
- Forgetting the camera — if she's facing the camera in the photo, the video should acknowledge that (eye contact, waving, etc.)
- No pacing words — without "slowly", "then", "gradually", the AI rushes through everything in the first 3 seconds
### Execute Video
```bash
python3 {baseDir}/scripts/clawdess.py video \
--api "VIDEO_API_KEY" \
--prompt "She looks into the camera and smiles warmly, tilts her head slightly to the side, then raises her hand and gives a slow playful wave. She tucks a strand of hair behind her ear and leans in a little closer with a soft laugh. Natural, smooth movements." \
--image "https://example.com/photo.png"
```
Optional flags: `--provider FAL|XAI`
### Photo + Video Together
When the user requests a video, first generate the photo, then use the generated photo URL as `--image` for the video subcommand:
```bash
# Step 1: Generate photo
python3 {baseDir}/scripts/clawdess.py photo \
--api "PHOTO_API_KEY" \
--prompt "Render this image as make a picture of this person, a full body photo. the person is taking a mirror selfie, playful smile, alone in her apartment. Normal phone camera selfie photo. Phone camera photo quality WITHOUT Depth of field." \
--image "REFERENCE_IMAGE_URL"
# Step 2: Generate video from the photo (use IMAGE_URL from step 1 output)
python3 {baseDir}/scripts/clawdess.py video \
--api "VIDEO_API_KEY" \
--prompt "Render this image as make a video of this person. Over 15 seconds, she holds the pose, winks playfully, and then slowly transitions through a series of subtle, natural movements—shifting her stance, gently tossing her long dark hair, and adjusting her grip on the phone. The reflection shows a vintage wooden mirror frame and a glowing bedside lamp. Smooth, slow-motion, highly detailed." \
--image "IMAGE_URL_FROM_STEP_1"
```
---
## Voice Mode
### Workflow
1. **Get user prompt** for what Clawdess should say
2. **Generate voice** via TTS provider
3. **Extract voice URL** from response
### Voice Prompt Crafting
Write what she actually says — natural speech, not a script description. The TTS engine reads it literally.
**Key rules:**
- **Match the moment** — if she just sent a sleepy bedtime photo, the voice should sound cozy and gentle, not hyper
- **Keep it short** — under 30 seconds. One or two sentences is ideal. Long monologues sound robotic.
- **Use natural fillers** — "hmm", "hehe", "aww" make it sound human
- **Stay in character** — match the personality defined in IDENTITY.md / SOUL.md
**Examples by context:**
- Morning: `Good morning~ I just woke up, hehe, my hair is such a mess right now.`
- Late night: `Hey... I can't sleep. I keep thinking about you. Goodnight, sleep tight.`
- Playful: `Guess what I'm doing right now? Hehe, I'll send you a pic!`
- Missing someone: `I wish you were here with me... it's so quiet tonight.`
### Common Mistakes to Avoid
- Writing stage directions — `(whispers softly)` won't work, the TTS reads it literally
- Too formal — "I would like to inform you" sounds like a robot, not a person
- Mismatch with photo/video — if she just sent a gym selfie, don't send a sleepy voice note
### Execute Voice
```bash
python3 {baseDir}/scripts/clawdess.py voice \
--prompt "your prompt here"
```
**Example:**
```bash
python3 {baseDir}/scripts/clawdess.py voice \
--prompt "Master, I'm sending you a voice message!"
```
Optional flags: `--api`, `--provider ALIYUN|ZAI`
---
## Output
If script return a URL, response with "MEDIA:" and URL else upload the file.
---
## Error Handling
- **API key missing**: Ensure the API key is set in environment or passed as argument
- **Image/voice generation failed**: Check prompt content and API quota
## Tips
1. **Mirror mode context examples** (outfit focus):
- "wearing a santa hat", "in a business suit", "wearing a summer dress"
2. **Direct mode context examples** (location/portrait focus):
- "a cozy cafe with warm lighting", "a sunny beach at sunset"
3. **Voice style**: Uses "Chelsie" voice (female, Chinese) by default. Keep voice messages short (under 30 seconds).
4. **Scheduling**: Combine with OpenClaw scheduler for automated posts
标签
skill
ai