`monitor-experiment`¶

Pack: ARIS skills

Category: drafting

Field: —

License: MIT

Updated: 2026-05-18

Stages: paper-drafting

↗ view SKILL.md on source · GitHub stars

Monitor Experiment Results¶

Monitor: $ARGUMENTS

Workflow¶

Step 1: Check What's Running¶

SSH server:

Bash

ssh <server> "screen -ls"

Vast.ai instance (read ssh_host, ssh_port from vast-instances.json):

Bash

ssh -p <PORT> root@<HOST> "screen -ls"

Also check vast.ai instance status:

Bash

vastai show instances

Modal (when gpu: modal in CLAUDE.md):

Bash

modal app list         # List running/recent apps
modal app logs <app>   # Stream logs from a running app

Modal apps auto-terminate when done — if it's not in the list, it already finished. Check results via modal volume ls <volume> or local output.

Step 2: Collect Output from Each Screen¶

For each screen session, capture the last N lines:

Bash

ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"

If hardcopy fails, check for log files or tee output.

Step 3: Check for JSON Result Files¶

Bash

ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"

If JSON results exist, fetch and parse them:

Bash

ssh <server> "cat <results_dir>/<latest>.json"

Step 3.5: Pull W&B Metrics (when `wandb: true` in CLAUDE.md)¶

Skip this step entirely if wandb is not set or is false in CLAUDE.md.

Pull training curves and metrics from Weights & Biases via Python API:

Bash

## List recent runs in the project
ssh <server> "python3 -c \"
import wandb
api = wandb.Api()
runs = api.runs('<entity>/<project>', per_page=10)
for r in runs:
    print(f'{r.id}  {r.state}  {r.name}  {r.summary.get(\"eval/loss\", \"N/A\")}')
\""

## Pull specific metrics from a run (last 50 steps)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
history = list(run.scan_history(keys=['train/loss', 'eval/loss', 'eval/ppl', 'train/lr'], page_size=50))
print(json.dumps(history[-10:], indent=2))
\""

## Pull run summary (final metrics)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
print(json.dumps(dict(run.summary), indent=2, default=str))
\""

What to extract: - Training loss curve — is it converging? diverging? plateauing? - Eval metrics — loss, PPL, accuracy at latest checkpoint - Learning rate — is the schedule behaving as expected? - GPU memory — any OOM risk? - Run status — running / finished / crashed?

W&B dashboard link (include in summary for user):

Text Only

https://wandb.ai/<entity>/<project>/runs/<run_id>

This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time.

Step 4: Summarize Results¶

Present results in a comparison table:

Text Only

| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline  | X.XX   | —                 | done   |
| Method A  | X.XX   | +Y.Y              | done   |

Step 5: Interpret¶

Compare against known baselines
Flag unexpected results (negative delta, NaN, divergence)
Suggest next steps based on findings

Step 6: Feishu Notification (if configured)¶

After results are collected, check ~/.claude/feishu.json: - Send experiment_done notification: results summary table, delta vs baseline - If config absent or mode "off": skip entirely (no-op)

Key Rules¶

Always show raw numbers before interpretation
Compare against the correct baseline (same config)
Note if experiments are still running (check progress bars, iteration counts)
If results look wrong, check training logs for errors before concluding
Vast.ai cost awareness: When monitoring vast.ai instances, report the running cost (hours * $/hr from vast-instances.json). If all experiments on an instance are done, remind the user to run /vast-gpu destroy <instance_id> to stop billing
Modal cost awareness: Modal auto-scales to zero — no idle billing. When reporting results from Modal runs, note the actual execution time and estimated cost (time * $/hr from the GPU tier used). No cleanup action needed

monitor-experiment¶

Monitor Experiment Results¶

Workflow¶

Step 1: Check What's Running¶

Step 2: Collect Output from Each Screen¶

Step 3: Check for JSON Result Files¶

Step 3.5: Pull W&B Metrics (when wandb: true in CLAUDE.md)¶

Step 4: Summarize Results¶

Step 5: Interpret¶

Step 6: Feishu Notification (if configured)¶

Key Rules¶

`monitor-experiment`¶

Step 3.5: Pull W&B Metrics (when `wandb: true` in CLAUDE.md)¶