evomath-tao¶
Mathematical derivation skill in the spirit of Tao-style proof reasoning
formal-modelingEvoMath (Tao-style)¶
EvoMath is a lightweight proof workflow for contest-style mathematical reasoning. Its job is to produce a rigorous proof, a verified counterexample, a useful partial result, or a clear handoff. Keep the process small; do not run a heavy audit pipeline by default.
Methodology Anchor — Terence Tao's Research-Math Practice¶
This skill operationalizes the way Terence Tao approaches research mathematics:
- Compute small cases first (Kepler before Newton) — build intuition from data before reaching for theory.
- Try the standard toolbox broadly before going deep — most hard problems crack to a standard technique; the few that don't only reveal which after several have failed.
- Hold rigor and intuition together (post-rigorous mathematics) — trust intuition, but verify every step. "It feels right" is a hypothesis, not a proof.
- Atomize when stuck — decompose into independently checkable sub-claims. A clean map of proved / conjectured / open beats a polished but shaky narrative.
- Stay honest about what isn't proved — distinguish PROVED / VERIFIED_NUMERICALLY / CONJECTURED / HANDED_OFF. When blocked, name the precise gap.
- Distill each result into reusable insight — after every problem, extract what worked into a strategy and what failed into a named pattern. Mathematical maturity is accumulated meta-insight.
Every phase below is a concrete operationalization of one or more of these principles.
Operating Rules¶
- Use Markdown notes for handoff between steps. Do not require JSON/YAML unless a script explicitly asks for it.
- Keep only compact state: plan, verified claims, failed attempts, final audit. Do not pass long failed derivations into later prompts.
- Prefer a few independent proof attempts over one long derivation.
- Numerical verification is NOT a proof step (math-olympiad rule). Checking a claim on n=1..100 and finding no counterexample does NOT make it PROVED; the strongest label such evidence can earn is VERIFIED_NUMERICALLY.
- Exact arithmetic can refute; approximate numerics only suggest.
- A proof is final only after an adversarial check of the clean proof.
- Calibrated abstention over bluffing: when verification fails repeatedly, admit it. Return partial results and mark unfixed gaps explicitly (math- olympiad rule). Final status HANDED_OFF with a structured wall report is always preferable to PROVED with hand-waved gaps.
- Every final answer must include a visible
final-status: ...line. - Use TodoWrite to drive the workflow. Each step is one todo; you cannot mark a
todo completed unless the corresponding
.mdfile passes its validator.
If filesystem access is available, create a Markdown workspace with:
If filesystem access is not available, keep the same Markdown sections inline in the conversation. In that case run the validators by mentally checking the same required fields the script checks — the discipline is the same.
Fast Exit¶
Do not use EvoMath for single calculations, definition lookups, symbolic manipulation, or answer-only requests with no proof obligation. Give the direct answer instead. No TodoWrite list is needed for a Fast Exit.
If the statement has a blocking ambiguity that changes truth value, ask one specific clarification question before solving.
Execution Protocol (TodoWrite + Validation)¶
For any problem that passes the Fast Exit Gate, follow this protocol.
1. Create the 5-step todo list¶
Before doing any solving work, call TodoWrite with these five items in this order. Each item names its primary reference file:
- Plan Briefly — read
references/intake-checklist.mdfor type classification, ambiguity handling, goal types. - Try Candidates — read
references/angles-by-type.mdfor technique ideas if you are out of angles for this problem type. - Assemble — read
references/output-formats.mdif you need formatting conventions or LaTeX templates. - Audit — read
references/grading-taxonomy.mdfor issue classes and severity rules. Readreferences/phase-4-audit.mdonly if the user requests strict multi-reviewer audit. - Reflect — read
references/claim-memory.mdonly when deep reflection is triggered (see "Deep Reflection Triggers" below).
2. Per-step discipline¶
For each step in order:
- Mark the todo
in_progressbefore reading the reference or writing output. - Read the referenced file(s) if and only if you need them for this step.
- Produce the corresponding
.mdoutput (plan.md, candidates.md, audit.md, final.md sections, etc.). - Run the validator before marking the todo completed:
python skills/evomath-tao/scripts/evomath_workspace.py validate-phase <N> --dir .evomath/current
- If validation FAILS, the todo stays
in_progress. Revise the.mdfile based on the printed failure messages and re-run the validator. Do not mark completed until the validator exits 0.
3. PROVED gate¶
If your final-status is PROVED, you MUST additionally run:
This verifies that the 10-item PROVED Self-Check Checklist (see
references/output-formats.md) is present in final.md with all boxes ticked.
If this fails, downgrade final-status to CONJECTURED or HANDED_OFF and revise
the answer.
4. Deep Reflection Triggers¶
Step 5 (Reflect) has two modes:
- Light reflection (default): three lines — successful pattern, failed pattern to avoid, whether memory was written.
- Deep reflection (triggered when any of the following hold):
- Step 2 required 3+ revision rounds for any candidate
- Step 4 identified a fatal flaw before the final repair
- A new winning technique appeared that is not yet in any L2 strategy entry
- The user explicitly asks for self-evolution or cross-problem learning
- final-status is HANDED_OFF with a recurring failure-mode
In deep reflection mode, run the full ESE/IVE protocols described in
references/claim-memory.md and update L2/L3 memory in
.evomath/session-memory.md.
5. Fallback when filesystem is unavailable¶
If you cannot run scripts, keep the same TodoWrite discipline:
- Still create the 5-item list and march through it.
- Still keep the same .md sections inline in the conversation.
- Substitute mental validation for the script call — check the same required
fields the validator would check.
Workflow¶
1. Plan Briefly¶
Write a short Markdown plan:
- Problem type: algebra, geometry, number theory, combinatorics, analysis, or other.
- Goal: prove, refute, find example, or audit proof.
- Strategy: one sentence.
- Subgoals: at most five bullets.
For "determine all" problems, include both:
- existence/construction
- impossibility/exclusion
For a simple problem, one root subgoal is enough.
2. Try Candidates¶
Mode dispatch (decide before generating):
- If Phase 0
goal = find-numeric-answer(AIME-style: answer is a single number, no proof required) → AIME mode: generate 5–7 short candidate answers using varied approaches (small-case enumeration, modular invariants, algebraic manipulation, generating functions, brute-force code). Take majority vote across candidates. Verify the top two by substitution into the original problem. Skip the rest of the proof workflow; output the numeric answer withfinal-status: PROVEDonly when both top candidates agree AND substitution checks pass. - Otherwise → Proof mode: continue below.
Proof mode — per-candidate 5-round internal loop:
For each active subgoal, try up to four genuinely different candidate routes. Each candidate is itself the product of a 5-round internal mini-process — not a one-shot generation:
- Solve — produce a proof attempt using reasoning only. No tool use during this round (no calculator, sympy, Lean, web). Pure pencil-and-paper.
- Self-improve — refine the attempt for clarity, fix obvious gaps.
- Self-verify — walk through line by line, looking for: shielding words ("obviously"/"clearly"), unjustified swaps, missing hypotheses, off-by-one cases, hidden assumptions.
- Correct — fix issues found in step 3.
- Repeat 1–4 up to 5 times per candidate, or until the candidate self-
reports as
is_sound.
Each candidate's internal-rounds count is recorded so Phase 4 can see how much self-revision was needed. A candidate that needs 5 rounds is more likely to be borderline than one that's sound in 1.
Record candidates in this Markdown table:
| Candidate | Idea | Internal rounds | Verdict | Issue or reason |
|---|---|---|---|---|
| C1 | 1–5 | sound / repair / fail |
Judging a candidate:
sound: enough to use as a verified claim.repair: promising but missing a local step; revise at most twice outside the 5-round internal loop (so total max revisions = 5 internal + 2 external).fail: wrong, circular, too weak, or repeats a known dead end.
Computation discipline (math-olympiad VERBATIM rule): during the Solve round, no tool calls. Computation is allowed in Phase 1 (Empirical) and in Phase 5 Deep Mode, NEVER during Phase 2 Solve. This protects against ritualized "I called sympy so it must be right" reasoning.
When a candidate is sound, add it to Proof Artifact / Verified Claims with a one-paragraph proof summary. When a route fails, add one line to Negative Attempts so it is not repeated.
Use references/angles-by-type.md only when you are out of ideas for a problem
type. Do not load it by default.
3. Assemble¶
Turn the accepted claims into a clean proof or refutation.
Rules:
- State the original claim.
- Present the final argument only; omit failed attempts.
- Justify every non-trivial step by a verified claim, theorem, construction, or exact computation.
- If a required subgoal remains unsolved, stop pretending the proof is complete: output a partial result or handoff.
4. Audit¶
Audit only the clean proof, not the exploration notes.
Check:
- The proof proves the original statement, not a weaker restatement.
- All cases, boundary values, degeneracies, and quantifiers are handled.
- No claim is cited without proof or explicit acceptance in the Proof Artifact.
- Refutations use an exactly verified counterexample.
Audit follows math-olympiad's 3-safeguard pattern (see references/phase-4-audit.md):
1. Verifier context isolation (strip thinking traces before review).
2. Asymmetric voting (4 HOLDS to confirm; 2 HOLE FOUND to refute).
3. Pigeonhole exit (stop launching reviewers after threshold).
Apply the named-pattern screen (P4 / P5 / P6 / P18 / P40 / P41 in grading-taxonomy.md) plus the counterexample-first rule before any PROVED award.
For ordinary use, one careful local audit is enough. Use the heavier
references/phase-4-audit.md protocol only when the user asks for strict
multi-reviewer audit or when the result is high-risk.
5. Reflect¶
After the final answer, add a short reflection note. It should be compact:
- successful pattern, if any
- failed pattern to avoid
- whether memory was written
Per-problem memory resets at the next problem. Cross-problem memory is optional:
write only compact strategy/failure summaries to .evomath/session-memory.json
when file access is available. If not written, say memory-persisted: false.
Status Labels¶
Use exactly one:
| Status | Meaning |
|---|---|
PROVED |
Complete proof of the original statement passed Safeguards 1–3 audit. The strongest label EvoMath awards. |
REFUTED |
Exact counterexample or contradiction proof found |
VERIFIED_NUMERICALLY |
Finite exact checks only; no general proof. Numerical evidence is NOT a proof — this label exists to record empirical support honestly. |
CONJECTURED |
Strong partial evidence or partial proof, but incomplete |
HANDED_OFF |
Stopped with a precise remaining gap or user question |
Exactly one of these five labels must appear in every final answer. See references/confidence-rules.md for award conditions and promotion rules.
Final Answer Shape¶
Use Markdown, not a rigid schema:
final-status: PROVED
### Result
<answer>
### Proof / Report
<clean proof, refutation, audit report, partial result, or handoff>
### Audit
- audits-run:
- critical-issues:
- remaining-gaps:
### Proof Artifact
- verified claims:
- negative attempts:
### Reflection
- memory-persisted:
- storage-location:
- proposed-memory-updates:
For proof-audit requests, lead with findings ordered by severity, then give the verdict and suggested repair.
Optional Helpers¶
scripts/evomath_workspace.py:initcreates the five Markdown state files.checkverifiesfinal.mdhas a validfinal-status:line.validate-phase Nvalidates the .md output for step N (1..5). Use this before marking the corresponding todo completed. Add--strictto also re-validate all prior steps.validate-provedwhen final-status is PROVED, verifies the 10-item Self-Check Checklist is fully ticked.references/angles-by-type.md: technique ideas by problem type.references/grading-taxonomy.md: detailed flaw taxonomy for audits.references/phase-4-audit.md: heavier independent-review protocol.references/output-formats.md: optional Markdown/LaTeX formatting details + the PROVED Self-Check Checklist template.references/claim-memory.md: three-layer memory architecture and ESE/IVE reflection protocols. Read only in deep-reflection mode.references/model-tier.md: per-tier parameter table (Haiku / Sonnet / Opus). Read once at the start of Phase 0 to set K, internal-rounds, audit passes, and abstain thresholds for the active model.