When Precision = 1.0 Was Actually Wrong

My first metric couldn't tell the difference between "minimal changes" and "smart changes"

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#dbeafe','primaryTextColor':'#1e40af','primaryBorderColor':'#3b82f6','lineColor':'#64748b','secondaryColor':'#f0fdf4','tertiaryColor':'#fef2f2','fontSize':'14px'}}}%% graph TB Start["📝 Instruction: Delete Section 2"] Start --> Branch1["⬅️ My Wrong Expectation"] Start --> Branch2["➡️ Agent's Smart Behavior"] Branch1 --> Change1["❌ Delete Section 2"] Change1 --> Score1["✅ Precision = 1.0"] Branch2 --> Change2["❌ Delete Section 2"] Change2 --> Change3["📋 Update TOC"] Change3 --> Change4["🔢 Renumber 3→2"] Change4 --> Change5["🔢 Renumber 4→3"] Change5 --> Change6["🔗 Fix cross-refs"] Change6 --> Score2["❌ Precision = 0.0
Penalized for 4 'extra' edits"] style Start fill:#e0e7ff,stroke:#4f46e5,stroke-width:3px style Branch1 fill:#d1fae5,stroke:#10b981,stroke-width:2px style Branch2 fill:#fee2e2,stroke:#ef4444,stroke-width:2px style Change1 fill:#f0fdf4,stroke:#10b981 style Change2 fill:#fef2f2,stroke:#ef4444 style Change3 fill:#fef2f2,stroke:#ef4444 style Change4 fill:#fef2f2,stroke:#ef4444 style Change5 fill:#fef2f2,stroke:#ef4444 style Change6 fill:#fef2f2,stroke:#ef4444 style Score1 fill:#d1fae5,stroke:#10b981,stroke-width:3px style Score2 fill:#fee2e2,stroke:#ef4444,stroke-width:3px