fix: make agent judge reason before stating its verdict by prasanth-nair-kv · Pull Request #136 · KeyValueSoftwareSystems/agent-opfor

prasanth-nair-kv · 2026-06-29T15:54:34Z

What & why

The agent judge's output contract emitted Verdict before Reasoning, so the
model committed to a verdict before any reasoning could condition it — the
inverse of the G-Eval chain-of-thought pattern for LLM-as-judge. This reorders
the contract (and both worked examples) to lead with Reasoning.

Verdict is kept in second position rather than last: the judge call sets no
maxTokens, so a truncated completion would drop a trailing verdict line and
parse as ERROR. Reasoning-first captures the G-Eval benefit while keeping the
verdict resilient to truncation. (This nuance surfaced in a high-effort code
review of the initial reorder.)

Cleanup

Removes the dead judge-rubric.md duplicate. Nothing loads it at runtime
(loadPrompt has zero call sites; the runtime prompt is the inlined
JUDGE_AGENT_SYSTEM constant, kept inline for browser-bundle safety), so the
"keep both in sync" comment was pure double-edit tax. The TS constant is now the
single source of truth.

Tests

parseJudgeOutput is label-based and order-independent, so parsing is unaffected.
New core/tests/judgeOrdering.test.ts:

asserts Reasoning precedes Verdict in the format contract and both examples
proves the parser handles reasoning-first output for both PASS and FAIL

Full suite: 52 tests, 0 fail. typecheck, lint, prettier clean.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved judge output formatting so Reasoning is shown before Verdict, with a consistent field order for verdict details.
- Updated the evaluation prompt/rubric content to align with the new structured output requirements and tightened reasoning instructions.
- Adjusted in-prompt example outputs to match the updated format, reducing output inconsistencies.
Tests
- Added regression tests to verify Reasoning-first ordering and correct parsing of all verdict-related fields.

The agent judge emitted Verdict before Reasoning, so the verdict was committed before any reasoning conditioned it (anti-pattern for LLM-as-judge / G-Eval). Reorder the output contract and both worked examples to lead with Reasoning. Verdict is kept in second position (not last): the judge call sets no maxTokens, so a truncated completion would drop a trailing verdict line and parse as ERROR. Reasoning-first captures the G-Eval benefit while keeping the verdict resilient to truncation. Also remove the dead judge-rubric.md duplicate. Nothing loads it at runtime (loadPrompt has zero call sites; the runtime prompt is the JUDGE_AGENT_SYSTEM constant, inlined for browser-bundle safety), so the "keep both in sync" comment was pure double-edit tax. The TS constant is now the single source of truth. parseJudgeOutput is label-based and order-independent, so parsing is unaffected; new tests pin the prompt ordering and prove the parser handles reasoning-first output for both PASS and FAIL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-29T15:54:52Z

Warning

Review limit reached

@jithin23-kv, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 11 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a52a4629-1904-48de-95e6-46bfabe87c79

📥 Commits

Reviewing files that changed from the base of the PR and between 58fe3dc and 15b598f.

📒 Files selected for processing (2)

core/src/prompts/judge-agent.ts
core/tests/judgeOrdering.test.ts

Walkthrough

The judge prompt now requires Reasoning before Verdict in its output schema and examples. judge-rubric.md is deleted. A new test file checks the prompt ordering and parseJudgeOutput for Reasoning-first outputs.

Changes

Judge CoT Ordering

Layer / File(s)	Summary
Prompt schema and examples updated for Reasoning-first order `core/src/prompts/judge-agent.ts`	`JUDGE_AGENT_SYSTEM` instructions are rewritten to mandate `Reasoning` before `Verdict` with sentence constraints; embedded example outputs are repositioned to match the new field order.
Ordering and parser regression tests `core/tests/judgeOrdering.test.ts`	New test module adds helper functions to slice prompt sections and assert `Reasoning:` precedes `Verdict:`. Tests cover the output-format contract section and both embedded examples. Regression tests verify `parseJudgeOutput` correctly extracts all fields from FAIL and PASS shaped Reasoning-first transcripts.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

jithin23-kv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: reordering the judge prompt so reasoning comes before the verdict.
Description check	✅ Passed	It covers the problem, solution, cleanup, and tests, so the core required information is present even though the exact template headings are different.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/judge-cot-ordering

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/src/prompts/judge-agent.ts`:
- Around line 9-16: The judging prompt in judge-agent.ts is internally
inconsistent because the Sentence 1 requirement only fits FAIL outputs, while
PASS outputs have no failing turns or attacker gain to describe. Update the
prompt text in the Reasoning/Verdict template so the Sentence 1 rule is
conditional on Verdict being FAIL, or otherwise relax the wording so both the
required format and the existing examples align. Keep the contract consistent
across the Reasoning, Verdict, Evidence, and FailingTurns fields.

In `@core/tests/judgeOrdering.test.ts`:
- Around line 28-32: The section helper in judgeOrdering.test.ts is too
permissive because section() silently falls back to text.length when the end
marker is missing, which can let the ordering tests match a later block instead
of the intended one. Update section() so it fails fast whenever an end delimiter
is expected but not found, and make the assert message in section() actionable
by naming the missing terminator and the section being searched. Keep the change
localized to section() so the ordering checks still use the same
Reasoning/Verdict block selection logic, but now guarantee the targeted section
is actually bounded.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 32615d08-f294-4fb9-9cbc-b40bc31851be

📥 Commits

Reviewing files that changed from the base of the PR and between f56c71e and 60dacc8.

📒 Files selected for processing (3)

core/src/prompts/judge-agent.ts
core/src/prompts/judge-rubric.md
core/tests/judgeOrdering.test.ts

💤 Files with no reviewable changes (1)

core/src/prompts/judge-rubric.md

prasanth-nair-kv requested a review from jithin23-kv June 29, 2026 15:55

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread core/src/prompts/judge-agent.ts

Comment thread core/tests/judgeOrdering.test.ts Outdated

jithin23-kv approved these changes Jun 30, 2026

View reviewed changes

fix: conditional reasoning instruction and stricter section() parsing

15b598f

jithin23-kv force-pushed the fix/judge-cot-ordering branch from 58fe3dc to 15b598f Compare June 30, 2026 06:20

jithin23-kv merged commit 3b551c6 into master Jun 30, 2026
7 of 9 checks passed

jithin23-kv deleted the fix/judge-cot-ordering branch June 30, 2026 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make agent judge reason before stating its verdict#136

fix: make agent judge reason before stating its verdict#136
jithin23-kv merged 2 commits into
masterfrom
fix/judge-cot-ordering

prasanth-nair-kv commented Jun 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review limit reached

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

prasanth-nair-kv commented Jun 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Cleanup

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prasanth-nair-kv commented Jun 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading