DEV Community

# evaluation

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge

Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge

1
Comments
4 min read
Your Model-as-Judge Doesn't Belong in the Hot Path

Your Model-as-Judge Doesn't Belong in the Hot Path

1
Comments
9 min read
Evaluating Large Language Models: The Pitfall of Overfitting in RAG

Evaluating Large Language Models: The Pitfall of Overfitting in RAG

Comments
2 min read
Evaluating Large Language Models: The Overfitting Problem

Evaluating Large Language Models: The Overfitting Problem

Comments
2 min read
Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Comments
5 min read
Our Quality Scores Were Precise, Useless, and Identical

Our Quality Scores Were Precise, Useless, and Identical

1
Comments 1
8 min read
Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

1
Comments 2
5 min read
Your Agents Are Fine. The Handoff Between Them Isn't.

Your Agents Are Fine. The Handoff Between Them Isn't.

2
Comments 1
5 min read
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

2
Comments 1
7 min read
Stop Asking 'Is GAI Here' — Ask 'At What Layer'

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

1
Comments
3 min read
Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

1
Comments 1
6 min read
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

3
Comments
6 min read
I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

Comments 1
4 min read
An LLM benchmark is only useful for as long as it's hard

An LLM benchmark is only useful for as long as it's hard

2
Comments
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

2
Comments
11 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.