Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
evaluation
Follow
Hide
Posts
Left menu
👋
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 29
Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge
#
ai
#
agents
#
evaluation
#
typescript
1
 reaction
Comments
Add Comment
4 min read
Your Model-as-Judge Doesn't Belong in the Hot Path
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 28
Your Model-as-Judge Doesn't Belong in the Hot Path
#
ai
#
agents
#
evaluation
#
observability
1
 reaction
Comments
Add Comment
9 min read
Evaluating Large Language Models: The Pitfall of Overfitting in RAG
Tanishq Soni
Tanishq Soni
Tanishq Soni
Follow
Jun 28
Evaluating Large Language Models: The Pitfall of Overfitting in RAG
#
llm
#
evaluation
#
overfitting
#
rag
Comments
Add Comment
2 min read
Evaluating Large Language Models: The Overfitting Problem
Tanishq Soni
Tanishq Soni
Tanishq Soni
Follow
Jun 28
Evaluating Large Language Models: The Overfitting Problem
#
llm
#
evaluation
#
overfitting
#
rag
Comments
Add Comment
2 min read
Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production
Abdul Rehman
Abdul Rehman
Abdul Rehman
Follow
Jun 27
Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production
#
ai
#
evaluation
#
production
#
llm
Comments
Add Comment
5 min read
Our Quality Scores Were Precise, Useless, and Identical
Alex @ Vibe Agent Making
Alex @ Vibe Agent Making
Alex @ Vibe Agent Making
Follow
Jun 24
Our Quality Scores Were Precise, Useless, and Identical
#
engineering
#
management
#
evaluation
#
codequality
1
 reaction
Comments
1
 comment
8 min read
Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 27
Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production
#
ai
#
evaluation
#
observability
#
testing
1
 reaction
Comments
2
 comments
5 min read
Your Agents Are Fine. The Handoff Between Them Isn't.
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 26
Your Agents Are Fine. The Handoff Between Them Isn't.
#
ai
#
agents
#
evaluation
#
observability
2
 reactions
Comments
1
 comment
5 min read
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 20
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems
#
ai
#
agents
#
observability
#
evaluation
2
 reactions
Comments
1
 comment
7 min read
Stop Asking 'Is GAI Here' — Ask 'At What Layer'
keeper
keeper
keeper
Follow
Jun 19
Stop Asking 'Is GAI Here' — Ask 'At What Layer'
#
ai
#
gai
#
framework
#
evaluation
1
 reaction
Comments
Add Comment
3 min read
Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 20
Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It
#
ai
#
agents
#
evaluation
#
observability
1
 reaction
Comments
1
 comment
6 min read
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 19
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output
#
ai
#
agents
#
evaluation
#
observability
3
 reactions
Comments
Add Comment
6 min read
I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.
Maya Andersson
Maya Andersson
Maya Andersson
Follow
Jun 25
I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.
#
machinelearning
#
llm
#
evaluation
#
mlops
Comments
1
 comment
4 min read
An LLM benchmark is only useful for as long as it's hard
Arthur
Arthur
Arthur
Follow
Jun 11
An LLM benchmark is only useful for as long as it's hard
#
llm
#
evaluation
#
benchmarks
#
humaneval
2
 reactions
Comments
Add Comment
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 9
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
#
ai
#
agents
#
safety
#
evaluation
2
 reactions
Comments
Add Comment
11 min read
👋
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account