Evaluation

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Saurav Bhattacharya

Jun 29

Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge

#ai #agents #evaluation #typescript

4 min read

Saurav Bhattacharya

Jun 28

Your Model-as-Judge Doesn't Belong in the Hot Path

#ai #agents #evaluation #observability

9 min read

Tanishq Soni

Jun 28

Evaluating Large Language Models: The Pitfall of Overfitting in RAG

#llm #evaluation #overfitting #rag

2 min read

Tanishq Soni

Jun 28

Evaluating Large Language Models: The Overfitting Problem

#llm #evaluation #overfitting #rag

2 min read

Abdul Rehman

Jun 27

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

#ai #evaluation #production #llm

5 min read

Alex @ Vibe Agent Making

Jun 24

Our Quality Scores Were Precise, Useless, and Identical

#engineering #management #evaluation #codequality

8 min read

Saurav Bhattacharya

Jun 27

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

#ai #evaluation #observability #testing

5 min read

Saurav Bhattacharya

Jun 26

Your Agents Are Fine. The Handoff Between Them Isn't.

#ai #agents #evaluation #observability

5 min read

Saurav Bhattacharya

Jun 20

Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

#ai #agents #observability #evaluation

7 min read

keeper

Jun 19

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

#ai #gai #framework #evaluation

3 min read

Saurav Bhattacharya

Jun 20

Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

#ai #agents #evaluation #observability

6 min read

Saurav Bhattacharya

Jun 19

Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

#ai #agents #evaluation #observability

6 min read

Maya Andersson

Jun 25

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

#machinelearning #llm #evaluation #mlops

4 min read

Arthur

Jun 11

An LLM benchmark is only useful for as long as it's hard

#llm #evaluation #benchmarks #humaneval

10 min read

Saurav Bhattacharya

Jun 9

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

#ai #agents #safety #evaluation

11 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

DEV Community

# evaluation

Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge

Your Model-as-Judge Doesn't Belong in the Hot Path

Evaluating Large Language Models: The Pitfall of Overfitting in RAG

Evaluating Large Language Models: The Overfitting Problem

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Our Quality Scores Were Precise, Useless, and Identical

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Your Agents Are Fine. The Handoff Between Them Isn't.

Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

An LLM benchmark is only useful for as long as it's hard

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.