Teaching AI to Grade Its Own Homework

OVERVIEW

Redesigning the Ragas evaluation platform (YC 2024) for AI engineers and data scientists, focusing on transforming a code-only library into an integrated web application with improved traceability, comparison capabilities, and data handling. Afthab led the effort to make complex evaluation metrics like Faithfulness and Context Recall more transparent and actionable, designing an Excel-style spreadsheet interface for unbounded data exploration, a contextual Peek View system for exposing LLM reasoning traces, and a Git-inspired comparison framework for experiment iteration.

YEAR

2025

ROLE

PRODUCT DESIGNER

Process in short

The Problem

Ragas was a YC-backed open-source Python library for evaluating RAG (Retrieval-Augmented Generation) pipelines
Several thousand GitHub stars. 4-5 enterprise partners actively engaging with the founders
3 of those partners specifically needed domain experts to review and annotate AI outputs
The library had no interface. Results existed only as code output or exported CSVs
Non-technical reviewers couldn't participate. The entire evaluation workflow was engineer-only

What I Did

Embedded myself in the domain: studied the library, competitors (Langfuse, W&B), and partner workflows
Designed a structured feedback framework for the founders' open hours and conference demos
Mapped the enterprise annotation workflow from one partner's actual dataset (SuperMe.ai)
Iterated from a pure DataFrame viewer to a layered interface: scannable table + deep inspection panel + configurable metrics system
Designed keyboard navigation after observing that mouse interaction slowed engineers who were used to spreadsheet workflows

The Impact

Shipped the first product interface for Ragas (0-to-1)
Domain reviewers gained access to the evaluation workflow for the first time
Engineers could inspect responses against source documents without switching tools
Built a metrics definition system that allowed different industries to configure evaluation criteria for their domain
Identified 20 design directions for the product roadmap beyond V1

My Role

Sole designer on a team of two founders and one developer
Led research, design, and product strategy for the interface layer
Worked directly with one partner's dataset to ground design decisions in real evaluation data

Smooth Scroll
This will hide itself!

Teaching AI to Grade Its Own Homework

OVERVIEW

Redesigning the Ragas evaluation platform (YC 2024) for AI engineers and data scientists, focusing on transforming a code-only library into an integrated web application with improved traceability, comparison capabilities, and data handling. Afthab led the effort to make complex evaluation metrics like Faithfulness and Context Recall more transparent and actionable, designing an Excel-style spreadsheet interface for unbounded data exploration, a contextual Peek View system for exposing LLM reasoning traces, and a Git-inspired comparison framework for experiment iteration.

YEAR

2025

ROLE

PRODUCT DESIGNER

Process in short

The Problem

Ragas was a YC-backed open-source Python library for evaluating RAG (Retrieval-Augmented Generation) pipelines
Several thousand GitHub stars. 4-5 enterprise partners actively engaging with the founders
3 of those partners specifically needed domain experts to review and annotate AI outputs
The library had no interface. Results existed only as code output or exported CSVs
Non-technical reviewers couldn't participate. The entire evaluation workflow was engineer-only

What I Did

Embedded myself in the domain: studied the library, competitors (Langfuse, W&B), and partner workflows
Designed a structured feedback framework for the founders' open hours and conference demos
Mapped the enterprise annotation workflow from one partner's actual dataset (SuperMe.ai)
Iterated from a pure DataFrame viewer to a layered interface: scannable table + deep inspection panel + configurable metrics system
Designed keyboard navigation after observing that mouse interaction slowed engineers who were used to spreadsheet workflows

The Impact

Shipped the first product interface for Ragas (0-to-1)
Domain reviewers gained access to the evaluation workflow for the first time
Engineers could inspect responses against source documents without switching tools
Built a metrics definition system that allowed different industries to configure evaluation criteria for their domain
Identified 20 design directions for the product roadmap beyond V1

My Role

Sole designer on a team of two founders and one developer
Led research, design, and product strategy for the interface layer
Worked directly with one partner's dataset to ground design decisions in real evaluation data

Smooth Scroll
This will hide itself!

Teaching AI to Grade Its Own Homework

OVERVIEW

Redesigning the Ragas evaluation platform (YC 2024) for AI engineers and data scientists, focusing on transforming a code-only library into an integrated web application with improved traceability, comparison capabilities, and data handling. Afthab led the effort to make complex evaluation metrics like Faithfulness and Context Recall more transparent and actionable, designing an Excel-style spreadsheet interface for unbounded data exploration, a contextual Peek View system for exposing LLM reasoning traces, and a Git-inspired comparison framework for experiment iteration.

YEAR

2025

ROLE

PRODUCT DESIGNER

Process in short

The Problem

Ragas was a YC-backed open-source Python library for evaluating RAG (Retrieval-Augmented Generation) pipelines
Several thousand GitHub stars. 4-5 enterprise partners actively engaging with the founders
3 of those partners specifically needed domain experts to review and annotate AI outputs
The library had no interface. Results existed only as code output or exported CSVs
Non-technical reviewers couldn't participate. The entire evaluation workflow was engineer-only

What I Did

Embedded myself in the domain: studied the library, competitors (Langfuse, W&B), and partner workflows
Designed a structured feedback framework for the founders' open hours and conference demos
Mapped the enterprise annotation workflow from one partner's actual dataset (SuperMe.ai)
Iterated from a pure DataFrame viewer to a layered interface: scannable table + deep inspection panel + configurable metrics system
Designed keyboard navigation after observing that mouse interaction slowed engineers who were used to spreadsheet workflows

The Impact

Shipped the first product interface for Ragas (0-to-1)
Domain reviewers gained access to the evaluation workflow for the first time
Engineers could inspect responses against source documents without switching tools
Built a metrics definition system that allowed different industries to configure evaluation criteria for their domain
Identified 20 design directions for the product roadmap beyond V1

My Role

Sole designer on a team of two founders and one developer
Led research, design, and product strategy for the interface layer
Worked directly with one partner's dataset to ground design decisions in real evaluation data

Smooth Scroll
This will hide itself!