Teaching AI to Grade Its Own Homework
OVERVIEW

Redesigning the Ragas evaluation platform (YC 2024) for AI engineers and data scientists, focusing on transforming a code-only library into an integrated web application with improved traceability, comparison capabilities, and data handling. Afthab led the effort to make complex evaluation metrics like Faithfulness and Context Recall more transparent and actionable, designing an Excel-style spreadsheet interface for unbounded data exploration, a contextual Peek View system for exposing LLM reasoning traces, and a Git-inspired comparison framework for experiment iteration.

YEAR

2025

ROLE

PRODUCT DESIGNER

Process in short

The Problem

  • Ragas was a YC-backed open-source Python library for evaluating RAG (Retrieval-Augmented Generation) pipelines

  • Several thousand GitHub stars. 4-5 enterprise partners actively engaging with the founders

  • 3 of those partners specifically needed domain experts to review and annotate AI outputs

  • The library had no interface. Results existed only as code output or exported CSVs

  • Non-technical reviewers couldn't participate. The entire evaluation workflow was engineer-only

What I Did

  • Embedded myself in the domain: studied the library, competitors (Langfuse, W&B), and partner workflows

  • Designed a structured feedback framework for the founders' open hours and conference demos

  • Mapped the enterprise annotation workflow from one partner's actual dataset (SuperMe.ai)

  • Iterated from a pure DataFrame viewer to a layered interface: scannable table + deep inspection panel + configurable metrics system

  • Designed keyboard navigation after observing that mouse interaction slowed engineers who were used to spreadsheet workflows

The Impact

  • Shipped the first product interface for Ragas (0-to-1)

  • Domain reviewers gained access to the evaluation workflow for the first time

  • Engineers could inspect responses against source documents without switching tools

  • Built a metrics definition system that allowed different industries to configure evaluation criteria for their domain

  • Identified 20 design directions for the product roadmap beyond V1

My Role

  • Sole designer on a team of two founders and one developer

  • Led research, design, and product strategy for the interface layer

  • Worked directly with one partner's dataset to ground design decisions in real evaluation data


Smooth Scroll
This will hide itself!
Teaching AI to Grade Its Own Homework
OVERVIEW

Redesigning the Ragas evaluation platform (YC 2024) for AI engineers and data scientists, focusing on transforming a code-only library into an integrated web application with improved traceability, comparison capabilities, and data handling. Afthab led the effort to make complex evaluation metrics like Faithfulness and Context Recall more transparent and actionable, designing an Excel-style spreadsheet interface for unbounded data exploration, a contextual Peek View system for exposing LLM reasoning traces, and a Git-inspired comparison framework for experiment iteration.

YEAR

2025

ROLE

PRODUCT DESIGNER

Process in short

The Problem

  • Ragas was a YC-backed open-source Python library for evaluating RAG (Retrieval-Augmented Generation) pipelines

  • Several thousand GitHub stars. 4-5 enterprise partners actively engaging with the founders

  • 3 of those partners specifically needed domain experts to review and annotate AI outputs

  • The library had no interface. Results existed only as code output or exported CSVs

  • Non-technical reviewers couldn't participate. The entire evaluation workflow was engineer-only

What I Did

  • Embedded myself in the domain: studied the library, competitors (Langfuse, W&B), and partner workflows

  • Designed a structured feedback framework for the founders' open hours and conference demos

  • Mapped the enterprise annotation workflow from one partner's actual dataset (SuperMe.ai)

  • Iterated from a pure DataFrame viewer to a layered interface: scannable table + deep inspection panel + configurable metrics system

  • Designed keyboard navigation after observing that mouse interaction slowed engineers who were used to spreadsheet workflows

The Impact

  • Shipped the first product interface for Ragas (0-to-1)

  • Domain reviewers gained access to the evaluation workflow for the first time

  • Engineers could inspect responses against source documents without switching tools

  • Built a metrics definition system that allowed different industries to configure evaluation criteria for their domain

  • Identified 20 design directions for the product roadmap beyond V1

My Role

  • Sole designer on a team of two founders and one developer

  • Led research, design, and product strategy for the interface layer

  • Worked directly with one partner's dataset to ground design decisions in real evaluation data


Smooth Scroll
This will hide itself!
Teaching AI to Grade Its Own Homework
OVERVIEW

Redesigning the Ragas evaluation platform (YC 2024) for AI engineers and data scientists, focusing on transforming a code-only library into an integrated web application with improved traceability, comparison capabilities, and data handling. Afthab led the effort to make complex evaluation metrics like Faithfulness and Context Recall more transparent and actionable, designing an Excel-style spreadsheet interface for unbounded data exploration, a contextual Peek View system for exposing LLM reasoning traces, and a Git-inspired comparison framework for experiment iteration.

YEAR

2025

ROLE

PRODUCT DESIGNER

Process in short

The Problem

  • Ragas was a YC-backed open-source Python library for evaluating RAG (Retrieval-Augmented Generation) pipelines

  • Several thousand GitHub stars. 4-5 enterprise partners actively engaging with the founders

  • 3 of those partners specifically needed domain experts to review and annotate AI outputs

  • The library had no interface. Results existed only as code output or exported CSVs

  • Non-technical reviewers couldn't participate. The entire evaluation workflow was engineer-only

What I Did

  • Embedded myself in the domain: studied the library, competitors (Langfuse, W&B), and partner workflows

  • Designed a structured feedback framework for the founders' open hours and conference demos

  • Mapped the enterprise annotation workflow from one partner's actual dataset (SuperMe.ai)

  • Iterated from a pure DataFrame viewer to a layered interface: scannable table + deep inspection panel + configurable metrics system

  • Designed keyboard navigation after observing that mouse interaction slowed engineers who were used to spreadsheet workflows

The Impact

  • Shipped the first product interface for Ragas (0-to-1)

  • Domain reviewers gained access to the evaluation workflow for the first time

  • Engineers could inspect responses against source documents without switching tools

  • Built a metrics definition system that allowed different industries to configure evaluation criteria for their domain

  • Identified 20 design directions for the product roadmap beyond V1

My Role

  • Sole designer on a team of two founders and one developer

  • Led research, design, and product strategy for the interface layer

  • Worked directly with one partner's dataset to ground design decisions in real evaluation data


Smooth Scroll
This will hide itself!

Create a free website with Framer, the website builder loved by startups, designers and agencies.