Google Docs for AI evaluation

Get feedback from
to ship better AI products.

One shareable link. No account. Comments become prompt edits, automatically.

Get started See how it works

Free in beta · BYOK · No per-seat billing

Trusted by

Honestly? Nobody yet. You could be the first logo on this row.

Free while we're in beta. Say hi and we'll set you up personally.

Who gives you feedback

The people you actually want feedback from.

The people who know what “good” looks like. One link, no account, comments in the browser.

General Counsel: Gets a link in her inbox, flags two responses from her phone at lunch, and closes the tab.
VP Marketing: Rates brand voice on five responses without seeing which model or which draft — feedback lands without bias.
Head of CX: Grades support replies against policy. Comments route back as prompt edits the engineer can ship.

Getting human feedback on AI is broken.

You asked legal to review three responses. They said “looks fine” in Slack and moved on.
You sent a Google Doc to marketing. Feedback scattered across 12 comment threads.
You launched anyway. And hoped the brand voice held up.

From prompt to production. Three steps, one link.

Draft, get blind reviewer feedback, and apply concrete prompt edits — without leaving a tab open for your reviewers.

01 Set up

02 Share

03 Apply

1 Set up

Draft a prompt, pick the models.

Write your system message, mark the variables, and choose a few models or temperatures to compare side-by-side.

New review · customer support tone · v3

System message system.md

You are a customer support agent. Be friendly and professional. Respond clearly to the user’s message and resolve their issue.

Variables

Responses to compare 3 of 5

A GPT-4o

t=0.7

OpenAI

B Claude Haiku 4

t=0.7

Anthropic

C Gemini 2.5 Pro

t=1.0

Google

3 test cases · BYOK Start review

2 Share

Send one link. No login.

Reviewers see outputs blind — no model names, no version numbers — and highlight the lines they like or hate right in the browser.

Customer message

“Hi, I just noticed I was charged twice for my subscription this month — $49 on March 3rd and again on March 5th. I’ve been a customer for two years and this is really frustrating. Can you help me get this sorted out?”

Blind review · Which response is better?

2 outputs · Blind mode Click the response you’d send

One blind test. Blind Bench makes it the default for every prompt.

Get started

3 Apply

Turn feedback into prompt edits.

The optimizer reads every annotation and proposes a new version, with each change cited to the comment that motivated it.

Optimizer · v3 → v4

v3 Previous system message

You are a helpful customer support agent. Answer the user’s question.

Reviewer annotations from v3 run

A Output A

“Reads like a corporate form letter.”

on "Dear valued customer, I appreciate your inquiry."

B Output B

“Too robotic — "I understand your concern" is filler.”

on "I understand your concern and will assist you."

v4 Proposed system message

You are a customer support agent. Be friendly and professional. Write like a helpful coworker, not a corporate robot. Avoid formal openers (“Dear valued customer”) and form-letter phrases (“I understand your concern”). Get to the point and be warm.

Why these changes

Both outputs drifted on tone. Output A went corporate; Output B defaulted to empty filler. The rewrite adds explicit prohibitions and a positive anchor (“helpful coworker, not a corporate robot”).

For the engineer who ships

Stop chasing sign-off in Slack.

You wrote the prompt. You need honest feedback from the people who know “good” — without scheduling a meeting or waiting on a thread. Send one link. Get structured, blind, traceable feedback back as prompt edits. BYOK, no per-seat billing.

What Blind Bench does that others don’t.

Other tools log prompts or collect scores. None of them close the loop. Blind Bench routes every reviewer comment back into the prompt — the next version is built from the feedback that drove it.

You tried...

What went wrong

Blind Bench

Prompt management & tracing

Langfuse, PromptLayer, PostHog, Braintrust

Built for developers. The reviewers whose judgment matters — PMs, legal, marketing — never open them, so feedback never reaches the prompt.

Pairs with them and closes the loop: reviewer comments become the next prompt version.

Eval platforms

Promptfoo, LangSmith

Reviewers see which variant is yours, and scores sit in a dashboard instead of driving the next edit.

Blind by default. Every annotation maps to a concrete prompt change, cited to the comment.

Google Docs / Slack

or email threads

No structure, no blinding, no path from a comment to a prompt change.

Structured, blind, and traceable — every comment lands in the next iteration.