AI in Production: Building AskVerdict from Zero to Launch

Everyone's building AI features. Few are talking honestly about what it actually takes to ship one that doesn't embarrass you in production. AskVerdict is my AI-powered verdict engine — it takes a question, surfaces relevant context, and generates a structured verdict with reasoning and confidence. Here's the uncensored build log.

The Idea and Why It Was Harder Than I Thought

The core premise of AskVerdict: give users a structured, cited answer to a question rather than a wall of text. Think less "chat with a document" and more "here is the verdict, here is the reasoning, here is the evidence, here is the confidence level." Structured output as a first-class concern from day one.

I thought the hard part would be the AI integration. It wasn't. The hard part was evaluation — knowing whether the system was actually good, consistently, across the input distribution you care about. More on that later.

Choosing the Model Stack

I evaluated GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. My criteria:

Structured output reliability — JSON mode / function calling consistency
Context window — needed at least 128K for document-heavy queries
Latency — anything over 8 seconds feels broken to users
Cost per call — multiplied by expected volume

GPT-4o won on structured output consistency. The JSON mode is genuinely reliable — I can count on my VerdictOutput schema coming back correctly formatted 99.8% of the time. Claude was more nuanced in its reasoning but less predictable about schema compliance without aggressive prompting. Gemini had the best context window but the slowest p95 latency at the time of testing.

For the embedding layer I used OpenAI's text-embedding-3-small. It's cheap, fast, and the quality difference versus text-embedding-3-large didn't justify 5x the cost for my use case.

Prompt Engineering: What Actually Works

My first prompt was three paragraphs of natural language instructions. It worked about 70% of the time. The other 30% produced verdicts that were technically correct but structurally inconsistent, or that hallucinated citations.

The breakthrough was treating the prompt like a type contract. Every output field gets a definition, a format, and an example. Constraints are listed explicitly, not implied:

const VERDICT_SYSTEM_PROMPT = `
You are a verdict engine. Given a question and context documents,
output a structured verdict in the following JSON format:
 
{
  "verdict": "string — one clear declarative sentence",
  "confidence": "HIGH | MEDIUM | LOW",
  "reasoning": "string — 2-4 sentences explaining the verdict",
  "supporting_evidence": ["array of direct quotes from context"],
  "caveats": ["array of important limitations or counterpoints"],
  "source_ids": ["array of document IDs referenced"]
}
 
Rules:
- verdict MUST be a single sentence
- supporting_evidence MUST quote directly from provided context
- If context is insufficient, set confidence to LOW and say so in reasoning
- NEVER fabricate citations not present in context
- source_ids MUST only reference IDs from the provided document list
`.trim();

The explicit NEVER fabricate instruction made a measurable difference. Before adding it, hallucinated citations appeared in roughly 4% of responses. After: under 0.5%.

Handling Hallucinations in Production

Hallucinations aren't a bug you can patch. They're a probabilistic property of the model you have to design around. My mitigation strategy has three layers:

1. Grounding with retrieved context. AskVerdict uses RAG (Retrieval-Augmented Generation). The model only answers based on documents I provide. No grounding documents, no verdict — the system returns a structured "insufficient context" response rather than guessing.

2. Citation verification. After the model responds, I run a verification pass that checks every entry in supporting_evidence against the source documents using fuzzy string matching. Evidence that doesn't appear verbatim (or near-verbatim) in the source gets flagged and stripped before the response reaches the user.

function verifyEvidence(
  evidence: string[],
  sourceDocs: Document[]
): { verified: string[]; flagged: string[] } {
  const allText = sourceDocs.map(d => d.content).join(' ');
 
  return evidence.reduce(
    (acc, quote) => {
      const similarity = cosineSimilarity(
        embed(quote),
        embed(extractBestMatch(quote, allText))
      );
      if (similarity > 0.88) {
        acc.verified.push(quote);
      } else {
        acc.flagged.push(quote);
      }
      return acc;
    },
    { verified: [] as string[], flagged: [] as string[] }
  );
}

3. Confidence calibration UI. I display confidence levels prominently and with honest language. LOW confidence verdicts show a banner: "The available context is limited — treat this verdict as a starting point, not a conclusion." Users appreciate honesty more than false authority.

Building the Evaluation Pipeline

This is where most AI projects cut corners and regret it. You cannot know if your system is improving without a rigorous eval pipeline.

I built a dataset of 200 test cases: questions with associated context documents and human-written "ground truth" verdicts. Each test case is tagged by domain (legal, technical, general knowledge) and difficulty (easy/medium/hard).

The eval runs on every prompt change and every model upgrade:

Verdict accuracy: semantic similarity between generated verdict and ground truth (using embedding cosine similarity, threshold 0.85)
Citation precision: what fraction of cited evidence is actually present in source docs
Schema compliance: does the output match the Zod schema exactly
Confidence calibration: do HIGH confidence verdicts actually have higher accuracy than LOW ones

That last metric caught a subtle regression when I switched model versions — accuracy dropped but the model became more confident. A dangerous combination that the eval surfaced before it hit production.

Cost Optimization: The Bill That Humbled Me

My first month in production with naive prompting cost more than I expected. The three biggest savings:

Token budgeting on context. I was passing entire documents into context. The average document was 8,000 tokens. Passing five documents per query at GPT-4o pricing adds up fast. I switched to chunked retrieval — 512-token chunks, top-5 by semantic similarity — and cut average input tokens by 73%.

Caching identical queries. AskVerdict has clusters of similar questions (people phrase the same thing differently). I compute a normalized query embedding and cache verdicts by approximate nearest neighbor — same question gets the cached response if asked within 24 hours. Cache hit rate: ~31%. That's 31% of calls that cost zero.

Tiered model routing. Simple factual queries (high-confidence retrieval match, short context) go to gpt-4o-mini. Complex, ambiguous queries with long context go to gpt-4o. The router is a small classifier that costs essentially nothing. 60% of queries route to mini. Cost reduction: approximately 40% without measurable quality degradation on the routed queries.

UX for AI Responses

The streaming API is non-negotiable for anything over 2 seconds. Users tolerate a long response if they see progress. A white screen for 6 seconds, then a wall of text, feels broken.

I stream the verdict components in a defined order: verdict sentence first (so users get the answer immediately), then confidence, then reasoning, then evidence. The most important information arrives first — same principle as a newspaper lede.

One thing that surprised me: users engage more with LOW confidence verdicts than HIGH ones. I think it's because the caveats give them something to explore. High confidence verdicts feel final; low confidence verdicts feel like the start of a conversation.

Deployment on Vercel

AskVerdict's frontend and API routes both live on Vercel. The AI calls happen in Next.js API routes using the Vercel AI SDK, which handles streaming responses cleanly with streamText:

import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
 
export async function POST(req: Request) {
  const { question, contextDocs } = await req.json();
 
  const result = await streamText({
    model: openai('gpt-4o'),
    system: VERDICT_SYSTEM_PROMPT,
    messages: buildMessages(question, contextDocs),
    maxTokens: 1024,
  });
 
  return result.toDataStreamResponse();
}

Edge runtime for the streaming routes brought p50 latency down from 420ms to 180ms for the connection establishment phase. The model call itself is what it is, but shaving 240ms off the time-to-first-token meaningfully improves perceived responsiveness.

What I'd Do Differently

Eval pipeline first, features second. I built features for three weeks before I had any systematic way to measure quality. I was flying blind. Two days spent on the eval harness at the start would have saved weeks of guessing.

Don't underestimate the chunking strategy. How you split documents for RAG affects quality more than model choice. I tried fixed-size chunks, sentence-boundary chunks, and semantic chunks. Semantic chunking (grouping sentences by topic coherence) improved retrieval precision by ~15%.

Rate limiting and abuse detection on day one. AI APIs are expensive. On the second day after launch someone ran 400 queries in an hour. I had no rate limiting. That was a learning moment.

The gap between "AI that works in a demo" and "AI that works reliably in production for real users" is enormous. The demo is easy. The evaluation, the cost management, the hallucination mitigation, the graceful degradation — that's the actual engineering work. And it's worth doing right.