When One LLM Isn't Enough • Prabal

Co-authored with Sarthak Mahajan, who contributed significantly through his research on extreme speech classification and LLM analysis of extreme speech content.

In Part 1, we covered schema design - simple schemas, detailed descriptions, breaking down scoring, constrained decoding. That gets you 80% of the way there.

This post is about the other 20%. The architectures and patterns you reach for when a single model call with a good schema still isn’t cutting it.

Justifications Before Answer

More applicable for non-thinking LLMs, but applies to all LLMs.

Do NOT ask for an answer, and then the explanation. LLMs are auto-regressive - what comes before impacts what comes after, and mostly what’s right before is more important than whatever was a paragraph away.

Always prefer explanation and then the answer. This forces the LLM to “think through” rather than just plopping down the answer in case the hidden states weren’t attuned to reasoning through the problem. Can even happen with thinking models!

In practice: put a reasoning: str field before your answer: int field in the schema. The model is forced to serialize its thinking into tokens that then condition the final output. It’s poor-man’s chain-of-thought baked into the schema itself. Works surprisingly well even on smaller models.

class ReasonedClassification(BaseModel):
    # reasoning BEFORE answer - this is the key
    reasoning: str = Field(
        description="Step-by-step reasoning about why this code "
        "might or might not have a bug"
    )
    has_bug: bool
    bug_line: int = Field(
        description="Line number of the bug, or 0 if no bug"
    )

The model writes its reasoning first, which conditions the has_bug and bug_line tokens. In testing, this correctly identified a subtle off-by-one bug in a binary search implementation (low = mid instead of low = mid + 1).

Structured Thinking

Sometimes you want LLMs to think through step-by-step before they give you an answer - and “justification before answer” isn’t granular enough.

Use structured outputs with multiple steps. Have justifications for each step, and then the action taken at a step. Do this repeatedly.

For LLMs with no “thinking mode,” you can just use structured outputs and a field for “thinking.” A simple thinking: str field placed before your actual output fields goes further than you’d expect.

The key insight: you’re not asking the model to “think harder.” You’re giving it token-space to reason in, which mechanically changes what the next tokens look like. Auto-regressive models can only reason through what they’ve already written.

class ThinkingStep(BaseModel):
    step_description: str
    observation: str
    conclusion: str

class StructuredAnalysis(BaseModel):
    thinking: str = Field(
        description="Initial high-level thoughts"
    )
    step_1: ThinkingStep = Field(
        description="Check the claim's source"
    )
    step_2: ThinkingStep = Field(
        description="Check for logical consistency"
    )
    step_3: ThinkingStep = Field(
        description="Check for missing context"
    )
    final_verdict: bool
    confidence: int = Field(description="1-100")

Fixed fields (step_1, step_2, step_3) instead of a list. The model knows exactly how many steps to produce, and each step has structure.

Writer-Critic Model

Use a pair of LLMs to generate responses, grade each other, and collaborate to create a final answer. Have a threshold set and only pass after max iterations or the set threshold is met.

I’ve used this for creative text generation, email generation - tasks that require quite a bit of authenticity and subjective rating. Got one LLM to behave as the copywriter. Got another LLM to behave as the recipient reading the email based on public LinkedIn profile info.

The “critic” doesn’t need to be a smart model. It just needs to be good at scoring against criteria - which, as we covered in Part 1, is easier when you break the scoring schema into sub-fields.

Set a max iteration cap though. Without one, you’ll burn tokens watching two models argue in circles.

class EmailDraft(BaseModel):
    subject: str
    body: str = Field(description="2-4 sentences")

class EmailCritique(BaseModel):
    clarity_score: int = Field(description="1-10")
    tone_score: int = Field(description="1-10")
    actionability_score: int = Field(description="1-10")
    feedback: str

MAX_ITERATIONS = 3
THRESHOLD = 8

for iteration in range(MAX_ITERATIONS):
    # Writer generates
    draft = call_llm(writer_prompt, EmailDraft)

    # Critic scores
    critique = call_llm(draft_as_input, EmailCritique)

    avg = (critique.clarity_score
           + critique.tone_score
           + critique.actionability_score) / 3

    if avg >= THRESHOLD:
        break

    # Feed critique back to writer for next iteration
    writer_prompt += f"Feedback: {critique.feedback}"

Best-of-N Sampling

Choose non-zero temperature. Generate N outputs. Get the LLM to rank all of them and choose the best one in another round.

The core insight: it’s easier for LLMs to verify than to generate correct content. You’re exploiting this asymmetry. Generating a perfect answer on the first try is hard. Picking the best one out of five is comparatively easy.

This pairs well with the confidence scoring from Part 1 - you can have the ranking model assign confidence scores to each candidate and pick programmatically rather than trusting the model’s “this one is best” judgment.

class RankedCandidate(BaseModel):
    best_index: int = Field(description="0-based index of best")
    reasoning: str

# Generate N candidates at high temperature
candidates = []
for _ in range(5):
    response = client.beta.chat.completions.parse(
        model="gpt-4.1-nano",
        messages=[...],
        response_format=Headline,
        temperature=0.9,
    )
    candidates.append(response.choices[0].message.parsed)

# Rank at temp=0
numbered = "\n".join(f"{i}: {c.headline}" for i, c in enumerate(candidates))
result = client.beta.chat.completions.parse(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": numbered}],
    response_format=RankedCandidate,
    temperature=0,
)
best = candidates[result.choices[0].message.parsed.best_index]

Freeform Thinking Models + Formatting Models

Structured outputs can genuinely constrain a model’s ability to think. Some really smart models have a hard time following structured output reliably - for example, DeepSeek R1 or other Chinese OSS models. It might happen that you burn $$$ on inference, but structured output generation fails midway and you need to generate an expensive output again.

The fix: use a smart model freeform, and use a sh*tty model (one that’s good at JSON, e.g. Gemini Flash 2.0, fine-tuned for structured output generation) to format it in JSON.

Allow a “release valve” of “invalid answer” or “error” JSON output to accommodate the freeform model responding unexpectedly. The formatter model doesn’t need to be smart - it just needs to be reliable at JSON. That’s a much easier job.

This is arguably the most cost-effective pattern on this list. You pay premium rates for thinking and bargain-bin rates for formatting. Each model does what it’s best at.

Prompt Engineering

Use Anthropic’s prompt generator - it’ll iteratively improve your prompt. Tune the thorough prompt it generates instead of composing prompts by yourself from scratch.

Assign an appropriate ROLE to the LLM - not just a task. It helps you elicit the information you want in more reliable, better ways. You want to nudge the internal hidden states to the region in latent space that’d best service your answer. Because this is inherently a black box, your best bet is to elicit this by invoking a specific role.

“You are an expert medical coder with 15 years of experience” activates a different region of the model’s capabilities than “classify this medical record.” Same task. Very different outputs.

Pressure Release Valve

Get LLMs to generate “feedback” for “what can the author of the prompt do better next time to help you give a better response” to figure out how you can improve the prompt over the long-term.

Found this on some YC podcast and it’s been gold. Add a prompt_feedback: str field to your schema. The model will tell you when your instructions are ambiguous, when fields contradict each other, when it didn’t have enough context. Track this feedback in observability tools like LangSmith and you’ve got a continuous improvement loop - the model is doing prompt engineering for you.

class AnalysisWithFeedback(BaseModel):
    reasoning: str
    sentiment: str
    confidence: int = Field(description="1-100")
    prompt_feedback: str = Field(
        description="Feedback for the prompt author: what could "
        "they do better next time to help you give a more "
        "accurate response? Be specific about ambiguities, "
        "missing context, or contradictions."
    )

On a deliberately ambiguous input like “The product is fine,” the model returned: “The prompt provides minimal information, making it difficult to determine a strong sentiment. For future reviews, more detailed feedback about specific aspects of the product would help.” That’s the model doing prompt engineering for you.

Up Next

At this point you’ve got solid schemas and smart multi-model architectures. But none of it matters if you can’t ship it reliably.

Part 3: Shipping LLMs in Production - temperature tuning, graceful degradation, observability, token caching, jailbreak detection, and all the infra you need to not get paged at 3am.