skip to content
← Posts

Schema Design

Co-authored with Sarthak Mahajan, who contributed significantly through his research on extreme speech classification using LLMs.

Tired of LLMs failing at producing JSON and ruining it for you?

LLMs are inherently stochastic. Truth is, you won’t be able to hit the reliability of a typical CRUD endpoint. However, we can get quite close. This is Part 1 of a 3-part series on how to actually make that happen. We’re starting with the most impactful thing you can do: get your schemas right.

Use Structured Outputs

This is the most well-known way to get LLMs to interact with interfaces - use structured outputs, not “plz make JSON for me.” If your provider doesn’t support structured outputs, use tool calling with tool_choice to force it.

But structured outputs go deeper than just “give me JSON.” You can modify the descriptions of the fields - and the types too - to get models to respond in specific ways. Think actor-critic patterns, careful field naming with rich descriptions, deliberate field ordering and re-ordering, and pressure-release valves for when things go sideways.

Have discriminated unions with space for “error” objects too. Structured output + discriminated union + a fallback constant belonging to the discriminated union for when the LLM shits the bed. This is non-negotiable for production systems.

class SuccessResult(BaseModel):
status: Literal["success"] = "success"
sentiment: Sentiment
confidence: int = Field(
description="Confidence from 1 to 100"
)
class ErrorResult(BaseModel):
status: Literal["error"] = "error"
error_reason: str = Field(
description="Why the analysis could not be completed"
)
class SentimentResponse(BaseModel):
result: SuccessResult | ErrorResult
response = client.beta.chat.completions.parse(
model="gpt-4.1-nano",
messages=[...],
response_format=SentimentResponse,
)

The discriminated union means you always get valid output - either a real result or a structured error you can handle programmatically.

Keep Schemas Simple

Nested schemas, discriminated unions (beyond the error fallback), arbitrary length lists, and “free form” text fields are enemies of reliability and must be eliminated as much as possible. Stick to enums, ints, and bools where possible - use floats, strings, lists, and dicts very sparingly. The smaller the model, the more pertinent this is.

Having field_1: str, field_2: str, ... is better than having a list that the LLM is supposed to fill to appropriate length.

This sounds limiting. It’s not. It forces you to think about what you actually need from the model, and that exercise alone improves your outputs.

class Urgency(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class Category(str, Enum):
BUG = "bug"
FEATURE = "feature"
QUESTION = "question"
DOCS = "docs"
class TicketClassification(BaseModel):
category: Category
urgency: Urgency
is_security_related: bool
estimated_effort_hours: int = Field(
description="Rough estimate of hours to resolve, from 1 to 40"
)

Enums, ints, bools. No lists, no free-form strings. The model can’t hallucinate a category that doesn’t exist.

Provide Detailed Field Descriptions

Don’t rely on the LLM “inferring” meaning. Ever.

Field names are not documentation. score: int means nothing. score: int - a rating from 1 to 100 indicating how relevant this document is to the user's query, where 1 means completely irrelevant and 100 means a perfect match means everything.

The model has to figure out what you want from what you give it. Give it more.

# Bad
class BadSchema(BaseModel):
score: int
relevant: bool
# Good
class GoodSchema(BaseModel):
relevance_score: int = Field(
description="A rating from 1 to 100 indicating how relevant "
"this document is to the user's query, where 1 means "
"completely irrelevant and 100 means a perfect match"
)
is_relevant: bool = Field(
description="Whether the document should be included in "
"search results - true if relevance_score >= 50"
)

Break Down Your Schema

Don’t ask “rank this on a scale of 0 to 10.” Instead, ask field_1: int, field_2: int, field_3: int - each with a simpler, individual scoring mechanism - which you then combine using sum/avg/etc. programmatically.

Asking for a single holistic score is basically asking the model to do multi-criteria evaluation in one shot. Splitting it into sub-scores and aggregating programmatically gives you both better reliability and interpretability - you can see why something scored high or low. Statistically, this leads to more stable outputs.

class ArticleQuality(BaseModel):
clarity_score: int = Field(
description="How clear and readable the writing is, 1-10"
)
accuracy_score: int = Field(
description="How factually accurate the content is, 1-10"
)
depth_score: int = Field(
description="How thoroughly the topic is covered, 1-10"
)
originality_score: int = Field(
description="How novel or unique the perspective is, 1-10"
)
# Then aggregate programmatically
overall = (
parsed.clarity_score
+ parsed.accuracy_score
+ parsed.depth_score
+ parsed.originality_score
) / 4

YAML for Token Savings

Maximize input token space and cost by using YAML for your inputs. It’s more compact and just as expressive.

This doesn’t work as well for output though - because most structured outputs are usually JSON-enforced constrained decoding. JSON output is much more reliable due to deliberate fine-tuning, access to tons of JSON in pre-training, and obviously tuning for better performance with constrained decoding.

So: YAML in, JSON out.

# YAML input - compact, readable
yaml_input = """
project: website-redesign
deadline: 2026-04-01
tasks:
- Fix mobile navigation bug
- Add dark mode support
- Write API documentation
constraints:
- Only one developer available
- Must ship mobile fix first
"""
# JSON structured output
response = client.beta.chat.completions.parse(
model="gpt-4.1-nano",
messages=[
{"role": "system", "content": "Prioritize these tasks."},
{"role": "user", "content": yaml_input},
],
response_format=PrioritizedTasks,
)

Confidence Scoring

You can use scores such as logprobs to understand an LLM’s “confidence” about a token.

For more complex cases where responses span across tokens, use manual confidence integer scores. Prefer a 1-100 integer scale - many LLMs struggle with generating floating point as output, or confuse 0 to 1 with 0% to 100% with 0 to 100. Ints. Just use ints.

This also works when assigning tasks to LLMs. For example, give the LLM examples of “this is 0/10 and this is 10/10 … generate 4/10 on this scale …” This allows semantic and qualitative grounding of scales that’d otherwise be very difficult to quantify.

class ClassificationWithConfidence(BaseModel):
is_spam: bool
confidence: int = Field(
description="How confident you are, from 1 to 100, "
"where 1 is pure guess and 100 is absolutely certain"
)
spam_indicators: int = Field(
description="Number of spam indicators found, 0 to 5"
)

Constrained Decoding

If inferring locally, use constrained decoding (Outlines, grammar-based sampling). This forces the model to only produce tokens that are valid within your schema. The model literally cannot produce invalid JSON - the sampling mask prevents it.

If you’re not running locally, lean on the structured output support from your provider as mentioned above.

Up Next

Schema design gets you most of the way there. But sometimes one model call isn’t enough - sometimes you need models to think step-by-step, critique each other, or split responsibilities between a “thinker” and a “formatter.”

That’s Part 2: When One LLM Isn’t Enough.

Post 1 of 3 in On LLM Control

← Previous Next →