Table of Contents

When you first build a multi-agent system, the initial architectural focus is usually on logic: getting the Researcher Agent to talk to the Writer Agent, and having the Reviewer Agent approve the output. You watch the terminal logs scroll by as your autonomous agents converse and solve complex problems. It feels like magic.

Then, you look at your Azure OpenAI billing dashboard at the end of the week, and the magic immediately fades. You are burning through tokens at an alarming rate, and upon closer inspection, you are paying for your AI agents to be polite to each other.

In this deep dive, I’m going to share the exact architectural changes I made to a production LangGraph orchestration system that reduced our daily token consumption by over 80,000 tokens—slashing our LLM costs without degrading the intelligence of the system.

The Problem: Conversational Filler in Agentic Systems

Most developers build multi-agent systems using chat-completion endpoints (like gpt-4o) that are inherently fine-tuned for conversational, human-like interaction. When you instruct Agent A to pass a summary to Agent B, the default behavior of the LLM is to wrap the data in pleasantries and conversational filler.

Consider a standard handoff. A Researcher Agent might output:

“Here is the summary of the financial data you requested. As you can see, Q3 revenue was up by 15%, but operating costs increased significantly due to the cloud migration. Let me know if you need any further analysis before you write the final report!”

This is extremely inefficient for LLM-to-LLM communication. Agent B (the Writer) does not care about the pleasantries. It only needs the raw data. Furthermore, because multi-agent orchestrators like LangGraph pass the entire message history into the context window for every subsequent turn, that conversational filler gets re-processed, re-tokenized, and billed on every single hop of the graph.

The Fix: Forcing Structured JSON Interfaces

The solution is to fundamentally change how agents perceive their communication channels. Agents should not “chat” with each other; they should invoke APIs. By forcing agents to output strict JSON schemas and completely stripping conversational dialogue, you instantly compress the token payload.

Implementation: Pydantic & OpenAI Structured Outputs

Using OpenAI’s Native Structured Outputs (or LangChain’s with_structured_output), you can define a rigid Pydantic model for the handoff between agents:

from pydantic import BaseModel, Field
class ResearcherOutput(BaseModel):
revenue_growth_q3: float = Field(description="Percentage growth in Q3")
key_cost_drivers: list[str] = Field(description="List of primary expenses")
requires_follow_up: bool = Field(description="Flag if data is incomplete")
# Bind the schema to the LLM
structured_llm = llm.with_structured_output(ResearcherOutput)
# The agent now outputs raw JSON, no filler:
# {"revenue_growth_q3": 15.0, "key_cost_drivers": ["cloud migration"], "requires_follow_up": false}

The Architectural Impact: Why This Saves Tokens

Switching from conversational text to strict JSON payloads yields three compounding benefits across your orchestration graph:

  • 1. Reduced Output Tokens: The agent no longer generates the 20-40 tokens of “Here is the data…” or “Let me know if…”. Over hundreds of loops, this alone saves thousands of output tokens (which are typically more expensive than input tokens).
  • 2. Massive Input Token Compression: Because the message history is smaller, every subsequent agent in the chain inherits a dramatically lighter context window. If Agent C needs to review the work of Agent A and Agent B, it only parses pure data, avoiding the token inflation of reading a simulated conversation.
  • 3. Deterministic Routing: You no longer need to use an LLM to parse intent (e.g., asking an LLM “Did the researcher find everything?”). Because the output is a rigid schema with boolean flags like requires_follow_up, you can use standard Python if/else statements in your LangGraph routing nodes, completely bypassing LLM inference for control flow.

The Results: From 110k to 30k Tokens Daily

By enforcing strict JSON handoffs and replacing LLM-based intent parsing with deterministic Python routing logic on the structured payloads, the daily token burn of the workflow plummeted from ~110,000 tokens to under 30,000 tokens.

The agents didn’t get dumber—in fact, they became more reliable because the structured data prevented hallucinated context drift. When building multi-agent systems, always remember: your agents are microservices, not colleagues. Make them communicate via APIs.

Related Reading: Once your agents are communicating efficiently, you need to ensure their state is stored durably. Check out my architectural comparison on Managing State in Multi-Agent Workflows: Redis vs Cosmos DB, and learn how to orchestration them at scale in Orchestrating AI Agents: LangGraph vs Azure AI Agents.

Categorized in: