§ ARTICLE / Deep Dive

The Adaptive Workflow Gap: Why Production AI Needs Pressure-Aware Routing

What the Baros prototype reveals about the missing layer in agent orchestration

When LLM latency spikes 3-10x during peak hours, static workflows fail catastrophically rather than degrading gracefully. This essay examines the gap between what production AI needs and what current orchestration tools provide. Drawing on Deerfield Green’s prototype gallery — Baros, Chremata, Kitsune, and Pelagos — we analyze where rigid Directed Acyclic Graphs fail and what pressure-aware routing would require. The agent revolution won’t arrive through better prompts. It will arrive through infrastructure that acknowledges uncertainty instead of pretending it doesn’t exist. Teams building agent-based systems need to understand when adaptive routing pays off, when it’s over-engineering, and what infrastructure must exist before it becomes viable.

The Rigidity Problem in Current Orchestration

Most AI workflow tools ship with a comfortable lie: that you can map complex reasoning onto a fixed graph and expect consistent results. LangChain, LangGraph, and similar frameworks treat agent workflows as Directed Acyclic Graphs — nodes connected by edges, execution flowing in one direction, no cycles, no surprises [6]. This works beautifully until load varies, models stall, or a single slow API call cascades into a timeout that takes down the whole pipeline.

Chremata’s architecture exposes this tension clearly. The earnings transcript NLP pipeline chains six stages — ingestion, cleaning, subjective labeling, objective labeling, NER, and classification — each dependent on the previous completing successfully [4]. If the LLM-based subjective labeling stage slows under load, everything behind it waits. There’s no bypass, no alternative path, no pressure valve. The pipeline either completes or fails.

Production teams report LLM latency spikes of 3-10x during peak hours, and static DAGs cannot detect or adapt to these conditions [19]. You can add retries and timeouts, but those treat symptoms, not the underlying rigidity. The workflow assumes every step will behave identically regardless of context, load, or external conditions. That assumption held when agents ran locally on single documents. It breaks when you’re processing thousands of transcripts, curating RLHF datasets at scale, or fusing multiple real-time data sources [8].

The question isn’t whether adaptive routing matters. It’s whether the engineering complexity pays off.

Inside Baros: Architecture and Methodology

Baros measures geopolitical pressure — not workflow pressure [13]. The name derives from the Greek baros (weight, pressure), continuing the nautical naming convention shared with Pelagos (open sea) and Pharos (lighthouse) [10]. It functions as a barometer for global risk, fusing prediction market sentiment with conflict indicators to signal escalating tension before crises erupt.

The same principle — detecting stress before failure — is what production AI orchestration lacks. Baros fuses multiple signal sources (prediction markets, conflict databases, shipping data) into a composite index. A pressure-aware workflow router would fuse latency metrics, queue depth, and error rates into a routing decision.

A pressure-aware routing system requires three new layers that no current prototype implements. First, real-time latency monitoring at each node. Not just end-to-end tracing, but per-stage metrics that detect degradation as it happens. Kitsune integrates Langfuse Cloud for trace observation, but these traces record what happened, not what’s happening [2]. Pressure-aware routing needs streaming metrics, not post-mortem analysis.

Second, threshold-based circuit breakers that trigger alternative paths. When latency exceeds a defined threshold at any node, the system automatically routes around the bottleneck — using cached results, switching to faster models, or skipping non-critical stages. Chremata’s LangGraph-based labeling stage can implement this, falling back to cached labels when OpenAI latency spikes [6].

Third, fallback validation to ensure degraded modes still produce acceptable outputs. This is the hardest piece. You need to verify that a fast, approximate result meets quality thresholds before accepting it. Kitsune’s quality gates enforce field validation and duplicate detection, but they halt on failure rather than degrading gracefully [11].

Baros’s name implies this capability even if its implementation doesn’t deliver it. The metaphor captures what’s missing: production AI needs instruments that signal stress before systems fail.

Performance Under Load: Empirical Findings

Without a working Baros routing implementation, we can’t measure its specific latency or cost improvements. But the existing prototypes offer proxy data on where static pipelines strain under load.

Kitsune’s quality gates reveal one bottleneck pattern: validation checks that enforce field presence, response length limits, and duplicate detection must run before datasets can be registered [11]. If the invalid ratio exceeds 1%, the entire batch fails. This all-or-nothing approach works for curated training data but would cripple a high-throughput inference pipeline where some failures are acceptable if overall throughput remains high.

Chremata’s dependency chain shows another pattern. The pipeline requires transcripts, financial data, and LLM APIs to all succeed before producing final labels [9]. A single FMP API rate limit or OpenAI timeout blocks the entire stage. There’s no degraded mode, no partial completion, no fallback to cached or approximate results.

External research on agent latency confirms the scope of the problem. Sparkco’s analysis identifies prompt engineering, model optimization, and hardware utilization as the three primary levers for latency reduction — but notes that orchestration overhead often negates these gains [19]. LangChain users report difficulty isolating whether slowdowns come from the LLM provider or the framework itself, suggesting the orchestration layer adds opaque latency [23].

The empirical signal is clear: static pipelines work until they don’t. When they fail, they fail catastrophically rather than degrading gracefully. A pressure-aware system would detect the bottleneck and route around it — using cached results, skipping non-critical stages, or switching to faster models when latency spikes.

The Cost of Adaptivity: Trade-offs and Complexity

Building adaptive routing isn’t free. Every decision point adds latency. Every fallback path multiplies test scenarios. Every dynamic choice makes debugging harder because the same input can produce different execution traces depending on system state.

Kitsune’s architecture shows the baseline complexity of a production pipeline even without adaptive routing. It integrates Fireworks AI for training, Langfuse Cloud for trace observation, and Novita/Ollama for inference — three external services with independent failure modes [2]. Adding dynamic routing means monitoring all three, detecting degradation, and making routing decisions in real-time. That’s observability infrastructure on top of orchestration infrastructure.

The learning curve compounds this. Engineers building on LangGraph already face a steep climb — understanding nodes, edges, state management, and checkpointing [6]. Add pressure-aware routing and you need metrics collection, threshold tuning, and circuit breaker logic. The cognitive load shifts from “does this workflow produce correct results” to “does this workflow produce correct results under all possible load conditions.”

There’s also the risk of over-engineering. Most workflows don’t need adaptive routing. Chremata processes quarterly earnings calls — a batch workload with predictable timing and no real-time latency requirements [4]. Kitsune curates training datasets asynchronously [8]. For these use cases, static pipelines with good error handling and retry logic may be sufficient.

The trade-off calculation is specific: adaptive routing pays off when latency variance is high, partial completion is acceptable, and the cost of failure exceeds the cost of complexity. That’s a narrow band — but it’s the band where production AI systems live or die.

Implications for the Production Landscape

If pressure-aware routing represents the next evolution in AI orchestration, teams need to prepare differently than they would for incremental framework improvements. This isn’t a LangChain 2.0 upgrade — it’s an architectural shift from deterministic to probabilistic execution.

Standardization matters. Dynamic orchestration research emphasizes context-aware action sequencing, where systems autonomously manage actions based on outcomes and environmental signals [14]. But without common abstractions, every team builds their own router. DG’s internal orchestration framework documentation distinguishes between static workflows and dynamic agents, suggesting both patterns have a place [16]. The question is when to use which.

Tooling support will determine adoption speed. LangGraph users report difficulty profiling whether latency comes from the LLM or the framework [23]. Adaptive routing needs better observability — not just traces, but pressure metrics, routing decisions, and fallback outcomes. Without this, teams can’t tune thresholds or validate that adaptivity is helping rather than hurting.

Specific DG assets could serve as testbeds for adaptive routing extensions. Chremata’s LangGraph-based subjective labeling stage [6] is the natural candidate — it already calls external LLM APIs and can implement fallback to cached labels or faster models when latency exceeds thresholds. Kitsune’s phased pipeline [8] could route around failed validation stages by flagging records for manual review rather than halting the entire batch. Pelagos’s fusion of prediction markets with shipping data applies the same multi-signal pattern that pressure-aware routing would need, just applied to risk assessment rather than workflow execution [10].

The Baros naming convention — measuring pressure, like a barometer forecasting storms — captures the right intuition even if the implementation doesn’t match [13]. Production AI needs instruments that signal stress before systems fail. Prediction markets fuse multiple signals to surface risk while it’s still far from port, whether that risk is geopolitical or infrastructural [10][12].

Teams should start measuring now: latency distributions (not averages), failure modes by stage, and the cost of retries versus fallbacks. When variance exceeds 3x and partial completion is acceptable, that’s when adaptive routing moves from interesting to necessary.

The agent revolution isn’t arriving through better prompts or larger models. It’s arriving through infrastructure that acknowledges uncertainty instead of pretending it doesn’t exist. Static DAGs work when the world behaves deterministically. Production AI lives in a world where APIs throttle, models stall, and load spikes without warning [19].

Baros, as implemented, measures geopolitical pressure rather than workflow pressure. But the metaphor holds: falling barometric pressure signals an incoming storm, and rising latency variance signals an incoming failure [13]. The teams that survive scale won’t be those with the most sophisticated agents. They’ll be those with systems that detect pressure early and route around it before the storm hits.

Start with measurement. Add adaptivity only where variance justifies complexity. And remember that the best routing decision is sometimes knowing when not to route at all — when a static pipeline, well-monitored and properly sized, remains the right tool. The future of AI infrastructure isn’t purely adaptive or purely static. It’s knowing which parts of your system need to bend and which should remain rigid [14][16].


References