Engineering blog

Built in public.
Written for engineers.

Deep technical writing on AI agents in production, LLM infrastructure, observability, and reliability. No fluff — only what we have learned building the runtime.

Reliability Infrastructure June 1, 2026 9 min read

The case for multi-provider LLM fallback chains

Every major LLM provider goes down — Anthropic, OpenAI, Google, all of them. If your agent is hardcoded to one model, every incident is your incident. Here is how fallback chains, circuit breakers, and latency-based routing keep agents running regardless.

Production Infrastructure May 4, 2026 10 min read

Why AI agents fail in production and how to fix it

Most teams get their agent working in a Claude Code session in an afternoon. Then it hits production and breaks silently, expensively, in ways nobody anticipated. Eight failure modes, eight fixes — including agentic loops, prompt injection, and context blowout.

Coming soon

Step-level observability for AI agents — what you actually need

Week 3 · AI agent observability · Trace logging · Queryable runs

Building a retry engine for LLM agents the right way

Week 4 · Exponential backoff · Checkpointing · Idempotency

Built in public.Written for engineers.

Built in public.
Written for engineers.