A Measurement Thesis

Agents pick what goes into their own context window. When they pick well, they get better. When they pick badly, they waste your money. Nobody measures which is happening.

I built a five-layer agent architecture called qlawbox to fix that. It treats context engineering as a reinforcement learning problem: the agent samples from learned distributions to decide what belongs in context, gets feedback, and updates. Over time, the distributions converge on what actually helps.

The stack is unfinished, but it runs, it's principled, and it's observable. Everything below is open source.

qlawbox

Answering questions about agent behavior requires measuring agents in realistic, messy environments. qlawbox is a five-layer architecture built to do just this, equipped with:

Each layer functions independently. When composed, they form a closed feedback loop whose behavior can be observed, internal dynamics traced, and interventions measured.

01 · KNOWLEDGE + LEARNING qortex → rules projects rules 02 · OBSERVABILITY qortex-observe ← core instruments 03 · RUNTIME vindler + bilrost → executes dispatches 04 · NERVOUS SYSTEM cadence → routes state drift 05 · INTEROCEPTION interoception affect signals ↑ feedback external signals RMR tokens convergence

Each layer emits structured events. The feedback loop returns observations to the learning layer. Everything is traceable.

01

Knowledge + Learning — qortex

qortex frames context engineering as a learning problem. System prompt components, such as tools and other injected context, are modeled as arms of a contextual bandit. Per query, the agent samples Beta posteriors to decide what belongs in context; feedback updates those posteriors via Thompson Sampling.

Beneath the bandit sits an experimental causal knowledge graph with typed edges, generated online via qortex-ingest. Feedback credit propagates backward through ancestor concepts, decaying by hop distance and edge strength.

qortex-online handles real-time indexing into Memgraph. Preliminary results suggest positive impact on retrieval performance with minimal overhead.

As of recently, qortex is also a network service. qortex serve starts a Starlette ASGI server exposing 35+ REST endpoints with HMAC-SHA256 auth and replay protection. Postgres backends (pgvector for embeddings, dedicated stores for Thompson Sampling arm states and interoception factors) replace the local SQLite defaults. Any language that speaks HTTP can now consume the full knowledge and learning stack.

7 framework adapters, each passing its framework's own test suite: CrewAI, Agno, AutoGen, LangChain, LangChain.js, Mastra, Vindler (fork: OpenClaw).

02

Observability — qortex-observe

Every selection, observation, and posterior update in the learning layer emits a structured event. qortex-observe routes those events to three export paths:

  • JSONL file sinks for offline replay
  • OpenTelemetry spans for distributed tracing via Jaeger
  • 48 Prometheus counters/gauges/histograms for real-time Grafana dashboards

MCP tracing middleware wraps each tool call with distributed trace context, so a single query can be traced from inbound message through graph traversal, vector retrieval, posterior sampling, and credit propagation.

03

Runtime — vindler + bilrost

vindler is a hardened fork of OpenClaw, rewired as the qlawbox agent runtime.

The fork replaces OpenClaw's memory layer with qortex and instruments every tool call and retrieval path through qortex-observe.

bilrost (PyPI) locks the runtime inside a Lima VM with OverlayFS for filesystem isolation and UFW NACLs for network policy. Network access is network-as-needed: tools declare their requirements and bilrost opens only those ports for the duration of the call.

Vindler also grew a LinWheel extension: 17 MCP tools for LinkedIn content operations, covering content processing, drafting, post management, voice profiles, and visual generation. All tools are opt-in (declared optional: true) and only activate when explicitly allowlisted. An approval gate (post_approve) is required before anything reaches post_schedule.

The codenames are Norse. In the Eddas, Bifrost (also spelled Bilrost) is the flaming rainbow bridge situated above the world-tree Yggdrasil, binding Miðgarð to Ásgarð. The only path in, it burns anything unwelcome that dares try to cross. Vindler is Heimdallr, the guardian of heaven's gates.

04

Nervous System — cadence

Typed signal bus for ambient agent behavior.

Rather than waiting for prompts, enables agents to subscribe to event streams and act when conditions are met. cadence classifies, prioritizes, and dispatches signals to the appropriate handler.

Internal state signals from the interoception layer route alongside external events through the same bus, so the agent's response to its own dynamics is handled by the same dispatch infrastructure as its response to the outside world.

Intended to interop with interoception to trigger spontaneous "speech" in response to shifts in internal state and predictions thereof. Very early stage and highly speculative, but a fundamental part of the long-term research horizon.

Docs ↗
05

Interoception — interoception

An analog of biological homeostasis.

The agent monitors its own conservation quantities — posterior stability, convergence rate, reward variance — and flags violations as affect signals: confusion, surprise, dissonance, novelty. Each signal maps to an observable property of the learning dynamics.

This is the most speculative layer. For now, it serves as an architectural placeholder for experimenting with internal signals — pointing cadence's ambient awareness inward, so agents can generate speech based on their perception of internal state and predictions about where that state is heading.

Like cadence, this is very early stage and highly speculative, but it represents a fundamental piece of the long-term research direction: agents that respond not only to the world, but to themselves.

Docs ↗

qortex-observe

48 Prometheus metrics feed Grafana dashboards that surface posterior dynamics: watch the agent learn, catch when it's learning the wrong things, intervene with tunable reward signals. Jaeger traces latencies from message ingestion through retrieval, concept extraction, and learning subsystem calls. Memgraph Lab provides live inspection of the knowledge graph as it's constructed.

Anthropic's own research on measuring agent autonomy identifies this as essential: post-deployment monitoring that reveals how agents actually behave in practice, not just what they're capable of in controlled evaluations. Pre-deployment testing cannot surface the dynamics that emerge from sustained agent-user interaction. The infrastructure to observe those dynamics must be built deliberately.

Grafana dashboard: bandit arm convergence, posterior means separating over a 2-hour window

Beta posteriors updating in real time. Selection rates, observation outcomes, poisoning detection.

Jaeger trace: online_index.embed spanning 20.81ms, 8 spans down to memgraph.add_edge and cypher.execute

End-to-end trace from embed through graph queries, memgraph.add_edge, and cypher.execute spans.

memgraph lab
screenshot pending

Live causal knowledge graph. Concepts, typed edges, credit propagation paths.

Grafana: Knowledge Graph Growth dashboard with Edge Lifecycle, KG Crystallization, and cumulative graph growth

KG Crystallization: edges enter a buffer and graduate to the persistent graph once confidence exceeds the promotion threshold.

Jaeger trace: MCP tools/call qortex_learning_observe, 31.53ms, fastmcp provider

MCP tool call trace: fastmcp provider, LocalProvider type, full span attributes including session ID.

Where the stack meets real users

Each one runs on qlawbox and generates signal for a different aspect of the research program.

LinWheel

autonomous publishing

Now a Vindler extension with 17 MCP tools spanning the full content lifecycle: analyze, reshape, refine, split, draft, bundle, manage posts, generate images and carousels, and manage voice profiles. All tools are opt-in; an approval gate is required before anything gets scheduled.

The test: can qlawbox drive a content pipeline from signal detection through scheduling and direct publishing, without human editing?

Swae OS

federated intelligence

Federated GraphQL microservice architecture. Three services behind a Hive gateway: user management, workout/habit tracking, and a hybrid RAG knowledge service. qortex plugs in behind the federation and provides structured retrieval across independent services.

The AI coaching layer is architecture-ready. When active, qortex provides the knowledge backend: structured retrieval across workout data, habit patterns, and journal entries through one federated graph.

Interlinear

adaptive learning

Language tutor with a 5-category error taxonomy (mechanical, lexical, morphological, syntactic, semantic). When you write puellam instead of puella, it identifies a case error and explains why nominative was expected. Course generation from any source text. CEFR-calibrated per grammatical concept.

The error taxonomy feeds Thompson Sampling: which correction strategies produce retention, which produce frustration. This is qlawbox as an adaptive learning backend. Med/long term, Interlinear ejects to microservices and lives behind the Swae federated gateway.

Hypotheses under investigation

The questions above frame the research direction. Below, each one gets a mechanism in the stack and something concrete to measure.

H1

Can agents learn to optimize their own context?

The bandit selects which system prompt components enter context. Over time, components that correlate with positive outcomes get promoted; those that don't get downweighted.

If the mechanism works, token consumption should decrease while task completion quality holds steady or improves.

Measuring: repeated mistake rate (RMR), total tokens per task, arm convergence speed (how many observations before posteriors stabilize).

H2

How faithfully do agents reason in practice?

The causal knowledge graph provides structured context that flat retrieval cannot. The question is whether access to typed relationships and domain rules changes agent behavior in measurable ways. This should include not just retrieval quality, but downstream reasoning and task completion.

Measuring: retrieval quality (P@5, R@5, nDCG@5) with graph-enhanced vs. flat retrieval, same tasks, same agent, same embeddings. Also tracking whether structurally relevant concepts surface that cosine similarity misses. Early adapter results.

H3

What does post-deployment monitoring look like?

qortex-observe emits structured events for every selection, observation, and posterior update. Grafana dashboards surface posterior dynamics. Jaeger traces end-to-end latency.

The question is whether this instrumentation is sufficient to detect when the agent is learning the wrong things — and whether tunable interventions (reward signals, arm resets, exploration rate adjustment) can correct course.

Measuring: time-to-detection for reward misspecification, intervention response latency, and whether operators can meaningfully steer agent learning through the observability surface. The measurement approaches here are themselves a research question.

From posteriors to geometry

Beta posteriors sit on a statistical manifold with curvature and symmetry. In physics, symmetries imply conservation laws. The question is whether that structure exists here, too. If it does, the learning layer produces quantities that should be conserved, and conservation violations become exactly the signals that interoception needs to monitor.

Sketching it through would be elegant. But showing it true would be surprising: for now this is pure, fun speculation. With luck, the agenda laid out above represents a step toward being able to show it false — to build tools that can silence the speculation, one way or the other.

Why Knowledge Graphs ↗ Causal Reasoning ↗ Geometry of Learning ↗

Where things stand

Layer Project Stage Note
01 qortex Functional Beta-Bernoulli Thompson Sampling, causal knowledge graph, 7 framework adapters, REST API, Postgres stores, HMAC auth
02 qortex-observe Functional JSONL sinks, OpenTelemetry/Jaeger, Prometheus/Grafana, MCP tracing middleware
03 vindler + bilrost Functional Lima VM, OverlayFS, UFW NACLs, network-as-needed tool dispatch, LinWheel extension (17 content tools), on PyPI
04 cadence Functional Typed signal bus, ambient dispatch, interoception interop planned
05 interoception Early Architectural placeholder; internal state signals, affect vocabulary, speculative

Everything here is work in progress. Directions, not conclusions.