Skip to content

research(llm): BaRP preference-conditioned bandit routing — runtime cost/quality trade-off dial (arXiv:2510.07429) #2415

@bug-ops

Description

@bug-ops

Finding

BaRP: Bandit Routing with Preference-Tunable Trade-offs (arXiv:2510.07429)

Trains a bandit router under real deployment feedback (partial-feedback bandit). Operator can dial performance–cost trade-off at test time without retraining. Outperforms offline routers by ≥12.46%.

Applicability to Zeph

Zeph's LinUCB bandit router (zeph-llm/src/router/bandit.rs, PR #2390/#2230) is purely accuracy-focused. BaRP's preference-conditioned extension would allow operators to specify cost vs. quality weight at runtime — e.g., "prefer cheaper models in this session".

Proposed design:

[llm.router.bandit]
cost_weight = 0.3   # 0.0 = pure quality, 1.0 = pure cost

The LinUCB UCB formula adds cost_weight * cost_penalty(provider) to the exploration bonus, making expensive providers less attractive when cost_weight is high.

Implementation sketch

// In LinUCB arm selection:
let adjusted_ucb = quality_ucb - config.cost_weight * provider_cost_estimate(arm);

This is a minimal change to the existing bandit implementation and directly maps to the [cost] tracking already in Zeph.

Priority

P2 — extends existing LinUCB infrastructure with a high-value config knob; small implementation surface.

Source

  • arXiv:2510.07429 — BaRP: Bandit Routing with Preference-Tunable Performance–Cost Trade-offs

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityllmzeph-llm crate (Ollama, Claude)researchResearch-driven improvement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions