-
Notifications
You must be signed in to change notification settings - Fork 2
research(llm): BaRP preference-conditioned bandit routing — runtime cost/quality trade-off dial (arXiv:2510.07429) #2415
Description
Finding
BaRP: Bandit Routing with Preference-Tunable Trade-offs (arXiv:2510.07429)
Trains a bandit router under real deployment feedback (partial-feedback bandit). Operator can dial performance–cost trade-off at test time without retraining. Outperforms offline routers by ≥12.46%.
Applicability to Zeph
Zeph's LinUCB bandit router (zeph-llm/src/router/bandit.rs, PR #2390/#2230) is purely accuracy-focused. BaRP's preference-conditioned extension would allow operators to specify cost vs. quality weight at runtime — e.g., "prefer cheaper models in this session".
Proposed design:
[llm.router.bandit]
cost_weight = 0.3 # 0.0 = pure quality, 1.0 = pure costThe LinUCB UCB formula adds cost_weight * cost_penalty(provider) to the exploration bonus, making expensive providers less attractive when cost_weight is high.
Implementation sketch
// In LinUCB arm selection:
let adjusted_ucb = quality_ucb - config.cost_weight * provider_cost_estimate(arm);This is a minimal change to the existing bandit implementation and directly maps to the [cost] tracking already in Zeph.
Priority
P2 — extends existing LinUCB infrastructure with a high-value config knob; small implementation surface.
Source
- arXiv:2510.07429 — BaRP: Bandit Routing with Preference-Tunable Performance–Cost Trade-offs