RouterLLM

Link : https://arxiv.org/pdf/2406.18665v

Demo : https://routerllm.streamlit.app/

Notes

Not all LLMs are created equal.
Broadly speaking, large models tend to be more capable but come at a higher cost, while smaller models tend to be less capable but cheaper to serve. The heterogeneous landscape presents a dilemma in the practical deployment of LLMs in real-world applications.
LLM routing is a promising solution to this problem.

Router is tough

how to optimize for quality of model responses with minimizing cost.
router need to infer the intent, complexity, and domain of incoming query
also need to understand candidate models’ capabilities
router model needs to be economical, fast, and adaptive to the evolving model landscape.

Problem

More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective.

$N$ different LLM models $M = {M_{1}, \dots, M_{N}}$
Each model $M_{i} : Q \to A$ as abstraction of a function mapping query to answer
Router is $R : Q \times M^{N} \to {1, \dots, N}$ is an $N$ -way classifier

a = M_{R (q)} (q)

The challenge of routing involves achieving an optimal equilibrium between increasing response quality and reducing cost.

For preference data:

D_{pref} = {(q, l_{i, j}) ∣ q \in Q, i, j \in N, l_{i, j} \in L}

where $q$ is a query,
$l_{i, j}$ is a label representing the comparison outcome of comparing $M_{i}, M_{j}$ ‘s quality on $q$ ,
$L = {win_{M_{i}}, tie, win_{M_{j}}}$

Solution

Principled framework for learning a binary routing function.

R_{bin}^{α} : Q \to {0, 1}

Between $M_{weak}$ and $M_{strong}$ from preference data

$R_{bin}^{α}$ is defined using 2 components.

Win Predication Model: probability of winning for $M_{strong}$ i.e. $P_{θ} (win_{M_{strong}} ∣ q)$
$θ max (q, l_{i, j}) \in D_{pref} \sum lo g P_{θ} (l_{i, j} ∣ q) .$
By learning the winning probability on preference data, the author capture the strengths and weaknesses of both models on various kinds of queries.
Cost Threshold: $α \in [0, 1]$ which converts winning probability into routing decision $M_{strong}$ and $M_{weak}$ .
$R_{bin}^{α} (q) = {0 (i.e., M_{weak}) 1 (i.e., M_{strong}) if P (win_{M_{j}} ∣ q) < α, otherwise .$
$α$ controls the quality/cost trade-off : a higher threshold imposes a stricter cost constraint.

Final response is denoted as router’s response i.e. $M_{R_{bin}^{α} (q)} (q)$ , which represents the response generated by either the weak or strong model, depending on the router’s decision.

Preference Data

80k battles from the Chatbot Arena platform with some Data Augmentation. See paper for more information.

Routing Approaches

(i’m not very clear on this)

Method for learning the win predication model from preference data.

Similarity-weighted (SW) ranking: Bradley-Terry model is adopted by author here. Given a user query $q$ , compute a weight $ω_{i} = γ^{1 + S (q, \overset{q}{^})}$ for each query $q_{i}$ in the train set based on its similarity to $q$ .

S (q, q_{i}) = \frac{ϵ \cdot ϵ _{i}}{∥ ϵ ∥∥ ϵ _{i} ∥ \cdot max _{1 \leq s \leq ∣ D_{pref} ∣} \frac{ϵ _{i} \cdot ϵ _{s}}{∥ ϵ _{i} ∥∥ ϵ _{s} ∥}}, (8)

$ϵ$ denotes a query embedding.

Matrix factorization: As in recommendation systems, it used to capture the low-rank structure of user-item interactions.

BERT classifier: A standard text classification method on based on BERT-base architecture.

Casual LLM classifier: Instruction-following paradigm, i.e. input as instruction prompt containing the user query, and output the win probability in a next-token prediction fashion instead of using a separate classification head.

tldr.

For both the matrix factorization router and the similarity-weighted ranking router, author used text-embedding-3-small to embed the input query. Author perform full parameter finetuning on both BERT and Causal LLM.

Contribution of paper

Formulate the LLM routing problem to explore the trade-off between cost and response quality.
Router training framework based on human preference data and augmentation techniques.
Open-source code and preference data

Similar works

LLM-BLENDER employs an ensemble framework that calls multiple LLMs at inference and uses a router model to select the best response.
Frugal-GPT employs an LLM cascade, sequentially querying LLMs until a reliable response is found.
Hybrid-LLM similar to this paper in framework but differs in three key ways: it uses synthetic preference labels derived via BARTScore, relies on a single BERT-based router architecture, and limits evaluation to in-domain generalization.

næhāl blog