25 min read 3 hour project ~$100 Medium
Overview
If you’re training LLMs, building AI apps, or developing agents - you’ll need some way to evaluate their performance. This can come at different points in the development lifecycle: during initial development, as it’s in production to detect regressions or drift, or as you seek to upgrade to newer models and/or configurations. Because the outputs of LLMs, AI apps, and agents are typically subjective or open-ended, there are essentially only three ways performance can be measured:- Human labeling or user feedback
- LLM-as-a-judge
- ✨ Vibes ✨
The Goal
We’re really going to nerd out on this one: our task today is to determine which open-source model family is best at creating ELI5 explanations of paper abstracts from the pre-print server Arxiv. ELI5 stands for “Explain Like I’m 5”, popularized by the now-defunct dataset ELI5 from Facebook. This task may seem trivial, but it’s a great demonstration of LLM-as-a-judge for a few reasons:- It requires high-level world-knowledge and understanding of complex technical concepts
- It requires the ability to map a high-level concept to a low-level explanation and communicate it effectively
- It’s extremely subjective, so there is no realistic way a ground-truth label set can be created
Getting the data
We’ll grab the current snapshot of the Arxiv metadata from Kaggle and sample 100,000 rows.Explore the data
Explore the data

Generating the explanations
As of this writing, we serve 4 model “families” on the Sutro platform: Llama, Qwen, Gemma, and GPT-OSS. Generally speaking, any of the models should be able to generate the explanations, but the larger models should be more adept at evaluating them (larger models generally contain more world knowledge and are better at understanding complex technical concepts). So, we’ll use smaller or medium-sized variants of each model family for the explanations, and their larger counterparts for the evaluations. It’s incredibly simple to generate the explanations using the Sutro SDK. For a sanity check, we’ll run the first 1000 rows through each model using protoptying (p0
) jobs.

We report an experimental design where the position and resonance frequency of the Nitrogen Vacancy (NV) in a diamond are correlated with the room temperature. A simple model trained on the interpolated correlation data predicts both quantities. The predictive tracking of the NV's location enables continuous operation of the NV quantum computer under ambient conditions for a week without recalibration.
Output ELI5 Explanation:
We did a test to see how a tiny thing in a diamond moves and makes a special sound when the room is warm. We made a smart guess that tells us where it is and how it sounds. This lets a tiny computer in the diamond keep working all week without having to fix it.
Scaling to the 100,000 row sample is as simple as removing the slicing of the dataframe, and setting the job priority to 1.
Model | Duration | Output Tokens | Cost |
---|---|---|---|
Llama 3.1 8B | 35 minutes | 16.3M | $1.03 |
Qwen 3 14B | 48 minutes | 11.9M | $5.76 |
Gemma 3 12B IT | 35 minutes | 18.8M | $9.16 |
GPT-OSS 20B | 30 minutes | 25.8M | $1.03 |
Evaluating the explanations
As mentioned earlier, it’s likely that larger models will be better at evaluating the explanations. This is because they contain more world knowledge and have more free parameters with which to reason. In some cases, you may have a “trusted” model that you want to use for all evaluations. But in this, case how do we know which model is the best judge? And how do we know the larger model in a specific family won’t be biased towards its smaller sibling due to similarities in the underlying training data? To combat these biases, we’ll use a larger model from each of the four families to evaluate the explanations from the three smaller models. Consequentially, this means that each smaller model will be evaluated by three larger models from other families. In traditional ML, this is known as ensemble modeling - using the responses of multiple models to make a single prediction.Pull down the results and append them to the original dataframe
Pull down the results and append them to the original dataframe
Plot the results in a heatmap
Plot the results in a heatmap

Using relative evaluations
In our previous evals, we used larger models to evaluate the ELI5 explanations in isolation: just showing the abstract and the explanation to the evaluator model and asking it to score the explanation on a scale of 0 to 100. But since we’re trying to understand which model is best for our task, it makes more sense to compare relative performance directly. This is where relative evaluations come in. We’ll now show the evaluator all three ELI5 explanations at once, and ask it to rank them from best to worst.Pull down the results and append them to the original dataframe
Pull down the results and append them to the original dataframe
Create the win rate percentage matrix
Create the win rate percentage matrix

Use the Elo rating system to determine the best model
Use the Elo rating system to determine the best model
elo | wins | losses | matches | |
---|---|---|---|---|
gpt-oss-20b | 1629.77 | 425974 | 174028 | 600001 |
qwen-3-14b | 1554.52 | 351422 | 248580 | 600001 |
gemma-3-12b-it | 1523.23 | 319396 | 280606 | 600003 |
llama-3.1-8b | 1292.47 | 103212 | 496790 | 600001 |
Conclusion
We just demonstrated how you can use Sutro to run model task evals. Our approach avoided the need for human feedback altogether, instead using an ensemble of LLMs to evaluate our task and remove the biases of any single model. This avoided the need for human labelers, or any online human feedback. We bootstrapped the evaluation process entirely offline:- in just a few hours
- on 100,000 samples, across 4 models and 16 evaluation jobs (but could be scaled to millions of samples and billions of tokens)
- for around 100 dollars
- without the need for any custom infrastructure or GPU setup
- using only the Sutro SDK and open-source Python tools