Core Workflow
A Sutro Function is built through an iterative loop: you create the function, Sutro generates predictions, you label the highest-impact cases, and Sutro optimizes the prompt to match your preferences. Each iteration refines the function further.1. Create the function
You provide:- A task definition — a plain-language description of what you want the function to do. This becomes the prompt the model receives, so be specific about the decision criteria. You write this yourself; it should reflect how you’d explain the task to a new team member.
- A dataset — a representative sample of your production data (CSV, JSONL, JSON, or Parquet). We find around 1,000 rows works for many tasks, but depending on how large or diverse your dataset is, more examples may be needed (i.e. 5-10k).
- A task type — LLM-as-a-judge, binary classification, single-label classification, multi-label classification, or structured extraction.
- Labels or schema — for judge and classification tasks, the set of labels to choose from. For extraction, the output schema defining which fields to extract. We recommend using a Pydantic or Zod based schema, but JSON is accepted as well.
2. Predict
Sutro runs your task definition against the dataset. An ensemble of models processes every row independently, and Sutro analyzes where they agree and disagree. Where they disagree, those rows represent the cases where your preferences matter most.3. Label
Sutro surfaces two sets of items for your review:- Low-confidence items — the cases with the most model disagreement. These are the highest-value labels you can provide, because they’re the ones where there is the most ambiguity on how to make the given decision.
- High-confidence items — cases where the models strongly agree. These are shown so you can catch cases where the consensus algorithm is confidently wrong.
4. Optimize
Once you’ve finished labeling, Sutro searches for a better prompt. The prompt optimizer tries different variations of your task definition, scores each one against a subset of your accumulated labels, and keeps the version that best matches your preferences. It also ensures the chosen prompt scores well across a diverse set of different edge cases. After optimization completes, you’ll see:- The new prompt — the optimized task definition, which you can review and compare against the previous version via a diff view.
- A validation score — how well the optimized prompt agreed with your labels on a held-out validation split.
The iteration loop
Each iteration builds on the last. Your labels accumulate across iterations, so the optimizer always has the full history of your preferences to work with. This means:- Early iterations tend to produce the biggest gains, as the function learns your core preferences.
- Later iterations refine edge cases and improve consistency on harder examples.
- You can stop whenever the function meets your needs. Most tasks converge in 2-4 iterations.