Skip to main content

Core Workflow

A Sutro Function is built through an iterative loop: you create the function, Sutro generates predictions, you label the highest-impact cases, and Sutro optimizes the prompt to match your preferences. Each iteration refines the function further.

1. Create the function

You provide:
  • A task definition — a plain-language description of what you want the function to do. This becomes the prompt the model receives, so be specific about the decision criteria. You write this yourself; it should reflect how you’d explain the task to a new team member.
  • A dataset — a representative sample of your production data (CSV, JSONL, JSON, or Parquet). We find around 1,000 rows works for many tasks, but depending on how large or diverse your dataset is, more examples may be needed (i.e. 5-10k).
  • A task type — LLM-as-a-judge, binary classification, single-label classification, multi-label classification, or structured extraction.
  • Labels or schema — for judge and classification tasks, the set of labels to choose from. For extraction, the output schema defining which fields to extract. We recommend using a Pydantic or Zod based schema, but JSON is accepted as well.
You can also configure optional settings, for example: which model to target during optimization or whether to enable web search.

2. Predict

Sutro runs your task definition against the dataset. An ensemble of models processes every row independently, and Sutro analyzes where they agree and disagree. Where they disagree, those rows represent the cases where your preferences matter most.

3. Label

Sutro surfaces two sets of items for your review:
  • Low-confidence items — the cases with the most model disagreement. These are the highest-value labels you can provide, because they’re the ones where there is the most ambiguity on how to make the given decision.
  • High-confidence items — cases where the models strongly agree. These are shown so you can catch cases where the consensus algorithm is confidently wrong.
For each item, you have the ability to see the input (text, image, or PDF), the consensus-chosen prediction, and can select your label. You can optionally write a justification explaining your reasoning. This is especially valuable since it helps inform the model through nuance, opinions, or motivations that might not be captured by just the label selection. You can also configure a held-out set — a fixed slice of your data that’s evaluated every iteration so you can track accuracy over time.

4. Optimize

Once you’ve finished labeling, Sutro searches for a better prompt. The prompt optimizer tries different variations of your task definition, scores each one against a subset of your accumulated labels, and keeps the version that best matches your preferences. It also ensures the chosen prompt scores well across a diverse set of different edge cases. After optimization completes, you’ll see:
  • The new prompt — the optimized task definition, which you can review and compare against the previous version via a diff view.
  • A validation score — how well the optimized prompt agreed with your labels on a held-out validation split.
You approve the prompt (or edit it if needed), and then start the next iteration.

The iteration loop

Each iteration builds on the last. Your labels accumulate across iterations, so the optimizer always has the full history of your preferences to work with. This means:
  • Early iterations tend to produce the biggest gains, as the function learns your core preferences.
  • Later iterations refine edge cases and improve consistency on harder examples.
  • You can stop whenever the function meets your needs. Most tasks converge in 2-4 iterations.
Because labels persist, you can also come back later — add new production data that’s drifted in, swap to a different model, or re-optimize — without losing the work you’ve already done.

What’s next

Once the function is performing well, you can deploy it and invoke it by name through the batch inference SDK or API.