Designing Your Task

The most common question when getting started with Sutro Functions is: “how should I structure this?” The right task design makes the difference between a function that converges quickly and one that struggles. This page covers the key decisions and common patterns.

Start narrow

A Sutro Function works best when it does one thing well. If your business problem involves multiple decisions, break it into separate functions rather than forcing a single function to handle everything. Example: You want to process inbound support tickets — categorize them, assess urgency, and extract key details. That’s three functions:

A single-label classifier for ticket category
A judge for urgency (e.g. low / medium / high / critical)
A structured extractor for key details (customer name, product, issue summary)

Each function gets its own labels, its own optimization cycle, and its own deployment. When one needs updating, you don’t risk breaking the others.

Choosing a task type

LLM-as-a-judge

Use for evaluative judgments where you’re assessing quality, correctness, or adherence to criteria.

“How well did the agent answer this question?” → good | acceptable | poor
“Does this summary accurately reflect the source?” → pass | fail

Judge tasks are a natural fit for building evals for agents and LLM pipelines. The function encodes your evaluation criteria so you can run consistent evals at scale.

Binary classification

Use when the decision is yes/no, true/false, or pass/fail.

“Is this lead qualified?”
“Does this document contain PII?”
“Did the agent follow the escalation policy?”

Binary classification is the simplest task type and tends to converge fastest. When in doubt, start here.

Single-label classification

Use when each input gets exactly one label from a set you define.

“What category is this support ticket?” → billing / technical / account / feature_request
“What is the sentiment of this review?” → positive / neutral / negative

For classification tasks, smaller label sets are easier for the model to learn, but larger sets (15+) can work well too. What matters most is that your labels are clearly defined, mutually exclusive, and complete — every possible outcome should be covered, including null-type outcomes like “unknown” or “not applicable.”

Multi-label classification

Use when each input can be assigned multiple labels simultaneously.

“What topics does this article cover?” → politics, technology, and healthcare
“What compliance issues are present?” → data_retention, access_control, and encryption
“What actions is the user attempting to perform in this turn?” -> appointment, prescription, lab_test, referral, insurance, billing, symptom_check

Structured extraction

Use when you need to pull specific fields out of unstructured text.

Extract invoice fields: vendor, amount, date, line items
Extract contact info: name, title, company, email
Extract legal clause details: clause type, parties, obligations

Extractive tasks work best when the answer exists in the input text. Abstractive tasks (where the model needs to generate new or lengthy text) are a weaker fit.

Writing a good task definition

Your task definition is the seed prompt the model receives. Describe the task and its details in a simple but complete way — don’t overthink it. The iteration process will back into a strong ruleset from your labels.

Useful patterns

DAG of Functions

Use a broad classifier to route inputs, then apply specialized functions to each category. Example: Classify documents into type (contract / invoice / correspondence), then run separate extraction functions for each type. The contract extractor pulls parties and obligations; the invoice extractor pulls amounts and dates.

Binary decomposition

When a multi-label problem is hard to optimize, break it into independent binary classifiers — one per label. Each function asks a single yes/no question and converges independently.

Common mistakes

Task too broad. “Analyze this customer interaction and determine next steps” is doing too many things. Break it down.
Ambiguous labels. If your labelers would disagree on what “medium priority” means, the function will too. Define your labels precisely in the task definition.
Too many labels. Every additional label increases the labeling burden and slows convergence. Start with fewer labels when possible.
Abstractive extraction. Asking the function to “summarize the key points” is abstractive. Asking it to “extract the stated deadline and responsible party” is extractive. The latter works much better in Sutro.

​Designing Your Task

​Start narrow

​Choosing a task type

​LLM-as-a-judge

​Binary classification

​Single-label classification

​Multi-label classification

​Structured extraction

​Writing a good task definition

​Useful patterns

​DAG of Functions

​Binary decomposition

​Common mistakes

Designing Your Task

Start narrow

Choosing a task type

LLM-as-a-judge

Binary classification

Single-label classification

Multi-label classification

Structured extraction

Writing a good task definition

Useful patterns

DAG of Functions

Binary decomposition

Common mistakes