Batch Inference - Sutro Docs

Running batch inference

infer(self, data, model='llama-3.1-8b', column=None, output_column='inference_result', job_priority=0, output_schema=None, system_prompt=None, dry_run=False, stay_attached=None, truncate_rows=False)

Run LLM inference on a large list, table, dataframe, or file.

Parameters:

data (Union[List, pd.DataFrame, pl.DataFrame, str]): The data to run inference on.
model (list, str, optional): The model(s) to use for inference. Default is “llama-3.1-8b”. Can accept a list of models, in which case the inference will be run in parallel for each model with stay_attached=False.
name (Union[str, List[str]], optional): A job name for experiment/metadata tracking purposes. If using a list of models, you must pass a list of names with length equal to the number of models, or None. Defaults to None.
description (Union[str, List[str]], optional): A job description for experiment/metadata tracking purposes. If using a list of models, you must pass a list of descriptions with length equal to the number of models, or None. Defaults to None.
column (Union[str, List[str]], optional): The column name to use for inference. Required if data is a DataFrame or file path. If a list is supplied, it will concatenate the columns of the list into a single column, accepting separator strings.
output_column (str, optional): The column name to store the inference results in if input is a DataFrame. Defaults to “inference_result”.
job_priority (int, optional): The priority of the job. Default is 0.
output_schema (Union[Dict[str, Any], BaseModel], optional): A structured schema for the output. Can be either a dictionary representing a JSON schema or a pydantic BaseModel. Defaults to None.
system_prompt (str, optional): A system prompt to add to all inputs. This allows you to define the behavior of the model. Defaults to None.
sampling_params (dict, optional): A dictionary of sampling parameters to use for the inference. Defaults to None, which uses the default sampling parameters.
random_seed_per_input (bool, optional): If True, a random seed will be generated for each input. This is useful for diversity in outputs. Defaults to False.
dry_run (bool, optional): If True, return cost estimates instead of running inference. Default is False.
stay_attached (bool, optional): If True, the SDK will stay attached to the job and update you on the status and results as they become available. Default behavior is True for priority 0 jobs, and False for priority 1 jobs.
truncate_rows (bool, optional): If True, any rows that have a token count exceeding the context window length of the selected model will be truncated to the max length that will fit within the context window. Defaults to True.

Returns: str: The ID of the inference job.

Monitoring job status

attach(self, job_id: str)

Attach to an existing job and stream its progress in real-time. This has the equivalent behavior of setting stay_attached=True when calling infer(...)

This method connects to a running job and displays live progress updates, including the number of rows processed and token statistics. It shows a progress bar with real-time updates until the job completes.

Parameters:

job_id (str): The ID of the job to attach to

Returns: None

Job Status Behavior:

RUNNING: Streams progress updates with a live progress bar and job statistics
SUCCEEDED: Notifies that the job already completed and suggests using sutro jobs results
FAILED: Displays failure message and exits
CANCELLED: Displays cancellation message and exits

Example:

# Attach to a running job to monitor its progress
sutro.attach("job_12345")
# Progress bar will display:
# Progress: 45%[████████████████                ] 450/1000 [00:32\<00:45] Input tokens processed: 12500, Tokens generated: 8300, Total tokens/s: 325.4

Note: This method is ideal for monitoring long-running jobs interactively. For programmatic use cases where you don’t want live progress updates, use the simpler await_job_completion() instead.

Await Job Completion

await_job_completion(self, job_id: str, timeout: int | None = 7200) → list | None

When deployed as part of a pipeline (Dagster, Airflow, etc) you might not be interested in seeing the progress of the job as it happens. await_job_completion is best for this use case, and should only be used when not using the stay_attached parameter of infer(...), or the attach(...) function.

Waits for a job to complete and return its results upon successful completion.

This method polls the job status every 5 seconds (and prints it out) until the job completes, fails, is cancelled, or the timeout is reached.

Parameters:

job_id (str): The ID of the job to await.
timeout (Optional[int]): Maximum time in seconds to wait for job completion. Defaults to 7200 (2 hours).

Returns: list | None: The results of the job if it completes successfully, or None if the job fails, is cancelled, or encounters an error.

Job Status Outcomes:

SUCCEEDED: Returns the job results
FAILED: Returns None
CANCELLED: Returns None

Example:

results = client.await_job_completion("job_12345", timeout=3600)
# Job status is RUNNING for job-f9102252-ae2f-4d61-a879-a657e314f2e0
if results:
    print(f"Job completed with {len(results)} results")

Getting Quotas

get_quotas(self)

Get your current quotas. Returns: list: A list of quotas, one for each priority level. Contains row_quota and token_quota for each priority level.

​Running batch inference

​Parameters:

​Monitoring job status

​Await Job Completion

​Getting Quotas

Running batch inference

Parameters:

Monitoring job status

Await Job Completion

Getting Quotas