Running batch inference

infer(self, data, model='llama-3.1-8b', column=None, output_column='inference_result', job_priority=0, output_schema=None, system_prompt=None, dry_run=False, stay_attached=None, truncate_rows=False)
Run LLM inference on a large list, table, dataframe, or file.

Parameters:

  • data (Union[List, pd.DataFrame, pl.DataFrame, str]): The data to run inference on.
  • model (list, str, optional): The model(s) to use for inference. Default is “llama-3.1-8b”. Can accept a list of models, in which case the inference will be run in parallel for each model with stay_attached=False.
  • column (str, optional): The column name to use for inference. Required if data is a DataFrame or file path.
  • output_column (str, optional): The column name to store the inference results in if input is a DataFrame. Defaults to “inference_result”.
  • job_priority (int, optional): The priority of the job. Default is 0.
  • output_schema (Union[Dict[str, Any], BaseModel], optional): A structured schema for the output. Can be either a dictionary representing a JSON schema or a pydantic BaseModel. Defaults to None.
  • system_prompt (str, optional): A system prompt to add to all inputs. This allows you to define the behavior of the model. Defaults to None.
  • sampling_params (dict, optional): A dictionary of sampling parameters to use for the inference. Defaults to None, which uses the default sampling parameters.
  • random_seed_per_input (bool, optional): If True, a random seed will be generated for each input. This is useful for diversity in outputs. Defaults to False.
  • dry_run (bool, optional): If True, return cost estimates instead of running inference. Default is False.
  • stay_attached (bool, optional): If True, the SDK will stay attached to the job and update you on the status and results as they become available. Default behavior is True for priority 0 jobs, and False for priority 1 jobs.
  • truncate_rows (bool, optional): If True, any rows that have a token count exceeding the context window length of the selected model will be truncated to the max length that will fit within the context window. Defaults to False.
Returns: Union [List, pd.DataFrame, pl.DataFrame, str]: The results of the inference or job ID.

Monitoring job status

attach(self, job_id: str)
Attach to an existing job and stream its progress in real-time. This has the equivalent behavior of setting stay_attached=True when calling infer(...)
This method connects to a running job and displays live progress updates, including the number of rows processed and token statistics. It shows a progress bar with real-time updates until the job completes.
Parameters:
  • job_id (str): The ID of the job to attach to
Returns: None
Job Status Behavior:
  • RUNNING: Streams progress updates with a live progress bar and job statistics
  • SUCCEEDED: Notifies that the job already completed and suggests using sutro jobs results
  • FAILED: Displays failure message and exits
  • CANCELLED: Displays cancellation message and exits
Example:
# Attach to a running job to monitor its progress
sutro.attach("job_12345")
# Progress bar will display:
# Progress: 45%[████████████████                ] 450/1000 [00:32\<00:45] Input tokens processed: 12500, Tokens generated: 8300, Total tokens/s: 325.4
Note: This method is ideal for monitoring long-running jobs interactively. For programmatic use cases where you don’t want live progress updates, use the simpler await_job_completion() instead.

Await Job Completion

await_job_completion(self, job_id: str, timeout: int | None = 7200) → list | None
When deployed as part of a pipeline (Dagster, Airflow, etc) you might not be interested in seeing the progress of the job as it happens. await_job_completion is best for this use case, and should only be used when not using the stay_attached parameter of infer(...), or the attach(...) function.
Waits for a job to complete and return its results upon successful completion.
This method polls the job status every 5 seconds (and prints it out) until the job completes, fails, is cancelled, or the timeout is reached.
Parameters:
  • job_id (str): The ID of the job to await.
  • timeout (Optional[int]): Maximum time in seconds to wait for job completion. Defaults to 7200 (2 hours).
Returns: list | None: The results of the job if it completes successfully, or None if the job fails, is cancelled, or encounters an error.
Job Status Outcomes:
  • SUCCEEDED: Returns the job results
  • FAILED: Returns None
  • CANCELLED: Returns None
Example:
results = client.await_job_completion("job_12345", timeout=3600)
# Job status is RUNNING for job-f9102252-ae2f-4d61-a879-a657e314f2e0
if results:
    print(f"Job completed with {len(results)} results")

Getting Quotas

get_quotas(self)
Get your current quotas. Returns: list: A list of quotas, one for each priority level. Contains row_quota and token_quota for each priority level.