Creating a Batch Inference Job

Creating a Batch Inference Job#

POST https://api.sutro.sh/batch-inference#

Run batch inference on an array of inputs.

Parameters:

inputs (list, required) – The list of inputs to run inference on.
model (str, optional, default=llama-3.1-8b) – The Model ID to use for inference. See Available Models for a list of available models.
system_prompt (str, optional, default=None) – A system prompt to use for the inference. Use this parameter to provide consistent, task-specific instructions to the model. See System Prompts for more information.
json_schema (object, optional, default=None) – If supplied, a JSON schema that the output must adhere to. Must follow the json-schema.org specification. See Structured Outputs for more information.
sampling_params (object, optional, default=None) – If supplied, a dictionary of sampling parameters to use for the inference. See Sampling Parameters for more information.
job_priority (int, optional, default=0) – The priority of the job. Currently, only priority 0 and 1 are supported. See Job Priority for more information.
dryrun (boolean, optional, default=False) – If True, the API will return cost estimates instead of running inference. See Cost Estimates for more information.
random_seed_per_input (boolean, optional, default=False) – If True, a random seed will be generated for each input. This is useful for diversity in outputs.
truncate_rows (boolean, optional, default=False) – If True, any rows that have a token count exceeding the context window length of the selected model will be truncated to the max length that will fit within the context window.

Request Headers:

Authorization – Your Sutro API key.

Accept:

application/json

Returns:

A job_id that can be used to poll for the status and results of the job.