Sampling parameters control the token sampling process, allowing for fine-grained customization of the model’s output.

How to Use Sampling Parameters

To use custom sampling parameters, pass a dictionary of your desired parameters to the sampling_params argument in your SDK call. Each model has a set of default sampling parameters that are recommended by the model creator for best performance. When you provide your own dictionary, it is merged with these defaults, and any values you specify will always take precedence.

Example of Overriding Defaults

Let’s assume the model’s default parameters are:
  • temperature: 0.6
  • top_k: 20
  • max_tokens: 32768
If you want a more creative response (higher temperature) and a shorter output, you can provide just those specific overrides:
import sutro as so

# We only specify the parameters we want to change.
sampling_overrides = {
    "temperature": 0.9,
    "max_tokens": 512
}

results = so.infer(
    inputs=...,
    sampling_params=sampling_overrides
)
The final parameters used by the model for this request will be a combination of your overrides and the defaults:
  • temperature: 0.9
  • max_tokens: 512
  • top_k: 20

Parameter Reference

We support any vLLM compatible sampling parameters. Please reference their sampling parameters class for a complete list of valid parameters. Note: we do not set defaults for every parameter in vllm.SamplingParams, in this case the value used falls back to the vLLM default.

Default Parameters by Model Family

Different model families have different recommended default parameters. Here is a reference for the base configurations. Each configuration is set based off the given lab’s recommended settings, e.g. Qwen3 14B > Sampling Parameters.

Llama Family

ParameterDefault Value
temperature0.75
top_p1
max_tokens4096
repetition_penalty1.0

Qwen 3 Family

The Qwen family has different defaults depending on whether the model is used for standard generation (Non-Thinking) or for tasks that require reasoning (Thinking). Non-Thinking Defaults
ParameterDefault Value
temperature0.7
top_p0.8
top_k20
max_tokens4096
repetition_penalty1.0
Thinking Defaults
ParameterDefault Value
temperature0.6
top_p0.95
top_k20
max_tokens4096
repetition_penalty1.0
Certain Qwen Mixture-of-Experts (MoE) models use a higher max_tokens default. They are as follows:Defaults to 16,384:
  • qwen-3-30b-a3b
  • qwen-3-235b-a22b
Defaults to 32,768:
  • qwen-3-30b-a3b-thinking
  • qwen-3-235b-a22b-thinking
This allows for sufficient length and robustness in the model’s reasoning process.

Gemma Family

ParameterDefault Value
temperature0.95
top_p0.95
top_k64
max_tokens4096
repetition_penalty1.0