Sampling Parameters

Sampling parameters control the token sampling process, allowing for fine-grained customization of the model’s output.

How to Use Sampling Parameters

To use custom sampling parameters, pass a dictionary of your desired parameters to the sampling_params argument in your SDK call. Each model has a set of default sampling parameters that are recommended by the model creator for best performance. When you provide your own dictionary, it is merged with these defaults, and any values you specify will always take precedence.

Example of Overriding Defaults

Let’s assume the model’s default parameters are:

temperature: 0.6
top_k: 20
max_tokens: 32768

If you want a more creative response (higher temperature) and a shorter output, you can provide just those specific overrides:

import sutro as so

# We only specify the parameters we want to change.
sampling_overrides = {
    "temperature": 0.9,
    "max_tokens": 512
}

results = so.infer(
    inputs=...,
    sampling_params=sampling_overrides
)

The final parameters used by the model for this request will be a combination of your overrides and the defaults:

temperature: 0.9
max_tokens: 512
top_k: 20

Parameter Reference

We support any vLLM compatible sampling parameters. Please reference their sampling parameters class for a complete list of valid parameters. Note: we do not set defaults for every parameter in vllm.SamplingParams, in this case the value used falls back to the vLLM default.

Default Parameters by Model Family

Different model families have different recommended default parameters. Here is a reference for the base configurations. Each configuration is set based off the given lab’s recommended settings, e.g. Qwen3 14B > Sampling Parameters.

Llama Family

Parameter	Default Value
`temperature`	`0.75`
`top_p`	`1`
`max_tokens`	`4096`
`repetition_penalty`	`1.0`

Qwen 3 Family

The Qwen family has different defaults depending on whether the model is used for standard generation (Non-Thinking) or for tasks that require reasoning (Thinking). Non-Thinking Defaults

Parameter	Default Value
`temperature`	`0.7`
`top_p`	`0.8`
`top_k`	`20`
`max_tokens`	`4096`
`repetition_penalty`	`1.0`

Thinking Defaults

Parameter	Default Value
`temperature`	`0.6`
`top_p`	`0.95`
`top_k`	`20`
`max_tokens`	`4096`
`repetition_penalty`	`1.0`

Certain Qwen Mixture-of-Experts (MoE) models use a higher max_tokens default. They are as follows:Defaults to 16,384:

qwen-3-30b-a3b
qwen-3-235b-a22b

Defaults to 32,768:

qwen-3-30b-a3b-thinking
qwen-3-235b-a22b-thinking

This allows for sufficient length and robustness in the model’s reasoning process.

Gemma Family

Parameter	Default Value
`temperature`	`0.95`
`top_p`	`0.95`
`top_k`	`64`
`max_tokens`	`4096`
`repetition_penalty`	`1.0`

Getting Started

Core Concepts

Common Patterns

How to Use Sampling Parameters

Example of Overriding Defaults

Parameter Reference

Default Parameters by Model Family

Llama Family

Qwen 3 Family

Gemma Family

Getting Started

Core Concepts

Common Patterns

​How to Use Sampling Parameters

​Example of Overriding Defaults

​Parameter Reference

​Default Parameters by Model Family

​Llama Family

​Qwen 3 Family

​Gemma Family

How to Use Sampling Parameters

Example of Overriding Defaults

Parameter Reference

Default Parameters by Model Family

Llama Family

Qwen 3 Family

Gemma Family