Documentation Index
Fetch the complete documentation index at: https://docs.sutro.sh/llms.txt
Use this file to discover all available pages before exploring further.
Sampling parameters control the token sampling process, allowing for fine-grained customization of the model’s output.
How to Use Sampling Parameters
To use custom sampling parameters, pass a dictionary of your desired parameters to the sampling_params argument in your SDK call.
Each model has a set of default sampling parameters that are recommended by the model creator for best performance. When you provide your own dictionary, it is merged with these defaults, and any values you specify will always take precedence.
Example of Overriding Defaults
Let’s assume the model’s default parameters are:
temperature: 0.6
top_k: 20
max_tokens: 32768
If you want a more creative response (higher temperature) and a shorter output, you can provide just those specific overrides:
import sutro as so
# We only specify the parameters we want to change.
sampling_overrides = {
"temperature": 0.9,
"max_tokens": 512
}
results = so.infer(
inputs=...,
sampling_params=sampling_overrides
)
The final parameters used by the model for this request will be a combination of your overrides and the defaults:
temperature: 0.9
max_tokens: 512
top_k: 20
Parameter Reference
We support any vLLM compatible sampling parameters. Please reference their sampling parameters class for a complete list of valid parameters.
Note: we do not set defaults for every parameter in vllm.SamplingParams, in this case the value used falls back to the vLLM default.
Default Parameters by Model Family
Different model families have different recommended default parameters. Here is a reference for the base configurations. Each configuration is set based off the given lab’s recommended settings, e.g. Qwen3 14B > Sampling Parameters.
| Parameter | Default Value |
temperature | 0.75 |
top_p | 1 |
max_tokens | 4096 |
repetition_penalty | 1.0 |
The Qwen family has different defaults depending on whether the model is used for standard generation (Non-Thinking) or for tasks that require reasoning (Thinking).
Non-Thinking Defaults
| Parameter | Default Value |
temperature | 0.7 |
top_p | 0.8 |
top_k | 20 |
max_tokens | 4096 |
repetition_penalty | 1.0 |
Thinking Defaults
| Parameter | Default Value |
temperature | 0.6 |
top_p | 0.95 |
top_k | 20 |
max_tokens | 4096 |
repetition_penalty | 1.0 |
Certain Qwen Mixture-of-Experts (MoE) models use a higher max_tokens default. They are as follows:Defaults to 16,384:
qwen-3-30b-a3b
qwen-3-235b-a22b
Defaults to 32,768:
qwen-3-30b-a3b-thinking
qwen-3-235b-a22b-thinking
This allows for sufficient length and robustness in the model’s reasoning process.
| Parameter | Default Value |
temperature | 0.95 |
top_p | 0.95 |
top_k | 64 |
max_tokens | 4096 |
repetition_penalty | 1.0 |