> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sutro.sh/llms.txt
> Use this file to discover all available pages before exploring further.

# Sampling Parameters

Sampling parameters control the token sampling process, allowing for fine-grained customization of the model's output.

## How to Use Sampling Parameters

To use custom sampling parameters, pass a dictionary of your desired parameters to the `sampling_params` argument in your SDK call.

Each model has a set of default sampling parameters that are recommended by the model creator for best performance. When you provide your own dictionary, **it is merged with these defaults**, and any values you specify will **always take precedence**.

### Example of Overriding Defaults

Let's assume the model's default parameters are:

* `temperature`: 0.6
* `top_k`: 20
* `max_tokens`: 32768

If you want a more creative response (higher temperature) and a shorter output, you can provide just those specific overrides:

```python theme={null}
import sutro as so

# We only specify the parameters we want to change.
sampling_overrides = {
    "temperature": 0.9,
    "max_tokens": 512
}

results = so.infer(
    inputs=...,
    sampling_params=sampling_overrides
)
```

The final parameters used by the model for this request will be a combination of your overrides and the defaults:

* `temperature`: **0.9**
* `max_tokens`: **512**
* `top_k`: **20**

## Parameter Reference

We support any vLLM compatible sampling parameters. Please reference [their sampling parameters class](https://docs.vllm.ai/en/stable/api/vllm/sampling_params.html#vllm.sampling_params.SamplingParams) for a complete list of valid parameters.

Note: we do not set defaults for every parameter in `vllm.SamplingParams`, in this case the value used falls back to the vLLM default.

## Default Parameters by Model Family

Different model families have different recommended default parameters. Here is a reference for the base configurations. Each configuration is set based off the given lab's recommended settings, e.g. [Qwen3 14B](https://huggingface.co/Qwen/Qwen3-14B#best-practices) > Sampling Parameters.

### [Llama Family](https://huggingface.co/meta-llama)

| Parameter            | Default Value |
| -------------------- | ------------- |
| `temperature`        | `0.75`        |
| `top_p`              | `1`           |
| `max_tokens`         | `4096`        |
| `repetition_penalty` | `1.0`         |

### [Qwen 3 Family](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f)

The Qwen family has different defaults depending on whether the model is used for standard generation (`Non-Thinking`) or for tasks that require reasoning (`Thinking`).

**Non-Thinking Defaults**

| Parameter            | Default Value |
| -------------------- | ------------- |
| `temperature`        | `0.7`         |
| `top_p`              | `0.8`         |
| `top_k`              | `20`          |
| `max_tokens`         | `4096`        |
| `repetition_penalty` | `1.0`         |

**Thinking Defaults**

| Parameter            | Default Value |
| -------------------- | ------------- |
| `temperature`        | `0.6`         |
| `top_p`              | `0.95`        |
| `top_k`              | `20`          |
| `max_tokens`         | `4096`        |
| `repetition_penalty` | `1.0`         |

<Note>
  Certain Qwen Mixture-of-Experts (MoE) models use a higher `max_tokens` default. They are as follows:

  Defaults to 16,384:

  * `qwen-3-30b-a3b`
  * `qwen-3-235b-a22b`

  Defaults to 32,768:

  * `qwen-3-30b-a3b-thinking`
  * `qwen-3-235b-a22b-thinking`

  This allows for sufficient length and robustness in the model's reasoning process.
</Note>

### [Gemma Family](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d)

| Parameter            | Default Value |
| -------------------- | ------------- |
| `temperature`        | `0.95`        |
| `top_p`              | `0.95`        |
| `top_k`              | `64`          |
| `max_tokens`         | `4096`        |
| `repetition_penalty` | `1.0`         |
