# Deployment and Autotuning

Helion’s autotuner explores a large search space which is a
time-consuming process, so production deployments should generate
autotuned configs **ahead of time**. Run the autotuner on a development
workstation or a dedicated tuning box that mirrors your target
GPU/accelerator. Check tuned configs into your repository alongside the kernel,
or package them as data files and load them with `helion.Config.load`
(see {doc}`api/config`). This keeps production kernel startup fast and
deterministic, while also giving explicit control over when autotuning
happens.

If you don't specify pre-tuned configs, Helion will autotune on the
first call for each specialization key. This is convenient for
experimentation, but not ideal for production since the first call
pays a large tuning cost.  Helion writes successful tuning results to
an on-disk cache (overridable with `HELION_CACHE_DIR`, skippable
with `HELION_SKIP_CACHE`, see {doc}`api/settings`) so repeated
runs on the same machine can reuse prior configs.  For more on
caching see {py:class}`~helion.autotuner.local_cache.LocalAutotuneCache`
and {py:class}`~helion.autotuner.local_cache.StrictLocalAutotuneCache`.

The rest of this document covers strategies for pre-tuning and deploying
tuned configs, which is the recommended approach for production workloads.

## Run Autotuning Jobs

The simplest way to launch autotuning straight through the kernel call:

```python
import torch, helion

@helion.kernel()
def my_kernel(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    ...

example_inputs = (
    torch.randn(1048576, device="cuda"),
    torch.randn(1048576, device="cuda"),
)

# First call triggers autotuning, which is cached for future calls, and prints the best config found.
my_kernel(*example_inputs)
```
Set `HELION_FORCE_AUTOTUNE=1` to re-run tuning even when cached configs
exist (documented in {doc}`api/settings`).

Call `my_kernel.autotune(example_inputs)` explicitly to separate
tuning from execution (see {doc}`api/kernel`).
`autotune()` returns the best config found, which you can save for
later use.  Tune against multiple sizes by invoking `autotune` with
a list of representative shapes, for example:

```python
datasets = {
  "s": (
      torch.randn(2**16, device="cuda"),
      torch.randn(2**16, device="cuda"),
  ),
  "m": (
      torch.randn(2**20, device="cuda"),
      torch.randn(2**20, device="cuda"),
  ),
  "l": (
      torch.randn(2**24, device="cuda"),
      torch.randn(2**24, device="cuda"),
  ),
}

for tag, args in datasets.items():
  config = my_kernel.autotune(args)
  config.save(f"configs/my_kernel_{tag}.json")
```

### Direct Control Over Autotuners

When you need more control, construct autotuners
manually. {py:class}`~helion.autotuner.pattern_search.PatternSearch` is the default
autotuner:

```python
from helion.autotuner import PatternSearch

bound = my_kernel.bind(example_inputs)
tuner = PatternSearch(
    bound,
    example_inputs,
    # Double the defaults to explore more candidates:
    initial_population=200,  # Default is 100.
    copies=10,               # Default is 5.
    max_generations=40,      # Default is 20.
)
best_config = tuner.autotune()
best_config.save("configs/my_kernel.json")
```

- Adjust `initial_population`, `copies`, or `max_generations` to trade
tuning time versus coverage, or try different search algorithms.

- Use different input tuples to produce multiple saved configs
(`my_kernel_large.json`, `my_kernel_fp8.json`, etc.).

- Tuning runs can be seeded with `HELION_AUTOTUNE_RANDOM_SEED` if you
need more reproducibility; see {doc}`api/settings`.  Note this only
affects which configs are tried, not the timing results.

## Deploy a Single Config

If one configuration wins for every production call, bake it into the decorator:

```python
best = helion.Config.load("configs/my_kernel.json")

@helion.kernel(config=best)
def my_kernel(x, y):
    ...
```

The supplied `config` applies to **all** argument shapes, dtypes, and
devices that hit this kernel. This is ideal for workloads with a single
critical path or when you manage routing externally. `helion.Config.save`
/ `load` make it easy to copy configs between machines; details live
in {doc}`api/config`.  One can also copy and paste the config from the
autotuner output.


## Deploy Multiple Configs

When you expect variability, supply a small list of candidates:

```python
candidate_configs = [
    helion.Config.load("configs/my_kernel_small.json"),
    helion.Config.load("configs/my_kernel_large.json"),
]

@helion.kernel(configs=candidate_configs, static_shapes=True)
def my_kernel(x, y):
    ...
```

Helion performs a lightweight benchmark (similar to Triton’s autotune)
the first time each specialization key is seen, running each provided
config and selecting the fastest.

A key detail here is controlling the specialization key, which
determines when to re-benchmark. Options include:

- **Default (`static_shapes=True`):** Helion shape-specializes on the exact
  shape/stride signature, rerunning the selection whenever those shapes
  differ. This delivers the best per-shape performance but requires all calls
  to match the example shapes exactly.

- **`static_shapes=False`:** switch to bucketed dynamic shapes. Helion
  reuses results as long as tensor dtypes and device types stay constant.
  Shape changes only trigger a re-selection when a dimension size crosses
  the buckets `{0, 1, ≥2}`. Use this when you need one compiled kernel to
  handle many input sizes.

- **Custom keys:** pass `key=` to group calls however you like.
This custom key is in addition to the above.

As an example, you could trigger re-tuning with power-of-two bucketing:

```python
@helion.kernel(
    configs=candidate_configs,
    key=lambda x, y: helion.next_power_of_2(x.numel()),
    static_shapes=False,
)
def my_kernel(x, y):
    ...
```

See {doc}`api/kernel` for the full decorator reference.

## Advanced Manual Deployment

Some teams prefer to skip all runtime selection, using Helion only as
an ahead-of-time compiler.  For this use case we provide `Kernel.bind`
and `BoundKernel.compile_config`, enabling wrapper patterns that let
you implement bespoke routing logic.  For example, to route based on
input size:

```python
bound = my_kernel.bind(example_inputs)

small_cfg = helion.Config.load("configs/my_kernel_small.json")
large_cfg = helion.Config.load("configs/my_kernel_large.json")

small_run = bound.compile_config(small_cfg)  # Returns a callable
large_run = bound.compile_config(large_cfg)

def routed_my_kernel(x, y):
    runner = small_run if x.numel() <= 2**16 else large_run
    return runner(x, y)
```

`Kernel.bind` produces a `BoundKernel` tied to sample
input types. You can pre-compile as many configs as you need using
`BoundKernel.compile_config`.  **Warning:** `kernel.bind()` specializes,
and the result will only work with the same input types you passed.

- With `static_shapes=True` (default) the bound kernel only works for the
exact shape/stride signature of the example inputs.  The generated code
has shapes baked in, which often provides a performance boost.

- With `static_shapes=False` it will specialize on the input dtypes,
device types, and whether each dynamic dimension falls into the 0, 1,
or ≥2 bucket.  Python types are also specialized.  For dimensions that
can vary across those buckets, supply representative inputs ≥2 to avoid
excessive specialization.

If you need to support multiple input types, bind multiple times with
representative inputs.

Alternately, you can export Triton source with
`bound.to_triton_code(small_cfg)` to drop Helion from your serving
environment altogether, embedding the generated kernel in a custom
runtime.  The Triton kernels could then be compiled down into PTX/cubins
to further remove Python from the critical path, but details on this
are beyond the scope of this document.