Rate this Page

★ ★ ★ ★ ★

Deployment and Autotuning#

Helion’s autotuner explores a large search space which is a time-consuming process, so production deployments should generate autotuned configs ahead of time. Run the autotuner on a development workstation or a dedicated tuning box that mirrors your target GPU/accelerator. Check tuned configs into your repository alongside the kernel, or package them as data files and load them with helion.Config.load (see Config). This keeps production kernel startup fast and deterministic, while also giving explicit control over when autotuning happens.

If you don’t specify pre-tuned configs, Helion will autotune on the first call for each specialization key. This is convenient for experimentation, but not ideal for production since the first call pays a large tuning cost. Helion writes successful tuning results to an on-disk cache (overridable with HELION_CACHE_DIR, skippable with HELION_SKIP_CACHE, see Settings) so repeated runs on the same machine can reuse prior configs. For more on caching see LocalAutotuneCache and StrictLocalAutotuneCache.

The rest of this document covers strategies for pre-tuning and deploying tuned configs, which is the recommended approach for production workloads.

Run Autotuning Jobs#

The simplest way to launch autotuning straight through the kernel call:

import torch, helion

@helion.kernel()
def my_kernel(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    ...

example_inputs = (
    torch.randn(1048576, device="cuda"),
    torch.randn(1048576, device="cuda"),
)

# First call triggers autotuning, which is cached for future calls, and prints the best config found.
my_kernel(*example_inputs)

Set HELION_FORCE_AUTOTUNE=1 to re-run tuning even when cached configs exist (documented in Settings).

Call my_kernel.autotune(example_inputs) explicitly to separate tuning from execution (see Kernel). autotune() returns the best config found, which you can save for later use. Tune against multiple sizes by invoking autotune with a list of representative shapes, for example:

datasets = {
  "s": (
      torch.randn(2**16, device="cuda"),
      torch.randn(2**16, device="cuda"),
  ),
  "m": (
      torch.randn(2**20, device="cuda"),
      torch.randn(2**20, device="cuda"),
  ),
  "l": (
      torch.randn(2**24, device="cuda"),
      torch.randn(2**24, device="cuda"),
  ),
}

for tag, args in datasets.items():
  config = my_kernel.autotune(args)
  config.save(f"configs/my_kernel_{tag}.json")

Direct Control Over Autotuners#

When you need more control, construct autotuners manually. PatternSearch is the default autotuner:

from helion.autotuner import PatternSearch

bound = my_kernel.bind(example_inputs)
tuner = PatternSearch(
    bound,
    example_inputs,
    # Double the defaults to explore more candidates:
    initial_population=200,  # Default is 100.
    copies=10,               # Default is 5.
    max_generations=40,      # Default is 20.
)
best_config = tuner.autotune()
best_config.save("configs/my_kernel.json")

Adjust initial_population, copies, or max_generations to trade tuning time versus coverage, or try different search algorithms.
Use different input tuples to produce multiple saved configs (my_kernel_large.json, my_kernel_fp8.json, etc.).
Tuning runs can be seeded with HELION_AUTOTUNE_RANDOM_SEED if you need more reproducibility; see Settings. Note this only affects which configs are tried, not the timing results.

Deploy a Single Config#

If one configuration wins for every production call, bake it into the decorator:

best = helion.Config.load("configs/my_kernel.json")

@helion.kernel(config=best)
def my_kernel(x, y):
    ...

The supplied config applies to all argument shapes, dtypes, and devices that hit this kernel. This is ideal for workloads with a single critical path or when you manage routing externally. helion.Config.save / load make it easy to copy configs between machines; details live in Config. One can also copy and paste the config from the autotuner output.

Deploy Multiple Configs#

When you expect variability, supply a small list of candidates:

candidate_configs = [
    helion.Config.load("configs/my_kernel_small.json"),
    helion.Config.load("configs/my_kernel_large.json"),
]

@helion.kernel(configs=candidate_configs, static_shapes=True)
def my_kernel(x, y):
    ...

Helion performs a lightweight benchmark (similar to Triton’s autotune) the first time each specialization key is seen, running each provided config and selecting the fastest.

A key detail here is controlling the specialization key, which determines when to re-benchmark. Options include:

Default (static_shapes=True): Helion shape-specializes on the exact shape/stride signature, rerunning the selection whenever those shapes differ. This delivers the best per-shape performance but requires all calls to match the example shapes exactly.
static_shapes=False: switch to bucketed dynamic shapes. Helion reuses results as long as tensor dtypes and device types stay constant. Shape changes only trigger a re-selection when a dimension size crosses the buckets {0, 1, ≥2}. Helion also tracks whether any tensor exceeds the torch.int32 indexing limit (more than 2**31 - 1 elements) and will automatically regenerate code with 64-bit indexing in that case. Use this mode when you need one compiled kernel to handle many input sizes, and pin @helion.kernel(..., index_dtype=torch.int64) if large tensors are the norm so you avoid an extra specialization boundary.
Custom keys: pass key= to group calls however you like. This custom key is in addition to the above.

As an example, you could trigger re-tuning with power-of-two bucketing:

@helion.kernel(
    configs=candidate_configs,
    key=lambda x, y: helion.next_power_of_2(x.numel()),
    static_shapes=False,
)
def my_kernel(x, y):
    ...

See Kernel for the full decorator reference.

Selective Shape Specialization#

The static_shapes setting is all-or-nothing: either every dimension is specialized (static_shapes=True) or dimensions are bucketed dynamically (static_shapes=False). Sometimes you want finer control - specializing only specific dimensions while keeping others dynamic.

Helion provides two APIs for selective shape specialization:

API	Location	Effect
`hl.specialize()`	Inside kernel	Dimension always specialized for all calls
`torch._dynamo.mark_static()`	Outside kernel	Dimension specialized only for marked tensors

`hl.specialize()` - Internal Specialization#

Use specialize() inside the kernel to make specific dimensions compile-time constants. This applies to every call to the kernel:

import torch
import helion
import helion.language as hl

@helion.kernel(static_shapes=False)
def rms_norm_fwd(
    x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-5
) -> torch.Tensor:
    m, n = x.size()
    hl.specialize(n)  # hidden dimension becomes a compile-time constant
    out = torch.empty_like(x)
    for tile_m in hl.tile(m):
        x_tile = x[tile_m, :].to(torch.float32)
        x_squared = x_tile * x_tile
        mean_x_squared = torch.mean(x_squared, dim=-1)
        inv_rms = torch.rsqrt(mean_x_squared + eps)
        normalized = x_tile * inv_rms[:, None]
        out[tile_m, :] = (normalized * weight[:].to(torch.float32)).to(out.dtype)
    return out

# Every call specializes on n - different hidden sizes = different cache entries
weight_4096 = torch.randn([4096], device="cuda")
weight_2048 = torch.randn([2048], device="cuda")
result1 = rms_norm_fwd(torch.randn([2048, 4096], device="cuda"), weight_4096)  # compiles for n=4096
result2 = rms_norm_fwd(torch.randn([1024, 4096], device="cuda"), weight_4096)  # reuses n=4096
result3 = rms_norm_fwd(torch.randn([2048, 2048], device="cuda"), weight_2048)  # compiles for n=2048

Use hl.specialize() when a dimension is performance-critical and you want it specialized regardless of how the kernel is called.

`torch._dynamo.mark_static()` - External Specialization#

Use torch._dynamo.mark_static() before calling the kernel to specialize dimensions on specific tensors. This is useful when you want the same kernel to serve both dynamic and specialized code paths:

@helion.kernel(static_shapes=False)
def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    m, k = x.size()
    k2, n = y.size()
    out = torch.empty([m, n], device=x.device, dtype=x.dtype)
    for tile_m, tile_n in hl.tile([m, n]):
        acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
        for tile_k in hl.tile(k):
            acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n])
        out[tile_m, tile_n] = acc.to(x.dtype)
    return out

# Dynamic call - all dimensions remain symbolic
x_dyn = torch.randn([m, k], device="cuda", dtype=torch.float16)
y_dyn = torch.randn([k, n], device="cuda", dtype=torch.float16)
result = matmul(x_dyn, y_dyn)

# Specialized call - mark specific dimensions as compile-time constants
x_opt = torch.randn([64, 128], device="cuda", dtype=torch.float16)
y_opt = torch.randn([128, 56], device="cuda", dtype=torch.float16)
torch._dynamo.mark_static(x_opt, [0, -1])  # specialize dims 0 and -1 (M and K)
torch._dynamo.mark_static(y_opt, 1)        # specialize dim 1 (N)
result = matmul(x_opt, y_opt)  # generates code with 64, 128, 56 as constants

This pattern enables a single kernel definition to serve both:

Fully dynamic fallback paths (for rare edge-case shapes)
Optimized hot paths (with shape constants baked into generated code)

Combining Both APIs#

The two APIs form a union - you can use hl.specialize() for dimensions that should always be specialized, and mark_static() for additional per-call specialization:

@helion.kernel(static_shapes=False)
def fn(x: torch.Tensor) -> torch.Tensor:
    hl.specialize(x.size(0))  # dim 0 always specialized (internal)
    out = torch.empty_like(x)
    for tile in hl.tile(x.size()):
        out[tile] = x[tile] * 2
    return out

# mark_static on dim 1 combines with hl.specialize on dim 0
x = torch.randn([320, 640], device="cuda")
torch._dynamo.mark_static(x, -1)  # specialize dim 1 (external)
result = fn(x)  # both 320 and 640 become constants

Cache Behavior#

Each unique combination of specialized dimension values creates a separate cache entry:

Unspecialized calls share one dynamic cache entry
Calls with mark_static() create entries keyed by the specialized values
Different specialized values (e.g., [64, 128] vs [48, 96]) create separate entries

Advanced Manual Deployment#

Some teams prefer to skip all runtime selection, using Helion only as an ahead-of-time compiler. For this use case we provide Kernel.bind and BoundKernel.compile_config, enabling wrapper patterns that let you implement bespoke routing logic. For example, to route based on input size:

bound = my_kernel.bind(example_inputs)

small_cfg = helion.Config.load("configs/my_kernel_small.json")
large_cfg = helion.Config.load("configs/my_kernel_large.json")

small_run = bound.compile_config(small_cfg)  # Returns a callable
large_run = bound.compile_config(large_cfg)

def routed_my_kernel(x, y):
    runner = small_run if x.numel() <= 2**16 else large_run
    return runner(x, y)

Kernel.bind produces a BoundKernel tied to sample input types. You can pre-compile as many configs as you need using BoundKernel.compile_config. Warning: kernel.bind() specializes, and the result will only work with the same input types you passed.

With static_shapes=True (default) the bound kernel only works for the exact shape/stride signature of the example inputs. The generated code has shapes baked in, which often provides a performance boost.
With static_shapes=False it will specialize on the input dtypes, device types, and whether each dynamic dimension falls into the 0, 1, or ≥2 bucket. Python types are also specialized. For dimensions that can vary across those buckets, supply representative inputs ≥2 to avoid excessive specialization. Just like the autotuning flow above, Helion records whether any tensor crosses the int32 indexing limit when static_shapes=False; explicitly set index_dtype=torch.int64 if your deployment commonly exceeds that threshold to avoid recompilation.

If you need to support multiple input types, bind multiple times with representative inputs.

Alternately, you can export Triton source with bound.to_triton_code(small_cfg) to drop Helion from your serving environment altogether, embedding the generated kernel in a custom runtime. The Triton kernels could then be compiled down into PTX/cubins to further remove Python from the critical path, but details on this are beyond the scope of this document.