Rate this Page

Autotuner Module#

The helion.autotuner module provides automatic optimization of kernel configurations.

Autotuning effort can be adjusted via :attr:helion.Settings.autotune_effort, which configures how much each algorithm explores ("none" disables autotuning, "quick" runs a smaller search, "full" uses the full search budget). Users may still override individual autotuning parameters if they need finer control.

Configuration Classes#

Config#

class helion.runtime.config.Config(*, block_sizes=None, num_threads=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, load_eviction_policies=None, num_warps=None, num_stages=None, pid_type=None, num_sm_multiplier=None, maxnreg=None, indexing=None, advanced_controls_file=None, **kwargs)[source]#
Parameters:
__init__(*, block_sizes=None, num_threads=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, load_eviction_policies=None, num_warps=None, num_stages=None, pid_type=None, num_sm_multiplier=None, maxnreg=None, indexing=None, advanced_controls_file=None, **kwargs)[source]#

Initialize a Config object.

Parameters:
  • block_sizes (list[int] | None) – Controls tile sizes for hl.tile invocations.

  • num_threads (list[int] | int | None) – Target thread count per axis (backend-specific).

  • loop_orders (list[list[int]] | None) – Permutes iteration order of tiles.

  • l2_groupings (list[int] | None) – Reorders program IDs for L2 cache locality.

  • reduction_loops (list[int | None] | None) – Configures reduction loop behavior.

  • range_unroll_factors (list[int] | None) – Loop unroll factors for tl.range calls.

  • range_warp_specializes (list[bool | None] | None) – Warp specialization for tl.range calls.

  • range_num_stages (list[int] | None) – Number of stages for tl.range calls.

  • range_multi_buffers (list[bool | None] | None) – Controls disallow_acc_multi_buffer for tl.range calls.

  • range_flattens (list[bool | None] | None) – Controls flatten parameter for tl.range calls.

  • static_ranges (list[bool] | None) – Whether to use tl.static_range instead tl.range.

  • load_eviction_policies (list[Literal['', 'first', 'last']] | None) – Eviction policies for load operations (“”, “first”, “last”).

  • num_warps (int | None) – Number of warps per block.

  • num_stages (int | None) – Number of stages for software pipelining.

  • pid_type (Optional[Literal['flat', 'xyz', 'persistent_blocked', 'persistent_interleaved']]) – Program ID type strategy (“flat”, “xyz”, “persistent_blocked”, “persistent_interleaved”).

  • num_sm_multiplier (Optional[Literal[1, 2, 4, 8]]) – Multiplier for the number of SMs in persistent kernels (1, 2, 4, 8). Controls multi-occupancy by launching N * num_sms thread blocks instead of just num_sms.

  • maxnreg (Optional[Literal[32, 64, 128, 256]]) – Maximum number of registers per thread (None, 32, 64, 128, 256). Lower values allow higher occupancy but may hurt performance. Used with persistent kernels to ensure multi-occupancy can be achieved.

  • indexing (Union[Literal['pointer', 'tensor_descriptor', 'block_ptr'], list[Literal['pointer', 'tensor_descriptor', 'block_ptr']], None]) –

    Indexing strategy for load and store operations. Can be: - A single strategy string (all loads/stores use this strategy):

    indexing=”block_ptr” # backward compatible

    • A list of strategies (one per load/store operation, must specify all): indexing=[“pointer”, “block_ptr”, “tensor_descriptor”]

    • Empty/omitted (all loads/stores default to “pointer”)

    Valid strategies: “pointer”, “tensor_descriptor”, “block_ptr”

  • advanced_controls_file (str | None) – Path to a PTXAS control file applied during compilation, or empty string for none.

  • **kwargs (object) – Additional user-defined configuration parameters.

  • flatten_loops (list[bool] | None) –

config: dict[str, object]#
to_json()[source]#

Convert the config to a JSON string.

Return type:

str

classmethod from_dict(config_dict)[source]#

Create a Config from a plain dictionary.

Parameters:

config_dict (Mapping[str, object]) –

Return type:

Config

classmethod from_json(json_str)[source]#

Create a Config object from a JSON string.

Parameters:

json_str (str) –

Return type:

Config

minimize(config_spec)[source]#

Return a new Config with values matching effective defaults removed.

This produces a minimal config representation by removing any values that match what the config_spec would use as defaults.

Parameters:

config_spec (object) – The ConfigSpec that defines the defaults for this kernel.

Return type:

Config

Returns:

A new Config with default values removed.

save(path)[source]#

Save the config to a JSON file.

Parameters:

path (str | Path) –

Return type:

None

classmethod load(path)[source]#

Load a config from a JSON file.

Parameters:

path (str | Path) –

Return type:

Config

property block_sizes: list[int]#
property loop_orders: list[list[int]]#
property num_threads: list[int]#
property flatten_loops: list[bool]#
property reduction_loops: list[int | None]#
property num_warps: int#
property num_stages: int#
property l2_groupings: list[int]#
property pid_type: Literal['flat', 'xyz', 'persistent_blocked', 'persistent_interleaved']#
property num_sm_multiplier: int#
property maxnreg: int | None#
property range_unroll_factors: list[int]#
property advanced_controls_file: str#
property range_warp_specializes: list[bool | None]#
property range_num_stages: list[int]#
property range_multi_buffers: list[bool | None]#
property range_flattens: list[bool | None]#
property static_ranges: list[bool]#
property load_eviction_policies: list[Literal['', 'first', 'last']]#
property indexing: Literal['pointer', 'tensor_descriptor', 'block_ptr'] | list[Literal['pointer', 'tensor_descriptor', 'block_ptr']]#

Search Algorithms#

The autotuner supports multiple search strategies:

LFBO Tree Search (Default)#

LFBOTreeSearch is the default autotuner. It extends LFBO Pattern Search with tree-guided neighbor generation, using greedy decision tree traversal to focus search on parameters the surrogate model has identified as important.

class helion.autotuner.surrogate_pattern_search.LFBOTreeSearch(kernel, args, *, num_neighbors=200, frac_selected=0.1, radius=2, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, quantile=0.1, patience=1, similarity_penalty=1.0, initial_population_strategy=None, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#

Bases: LFBOPatternSearch

LFBO Tree Search: Likelihood-Free Bayesian Optimization with tree-guided neighbor generation.

This algorithm uses a Random Forest classifier as a surrogate model to both select which configurations to benchmark and to guide the generation of new candidate configurations via greedy decision tree traversal.

Algorithm Overview:
  1. Generate an initial population (random or default) and benchmark all configurations

  2. Fit a Random Forest classifier to predict “good” vs “bad” configurations: - Configs with performance < quantile threshold are labeled as “good” (class 1) - Configs with performance >= quantile threshold are labeled as “bad” (class 0) - Weighted classification emphasizes configs that are much better than the threshold

  3. For the first generation, generate neighbors via random perturbation since the surrogate is not yet fitted

  4. For subsequent generations, generate neighbors via greedy tree traversal: a. For each of num_neighbors trials:

    • Pick a random decision tree from the Random Forest

    • Trace the decision path for the current best config through that tree

    • Extract the configuration parameters used in the tree’s split decisions

    • For each parameter on the path, greedily optimize it:
      • Generate pattern_neighbors within the configured radius

      • Score candidates using the single tree’s predicted probability

      • Accept the best value (ties broken randomly) and incrementally update the encoded representation

    • Keep the result only if it differs from the base configuration

    1. Score candidates using the full ensemble’s predicted probability with a diversity-aware similarity penalty, then select top candidates

  5. Benchmark selected candidates, retrain the classifier on all observed data

The tree-guided traversal focuses search on parameters the surrogate has identified as important (those used in tree splits). Using a single tree per trial (rather than the full ensemble) introduces diversity since different trees may emphasize different parameters.

References: - Song, J., et al. (2022). “A General Recipe for Likelihood-free Bayesian Optimization.” - Mišić, Velibor V. “Optimization of tree ensembles.” Operations Research 68.5 (2020): 1605-1624.

Parameters:
  • kernel (_AutotunableKernel) – The kernel to be autotuned.

  • args (Sequence[object]) – The arguments to be passed to the kernel during benchmarking.

  • initial_population (int) – Number of random configurations in initial population. Default from PATTERN_SEARCH_DEFAULTS. Ignored when using DEFAULT strategy.

  • copies (int) – Number of top configurations to run pattern search from. Default from PATTERN_SEARCH_DEFAULTS.

  • max_generations (int) – Maximum number of search iterations per copy. Default from PATTERN_SEARCH_DEFAULTS.

  • min_improvement_delta (float) – Early stopping threshold. Search stops if the relative improvement abs(best/current - 1) < min_improvement_delta. Default: 0.001 (0.1% improvement threshold).

  • frac_selected (float) – Fraction of generated neighbors to actually benchmark, after filtering by classifier score. Range: (0, 1]. Lower values reduce benchmarking cost but may miss good configurations. Default: 0.15.

  • num_neighbors (int) – Number of greedy tree traversal trials to run per generation. Each trial picks a random tree, traces its decision path, and greedily optimizes parameters along that path. Default: 100.

  • radius (int) – Maximum perturbation distance when generating pattern neighbors for each parameter during tree traversal. For power-of-two parameters, this is the max change in log2 space. For other parameters, this limits the neighborhood size. Default: 3.

  • quantile (float) – Threshold for labeling configs as “good” (class 1) vs “bad” (class 0). Configs with performance below this quantile are labeled as good. Range: (0, 1). Lower values create a more selective definition of “good”. Default: 0.1 (top 10% are considered good).

  • patience (int) – Number of generations without improvement before stopping the search copy. Default: 1.

  • similarity_penalty (float) – Penalty for selecting points that are similar to points already selected in the batch. Default: 1.0.

  • initial_population_strategy (InitialPopulationStrategy | None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_DEFAULT starts from only the default configuration. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (“from_random” or “from_default”).

  • compile_timeout_lower_bound (float) –

  • compile_timeout_quantile (float) –

__init__(kernel, args, *, num_neighbors=200, frac_selected=0.1, radius=2, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, quantile=0.1, patience=1, similarity_penalty=1.0, initial_population_strategy=None, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#

Create a PatternSearch autotuner.

Parameters:
  • kernel (_AutotunableKernel) – The kernel to be autotuned.

  • args (Sequence[object]) – The arguments to be passed to the kernel.

  • initial_population (int) – The number of random configurations to generate for the initial population. When using FROM_DEFAULT strategy, this is ignored (always 1).

  • copies (int) – Count of top Configs to run pattern search on.

  • max_generations (int) – The maximum number of generations to run.

  • min_improvement_delta (float) – Relative stop threshold; stop if abs(best/current - 1) < this.

  • initial_population_strategy (InitialPopulationStrategy | None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_DEFAULT starts from only the default configuration. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.

  • compile_timeout_lower_bound (float) – Lower bound for adaptive compile timeout in seconds.

  • compile_timeout_quantile (float) – Quantile of compile times to use for adaptive timeout.

  • num_neighbors (int) –

  • frac_selected (float) –

  • radius (int) –

  • quantile (float) –

  • patience (int) –

  • similarity_penalty (float) –

DE Surrogate Hybrid#

Differential Evolution with Surrogate-Assisted Selection (DE-SAS).

This hybrid approach combines the robust exploration of Differential Evolution with the sample efficiency of surrogate models. It’s designed to beat standard DE by making smarter decisions about which candidates to evaluate.

Key idea: - Use DE’s mutation/crossover to generate candidates (good exploration) - Use a Random Forest surrogate to predict which candidates are promising - Only evaluate the most promising candidates (sample efficiency) - Periodically re-fit the surrogate model

This is inspired by recent work on surrogate-assisted evolutionary algorithms, which have shown 2-5× speedups over standard EAs on expensive optimization problems.

References: - Jin, Y. (2011). “Surrogate-assisted evolutionary computation: Recent advances and future challenges.” - Sun, C., et al. (2019). “A surrogate-assisted DE with an adaptive local search”

Author: Francisco Geiman Thiesen Date: 2025-11-05

class helion.autotuner.de_surrogate_hybrid.DESurrogateHybrid(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, surrogate_threshold=100, candidate_ratio=3, refit_frequency=5, n_estimators=50, min_improvement_delta=0.001, patience=3, initial_population_strategy=None, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#

Hybrid Differential Evolution with Surrogate-Assisted Selection.

This algorithm uses DE for exploration but adds a surrogate model to intelligently select which candidates to actually evaluate, avoiding wasting evaluations on poor candidates.

Parameters:
  • kernel (_AutotunableKernel) – The bound kernel to tune

  • args (Sequence[object]) – Arguments for the kernel

  • population_size (int) – Size of the DE population (default: 40)

  • max_generations (int) – Maximum number of generations (default: 40)

  • crossover_rate (float) – Crossover probability (default: 0.8)

  • surrogate_threshold (int) – Use surrogate after this many evaluations (default: 100)

  • candidate_ratio (int) – Generate this many× candidates per slot (default: 3)

  • refit_frequency (int) – Refit surrogate every N generations (default: 5)

  • n_estimators (int) – Number of trees in Random Forest (default: 50)

  • min_improvement_delta (float) – Relative improvement threshold for early stopping. Default: 0.001 (0.1%). Early stopping enabled by default.

  • patience (int) – Number of generations without improvement before stopping. Default: 3. Early stopping enabled by default.

  • initial_population_strategy (InitialPopulationStrategy | None) – Strategy for generating the initial population. FROM_RANDOM generates a random population. FROM_DEFAULT starts from the default configuration. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var. If not set via env var and None is passed, defaults to FROM_RANDOM.

  • compile_timeout_lower_bound (float) –

  • compile_timeout_quantile (float) –

__init__(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, surrogate_threshold=100, candidate_ratio=3, refit_frequency=5, n_estimators=50, min_improvement_delta=0.001, patience=3, initial_population_strategy=None, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#

Create a DifferentialEvolutionSearch autotuner.

Parameters:
  • kernel (_AutotunableKernel) – The kernel to be autotuned.

  • args (Sequence[object]) – The arguments to be passed to the kernel.

  • population_size (int) – The size of the population.

  • max_generations (int) – The maximum number of generations to run.

  • crossover_rate (float) – The crossover rate for mutation.

  • immediate_update – Whether to update population immediately after each evaluation.

  • min_improvement_delta (float) – Relative improvement threshold for early stopping. If None (default), early stopping is disabled.

  • patience (int) – Number of generations without improvement before stopping. If None (default), early stopping is disabled.

  • initial_population_strategy (InitialPopulationStrategy | None) – Strategy for generating the initial population. FROM_RANDOM generates a random population. FROM_DEFAULT starts from the default configuration (repeated). FROM_BEST_AVAILABLE uses best configs from prior runs, fills remainder randomly. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.

  • compile_timeout_lower_bound (float) – Lower bound for adaptive compile timeout in seconds.

  • compile_timeout_quantile (float) – Quantile of compile times to use for adaptive timeout.

  • surrogate_threshold (int) –

  • candidate_ratio (int) –

  • refit_frequency (int) –

  • n_estimators (int) –

Differential Evolution#

class helion.autotuner.differential_evolution.DifferentialEvolutionSearch(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, immediate_update=None, min_improvement_delta=None, patience=None, initial_population_strategy=None, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#

A search strategy that uses differential evolution to find the best config.

Parameters:
__init__(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, immediate_update=None, min_improvement_delta=None, patience=None, initial_population_strategy=None, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#

Create a DifferentialEvolutionSearch autotuner.

Parameters:
  • kernel (_AutotunableKernel) – The kernel to be autotuned.

  • args (Sequence[object]) – The arguments to be passed to the kernel.

  • population_size (int) – The size of the population.

  • max_generations (int) – The maximum number of generations to run.

  • crossover_rate (float) – The crossover rate for mutation.

  • immediate_update (bool | None) – Whether to update population immediately after each evaluation.

  • min_improvement_delta (float | None) – Relative improvement threshold for early stopping. If None (default), early stopping is disabled.

  • patience (int | None) – Number of generations without improvement before stopping. If None (default), early stopping is disabled.

  • initial_population_strategy (InitialPopulationStrategy | None) – Strategy for generating the initial population. FROM_RANDOM generates a random population. FROM_DEFAULT starts from the default configuration (repeated). FROM_BEST_AVAILABLE uses best configs from prior runs, fills remainder randomly. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.

  • compile_timeout_lower_bound (float) – Lower bound for adaptive compile timeout in seconds.

  • compile_timeout_quantile (float) – Quantile of compile times to use for adaptive timeout.

mutate(x_index)[source]#
Parameters:

x_index (int) –

Return type:

list[object]

initial_two_generations()[source]#
Return type:

None

iter_candidates()[source]#
Return type:

Iterator[tuple[int, PopulationMember]]

evolve_population()[source]#
Return type:

int

check_early_stopping()[source]#

Check if early stopping criteria are met and update state.

This method updates best_perf_history and generations_without_improvement, and returns whether the optimization should stop.

Return type:

bool

Returns:

True if optimization should stop early, False otherwise.

Local Cache#

helion.autotuner.local_cache.get_helion_cache_dir()[source]#

Return the root directory for all Helion caches.

Return type:

Path

helion.autotuner.local_cache.helion_triton_cache_dir(device_index)[source]#

Return per-device Triton cache directory under Helion’s cache root.

Parameters:

device_index (int) –

Return type:

str

class helion.autotuner.local_cache.SavedBestConfig(hardware, specialization_key, config, config_spec_hash, flat_config)[source]#

A parsed cache entry from a .best_config file.

Parameters:
hardware: str#
specialization_key: str#
config: Config#
config_spec_hash: str#
flat_config: tuple[object, ...] | None#
to_mutable_flat_config()[source]#

Return the stored flat_config as a mutable list.

Return type:

list[object]

__init__(hardware, specialization_key, config, config_spec_hash, flat_config)#
Parameters:
helion.autotuner.local_cache.iter_cache_entries(cache_path, *, max_scan=None)[source]#

Yield parsed cache entries from cache_path, newest first.

Corrupt or unparsable files are skipped with a warning.

Parameters:
Return type:

Iterator[SavedBestConfig]

class helion.autotuner.local_cache.LocalAutotuneCache(autotuner)[source]#

This class implements the local autotune cache, storing the best config artifact on the local file system either by default on torch’s cache directory, or at a user specified HELION_CACHE_DIR directory. It uses the LooseAutotuneCacheKey implementation for the cache key which takes into account device and source code properties, but does not account for library level code changes such as Triton, Helion or PyTorch. Use StrictLocalAutotuneCache to consider these properties.

Parameters:

autotuner (BaseSearch) –

__init__(autotuner)[source]#
Parameters:

autotuner (BaseSearch) –

get()[source]#
Return type:

Config | None

put(config)[source]#
Parameters:

config (Config) –

Return type:

None

class helion.autotuner.local_cache.StrictLocalAutotuneCache(autotuner)[source]#

Stricter implementation of the local autotune cache, which takes into account library level code changes such as Triton, Helion or PyTorch.

Parameters:

autotuner (BaseSearch) –