Autotuner Module#
The helion.autotuner module provides automatic optimization of kernel configurations.
Autotuning effort can be adjusted via :attr:helion.Settings.autotune_effort, which configures how much each algorithm explores ("none" disables autotuning, "quick" runs a smaller search, "full" uses the full search budget). Users may still override individual autotuning parameters if they need finer control.
Configuration Classes#
Config#
- class helion.runtime.config.Config(*, block_sizes=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, load_eviction_policies=None, num_warps=None, num_stages=None, pid_type=None, indexing=None, **kwargs)[source]#
- Parameters:
load_eviction_policies (
list[Literal['','first','last']] |None) –pid_type (
Optional[Literal['flat','xyz','persistent_blocked','persistent_interleaved']]) –indexing (
Union[Literal['pointer','tensor_descriptor','block_ptr'],list[Literal['pointer','tensor_descriptor','block_ptr']],None]) –kwargs (
object) –
- __init__(*, block_sizes=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, load_eviction_policies=None, num_warps=None, num_stages=None, pid_type=None, indexing=None, **kwargs)[source]#
Initialize a Config object.
- Parameters:
block_sizes (
list[int] |None) – Controls tile sizes for hl.tile invocations.loop_orders (
list[list[int]] |None) – Permutes iteration order of tiles.l2_groupings (
list[int] |None) – Reorders program IDs for L2 cache locality.reduction_loops (
list[int|None] |None) – Configures reduction loop behavior.range_unroll_factors (
list[int] |None) – Loop unroll factors for tl.range calls.range_warp_specializes (
list[bool|None] |None) – Warp specialization for tl.range calls.range_num_stages (
list[int] |None) – Number of stages for tl.range calls.range_multi_buffers (
list[bool|None] |None) – Controls disallow_acc_multi_buffer for tl.range calls.range_flattens (
list[bool|None] |None) – Controls flatten parameter for tl.range calls.static_ranges (
list[bool] |None) – Whether to use tl.static_range instead tl.range.load_eviction_policies (
list[Literal['','first','last']] |None) – Eviction policies for load operations (“”, “first”, “last”).num_stages (
int|None) – Number of stages for software pipelining.pid_type (
Optional[Literal['flat','xyz','persistent_blocked','persistent_interleaved']]) – Program ID type strategy (“flat”, “xyz”, “persistent_blocked”, “persistent_interleaved”).indexing (
Union[Literal['pointer','tensor_descriptor','block_ptr'],list[Literal['pointer','tensor_descriptor','block_ptr']],None]) –Indexing strategy for load and store operations. Can be: - A single strategy string (all loads/stores use this strategy):
indexing=”block_ptr” # backward compatible
A list of strategies (one per load/store operation, must specify all): indexing=[“pointer”, “block_ptr”, “tensor_descriptor”]
Empty/omitted (all loads/stores default to “pointer”)
Valid strategies: “pointer”, “tensor_descriptor”, “block_ptr”
**kwargs (
object) – Additional user-defined configuration parameters.
Search Algorithms#
The autotuner supports multiple search strategies:
Pattern Search#
- class helion.autotuner.pattern_search.InitialPopulationStrategy(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Strategy for generating the initial population for search algorithms.
- FROM_RANDOM = 'from_random'#
Generate a random population of configurations.
- FROM_DEFAULT = 'from_default'#
Start from only the default configuration.
- class helion.autotuner.pattern_search.PatternSearch(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, initial_population_strategy=None)[source]#
Search that explores single-parameter perturbations around the current best.
- Parameters:
- __init__(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, initial_population_strategy=None)[source]#
Create a PatternSearch autotuner.
- Parameters:
kernel (
BoundKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.initial_population (
int) – The number of random configurations to generate for the initial population. When using FROM_DEFAULT strategy, this is ignored (always 1).copies (
int) – Count of top Configs to run pattern search on.max_generations (
int) – The maximum number of generations to run.min_improvement_delta (
float) – Relative stop threshold; stop if abs(best/current - 1) < this.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_DEFAULT starts from only the default configuration. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.
LFBO Pattern Search#
- class helion.autotuner.surrogate_pattern_search.LFBOPatternSearch(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, frac_selected=0.1, num_neighbors=300, radius=2, quantile=0.1, patience=1, initial_population_strategy=None)[source]#
Likelihood-Free Bayesian Optimization (LFBO) Pattern Search.
This algorithm enhances PatternSearch by using a Random Forest classifier as a surrogate model to select which configurations to benchmark, reducing the number of kernel compilations and runs needed to find optimal configurations.
- Algorithm Overview:
Generate an initial population (random or default) and benchmark all configurations
Fit a Random Forest classifier to predict “good” vs “bad” configurations: - Configs with performance < quantile threshold are labeled as “good” (class 1) - Configs with performance >= quantile threshold are labeled as “bad” (class 0) - Weighted classification emphasize configs that are much better than the threshold
For each generation: - Generate random neighbors around the current best configurations - Score all neighbors using the classifier’s predicted probability of being “good” - Benchmark only the top frac_selected fraction of neighbors - Retrain the classifier on all observed data (not incremental) - Update search trajectories based on new results
The weighted classification model learns to identify which configs maximize expected improvement over the current best config. Compared to fitting a surrogate to fit the config performances themselves, since this method is based on classification, it can also learn from configs that timeout or have unacceptable accuracy.
References: - Song, J., et al. (2022). “A General Recipe for Likelihood-free Bayesian Optimization.”
- Parameters:
kernel (
BoundKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel during benchmarking.initial_population (
int) – Number of random configurations in initial population. Default from PATTERN_SEARCH_DEFAULTS. Ignored when using DEFAULT strategy.copies (
int) – Number of top configurations to run pattern search from. Default from PATTERN_SEARCH_DEFAULTS.max_generations (
int) – Maximum number of search iterations per copy. Default from PATTERN_SEARCH_DEFAULTS.min_improvement_delta (
float) – Early stopping threshold. Search stops if the relative improvement abs(best/current - 1) < min_improvement_delta. Default: 0.001 (0.1% improvement threshold).frac_selected (
float) – Fraction of generated neighbors to actually benchmark, after filtering by classifier score. Range: (0, 1]. Lower values reduce benchmarking cost but may miss good configurations. Default: 0.15.num_neighbors (
int) – Number of random neighbor configurations to generate around each search point per generation. Default: 300.radius (
int) – Maximum perturbation distance in configuration space. For power-of-two parameters, this is the max change in log2 space. For other parameters, this limits how many parameters can be changed. Default: 2.quantile (
float) – Threshold for labeling configs as “good” (class 1) vs “bad” (class 0). Configs with performance below this quantile are labeled as good. Range: (0, 1). Lower values create a more selective definition of “good”. Default: 0.3 (top 30% are considered good).patience (
int) – Number of generations without improvement before stopping the search copy. Default: 2.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_DEFAULT starts from only the default configuration. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (“from_random” or “from_default”).
- __init__(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, frac_selected=0.1, num_neighbors=300, radius=2, quantile=0.1, patience=1, initial_population_strategy=None)[source]#
Create a PatternSearch autotuner.
- Parameters:
kernel (
BoundKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.initial_population (
int) – The number of random configurations to generate for the initial population. When using FROM_DEFAULT strategy, this is ignored (always 1).copies (
int) – Count of top Configs to run pattern search on.max_generations (
int) – The maximum number of generations to run.min_improvement_delta (
float) – Relative stop threshold; stop if abs(best/current - 1) < this.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_DEFAULT starts from only the default configuration. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.frac_selected (
float) –num_neighbors (
int) –radius (
int) –quantile (
float) –patience (
int) –
Differential Evolution#
- class helion.autotuner.differential_evolution.DifferentialEvolutionSearch(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, immediate_update=None, min_improvement_delta=None, patience=None, initial_population_strategy=None)[source]#
A search strategy that uses differential evolution to find the best config.
- Parameters:
- __init__(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, immediate_update=None, min_improvement_delta=None, patience=None, initial_population_strategy=None)[source]#
Create a DifferentialEvolutionSearch autotuner.
- Parameters:
kernel (
BoundKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.population_size (
int) – The size of the population.max_generations (
int) – The maximum number of generations to run.crossover_rate (
float) – The crossover rate for mutation.immediate_update (
bool|None) – Whether to update population immediately after each evaluation.min_improvement_delta (
float|None) – Relative improvement threshold for early stopping. If None (default), early stopping is disabled.patience (
int|None) – Number of generations without improvement before stopping. If None (default), early stopping is disabled.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates a random population. FROM_DEFAULT starts from the default configuration (repeated). Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.
Random Search#
- class helion.autotuner.random_search.RandomSearch(kernel, args, count=1000)[source]#
Implements a random search algorithm for kernel autotuning.
This class generates a specified number of random configurations for a given kernel and evaluates their performance.
- Inherits from:
FiniteSearch: A base class for finite configuration searches.
- kernel#
The kernel to be tuned.
- Type:
BoundKernel
Finite Search#
Local Cache#
- class helion.autotuner.local_cache.LocalAutotuneCache(autotuner)[source]#
This class implements the local autotune cache, storing the best config artifact on the local file system either by default on torch’s cache directory, or at a user specified HELION_CACHE_DIR directory. It uses the LooseAutotuneCacheKey implementation for the cache key which takes into account device and source code properties, but does not account for library level code changes such as Triton, Helion or PyTorch. Use StrictLocalAutotuneCache to consider these properties.
- Parameters:
autotuner (
BaseSearch) –