Autotuner Module#
The helion.autotuner module provides automatic optimization of kernel configurations.
Autotuning effort can be adjusted via :attr:helion.Settings.autotune_effort, which configures how much each algorithm explores ("none" disables autotuning, "quick" runs a smaller search, "full" uses the full search budget). Users may still override individual autotuning parameters if they need finer control.
Configuration Classes#
Config#
- class helion.runtime.config.Config(*, block_sizes=None, num_threads=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, load_eviction_policies=None, num_warps=None, num_stages=None, pid_type=None, num_sm_multiplier=None, maxnreg=None, indexing=None, atomic_indexing=None, advanced_controls_file=None, epilogue_subtile=None, **kwargs)[source]#
- Parameters:
load_eviction_policies (
list[Literal['','first','last']] |None) –pid_type (
Optional[Literal['flat','xyz','persistent_blocked','persistent_interleaved']]) –indexing (
Union[Literal['pointer','tensor_descriptor','block_ptr'],list[Literal['pointer','tensor_descriptor','block_ptr']],None]) –atomic_indexing (
Union[Literal['pointer','tensor_descriptor','block_ptr'],list[Literal['pointer','tensor_descriptor','block_ptr']],None]) –kwargs (
object) –
- __init__(*, block_sizes=None, num_threads=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, load_eviction_policies=None, num_warps=None, num_stages=None, pid_type=None, num_sm_multiplier=None, maxnreg=None, indexing=None, atomic_indexing=None, advanced_controls_file=None, epilogue_subtile=None, **kwargs)[source]#
Initialize a Config object.
- Parameters:
block_sizes (
list[int] |None) – Controls tile sizes for hl.tile invocations.num_threads (
list[int] |int|None) – Target thread count per axis (backend-specific).loop_orders (
list[list[int]] |None) – Permutes iteration order of tiles.l2_groupings (
list[int] |None) – Reorders program IDs for L2 cache locality.reduction_loops (
list[int|None] |None) – Configures reduction loop behavior.range_unroll_factors (
list[int] |None) – Loop unroll factors for tl.range calls.range_warp_specializes (
list[bool|None] |None) – Warp specialization for tl.range calls.range_num_stages (
list[int] |None) – Number of stages for tl.range calls.range_multi_buffers (
list[bool|None] |None) – Controls disallow_acc_multi_buffer for tl.range calls.range_flattens (
list[bool|None] |None) – Controls flatten parameter for tl.range calls.static_ranges (
list[bool] |None) – Whether to use tl.static_range instead tl.range.load_eviction_policies (
list[Literal['','first','last']] |None) – Eviction policies for load operations (“”, “first”, “last”).num_stages (
int|None) – Number of stages for software pipelining.pid_type (
Optional[Literal['flat','xyz','persistent_blocked','persistent_interleaved']]) – Program ID type strategy (“flat”, “xyz”, “persistent_blocked”, “persistent_interleaved”).num_sm_multiplier (
Optional[Literal[1,2,4,8]]) – Multiplier for the number of SMs in persistent kernels (1, 2, 4, 8). Controls multi-occupancy by launching N * num_sms thread blocks instead of just num_sms.maxnreg (
Optional[Literal[32,64,128,256]]) – Maximum number of registers per thread (None, 32, 64, 128, 256). Lower values allow higher occupancy but may hurt performance. Used with persistent kernels to ensure multi-occupancy can be achieved.indexing (
Union[Literal['pointer','tensor_descriptor','block_ptr'],list[Literal['pointer','tensor_descriptor','block_ptr']],None]) –Indexing strategy for load and store operations. Can be: - A single strategy string (all loads/stores use this strategy):
indexing=”block_ptr” # backward compatible
A list of strategies (one per load/store operation, must specify all): indexing=[“pointer”, “block_ptr”, “tensor_descriptor”]
Empty/omitted (all loads/stores default to “pointer”)
Valid strategies: “pointer”, “tensor_descriptor”, “block_ptr”
atomic_indexing (
Union[Literal['pointer','tensor_descriptor','block_ptr'],list[Literal['pointer','tensor_descriptor','block_ptr']],None]) – Indexing strategy for atomic operations (e.g., hl.atomic_add). Same format asindexing(a single string or a list per atomic op). Defaults to “pointer” when omitted.advanced_controls_file (
str|None) – Path to a PTXAS control file applied during compilation, or empty string for none.epilogue_subtile (
int|None) – Split factor for the epilogue (post-matmul pointwise + store) along the N dimension. None = disabled (default), valid values are 2 or 4.**kwargs (
object) – Additional user-defined configuration parameters.
- minimize(config_spec)[source]#
Return a new Config with values matching effective defaults removed.
This produces a minimal config representation by removing any values that match what the config_spec would use as defaults.
- property indexing: Literal['pointer', 'tensor_descriptor', 'block_ptr'] | list[Literal['pointer', 'tensor_descriptor', 'block_ptr']]#
Search Algorithms#
The autotuner supports multiple search strategies:
Pattern Search#
- class helion.autotuner.pattern_search.InitialPopulationStrategy(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Strategy for generating the initial population for search algorithms.
- FROM_RANDOM = 'from_random'#
Generate a random population of configurations.
- FROM_BEST_AVAILABLE = 'from_best_available'#
Start from default config plus up to 20 best matching cached configs from previous runs.
- class helion.autotuner.pattern_search.PatternSearch(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, initial_population_strategy=None, best_available_pad_random=True, num_neighbors_cap=-1, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Search that explores single-parameter perturbations around the current best.
- Parameters:
kernel (
_AutotunableKernel) –initial_population (
int) –copies (
int) –max_generations (
int) –min_improvement_delta (
float) –initial_population_strategy (
InitialPopulationStrategy|None) –best_available_pad_random (
bool) –num_neighbors_cap (
int) –finishing_rounds (
int) –compile_timeout_lower_bound (
float) –compile_timeout_quantile (
float) –
- __init__(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, initial_population_strategy=None, best_available_pad_random=True, num_neighbors_cap=-1, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Create a PatternSearch autotuner.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.initial_population (
int) – The number of random configurations to generate for the initial population.copies (
int) – Count of top Configs to run pattern search on.max_generations (
int) – The maximum number of generations to run.min_improvement_delta (
float) – Relative stop threshold; stop if abs(best/current - 1) < this.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.best_available_pad_random (
bool) – When True and using FROM_BEST_AVAILABLE, pad the cached configs with random configs to reach initial_population size. When False, use only the default and cached configs (no random padding).num_neighbors_cap (
int) – Maximum number of neighbors to explore per generation. -1 means no cap. Set HELION_CAP_AUTOTUNE_NUM_NEIGHBORS=N to override.finishing_rounds (
int) – Number of finishing rounds to run after the main search.compile_timeout_lower_bound (
float) – Lower bound for adaptive compile timeout in seconds.compile_timeout_quantile (
float) – Quantile of compile times to use for adaptive timeout.
LFBO Pattern Search#
- class helion.autotuner.surrogate_pattern_search.LFBOPatternSearch(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, frac_selected=0.1, num_neighbors=300, radius=2, quantile=0.1, patience=1, similarity_penalty=1.0, initial_population_strategy=None, best_available_pad_random=True, num_neighbors_cap=-1, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Batch Likelihood-Free Bayesian Optimization (LFBO) Pattern Search.
This algorithm enhances PatternSearch by using a Random Forest classifier as a surrogate model to select which configurations to benchmark, reducing the number of kernel compilations and runs needed to find optimal configurations. It imposes a similarity penalty to encourage diverse config selection.
- Algorithm Overview:
Generate an initial population (random or default) and benchmark all configurations
Fit a Random Forest classifier to predict “good” vs “bad” configurations: - Configs with performance < quantile threshold are labeled as “good” (class 1) - Configs with performance >= quantile threshold are labeled as “bad” (class 0) - Weighted classification emphasize configs that are much better than the threshold
For each generation: - Generate random neighbors around the current best configurations - Score all neighbors using the classifier’s predicted probability of being “good” - Penalizes points that are similar to previously selected points - Selects points to benchmark via sequential greedy optimization - Retrain the classifier on all observed data (not incremental) - Update search trajectories based on new results
The weighted classification model learns to identify which configs maximize expected improvement over the current best config. Compared to fitting a surrogate to fit the config performances themselves, since this method is based on classification, it can also learn from configs that timeout or have unacceptable accuracy.
References: - Song, J., et al. (2022). “A General Recipe for Likelihood-free Bayesian Optimization.”
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel during benchmarking.initial_population (
int) – Number of random configurations in initial population. Default from PATTERN_SEARCH_DEFAULTS. Ignored when using DEFAULT strategy.copies (
int) – Number of top configurations to run pattern search from. Default from PATTERN_SEARCH_DEFAULTS.max_generations (
int) – Maximum number of search iterations per copy. Default from PATTERN_SEARCH_DEFAULTS.min_improvement_delta (
float) – Early stopping threshold. Search stops if the relative improvement abs(best/current - 1) < min_improvement_delta. Default: 0.001 (0.1% improvement threshold).frac_selected (
float) – Fraction of generated neighbors to actually benchmark, after filtering by classifier score. Range: (0, 1]. Lower values reduce benchmarking cost but may miss good configurations. Default: 0.15.num_neighbors (
int) – Number of random neighbor configurations to generate around each search point per generation. Default: 300.radius (
int) – Maximum perturbation distance in configuration space. For power-of-two parameters, this is the max change in log2 space. For other parameters, this limits how many parameters can be changed. Default: 2.quantile (
float) – Threshold for labeling configs as “good” (class 1) vs “bad” (class 0). Configs with performance below this quantile are labeled as good. Range: (0, 1). Lower values create a more selective definition of “good”. Default: 0.3 (top 30% are considered good).patience (
int) – Number of generations without improvement before stopping the search copy. Default: 2.similarity_penalty (
float) – Penalty for selecting points that are similar to points already selected in the batch. Default: 1.0.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var.best_available_pad_random (
bool) –num_neighbors_cap (
int) –finishing_rounds (
int) –compile_timeout_lower_bound (
float) –compile_timeout_quantile (
float) –
- __init__(kernel, args, *, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, frac_selected=0.1, num_neighbors=300, radius=2, quantile=0.1, patience=1, similarity_penalty=1.0, initial_population_strategy=None, best_available_pad_random=True, num_neighbors_cap=-1, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Create a PatternSearch autotuner.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.initial_population (
int) – The number of random configurations to generate for the initial population.copies (
int) – Count of top Configs to run pattern search on.max_generations (
int) – The maximum number of generations to run.min_improvement_delta (
float) – Relative stop threshold; stop if abs(best/current - 1) < this.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.best_available_pad_random (
bool) – When True and using FROM_BEST_AVAILABLE, pad the cached configs with random configs to reach initial_population size. When False, use only the default and cached configs (no random padding).num_neighbors_cap (
int) – Maximum number of neighbors to explore per generation. -1 means no cap. Set HELION_CAP_AUTOTUNE_NUM_NEIGHBORS=N to override.finishing_rounds (
int) – Number of finishing rounds to run after the main search.compile_timeout_lower_bound (
float) – Lower bound for adaptive compile timeout in seconds.compile_timeout_quantile (
float) – Quantile of compile times to use for adaptive timeout.frac_selected (
float) –num_neighbors (
int) –radius (
int) –quantile (
float) –patience (
int) –similarity_penalty (
float) –
- classmethod get_kwargs_from_profile(profile, settings)[source]#
Retrieve extra kwargs from the effort profile for the autotuner.
- seed_training_data(results)[source]#
Pre-populate the surrogate’s training set with externally-benchmarked configs.
Useful when an outer loop (e.g. a hybrid LLM+LFBO search) has already benchmarked configs and wants the LFBO surrogate to learn from them rather than starting from scratch. Failed configs (perf=inf) are kept since the surrogate’s binary classifier learns from negatives too.
- compute_leaf_similarity(surrogate, X_test)[source]#
Compute pairwise similarity matrix using leaf node co-occurrence.
For RandomForest, two samples are similar if they land in the same leaf nodes across trees. This is the Jaccard similarity of their leaf assignments.
- Parameters:
model – Fitted RandomForestClassifier
X_test (
ndarray) – Test samples (n_samples, n_features)surrogate (
RandomForestClassifier) –
- Returns:
- (n_samples, n_samples) matrix where entry [i,j] is
the fraction of trees where samples i and j land in the same leaf
- Return type:
similarity_matrix
LFBO Tree Search (Default)#
LFBOTreeSearch is the default autotuner.
It extends LFBO Pattern Search with tree-guided neighbor generation, using greedy decision tree
traversal to focus search on parameters the surrogate model has identified as important.
- class helion.autotuner.surrogate_pattern_search.LFBOTreeSearch(kernel, args, *, num_neighbors=200, frac_selected=0.1, radius=2, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, quantile=0.1, patience=1, similarity_penalty=1.0, initial_population_strategy=None, best_available_pad_random=True, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Bases:
LFBOPatternSearchLFBO Tree Search: Likelihood-Free Bayesian Optimization with tree-guided neighbor generation.
This algorithm uses a Random Forest classifier as a surrogate model to both select which configurations to benchmark and to guide the generation of new candidate configurations via greedy decision tree traversal.
- Algorithm Overview:
Generate an initial population (random or default) and benchmark all configurations
Fit a Random Forest classifier to predict “good” vs “bad” configurations: - Configs with performance < quantile threshold are labeled as “good” (class 1) - Configs with performance >= quantile threshold are labeled as “bad” (class 0) - Weighted classification emphasizes configs that are much better than the threshold
For the first generation, generate neighbors via random perturbation since the surrogate is not yet fitted
For subsequent generations, generate neighbors via greedy tree traversal: a. For each of num_neighbors trials:
Pick a random decision tree from the Random Forest
Trace the decision path for the current best config through that tree
Extract the configuration parameters used in the tree’s split decisions
- For each parameter on the path, greedily optimize it:
Generate pattern_neighbors within the configured radius
Score candidates using the single tree’s predicted probability
Accept the best value (ties broken randomly) and incrementally update the encoded representation
Keep the result only if it differs from the base configuration
Score candidates using the full ensemble’s predicted probability with a diversity-aware similarity penalty, then select top candidates
Benchmark selected candidates, retrain the classifier on all observed data
The tree-guided traversal focuses search on parameters the surrogate has identified as important (those used in tree splits). Using a single tree per trial (rather than the full ensemble) introduces diversity since different trees may emphasize different parameters.
References: - Song, J., et al. (2022). “A General Recipe for Likelihood-free Bayesian Optimization.” - Mišić, Velibor V. “Optimization of tree ensembles.” Operations Research 68.5 (2020): 1605-1624.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel during benchmarking.initial_population (
int) – Number of random configurations in initial population. Default from PATTERN_SEARCH_DEFAULTS. Ignored when using DEFAULT strategy.copies (
int) – Number of top configurations to run pattern search from. Default from PATTERN_SEARCH_DEFAULTS.max_generations (
int) – Maximum number of search iterations per copy. Default from PATTERN_SEARCH_DEFAULTS.min_improvement_delta (
float) – Early stopping threshold. Search stops if the relative improvement abs(best/current - 1) < min_improvement_delta. Default: 0.001 (0.1% improvement threshold).frac_selected (
float) – Fraction of generated neighbors to actually benchmark, after filtering by classifier score. Range: (0, 1]. Lower values reduce benchmarking cost but may miss good configurations. Default: 0.15.num_neighbors (
int) – Number of greedy tree traversal trials to run per generation. Each trial picks a random tree, traces its decision path, and greedily optimizes parameters along that path. Default: 100.radius (
int) – Maximum perturbation distance when generating pattern neighbors for each parameter during tree traversal. For power-of-two parameters, this is the max change in log2 space. For other parameters, this limits the neighborhood size. Default: 3.quantile (
float) – Threshold for labeling configs as “good” (class 1) vs “bad” (class 0). Configs with performance below this quantile are labeled as good. Range: (0, 1). Lower values create a more selective definition of “good”. Default: 0.1 (top 10% are considered good).patience (
int) – Number of generations without improvement before stopping the search copy. Default: 1.similarity_penalty (
float) – Penalty for selecting points that are similar to points already selected in the batch. Default: 1.0.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var.best_available_pad_random (
bool) –finishing_rounds (
int) –compile_timeout_lower_bound (
float) –compile_timeout_quantile (
float) –
- __init__(kernel, args, *, num_neighbors=200, frac_selected=0.1, radius=2, initial_population=100, copies=5, max_generations=20, min_improvement_delta=0.001, quantile=0.1, patience=1, similarity_penalty=1.0, initial_population_strategy=None, best_available_pad_random=True, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Create a PatternSearch autotuner.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.initial_population (
int) – The number of random configurations to generate for the initial population.copies (
int) – Count of top Configs to run pattern search on.max_generations (
int) – The maximum number of generations to run.min_improvement_delta (
float) – Relative stop threshold; stop if abs(best/current - 1) < this.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates initial_population random configs. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.best_available_pad_random (
bool) – When True and using FROM_BEST_AVAILABLE, pad the cached configs with random configs to reach initial_population size. When False, use only the default and cached configs (no random padding).num_neighbors_cap – Maximum number of neighbors to explore per generation. -1 means no cap. Set HELION_CAP_AUTOTUNE_NUM_NEIGHBORS=N to override.
finishing_rounds (
int) – Number of finishing rounds to run after the main search.compile_timeout_lower_bound (
float) – Lower bound for adaptive compile timeout in seconds.compile_timeout_quantile (
float) – Quantile of compile times to use for adaptive timeout.num_neighbors (
int) –frac_selected (
float) –radius (
int) –quantile (
float) –patience (
int) –similarity_penalty (
float) –
LLM-Guided Search#
LLMGuidedSearch uses a large language model to
iteratively propose kernel configurations. It sends the kernel source, config space, GPU
hardware info, and benchmark results to the LLM, which suggests promising configurations
across multiple refinement rounds.
Search for autotune configs by iteratively querying an LLM.
High-level flow: 1. Initialize the prompt context from the kernel, config space, and default
config so the first LLM call sees both the workload description and the available tuning knobs.
Round 0 launches the first LLM call immediately, then benchmarks the default config plus a few random seed configs while that request is in flight.
When the round-0 LLM response arrives, the search benchmarks its new unique configs and folds those results into the running set of top configs.
The top configs are then rebenchmarked before the next prompt is built, so each later LLM round sees the latest stabilized timings instead of only one-shot measurements.
Later rounds repeat a synchronous cycle: build prompt from the latest search state, query the LLM, benchmark new configs, then rebenchmark the strongest configs.
The final returned config comes from the best rebenchmarked config, not from an unrechecked one-shot LLM suggestion.
The implementation keeps config parsing, workload analysis, prompting, transport, and search orchestration separate: - configs.py parses and validates sparse configs from LLM responses. - workload.py analyzes the kernel and hardware for prompt context. - feedback.py summarizes benchmark results for prompts. - prompting.py builds the actual prompt text. - transport.py handles provider I/O. - This file owns the round-by-round search state machine.
- helion.autotuner.llm_search.guided_search_kwargs_from_config(config, settings)[source]#
Merge LLM config defaults with the supported HELION_LLM_* overrides.
- helion.autotuner.llm_search.guided_search_kwargs_from_profile(profile, settings)[source]#
Merge effort-profile defaults with the supported HELION_LLM_* overrides.
- class helion.autotuner.llm_search.LLMGuidedSearch(kernel, args, *, provider=None, model='gpt-5-2', configs_per_round=15, max_rounds=4, initial_random_configs=10, finishing_rounds=0, min_improvement_delta=0.005, api_base=None, api_key=None, request_timeout_s=120.0, compile_timeout_s=None)[source]#
LLM-Guided autotuner that uses a language model to suggest kernel configurations.
Instead of random or evolutionary search, this strategy uses an LLM to propose configurations based on: - The kernel’s source code and structure - The configuration space (parameter types, ranges) - GPU hardware information - Benchmark results from previous rounds (iterative refinement)
The search overlaps only the initial round-0 request with seed benchmarking. After that, refinement rounds are synchronous: each round asks the LLM for a batch of configs, benchmarks them, rebenchmarks the strongest configs, and only then builds the next prompt.
Common providers (OpenAI Responses, Anthropic Messages, and compatible proxies) work via direct HTTP without extra dependencies.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – Arguments passed to the kernel during benchmarking.provider (
str|None) – Optional explicit provider override. Use this when a proxy serves a model family behind a different API shape than its name implies.model (
str) – LLM model name (e.g. “gpt-5-2”, “claude-haiku-4.5”, “claude-3-5-haiku-latest”). Can also be set via HELION_LLM_MODEL.configs_per_round (
int) – Number of configs to request from the LLM per round.max_rounds (
int) – Total number of LLM query rounds, including the initial suggestion round.max_rounds=1means one LLM call total.initial_random_configs (
int) – Number of random configs to add alongside LLM suggestions in the first round, for diversity.finishing_rounds (
int) – Number of finishing rounds to simplify the best config.api_base (
str|None) – Optional custom API base URL for the LLM provider.api_key (
str|None) – Optional API key. Defaults to the provider’s env var (e.g. OPENAI_API_KEY).compile_timeout_s (
int|None) – Optional compile-time cap applied only while the LLM search benchmarks its exploratory configs.min_improvement_delta (
float) –request_timeout_s (
float) –
- __init__(kernel, args, *, provider=None, model='gpt-5-2', configs_per_round=15, max_rounds=4, initial_random_configs=10, finishing_rounds=0, min_improvement_delta=0.005, api_base=None, api_key=None, request_timeout_s=120.0, compile_timeout_s=None)[source]#
Initialize the PopulationBasedSearch object.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be tuned.args (
Sequence[object]) – The arguments to be passed to the kernel.finishing_rounds (
int) – Number of finishing rounds to run after the main search.model (
str) –configs_per_round (
int) –max_rounds (
int) –initial_random_configs (
int) –min_improvement_delta (
float) –request_timeout_s (
float) –
LLM Environment Variables#
Variable |
Default |
Description |
|---|---|---|
|
LLM provider: |
|
|
Model name (e.g. |
|
|
API key (falls back to |
|
|
Custom API base URL |
|
|
|
Compile timeout (seconds) for LLM-proposed configs |
|
Custom CA bundle path (for corporate proxies that do TLS inspection) |
|
|
Client certificate path (for proxies requiring mutual TLS) |
|
|
Client key path (for proxies requiring mutual TLS) |
The proxy/TLS variables are only needed in corporate environments where a proxy intercepts HTTPS traffic. Most users connecting directly to the LLM API can ignore them.
LLM-Seeded Search (Hybrid)#
LLMSeededSearch is a two-stage hybrid approach:
Stage 1 (LLM): Run LLM-guided search for a configurable number of rounds to find good initial configs.
Stage 2 (Surrogate): Run a non-LLM search algorithm (default:
LFBOTreeSearch), seeded with the best LLM config and trained on all LLM benchmark results.
This combines the LLM’s ability to make informed initial guesses with the surrogate model’s efficient local search.
LLMSeededLFBOTreeSearch is a convenience subclass
that locks stage 2 to LFBOTreeSearch.
Run a two-stage hybrid autotuner that seeds a local search with an LLM pass.
High-level flow:
1. Run LLMGuidedSearch for llm_max_rounds rounds and keep its best
config. The hybrid defaults to 1 LLM round.
Run a second-stage non-LLM search,
LFBOTreeSearchby default.If the second stage supports best-available seeding, force
FROM_BEST_AVAILABLEand inject the LLM best config so stage 2 can refine it instead of starting cold.Report per-stage timing and config-count metrics, plus aggregated hybrid totals.
Setting llm_max_rounds=0 skips the LLM stage and runs only the second
stage.
- class helion.autotuner.llm_seeded_lfbo.LLMSeededSearch(kernel, args, *, second_stage_algorithm=None, second_stage_kwargs=None, best_available_pad_random=False, llm_provider=None, llm_model='gpt-5-2', llm_configs_per_round=15, llm_max_rounds=1, llm_initial_random_configs=10, llm_compile_timeout_s=15, llm_api_base=None, llm_api_key=None, llm_request_timeout_s=120.0)[source]#
Generic hybrid autotuner that seeds a second-stage search with LLM proposals.
The algorithm runs in two stages: 1. Run
LLMGuidedSearchforllm_max_roundsrounds and capture its bestconfig in memory.
Run the configured second-stage search algorithm. If the algorithm supports best-available seeding, it is switched to
FROM_BEST_AVAILABLEso it can start from the LLM seed config.
Setting
llm_max_rounds=0disables the seed stage and runs only the second-stage search.- Parameters:
- default_second_stage_algorithm = 'LFBOTreeSearch'#
- allow_second_stage_env_override = True#
- __init__(kernel, args, *, second_stage_algorithm=None, second_stage_kwargs=None, best_available_pad_random=False, llm_provider=None, llm_model='gpt-5-2', llm_configs_per_round=15, llm_max_rounds=1, llm_initial_random_configs=10, llm_compile_timeout_s=15, llm_api_base=None, llm_api_key=None, llm_request_timeout_s=120.0)[source]#
Initialize the BaseSearch object.
- Parameters:
- hybrid_stage_breakdown: dict[str, object] | None#
- class helion.autotuner.llm_seeded_lfbo.LLMSeededLFBOTreeSearch(kernel, args, *, second_stage_kwargs=None, best_available_pad_random=False, llm_provider=None, llm_model='gpt-5-2', llm_configs_per_round=15, llm_max_rounds=1, llm_initial_random_configs=10, llm_compile_timeout_s=15, llm_api_base=None, llm_api_key=None, llm_request_timeout_s=120.0)[source]#
Convenience wrapper for the common LLM-seeded LFBO tree search pipeline.
LFBO-specific stage-2 settings should be passed through
second_stage_kwargs.- Parameters:
- allow_second_stage_env_override = False#
- classmethod get_kwargs_from_profile(profile, settings)[source]#
Drop the explicit stage-2 algorithm knob from the LFBO convenience API.
- __init__(kernel, args, *, second_stage_kwargs=None, best_available_pad_random=False, llm_provider=None, llm_model='gpt-5-2', llm_configs_per_round=15, llm_max_rounds=1, llm_initial_random_configs=10, llm_compile_timeout_s=15, llm_api_base=None, llm_api_key=None, llm_request_timeout_s=120.0)[source]#
Initialize the BaseSearch object.
- Parameters:
Hybrid Environment Variables#
Variable |
Default |
Description |
|---|---|---|
|
|
Override the second-stage search algorithm |
|
(effort-dependent) |
Override the number of LLM rounds in stage 1 |
To use the LLM-guided autotuner, set the HELION_AUTOTUNER environment variable:
# Pure LLM-guided search
export HELION_AUTOTUNER=LLMGuidedSearch
# LLM-seeded hybrid (recommended)
export HELION_AUTOTUNER=LLMSeededLFBOTreeSearch
Example: Using Claude as the LLM provider#
export HELION_AUTOTUNER=LLMSeededLFBOTreeSearch
export HELION_LLM_PROVIDER=anthropic
export HELION_LLM_MODEL=claude-opus-4-7
export HELION_LLM_API_KEY=your-key-here
Then run your kernel as usual — the autotuner will use Claude to propose initial configs before handing off to the surrogate-based search:
out = matmul(torch.randn([2048, 2048], device="cuda"),
torch.randn([2048, 2048], device="cuda"))
DE Surrogate Hybrid#
Differential Evolution with Surrogate-Assisted Selection (DE-SAS).
This hybrid approach combines the robust exploration of Differential Evolution with the sample efficiency of surrogate models. It’s designed to beat standard DE by making smarter decisions about which candidates to evaluate.
Key idea: - Use DE’s mutation/crossover to generate candidates (good exploration) - Use a Random Forest surrogate to predict which candidates are promising - Only evaluate the most promising candidates (sample efficiency) - Periodically re-fit the surrogate model
This is inspired by recent work on surrogate-assisted evolutionary algorithms, which have shown 2-5× speedups over standard EAs on expensive optimization problems.
References: - Jin, Y. (2011). “Surrogate-assisted evolutionary computation: Recent advances and future challenges.” - Sun, C., et al. (2019). “A surrogate-assisted DE with an adaptive local search”
Author: Francisco Geiman Thiesen Date: 2025-11-05
- class helion.autotuner.de_surrogate_hybrid.DESurrogateHybrid(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, surrogate_threshold=100, candidate_ratio=3, refit_frequency=5, n_estimators=50, min_improvement_delta=0.001, patience=3, initial_population_strategy=None, best_available_pad_random=True, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Hybrid Differential Evolution with Surrogate-Assisted Selection.
This algorithm uses DE for exploration but adds a surrogate model to intelligently select which candidates to actually evaluate, avoiding wasting evaluations on poor candidates.
- Parameters:
kernel (
_AutotunableKernel) – The bound kernel to tunepopulation_size (
int) – Size of the DE population (default: 40)max_generations (
int) – Maximum number of generations (default: 40)crossover_rate (
float) – Crossover probability (default: 0.8)surrogate_threshold (
int) – Use surrogate after this many evaluations (default: 100)candidate_ratio (
int) – Generate this many× candidates per slot (default: 3)refit_frequency (
int) – Refit surrogate every N generations (default: 5)n_estimators (
int) – Number of trees in Random Forest (default: 50)min_improvement_delta (
float) – Relative improvement threshold for early stopping. Default: 0.001 (0.1%). Early stopping enabled by default.patience (
int) – Number of generations without improvement before stopping. Default: 3. Early stopping enabled by default.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates a random population. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var. If not set via env var and None is passed, defaults to FROM_RANDOM.best_available_pad_random (
bool) –finishing_rounds (
int) –compile_timeout_lower_bound (
float) –compile_timeout_quantile (
float) –
- __init__(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, surrogate_threshold=100, candidate_ratio=3, refit_frequency=5, n_estimators=50, min_improvement_delta=0.001, patience=3, initial_population_strategy=None, best_available_pad_random=True, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Create a DifferentialEvolutionSearch autotuner.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.population_size (
int) – The size of the population.max_generations (
int) – The maximum number of generations to run.crossover_rate (
float) – The crossover rate for mutation.immediate_update – Whether to update population immediately after each evaluation.
min_improvement_delta (
float) – Relative improvement threshold for early stopping. If None (default), early stopping is disabled.patience (
int) – Number of generations without improvement before stopping. If None (default), early stopping is disabled.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates a random population. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.best_available_pad_random (
bool) – When True and using FROM_BEST_AVAILABLE, pad the cached configs with random configs to reach 2x population size. When False, use only the default and cached configs (no random padding).finishing_rounds (
int) – Number of finishing rounds to run after the main search.compile_timeout_lower_bound (
float) – Lower bound for adaptive compile timeout in seconds.compile_timeout_quantile (
float) – Quantile of compile times to use for adaptive timeout.surrogate_threshold (
int) –candidate_ratio (
int) –refit_frequency (
int) –n_estimators (
int) –
Differential Evolution#
- class helion.autotuner.differential_evolution.DifferentialEvolutionSearch(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, immediate_update=None, min_improvement_delta=None, patience=None, initial_population_strategy=None, best_available_pad_random=True, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
A search strategy that uses differential evolution to find the best config.
- Parameters:
kernel (
_AutotunableKernel) –population_size (
int) –max_generations (
int) –crossover_rate (
float) –initial_population_strategy (
InitialPopulationStrategy|None) –best_available_pad_random (
bool) –finishing_rounds (
int) –compile_timeout_lower_bound (
float) –compile_timeout_quantile (
float) –
- __init__(kernel, args, population_size=40, max_generations=40, crossover_rate=0.8, immediate_update=None, min_improvement_delta=None, patience=None, initial_population_strategy=None, best_available_pad_random=True, finishing_rounds=0, compile_timeout_lower_bound=30.0, compile_timeout_quantile=0.9)[source]#
Create a DifferentialEvolutionSearch autotuner.
- Parameters:
kernel (
_AutotunableKernel) – The kernel to be autotuned.args (
Sequence[object]) – The arguments to be passed to the kernel.population_size (
int) – The size of the population.max_generations (
int) – The maximum number of generations to run.crossover_rate (
float) – The crossover rate for mutation.immediate_update (
bool|None) – Whether to update population immediately after each evaluation.min_improvement_delta (
float|None) – Relative improvement threshold for early stopping. If None (default), early stopping is disabled.patience (
int|None) – Number of generations without improvement before stopping. If None (default), early stopping is disabled.initial_population_strategy (
InitialPopulationStrategy|None) – Strategy for generating the initial population. FROM_RANDOM generates a random population. FROM_BEST_AVAILABLE uses cached configs from prior runs, and fills the remainder with random configs when best_available_pad_random is True. Can be overridden by HELION_AUTOTUNER_INITIAL_POPULATION env var (handled in default_autotuner_fn). If None is passed, defaults to FROM_RANDOM.best_available_pad_random (
bool) – When True and using FROM_BEST_AVAILABLE, pad the cached configs with random configs to reach 2x population size. When False, use only the default and cached configs (no random padding).finishing_rounds (
int) – Number of finishing rounds to run after the main search.compile_timeout_lower_bound (
float) – Lower bound for adaptive compile timeout in seconds.compile_timeout_quantile (
float) – Quantile of compile times to use for adaptive timeout.
Random Search#
- class helion.autotuner.random_search.RandomSearch(kernel, args, count=1000)[source]#
Implements a random search algorithm for kernel autotuning.
This class generates a specified number of random configurations for a given kernel and evaluates their performance.
- Inherits from:
FiniteSearch: A base class for finite configuration searches.
- kernel#
The kernel to be tuned (any
_AutotunableKernel).
- args#
The arguments to be passed to the kernel.
- count#
The number of random configurations to generate.
Finite Search#
- class helion.autotuner.finite_search.FiniteSearch(kernel, args, configs=None)[source]#
Search over a given list of configs, returning the best one.
This strategy is similar to triton.Autotune, and is the default if you specify helion.kernel(configs=[…]).
- Parameters:
Local Cache#
- helion.autotuner.local_cache.get_helion_cache_dir()[source]#
Return the root directory for all Helion caches.
- Return type:
- helion.autotuner.local_cache.helion_triton_cache_dir(device_index)[source]#
Return per-device Triton cache directory under Helion’s cache root.
- class helion.autotuner.local_cache.SavedBestConfig(hardware, specialization_key, config, config_spec_hash, flat_config)[source]#
A parsed cache entry from a .best_config file.
- Parameters:
- helion.autotuner.local_cache.iter_cache_entries(cache_path, *, max_scan=None)[source]#
Yield parsed cache entries from cache_path, newest first.
Corrupt or unparsable files are skipped with a warning.
- Parameters:
- Return type:
- class helion.autotuner.local_cache.LocalAutotuneCache(autotuner)[source]#
This class implements the local autotune cache, storing the best config artifact on the local file system either by default on torch’s cache directory, or at a user specified HELION_CACHE_DIR directory. It uses the LooseAutotuneCacheKey implementation for the cache key which takes into account device and source code properties, but does not account for library level code changes such as Triton, Helion or PyTorch. Use StrictLocalAutotuneCache to consider these properties.
- Parameters:
autotuner (
BaseSearch) –