Language Module
The helion.language
module contains the core DSL constructs for writing GPU kernels.
Loop Constructs
tile()
- helion.language.tile(begin_or_end, end_or_none=None, /, block_size=None)[source]
Break up an iteration space defined by a size or sequence of sizes into tiles.
The generated tiles can flatten the iteration space into the product of the sizes, perform multidimensional tiling, swizzle the indices for cache locality, reorder dimensions, etc. The only invariant is that every index in the range of the given sizes is covered exactly once.
The exact tiling strategy is determined by a Config object, typically created through autotuning.
If used at the top level of a function, this becomes the grid of the kernel. Otherwise, it becomes a loop in the output kernel.
The key difference from
grid()
is thattile
gives youTile
objects that load a slice of elements, whilegrid
gives you scalar integer indices. It is recommended to usetile
in most cases, since it allows more choices in autotuning.- Parameters:
begin_or_end (
int
|Tensor
|Sequence
[int
|Tensor
]) – If 2+ positional args provided, the start of iteration space. Otherwise, the end of iteration space.end_or_none (
int
|Tensor
|Sequence
[int
|Tensor
] |None
) – If 2+ positional args provided, the end of iteration space.block_size (
object
) – Fixed block size (overrides autotuning) or None for autotuned size
- Returns:
Iterator over tile objects
- Return type:
Examples
One dimensional tiling:
@helion.kernel def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: result = torch.zeros_like(x) for tile in hl.tile(x.size(0)): # tile processes multiple elements at once result[tile] = x[tile] + y[tile] return result
Multi-dimensional tiling:
@helion.kernel() def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: m, k = x.size() k, n = y.size() out = torch.empty([m, n], dtype=x.dtype, device=x.device) for tile_m, tile_n in hl.tile([m, n]): acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) for tile_k in hl.tile(k): acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) out[tile_m, tile_n] = acc return out
Fixed block size:
@helion.kernel def process_with_fixed_block(x: torch.Tensor) -> torch.Tensor: result = torch.zeros_like(x) for tile in hl.tile(x.size(0), block_size=64): # Process with fixed block size of 64 result[tile] = x[tile] * 2 return result
Using tile properties:
@helion.kernel def tile_info_example(x: torch.Tensor) -> torch.Tensor: result = torch.zeros([x.size(0)], dtype=x.dtype, device=x.device) for tile in hl.tile(x.size(0)): # Access tile properties start = tile.begin end = tile.end size = tile.block_size indices = tile.index # [start, start+1, ..., end-1] # Use in computation result[tile] = x[tile] + indices return result
See also
grid()
: For explicit control over the launch gridtile_index()
: For getting tile indicesregister_block_size()
: For registering block sizes
Note
Similar to
range()
with multiple forms:tile(end) iterates 0 to end-1, autotuned block_size
tile(begin, end) iterates begin to end-1, autotuned block_size
tile(begin, end, block_size) iterates begin to end-1, fixed block_size
tile(end, block_size=block_size) iterates 0 to end-1, fixed block_size
Block sizes can be registered for autotuning explicitly with
register_block_size()
and passed as theblock_size
argument if one needs two loops to use the same block size. Passingblock_size=None
is equivalent to calling register_block_size.Use
tile
in most cases. Usegrid
when you need explicit control over the launch grid.
The tile()
function is the primary way to create parallel loops in Helion kernels. It provides several key features:
Tiling Strategies: The exact tiling strategy is determined by a Config object, typically created through autotuning. This allows for:
Multidimensional tiling
Index swizzling for cache locality
Dimension reordering
Flattening of iteration spaces
Usage Patterns:
# Simple 1D tiling
for tile in hl.tile(1000):
# tile.begin, tile.end, tile.block_size are available
# Load entire tile (not just first element)
data = tensor[tile] # or hl.load(tensor, tile) for explicit loading
# 2D tiling
for tile_i, tile_j in hl.tile([height, width]):
# Each tile represents a portion of the 2D space
pass
# With explicit begin/end/block_size
for tile in hl.tile(0, 1000, block_size=64):
pass
Grid vs Loop Behavior:
When used at the top level of a kernel function,
tile()
becomes the grid of the kernel (parallel blocks)When used nested inside another loop, it becomes a sequential loop within each block
grid()
- helion.language.grid(begin_or_end, end_or_none=None, /, step=None)[source]
Iterate over individual indices of the given iteration space.
The key difference from
tile()
is thatgrid
gives you scalar integer indices (torch.SymInt
), whiletile
gives youTile
objects that load a slice of elements. Usetile
in most cases. Usegrid
when you need explicit control over the launch grid or when processing one element at a time.Semantics are equivalent to:
for i in hl.tile(...): # i is a Tile object, accesses multiple elements data = tensor[i] # loads slice of elements (1D tensor)
vs:
for i in hl.grid(...): # i is a scalar index, accesses single element data = tensor[i] # loads single element (0D scalar)
When used at the top level of a function, this becomes the grid of the kernel. Otherwise, it becomes a loop in the output kernel.
- Parameters:
begin_or_end (
int
|Tensor
|Sequence
[int
|Tensor
]) – If 2+ positional args provided, the start of iteration space. Otherwise, the end of iteration space.end_or_none (
int
|Tensor
|Sequence
[int
|Tensor
] |None
) – If 2+ positional args provided, the end of iteration space.step (
int
|Tensor
|Sequence
[int
|Tensor
] |None
) – Step size for iteration (default: 1)
- Returns:
Iterator over scalar indices
- Return type:
Iterator[torch.SymInt] or Iterator[Sequence[torch.SymInt]]
See also
tile()
: For processing multiple elements at oncetile_index()
: For getting tile indicesarange()
: For creating index sequences
Note
Similar to
range()
with multiple forms:grid(end) iterates from 0 to end-1, step 1
grid(begin, end) iterates from begin to end-1, step 1
grid(begin, end, step) iterates from begin to end-1, given step
grid(end, step=step) iterates from 0 to end-1, given step
Use
tile
in most cases. Usegrid
when you need explicit control over the launch grid.
The grid()
function iterates over individual indices rather than tiles. It’s equivalent to tile(size, block_size=1)
but returns scalar indices instead of tile objects.
static_range()
- helion.language.static_range(begin_or_end, end_or_none=None, /, step=1)[source]
Create a range that gets unrolled at compile time by iterating over constant integer values.
This function is similar to Python’s built-in range(), but it generates a sequence of integer constants that triggers loop unrolling behavior in Helion kernels. The loop is completely unrolled at compile time, with each iteration becoming separate instructions in the generated code.
- Parameters:
- Returns:
Iterator over constant integer values
- Return type:
Iterator[int]
Examples
Simple unrolled loop:
@helion.kernel def unrolled_example(x: torch.Tensor) -> torch.Tensor: result = torch.zeros_like(x) for tile in hl.tile(x.size(0)): acc = torch.zeros([tile], dtype=x.dtype, device=x.device) # This loop gets completely unrolled for i in hl.static_range(3): acc += x[tile] * i result[tile] = acc return result
Range with start and step:
@helion.kernel def kernel_stepped_unroll(x: torch.Tensor) -> torch.Tensor: result = torch.zeros_like(x) for tile in hl.tile(x.size(0)): acc = torch.zeros([tile], dtype=x.dtype, device=x.device) # Unroll loop from 2 to 8 with step 2: [2, 4, 6] for i in hl.static_range(2, 8, 2): acc += x[tile] * i result[tile] = acc return result
Note
Only constant integer values are supported
The range must be small enough to avoid compilation timeouts
Each iteration becomes separate instructions in the generated Triton code
Use for small, fixed iteration counts where unrolling is beneficial
static_range()
behaves like a compile-time unrolled range for small loops. It hints the compiler to fully unroll the loop body where profitable.
Memory Operations
load()
- helion.language.load(tensor, index, extra_mask=None)[source]
Load a value from a tensor using a list of indices.
This function is equivalent to tensor[index] but allows setting extra_mask= to mask elements beyond the default masking based on the hl.tile range.
- Parameters:
- Returns:
The loaded value
- Return type:
store()
- helion.language.store(tensor, index, value, extra_mask=None)[source]
Store a value to a tensor using a list of indices.
This function is equivalent to tensor[index] = value but allows setting extra_mask= to mask elements beyond the default masking based on the hl.tile range.
- Parameters:
- Return type:
- Returns:
None
atomic_add()
- helion.language.atomic_add(target, index, value, sem='relaxed')[source]
Atomically add a value to a target tensor.
Performs an atomic read-modify-write that adds
value
totarget[index]
. This is safe for concurrent access from multiple threads/blocks.- Parameters:
- Returns:
The previous value(s) stored at
target[index]
before the update.- Return type:
Example
@helion.kernel def global_sum(x: torch.Tensor, result: torch.Tensor) -> torch.Tensor:
- for tile in hl.tile(x.size(0)):
hl.atomic_add(result, [0], x[tile].sum())
return result
Notes
Use for race-free accumulation across parallel execution.
Higher memory semantics may reduce performance.
atomic_and()
- helion.language.atomic_and(target, index, value, sem='relaxed')[source]
Atomically apply bitwise AND with
value
totarget[index]
.- Parameters:
- Returns:
The previous value(s) stored at
target[index]
before the update.- Return type:
atomic_or()
- helion.language.atomic_or(target, index, value, sem='relaxed')[source]
Atomically apply bitwise OR with
value
totarget[index]
.- Parameters:
- Returns:
The previous value(s) stored at
target[index]
before the update.- Return type:
atomic_xor()
- helion.language.atomic_xor(target, index, value, sem='relaxed')[source]
Atomically apply bitwise XOR with
value
totarget[index]
.- Parameters:
- Returns:
The previous value(s) stored at
target[index]
before the update.- Return type:
atomic_xchg()
atomic_max()
atomic_min()
atomic_cas()
- helion.language.atomic_cas(target, index, expected, value, sem='relaxed')[source]
Atomically compare-and-swap a value at
target[index]
.If the current value equals
expected
, writesvalue
. Otherwise leaves memory unchanged.- Parameters:
target (
Tensor
) – Tensor to update.index (
list
[object
]) – Indices selecting elements to update. Can include tiles.expected (
Tensor
|float
|bool
) – Expected current value(s) used for comparison.value (
Tensor
|float
|bool
) – New value(s) to write if comparison succeeds.sem (
str
) – Memory ordering semantics. One of"relaxed"
,"acquire"
,"release"
,"acq_rel"
. Defaults to"relaxed"
.
- Returns:
The previous value(s) stored at
target[index]
before the compare-and-swap.- Return type:
Note
Triton CAS doesn’t support a masked form; our generated code uses an unmasked CAS and relies on index masking to avoid OOB.
Inline Assembly
inline_asm_elementwise()
- helion.language.inline_asm_elementwise(asm, constraints, args, dtype, is_pure, pack)[source]
Execute inline assembly over a tensor. Essentially, this is map where the function is inline assembly.
The input tensors args are implicitly broadcasted to the same shape. dtype can be a tuple of types, in which case the output is a tuple of tensors.
Each invocation of the inline asm processes pack elements at a time. Exactly which set of inputs a block receives is unspecified. Input elements of size less than 4 bytes are packed into 4-byte registers.
This op does not support empty dtype – the inline asm must return at least one tensor, even if you don’t need it. You can work around this by returning a dummy tensor of arbitrary type; it shouldn’t cost you anything if you don’t use it.
- Parameters:
asm (
str
) – assembly to run. Must match target’s assembly format.constraints (
str
) – asm constraints in LLVM formatargs (
Sequence
[Tensor
]) – the input tensors, whose values are passed to the asm blockdtype (
Union
[dtype
,Sequence
[dtype
]]) – the element type(s) of the returned tensor(s)is_pure (
bool
) – if true, the compiler assumes the asm block has no side-effectspack (
int
) – the number of elements to be processed by one instance of inline assembly
- Return type:
- Returns:
one tensor or a tuple of tensors of the given dtypes
Executes target-specific inline assembly on elements of one or more tensors with broadcasting and optional packed processing.
Tensor Creation
zeros()
- helion.language.zeros(shape, dtype=torch.float32, device=None)[source]
Return a device-tensor filled with zeros.
Equivalent to
hl.full(shape, 0.0 if dtype.is_floating_point else 0, dtype=dtype)
.Note
Only use within
hl.tile()
loops for creating local tensors. For output tensor creation, usetorch.zeros()
with proper device placement.- Parameters:
- Returns:
A device tensor of the given shape and dtype filled with zeros
- Return type:
Examples
@helion.kernel def process_kernel(input: torch.Tensor) -> torch.Tensor: result = torch.empty_like(input) for tile in hl.tile(input.size(0)): buffer = hl.zeros([tile], dtype=input.dtype) # Local buffer buffer += input[tile] # Add input values to buffer result[tile] = buffer return result
full()
- helion.language.full(shape, value, dtype=torch.float32, device=None)[source]
Create a device-tensor filled with a specified value.
Note
Only use within
hl.tile()
loops for creating local tensors. For output tensor creation, usetorch.full()
with proper device placement.- Parameters:
- Returns:
A device tensor of the given shape and dtype filled with value
- Return type:
Examples
@helion.kernel def process_kernel(input: torch.Tensor) -> torch.Tensor: result = torch.empty_like(input) for tile in hl.tile(input.size(0)): # Create local buffer filled with initial value buffer = hl.full([tile], 0.0, dtype=input.dtype) buffer += input[tile] # Add input values to buffer result[tile] = buffer return result
arange()
See arange()
for details.
Tunable Parameters
register_block_size()
- helion.language.register_block_size(min_or_max, max_or_none=None, /)[source]
Explicitly register a block size that should be autotuned and can be used for allocations and inside hl.tile(…, block_size=…).
This is useful if you have two loops where you want them to share a block size, or if you need to allocate a kernel tensor before the hl.tile() loop.
- The signature can one of:
hl.register_block_size(max) hl.register_block_size(min, max)
Where min and max are integers that control the range of block_sizes searched by the autotuner. Max may be a symbolic shape, but min must be a constant integer.
register_tunable()
register_reduction_dim()
See register_reduction_dim()
for details.
Tile Operations
Tile Class
- class helion.language.Tile(block_id)[source]
This class should not be instantiated directly, it is the result of hl.tile(…) and represents a single tile of the iteration space.
Tile’s can be used as indices to tensors, e.g. tensor[tile]. Tile’s can also be use as sizes for allocations, e.g. torch.empty([tile]). There are also properties such as
tile.index
,tile.begin
,tile.end
,tile.id
andtile.block_size
that can be used to retrieve various information about the tile.Masking is implicit for tiles, so if the final tile is smaller than the block size loading that tile will only load the valid elements and reduction operations know to ignore the invalid elements.
- Parameters:
block_id (
int
)
The Tile
class represents a portion of an iteration space with the following key attributes:
begin
: Starting indices of the tileend
: Ending indices of the tileblock_size
: Size of the tile in each dimension
View Operations
subscript()
- helion.language.subscript(tensor, index)[source]
Equivalent to tensor[index] where tensor is a kernel-tensor (not a host-tensor).
Can be used to add dimensions to the tensor, e.g. tensor[None, :] or tensor[:, None].
- Parameters:
- Returns:
The indexed tensor with potentially modified dimensions
- Return type:
Examples
@helion.kernel def broadcast_multiply(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: # x has shape (N,), y has shape (M,) result = torch.empty( [x.size(0), y.size(0)], dtype=x.dtype, device=x.device ) for tile_i, tile_j in hl.tile([x.size(0), y.size(0)]): # Get tile data x_tile = x[tile_i] y_tile = y[tile_j] # Make x broadcastable: (tile_size, 1) # same as hl.subscript(x_tile, [slice(None), None]) x_expanded = x_tile[:, None] # Make y broadcastable: (1, tile_size) # same as hl.subscript(y_tile, [None, slice(None)]) y_expanded = y_tile[None, :] result[tile_i, tile_j] = x_expanded * y_expanded return result
Note
Only supports None and : (slice(None)) indexing
Used for reshaping kernel tensors by adding dimensions
Prefer direct indexing syntax when possible:
tensor[None, :]
Does not support integer indexing or slicing with start/stop
StackTensor
StackTensor class
- class helion.language.StackTensor(tensor_like: torch.Tensor, dev_ptrs: torch.Tensor)[source]
This class should not be instantiated directly. It is the result of hl.stacktensor_like(…). It presents a batch of tensors of the same properties (shape, dtype and stride) but reside at different memory locations virtually stacked together.
StackTensor provides a way to perform parallel memory accesses to multiple tensors with a single subscription.
Core Concept:
Instead of performing separate memory operations on each tensor individually, StackTensor allows you to broadcast a single memory operation (hl.load, hl.store, hl.atomic_add, hl.signal, hl.wait etc.) to multiple tensor buffers in parallel. This is particularly useful for batch processing scenarios where the same operation needs to be applied to multiple tensors.
Memory Operation Behavior:
Loads: When you index into a StackTensor (e.g., stack_tensor[i]), it performs the same indexing operation on all underlying tensor buffers and returns a new tensor where the results are stacked according to the shape of dev_ptrs.
Stores: When you assign to a StackTensor (e.g., stack_tensor[i] = value), the value tensor is “unstacked” - each slice of the value tensor is written to the respective underlying tensor buffer. This is the reverse operation of loading. (e.g. value[j] is writtent to tensor_j[i]).
Shape Semantics:
The StackTensor’s shape is dev_ptrs.shape + tensor_like.shape, where:
dev_ptrs.shape becomes the stacking dimensions
tensor_like.shape represents the shape of each individual tensor
-
tensor_like:
Tensor
A template host tensor that defines the shape, dtype, and other properties for all tensors in the stack group. Must be a Host tensor (created outside of the device loop).
stacktensor_like
- helion.language.stacktensor_like(tensor_like, dev_ptrs)[source]
Creates a StackTensor from a tensor of data pointers (dev_ptrs) pointing to tensors alike residing at different memory locations.
This function creates a StackTensor that allows you to broadcast memory operations to multiple tensor buffers in parallel.
Must be called inside a helion kernel with dev_ptrs as a device tensor and tensor_like as a host tensor.
- Parameters:
tensor_like (
Tensor
) – A template host tensor that defines the shape, dtype, and other properties that each buffer in the stack group should have. Must be a host tensor.dev_ptrs (
Tensor
) – A tensor containing device pointers (memory addresses) to data buffers. Must be of dtype torch.uint64 and must be a device tensor.
Examples
Basic Load Operation:
@helion.kernel def stack_load(dev_ptrs: torch.Tensor, example: torch.Tensor): for tile in hl.tile(example.size(0)): ptr_tile = dev_ptrs[:] # Shape: [num_tensors] stack_tensor = hl.stack_like(example, ptr_tile) # Load from all tensors simultaneously data = stack_tensor[tile] # Shape: [num_tensors, tile_size] return data
Store Operation:
@helion.kernel def stack_store( dev_ptrs: torch.Tensor, example: torch.Tensor, values: torch.Tensor ): ptr_tile = dev_ptrs[:] # Shape: [num_tensors] stack_tensor = hl.stack_like(example, ptr_tile) # Store values of shape [num_tensors, N] to all tensors in parallel stack_tensor[:] = values # slice values[i, :] goes to tensor i
Usage Setup:
# Create list of tensors to process tensor_list = [torch.randn(16, device="cuda") for _ in range(4)] tensor_ptrs = torch.as_tensor( [p.data_ptr() for p in tensor_list], dtype=torch.uint64, device="cuda" ) result = stack_load(tensor_ptrs, tensor_list[0])
- Return type:
- Returns:
A StackTensor object that broadcasts memory operations to all data buffers pointed to by dev_ptrs.
Reduction Operations
reduce()
See reduce()
for details.
Scan Operations
associative_scan()
See associative_scan()
for details.
cumsum()
See cumsum()
for details.
cumprod()
See cumprod()
for details.
tile_index()
- helion.language.tile_index(tile)[source]
Retrieve the index (a 1D tensor containing offsets) of the given tile. This can also be written as: tile.index.
Example usage:
@helion.kernel def arange(length: int, device: torch.device) -> torch.Tensor: out = torch.empty(length, dtype=torch.int32, device=device) for tile in hl.tile(length): out[tile] = tile.index return out
- Parameters:
tile (
TileInterface
)- Return type:
tile_begin()
tile_end()
- helion.language.tile_end(tile)[source]
Retrieve the end offset of the given tile. For the first 0 to N-1 tiles, this is equivalent to tile.begin + tile.block_size. For the last tile, this is the end offset passed to hl.tile(). This can also be written as: tile.end.
- Parameters:
tile (
TileInterface
)- Return type:
tile_block_size()
tile_id()
Synchronization
signal()
- helion.language.signal(signal_pad, index=None, signal=1, wait_for=None, scope='gpu', hasPreviousMemAccess=True)[source]
Set global memory barriers.
Sets global memory barriers to the specified value. If wait_for is not None, it waits for the barriers to be cleared before setting.
- Parameters:
signal_pad (
Tensor
|StackTensor
) – Tensor of global memory barriers to setindex (
list
[object
] |None
) – Indices to index into the signal_pad tensorsignal (
int
) – the value to sendwait_for (
int
|None
) – The value to wait for before sending the signal.scope (
str
) – The scope of the lock (default: ‘gpu’)hasPreviousMemAccess (
bool
) – Whether the signal is preceded by a memory access (default: True)
- Return type:
- Returns:
The old value of the global memory barriers before the update.
wait()
- helion.language.wait(signal_pad, index=None, signal=1, update=None, scope='gpu', hasSubsequentMemAccess=True)[source]
Wait for global memory barriers.
Spins on global memory barriers until the signal values is observed on all barriers.
- Parameters:
signal_pad (
Tensor
|StackTensor
) – Tensor of global memory barriers to wait onindex (
list
[object
] |None
) – Indices to index into the signal_pad tensorsignal (
int
) – the value to wait forupdate (
int
|None
) – Atomically update the signal_pad tensor with this value once the signal is observed. (default: None)scope (
str
) – The scope of the lock (default: ‘gpu’)hasSubsequentMemAccess (
bool
) – Whether the wait is followed by a subsequence memory access (default: True)
- Return type:
- Returns:
None
Utilities
device_print()
See device_print()
for details.
Constexpr Operations
constexpr()
See constexpr
for details.
specialize()
See specialize()
for details.
Matrix Operations
dot()
See dot()
for details.