Rate this Page

★ ★ ★ ★ ★

Language Module#

The helion.language module contains the core DSL constructs for writing GPU kernels.

Loop Constructs#

tile()#

helion.language.tile(begin_or_end, end_or_none=None, /, block_size=None)[source]#

Break up an iteration space defined by a size or sequence of sizes into tiles.

The generated tiles can flatten the iteration space into the product of the sizes, perform multidimensional tiling, swizzle the indices for cache locality, reorder dimensions, etc. The only invariant is that every index in the range of the given sizes is covered exactly once.

The exact tiling strategy is determined by a Config object, typically created through autotuning.

If used at the top level of a function, this becomes the grid of the kernel. Otherwise, it becomes a loop in the output kernel.

The key difference from grid() is that tile gives you Tile objects that load a slice of elements, while grid gives you scalar integer indices. It is recommended to use tile in most cases, since it allows more choices in autotuning.

Parameters:

begin_or_end (int | Tensor | Sequence[int | Tensor]) – If 2+ positional args provided, the start of iteration space. Otherwise, the end of iteration space.
end_or_none (int | Tensor | Sequence[int | Tensor] | None) – If 2+ positional args provided, the end of iteration space.
block_size (object) – Fixed block size (overrides autotuning) or None for autotuned size

Returns:

Iterator over tile objects

Return type:

Iterator[Tile] or Iterator[Sequence[Tile]]

Examples

One dimensional tiling:

@helion.kernel
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)

    for tile in hl.tile(x.size(0)):
        # tile processes multiple elements at once
        result[tile] = x[tile] + y[tile]

    return result

Multi-dimensional tiling:

@helion.kernel()
def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    m, k = x.size()
    k, n = y.size()
    out = torch.empty([m, n], dtype=x.dtype, device=x.device)

    for tile_m, tile_n in hl.tile([m, n]):
        acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
        for tile_k in hl.tile(k):
            acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n])
        out[tile_m, tile_n] = acc


return out

Fixed block size:

@helion.kernel
def process_with_fixed_block(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)

    for tile in hl.tile(x.size(0), block_size=64):
        # Process with fixed block size of 64
        result[tile] = x[tile] * 2

    return result

Using tile properties:

@helion.kernel
def tile_info_example(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros([x.size(0)], dtype=x.dtype, device=x.device)

    for tile in hl.tile(x.size(0)):
        # Access tile properties
        start = tile.begin
        end = tile.end
        size = tile.block_size
        indices = tile.index  # [start, start+1, ..., end-1]

        # Use in computation
        result[tile] = x[tile] + indices

    return result

See also

grid(): For explicit control over the launch grid
tile_index(): For getting tile indices
register_block_size(): For registering block sizes

Note

Similar to range() with multiple forms:

tile(end) iterates 0 to end-1, autotuned block_size
tile(begin, end) iterates begin to end-1, autotuned block_size
tile(begin, end, block_size) iterates begin to end-1, fixed block_size
tile(end, block_size=block_size) iterates 0 to end-1, fixed block_size

Block sizes can be registered for autotuning explicitly with register_block_size() and passed as the block_size argument if one needs two loops to use the same block size. Passing block_size=None is equivalent to calling register_block_size.

Use tile in most cases. Use grid when you need explicit control over the launch grid.

The tile() function is the primary way to create parallel loops in Helion kernels. It provides several key features:

Tiling Strategies: The exact tiling strategy is determined by a Config object, typically created through autotuning. This allows for:

Multidimensional tiling
Index swizzling for cache locality
Dimension reordering
Flattening of iteration spaces

Usage Patterns:

# Simple 1D tiling
for tile in hl.tile(1000):
    # tile.begin, tile.end, tile.block_size are available
    # Load entire tile (not just first element)
    data = tensor[tile]  # or hl.load(tensor, tile) for explicit loading

# 2D tiling
for tile_i, tile_j in hl.tile([height, width]):
    # Each tile represents a portion of the 2D space
    pass

# With explicit begin/end/block_size
for tile in hl.tile(0, 1000, block_size=64):
    pass

Grid vs Loop Behavior:

When used at the top level of a kernel function, tile() becomes the grid of the kernel (parallel blocks)
When used nested inside another loop, it becomes a sequential loop within each block

grid()#

helion.language.grid(begin_or_end, end_or_none=None, /, step=None)[source]#

Iterate over individual indices of the given iteration space.

The key difference from tile() is that grid gives you scalar integer indices (torch.SymInt), while tile gives you Tile objects that load a slice of elements. Use tile in most cases. Use grid when you need explicit control over the launch grid or when processing one element at a time.

Semantics are equivalent to:

for i in hl.tile(...):
    # i is a Tile object, accesses multiple elements
    data = tensor[i]  # loads slice of elements (1D tensor)

vs:

for i in hl.grid(...):
    # i is a scalar index, accesses single element
    data = tensor[i]  # loads single element (0D scalar)

When used at the top level of a function, this becomes the grid of the kernel. Otherwise, it becomes a loop in the output kernel.

Parameters:

begin_or_end (int | Tensor | ConstExpr | Sequence[int | Tensor | ConstExpr]) – If 2+ positional args provided, the start of iteration space. Otherwise, the end of iteration space.
end_or_none (int | Tensor | ConstExpr | Sequence[int | Tensor | ConstExpr] | None) – If 2+ positional args provided, the end of iteration space.
step (int | Tensor | ConstExpr | Sequence[int | Tensor | ConstExpr] | None) – Step size for iteration (default: 1)

Returns:

Iterator over scalar indices

Return type:

Iterator[torch.SymInt] or Iterator[Sequence[torch.SymInt]]

See also

tile(): For processing multiple elements at once
tile_index(): For getting tile indices
arange(): For creating index sequences

Note

Similar to range() with multiple forms:

grid(end) iterates from 0 to end-1, step 1
grid(begin, end) iterates from begin to end-1, step 1
grid(begin, end, step) iterates from begin to end-1, given step
grid(end, step=step) iterates from 0 to end-1, given step

Use tile in most cases. Use grid when you need explicit control over the launch grid.

The grid() function iterates over individual indices rather than tiles. It’s equivalent to tile(size, block_size=1) but returns scalar indices instead of tile objects.

static_range()#

helion.language.static_range(begin_or_end, end_or_none=None, /, step=1)[source]#

Create a range that gets unrolled at compile time by iterating over constant integer values.

This function is similar to Python’s built-in range(), but it generates a sequence of integer constants that triggers loop unrolling behavior in Helion kernels. The loop is completely unrolled at compile time, with each iteration becoming separate instructions in the generated code.

Parameters:

begin_or_end (int) – If 2+ positional args provided, the start of range (integer). Otherwise, the end of range (integer).
end_or_none (int | None) – If 2+ positional args provided, the end of range (integer).
step (int) – Step size for iteration (integer, default: 1)

Returns:

Iterator over constant integer values

Return type:

Iterator[int]

Examples

Simple unrolled loop:

@helion.kernel
def unrolled_example(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)

    for tile in hl.tile(x.size(0)):
        acc = torch.zeros([tile], dtype=x.dtype, device=x.device)
        # This loop gets completely unrolled
        for i in hl.static_range(3):
            acc += x[tile] * i
        result[tile] = acc

    return result

Range with start and step:

@helion.kernel
def kernel_stepped_unroll(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)

    for tile in hl.tile(x.size(0)):
        acc = torch.zeros([tile], dtype=x.dtype, device=x.device)
        # Unroll loop from 2 to 8 with step 2: [2, 4, 6]
        for i in hl.static_range(2, 8, 2):
            acc += x[tile] * i
        result[tile] = acc

    return result

Note

Only constant integer values are supported
The range must be small enough to avoid compilation timeouts
Each iteration becomes separate instructions in the generated Triton code
Use for small, fixed iteration counts where unrolling is beneficial

static_range() behaves like a compile-time unrolled range for small loops. It hints the compiler to fully unroll the loop body where profitable.

Memory Operations#

load()#

helion.language.load(tensor, index, extra_mask=None, eviction_policy=None)[source]#

Load a value from a tensor using a list of indices.

This function is equivalent to tensor[index] but allows setting extra_mask= to mask elements beyond the default masking based on the hl.tile range. It also accepts an optional eviction_policy which is forwarded to the underlying Triton tl.load call to control the cache eviction behavior (e.g., “evict_last”).

Parameters:

tensor (Tensor | StackTensor) – The tensor / stack tensor to load from
index (list[object]) – The indices to use to index into the tensor
extra_mask (Tensor | None) – The extra mask (beyond automatic tile bounds masking) to apply to the tensor
eviction_policy (str | None) – Optional Triton load eviction policy to hint cache behavior

Returns:

The loaded value

Return type:

torch.Tensor

store()#

helion.language.store(tensor, index, value, extra_mask=None)[source]#

Store a value to a tensor using a list of indices.

This function is equivalent to tensor[index] = value but allows setting extra_mask= to mask elements beyond the default masking based on the hl.tile range.

Parameters:

tensor (Tensor | StackTensor) – The tensor / stack tensor to store to
index (list[object]) – The indices to use to index into the tensor
value (Tensor | SymInt | float) – The value to store
extra_mask (Tensor | None) – The extra mask (beyond automatic tile bounds masking) to apply to the tensor

Return type:

None

Returns:

None

atomic_add()#

helion.language.atomic_add(target, index, value, sem='relaxed')[source]#

Atomically add a value to a target tensor.

Performs an atomic read-modify-write that adds value to target[index]. This is safe for concurrent access from multiple threads/blocks.

Parameters:

target (Tensor) – Tensor to update.
index (list[object]) – Indices selecting elements to update. Can include tiles.
value (Tensor | float) – Value(s) to add (tensor or scalar).
sem (str) – Memory ordering semantics. One of "relaxed", "acquire", "release", "acq_rel". Defaults to "relaxed".

Returns:

The previous value(s) stored at target[index] before the update.

Return type:

torch.Tensor

Example

@helion.kernel def global_sum(x: torch.Tensor, result: torch.Tensor) -> torch.Tensor:

for tile in hl.tile(x.size(0)):
hl.atomic_add(result, [0], x[tile].sum())

return result

Notes

Use for race-free accumulation across parallel execution.
Higher memory semantics may reduce performance.

atomic_and()#

helion.language.atomic_and(target, index, value, sem='relaxed')[source]#

Atomically apply bitwise AND with value to target[index].

Parameters:

target (Tensor) – Tensor to update (integer/bool dtype).
index (list[object]) – Indices selecting elements to update. Can include tiles.
value (Tensor | int | bool) – Value(s) to AND with.
sem (str) – Memory ordering semantics. One of "relaxed", "acquire", "release", "acq_rel". Defaults to "relaxed".

Returns:

The previous value(s) stored at target[index] before the update.

Return type:

torch.Tensor

atomic_or()#

helion.language.atomic_or(target, index, value, sem='relaxed')[source]#

Atomically apply bitwise OR with value to target[index].

Parameters:

target (Tensor) – Tensor to update (integer/bool dtype).
index (list[object]) – Indices selecting elements to update. Can include tiles.
value (Tensor | int | bool) – Value(s) to OR with.
sem (str) – Memory ordering semantics. One of "relaxed", "acquire", "release", "acq_rel". Defaults to "relaxed".

Returns:

The previous value(s) stored at target[index] before the update.

Return type:

torch.Tensor

atomic_xor()#

helion.language.atomic_xor(target, index, value, sem='relaxed')[source]#

Atomically apply bitwise XOR with value to target[index].

Parameters:

target (Tensor) – Tensor to update (integer/bool dtype).
index (list[object]) – Indices selecting elements to update. Can include tiles.
value (Tensor | int | bool) – Value(s) to XOR with.
sem (str) – Memory ordering semantics. One of "relaxed", "acquire", "release", "acq_rel". Defaults to "relaxed".

Returns:

The previous value(s) stored at target[index] before the update.

Return type:

torch.Tensor

atomic_xchg()#

helion.language.atomic_xchg(target, index, value, sem='relaxed')[source]#

Atomically exchange (set) a value at target[index].

Parameters:

target (Tensor) – Tensor to update.
index (list[object]) – Indices selecting elements to update. Can include tiles.
value (Tensor | float | bool) – New value(s) to set.
sem (str) – Memory ordering semantics. One of "relaxed", "acquire", "release", "acq_rel". Defaults to "relaxed".

Returns:

The previous value(s) stored at target[index] before the update.

Return type:

torch.Tensor

atomic_max()#

helion.language.atomic_max(target, index, value, sem='relaxed')[source]#

Atomically update target[index] with the maximum of current value and value.

Parameters:

target (Tensor) – Tensor to update.
index (list[object]) – Indices selecting elements to update. Can include tiles.
value (Tensor | float) – Value(s) to compare with.
sem (str) – Memory ordering semantics. One of "relaxed", "acquire", "release", "acq_rel". Defaults to "relaxed".

Returns:

The previous value(s) stored at target[index] before the update.

Return type:

torch.Tensor

atomic_min()#

helion.language.atomic_min(target, index, value, sem='relaxed')[source]#

Atomically update target[index] with the minimum of current value and value.

Parameters:

target (Tensor) – Tensor to update.
index (list[object]) – Indices selecting elements to update. Can include tiles.
value (Tensor | float) – Value(s) to compare with.
sem (str) – Memory ordering semantics. One of "relaxed", "acquire",
"release" –
"relaxed". ("acq_rel". Defaults to) –

Returns:

The previous value(s) stored at target[index] before the update.

Return type:

torch.Tensor

atomic_cas()#

helion.language.atomic_cas(target, index, expected, value, sem='relaxed')[source]#

Atomically compare-and-swap a value at target[index].

If the current value equals expected, writes value. Otherwise leaves memory unchanged.

Parameters:

target (Tensor) – Tensor to update.
index (list[object]) – Indices selecting elements to update. Can include tiles.
expected (Tensor | float | bool) – Expected current value(s) used for comparison.
value (Tensor | float | bool) – New value(s) to write if comparison succeeds.
sem (str) – Memory ordering semantics. One of "relaxed", "acquire", "release", "acq_rel". Defaults to "relaxed".

Returns:

The previous value(s) stored at target[index] before the compare-and-swap.

Return type:

torch.Tensor

Note

Triton CAS doesn’t support a masked form; our generated code uses an unmasked CAS and relies on index masking to avoid OOB.

Inline Assembly#

inline_asm_elementwise()#

helion.language.inline_asm_elementwise(asm, constraints, args, dtype, is_pure, pack)[source]#

Execute inline assembly over a tensor. Essentially, this is map where the function is inline assembly.

The input tensors args are implicitly broadcasted to the same shape. dtype can be a tuple of types, in which case the output is a tuple of tensors.

Each invocation of the inline asm processes pack elements at a time. Exactly which set of inputs a block receives is unspecified. Input elements of size less than 4 bytes are packed into 4-byte registers.

This op does not support empty dtype – the inline asm must return at least one tensor, even if you don’t need it. You can work around this by returning a dummy tensor of arbitrary type; it shouldn’t cost you anything if you don’t use it.

Parameters:

asm (str) – assembly to run. Must match target’s assembly format.
constraints (str) – asm constraints in LLVM format
args (Sequence[Tensor]) – the input tensors, whose values are passed to the asm block
dtype (Union[dtype, Sequence[dtype]]) – the element type(s) of the returned tensor(s)
is_pure (bool) – if true, the compiler assumes the asm block has no side-effects
pack (int) – the number of elements to be processed by one instance of inline assembly

Return type:

Tensor | tuple[Tensor, ...]

Returns:

one tensor or a tuple of tensors of the given dtypes

Executes target-specific inline assembly on elements of one or more tensors with broadcasting and optional packed processing.

inline_triton()#

helion.language.inline_triton(triton_source, args, output_like)[source]#

Inline a raw Triton snippet inside a Helion kernel.

Parameters:

triton_source (str) – The Triton code snippet. The last statement must be an expression representing the return value. The snippet may be indented, and common indentation is stripped automatically.
args (Sequence[object] | Mapping[str, object]) – Positional or keyword placeholders that will be substituted via str.format before code generation. Provide a tuple/list for positional placeholders ({0}, {1}, …) or a mapping for named placeholders ({x}, {y}, …).
output_like (TypeVar(_T)) – Example tensors describing the expected outputs. A single tensor indicates a single output; a tuple/list of tensors indicates multiple outputs.

Return type:

TypeVar(_T)

Returns:

The value(s) produced by the snippet. Matches the structure of output_like.

Embeds small Triton code snippets directly inside a Helion kernel. Common indentation is removed automatically, placeholders are replaced using str.format with tuple or dict arguments, and the final line in the snippet becomes the return value. Provide tensors (or tuples of tensors) via output_like so Helion knows the type of the return value.

triton_kernel()#

helion.language.triton_kernel(triton_source_or_fn, args, output_like)[source]#

Define (once) and call a @triton.jit function from Helion device code.

Parameters:

triton_source_or_fn (object) – Source for a single @triton.jit function definition, or a Python function object defining a @triton.jit kernel.
args (Sequence[object] | Mapping[str, object]) – Positional or keyword placeholders that will be substituted via name resolution of Helion variables.
output_like (TypeVar(_T)) – Example tensor(s) describing the expected outputs for shape/dtype checks.

Return type:

TypeVar(_T)

Define (once) and call a @triton.jit function from Helion device code.

Accepts either:
- a source string containing a single Triton function definition,
- a function name string referring to a @triton.jit function in the kernel’s module, or
- a Python function object (or Triton JITFunction; unwrapped via .fn).
The function is emitted at module scope once and then invoked from the kernel body.
Pass output_like tensors for shape/dtype checks identical to inline_triton.

Example (by name):

@triton.jit
def add_pairs(a, b):
    return a + b

@helion.kernel()
def k(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    out = torch.empty_like(x)
    for tile in hl.tile(x.shape):
        out[tile] = hl.triton_kernel("add_pairs", args=(x[tile], y[tile]), output_like=x[tile])
    return out

Tensor Creation#

zeros()#

helion.language.zeros(shape, dtype=torch.float32, device=None)[source]#

Return a device-tensor filled with zeros.

Equivalent to hl.full(shape, 0.0 if dtype.is_floating_point else 0, dtype=dtype).

Note

Only use within hl.tile() loops for creating local tensors. For output tensor creation, use torch.zeros() with proper device placement.

Parameters:

shape (list[object]) – A list of sizes (or tile indices which are implicitly converted to sizes)
dtype (dtype) – Data type of the tensor (default: torch.float32)
device (device | None) – Device must match the current compile environment device

Returns:

A device tensor of the given shape and dtype filled with zeros

Return type:

torch.Tensor

Examples

@helion.kernel
def process_kernel(input: torch.Tensor) -> torch.Tensor:
    result = torch.empty_like(input)

    for tile in hl.tile(input.size(0)):
        buffer = hl.zeros([tile], dtype=input.dtype)  # Local buffer
        buffer += input[tile]  # Add input values to buffer
        result[tile] = buffer

    return result

See also

full(): For filling with arbitrary values
arange(): For creating sequences

full()#

helion.language.full(shape, value, dtype=torch.float32, device=None)[source]#

Create a device-tensor filled with a specified value.

Note

Only use within hl.tile() loops for creating local tensors. For output tensor creation, use torch.full() with proper device placement.

Parameters:

shape (list[object]) – A list of sizes (or tile indices which are implicitly converted to sizes)
value (float) – The value to fill the tensor with
dtype (dtype) – The data type of the tensor (default: torch.float32)
device (device | None) – Device must match the current compile environment device

Returns:

A device tensor of the given shape and dtype filled with value

Return type:

torch.Tensor

Examples

@helion.kernel
def process_kernel(input: torch.Tensor) -> torch.Tensor:
    result = torch.empty_like(input)

    for tile in hl.tile(input.size(0)):
        # Create local buffer filled with initial value
        buffer = hl.full([tile], 0.0, dtype=input.dtype)
        buffer += input[tile]  # Add input values to buffer
        result[tile] = buffer

    return result

See also

zeros(): For filling with zeros
arange(): For creating sequences

arange()#

See arange() for details.

Tunable Parameters#

register_block_size()#

helion.language.register_block_size(min_or_max, max_or_none=None, /)[source]#

Explicitly register a block size that should be autotuned and can be used for allocations and inside hl.tile(…, block_size=…).

This is useful if you have two loops where you want them to share a block size, or if you need to allocate a kernel tensor before the hl.tile() loop.

The signature can one of:: hl.register_block_size(max) hl.register_block_size(min, max)

Where min and max are integers that control the range of block_sizes searched by the autotuner. Max may be a symbolic shape, but min must be a constant integer.

Parameters:

min_or_max (int) –
max_or_none (int | None) –

Return type:

int

register_tunable()#

helion.language.register_tunable(name, fragment)[source]#

Register a tunable parameter for autotuning.

This function allows you to define parameters that can be automatically tuned during the autotuning process. The fragment defines the search space and default value.

Parameters:

name (str) – The key for the tunable parameter in the Config().
fragment (ConfigSpecFragment) – A ConfigSpecFragment that defines the search space (e.g., PowerOfTwoFragment)

Returns:

The value assigned to this tunable parameter in the current configuration.

Return type:

int

Tile Operations#

Tile Class#

class helion.language.Tile(block_id)[source]#

This class should not be instantiated directly, it is the result of hl.tile(…) and represents a single tile of the iteration space.

Tile’s can be used as indices to tensors, e.g. tensor[tile]. Tile’s can also be use as sizes for allocations, e.g. torch.empty([tile]). There are also properties such as * tile.index * tile.begin * tile.end * tile.id * tile.block_size * tile.count that can be used to retrieve various information about the tile.

Masking is implicit for tiles, so if the final tile is smaller than the block size loading that tile will only load the valid elements and reduction operations know to ignore the invalid elements.

Parameters:: block_id (int) –

__init__(block_id)[source]#

Parameters:: block_id (int) –

The Tile class represents a portion of an iteration space with the following key attributes:

begin: Starting indices of the tile
end: Ending indices of the tile
block_size: Size of the tile in each dimension

View Operations#

subscript()#

helion.language.subscript(tensor, index)[source]#

Equivalent to tensor[index] where tensor is a kernel-tensor (not a host-tensor).

Can be used to add dimensions to the tensor, e.g. tensor[None, :] or tensor[:, None].

Parameters:

tensor (Tensor) – The kernel tensor to index
index (list[object]) – List of indices, including None for new dimensions and : for existing dimensions

Returns:

The indexed tensor with potentially modified dimensions

Return type:

torch.Tensor

Examples

@helion.kernel
def broadcast_multiply(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    # x has shape (N,), y has shape (M,)
    result = torch.empty(
        [x.size(0), y.size(0)], dtype=x.dtype, device=x.device
    )

    for tile_i, tile_j in hl.tile([x.size(0), y.size(0)]):
        # Get tile data
        x_tile = x[tile_i]
        y_tile = y[tile_j]

        # Make x broadcastable: (tile_size, 1)
        # same as hl.subscript(x_tile, [slice(None), None])
        x_expanded = x_tile[:, None]
        # Make y broadcastable: (1, tile_size)
        # same as hl.subscript(y_tile, [None, slice(None)])
        y_expanded = y_tile[None, :]

        result[tile_i, tile_j] = x_expanded * y_expanded

    return result

See also

load(): For loading tensor values
store(): For storing tensor values

Note

Only supports None and : (slice(None)) indexing
Used for reshaping kernel tensors by adding dimensions
Prefer direct indexing syntax when possible: tensor[None, :]
Does not support integer indexing or slicing with start/stop

split()#

helion.language.split(tensor)[source]#

Split the last dimension of a tensor with size two into two separate tensors.

Parameters:: tensor (Tensor) – The input tensor whose last dimension has length two.
Return type:: tuple[Tensor, Tensor]
Returns:: A tuple (lo, hi) where each tensor has the same shape as tensor without its last dimension.

See also

join()

join()#

helion.language.join(tensor0, tensor1)[source]#

Join two tensors along a new minor dimension.

Parameters:

tensor0 (Tensor) – First tensor to join.
tensor1 (Tensor) – Second tensor to join. Must be broadcast-compatible with tensor0.

Returns:

A tensor with shape broadcast_shape + (2,) where broadcast_shape is the broadcast of the input shapes.

Return type:

torch.Tensor

See also

split()

StackTensor#

StackTensor class#

class helion.language.StackTensor(tensor_like: torch.Tensor, dev_ptrs: torch.Tensor)[source]#

This class should not be instantiated directly. It is the result of hl.stacktensor_like(…). It presents a batch of tensors of the same properties (shape, dtype and stride) but reside at different memory locations virtually stacked together.

StackTensor provides a way to perform parallel memory accesses to multiple tensors with a single subscription.

Core Concept:

Instead of performing separate memory operations on each tensor individually, StackTensor allows you to broadcast a single memory operation (hl.load, hl.store, hl.atomic_add, hl.signal, hl.wait etc.) to multiple tensor buffers in parallel. This is particularly useful for batch processing scenarios where the same operation needs to be applied to multiple tensors.

Memory Operation Behavior:

Loads: When you index into a StackTensor (e.g., stack_tensor[i]), it performs the same indexing operation on all underlying tensor buffers and returns a new tensor where the results are stacked according to the shape of dev_ptrs.
Stores: When you assign to a StackTensor (e.g., stack_tensor[i] = value), the value tensor is “unstacked” - each slice of the value tensor is written to the respective underlying tensor buffer. This is the reverse operation of loading. (e.g. value[j] is writtent to tensor_j[i]).

Shape Semantics:

The StackTensor’s shape is dev_ptrs.shape + tensor_like.shape, where:

dev_ptrs.shape becomes the stacking dimensions
tensor_like.shape represents the shape of each individual tensor

tensor_like: Tensor#: A template host tensor that defines the shape, dtype, and other properties for all tensors in the stack group. Must be a Host tensor (created outside of the device loop).

dev_ptrs: Tensor#: A tensor containing device pointers (memory buffer addresses) to the actual tensors in device memory. Must be of dtype torch.uint64.

property dtype: dtype#

property device: device#

property shape: Size#

new_empty(*args, **kwargs)[source]#

Parameters:

args (Sequence[int | SymInt]) –
kwargs (dict) –

Return type:

Tensor

stacktensor_like#

helion.language.stacktensor_like(tensor_like, dev_ptrs)[source]#

Creates a StackTensor from a tensor of data pointers (dev_ptrs) pointing to tensors alike residing at different memory locations.

This function creates a StackTensor that allows you to broadcast memory operations to multiple tensor buffers in parallel.

Must be called inside a helion kernel with dev_ptrs as a device tensor and tensor_like as a host tensor.

Parameters:

tensor_like (Tensor) – A template host tensor that defines the shape, dtype, and other properties that each buffer in the stack group should have. Must be a host tensor.
dev_ptrs (Tensor) – A tensor containing device pointers (memory addresses) to data buffers. Must be of dtype torch.uint64 and must be a device tensor.

Examples

Basic Load Operation:

@helion.kernel
def stack_load(dev_ptrs: torch.Tensor, example: torch.Tensor):
    for tile in hl.tile(example.size(0)):
        ptr_tile = dev_ptrs[:]  # Shape: [num_tensors]
        stack_tensor = hl.stack_like(example, ptr_tile)
        # Load from all tensors simultaneously
        data = stack_tensor[tile]  # Shape: [num_tensors, tile_size]
    return data

Store Operation:

@helion.kernel
def stack_store(
    dev_ptrs: torch.Tensor, example: torch.Tensor, values: torch.Tensor
):
    ptr_tile = dev_ptrs[:]  # Shape: [num_tensors]
    stack_tensor = hl.stack_like(example, ptr_tile)

    # Store values of shape [num_tensors, N] to all tensors in parallel
    stack_tensor[:] = values  # slice values[i, :] goes to tensor i

Usage Setup:

# Create list of tensors to process
tensor_list = [torch.randn(16, device="cuda") for _ in range(4)]
tensor_ptrs = torch.as_tensor(
    [p.data_ptr() for p in tensor_list], dtype=torch.uint64, device="cuda"
)
result = stack_load(tensor_ptrs, tensor_list[0])

Return type:: StackTensor
Returns:: A StackTensor object that broadcasts memory operations to all data buffers pointed to by dev_ptrs.

Reduction Operations#

reduce()#

See reduce() for details.

Scan Operations#

associative_scan()#

See associative_scan() for details.

cumsum()#

See cumsum() for details.

cumprod()#

See cumprod() for details.

tile_index()#

helion.language.tile_index(tile)[source]#

Retrieve the index (a 1D tensor containing offsets) of the given tile. This can also be written as: tile.index.

Example usage:

@helion.kernel
def arange(length: int, device: torch.device) -> torch.Tensor:
    out = torch.empty(length, dtype=torch.int32, device=device)
    for tile in hl.tile(length):
        out[tile] = tile.index
    return out

Parameters:: tile (TileInterface) –
Return type:: Tensor

tile_begin()#

helion.language.tile_begin(tile)[source]#

Retrieve the start offset of the given tile. This can also be written as: tile.begin.

Parameters:: tile (TileInterface) –
Return type:: int

tile_end()#

helion.language.tile_end(tile)[source]#

Retrieve the end offset of the given tile. For the first 0 to N-1 tiles, this is equivalent to tile.begin + tile.block_size. For the last tile, this is the end offset passed to hl.tile(). This can also be written as: tile.end.

Parameters:: tile (TileInterface) –
Return type:: int

tile_block_size()#

helion.language.tile_block_size(tile)[source]#

Retrieve block size of a given tile, usually set the autotuner. This can also be written as: tile.block_size.

Parameters:: tile (TileInterface) –
Return type:: int

tile_id()#

helion.language.tile_id(tile)[source]#

Retrieve tile_id of a given tile or list of tiles. This is equivalent to tile.begin // tile.block_size. This can also be written as: tile.id.

Parameters:: tile (TileInterface) –
Return type:: int

Synchronization#

signal()#

helion.language.signal(signal_pad, index=None, signal=1, wait_for=None, scope='gpu', hasPreviousMemAccess=True)[source]#

Set global memory barriers.

Sets global memory barriers to the specified value. If wait_for is not None, it waits for the barriers to be cleared before setting.

Parameters:

signal_pad (Tensor | StackTensor) – Tensor of global memory barriers to set
index (list[object] | None) – Indices to index into the signal_pad tensor
signal (int) – the value to send
wait_for (int | None) – The value to wait for before sending the signal.
scope (str) – The scope of the lock (default: ‘gpu’)
hasPreviousMemAccess (bool) – Whether the signal is preceded by a memory access (default: True)

Return type:

Tensor

Returns:

The old value of the global memory barriers before the update.

wait()#

helion.language.wait(signal_pad, index=None, signal=1, update=None, scope='gpu', hasSubsequentMemAccess=True)[source]#

Wait for global memory barriers.

Spins on global memory barriers until the signal values is observed on all barriers.

Parameters:

signal_pad (Tensor | StackTensor) – Tensor of global memory barriers to wait on
index (list[object] | None) – Indices to index into the signal_pad tensor
signal (int) – the value to wait for
update (int | None) – Atomically update the signal_pad tensor with this value once the signal is observed. (default: None)
scope (str) – The scope of the lock (default: ‘gpu’)
hasSubsequentMemAccess (bool) – Whether the wait is followed by a subsequence memory access (default: True)

Return type:

None

Returns:

None

Utilities#

device_print()#

See device_print() for details.

Constexpr Operations#

constexpr()#

See constexpr for details.

specialize()#

See specialize() for details.

Matrix Operations#

dot()#

See dot() for details.