`tile`	Break up an iteration space defined by a size or sequence of sizes into tiles.
`grid`	Iterate over individual indices of the given iteration space.
`static_range`	Create a range that gets unrolled at compile time by iterating over constant integer values.
`load`	Load a value from a tensor using a list of indices.
`store`	Store a value to a tensor using a list of indices.
`atomic_add`	Atomically add a value to a target tensor.
`atomic_and`	Atomically apply bitwise AND with `value` to `target[index]`.
`atomic_or`	Atomically apply bitwise OR with `value` to `target[index]`.
`atomic_xor`	Atomically apply bitwise XOR with `value` to `target[index]`.
`atomic_xchg`	Atomically exchange (set) a value at `target[index]`.
`atomic_max`	Atomically update `target[index]` with the maximum of current value and `value`.
`atomic_min`	Atomically update `target[index]` with the minimum of current value and `value`.
`atomic_cas`	Atomically compare-and-swap a value at `target[index]`.
`device_print`	Print values from device code.
`signal`	Set global memory barriers.
`wait`	Wait for global memory barriers.
`stacktensor_like`	Creates a StackTensor from a tensor of data pointers (dev_ptrs) pointing to tensors alike residing at different memory locations.
`zeros`	Return a device-tensor filled with zeros.
`full`	Create a device-tensor filled with a specified value.
`arange`	Same as torch.arange(), but defaults to same device as the current kernel.
`subscript`	Equivalent to tensor[index] where tensor is a kernel-tensor (not a host-tensor).
`split`	Split the last dimension of a tensor with size two into two separate tensors.
`join`	Join two tensors along a new minor dimension.
`reduce`	Applies a reduction operation along a specified dimension or all dimensions.
`associative_scan`	Applies an associative scan operation along a specified dimension.
`cumsum`	Compute the cumulative sum along a specified dimension.
`cumprod`	Compute the cumulative product along a specified dimension.
`dot`	Performs a matrix multiplication of tensors with support for multiple dtypes.
`inline_asm_elementwise`	Execute inline assembly over a tensor.
`inline_triton`	Inline a raw Triton snippet inside a Helion kernel.
`register_block_size`	Explicitly register a block size that should be autotuned and can be used for allocations and inside hl.tile(..., block_size=...).
`register_tunable`	Register a tunable parameter for autotuning.
`constexpr`	alias of `ConstExpr`
`specialize`	Turn dynamic shapes into compile-time constants. Examples::.

Language Classes#

`Tile`	This class should not be instantiated directly, it is the result of hl.tile(...) and represents a single tile of the iteration space.
`StackTensor`	This class should not be instantiated directly.

Tile Helpers#

`tile_index`	Retrieve the index (a 1D tensor containing offsets) of the given tile.
`tile_begin`	Retrieve the start offset of the given tile.
`tile_end`	Retrieve the end offset of the given tile.
`tile_block_size`	Retrieve block size of a given tile, usually set the autotuner.
`tile_id`	Retrieve tile_id of a given tile or list of tiles.