helion.language.stacktensor_like#

helion.language.stacktensor_like(tensor_like, dev_ptrs)[source]#

Creates a StackTensor from a tensor of data pointers (dev_ptrs) pointing to tensors alike residing at different memory locations.

This function creates a StackTensor that allows you to broadcast memory operations to multiple tensor buffers in parallel.

Must be called inside a helion kernel with dev_ptrs as a device tensor and tensor_like as a host tensor.

Parameters:

tensor_like (Tensor) – A template host tensor that defines the shape, dtype, and other properties that each buffer in the stack group should have. Must be a host tensor.
dev_ptrs (Tensor) – A tensor containing device pointers (memory addresses) to data buffers. Must be of dtype torch.uint64 and must be a device tensor.

Examples

Basic Load Operation:

@helion.kernel
def stack_load(dev_ptrs: torch.Tensor, example: torch.Tensor):
    for tile in hl.tile(example.size(0)):
        ptr_tile = dev_ptrs[:]  # Shape: [num_tensors]
        stack_tensor = hl.stack_like(example, ptr_tile)
        # Load from all tensors simultaneously
        data = stack_tensor[tile]  # Shape: [num_tensors, tile_size]
    return data

Store Operation:

@helion.kernel
def stack_store(
    dev_ptrs: torch.Tensor, example: torch.Tensor, values: torch.Tensor
):
    ptr_tile = dev_ptrs[:]  # Shape: [num_tensors]
    stack_tensor = hl.stack_like(example, ptr_tile)

    # Store values of shape [num_tensors, N] to all tensors in parallel
    stack_tensor[:] = values  # slice values[i, :] goes to tensor i

Usage Setup:

# Create list of tensors to process
tensor_list = [torch.randn(16, device="cuda") for _ in range(4)]
tensor_ptrs = torch.as_tensor(
    [p.data_ptr() for p in tensor_list], dtype=torch.uint64, device="cuda"
)
result = stack_load(tensor_ptrs, tensor_list[0])

Return type:: StackTensor
Returns:: A StackTensor object that broadcasts memory operations to all data buffers pointed to by dev_ptrs.