helion.language.stacktensor_like
- helion.language.stacktensor_like(tensor_like, dev_ptrs)[source]
Creates a StackTensor from a tensor of data pointers (dev_ptrs) pointing to tensors alike residing at different memory locations.
This function creates a StackTensor that allows you to broadcast memory operations to multiple tensor buffers in parallel.
Must be called inside a helion kernel with dev_ptrs as a device tensor and tensor_like as a host tensor.
- Parameters:
tensor_like (
Tensor
) – A template host tensor that defines the shape, dtype, and other properties that each buffer in the stack group should have. Must be a host tensor.dev_ptrs (
Tensor
) – A tensor containing device pointers (memory addresses) to data buffers. Must be of dtype torch.uint64 and must be a device tensor.
Examples
Basic Load Operation:
@helion.kernel def stack_load(dev_ptrs: torch.Tensor, example: torch.Tensor): for tile in hl.tile(example.size(0)): ptr_tile = dev_ptrs[:] # Shape: [num_tensors] stack_tensor = hl.stack_like(example, ptr_tile) # Load from all tensors simultaneously data = stack_tensor[tile] # Shape: [num_tensors, tile_size] return data
Store Operation:
@helion.kernel def stack_store( dev_ptrs: torch.Tensor, example: torch.Tensor, values: torch.Tensor ): ptr_tile = dev_ptrs[:] # Shape: [num_tensors] stack_tensor = hl.stack_like(example, ptr_tile) # Store values of shape [num_tensors, N] to all tensors in parallel stack_tensor[:] = values # slice values[i, :] goes to tensor i
Usage Setup:
# Create list of tensors to process tensor_list = [torch.randn(16, device="cuda") for _ in range(4)] tensor_ptrs = torch.as_tensor( [p.data_ptr() for p in tensor_list], dtype=torch.uint64, device="cuda" ) result = stack_load(tensor_ptrs, tensor_list[0])
- Return type:
- Returns:
A StackTensor object that broadcasts memory operations to all data buffers pointed to by dev_ptrs.