I’m currently trying to apply tensorization in TVM to replace a data-loading loop that involves padding , but I’m running into a problem related to non-divisible tile dimensions .
My original input tensor has shape [1, 48, 32, 32] (NCHW). After padding (e.g., with 1-pixel padding on spatial dimensions), the padded tensor becomes [1, 48, 34, 34] . I then tile the spatial dimensions into blocks of size 16 × 16 , which—due to padding—requires loading 18 × 18 elements per tile (to include halo regions for convolution).
In the TIR schedule, I have a block like this:
pad_temp_shared_gemm_i = T.alloc_buffer((1, 48, 34, 34), scope="shared_gemm_i")
for nn, ff_0, xx_0, yy_0 in T.grid(1, 4, 2, 2):
for ax0, ax1, ax2 in T.grid(48, 18, 18):
with T.block("pad_temp_shared_gemm_i"):
v0 = T.axis.spatial(1, 0)
v1 = T.axis.spatial(48, ax0)
v2 = T.axis.spatial(34, yy_0 * 16 + ax1)
v3 = T.axis.spatial(34, xx_0 * 16 + ax2)
T.reads(pad_temp[v0, v1, v2, v3])
T.writes(pad_temp_shared_gemm_i[v0, v1, v2, v3])
pad_temp_shared_gemm_i[v0, v1, v2, v3] = pad_temp[v0, v1, v2, v3]
I want to tensorize the inner loop (ax0, ax1, ax2 with shape [48, 18, 18] ) into an external DMA operation (via call_extern ). However, TVM’s tensorize or blockize mechanism fails because the tile size (18) does not evenly divide the padded dimension (34) —specifically, 34 / 18 is not an integer. This causes misalignment in the loop bounds and prevents TVM from forming a regular, tensorizable block pattern.
How can I work around this limitation while still using tensorization for DMA offloading?