Meet problems when tiling pad data

Haleski · December 24, 2025, 7:15am

I’m currently trying to apply tensorization in TVM to replace a data-loading loop that involves padding , but I’m running into a problem related to non-divisible tile dimensions .

My original input tensor has shape [1, 48, 32, 32] (NCHW). After padding (e.g., with 1-pixel padding on spatial dimensions), the padded tensor becomes [1, 48, 34, 34] . I then tile the spatial dimensions into blocks of size 16 × 16 , which—due to padding—requires loading 18 × 18 elements per tile (to include halo regions for convolution).

In the TIR schedule, I have a block like this:

pad_temp_shared_gemm_i = T.alloc_buffer((1, 48, 34, 34), scope="shared_gemm_i")
for nn, ff_0, xx_0, yy_0 in T.grid(1, 4, 2, 2):
    for ax0, ax1, ax2 in T.grid(48, 18, 18):
        with T.block("pad_temp_shared_gemm_i"):
            v0 = T.axis.spatial(1, 0)
            v1 = T.axis.spatial(48, ax0)
            v2 = T.axis.spatial(34, yy_0 * 16 + ax1)
            v3 = T.axis.spatial(34, xx_0 * 16 + ax2)
            T.reads(pad_temp[v0, v1, v2, v3])
            T.writes(pad_temp_shared_gemm_i[v0, v1, v2, v3])
            pad_temp_shared_gemm_i[v0, v1, v2, v3] = pad_temp[v0, v1, v2, v3]

I want to tensorize the inner loop (ax0, ax1, ax2 with shape [48, 18, 18] ) into an external DMA operation (via call_extern ). However, TVM’s tensorize or blockize mechanism fails because the tile size (18) does not evenly divide the padded dimension (34) —specifically, 34 / 18 is not an integer. This causes misalignment in the loop bounds and prevents TVM from forming a regular, tensorizable block pattern.

How can I work around this limitation while still using tensorization for DMA offloading?

cbalint13 · December 30, 2025, 2:49pm

Hi @Haleski ,

As you pointed out, by default the tensorizer (using blockize internally) looks for divisibility and if this is not met, will skip any tensorization. Also, coming from higher levels, at tuning stage, iterations inside metaschedule (sketch) proposals also will get (all of them) rejection.

With the older TVM framework "relay → topi(templates + knob based schedules, “te”-based ones, not “tir”) → autotune & autoschedule " it was possibile to modify the schedule (topi) of operator’s templates (see: cnn, see: gemm) to do any custom tensorization. In short, for custom HW, was a simple way to create a topi catalog (even out-of-tree maintained ones), with customized schedules (“te” based) + tensorization declarations for any desired operators: cnn, gemm, etc.. . Also the tensorization declarations itself allowed custom declarations over intermediate stages like: load / store / compute / update, so it was much easy to do adapt anything to any custom ISA or HW.
These days “relax → topi(still use a minimal topi template as seed, but no more predefined schedules & tunnable knobs anymore)” there is no more reliance on topi catalog like it was before, metaschedules propose sketches fully automatically. It would be possible to enchache metascheduler itself in this process, but even doing so still will miss the flexibility of previous working way.

Now, what can be done for the new metascheduler:

It is possible to add back “load, store, compute, update” fine grained stages, (I have a pendig patch against MultiLevelTilingWithIntrin(), never submitted if interested) to the new blockize-tensorize functions, but this would be the easy part, to upgrade related functions.
But for the divisibility parts, as real enhanchement, one needs to teach metaschedule to also propose some “smart ways” of blockize w/o some smart pre-post-padding. This way needs careful analisys. Some possible way, I could think of, would be touching “BlockRealize” itself (?!), having some kind of “BlockReminder” along with “LoopPeel/LoopPadder” steps. The core idea is to split all into a tensorizable block (blockizable) then the last “reminder” block(s) would be padded out into a also tensorizable block, so the final TIR expression would be now tensorizable as a whole.

Of course other ideas are also welcome, especially PR enhanchments.

This is the disadvantage of high level automation, one may miss control over fine-grained (but sometimes highly demanded) details.