Thread STFT magnitude and vectorize bilinear interpolation

## kernel_conv.cpp — Conv1D & Signal Ops

| Change | Impact |
|--------|--------|
| **Thread `cactus_stft_magnitude_f16`** — add `parallel_for_2d` over `(N, num_fft_bins)` | High — currently single-threaded, bottleneck for Silero VAD |
| **Vectorize `cactus_bilinear_interpolation_f16`** — the innermost `embed_dim` loop is pure scalar; replace with 8-wide `vld1q_f16` + `vfmaq_f16` | Medium — bottleneck for vision encoders |
| **Improve depthwise conv gather** — for `dilation == 1`, the input slice is contiguous and can use `vld1q_f16` directly instead of scalar gather into a stack array | Medium |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread STFT magnitude and vectorize bilinear interpolation #406

kernel_conv.cpp — Conv1D & Signal Ops

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change	Impact
Thread `cactus_stft_magnitude_f16` — add `parallel_for_2d` over `(N, num_fft_bins)`	High — currently single-threaded, bottleneck for Silero VAD
Vectorize `cactus_bilinear_interpolation_f16` — the innermost `embed_dim` loop is pure scalar; replace with 8-wide `vld1q_f16` + `vfmaq_f16`	Medium — bottleneck for vision encoders
Improve depthwise conv gather — for `dilation == 1`, the input slice is contiguous and can use `vld1q_f16` directly instead of scalar gather into a stack array	Medium

Thread STFT magnitude and vectorize bilinear interpolation #406

Description

kernel_conv.cpp — Conv1D & Signal Ops

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions