You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thread cactus_stft_magnitude_f16 — add parallel_for_2d over (N, num_fft_bins)
High — currently single-threaded, bottleneck for Silero VAD
Vectorize cactus_bilinear_interpolation_f16 — the innermost embed_dim loop is pure scalar; replace with 8-wide vld1q_f16 + vfmaq_f16
Medium — bottleneck for vision encoders
Improve depthwise conv gather — for dilation == 1, the input slice is contiguous and can use vld1q_f16 directly instead of scalar gather into a stack array
kernel_conv.cpp — Conv1D & Signal Ops
cactus_stft_magnitude_f16— addparallel_for_2dover(N, num_fft_bins)cactus_bilinear_interpolation_f16— the innermostembed_dimloop is pure scalar; replace with 8-widevld1q_f16+vfmaq_f16dilation == 1, the input slice is contiguous and can usevld1q_f16directly instead of scalar gather into a stack array