It seems that the BilateralFilter cuda kernel silently fails (returns zeros) on inputs with more than 16 channels. Here is an example of the 2D case:
# 16 channels
BilateralFilter.apply(torch.randn(10, 16, 60, 60).cuda().contiguous()).std()
tensor(1.0003, device='cuda:0')
# 17 channels
BilateralFilter.apply(torch.randn(10, 17, 60, 60).cuda().contiguous()).std()
tensor(0., device='cuda:0')