From #6845, I believe a contributor to why the du implementation here is slower is because it directly prints small content to stdout() repeatedly and at first go within print_stat() function.
For a command like du, which does heavy printing to stdout(), we should use a BufWriter instead and print from this buffer every time the buffer needs to be flushed (when the buffer is full or can't hold the next string to store inside). This way we can reduce the number of write syscalls and locks we perform to stdout().