Forward and Backward are done in sequence by layer ID at the moment. In principle, all Forward / Backward steps at the same depth in the DAG can be executed in parallel.
In DAG models where single layer operations do not saturate the host / device, this should improve performance.
As I understand it, this would be done by batch cuBLAS and streams for parallel kernel execution at each depth in the model.