If I run just the astropy.units tests with:
python setup.py test -P units --coverage
On Python 2 it runs in:
============= 2739 passed, 15 skipped, 3 xfailed in 115.31 seconds =============
while on Python 3 it runs in:
============= 1658 passed, 3 xfailed in 335.95 seconds =============
Note that Python 3 runs fewer tests (because it doesn't do the tests with unicode_literals) and even then it's 3 times slower. I'm not sure if the bottleneck is in our functionality, or the coverage testing itself.