Skip to content

Performance of fromfile on Python 3 #13319

@apontzen

Description

@apontzen

numpy.fromfile is drastically inefficient for small reads on python 3; orders of magnitude slower than the same call on python 2. I believe this is because of changes made in response to #4118, keeping things in sync despite the IO buffering in python 3.

Naively implementing a pure python version of fromfile reveals that better performance is available, even in python 3.x. Is it possible to improve the performance of numpy.fromfile to match such a reference implementation?

Reproducing code example:

import numpy as np
import contextlib
import time
import sys
import os

@contextlib.contextmanager
def timeme(label):
    start = time.time()
    yield
    end = time.time()
    diff = (end-start)*1000
    print("%s: %.1f ms"%(label.rjust(45), diff))

def py_fromfile(f, dtype, num):
    buf = np.empty(num, dtype)
    f.readinto(buf)
    return buf

def make_test_file():
    test_data = np.random.uniform(size=10**6)
    with open('testdata', 'wb') as f:
        test_data.tofile(f)

def remove_test_file():
    os.remove("testdata")

def time_read(num_chunks, reading_method):
    py_version = sys.version_info
    reading_name = reading_method.__name__
    test_label = 'Read in %d chunk(s), %s, py%d.%d'%(num_chunks, reading_name,
                                                 py_version[0], py_version[1])
    make_test_file()

    with timeme(test_label):
        with open('testdata', 'rb') as f:
            for i in range(num_chunks):
                reading_method(f, np.float32, 10**6//num_chunks)

    remove_test_file()

time_read(1, np.fromfile)
time_read(10000, np.fromfile)
time_read(10000, py_fromfile)

Running with any recent version of numpy on python 2.7 and 3.7 respectively gives results along the following lines:

          Read in 1 chunk(s), fromfile, py3.7: 2.0 ms
      Read in 10000 chunk(s), fromfile, py3.7: 128.0 ms
   Read in 10000 chunk(s), py_fromfile, py3.7: 16.3 ms

          Read in 1 chunk(s), fromfile, py2.7: 2.0 ms
      Read in 10000 chunk(s), fromfile, py2.7: 6.6 ms
   Read in 10000 chunk(s), py_fromfile, py2.7: 20.6 ms

Thus, on python 2.7, numpy.fromfile is as efficient as it can be (far more efficient than the pure python implementation), but on python 3.7 (and other 3.x) numpy.fromfile drastically underperforms relative to the pure python implementation. This suggests a better implementation for fromfile may be possible.

Numpy/Python version information:

This can be reproduced on all recent versions of numpy as far as I can tell. However, the stats given above are from the following two setups:

python 2:

1.16.2 2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 16:24:02) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

python 3:

1.16.2 3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions