Skip to content

Lock on bcolz #1033

@thequackdaddy

Description

@thequackdaddy

Hello,

Still playing with this package.

I have a question about the use of threads/locks on dataframe imports and bcolz.

Take this section of code:

def locked_df_from_ctable(*args, **kwargs):
    with lock:
        result = dataframe_from_ctable(*args, **kwargs)
    return result

I think what's happening here is that the lock prevents the dask threads from reading more than one bcolz file at a time. In my particular case, I have a file of about 7 million records in a bcolz dataset. When I just got rid of the lock... (i.e., got rid of the with lock statement and un-indented result = ..., the processing time was cut from around 90 seconds to 50 seconds. CPU usage never exceed 50%. (I presume the bottleneck was hard drive read speeds).

So what's the purpose of this lock? Is it necessary in some legacy packages? I ask because I'd like to make a PR to get rid of it. In my test, I was able to just drop it and I got the same result.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions