-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Hello,
Still playing with this package.
I have a question about the use of threads/locks on dataframe imports and bcolz.
Take this section of code:
def locked_df_from_ctable(*args, **kwargs):
with lock:
result = dataframe_from_ctable(*args, **kwargs)
return result
I think what's happening here is that the lock prevents the dask threads from reading more than one bcolz file at a time. In my particular case, I have a file of about 7 million records in a bcolz dataset. When I just got rid of the lock... (i.e., got rid of the with lock statement and un-indented result = ..., the processing time was cut from around 90 seconds to 50 seconds. CPU usage never exceed 50%. (I presume the bottleneck was hard drive read speeds).
So what's the purpose of this lock? Is it necessary in some legacy packages? I ask because I'd like to make a PR to get rid of it. In my test, I was able to just drop it and I got the same result.
Thanks.