4135 loadimage and saveimage using network sources by karwojan · Pull Request #4136 · Project-MONAI/MONAI

karwojan · 2022-04-14T11:09:24Z

Description

It would be very useful to load/save images directly from/to network storage (like S3 or Google Cloud Storage) using pure monai transforms. I have a pretty simple idea how to implement such a feature adding single extra argument to LoadImage and SaveImage transforms.

We could treat input and output paths as pointing to the network sources instead of local files, because classic storage services like S3 use filesystem-like chierarcical structure of object keys. LoadImage transform could accept a Callable, which download object from network storage and save to temporary files, which is then used by readers and any other regular LoadImage transform logic. Similarly SaveImage writer could optionally save data to temporary file, which is then uploaded to network storage.

Status

Work in progress

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

Nic-Ma · 2022-04-15T06:10:18Z

Hi @karwojan ,

Thanks for your feature request.
I would suggest to add a new StreamReader(or other name) reader inheriting from MONAI ImageReader.
Then use it as the reader arg of LoadImage:
https://github.com/Project-MONAI/MONAI/blob/dev/monai/transforms/io/array.py#L107
Let's try to keep components simple and independent.
What do you think?
CC @ericspod @wyli @rijobro

Thanks.

karwojan · 2022-04-19T07:35:33Z

Hi @karwojan ,

Thanks for your feature request. I would suggest to add a new StreamReader(or other name) reader inheriting from MONAI ImageReader. Then use it as the reader arg of LoadImage: https://github.com/Project-MONAI/MONAI/blob/dev/monai/transforms/io/array.py#L107 Let's try to keep components simple and independent. What do you think? CC @ericspod @wyli @rijobro

Thanks.

Hi @Nic-Ma
Thanks for your feedback. In fact my first idea was to create such a separated reader. But I think that this is not an exact responsibility of ImageReader subclass, which should take care of decoding particular data format (file type). If I want to use readers I should create wrapper for each existing subclass of ImageReader because any of these file types could be accessible via network. This is way I put it in LoadImage class. Or alternatively I could create a reader, which aggregates another reader(s) but it starts to look like repeating LoadImage implementation. Maybe there should be separated transform (something like DownloadImage) for that?

Nic-Ma · 2022-04-19T09:53:54Z

Hi @karwojan ,

Thanks for your analysis.
Maybe I misunderstood your idea, I thought you want to download some data array from network streaming, then use it from memory directly.
Seems you want to download the image format files to the disk first?
If so, I agree maybe you can create a new download transform that is only responsible for downloading, then you still can load images and do other operations. Is it what you expected?
And please note that, transforms usually execute in the multi-processing environment of DataLoader or multi-thread environment of CacheDataset, please make sure your downloading logic is process-safe and thread-safe.
What do you think?
Please feel free to correct me if I misunderstood your proposal.

Thanks.

ericspod · 2022-04-19T16:08:58Z

I think I understand the issues @karwojan is addressing with this PR. The alternative solutions to this is to change what the readers/writers do or provide a wrapper that goes around the readers/writers that you would provide, but those require changing a lot of how LoadImage/SaveImage are created. This solution is much cleaner and provides an optional argument that's easy to fill with some existing function. Another choice is to provide wrappers for LoadImage and SaveImage which does the downloading and then passes along the temporary file paths instead of the originals.

One problem I see however is that the location the temporary files are stored in isn't controllable and varies by platform. Linux platforms may be configured to mount /tmp (the default location for temporary files) using tempfs which is a ramdisk, memory can quickly fill up if you keep all these temporary files open. Windows will do something different and provides temporary file storage on disk, macOS I think does something different still.

So this doesn't really provide enough control over what is done with files when loaded from remote sources and is perhaps less efficient by downloading files and keeping them rather than streaming data directly to readers. The alternative thing to do is mount the remote storage as a local file system and read from it with a cached dataset type to minimise how much loading is done, this would just use the existing code without modification. We can adopt the change proposed here and hope this doesn't become a problem, but if we provide wrappers for LoadImage and SaveImage which allow configuration of how temporary data is stored, this could help us route around this problem if it comes up.

karwojan · 2022-04-22T14:19:21Z

Sorry for the long response time. I've changed my implementation according to @ericspod suggestion. I've created separated DownloadImage and UploadImage transforms, which wrap up LoadImage and SaveImage transforms respectively. I agree that it is better to separate this responsibility.

Nic-Ma · 2022-04-22T14:48:26Z

Hi @karwojan ,

Thanks for your update.
I have a question about the API design: can we simplify the logic to decouple DownloadImage and LoadImage? UploadImage and SaveImage?
So we don't need to inherit LoadImage, DownloadImage returns the temp filepath, then we compose 2 transforms:
transforms = [DownLoadImaged(keys=["image"]), LoadImaged(keys=["image"])]
@ericspod What do you think?

Please feel free to correct me if I misunderstood the proposal.

Thanks.

karwojan · 2022-04-25T07:44:46Z

Hi @Nic-Ma ,
It seems to be a really good idea how to make this API simpler. But the problem is with temporary files living time. For now, DownloadImage takes care of removing temporary files - all temporary files are removed when leaving DownloadImage.call method scope. If these temporary files still exist out of DownloadImage transform who and when should take care of removing them?

…aveimage-using-network-sources

Nic-Ma · 2022-04-25T09:28:14Z

Hi @karwojan ,

Thanks for the detailed explantion.
Actually, I think DownloadImage transform should not delete the temp files once loaded, because we may repeatedly load the files for every epoch during training, I think we should not download the same files in every epoch.
So maybe just provide expected directory to save the loaded images and manually delete later instead of the temp folder?
BTW, you can leverage the API download_and_extract to download and extract:
https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/utils.py#L272

What do you think?

Thanks.

karwojan · 2022-04-25T11:01:45Z

Hi @Nic-Ma ,
I was thinking about a little different use case while creating this transform. In the situation you describe (when files should be downloaded just once) there is no special need to add new transform, because you can download data in separated script, then process it as always and do not care if they come from some network storage or not. But sometimes it is necessary to do not store whole dataset locally (because of not enough disk space for some BIG dataset in your computing machine) and it is not a problem to download data every time (because of fast connection to fast storage). In such a situation we want to download data every time and it should be placed as a regular step in training loop - this is why I need new transform for that.

ericspod · 2022-04-26T12:48:36Z

I think the issue @Nic-Ma is getting at is the use of temporary files. What your implementation does currently is download the file to a temporary location which is then read by a reader, the file then is deleted at some point in the future. This can cause files to be downloaded multiple times which has a cost. NamedTemporaryFile is being used as a context manager here so when it gets deleted isn't deterministic, however I would suggest changing it anyway so that data can be downloaded to a given location with filenames that are guaranteed unique somehow so that files that are already downloaded are not done so again. If a location isn't given then a temporary file should be used, in that case there is the lifetime problem if using two transforms where the file could get deleted between transform calls.

Nic-Ma · 2022-04-26T15:10:46Z

Hi @karwojan @ericspod ,

Thanks for the discussion.
Talking about this use case, I feel maybe inheriting from LoadImage is the best option we can do so far.
2 minor suggestions:

Maybe you can support both temp dir and user specified dir, for temp dir, use content provider to automatically delete.
No need to list all the args of the base LoadImage transform, use kwargs instead.

What do you think?

Thanks in advance.

ericspod · 2022-04-26T15:31:20Z

Providing a storage directory path to keep images is the way to go, if this is None then use temporary files. For the arguments to LoadImage/SafeImage I would provide them in full rather than using args/kwargs, it's clearer and easier to understand from someone looking at the documentation what the API is, though it is wordy writing everything out. Python could use a mechanism to specify arguments being inherited from overridden methods like __init__.

…urces

karwojan · 2022-04-29T10:34:03Z

Okay, maybe the problem is that I haven't explained my idea clearly at the beggining.

First of all, I think that it could be very useful to load data dynamically from network storage during training, not from local drive. In fact, I need such a functionality in my project, because I have no enough space on local disk to store whole dataset. I have also very fast network storage and very fast connection with it. I mean that I have to download data in each training loop and this is not a problem. It does not seem to be uncommon scenario because of VERY BIG datasets, which we have to deal with in today's world.

Secondly, If I found a way to load data using nibabel or itk directly from bytes array I wouldn't use temporary files. These temporary files are just a workaround how to load data from network storage using nibabel and itk. I don't want to store any data in local hard drive.

Finally, regarding the implementation, temporary files are removed after scope of the method is executed, because NamedTemporaryFile variable's reference is removed and all its resources are automatically free. I think that this behaviour is deterministic.

Nic-Ma · 2022-04-29T11:15:21Z

Hi @karwojan ,

Thanks for the great discussion.
I think we aligned on the proposal. Could you please help update the PR according to your idea?

Thanks in advance.

…aveimage-using-network-sources

…sing-network-sources' into 4135-loadimage-and-saveimage-using-network-sources

j.karwowski added 2 commits April 14, 2022 10:54

Implement network downloader in LoadImage transform

45aab8d

Implement network uploader in SaveImage transform

c319158

Nic-Ma requested a review from ericspod April 15, 2022 06:10

Create separated DownloadImage and UploadImage transforms

fd987af

Merge remote-tracking branch 'upstream/dev' into 4135-loadimage-and-s…

dd8f4f3

…aveimage-using-network-sources

Merge branch 'dev' into 4135-loadimage-and-saveimage-using-network-so…

0b14080

…urces

Jan Karwowski added 2 commits May 18, 2022 08:51

Merge remote-tracking branch 'upstream/dev' into 4135-loadimage-and-s…

5069823

…aveimage-using-network-sources

Merge remote-tracking branch 'karwojan/4135-loadimage-and-saveimage-u…

e3f185e

…sing-network-sources' into 4135-loadimage-and-saveimage-using-network-sources

karwojan closed this Aug 2, 2022

karwojan deleted the 4135-loadimage-and-saveimage-using-network-sources branch August 2, 2022 11:39

dzenanz mentioned this pull request Oct 24, 2022

LoadImage and SaveImage using network source (S3 or Google Cloud Storage) #4135

Closed

Conversation

karwojan commented Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Status

Types of changes

Uh oh!

Nic-Ma commented Apr 15, 2022

Uh oh!

karwojan commented Apr 19, 2022

Uh oh!

Nic-Ma commented Apr 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericspod commented Apr 19, 2022

Uh oh!

karwojan commented Apr 22, 2022

Uh oh!

Nic-Ma commented Apr 22, 2022

Uh oh!

karwojan commented Apr 25, 2022

Uh oh!

Nic-Ma commented Apr 25, 2022

Uh oh!

karwojan commented Apr 25, 2022

Uh oh!

ericspod commented Apr 26, 2022

Uh oh!

Nic-Ma commented Apr 26, 2022

Uh oh!

ericspod commented Apr 26, 2022

Uh oh!

karwojan commented Apr 29, 2022

Uh oh!

Nic-Ma commented Apr 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

karwojan commented Apr 14, 2022 •

edited

Loading

Nic-Ma commented Apr 19, 2022 •

edited

Loading