5390 Simplify cache content to only list type by Nic-Ma · Pull Request #5398 · Project-MONAI/MONAI

Nic-Ma · 2022-10-25T15:45:23Z

Fixes #5390 .

Description

This PR simplified the hash key caching mode in the CacheDataset to use list as cache type, same as regular cache.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

merge master

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-10-25T15:48:19Z

/black

Nic-Ma · 2022-10-25T15:48:26Z

/build

myron · 2022-10-25T17:35:59Z

@Nic-Ma , I do appreciate you putting time into creating this PR so quickly , but this does not address #5390 . That issue proposes to remove caching by key completely, since it is not necessary and there is no use cases to keep it.

we should probably first agree if we're removing caching_by_key completely or not first.

monai/data/dataset.py

Signed-off-by: Nic Ma <[email protected]>

monai/data/dataset.py

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-10-28T08:08:54Z

/black

Nic-Ma · 2022-10-28T08:09:01Z

/build

myron · 2022-10-28T09:31:17Z

it looks better, and works good, I tested it. I'd still suggest explicitly providing data to _fill_cache() , instead of modifying self.data twice. It will be more explicit and more compact. you can use these functions below

   def set_data(self, data: Sequence):
        """
        Set the input data and run deterministic transforms to generate cache content.
        Note: should call this func after an entire epoch and must set `persistent_workers=False`
        in PyTorch DataLoader, because it needs to create new worker processes based on new
        generated cache content.
        """

        self.data = data
        self._cache = []
        self._hash_keys = []
        self.cache_num = min(int(self.set_num), int(len(self.data) * self.set_rate), len(self.data))

        if self.cache_num <= 0:
            return

        if self.hash_as_key:
            # only compute cache for the unique items of dataset
            mapping = {self.hash_func(v): v for v in data}
            data = list(mapping.values())[: self.cache_num]
            self._hash_keys = list(mapping)[: self.cache_num]

        self._cache = self._fill_cache(data)

    def _fill_cache(self, data) -> List:
        if self.cache_num <= 0:
            return []
        if self.progress and not has_tqdm:
            warnings.warn("tqdm is not installed, will not show the caching progress bar.")
        with ThreadPool(self.num_workers) as p:
            if self.progress and has_tqdm:
                return list(
                    tqdm(
                        p.imap(self._load_cache_item, data),
                        total=len(data),
                        desc="Loading dataset",
                    )
                )
            return list(p.imap(self._load_cache_item, data))

    def _load_cache_item(self, item):
        """
        Args:
            item: element from the input data sequence.
        """
        for _transform in self.transform.transforms:  # type:ignore
            # execute all the deterministic transforms
            if isinstance(_transform, Randomizable) or not isinstance(_transform, Transform):
                break
            _xform = deepcopy(_transform) if isinstance(_transform, ThreadUnsafe) else _transform
            item = apply_transform(_xform, item)
        if self.as_contiguous:
            item = convert_to_contiguous(item, memory_format=torch.contiguous_format)
        return item

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-10-28T10:44:06Z

/black

Nic-Ma · 2022-10-28T10:44:14Z

/build

Nic-Ma · 2022-10-28T10:55:58Z

Hi @myron ,

Thanks for your suggestion.
I updated the PR to avoid changing self.data, I kept the _compute_cache function because this is a minor issue in your code that we should compute self.cache_num on data instead of self.data.

Thanks.

This reverts commit 13801ee. Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-10-28T15:36:53Z

Hi @myron ,

During deep tests, I found that if the data in p.imap(self._load_cache_item, data) is complicated iterable object (like another Dataset object), this multi-thread imap may do wrong slicing for the items. So I reverted the change, can we try to optimize the self.data logic later when we got more clear idea? I think that's not in the scope of this PR.

Thanks in advance.

myron · 2022-10-30T19:20:08Z

Hi @myron ,

During deep tests, I found that if the data in p.imap(self._load_cache_item, data) is complicated iterable object (like another Dataset object), this multi-thread imap may do wrong slicing for the items. So I reverted the change, can we try to optimize the self.data logic later when we got more clear idea? I think that's not in the scope of this PR.

Thanks in advance.

Thank you for the reply. So if data is a complicated iterable and p.imap does not handle its slicing, perhaps we can compute and list of non-repeated data indices and iterate on them. It seems like a more robust solution, and we don't need to a workaround to change self.data temporarily. Please see below. If this solution does not work for any reason , I'm okay with your PR as is. thank you

def set_data(self, data: Sequence):
     """
     Set the input data and run deterministic transforms to generate cache content.
     Note: should call this func after an entire epoch and must set `persistent_workers=False`
     in PyTorch DataLoader, because it needs to create new worker processes based on new
     generated cache content.
     """

     self.data = data
     self._cache = []
     self._hash_keys = []
     self.cache_num = min(int(self.set_num), int(len(self.data) * self.set_rate), len(self.data))

     if self.cache_num <= 0:
         return

     if self.hash_as_key:
         # only compute cache for the unique items of dataset
         mapping = {self.hash_func(data[i]): i for i in range(len(data))}
         self._hash_keys = list(mapping)[: self.cache_num]
         indices = list(mapping.values())[: self.cache_num]
     else:
         indices = list(range(self.cache_num))

     self._cache = self._fill_cache(indices)

 def _fill_cache(self, indices=None) -> List:

     if self.cache_num <= 0:
         return []

     if indices is None:
         indices = list(range(self.cache_num))

     if self.progress and not has_tqdm:
         warnings.warn("tqdm is not installed, will not show the caching progress bar.")
     with ThreadPool(self.num_workers) as p:
         if self.progress and has_tqdm:
             return list(
                 tqdm(
                     p.imap(self._load_cache_item, indices),
                     total=len(indices),
                     desc="Loading dataset",
                 )
             )
         return list(p.imap(self._load_cache_item, indices))

Nic-Ma · 2022-10-31T03:22:41Z

/black

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-10-31T03:26:26Z

/black

for more information, see https://pre-commit.ci

Signed-off-by: monai-bot <[email protected]>

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-10-31T06:14:08Z

Hi @myron ,

Thanks for your suggestion.
I have updated the PR according to your proposal, could you please help review it again?

Thanks in advance.

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-10-31T06:16:50Z

/black

Nic-Ma · 2022-10-31T06:16:57Z

/build

Fixes #5390 . ### Description This PR simplified the `hash key` caching mode in the `CacheDataset` to use `list` as cache type, same as regular cache. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: Nic Ma <[email protected]> Signed-off-by: monai-bot <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: monai-bot <[email protected]> Signed-off-by: KumoLiu <[email protected]>

Fixes Project-MONAI#5390 . ### Description This PR simplified the `hash key` caching mode in the `CacheDataset` to use `list` as cache type, same as regular cache. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: Nic Ma <[email protected]> Signed-off-by: monai-bot <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: monai-bot <[email protected]> Signed-off-by: Yiheng Wang <[email protected]>

Nic-Ma and others added 9 commits February 1, 2021 19:15

Merge pull request #19 from Project-MONAI/master

42a45e0

merge master

Merge pull request #32 from Project-MONAI/master

cd16a13

merge master

Merge pull request #180 from Project-MONAI/dev

6f87afd

merge master

Merge pull request #214 from Project-MONAI/dev

f398298

merge master

Merge pull request #397 from Project-MONAI/dev

ec463d6

merge master

Merge pull request #429 from Project-MONAI/dev

ca62306

merge master

Merge pull request #450 from Project-MONAI/dev

af77f46

merge master

[DLMED] unify cache structure

30d022a

Signed-off-by: Nic Ma <[email protected]>

Merge branch 'dev' into 5390-remove-dict-cache

19652e9

Nic-Ma requested review from ericspod, myron, rijobro and wyli October 25, 2022 15:48

Nic-Ma mentioned this pull request Oct 25, 2022

Remove hashing by key from CacheDatast (hash_as_key) #5390

Closed

Nic-Ma mentioned this pull request Oct 25, 2022

5308 Add runtime cache support to CacheDataset #5365

Merged

7 tasks

myron requested changes Oct 26, 2022

View reviewed changes

monai/data/dataset.py Outdated Show resolved Hide resolved

monai/data/dataset.py Outdated Show resolved Hide resolved

monai/data/dataset.py Show resolved Hide resolved

monai/data/dataset.py Outdated Show resolved Hide resolved

monai/data/dataset.py Outdated Show resolved Hide resolved

Nic-Ma added 2 commits October 26, 2022 18:44

Merge branch 'dev' into 5390-remove-dict-cache

3b78916

[DLMED] update according to comments

3df0f48

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma requested a review from myron October 26, 2022 11:03

myron reviewed Oct 26, 2022

View reviewed changes

monai/data/dataset.py Show resolved Hide resolved

Nic-Ma added 5 commits October 27, 2022 10:16

[DLMED] update according to comments

1f8b351

Signed-off-by: Nic Ma <[email protected]>

[DLMED] add more test

fdb629d

Signed-off-by: Nic Ma <[email protected]>

Merge branch 'dev' into 5390-remove-dict-cache

dc98d20

[DLMED] update according to comments

1068e5d

Signed-off-by: Nic Ma <[email protected]>

Merge branch 'dev' into 5390-remove-dict-cache

5d9534e

Nic-Ma requested a review from myron October 28, 2022 08:07

[DLMED] update according to comments

13801ee

Signed-off-by: Nic Ma <[email protected]>

Revert "[DLMED] update according to comments"

7fcaa86

This reverts commit 13801ee. Signed-off-by: Nic Ma <[email protected]>

Nic-Ma force-pushed the 5390-remove-dict-cache branch from e2a9277 to 7fcaa86 Compare October 28, 2022 15:30

Merge branch 'dev' into 5390-remove-dict-cache

6215ade

Merge branch 'dev' into 5390-remove-dict-cache

34bf5ae

Merge branch 'dev' into 5390-remove-dict-cache

02abed4

[DLMED] avoid changing self.data

82f106f

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma force-pushed the 5390-remove-dict-cache branch from dfa3bfa to 82f106f Compare October 31, 2022 03:26

pre-commit-ci bot and others added 3 commits October 31, 2022 03:26

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff5bb25

for more information, see https://pre-commit.ci

[MONAI] code formatting

9315f68

Signed-off-by: monai-bot <[email protected]>

[DLMED] fix typo

aa162ef

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma force-pushed the 5390-remove-dict-cache branch from e2e2294 to aa162ef Compare October 31, 2022 03:56

[DLMED] fix doc

86cb054

Signed-off-by: Nic Ma <[email protected]>

myron approved these changes Oct 31, 2022

View reviewed changes

wyli merged commit 766f21f into Project-MONAI:dev Oct 31, 2022

Conversation

Nic-Ma commented Oct 25, 2022

Description

Types of changes

Uh oh!

Nic-Ma commented Oct 25, 2022

Uh oh!

Nic-Ma commented Oct 25, 2022

Uh oh!

myron commented Oct 25, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nic-Ma commented Oct 28, 2022

Uh oh!

Nic-Ma commented Oct 28, 2022

Uh oh!

myron commented Oct 28, 2022

Uh oh!

Nic-Ma commented Oct 28, 2022

Uh oh!

Nic-Ma commented Oct 28, 2022

Uh oh!

Nic-Ma commented Oct 28, 2022

Uh oh!

Nic-Ma commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

myron commented Oct 30, 2022

Uh oh!

Nic-Ma commented Oct 31, 2022

Uh oh!

Nic-Ma commented Oct 31, 2022

Uh oh!

Nic-Ma commented Oct 31, 2022

Uh oh!

Nic-Ma commented Oct 31, 2022

Uh oh!

Nic-Ma commented Oct 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Nic-Ma commented Oct 28, 2022 •

edited

Loading