Skip to content

5390 Simplify cache content to only list type#5398

Merged
wyli merged 26 commits intoProject-MONAI:devfrom
Nic-Ma:5390-remove-dict-cache
Oct 31, 2022
Merged

5390 Simplify cache content to only list type#5398
wyli merged 26 commits intoProject-MONAI:devfrom
Nic-Ma:5390-remove-dict-cache

Conversation

@Nic-Ma
Copy link
Copy Markdown
Contributor

@Nic-Ma Nic-Ma commented Oct 25, 2022

Fixes #5390 .

Description

This PR simplified the hash key caching mode in the CacheDataset to use list as cache type, same as regular cache.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 25, 2022

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 25, 2022

/build

@myron
Copy link
Copy Markdown
Collaborator

myron commented Oct 25, 2022

@Nic-Ma , I do appreciate you putting time into creating this PR so quickly , but this does not address #5390 . That issue proposes to remove caching by key completely, since it is not necessary and there is no use cases to keep it.

we should probably first agree if we're removing caching_by_key completely or not first.

@Nic-Ma Nic-Ma requested a review from myron October 26, 2022 11:03
@Nic-Ma Nic-Ma requested a review from myron October 28, 2022 08:07
@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 28, 2022

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 28, 2022

/build

@myron
Copy link
Copy Markdown
Collaborator

myron commented Oct 28, 2022

it looks better, and works good, I tested it. I'd still suggest explicitly providing data to _fill_cache() , instead of modifying self.data twice. It will be more explicit and more compact. you can use these functions below

   def set_data(self, data: Sequence):
        """
        Set the input data and run deterministic transforms to generate cache content.
        Note: should call this func after an entire epoch and must set `persistent_workers=False`
        in PyTorch DataLoader, because it needs to create new worker processes based on new
        generated cache content.
        """

        self.data = data
        self._cache = []
        self._hash_keys = []
        self.cache_num = min(int(self.set_num), int(len(self.data) * self.set_rate), len(self.data))

        if self.cache_num <= 0:
            return

        if self.hash_as_key:
            # only compute cache for the unique items of dataset
            mapping = {self.hash_func(v): v for v in data}
            data = list(mapping.values())[: self.cache_num]
            self._hash_keys = list(mapping)[: self.cache_num]

        self._cache = self._fill_cache(data)

    def _fill_cache(self, data) -> List:
        if self.cache_num <= 0:
            return []
        if self.progress and not has_tqdm:
            warnings.warn("tqdm is not installed, will not show the caching progress bar.")
        with ThreadPool(self.num_workers) as p:
            if self.progress and has_tqdm:
                return list(
                    tqdm(
                        p.imap(self._load_cache_item, data),
                        total=len(data),
                        desc="Loading dataset",
                    )
                )
            return list(p.imap(self._load_cache_item, data))

    def _load_cache_item(self, item):
        """
        Args:
            item: element from the input data sequence.
        """
        for _transform in self.transform.transforms:  # type:ignore
            # execute all the deterministic transforms
            if isinstance(_transform, Randomizable) or not isinstance(_transform, Transform):
                break
            _xform = deepcopy(_transform) if isinstance(_transform, ThreadUnsafe) else _transform
            item = apply_transform(_xform, item)
        if self.as_contiguous:
            item = convert_to_contiguous(item, memory_format=torch.contiguous_format)
        return item

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 28, 2022

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 28, 2022

/build

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 28, 2022

Hi @myron ,

Thanks for your suggestion.
I updated the PR to avoid changing self.data, I kept the _compute_cache function because this is a minor issue in your code that we should compute self.cache_num on data instead of self.data.

Thanks.

This reverts commit 13801ee.

Signed-off-by: Nic Ma <[email protected]>
@Nic-Ma Nic-Ma force-pushed the 5390-remove-dict-cache branch from e2a9277 to 7fcaa86 Compare October 28, 2022 15:30
@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 28, 2022

Hi @myron ,

During deep tests, I found that if the data in p.imap(self._load_cache_item, data) is complicated iterable object (like another Dataset object), this multi-thread imap may do wrong slicing for the items. So I reverted the change, can we try to optimize the self.data logic later when we got more clear idea? I think that's not in the scope of this PR.

Thanks in advance.

@myron
Copy link
Copy Markdown
Collaborator

myron commented Oct 30, 2022

Hi @myron ,

During deep tests, I found that if the data in p.imap(self._load_cache_item, data) is complicated iterable object (like another Dataset object), this multi-thread imap may do wrong slicing for the items. So I reverted the change, can we try to optimize the self.data logic later when we got more clear idea? I think that's not in the scope of this PR.

Thanks in advance.

Thank you for the reply. So if data is a complicated iterable and p.imap does not handle its slicing, perhaps we can compute and list of non-repeated data indices and iterate on them. It seems like a more robust solution, and we don't need to a workaround to change self.data temporarily. Please see below. If this solution does not work for any reason , I'm okay with your PR as is. thank you

def set_data(self, data: Sequence):
     """
     Set the input data and run deterministic transforms to generate cache content.
     Note: should call this func after an entire epoch and must set `persistent_workers=False`
     in PyTorch DataLoader, because it needs to create new worker processes based on new
     generated cache content.
     """

     self.data = data
     self._cache = []
     self._hash_keys = []
     self.cache_num = min(int(self.set_num), int(len(self.data) * self.set_rate), len(self.data))

     if self.cache_num <= 0:
         return

     if self.hash_as_key:
         # only compute cache for the unique items of dataset
         mapping = {self.hash_func(data[i]): i for i in range(len(data))}
         self._hash_keys = list(mapping)[: self.cache_num]
         indices = list(mapping.values())[: self.cache_num]
     else:
         indices = list(range(self.cache_num))

     self._cache = self._fill_cache(indices)

 def _fill_cache(self, indices=None) -> List:

     if self.cache_num <= 0:
         return []

     if indices is None:
         indices = list(range(self.cache_num))

     if self.progress and not has_tqdm:
         warnings.warn("tqdm is not installed, will not show the caching progress bar.")
     with ThreadPool(self.num_workers) as p:
         if self.progress and has_tqdm:
             return list(
                 tqdm(
                     p.imap(self._load_cache_item, indices),
                     total=len(indices),
                     desc="Loading dataset",
                 )
             )
         return list(p.imap(self._load_cache_item, indices))

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 31, 2022

/black

@Nic-Ma Nic-Ma force-pushed the 5390-remove-dict-cache branch from dfa3bfa to 82f106f Compare October 31, 2022 03:26
@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 31, 2022

/black

@Nic-Ma Nic-Ma force-pushed the 5390-remove-dict-cache branch from e2e2294 to aa162ef Compare October 31, 2022 03:56
@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 31, 2022

Hi @myron ,

Thanks for your suggestion.
I have updated the PR according to your proposal, could you please help review it again?

Thanks in advance.

Signed-off-by: Nic Ma <[email protected]>
@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 31, 2022

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Oct 31, 2022

/build

@wyli wyli merged commit 766f21f into Project-MONAI:dev Oct 31, 2022
KumoLiu pushed a commit that referenced this pull request Nov 2, 2022
Fixes #5390 .

### Description

This PR simplified the `hash key` caching mode in the `CacheDataset` to
use `list` as cache type, same as regular cache.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

Signed-off-by: Nic Ma <[email protected]>
Signed-off-by: monai-bot <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: monai-bot <[email protected]>
Signed-off-by: KumoLiu <[email protected]>
yiheng-wang-nv pushed a commit to yiheng-wang-nv/MONAI that referenced this pull request Nov 2, 2022
Fixes Project-MONAI#5390 .

### Description

This PR simplified the `hash key` caching mode in the `CacheDataset` to
use `list` as cache type, same as regular cache.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

Signed-off-by: Nic Ma <[email protected]>
Signed-off-by: monai-bot <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: monai-bot <[email protected]>
Signed-off-by: Yiheng Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove hashing by key from CacheDatast (hash_as_key)

4 participants