Skip to content

Conversation

@jakirkham
Copy link
Member

Another approach to issue ( #3656 ). Found it a bit easier to articulate this as a PR as opposed to a comment. Hope that is ok.

As serialize_bytelist already performs compression, this approach simply leverages that feature and stores the data in a dict in memory. Nests another Buffer in self.data for transitioning in-memory data to in-memory compressed data and then to on disk compressed data (this last step is the same as before).

If we decide we like this approach, something that I could use some advice on would be determining an appropriate transition/exposing that in a sensible way to the user. Though we need not worry about that if another approach may be more appropriate.

cc @prasunanand @TomAugspurger @martindurant @mrocklin @madsbk @quasiben

@jakirkham jakirkham mentioned this pull request Jul 19, 2020
@jakirkham jakirkham force-pushed the store_compressed_data_worker branch from 73a736d to f3fa583 Compare July 20, 2020 09:38
@jakirkham
Copy link
Member Author

Friendly nudge 😉 Would be great to get others take on this 😄

@mrocklin
Copy link
Member

In principle the approach seems ok to me. I think that as you say there are some active questions to resolve:

  1. How do we decide when to move between different layers?
  2. When is this helpful? When is it harmful?

@jakirkham
Copy link
Member Author

Thanks Matt! 😄

How do we decide when to move between different layers?

With this my current thinking is we have another weight and make this configurable (not yet done here). Though this is just my naive thought. Would welcome thoughts from others here 🙂

When is this helpful? When is it harmful?

This is a good question. It would be helpful to identify some workloads where we expect this matters. Happy to think about this some, but would also be interested in knowing if anyone else has workloads they'd like to try here (@fjetter? 😉).

One point worth raising is this is the same compression step we have today. The only difference is it stays around in memory before eventually being moved to disk. So the overall workflow itself hasn't changed. We have merely split one step into two.

@jakirkham jakirkham force-pushed the store_compressed_data_worker branch 3 times, most recently from 3184c93 to 047323e Compare July 24, 2020 23:16
Adds a `Buffer` for transitioning in-memory data to in-memory compressed
data.
@jakirkham jakirkham force-pushed the store_compressed_data_worker branch from 047323e to 121f5e9 Compare July 24, 2020 23:33
@jakirkham
Copy link
Member Author

Have now pushed in some logic to handle configuring when compression occurs. This probably requires some more playing on realistic workloads to determine an appropriate default configuration. Though it is overridable in any event.

Base automatically changed from master to main March 8, 2021 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants