Autocompression #3702

prasunanand · 2020-04-12T15:35:40Z

Add TypeCompression class
Added setitem and getitem
Need to get rid of maybe_compress in distributed/protocol/core.py and distributed/protocol/serialize.py(currently trying to understand headers and comm)

mrocklin · 2020-04-12T16:42:42Z

Thanks for starting on this @prasunanand . I've added more context to the original issue. I hope that that helps to clarify the objectives: #3656

prasunanand · 2020-04-12T17:00:40Z

Thanks @mrocklin :) .

prasunanand · 2020-04-13T16:33:38Z

@mrocklin I request you to review again

mrocklin · 2020-04-21T14:34:38Z

distributed/protocol/compression.py

+        compression_type = "zlib"
+        if compression_type == "auto":
+            compression_type = default_compression
+        self.storage[key] = self.compressions[compression_type]["compress"](value)


I suspect that this won't work for values that don't satisfy the buffer protocol.

Compression functions generally consume bytes-like objects and return bytes-like objects. Your test probably passes because Numpy arrays happen to satisfy this protocol, but other objects, like custom user classes, probably wouldn't.

I recommend trying out your solution with a wider variety of workloads. I suspect that it will have difficulty on many of them.

You will probably need to look into how Dask serializes objects to understand how to convert arbitrary Python objects into bytes-like objects.

prasunanand · 2020-04-23T16:22:39Z

Thanks @mrocklin. I have a doubt. Does Serialization and compression mean the same for `large results or Python specific data, like NumPy arrays or Pandas. If yes, Should I call serialize() function on Python specific datasets which looks for different serializers registered ?

if type(key)  == np.ndarray or type(key)  ==  pd.Dataframe:
    compress_func = serialize
    decompress_func = serialize

Sorry for replying late .

prasunanand · 2020-04-23T17:21:11Z

Looking at the code in dumps , are both serialize() and compress() called on the data ?

TomAugspurger · 2020-04-23T21:16:14Z

@prasunanand I would have a look at

distributed/distributed/worker.py

Lines 233 to 234 in 6db09f3

    
               data: MutableMapping, type, None 
        
                   The object to use for storage, builds a disk-backed LRU dict by default

.

I think the idea is to have Worker.data be the mutable mapping that implements automatic compression.

Compressing data prior to communication (in distributed/protocol) is also an interesting idea, but might be more involved.

TomAugspurger

A few questions

How do we expose this to the user? Asking them to provide theTypeCompressor as a data= argument doesn't seem good. Perhaps a compress_data or compress_persisted_data boolean argument to Worker, with an associated config setting?
Can this be combined with a zict.Buffer to enable in-memory compression & spill to disk at the same time?

TomAugspurger · 2020-04-27T17:59:22Z

distributed/protocol/__init__.py

+import dask
+
+
+class TypeCompressor(collections.abc.MutableMapping):


This probably shouldn't be in distributed.protocol, since it isn't strictly related to the protocol.

Longer-term, this may be best to put in https://github.com/dask/zict/, but I'm happy to experiment with it here. This could maybe go in a new distributed/utils_data.py module.

It's probably tricky to do that as this depends on maybe_compress, which is in Distributed. That said, maybe there are ways to build this out of functionality in Zict. Thus making this more compact (maybe not even needing this class).

distributed/protocol/__init__.py

mrocklin · 2020-04-27T18:25:05Z

Thanks @mrocklin. I have a doubt. Does Serialization and compression mean the same for `large results or Python specific data, like NumPy arrays or Pandas. If yes, Should I call serialize() function on Python specific datasets which looks for different serializers registered ?

if type(key)  == np.ndarray or type(key)  ==  pd.Dataframe:
    compress_func = serialize
    decompress_func = serialize

It should work on any python object that is passed in. To do this well you will need to understand more about how dask handles serialization and compression.

In general, I get the sense that this issue may be beyond your current understanding of Dask. I was probably wrong to mark this as a good first issue.

prasunanand · 2020-05-01T14:48:55Z

Apologies @mrocklin . I studied zict codebase and worker.py as I was unable to get my head around what this feature demanded. Now I have a better understanding. Hence, I have tried once more and hope it solves the issue(or the solution is close).

If not I will close this PR and work on other issues.

mrocklin · 2020-05-01T15:16:42Z

distributed/worker.py

+            self.data = TypeCompressor()
+            for k, v in data.values():
+                self.data[k] = data[k]
+        elif isinstance(data, TypeCompressor):


I'm not sure what's going on here. Was it necessary to special-case TypeCompressor here? Ideally we only depend on the MutableMapping interface and don't have to worry too much about every possible type that someone might provide individually. That can make future maintenance an issue.

mrocklin · 2020-05-01T15:19:48Z

distributed/protocol/__init__.py

+        header, (compression, compressed) = self.storage[key]
+
+        frames = decompress({"compression": {compression}}, compressed)
+        return deserialize(header, frames)


In principle this looks good @prasunanand . Thank you for sticking with this.

The next thing to do is probably to try it out in practice a bit and see how it performs. There are optimizations I can think of, like it might make sense to only compress values that we know are small using the sizeof function, but that would just be a guess. If you have time to try out a few different kinds of computations, maybe taken from examples.dask.org that might be a good way to get more information here.

@prasunanand , have you had any time to try the suggestions here?

I thought this is what maybe_compress already does. Meaning it checks to see whether something is worth compressing. If so, it compresses it.

jakirkham · 2020-06-23T21:54:06Z

cc @madsbk (as this may be of interest 🙂)

…properly

jakirkham · 2020-07-14T11:37:43Z

Any thoughts or questions on the feedback so far @prasunanand ? 🙂

prasunanand · 2020-07-14T17:52:07Z

Sorry I have been busy at work. Last week I have been trying to measure the performance using different workloads. Found out that Numpy array were not getting deserialized properly. Corrected that. Regarding measung performance for different workloads I have been trying to use dask diagnostics.

Please let me know if its the right approach.
In the following code, I am trying to compute over an ndarray of size (1000, 1000).

import asyncio

async def test_compression():
	import numpy as np
	import dask.array as da
	from distributed.protocol import TypeCompressor
	from distributed import Scheduler, Worker, Client, wait, performance_report

	x = np.ones(1000000)  # a large but compressible piece of data

	d = TypeCompressor()

	d["x"] = x  # put Python object into d
	out = d["x"]  # get the object back out
	d["z"] = 3

	assert str(out) == str(x)
	np.testing.assert_allclose(
	    np.frombuffer(out, x.dtype), x
	)  # the two objects should match

	# assuming here that the underlying bytes are
	# stored in something like a `.storage` attribute, but this isn't required
	# we check that the amount of actual data stored is small
	assert sum(map(len, d.storage.values())) < x.nbytes

	async with Scheduler() as s:
	    async with Worker(s.address, data=d) as worker:
	        async with Client(s.address, asynchronous=True) as c:
	        	async with performance_report(filename="dask-report.html"): #measure performance
	        		x = da.ones((1000, 1000))
	        		y = await x.persist()  # put data in memory
	        		y = await (x + x.T).mean().persist()  # do some work
	        		future = c.compute(y)
	        		await wait(future)
	        		assert sum(map(len, worker.data.storage.values())) < x.nbytes

asyncio.run(test_compression())

Attached is an html file that I got as result.(To view change the extension from .txt to .html)
dask-report.txt

prasunanand · 2020-07-14T17:54:29Z

Apart from this:
What are the other data-types I should investigate with (numpy.ndarray, string, pandas.Series) ?

jakirkham · 2020-07-19T00:01:13Z

I tried another approach to this problem (hope that is ok 🙂) in PR ( #3968 ). We need not go with that. Just trying to see if we are able to get the behavior we want out of Zict alone. Would be curious to hear what people think of it.

mrocklin reviewed Apr 21, 2020

View reviewed changes

prasunanand changed the title ~~WIP: Autocompression~~ Autocompression Apr 24, 2020

prasunanand marked this pull request as ready for review April 24, 2020 11:22

TomAugspurger reviewed Apr 27, 2020

View reviewed changes

fix branch and add Autocompression

f31e53f

prasunanand force-pushed the auto_compression branch from 6dd14ff to f31e53f Compare May 1, 2020 14:41

Nits

bfa432d

mrocklin reviewed May 1, 2020

View reviewed changes

jakirkham mentioned this pull request Jun 24, 2020

Create MutableMapping for automatic compression #3656

Open

Numpy arrays(objects following Buffer Protocol) now get deserialized …

a775212

…properly

beckernick mentioned this pull request Jul 9, 2020

[FEA] Allow communicating spilled data rapidsai/dask-cuda#342

Closed

Base automatically changed from master to main March 8, 2021 19:04

prasunanand requested a review from fjetter as a code owner January 23, 2024 10:57

		import dask


		class TypeCompressor(collections.abc.MutableMapping):

Uh oh!

Autocompression #3702

Are you sure you want to change the base?

Autocompression #3702

Uh oh!

Conversation

prasunanand commented Apr 12, 2020

Uh oh!

mrocklin commented Apr 12, 2020

Uh oh!

prasunanand commented Apr 12, 2020

Uh oh!

prasunanand commented Apr 13, 2020

Uh oh!

mrocklin Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

prasunanand commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prasunanand commented Apr 23, 2020

Uh oh!

TomAugspurger commented Apr 23, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Apr 27, 2020

Choose a reason for hiding this comment

Uh oh!

jakirkham Jun 24, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrocklin commented Apr 27, 2020

Uh oh!

prasunanand commented May 1, 2020

Uh oh!

mrocklin May 1, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin May 1, 2020

Choose a reason for hiding this comment

Uh oh!

martindurant May 26, 2020

Choose a reason for hiding this comment

Uh oh!

jakirkham Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jun 23, 2020

Uh oh!

jakirkham commented Jul 14, 2020

Uh oh!

prasunanand commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prasunanand commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Jul 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

prasunanand commented Apr 23, 2020 •

edited

Loading

prasunanand commented Jul 14, 2020 •

edited

Loading

prasunanand commented Jul 14, 2020 •

edited

Loading