Expected interface for dask objects #1068

jcrist · 2016-03-29T20:37:11Z

Defines an abstract interface that's expected for objects to act like dask collections. This is useful for projects like xarray which want to use dask, but don't want to subclass from Base. Fixes #700. Ping @shoyer, @mrocklin for comment.

The interface is as follows:

class DaskIntf(object):
    """Minimal interface required for dask-like duck typing"""
    def _dask_optimize_(self, dsk, keys, **kwargs):
        """Return an optimized dask graph"""
        pass

    def _dask_finalize_(self, results):
        """Finalize the computed keys into an output object"""
        pass

    def _dask_default_get_(self, dsk, key, **kwargs):
        """The default scheduler `get` to use for this object"""
        pass

    def _dask_keys_(self):
        """The keys for the dask graph"""
        pass

    def _dask_graph_(self):
        """The dask graph"""
        pass

A few points:

All 5 of the above are expected to act as methods, and the presence of them is enough to indicate something is a dask collection. However, they don't actually need to be methods, and in the case of _dask_finalize_ they should really be staticmethods (see discussion here Duck typed dask collections for use with dask.imperative #700 (comment))
Internally in our dask collection code it's rather nice to be able to use the slightly shorter dask and _keys methods/attributes. I've left these in, but no general code depends on them. I'm half tempted to remove the _keys method, but the dask attribute should definitely stay.
The implementation could be changed to use ABCMeta, but that's a bit too much magic for my liking. Currently we just say a dask collection is something that subclasses Base, or has these 5 methods.

TODO:

Cleanup implementation.
Go through codebase and find points where duck typing of collections is acceptable (at least swap out isinstance(x, Base) things)
Tests
Document interface

Example:

In [1]: %paste
from uuid import uuid4
from dask import do, value, compute
from dask.threaded import get

class Duck(object):
    def __init__(self, val):
        self.key = str(uuid4())
        self.val = val

    _dask_graph_ = lambda self: {self.key: self.val}
    _dask_keys_ = lambda self: self.key
    _dask_finalize_ = staticmethod(lambda x: x)
    _dask_optimize_ = staticmethod(lambda d, k, **kws: d)
    _dask_default_get_ = staticmethod(get)

d = Duck(1)
v = value(2)
o = sum([v, d, v, d, 2, 3])
## -- End pasted text --

In [2]: o.compute()
Out[2]: 11

jcrist · 2016-03-29T20:45:26Z

dask/base.py



-class Base(object):
+class DaskIntf(object):


The only reason I broke this off as its own object was to separate the expected interface from a base class with other common methods on it (Base). This seemed useful to me for documentation purposes, but does add yet-another-class. Happy to swap this out for just good docs.

I think this is fine, though I would call it DaskInterface instead of abbreviating Intf. It took me a while to figure that one out.

Defines an abstract interface that's expected for objects to act like dask collections. This is useful for projects like xarray which want to use dask, but don't want to subclass from Base.

jcrist · 2016-04-07T21:59:37Z

This is ready for review.

mrocklin · 2016-04-07T22:00:14Z

I hope to have time to review this later tomorrow

mrocklin · 2016-04-08T18:28:48Z

dask/base.py

+
+    def _dask_default_get_(self, dsk, key, **kwargs):
+        """The default scheduler `get` to use for this object"""
+        raise NotImplementedError()


Should we annotate finalize and get here with staticmethod decorators as a hint to passersby?

Or alternatively, how do we communicate to others that finalize and get should not be instance methods?

Why should _dask_default_get_ not be an instance method? Only thing I can think of is that it can't rely on self, because only one get is used. I did put a warning a couple places in the docs about it, so hopefully people just read about it there instead of mucking in the code here?

I often muck in code. I'll search the source, find something like DaskInterface, and go ahead an implement what's there. Is there a reason not to add staticmethod? It seems like a more common choice.

mrocklin · 2016-04-08T18:39:44Z

Should we make _dask_optimize_ optional?

mrocklin · 2016-04-08T18:40:24Z

Generally this looks pretty good to me.

I think it'd be nice to codevelop this with xarray, just to make sure we hit all of the important cases.

jcrist · 2016-04-08T18:55:28Z

Making _dask_optimize_ optional might be nice, unsure. I also thought about making _dask_default_get_ optional (and defaulting to dask.threaded.get if not present).

I have a proof of concept of this working with xarray. I could keep working on it, but it might be easier for an xarray dev to implement (or at least give me pointers on the code base).

jcrist · 2016-04-08T19:10:15Z

Xarray branch here: https://github.com/jcrist/xarray/tree/dask_intf

Notebook demonstrating use here: https://gist.github.com/jcrist/ea98b5b5b8e85f707b648996f5fcfbc7

mrocklin · 2016-04-08T22:17:20Z

@shoyer the above links might interest you

shoyer · 2016-04-08T23:46:33Z

@jcrist thanks for sharing that example. So, an xarray object may or may not contain dask arrays. Should we define these attributes if it does not? Presumably they would point to some sort of dummy data?

mrocklin · 2016-04-08T23:48:39Z

I think that cell 23 in the notebook is supposed to allay this concern. The interface functions err when self.data is not a dask-thing, so is_dask_collection(self) returns False.

shoyer · 2016-04-09T00:05:10Z

OK, nice. Yes, I see that the notebook does cover that.

One thing to watch out for is that hasattr is inconsistent between Python 2 and 3, so I'm not sure this example would work on Python 3: https://hynek.me/articles/hasattr/

Relying on this behavior to indicate that an object that may or may not be a duck type (like Foo from the notebook or xarray types) is not a duck type seems like it's asking for bugs. I would prefer something more explicit -- maybe returning None as a sentinel value? So we'd write something like this instead:

    @property
    def _dask_graph_(self):
        if isinstance(self.data, da.Array):
            return self.data._dask_graph_
        else:
            return None

and the test is_dask_colleciton would verify getattr(obj, '_dask_graph_', None) is not None instead of using hasattr.

mrocklin · 2016-04-14T21:24:21Z

dask/base.py

    """Base class for dask collections"""

+    def _dask_graph_(self):
+        return self.dask


Should we keep around the .dask attribute or slowly deprecate it?

mrocklin · 2016-04-20T20:32:54Z

Any general thoughts on this @shoyer ? @jcrist and I were chatting and were wondering how important this was to xarray.

mrocklin · 2017-04-03T13:36:36Z

This came up again in pydata/xarray#1344

In particular we would like to be able to call the following:

my_xarray = dask.persist(my_xarray)

Currently this calls the following function:

def redict_collection(c, dsk):
    cc = copy.copy(c)
    cc.dask = dsk
    return cc

How would this interface have to be extended to support this? What would XArray have to do to support this interface?

mrocklin · 2017-09-01T12:49:50Z

and the test is_dask_colleciton would verify getattr(obj, '_dask_graph_', None) is not None instead of using hasattr.

This seems sensible to me.

Should we use double underscores like __dask_optimize__ rather than single underscores like _dask_optimize_? I'm not sure what Python expectations are here.

mrocklin · 2017-09-01T12:51:41Z

I find myself in need of this, so I'll probably start pushing on it a bit.

@jcrist have your thoughts on this topic changed at all since you wrote this originally?

mrocklin · 2017-09-01T12:54:12Z

This will need to deprecate the older interface a bit more slowly. Otherwise we introduce some friction for dask.distributed and any other low-level code out there that might be out there that depend on the old methods (which exists at least in private projects).

jcrist · 2017-09-01T14:14:48Z

I've been thinking about this a fair amount lately. I think if I was to redo it I would write it like:

class DaskIntf(object):
    """Minimal interface required for dask-like duck typing"""
    @staticmethod
    def __dask_precompute__(*args, **kwargs):
        """Given a set of objects `*args` that have the same `__dask_recompute__`
        method, return a dask graph and keys.

        By default this would merge the graphs and call `__dask_optimize__`, but
        could also be overridden to more strongly influence the graph construction"""
        pass

    @staticmethod
    def __dask_finalize__(results):
        """Finalize the computed keys into an output object"""
        pass

    @staticmethod
    def __dask_default_get__(dsk, keys, **kwargs):
        """The default scheduler `get` to use for this object"""
        pass

    def __dask_rebuild__(self, keys, results):
        """Given the keys and results, rebuild a new dask object.

        Used for `persist`"""
        pass

    def __dask_keys__(self):
        """The keys for the dask graph"""
        pass

    def __dask_graph__(self):
        """The dask graph. Return a mutable mapping or `None`. If `None`, not
        interpreted as a dask object."""
        pass

To check if something is a dask collection:

def is_dask_collection(x):
    f = getattr(x, '__dask_graph__', lambda: None)() is not None

The compute/persist process then becomes:

group arguments to dask.compute by __dask_precompute__
pass grouped arguments to corresponding __dask_precompute__ method. Get back a single graph, and a list of keys.
Merge all graphs together, concatenate all keys
Pass to the scheduler (same logic as currently for determining which scheduler)
If compute, pass results to __dask_finalize__
If persist, pass results and keys to __dask_rebuild__

Certain methods might be made optional (not set on that). For example, __dask_precompute__ might just default to calling and returning __dask_graph__/__dask_keys__ if undefined. We might also default to the threaded scheduler if __dask_default_get__ is undefined.

Similarly, all the staticmethod methods might be defined together in a dictionary, similar to __array_interface__. Perk of this is it makes it harder for these to accidentally be instance methods, and keeps the object namespace smaller. Downside is dictionaries are mutable, inheritance is trickier, and sometimes it's nice to define the functions under the class instead of as top-level functions.

I find myself in need of this, so I'll probably start pushing on it a bit.

I'm curious what you need this for?

This will need to deprecate the older interface a bit more slowly.

I'm not sure about the needs of dask.distributed, but the existing interface was never public. Projects that are making use of this shouldn't expect the existing private methods to remain. I suspect that users making use of this are sufficiently cutting edge that breaking changes will be fine (this is true for at least the private projects I've worked on).

mrocklin · 2017-09-01T14:28:48Z

Similarly, all the staticmethod methods might be defined together in a dictionary, similar to array_interface. Perk of this is it makes it harder for these to accidentally be instance methods, and keeps the object namespace smaller. Downside is dictionaries are mutable, inheritance is trickier, and sometimes it's nice to define the functions under the class instead of as top-level functions.

I'm not overly concerned about accidental staticmethod cases. I suspect/hope that most implementations of the Dask interface will be somewhat expertly done. I don't expect this to be a common thing.

I'm not sure about the needs of dask.distributed, but the existing interface was never public. Projects that are making use of this shouldn't expect the existing private methods to remain. I suspect that users making use of this are sufficiently cutting edge that breaking changes will be fine (this is true for at least the private projects I've worked on).

It looks like dask.distributed uses _finalize once.

I'm curious what you need this for?

It has come up a couple of times recently

I wanted to compute or persist an XArray dataset or dataarray. There is some support for this but it's not complete.
I wanted to do async tests with XArray, but dask.distributed had no idea how to get out graphs or replace those graphs with futures.

mrocklin · 2017-09-01T14:30:50Z

More broadly, the reason for duck types with XArray is, I think, integration with dask.distributed.

mrocklin · 2017-09-02T23:51:43Z

def __dask_rebuild__(self, keys, results):

For rebuild I might suggest that it just takes a new task graph. The keys should be the same.

mrocklin · 2017-09-02T23:52:58Z

@jcrist is this something that you plan to have time for in the next week or so? Otherwise I might want to take it on if that would be ok with you (though I would be very happy if you happened to have time).

mrocklin · 2017-09-03T00:01:05Z

@shoyer just verifying that you think it's a decent idea to add methods like __dask_graph__, __dask_optimize__, and __dask_finalize__ directly to XArray DataSet and DataArray objects, yes? This would not force a strict dependency.

Question: is there any custom logic that you rely on in XArray right now? For example when you call compute does this have custom side effects like keeping around the computed arrays on the original object? Is there any logic in XArray that you would be sad to lose?

shoyer · 2017-09-03T00:19:54Z

just verifying that you think it's a decent idea to add methods like dask_graph, dask_optimize, and dask_finalize directly to XArray DataSet and DataArray objects, yes? This would not force a strict dependency.

Yes, as long as there's an easy way to express that an object doesn't contain any dask objects.

Question: is there any custom logic that you rely on in XArray right now?

No, I don't think so. We have a separate .load() method which does a compute plus caching, but .compute() and .persist() return new objects after calling the method on the contained dask arrays.

jcrist · 2017-09-04T19:18:31Z

@jcrist is this something that you plan to have time for in the next week or so?

Sure. I should be able to get to this in the next couple days (definitely by end of week).

jcrist · 2017-10-05T20:52:39Z

Apologies for the delay here. This is implemented in #2748, which supersedes this PR. Closing.

* Make column projections stricter (dask#881) * Simplify again after lowering (dask#884) * Visual EXPLAIN (dask#885) * Fix merge predicate pushdowns with weird predicates (dask#888) * Handle futures that are put into map_partitions (dask#892) * Remove eager divisions from indexing (dask#891) * Add shuffle if objects are not aligned and partitions are unknown in assign (dask#887) Co-authored-by: Hendrik Makait <[email protected]> * Add support for dd.Aggregation (dask#893) * Fix random_split for series (dask#894) * Update dask version * Use Aggregation from dask/dask (dask#895) * Fix meta calculation error in groupby (dask#897) * Revert "Use Aggregation from dask/dask" (dask#898) * Parquet reader using Pyarrow FileSystem (dask#882) Co-authored-by: Patrick Hoefler <[email protected]> * Fix assign for empty indexer (dask#901) * Add dask.dataframe import at start (dask#903) * Add indicator support to merge (dask#902) * Fix detection of parquet filter pushdown (dask#910) * Speedup init of `ReadParquetPyarrowFS` (dask#909) * Don't rely on sets in are_co_aligned (dask#908) * Implement more efficient GroupBy.mean (dask#906) * Refactor GroupByReduction (dask#920) * Implement array inference in new_collection (dask#922) * Add support for convert string option (dask#912) * P2P shuffle drops partitioning column early (dask#899) * Avoid culling for SetIndexBlockwise with divisions (dask#925) * Re-run versioneer install to fix version number (tag_prefix) (dask#926) * Sort if split_out=1 in value_counts (dask#924) * Wrap fragments (dask#911) * Ensure that columns are copied in projection (dask#927) * Raise in map if pandas < 2.1 (dask#929) * Add _repr_html_ and updated __repr__ for FrameBase (dask#930) * Support token for map_partitions (dask#931) * Fix Copy-on-Write related bug in groupby.transform (dask#933) * Fix to_dask_dataframe test after switching to dask-expr by default (dask#935) * Use multi-column assign in groupby apply (dask#934) * Enable copy on write by default (dask#932) Co-authored-by: Patrick Hoefler <[email protected]> * Avoid fusing from_pandas ops to avoid duplicating data (dask#938) * Adjust automatic split_out parameter (dask#940) * Revert "Add _repr_html_ and updated __repr__ for FrameBase (dask#930)" (dask#941) * Remove repartition from P2P shuffle (dask#942) * [Parquet] Calculate divisions from statistics (dask#917) * Accept user arguments for arrow_to_pandas (dask#936) * Add _repr_html_ and prettier __repr__ w/o graph materialization (dask#943) * Add dask tokenize for fragment wrapper (dask#948) * Warn if annotations are ignored (dask#947) * Require `pyarrow>=7` (dask#949) * Implement string conversion for from_array (dask#950) * Add dtype and columns type check for shuffle (dask#951) * Concat arrow tables before converting to pandas (dask#928) * MINOR: Avoid confusion around shuffle method (dask#956) Co-authored-by: Patrick Hoefler <[email protected]> * Set pa cpu count (dask#954) Co-authored-by: Patrick Hoefler <[email protected]> * Update for pandas nighlies (dask#953) * Fix bug with split_out in groupby aggregate (dask#957) * Fix default observed value (dask#960) * Ensure that we respect shuffle in context manager (dask#958) Co-authored-by: Hendrik Makait <[email protected]> * Fix 'Empty' prefix to non-empty Series repr (dask#963) * Update README.md (dask#964) * Adjust split_out values to be consistent with other methods (dask#961) * bump version to 1.0 * Raise an error if the optimizer cannot terminate (dask#966) * Fix non-converging optimizer (dask#967) * Fixup filter pushdown through merges with ands and column reuse (dask#969) * Fix unique with shuffle and strings (dask#971) * Implement custom reductions (dask#970) * Fixup set_index with one partition but more divisions by user (dask#972) * Fixup predicate pushdown for query 19 (dask#973) Co-authored-by: Miles <[email protected]> * Revert enabling pandas cow (dask#974) * Update changelog for 1.0.2 * Fix set-index preserving divisions for presorted (dask#977) * Fixup reduction with split_every=False (dask#978) * Release for dask 2024.3.1 * Raise better error for repartition on divisions with unknown divisions (dask#980) * Fix concat of series objects with column projection (dask#981) * Fix some reset_index optimization issues (dask#982) * Remove keys() (dask#983) * Ensure wrapping an array when comparing to Series works if columns are empty (dask#984) * Version v1.0.4 * Visual ANALYZE (dask#889) Co-authored-by: fjetter <[email protected]> * Support ``prefix`` argument in ``from_delayed`` (dask#991) * Ensure drop matches column names exactly (dask#992) * Fix SettingWithCopyWarning in _merge.py (dask#990) * Update pyproject.toml (dask#994) * Allow passing of boolean index for column index in loc (dask#995) * Ensure that repr doesn't raise if an operand is a pandas object (dask#996) * Version v1.0.5 * Reduce coverage target a little bit (dask#999) * Nicer read_parquet prefix (dask#998) Co-authored-by: Patrick Hoefler <[email protected]> * Set divisions with divisions already known (dask#997) * Start building and publishing conda nightlies (dask#986) * Fix zero division error when reading index from parquet (dask#1000) * Rename overloaded `to/from_dask_dataframe` API (dask#987) * Register json and orc APIs for "pandas" dispatch (dask#1004) * Fix pyarrow fs reads for list of directories (dask#1006) * Release for dask 2024.4.0 * Fix meta caclulation in drop_duplicates (dask#1007) * Release 1.0.7 * Support named aggregations in groupby.aggregate (dask#1009) * Make release 1.0.9 * Adjust version number in changes * Make setattr work (dask#1011) * Release for dask 2024.4.1 * Fix head for npartitions=-1 and optimizer step (dask#1014) * Deprecate ``to/from_dask_dataframe`` API (dask#1001) * Fix projection for rename if projection isn't renamed (dask#1016) * Fix unique with numeric columns (dask#1017) * Add changes for new version * Fix column projections in merge when suffixes are relevant (dask#1019) * Simplify dtype casting logic for shuffle (dask#1012) * Use implicit knowledge about divisions for efficient grouping (dask#946) Co-authored-by: Patrick Hoefler <[email protected]> Co-authored-by: Hendrik Makait <[email protected]> * Fix assign after set index incorrect projections (dask#1020) * Fix read_parquet if directory is empty (dask#1023) * Rename uniuqe_partition_mapping property and add docs (dask#1022) * Add docs for usefule optimizer methods (dask#1025) * Fix doc build error (dask#1026) * Fix error in analyze for scalar (dask#1027) * Add nr of columns to explain output for projection (dask#1030) Co-authored-by: Hendrik Makait <[email protected]> * Fuse more aggressively if parquet files are tiny (dask#1029) * Move IO docstrings over (dask#1033) * Release for dask 2024.4.2 * Add cudf support to ``to_datetime`` and ``_maybe_from_pandas`` (dask#1035) * Fix backend dispatching for `read_csv` (dask#1028) * Fix loc accessing index for element wise op (dask#1037) * Fix loc slicing with Datetime Index (dask#1039) * Fix shuffle after set_index from 1 partition df (dask#1040) * Bugfix release * Fix bug in ``Series`` reductions (dask#1041) * Fix shape returning integer (dask#1043) * Fix xarray integration with scalar columns (dask#1046) * Fix None min/max statistics and missing statistics generally (dask#1045) * Fix drop with set (dask#1047) * Fix delayed in fusing with multipled dependencies (dask#1038) * Add bugfix release * Optimize when from-delayed is called (dask#1048) * Fix default name conversion in `ToFrame` (dask#1044) Co-authored-by: Patrick Hoefler <[email protected]> * Add support for ``DataFrame.melt`` (dask#1049) * Fixup failing test (dask#1052) * Generalize ``get_dummies`` (dask#1053) * reduce pickle size of parquet fragments (dask#1050) * Add a bunch of docs (dask#1051) Co-authored-by: Hendrik Makait <[email protected]> * Release for dask 2024.5.0 * Fix to_parquet in append mode (dask#1057) * Fix sort_values for unordered categories (dask#1058) * Fix dropna before merge (dask#1062) * Fix non-integer divisions in FusedIO (dask#1063) * Add cache argument to ``lower_once`` (dask#1059) * Use ensure_deterministic kwarg instead of config (dask#1064) * Fix isin with strings (dask#1067) * Fix isin for head computation (dask#1068) * Fix read_csv with positional usecols (dask#1069) * Release for dask 2024.5.1 * Use `is_categorical_dtype` dispatch for `sort_values` (dask#1070) * Fix meta for string accessors (dask#1071) * Fix projection to empty from_pandas (dask#1072) * Release for dask 2024.5.2 * Fix categorize if columns are dropped (dask#1074) * Fix resample divisions propagation (dask#1075) * Release for dask 2024.6.0 * Fix get_group for multiple keys (dask#1080) * Skip distributed tests (dask#1081) * Fix cumulative aggregations for empty partitions (dask#1082) * Move another test to distributed folder (dask#1085) * Release 1.1.4 * Release for dask 2024.6.2 * Add minimal subset of interchange protocol (dask#1087) * Add from_map docstring (dask#1088) * Ensure 1 task group per from_delayed (dask#1084) * Advise against using from_delayed (dask#1089) * Refactor shuffle method to handle invalid columns (dask#1091) * Fix freq behavior on ci (dask#1092) * Add first array draft (dask#1090) * Fix array import stuff (dask#1094) * Add asarray (dask#1095) * Implement arange (dask#1097) * Implement linspace (dask#1098) * Implement zeros and ones (dask#1099) * Remvoe pandas 2 checks (dask#1100) * Add unify-chunks draft to arrays (dask#1101) Co-authored-by: Patrick Hoefler <[email protected]> * Release for dask 2024.7.0 * Skip test if optional xarray cannot be imported (dask#1104) * Fix deepcopying FromPandas class (dask#1105) * Fix from_pandas with chunksize and empty df (dask#1106) * Link fix in readme (dask#1107) * Fix shuffle blowing up the task graph (dask#1108) Co-authored-by: Hendrik Makait <[email protected]> * Release for dask 2024.7.1 * Fix some things for pandas 3 (dask#1110) * Fixup remaining upstream failures (dask#1111) * Release for dask 2024.8.0 * Drop support for Python 3.9 (dask#1109) Co-authored-by: James Bourbeau <[email protected]> * Fix tuples as on argument in merge (dask#1117) * Fix merging when index name in meta missmatches actual name (dask#1119) Co-authored-by: Hendrik Makait <[email protected]> * Register `read_parquet` and `read_csv` as "dispatchable" (dask#1114) * Fix projection for Index class in read_parquet (dask#1120) * Fix result index of merge (dask#1121) * Introduce `ToBackend` expression (dask#1115) * Avoid calling ``array`` attribute on ``cudf.Series`` (dask#1122) * Make split_out for categorical default smarter (dask#1124) * Release for dask 2024.8.1 * Fix scalar detection of columns coming from sql (dask#1125) * Bump `pyarrow>=14.0.1` minimum versions (dask#1127) Co-authored-by: Patrick Hoefler <[email protected]> * Fix concat axis 1 bug in divisions (dask#1128) * Release for dask 2024.8.2 * Use task-based rechunking as default (dask#1131) * Improve performance of `DelayedsExpr` through caching (dask#1132) * Import from tokenize (dask#1133) * Release for dask 2024.9.0 * Add concatenate flag to .compute() (dask#1138) * Release for dask 2024.9.1 * Fix displaying timestamp scalar (dask#1141) * Fix alignment issue with groupby index accessors (dask#1142) * Improve handling of optional dependencies in `analyze` and `explain` (dask#1146) * Switch from mambaforge to miniforge in CI (dask#1147) * Fix merge_asof for single partition (dask#1145) * Raise exception when calculating divisons (dask#1149) * Fix binary operations with scalar on the left (dask#1150) * Explicitly list setuptools as a build dependency in conda recipe (dask#1151) * Version v1.1.16 * Fix ``Merge`` divisions after filtering partitions (dask#1152) * Fix meta calculation for to_datetime (dask#1153) * Internal cleanup of P2P code (dask#1154) * Migrate P2P shuffle and merge to TaskSpec (dask#1155) * Improve Aggregation docstring explicitly mentionning SeriesGroupBy (dask#1156) * Migrate shuffle and merge to `P2PBarrierTask` (dask#1157) * Migrate Blockwise to use taskspec (dask#1159) * Add support for Python 3.13 (dask#1160) * Release for dask 2024.11.0 * Fix fusion calling things multiple times (dask#1161) * Version 1.1.18 * Version 1.1.19 * Fix orphaned dependencies in Fused expression (dask#1163) * Use Taskspec fuse implementation (dask#1162) Co-authored-by: Patrick Hoefler <[email protected]> * Introduce more caching when walking the expression (dask#1165) * Avoid exponentially growing graph for Assign-Projection combinations (dask#1164) * Remove ``from_dask_dataframe`` (dask#1167) * Deprecated and remove from_legacy_dataframe usage (dask#1168) Co-authored-by: James Bourbeau <[email protected]> * Remove recursion in task spec (dask#1158) * Fix value_counts with split_out != 1 (dask#1170) * Release 2024.12.0 * Use new blockwise unpack collection in array (dask#1173) * Propagate group_keys in DataFrameGroupBy (dask#1174) * Fix assign optimization when overwriting columns (dask#1176) * Remove custom read-csv stuff (dask#1178) * Fixup install paths (dask#1179) * Version 1.1.21 * Remove legacy conversion functions (dask#1177) * Remove duplicated files * Move repository * Clean up docs and imports * Clean up docs and imports --------- Co-authored-by: Hendrik Makait <[email protected]> Co-authored-by: Florian Jetter <[email protected]> Co-authored-by: Miles <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Richard (Rick) Zamora <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: James Bourbeau <[email protected]> Co-authored-by: alex-rakowski <[email protected]> Co-authored-by: Matthew Rocklin <[email protected]> Co-authored-by: Sandro <[email protected]> Co-authored-by: Ben <[email protected]> Co-authored-by: James Bourbeau <[email protected]> Co-authored-by: Guillaume Eynard-Bontemps <[email protected]> Co-authored-by: Tom Augspurger <[email protected]>

jcrist changed the title ~~[WIP] Outline duck interface for dask objects~~ [WIP] Expected interface for dask objects Mar 29, 2016

jcrist reviewed Mar 29, 2016
View reviewed changes

jcrist force-pushed the ducks branch from f60a62b to b63bfd1 Compare March 29, 2016 21:15

jcrist force-pushed the ducks branch from b63bfd1 to 5e12984 Compare April 7, 2016 00:21

Outline duck interface for dask objects

71eb139

Defines an abstract interface that's expected for objects to act like dask collections. This is useful for projects like xarray which want to use dask, but don't want to subclass from Base.

jcrist force-pushed the ducks branch from 5e12984 to 71eb139 Compare April 7, 2016 00:24

Add docs on general dask interface.

be9182b

mrocklin reviewed Apr 8, 2016
View reviewed changes

mrocklin changed the title ~~[WIP] Expected interface for dask objects~~ Expected interface for dask objects Apr 8, 2016

mrocklin reviewed Apr 14, 2016
View reviewed changes

This was referenced Jul 20, 2016

Should _default_get be inherited on chained operations? #1283

Closed

dtype and shape don't match for empty arrays #1398

Closed

shoyer mentioned this pull request Nov 8, 2016

Convert xarray dataset to dask dataframe or delayed objects pydata/xarray#1093

Open

shoyer mentioned this pull request Nov 18, 2016

Remove caching logic from xarray.Variable pydata/xarray#1128

Merged

mrocklin mentioned this pull request Mar 30, 2017

Dask Persist pydata/xarray#1344

Closed

mrocklin mentioned this pull request Sep 5, 2017

Formalize contract between XArray and the dask.distributed scheduler pangeo-data/pangeo#5

Closed

jcrist mentioned this pull request Oct 5, 2017

Dask Collection Interface #2748

Merged

jcrist closed this Oct 5, 2017

jcrist deleted the ducks branch October 5, 2017 20:52

jhamman mentioned this pull request Oct 21, 2017

Formalize contract between XArray and the dask.distributed scheduler pydata/xarray#1644

Closed

phofl added a commit to phofl/dask that referenced this pull request Dec 23, 2024

Fix isin for head computation (dask#1068)

5017d70

Uh oh!

Expected interface for dask objects #1068

Expected interface for dask objects #1068

Uh oh!

Conversation

jcrist commented Mar 29, 2016

Uh oh!

jcrist Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

shoyer Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

jcrist commented Apr 7, 2016

Uh oh!

mrocklin commented Apr 7, 2016

Uh oh!

mrocklin Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

jcrist Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 14, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 8, 2016

Uh oh!

mrocklin commented Apr 8, 2016

Uh oh!

jcrist commented Apr 8, 2016

Uh oh!

jcrist commented Apr 8, 2016

Uh oh!

mrocklin commented Apr 8, 2016

Uh oh!

shoyer commented Apr 8, 2016

Uh oh!

mrocklin commented Apr 8, 2016

Uh oh!

shoyer commented Apr 9, 2016

Uh oh!

mrocklin Apr 14, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 20, 2016

Uh oh!

mrocklin commented Apr 3, 2017

Uh oh!

mrocklin commented Sep 1, 2017

Uh oh!

mrocklin commented Sep 1, 2017

Uh oh!

mrocklin commented Sep 1, 2017

Uh oh!

jcrist commented Sep 1, 2017

Uh oh!

mrocklin commented Sep 1, 2017

Uh oh!

mrocklin commented Sep 1, 2017

Uh oh!

mrocklin commented Sep 2, 2017

Uh oh!

mrocklin commented Sep 2, 2017

Uh oh!

mrocklin commented Sep 3, 2017

Uh oh!

shoyer commented Sep 3, 2017

Uh oh!

jcrist commented Sep 4, 2017

Uh oh!

jcrist commented Oct 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects