Skip to content

Conversation

@jcorbin
Copy link

@jcorbin jcorbin commented Mar 9, 2016

This allows the mapping function to get arbitrary other (maybe eventually computed) values in addition to each partition.

dask/bag/core.py Outdated
dask.imperative.to_task_dasks and any resulting dasks merged into the
result Bag dask:
>>> b.map_partitions(lambda part, freqs: part / freqs, b.frequencies()) # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will have some negative consequences when it comes to memory usage. We'll end up keeping a fair amount more in RAM than otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular example, part / freqs, probably isn't ideal. I'm unable to think of a situation where this would be valid. Perhaps use something like b.map_partitions(lambda part, total: [x / total for x in part], b.count())?

Should this be extended to map as well?

@jcorbin
Copy link
Author

jcorbin commented Mar 9, 2016

Added a modified example, let's see if it passes doctest; also added args (pre unpacking) to tokenize call.

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2016

It's slightly off. This passes though:

        >>> b.map_partitions(
        ...     lambda part, total: [p / total for p in part],
        ...     b.sum()
        ... ).sum().compute()
        1.0

I got this by running the following:

mrocklin@workstation:~/workspace/dask$ py.test  --doctest-modules dask/bag/

I highly recommend trying to run tests locally. It greatly reduces iteration times.

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2016

It would be good to have more tests for this sort of thing in test_bag.py. Checking literal arguments (i.e. non-dask values), multiple arguments, etc..

@jcorbin jcorbin force-pushed the bag_map_parts_args branch from 785aa46 to 4c5822c Compare March 9, 2016 22:11
@jcorbin
Copy link
Author

jcorbin commented Mar 9, 2016

Okay, added new simple test case, and doctest passes too. Doing the same for map probably makes sense, should I add that here or separate PR?

.map_partitions(sum)) == [20, 20, 20]
assert list(
b.map_partitions(lambda a, m: [x * m for x in a], value(2))
.map_partitions(sum)) == [20, 20, 20]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to support *args then it would be also good to check multiple extra arguments.

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2016

I suggest putting map in the same PR. I suspect that map will be a more common entry point.

@jcorbin jcorbin force-pushed the bag_map_parts_args branch from 3cc50a0 to e90f12e Compare March 10, 2016 01:34
@jcorbin
Copy link
Author

jcorbin commented Mar 10, 2016

Okay:

  • added .map equivalent
  • since I chose to make fixed args come first with .map, I did the same thing for .map_partitions; I think this makes more sense anyhow in addition to being easier for .map
  • map with multi-args is now mutually exclusive to map with fixed args; I think this is right, and can't really see an easy way to combine the two concepts that wouldn't smell like "too much magic"

@jcorbin jcorbin changed the title Add support for extra args to Bag.map_partitions Add support for *args to Bag.map and Bag.map_partitions Mar 10, 2016
[0, 10, 20, 30, 40]
Any additional arguments get passed to the function _before_ the data
argument; argument values may either by concrete or a dask computation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems error prone to me, I would have expected it to go the other way. I'm not suggesting that we change it, just bringing up the point that intuition may be split on this, which is troublesome.

I wonder if a full db.map function might be less ambiguous. Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also find after to be more intuitive; I think it would take defining a custom function like map_with_args(func, args, iterable) to use in place of map when we have any arguments, since it's not possible to partially apply from the end of argument list. I'll update with this variant to make the example concrete and see if it works :-)

@mrocklin
Copy link
Member

What happens now if someone does something like the following:

a = db.from_sequence([1, 2, 3])
b = db.from_sequence([10, 20, 30])
a.map(add, b)

@jcorbin
Copy link
Author

jcorbin commented Mar 10, 2016

Good point about a.map(add, b), the more I reflect on this, the more I think that I need to take over the to_task_dasks routine that I only borrowed from imperative, to:

  • correctly handle and not optimize Item which is the more natural choice within the bag module
  • at least warn, if not hard break, if user tries to pass in a large argument (like an entire bag) since they may be expecting some sort of unwrapping/alignment, and that's not what they get here

@mrocklin
Copy link
Member

It might be worth writing down a test or in-depth comment that precisely specifies behavior before jumping in and implementing things. This is a significant enough API enhancement that people might want to weigh in.

@jcorbin jcorbin force-pushed the bag_map_parts_args branch from e90f12e to 0972425 Compare March 10, 2016 21:57
@jcorbin
Copy link
Author

jcorbin commented Mar 10, 2016

Okay:

  • went back to passing *args last to the function, turned out to be easier than I had expected; I've kept it separate in git commits for now, but will squash it down once we confirm this is what we want
  • I generalized imperative.to_task_dasks and am now using an Item-aware variant local to the bag module; this seems the most conservative thing to do for now and doesn't force an Item vs Value resolution now
  • updated the tests with less contrived examples (as uncontrived as number tricks ever get ;-)

I can easily split out the prep work for base.normalize_to_dasks into a separate PR if you prefer. Also please let me know if you'd rather bag.to_task_dasks be named differently, I wasn't certain what inspired its name in imperative and have just copied it.

@jcorbin jcorbin force-pushed the bag_map_parts_args branch from 0972425 to fe768c2 Compare March 10, 2016 22:24
@jcrist
Copy link
Member

jcrist commented Mar 10, 2016

Looking at this, I actually find the behavior implemented surprising. In the standard library, multiple arguments to map are all passed as iterables:

In [5]: x = [1, 2, 3, 4, 5]

In [6]: y = [6, 7, 8, 9, 10]

In [7]: map(f, x, y)
Out[7]: [7, 9, 11, 13, 15]

What you implemented is more like:

def your_map(f, itbl, *args):
    return map(lambda x: f(x, *args), itbl)

Perhaps this is the better api, but I personally expected it to work like the stdlib. What you have here is also fairly easy implement with a simple inline lambda:

b.map(lambda x: f(x, *args))

If I were to implement this, I'd base it on zip. Something like (untested):

def map(self, f, *args):
    if args:
        return bag_zip(self, *args).map(f)
    <normal implementation here>

@jcorbin
Copy link
Author

jcorbin commented Mar 11, 2016

Fair point @jcrist , remember that my original case here was map_partitions, I only added the map equivalent because @mrocklin mentioned it for consistency. Really this is more like the .apply family of methods in pandas (e.g. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html).

Your inline lambda trick only works if the arguments are static, and not eventually computed from the dataset (e.g. an Item from a bag reduction).

My motivating use case is when you need to do a two-pass run over the data where the map(_partitions) phase needs the result of a reduction over the data, e.g. here: #1040 (comment)

@mrocklin
Copy link
Member

Two points:

  1. I think that map and map_partitions will, in the end, have almost identical APIs, so the discussion here is valid even for the map_partitions work
  2. The treatment of reduced results Item/Value/whatever is novel and is something for us to think about.

In other APIs we've supported singleton arguments in map using kwargs, assuming that all kwargs are singletons and all args are sequences / collections. Would this compromise work in your applications @jcorbin ?

@mrocklin
Copy link
Member

OK, it seems to me that there are a few active questions about the design of map.

  1. Should we support mapping a function across multiple bags? Should we raise a warning at runtime if they are differently sized?
db.map(add, bag1, bag2)
  1. If an argument is not a bag, should we treat it as a constant?
db.map(add, bag1, 10)
  1. If an argument is not a bag, but is a dask collection with a single key, should we also treat it as a constant
db.map(add, bag1, bag2.sum())

Using constants as args within map violates expectations set by map functions elsewhere. However, it's also very convenient. In distributed I got around this by using constants as kwargs

def add(x=None, y=None):
    return x + y

db.map(add, bag1, y=10)

Any thoughts anyone?

@jcorbin
Copy link
Author

jcorbin commented Mar 11, 2016

@mrocklin here's my thoughts on each of your points:

Re 1. and the db.map(add, bag1, bag2 case:

  • this is already possible using db.bag_zip(bag1, bag2).map(...)
  • core python has always been unclear here, map(f, zip(*its)) vs map(f, *its) largely due to the eager vs lazy dichotomy that didn't really come clear until py3k
  • so in my view, this is not a feature that needs to be aped, especially in a library designed for lazy computation

Re 2. passing constant/static arguments

  • it was only by side-effect that my change ended up allowing this use case, since to_task_dasks preserves such values
  • this use case was already possible using partial or a lambda depending on call site needs

Re 3. passing a dask's eventual computed value in to the mapping function

  • whether than be an imperative Value or a bag Item, the intent here was towards scalar values
  • the values here are expensive to compute and, in my case The Joy and Agony of Categorical Types #1040, most of the cost of computing it is a shared dependency with the mapping itself (see example below)

Writing my reply to case 2 above got me thinking:

  • what if I had a way of late-binding (partial-ing, lambda-wrapping, etc) my map(_partitions) function to an eventual computation?
  • then the change to each of map and map_partitions would be simple supporting this new kind of function
  • which could apply to many other contexts (any other places where we take functions, like reduction f.e.)

I may have a run at trying to implement this idea to see how feasible it would be.

Example problem for case 3 above from #1040 for 10 shards:
complex_dask

And its execution over ~400 shards:
complex_dask_profile

@jcorbin
Copy link
Author

jcorbin commented Mar 12, 2016

Okay so per the comments on #1044, shall I:

  • shift this api to be **kwargs on .map and .map_partitions
  • only support Value and Item values in **kwargs
  • drop the to_task_dasks work, and just do something targeted at only Value/Item
    ?

@mrocklin
Copy link
Member

I'm much more comfortable supporting non-sequences in keyword arguments. @jcrist are you ok with this? I'd like to see value/item as well as non-dask values work in keywords.

Ideally the following work

b = db.from_sequence([1, 2, 3 ,4], npartitions=2)
def add(x=0, y=0):
    return x + y

assert list(b.map(add, y=100)) == [101, 102, 103, 104]
assert list(b.map(add, y=b.sum())) == [11, 12, 13, 14]
with pytest.raises(TypeError):
    b.map(add, 10)
with pytest.raises(TypeError):
    b.map(add, b.sum())

I'm not making any claim about what we should do with b.map(add, b). I suggest that we just don't support it for now.

Are there other ideas for how the API should work? I think that suggesting ideas as a test suite is a good way to precisely nail down intended behavior.

@jcorbin
Copy link
Author

jcorbin commented Mar 14, 2016

Okay so barring any other feedback, I will pivot this PR to use **kwargs with scalar-valued (dask.bag.Item, 'dask.imperative.Value, or a python literal value) pass through; intended use case:

b = dask.bag.from_sequence(100, npartitions=10)
assert b.map(lambda x, total=0: x / total, total=b.sum()).sum().compute() == 1.0
assert b.map_partitions(lambda X, total=0: [x / total for x in X], total=b.sum()).sum().compute() == 1.0

One final counter idea: this usage pattern could be supported by "lifting" scalar values under bag_zip (see https://gist.github.com/jcorbin/3173695ec83b20386f36 which has an example comment.)

@mrocklin
Copy link
Member

The scalar_zip idea looks cool. It breaks the Python interface though and so would need some discussion. If you'd like to get this in soon theh I recommend going with implementing things as your examples show. I think that this change won't surprise/offend anyone (though others should please speak up if otherwise.)

@jcorbin
Copy link
Author

jcorbin commented Mar 14, 2016

Turned the gist into #1049 since the more I think about this, the more I like expanding zip's definition since it enables more usage patterns (like the reduction cases that computed functions also supported).

That said, I'll still update this PR as discussed to directly support value-kwargs, and then we can decide if we want either/both.

I'm in no rush to get any of this landed, since I can easily continue using development branch copies of dask to GTD wrt my work-a-day datasets ;-) I'd much rather arrive at a solution / solutions that makes most sense to others.

@jcorbin
Copy link
Author

jcorbin commented Mar 15, 2016

@mrocklin let me know if you would like me to salvage any of the cleanup work or the to_task_dasks refactor before abandoning in favor of #1054.

@jcrist
Copy link
Member

jcrist commented Mar 17, 2016

Closing as superseded by #1054.

@jcrist jcrist closed this Mar 17, 2016
phofl pushed a commit to phofl/dask that referenced this pull request Dec 23, 2024
phofl added a commit to phofl/dask that referenced this pull request Dec 23, 2024
* Make column projections stricter (dask#881)

* Simplify again after lowering (dask#884)

* Visual EXPLAIN (dask#885)

* Fix merge predicate pushdowns with weird predicates (dask#888)

* Handle futures that are put into map_partitions (dask#892)

* Remove eager divisions from indexing (dask#891)

* Add shuffle if objects are not aligned and partitions are unknown in assign (dask#887)

Co-authored-by: Hendrik Makait <[email protected]>

* Add support for dd.Aggregation (dask#893)

* Fix random_split for series (dask#894)

* Update dask version

* Use Aggregation from dask/dask (dask#895)

* Fix meta calculation error in groupby (dask#897)

* Revert "Use Aggregation from dask/dask" (dask#898)

* Parquet reader using Pyarrow FileSystem (dask#882)

Co-authored-by: Patrick Hoefler <[email protected]>

* Fix assign for empty indexer (dask#901)

* Add dask.dataframe import at start (dask#903)

* Add indicator support to merge (dask#902)

* Fix detection of parquet filter pushdown (dask#910)

* Speedup init of `ReadParquetPyarrowFS` (dask#909)

* Don't rely on sets in are_co_aligned (dask#908)

* Implement more efficient GroupBy.mean (dask#906)

* Refactor GroupByReduction (dask#920)

* Implement array inference in new_collection (dask#922)

* Add support for convert string option (dask#912)

* P2P shuffle drops partitioning column early (dask#899)

* Avoid culling for SetIndexBlockwise with divisions (dask#925)

* Re-run versioneer install to fix version number (tag_prefix) (dask#926)

* Sort if split_out=1 in value_counts (dask#924)

* Wrap fragments (dask#911)

* Ensure that columns are copied in projection (dask#927)

* Raise in map if pandas < 2.1 (dask#929)

* Add _repr_html_ and updated __repr__ for FrameBase (dask#930)

* Support token for map_partitions (dask#931)

* Fix Copy-on-Write related bug in groupby.transform (dask#933)

* Fix to_dask_dataframe test after switching to dask-expr by default (dask#935)

* Use multi-column assign in groupby apply (dask#934)

* Enable copy on write by default (dask#932)

Co-authored-by: Patrick Hoefler <[email protected]>

* Avoid fusing from_pandas ops to avoid duplicating data (dask#938)

* Adjust automatic split_out parameter (dask#940)

* Revert "Add _repr_html_ and updated __repr__ for FrameBase (dask#930)" (dask#941)

* Remove repartition from P2P shuffle (dask#942)

* [Parquet] Calculate divisions from statistics (dask#917)

* Accept user arguments for arrow_to_pandas (dask#936)

* Add _repr_html_ and prettier __repr__ w/o graph materialization (dask#943)

* Add dask tokenize for fragment wrapper (dask#948)

* Warn if annotations are ignored (dask#947)

* Require `pyarrow>=7` (dask#949)

* Implement string conversion for from_array (dask#950)

* Add dtype and columns type check for shuffle (dask#951)

* Concat arrow tables before converting to pandas (dask#928)

* MINOR: Avoid confusion around shuffle method (dask#956)

Co-authored-by: Patrick Hoefler <[email protected]>

* Set pa cpu count (dask#954)

Co-authored-by: Patrick Hoefler <[email protected]>

* Update for pandas nighlies (dask#953)

* Fix bug with split_out in groupby aggregate (dask#957)

* Fix default observed value (dask#960)

* Ensure that we respect shuffle in context manager (dask#958)

Co-authored-by: Hendrik Makait <[email protected]>

* Fix 'Empty' prefix to non-empty Series repr (dask#963)

* Update README.md (dask#964)

* Adjust split_out values to be consistent with other methods (dask#961)

* bump version to 1.0

* Raise an error if the optimizer cannot terminate (dask#966)

* Fix non-converging optimizer (dask#967)

* Fixup filter pushdown through merges with ands and column reuse (dask#969)

* Fix unique with shuffle and strings (dask#971)

* Implement custom reductions (dask#970)

* Fixup set_index with one partition but more divisions by user (dask#972)

* Fixup predicate pushdown for query 19 (dask#973)

Co-authored-by: Miles <[email protected]>

* Revert enabling pandas cow (dask#974)

* Update changelog for 1.0.2

* Fix set-index preserving divisions for presorted (dask#977)

* Fixup reduction with split_every=False (dask#978)

* Release for dask 2024.3.1

* Raise better error for repartition on divisions with unknown divisions (dask#980)

* Fix concat of series objects with column projection (dask#981)

* Fix some reset_index optimization issues (dask#982)

* Remove keys() (dask#983)

* Ensure wrapping an array when comparing to Series works if columns are empty (dask#984)

* Version v1.0.4

* Visual ANALYZE (dask#889)

Co-authored-by: fjetter <[email protected]>

* Support ``prefix`` argument in  ``from_delayed`` (dask#991)

* Ensure drop matches column names exactly (dask#992)

* Fix SettingWithCopyWarning in _merge.py (dask#990)

* Update pyproject.toml (dask#994)

* Allow passing of boolean index for column index in loc (dask#995)

* Ensure that repr doesn't raise if an operand is a pandas object (dask#996)

* Version v1.0.5

* Reduce coverage target a little bit (dask#999)

* Nicer read_parquet prefix (dask#998)

Co-authored-by: Patrick Hoefler <[email protected]>

* Set divisions with divisions already known (dask#997)

* Start building and publishing conda nightlies (dask#986)

* Fix zero division error when reading index from parquet (dask#1000)

* Rename overloaded `to/from_dask_dataframe` API (dask#987)

* Register json and orc APIs for "pandas" dispatch (dask#1004)

* Fix pyarrow fs reads for list of directories (dask#1006)

* Release for dask 2024.4.0

* Fix meta caclulation in drop_duplicates (dask#1007)

* Release 1.0.7

* Support named aggregations in groupby.aggregate (dask#1009)

* Make release 1.0.9

* Adjust version number in changes

* Make setattr work (dask#1011)

* Release for dask 2024.4.1

* Fix head for npartitions=-1 and optimizer step (dask#1014)

* Deprecate ``to/from_dask_dataframe`` API (dask#1001)

* Fix projection for rename if projection isn't renamed (dask#1016)

* Fix unique with numeric columns (dask#1017)

* Add changes for new version

* Fix column projections in merge when suffixes are relevant (dask#1019)

* Simplify dtype casting logic for shuffle (dask#1012)

* Use implicit knowledge about divisions for efficient grouping (dask#946)

Co-authored-by: Patrick Hoefler <[email protected]>
Co-authored-by: Hendrik Makait <[email protected]>

* Fix assign after set index incorrect projections (dask#1020)

* Fix read_parquet if directory is empty (dask#1023)

* Rename uniuqe_partition_mapping property and add docs (dask#1022)

* Add docs for usefule optimizer methods (dask#1025)

* Fix doc build error (dask#1026)

* Fix error in analyze for scalar (dask#1027)

* Add nr of columns to explain output for projection (dask#1030)

Co-authored-by: Hendrik Makait <[email protected]>

* Fuse more aggressively if parquet files are tiny (dask#1029)

* Move IO docstrings over (dask#1033)

* Release for dask 2024.4.2

* Add cudf support to ``to_datetime`` and ``_maybe_from_pandas`` (dask#1035)

* Fix backend dispatching for `read_csv` (dask#1028)

* Fix loc accessing index for element wise op (dask#1037)

* Fix loc slicing with Datetime Index (dask#1039)

* Fix shuffle after set_index from 1 partition df (dask#1040)

* Bugfix release

* Fix bug in ``Series`` reductions (dask#1041)

* Fix shape returning integer (dask#1043)

* Fix xarray integration with scalar columns (dask#1046)

* Fix None min/max statistics and missing statistics generally (dask#1045)

* Fix drop with set (dask#1047)

* Fix delayed in fusing with multipled dependencies (dask#1038)

* Add bugfix release

* Optimize when from-delayed is called (dask#1048)

* Fix default name conversion in `ToFrame` (dask#1044)

Co-authored-by: Patrick Hoefler <[email protected]>

* Add support for ``DataFrame.melt`` (dask#1049)

* Fixup failing test (dask#1052)

* Generalize ``get_dummies`` (dask#1053)

* reduce pickle size of parquet fragments (dask#1050)

* Add a bunch of docs (dask#1051)

Co-authored-by: Hendrik Makait <[email protected]>

* Release for dask 2024.5.0

* Fix to_parquet in append mode (dask#1057)

* Fix sort_values for unordered categories (dask#1058)

* Fix dropna before merge (dask#1062)

* Fix non-integer divisions in FusedIO (dask#1063)

* Add cache  argument to ``lower_once`` (dask#1059)

* Use ensure_deterministic kwarg instead of config (dask#1064)

* Fix isin with strings (dask#1067)

* Fix isin for head computation (dask#1068)

* Fix read_csv with positional usecols (dask#1069)

* Release for dask 2024.5.1

* Use `is_categorical_dtype` dispatch for `sort_values` (dask#1070)

* Fix meta for string accessors (dask#1071)

* Fix projection to empty from_pandas (dask#1072)

* Release for dask 2024.5.2

* Fix categorize if columns are dropped (dask#1074)

* Fix resample divisions propagation (dask#1075)

* Release for dask 2024.6.0

* Fix get_group for multiple keys (dask#1080)

* Skip distributed tests (dask#1081)

* Fix cumulative aggregations for empty partitions (dask#1082)

* Move another test to distributed folder (dask#1085)

* Release 1.1.4

* Release for dask 2024.6.2

* Add minimal subset of interchange protocol (dask#1087)

* Add from_map docstring (dask#1088)

* Ensure 1 task group per from_delayed (dask#1084)

* Advise against using from_delayed (dask#1089)

* Refactor shuffle method to handle invalid columns (dask#1091)

* Fix freq behavior on  ci (dask#1092)

* Add first array draft (dask#1090)

* Fix array import stuff (dask#1094)

* Add asarray (dask#1095)

* Implement arange (dask#1097)

* Implement linspace (dask#1098)

* Implement zeros and ones (dask#1099)

* Remvoe pandas 2 checks (dask#1100)

* Add unify-chunks draft to arrays (dask#1101)

Co-authored-by: Patrick Hoefler <[email protected]>

* Release for dask 2024.7.0

* Skip test if optional xarray cannot be imported (dask#1104)

* Fix deepcopying FromPandas class (dask#1105)

* Fix from_pandas with chunksize and empty df (dask#1106)

* Link fix in readme (dask#1107)

* Fix shuffle blowing up the task graph (dask#1108)

Co-authored-by: Hendrik Makait <[email protected]>

* Release for dask 2024.7.1

* Fix some things for pandas 3 (dask#1110)

* Fixup remaining upstream failures (dask#1111)

* Release for dask 2024.8.0

* Drop support for Python 3.9 (dask#1109)

Co-authored-by: James Bourbeau <[email protected]>

* Fix tuples as on argument in merge (dask#1117)

* Fix merging when index name in meta missmatches actual name (dask#1119)

Co-authored-by: Hendrik Makait <[email protected]>

* Register `read_parquet` and `read_csv` as "dispatchable" (dask#1114)

* Fix projection for Index class in read_parquet (dask#1120)

* Fix result index of merge (dask#1121)

* Introduce `ToBackend` expression (dask#1115)

* Avoid calling ``array`` attribute on ``cudf.Series`` (dask#1122)

* Make split_out for categorical default smarter (dask#1124)

* Release for dask 2024.8.1

* Fix scalar detection of columns coming from sql (dask#1125)

* Bump `pyarrow>=14.0.1` minimum versions (dask#1127)

Co-authored-by: Patrick Hoefler <[email protected]>

* Fix concat axis 1 bug in divisions (dask#1128)

* Release for dask 2024.8.2

* Use task-based rechunking as default (dask#1131)

* Improve performance of `DelayedsExpr` through caching (dask#1132)

* Import from tokenize (dask#1133)

* Release for dask 2024.9.0

* Add concatenate flag to .compute() (dask#1138)

* Release for dask 2024.9.1

* Fix displaying timestamp scalar (dask#1141)

* Fix alignment issue with groupby index accessors (dask#1142)

* Improve handling of optional dependencies in `analyze` and `explain` (dask#1146)

* Switch from mambaforge to miniforge in CI (dask#1147)

* Fix merge_asof for single partition (dask#1145)

* Raise exception when calculating divisons (dask#1149)

* Fix binary operations with scalar on the left (dask#1150)

* Explicitly list setuptools as a build dependency in conda recipe (dask#1151)

* Version v1.1.16

* Fix ``Merge`` divisions after filtering partitions (dask#1152)

* Fix meta calculation for to_datetime (dask#1153)

* Internal cleanup of P2P code (dask#1154)

* Migrate P2P shuffle and merge to TaskSpec (dask#1155)

* Improve Aggregation docstring explicitly mentionning SeriesGroupBy (dask#1156)

* Migrate shuffle and merge to `P2PBarrierTask` (dask#1157)

* Migrate Blockwise to use taskspec (dask#1159)

* Add support for Python 3.13 (dask#1160)

* Release for dask 2024.11.0

* Fix fusion calling things multiple times (dask#1161)

* Version 1.1.18

* Version 1.1.19

* Fix orphaned dependencies in Fused expression (dask#1163)

* Use Taskspec fuse implementation (dask#1162)

Co-authored-by: Patrick Hoefler <[email protected]>

* Introduce more caching when walking the expression (dask#1165)

* Avoid exponentially growing graph for Assign-Projection combinations (dask#1164)

* Remove ``from_dask_dataframe`` (dask#1167)

* Deprecated and remove from_legacy_dataframe usage (dask#1168)

Co-authored-by: James Bourbeau <[email protected]>

* Remove recursion in task spec (dask#1158)

* Fix value_counts with split_out != 1 (dask#1170)

* Release 2024.12.0

* Use new blockwise unpack collection in array (dask#1173)

* Propagate group_keys in DataFrameGroupBy (dask#1174)

* Fix assign optimization when overwriting columns (dask#1176)

* Remove custom read-csv stuff (dask#1178)

* Fixup install paths (dask#1179)

* Version 1.1.21

* Remove legacy conversion functions (dask#1177)

* Remove duplicated files

* Move repository

* Clean up docs and imports

* Clean up docs and imports

---------

Co-authored-by: Hendrik Makait <[email protected]>
Co-authored-by: Florian Jetter <[email protected]>
Co-authored-by: Miles <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Richard (Rick) Zamora <[email protected]>
Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: James Bourbeau <[email protected]>
Co-authored-by: alex-rakowski <[email protected]>
Co-authored-by: Matthew Rocklin <[email protected]>
Co-authored-by: Sandro <[email protected]>
Co-authored-by: Ben <[email protected]>
Co-authored-by: James Bourbeau <[email protected]>
Co-authored-by: Guillaume Eynard-Bontemps <[email protected]>
Co-authored-by: Tom Augspurger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants