Skip to content

Conversation

@dsgreen2
Copy link
Contributor

@dsgreen2 dsgreen2 commented Mar 16, 2023

Adds a method to_dask_dataframe() to convert a dataarray to a dask dataframe.

I have added the function to_dask_dataframe() in dataarray.py . This implementation is as suggested in issue #7409 . The function first converts the data array to a temporary dataset and then calls Dataset.to_dask_dataframe() method.

Could you please review it . Thank you.

Examples
--------
da=xr.DataArray(np.random.rand(4,3,2),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is not correctly formatted, see other functions as reference.

Also, don't use random values in examples, simply use np.ones(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.ones can hide errors when dealing with tricky shapes so something like np.arange(4*3*2).reshape(4,3,2) is a little better.

assert_array_equal(actual.index.names, list("ABC"))

def test_to_dask_dataframe(self) -> None:
arr_np = np.random.randn(3, 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it doesn't really matter in most cases, we try to avoid random values in tests.

Maybe use np.arange(3*4).reshape(3,4).

if self.ndim == 0:
raise ValueError("Cannot convert a scalar to a dataframe")

tmp_dataset = Dataset({name: self})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we use the to tmp dataset method here, but since we only use it to construct the data frame and don't roundtrip it doesn't actually matter?

Comment on lines 6686 to 6687
dim_order: Sequence of Hashable or None , optional
Hierarchical dimension order for the resulting dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dim_order: Sequence of Hashable or None , optional
Hierarchical dimension order for the resulting dataframe.
dim_order: Sequence of Hashable or None , optional
Hierarchical dimension order for the resulting dataframe.

Follow numpys docstring conventions. More errors above and below.

Comment on lines 6725 to 6732
if name is None:
name = self.name

if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name is None:
name = self.name
if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter"
)

Not needed when using self._to_dataset_whole.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be better to keep this error message ? When I removed it, The error shown was ' unable to convert unnamed DataArray to a Dataset without providing an explicit name ' . Keeping these lines can show the error message specific to dataarray to daskdataframe conversion.

Examples
--------
da=xr.DataArray(np.random.rand(4,3,2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.ones can hide errors when dealing with tricky shapes so something like np.arange(4*3*2).reshape(4,3,2) is a little better.

@dsgreen2
Copy link
Contributor Author

I have made the changes as suggested. Please review them .Thanks

Comment on lines 6732 to 6741
if name is None:
name = self.name

if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter ."
)
ds = self._to_dataset_whole(name)
return ds.to_dask_dataframe(dim_order, set_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name is None:
name = self.name
if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter ."
)
ds = self._to_dataset_whole(name)
return ds.to_dask_dataframe(dim_order, set_index)
name = self.name if self.name is not None else _THIS_ARRAY
ds = self._to_dataset_whole(name, shallow_copy=False)
return ds.to_dask_dataframe(dim_order, set_index)

I think we go with this. I don't think it should be necessary to name the dataarray which is more in line with how self._to_temp_dataset works and setting dataarray.name = "new_name" is easy enough.

Comment on lines 6694 to 6700
name : Hashable or None, optional
Name given to this array(required if unnamed).
It is a keyword-only argument. A keyword-only argument can only be passed
to the function using its name as a keyword argument , and not as a
positional argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name : Hashable or None, optional
Name given to this array(required if unnamed).
It is a keyword-only argument. A keyword-only argument can only be passed
to the function using its name as a keyword argument , and not as a
positional argument.

Comment on lines 6681 to 6682
*,
name: Hashable | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*,
name: Hashable | None = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes .Please review them.

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still an issue with the docstring example, probably some whitespace mismatch somewhere. It should just be copy/pasting the results from the ipython console.

By `Deepak Cherian <https://github.com/dcherian>`_.
- Improved performance in ``open_dataset`` for datasets with large object arrays (:issue:`7484`, :pull:`7494`).
By `Alex Goodman <https://github.com/agoodm>`_ and `Deepak Cherian <https://github.com/dcherian>`_.
- Added new method :py:meth:`DataArray.to_dask_dataframe`,convert a dataarray into a dask dataframe (:issue:`7409`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

vectors in contiguous order , so the last dimension in this list
will be contiguous in the resulting DataFrame. This has a major influence
on which operations are efficient on the resulting dask dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

@dsgreen2 dsgreen2 Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes .Please review them. Thanks

if self.name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"dask dataframe : use the ``name`` parameter ."
"dask dataframe : use the ``.rename`` method to assign a name."

@dcherian
Copy link
Contributor

dcherian commented Apr 28, 2023

Thanks for your patience here @dsgreen2 . This is a nice contribution. Welcome to Xarray!

@dcherian dcherian enabled auto-merge (squash) April 28, 2023 14:29
@dcherian dcherian merged commit 25d9a28 into pydata:main Apr 28, 2023
@welcome
Copy link

welcome bot commented Apr 28, 2023

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement DataArray.to_dask_dataframe()

4 participants