-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What happened?
We are trying to convert a 17gb Zarr dataset to Parquet using xArray by calling xr.to_dask_dataframe and then ddf.to_parquet. When calling to_dask_dataframe the notebook crashes with "Kernel Restarting: The kernel for debug/minimal.ipynb appears to have died. It will restart automatically." We also find this occurs when using a synthetic dataset of the same size which we create in the example below.
What did you expect to happen?
We expected a Dask dataframe object to be created lazily and not crash the notebook. We expected the operation to be lazy based on the source code, and the following line in the docs "For datasets containing dask arrays where the data should be lazily loaded, see the Dataset.to_dask_dataframe() method."
Minimal Complete Verifiable Example
import dask.array as da
import xarray as xr
import numpy as np
chunks = 5000
dim1_sz = 100_000
dim2_sz = 100_000
# Does not crash when using the following constants.
'''
dim1_sz = 10_000
dim2_sz = 10_000
'''
ds = xr.Dataset({
'x': xr.DataArray(
data = da.random.random((dim1_sz, dim2_sz), chunks=chunks),
dims = ['dim1', 'dim2'],
coords = {'dim1': np.arange(0, dim1_sz), 'dim2': np.arange(0, dim2_sz)})})
df = ds.to_dask_dataframe()
dfMVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
This operation crashes when the size of the array is above some (presumably machine specific) threshold, and works below it. You may need to play with the array size to replicate this behavior.
Environment
Details
INSTALLED VERSIONS
commit: None
python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59)
[GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.196-108.356.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.3
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.12.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.05.0
distributed: 2022.5.0
matplotlib: 3.5.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.0
cupy: None
pint: None
sparse: None
setuptools: 62.3.1
pip: 22.1
conda: None
pytest: None
IPython: 8.3.0
sphinx: None