-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Remove "Size: " prefix in Dataset and DataArray repr + add option to show nbytes on individual variables #9078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove "Size: " prefix in Dataset and DataArray repr + add option to show nbytes on individual variables #9078
Conversation
d11874f to
278ff3d
Compare
|
Thanks a lot @etienneschalk . I support the removing of What's the case for defaulting to only have the size on lazy variables? I would have the default be for all but ofc open-minded... |
Maybe the option naming should be changed, eg having strings like In all cases, a default behaviour should be defined before this PR is merged (updating the repr tests is a laborious task, extremely error-prone). I also think having that a canonical single string representation would be more easily maintainable in the future compared to supporting too many options |
|
I think nbytes is least useful on coordinates, which are generally small. I'm more on the fence for data variables in an xarray.Dataset, though at least for me I think I find just the top level total size of the Dataset most relevant. For the repr, what about something like |
|
For raster data, coordinates can become very large and weigh dozens of kB rapidly, so I think they still deserve to be shown. Maybe what is really needed is to only show the Regarding the "size" word, it can introduce a confusion because DataArray.size gives the number of elements in an array as well as numpy's ndarray.size) whereas the memory footprint is given by the Regarding the
This would solve both the wording problem ("size" is confusing, "nbytes" is a bit cryptic) as well as the equal size, while also de-cluttering the repr. To me the Current vs ProposalCurrent Proposal Diff:
The threshold could be configurable. This would also removing the flaky logic I wrote to detect lazy arrays, by using a simpler criteria (an integer memory footprint and a threshold, rather than checking the laziness of arrays by devious means) Note: one could create coordinates |
Lazy variables (using Xarray's internal lazy arrays) just have I like moving nbytes into the repr header, like
I also like the idea to skip nbytes for the inline repr of small arrays, though I do worry that is gets a little harder to keep track of data vs size when some variables have it and other don't. Consistency might ultimately be more important. With regards to adding a new option |
+1 (FWIW I generally find that adding options, while it skips over disagreements, it doesn't resolve the issue + requires more maintenance for different modes, particularly when we have lots of options and their product scales factorially) Agree with moving into the header... I would still medium-strength vote for having sizes on each variable, but having it on the whole object accomplishes 80% of the goal. (and would weigh my vote less than @shoyer 's...) |
👍 for The only problem I see with
Yes it helps reasoning for sizes at the group level. I have already an example of a DataTree containing various rasters, with groups containing only float32 rasters, and other groups, only uint8 rasters. The dominant dtype of the group is the most relevant information to "weigh" the DataTree. In that context losing the per-variable nbytes display is ok ; but for inspecting a particular Dataset, being able to see all the sizes of the variables still is important in my opinion.
Yes indeed, it is confusing ; maybe the nbytes threshold criteria should rather be applied on the whole dataset / datarray level rather than on a per-variable basis. If a dataset exceeds 1kB, then all the variables have their size shown, otherwise, none.
👍 especially for a sole Dataset (outside of a DataTree context) It seems it is difficult to find a consensus on when to show or not to show the nbytes information. A default behaviour seems hard to reach in that context, and in the consensus, nobody is totally satisfied. Another solution would be to have an "all or nothing" approach to nbytes in the repr. Such a solution would ensure consistency. Do not show nbytes by default, and add a global option that enables showing the nbytes, for all variables, no matter the conditions (no matter if it is lazy / small). Users that want to inspect the memory footprint of an xarray data object are more likely to opt-in (changing the global option), and want all the available information. Examples: By default : No nbytes are shown. Clean minimalistic classical repr of an xarray data structure: <xarray.Dataset>
Dimensions: (foo: 1200, bar: 111)
Coordinates:
* foo (foo) int16 0 1 2 3 4 5 6 7 8 ... 1194 1195 1196 1197 1198 1199
* bar (bar) int16 0 1 2 3 4 5 6 7 8 ... 103 104 105 106 107 108 109 110
Data variables:
*empty*User opt-in with a global option: All possible nbytes are shown, with no concession. It is an active choice of the user to see all the nbytes info. <xarray.Dataset [3kB]>
Dimensions: (foo: 1200, bar: 111)
Coordinates [2kB]:
* foo (foo) int16 [2kB] 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
* bar (bar) int16 [222B] 0 1 2 3 4 5 6 7 8 ... 103 104 105 106 107 108 109 110
Data variables [0B]:
*empty*The use of a distinct symbol (here brackets, but it could be parenthesis too, or any other pair of symbols) emphasizes the size, making it easier to identify when reading a text repr. What do you think of the all or nothing approach ? |
|
I like the idea of putting the total memory size (data + coordinates + indexes(?)) into the top-level angle brackets, as well as adding size information to the subsections. I'm not sure about how to format this, though: parentheses are already used by the dimension sizes in the first line of the (somewhat unrelated to this, so I don't mind discussing in a separate issue: should we add a As for per-variable memory footprints, I would advise against behaving differently depending on the size of the variable. Instead, I'd prefer a single option that enables / disables the printing of size information for variables ( Also, I'd like to avoid using square brackets on the inline |
nbytesto repr? #8690whats-new.rst[ ] New functions/methods are listed inapi.rstSummary
Removed the "Size: " prefix before the
nbytesin theDataArrayandDatasetrepresentations,and added a
display_variables_nbytesoption to show or hide thenbytesof individual variablesin the
DataArrayandDatasetrepresentations. The option can take one of those values:True: to always show the nbytes for variablesFalse: to always hide the nbytes for variablesdefault: to only show the nbytes for lazy variables (e.g. dask arrays)Open Points
(1) I am unsure how to universally check for the laziness of an array. In the current implementation, I use the following definition: "if the
.chunksattribute of a variable is not None, it means it is lazy".Any better way of defining what is a "lazy variable" would be welcome !
(2) The 'default' value of the 'display_variables_nbytes' flag implements the behaviour described in the comment of @shoyer : #8690 (comment)
True always show nbytes, False always hide nbytes, 'default' depends on the laziness of the array.