Skip to content

Conversation

@etienneschalk
Copy link
Contributor

@etienneschalk etienneschalk commented Jun 8, 2024

  • Closes Add nbytes to repr? #8690
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

Summary

Removed the "Size: " prefix before the nbytes in the DataArray and Dataset representations,
and added a display_variables_nbytes option to show or hide the nbytes of individual variables
in the DataArray and Dataset representations. The option can take one of those values:

  • True : to always show the nbytes for variables
  • False : to always hide the nbytes for variables
  • default : to only show the nbytes for lazy variables (e.g. dask arrays)

Open Points


(1) I am unsure how to universally check for the laziness of an array. In the current implementation, I use the following definition: "if the .chunks attribute of a variable is not None, it means it is lazy".

Any better way of defining what is a "lazy variable" would be welcome !


(2) The 'default' value of the 'display_variables_nbytes' flag implements the behaviour described in the comment of @shoyer : #8690 (comment)

not to include it there in favor of showing more data values (unless that data values are hidden, e.g., for a lazy array, in which case we could show memory usage instead).

True always show nbytes, False always hide nbytes, 'default' depends on the laziness of the array.


@etienneschalk etienneschalk changed the title Etienneschalk/remove size prefix repr Remove "Size: " prefix in Dataset and DataArray repr + add option to show nbytes on individual variables Jun 8, 2024
@etienneschalk etienneschalk force-pushed the etienneschalk/remove-size-prefix-repr branch from d11874f to 278ff3d Compare June 8, 2024 17:38
@max-sixty
Copy link
Collaborator

Thanks a lot @etienneschalk .

I support the removing of Size: (we should let lots of folks weigh in though)

What's the case for defaulting to only have the size on lazy variables? I would have the default be for all but ofc open-minded...

@etienneschalk
Copy link
Contributor Author

Thanks a lot @etienneschalk .

I support the removing of Size: (we should let lots of folks weigh in though)

What's the case for defaulting to only have the size on lazy variables? I would have the default be for all but ofc open-minded...

Maybe the option naming should be changed, eg having strings like "none", "lazy", "all" instead of False, "default", True, with an actual default to be chosen in "none", "lazy", "all".

In all cases, a default behaviour should be defined before this PR is merged (updating the repr tests is a laborious task, extremely error-prone).

I also think having that a canonical single string representation would be more easily maintainable in the future compared to supporting too many options

@shoyer
Copy link
Member

shoyer commented Jun 10, 2024

I think nbytes is least useful on coordinates, which are generally small. I'm more on the fence for data variables in an xarray.Dataset, though at least for me I think I find just the top level total size of the Dataset most relevant.

For the repr, what about something like <xarray.DataArray (time: 10) size=80B> or <xarray.Dataset size=136B>? This looks a little cleaner to me than adding Size: on the same row.

@etienneschalk
Copy link
Contributor Author

For raster data, coordinates can become very large and weigh dozens of kB rapidly, so I think they still deserve to be shown. Maybe what is really needed is to only show the nbytes of variables (no matter if they are coordinate variables or data variables) when they have a large memory footprint.

Regarding the "size" word, it can introduce a confusion because DataArray.size gives the number of elements in an array as well as numpy's ndarray.size) whereas the memory footprint is given by the nbytes property for both xarray: DataArray.nbytes and numpy: ndarray.nbytes. Also, in your original comment you also mentioned nbytes in the repr: #8690 (comment)

Regarding the = sign, I am a bit worried about using it since the footprint is truncated not to occupy too much space in the repr, but could trick users into believing this is the exact size when it's not.

<xarray.DataArray (time: 10) size=80B> or <xarray.Dataset size=136B> => <xarray.DataArray (time: 10) 80B> or <xarray.Dataset 136B>`

This would solve both the wording problem ("size" is confusing, "nbytes" is a bit cryptic) as well as the equal size, while also de-cluttering the repr. To me the B prefix is a hint that it is a memory footprint (B kB MB GB). However, B alone is less significant, so, joining the first point of not showing the size of small arrays, I would suggest that for an array to have its nbytes displayed, it should weigh more than the symbolic threshold of 1kB. This criteria could also be applied to the global nbytes shown at DataArray or Dataset level.

Current vs Proposal

Current

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 222B 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

Proposal

<xarray.Dataset 3kB>
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 0 1 2 3 4 5 6 7 8 ... 103 104 105 106 107 108 109 110
Data variables:
    *empty*

Diff:

  • Size : prefix removed ;
  • nbytes integrated inside the chevrons ;
  • the bar variables, not qualifying to the 1kB threshold, see its nbytes repr removed

The threshold could be configurable. This would also removing the flaky logic I wrote to detect lazy arrays, by using a simpler criteria (an integer memory footprint and a threshold, rather than checking the laziness of arrays by devious means)

Note: one could create coordinates ['1GB', '2GB', '3GB', '4GB, '5GB', ... '10000GB'], and the repr would become ambiguous!

@shoyer
Copy link
Member

shoyer commented Jun 20, 2024

What's the case for defaulting to only have the size on lazy variables? I would have the default be for all but ofc open-minded...

Lazy variables (using Xarray's internal lazy arrays) just have ... as the data repr instead of actual values, so they have free space to list the size of the data. They are also more likely to be large. (Dask arrays already have size in the short repr.)

I like moving nbytes into the repr header, like <xarray.Dataset 3kB> or <xarray.DataArray (x: 2, y: 2, time: 3) 96B>. Other options:

  1. Use parentheses <xarray.Dataset (3kB)> or <xarray.DataArray (x: 2, y: 2, time: 3; 96B)>. I think I like this somewhat better than no parentheses.
  2. Put size onto sub-sections of the repr, e.g., Coordinates (3kB): and Data variables (3kB):. This could be helpful for tracking down where large data is, especially when used for DataTree. It also feels like the least obtrusive place to put this information, as these lines are currently very empty.

I also like the idea to skip nbytes for the inline repr of small arrays, though I do worry that is gets a little harder to keep track of data vs size when some variables have it and other don't. Consistency might ultimately be more important.

With regards to adding a new option display_variables_nbytes this is fine, but I think most users rarely change defaults, so we still need to come to a concensus on the preferred default :).

@max-sixty
Copy link
Collaborator

With regards to adding a new option display_variables_nbytes this is fine, but I think most users rarely change defaults, so we still need to come to a concensus on the preferred default :).

+1

(FWIW I generally find that adding options, while it skips over disagreements, it doesn't resolve the issue + requires more maintenance for different modes, particularly when we have lots of options and their product scales factorially)


Agree with moving into the header...

I would still medium-strength vote for having sizes on each variable, but having it on the whole object accomplishes 80% of the goal. (and would weigh my vote less than @shoyer 's...)

@etienneschalk
Copy link
Contributor Author

Use parentheses <xarray.Dataset (3kB)> or <xarray.DataArray (x: 2, y: 2, time: 3; 96B)>. I think I like this somewhat better than no parentheses.

👍 for <xarray.Dataset (3kB)>

The only problem I see with <xarray.DataArray (x: 2, y: 2, time: 3; 96B)> is that is mixes the nbytes information with the dimensions information, I would prefer to keep the dimensions separate from the nbytes, eg <xarray.DataArray (x: 2, y: 2, time: 3) (96B)> ; if parenthesis are too redundant, a new symbol for size can be introduced like square brackets eg <xarray.DataArray (x: 2, y: 2, time: 3) [96B]> and <xarray.Dataset [3kB]>

Put size onto sub-sections of the repr, e.g., Coordinates (3kB): and Data variables (3kB):. This could be helpful for tracking down where large data is, especially when used for DataTree. It also feels like the least obtrusive place to put this information, as these lines are currently very empty.

Yes it helps reasoning for sizes at the group level. I have already an example of a DataTree containing various rasters, with groups containing only float32 rasters, and other groups, only uint8 rasters. The dominant dtype of the group is the most relevant information to "weigh" the DataTree. In that context losing the per-variable nbytes display is ok ; but for inspecting a particular Dataset, being able to see all the sizes of the variables still is important in my opinion.

I also like the idea to skip nbytes for the inline repr of small arrays, though I do worry that is gets a little harder to keep track of data vs size when some variables have it and other don't. Consistency might ultimately be more important.

Yes indeed, it is confusing ; maybe the nbytes threshold criteria should rather be applied on the whole dataset / datarray level rather than on a per-variable basis. If a dataset exceeds 1kB, then all the variables have their size shown, otherwise, none.

I would still medium-strength vote for having sizes on each variable,

👍 especially for a sole Dataset (outside of a DataTree context)


It seems it is difficult to find a consensus on when to show or not to show the nbytes information. A default behaviour seems hard to reach in that context, and in the consensus, nobody is totally satisfied. Another solution would be to have an "all or nothing" approach to nbytes in the repr. Such a solution would ensure consistency. Do not show nbytes by default, and add a global option that enables showing the nbytes, for all variables, no matter the conditions (no matter if it is lazy / small). Users that want to inspect the memory footprint of an xarray data object are more likely to opt-in (changing the global option), and want all the available information.

Examples:

By default : No nbytes are shown. Clean minimalistic classical repr of an xarray data structure:

<xarray.Dataset>
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 0 1 2 3 4 5 6 7 8 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 0 1 2 3 4 5 6 7 8 ... 103 104 105 106 107 108 109 110
Data variables:
    *empty*

User opt-in with a global option: All possible nbytes are shown, with no concession. It is an active choice of the user to see all the nbytes info.

<xarray.Dataset [3kB]>
Dimensions:  (foo: 1200, bar: 111)
Coordinates [2kB]:
  * foo      (foo) int16 [2kB] 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 [222B] 0 1 2 3 4 5 6 7 8 ... 103 104 105 106 107 108 109 110
Data variables [0B]:
    *empty*

The use of a distinct symbol (here brackets, but it could be parenthesis too, or any other pair of symbols) emphasizes the size, making it easier to identify when reading a text repr.

What do you think of the all or nothing approach ?

@keewis
Copy link
Collaborator

keewis commented Jun 23, 2024

I like the idea of putting the total memory size (data + coordinates + indexes(?)) into the top-level angle brackets, as well as adding size information to the subsections.

I'm not sure about how to format this, though: parentheses are already used by the dimension sizes in the first line of the repr, and the sections also can have these if there's a limit to how many variables are displayed (display_max_rows).

(somewhat unrelated to this, so I don't mind discussing in a separate issue: should we add a Dimension order: (x, y, z) line as the second line of the DataArray repr? That might avoid the impression of <xarray.DataArray (x: 3, z: 1, y: 2)> describing the order of all data variables)

As for per-variable memory footprints, I would advise against behaving differently depending on the size of the variable. Instead, I'd prefer a single option that enables / disables the printing of size information for variables (display_variable_memory?). This is different from the last proposal in #9078 (comment), which would remove all memory size information, not just the one on the variables.

Also, I'd like to avoid using square brackets on the inline repr of variables (int64 [16B]): that's what pint-xarray is using to display the units, and would end up as int64 [16B] [m] 1 2. In that regard, I'm pretty happy with int64 16B [m] 1 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add nbytes to repr?

4 participants