Add draft of Array._repr_html_ #4794

mrocklin · 2019-05-11T16:08:33Z

This provides more information about Dask arrays as an HTML repr

Currently it includes a table of information (bytes, shape, dtype, ...)
A visual representation of the grid in 1d or 2d (working on 3d next)

A rendered result is availble here https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0df34e9d609f38a2bf3df9311909bb2c/raw/7ab5513cc1c6740ca8b2f4785ecd7d3a3869012c/dask-array-repr.ipynb

TODO

3d
handle greater dimensions by having a few grids displayed . So for example 4d would be a 3d grid followed by a 1d grid
Improve styling and placement of grid (visual design help most welcome)
Figure out better heuristics on how to handle many chunks or high degrees of asymmetry

cc @shoyer @birdsarah @rabernat @jhamman

My hope is that with changes like these (I plan to do something similar for high level graphs) we can improve users' literacy and ability to reason about their computations.

shoyer · 2019-05-11T16:31:24Z

On my screen, the text in the SVG showing dimension sizes looks really bold, with overlapping letters:

shoyer · 2019-05-11T16:34:16Z

Maybe instead of "(3, 10) → 30" for # Chunks, this could be "3 x 10 = 10"?

mrocklin · 2019-05-11T16:39:27Z

Yes, on mine as well. Does anyone know enough about SVG to help here? I don't know very well how to control font styling.

shoyer · 2019-05-11T16:40:20Z

I think you might be able to do this in pure CSS, which would likely render more consistently. That's just a guess, though, I'm definitely not an expert here.

shoyer · 2019-05-11T16:57:28Z

To be honest, I'm not sure I love the big table and 2D representation of arrays. It feels much lower information density than the old repr. My preference would be to try to squeeze the HTML repr into roughly 4-5 lines of text, which could perhaps be achieved by splitting the table into more columns and using a smaller visual depiction.

I guess there's a tradeoff between representations optimized for new users and for people who understand the dask.array data model. If I were optimizing this for experienced users, I would consider something like the following, with a barplot showing dimension/chunk size for each axis on a log-scale:

shoyer · 2019-05-11T16:58:52Z

I do really like the idea of exploring HTML reprs, though!

birdsarah · 2019-05-11T17:12:02Z

On the advanced vs beginner thing, I agree. Or even the difference between when you're working and learning and when it's finally working and you want to have a compact and complete notebook. I'm thinking maybe something like sympy's init_printing function that we can recommend in docs and then it's easy to opt out of - or even to set by environment.

birdsarah · 2019-05-11T17:13:50Z

I also think there's value in building custom visualizations for 2, 3, and 4D data which are all very common use cases - especially for beginners. And then have a more generic representation for the higher dimensional data.

mrocklin · 2019-05-11T17:20:59Z

I find the grid structure to be helpful when playing around with dask arrays. Even as an expert I found myself relying on it almost immediately when I started playing around in the notebook. Previously I kept a mental model of chunks in my head, and now I found myself being lazy and leaving that to the UI.

For new users I would like for the repr to be interpretable without explanation, which, to me, rules out the histograms.

However, adding the constraint of information density should help I think. I'm hoping that if we're clever that we can come up with something that is both compact and easily interpretable without explanation. We don't need to visually show every chunk in the array. I think that there are diminishing returns after showing a few representative chunks per dimension. I think that some sort of gridded structure is intutitive, but I think that it breaks down with a large number of chunks.

mrocklin · 2019-05-11T17:21:59Z

I do really like the idea of exploring HTML reprs, though!

Yes! I was inspired by Iris actually (cc @jacobtomlinson). More PyData projects should explore HTML reprs. Turns out that our displays are capable of more than just fixed-width formatted text :)

mrocklin · 2019-05-11T17:23:11Z

I also think there's value in building custom visualizations for 2, 3, and 4D data which are all very common use cases - especially for beginners. And then have a more generic representation for the higher dimensional data.

I'm generally on board with this (assuming we go for something gridded like what is proposed here rather than a histogram approach, where this problem goes away).

Thoughts on what 3D or 4D should look like?

shoyer · 2019-05-11T18:16:15Z

I wonder if log or power-law scaling for sizes would work even with 2D arrays?

mrocklin · 2019-05-11T18:26:41Z

A small test to encode some expectations explicitly (code not yet written)

def grid_points(chunks):
    """ Compute locations of lines for drawing a grid """

def test_draw_grid_points():
    x = da.ones((100, 100), chunks=(10, 10))
    x, y = grid_points(x.chunks)
    assert (x == y).all()  # respects symmetry

    x = da.ones((10, 50), chunks=(10, 10))
    x, y = grid_points(x.chunks)
    assert len(x) == 2
    assert len(y) == 5   # happy to draw modest grids explicitly
    assert y[-1] > x[-1]  # larger axes should be larger

    x = da.ones((10, 10000), chunks=(10, 100))
    x, y = grid_points(x.chunks)
    assert len(x) == 2
    assert 5 <= len(y) < 30  # not too many lines
    assert y[-1] > x[-1]  # still a difference
    assert y[-1] < x[-1] * 20  # but not beyond perception

mrocklin · 2019-05-11T18:29:09Z

Log or power-law scaling makes sense. We might also go more extreme and just cap the maximum aspect ratio at something like 10x. Differences between the axes might not matter much beyond that.

dhirschfeld · 2019-05-12T03:51:34Z

Maybe a repr_html(**kwargs) function with kwargs which can control the format of the output with _repr_html_ defaulting to a compact version?

e.g. repr_html(show_grid=False, **other_useful_kwargs)

shoyer · 2019-05-12T04:47:15Z

One nice thing about log scaled barplots is they correspond nicely with how dimension sizes combine: if two different barplots each add up to the same total area, then the arrays have the same size, because log(x)+log(y)=log(x*y). So I imagine you could pretty quickly build up an intuition for what "good" chunk sizes look like at a glance.

…

On Sat, May 11, 2019 at 8:51 PM Dave Hirschfeld ***@***.***> wrote: Maybe a repr_html(**kwargs) function with kwargs which can control the format of the output with _repr_html_ defaulting to a compact version? e.g. repr_html(show_grid=False, **other_useful_kwargs) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4794 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJJFVWAZZM6XEYKBK26FLLPU6H4RANCNFSM4HMIOSPA> .

mrocklin · 2019-05-13T01:39:11Z

Now with 3d and logarithmic scaling

https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0df34e9d609f38a2bf3df9311909bb2c/raw/15d41f26800710acc63daabf7e80bfda703b883d/dask-array-repr-2.ipynb

mrocklin · 2019-05-13T01:42:04Z

We could also make things more compact.

200px size

120px size

martindurant · 2019-05-13T01:44:52Z

I find that the array name is not very useful, since you can never read it. It could be removed, or be more of a "title", spanning across the table and graphic.

Does this extend to

zero dimensions
more than 3 dimensions
complex dtypes?

mrocklin · 2019-05-13T01:46:27Z

See conversation above and example notebook for dimensions other then 1,2,3 (the answer is no, it doesn't)

We just print the dtype, so I don't see any reason to suspect that it wouldn't work.

Another special case you haven't mentioned is unknown chunks.

mrocklin · 2019-05-13T01:58:56Z

Now with lighter weight dimension text

jacobtomlinson · 2019-05-13T08:49:45Z

All credit for the Iris repr goes to @DPeterK. These HTML reprs are awesome though, make life much easier and we should repr all the things! I did the worker logs in dask-kubernetes and it makes it so much easier to debug issues.

It would be interesting to think about nesting these reprs too, or at least making reusable components. I commonly use Iris and Xarray which both use dask array as the internal array. It would be nice for the HTML repr for both of those to pull out at least some of the dask array one!

mrocklin · 2019-05-13T14:07:08Z

I've added some modest testing and am now raising informative errors. I've also added in docstrings in some of the functions more likely to be seen by users or that are somewhat complex.

shoyer · 2019-05-13T14:36:45Z

Xarray has an HTML repr in progress, but it needs a bit more work to bring it across the finish line: pydata/xarray#1820

…

On Mon, May 13, 2019 at 7:07 AM Matthew Rocklin ***@***.***> wrote: I've added some modest testing and am now raising informative errors. I've also added in docstrings in some of the functions more likely to be seen by users or that are somewhat complex. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4794?email_source=notifications&email_token=AAJJFVR2B3SZUNFWGMHKG63PVFYZBA5CNFSM4HMIOSPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVINO4Q#issuecomment-491837298>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJJFVSSKL2GCMSZNCKAFEDPVFYZBANCNFSM4HMIOSPA> .

This provides more information about Dask arrays as an HTML repr Currently it includes a table of information (bytes, shape, dtype, ...) A visual representation of the grid in 1d or 2d (working on 3d next) A rendered result is availble here https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0df34e9d609f38a2bf3df9311909bb2c/raw/7ab5513cc1c6740ca8b2f4785ecd7d3a3869012c/dask-array-repr.ipynb

jakirkham · 2019-05-13T23:38:54Z

In case it is of interest, here is a similar feature in Zarr.

ref: https://nbviewer.jupyter.org/github/zarr-developers/zarr/blob/master/notebooks/repr_info.ipynb

cc @alimanfoo

mrocklin · 2019-05-14T00:37:15Z

OK, handling @martindurant 's feedback here:

I find that the array name is not very useful, since you can never read it. It could be removed, or be more of a "title", spanning across the table and graphic.

And @dhirschfeld 's feedback here:

The size in bytes of each-chunk might be a useful metric?
i.e. the user may want to ensure each chunk fits in the cache

We now have the following:

mrocklin · 2019-05-14T00:39:29Z

I'm inclined to merge soonish if there are no objections. This is the sort of thing that we can modify easily in the future. There aren't any strong contracts to break here.

martindurant · 2019-05-14T00:39:57Z

I like this last version better. I am uncertain whether there's anything to be gained from showing the array name, except that it is the thing that shows up most prominently in progress-bars and on the dashboard.

stsievert · 2019-05-14T00:47:10Z

Ditto, I’m also a fan on the two column version. It provides a clean separation and shows more information than #4794 (comment) in fewer rows (4 vs 6 rows).

alimanfoo · 2019-05-14T09:57:21Z

Really awesome to see this! It will help hugely for gaining understanding and intuitions for how dask works. (Numpy really should do something similar :-)

Don't know if anyone else is feeling this, but one thing I'm finding a little confusing at the moment is that how the dimensions are displayed changes depending on the number of dimensions. E.g., if you have a 2D array then the first dimension is displayed as vertical length. But then if you have a 3D array then the first dimension is displayed as depth.

One thing I find in explaining arrays to others is that it helps to think of the first dimension visually in the same way and always "draw" it the same way. I.e., when I draw an array, I always draw the first dimension as vertical, the second dimension as horizontal, and then if there is a third dimension, that becomes depth. I don't really mind which way round, as long as it's consistent.

alimanfoo · 2019-05-14T09:59:43Z

Btw wondering if we could factor this out somehow. Dask, NumPy, Zarr, ... could all use the same HTML repr very effectively. Only difference would be the rows in the table on the left.

DPeterK · 2019-05-14T10:45:55Z

@mrocklin this is great! Here's a screenshot of trying this on the dask array at the core of a (small) 4D Iris cube (bigger, more interesting cube incoming):

[Edit]: when this goes in, I'd love to integrate it to get this repr when you click on the 'Shape' element of the cube pretty repr...

jacobtomlinson · 2019-05-14T12:25:42Z

Something kind of like this would be nice

DPeterK · 2019-05-14T12:52:40Z

Here's the promised, much larger dataset loaded as a cube:

mrocklin · 2019-05-14T12:57:45Z

From @alimanfoo

Don't know if anyone else is feeling this, but one thing I'm finding a little confusing at the moment is that how the dimensions are displayed changes depending on the number of dimensions

Yes, I originally kept things consistent so that the third dimension was the depth dimension. I found that it conflicted with my intuition of what arrays looked like, possibly due to a history of imaging datasets. I wanted the fastest-moving dimensions (like (100, 1024, 1024)) to be facing me as a a rectangle, and then I wanted the slowest moving dimension going away as depth. I fully acknowledge though that this may not be ideal. It would be good to get others' opinions on how best to arrange the dimensions.

mrocklin · 2019-05-14T12:58:38Z

@DPeterK @jacobtomlinson note that you can get just the grid without the table by using the to_svg method. You can also control the size of the image there with the size= keyword.

mrocklin · 2019-05-14T18:21:51Z

I think that the only pending issue is @alimanfoo 's from above about which dimension goes where. If people have thoughts that they're able to share that would be great. Additional perspectives would be valuable here.

shoyer · 2019-05-14T18:34:33Z

I would lean towards including the array name in the HTML repr, under the model that the HTML repr should contain a strict super-set of the information in the text repr. Unless there's something I'm missing about different use-cases for these?

alimanfoo · 2019-05-14T18:48:44Z

Heh, I can see this is hard to generalise as different domains think about arrays in different ways. For our genomic data, array shape is typically (100 million, 10000, 2) so putting the first dimension as depth will look pretty weird :-) Maybe you could do a webgl version which the user can rotate in 3d to their liking :-)

…

On Tue, 14 May 2019, 13:57 Matthew Rocklin, ***@***.***> wrote: From @alimanfoo <https://github.com/alimanfoo> Don't know if anyone else is feeling this, but one thing I'm finding a little confusing at the moment is that how the dimensions are displayed changes depending on the number of dimensions Yes, I originally kept things consistent so that the third dimension was the depth dimension. I found that it conflicted with my intuition of what arrays looked like, possibly due to a history of imaging datasets. I wanted the fastest-moving dimensions (like (100, *1024*, *1024*)) to be facing me as a a rectangle, and then I wanted the slowest moving dimension going away as depth. I fully acknowledge though that this may not be ideal. It would be good to get others' opinions on how best to arrange the dimensions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4794?email_source=notifications&email_token=AAFLYQTJE627ZXANGWRHHULPVKZNJA5CNFSM4HMIOSPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVLMNPI#issuecomment-492226237>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFLYQQR4WPQXSC5VZZVVD3PVKZNJANCNFSM4HMIOSPA> .

mrocklin · 2019-05-14T19:01:06Z

Maybe you could do a webgl version which the user can rotate in 3d to their liking :-)

Sounds like great future work :)

mrocklin · 2019-05-15T15:58:04Z

I would lean towards including the array name in the HTML repr, under the model that the HTML repr should contain a strict super-set of the information in the text repr. Unless there's something I'm missing about different use-cases for these?

@martindurant gave the opposite recommendation in #4794 (comment) on the grounds that it's not particularly useful.

under the model that the HTML repr should contain a strict super-set of the information in the text repr

I don't see a strong motivation for this personally. I'm optimizing the HTML repr mostly for user experience rather than comprehensive representation.

martindurant · 2019-05-15T16:07:38Z

gave the opposite recommendation

I suggested that just seeing part of the name wasn't interesting; if you wanted to have a superset of the text repr, then you would want the whole name, which is too long for the previous location. I could imagine it as a heading for the HTML block.

stsievert · 2019-05-15T16:30:06Z

Would showing the text repr in the title-text would help both #4794 (comment) and #4794 (comment)? Something like

<table>
  <tr title="ones-353699120031">
    <th></th>
    <th>Array</th>
    <th>Chunks</th>
  </tr>
  ...
</table>

This looks something like

shoyer · 2019-05-15T16:47:50Z

I would avoid title text. You can't select it for copy/paste, and it's not something notebook users know to look for.

mrocklin · 2019-05-16T15:32:57Z

The array name seems contentious enough that I think I'd like to leave it for future work. I don't think it's necessary to hold up the rest of this PR while waiting for folks to come to an agreement there.

If there are no objections I'll plan to merge this tomorrow.

mrocklin · 2019-05-17T16:19:35Z

Thanks for the review all. I'm quite happy with how this turned out. It was much nicer for all of the added perspectives.

mrocklin added 6 commits May 13, 2019 09:40

reduce stroke width

9c4d82e

add skew

5fb6d9d

support 3d plots

c030569

Add log ratios and text

d24e99c

black

c9102e4

Use Array/Chunk table arrangement

50959de

mrocklin added 2 commits May 13, 2019 19:51

add extra line and comment for rectangle

4160237

Capitalize consistently

1286edb

mrocklin merged commit 081a911 into dask:master May 17, 2019

mrocklin deleted the array-repr-html branch May 17, 2019 16:19

jakirkham mentioned this pull request Jun 27, 2019

Dask.array HTML repr #1676

Closed

TomNicholas mentioned this pull request Jul 9, 2019

WIP: html repr pydata/xarray#1820

Closed

8 tasks

maxrjones mentioned this pull request Jul 31, 2024

WIP: Add array SVG image and table to _repr_html_ pydata/xarray#9301

Open

4 tasks

Uh oh!

Add draft of Array._repr_html_ #4794

Add draft of Array._repr_html_ #4794

Uh oh!

Conversation

mrocklin commented May 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

shoyer commented May 11, 2019

Uh oh!

shoyer commented May 11, 2019

Uh oh!

mrocklin commented May 11, 2019

Uh oh!

shoyer commented May 11, 2019

Uh oh!

shoyer commented May 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented May 11, 2019

Uh oh!

birdsarah commented May 11, 2019

Uh oh!

birdsarah commented May 11, 2019

Uh oh!

mrocklin commented May 11, 2019

Uh oh!

mrocklin commented May 11, 2019

Uh oh!

mrocklin commented May 11, 2019

Uh oh!

shoyer commented May 11, 2019

Uh oh!

mrocklin commented May 11, 2019

Uh oh!

mrocklin commented May 11, 2019

Uh oh!

dhirschfeld commented May 12, 2019

Uh oh!

shoyer commented May 12, 2019 via email

Uh oh!

mrocklin commented May 13, 2019

Uh oh!

mrocklin commented May 13, 2019

200px size

120px size

Uh oh!

martindurant commented May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented May 13, 2019

Uh oh!

mrocklin commented May 13, 2019

Uh oh!

jacobtomlinson commented May 13, 2019

Uh oh!

mrocklin commented May 13, 2019

Uh oh!

shoyer commented May 13, 2019 via email

Uh oh!

jakirkham commented May 13, 2019

Uh oh!

mrocklin commented May 14, 2019

Uh oh!

mrocklin commented May 14, 2019

Uh oh!

martindurant commented May 14, 2019

Uh oh!

stsievert commented May 14, 2019

Uh oh!

alimanfoo commented May 14, 2019

Uh oh!

alimanfoo commented May 14, 2019

Uh oh!

DPeterK commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson commented May 14, 2019

Uh oh!

mrocklin commented May 11, 2019 •

edited

Loading

shoyer commented May 11, 2019 •

edited

Loading

martindurant commented May 13, 2019 •

edited

Loading

DPeterK commented May 14, 2019 •

edited

Loading

DPeterK commented May 14, 2019 •

edited

Loading