Skip to content

Conversation

@mrocklin
Copy link
Member

@mrocklin mrocklin commented May 11, 2019

This provides more information about Dask arrays as an HTML repr

Currently it includes a table of information (bytes, shape, dtype, ...)
A visual representation of the grid in 1d or 2d (working on 3d next)

A rendered result is availble here https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0df34e9d609f38a2bf3df9311909bb2c/raw/7ab5513cc1c6740ca8b2f4785ecd7d3a3869012c/dask-array-repr.ipynb

Screen Shot 2019-05-11 at 11 08 53 AM

TODO

  • 3d
  • handle greater dimensions by having a few grids displayed . So for example 4d would be a 3d grid followed by a 1d grid
  • Improve styling and placement of grid (visual design help most welcome)
  • Figure out better heuristics on how to handle many chunks or high degrees of asymmetry

cc @shoyer @birdsarah @rabernat @jhamman

My hope is that with changes like these (I plan to do something similar for high level graphs) we can improve users' literacy and ability to reason about their computations.

@shoyer
Copy link
Member

shoyer commented May 11, 2019

On my screen, the text in the SVG showing dimension sizes looks really bold, with overlapping letters:
image

@shoyer
Copy link
Member

shoyer commented May 11, 2019

Maybe instead of "(3, 10) → 30" for # Chunks, this could be "3 x 10 = 10"?

@mrocklin
Copy link
Member Author

Yes, on mine as well. Does anyone know enough about SVG to help here? I don't know very well how to control font styling.

@shoyer
Copy link
Member

shoyer commented May 11, 2019

I think you might be able to do this in pure CSS, which would likely render more consistently. That's just a guess, though, I'm definitely not an expert here.

@shoyer
Copy link
Member

shoyer commented May 11, 2019

To be honest, I'm not sure I love the big table and 2D representation of arrays. It feels much lower information density than the old repr. My preference would be to try to squeeze the HTML repr into roughly 4-5 lines of text, which could perhaps be achieved by splitting the table into more columns and using a smaller visual depiction.

I guess there's a tradeoff between representations optimized for new users and for people who understand the dask.array data model. If I were optimizing this for experienced users, I would consider something like the following, with a barplot showing dimension/chunk size for each axis on a log-scale:
image

@shoyer
Copy link
Member

shoyer commented May 11, 2019

I do really like the idea of exploring HTML reprs, though!

@birdsarah
Copy link
Contributor

On the advanced vs beginner thing, I agree. Or even the difference between when you're working and learning and when it's finally working and you want to have a compact and complete notebook. I'm thinking maybe something like sympy's init_printing function that we can recommend in docs and then it's easy to opt out of - or even to set by environment.

@birdsarah
Copy link
Contributor

I also think there's value in building custom visualizations for 2, 3, and 4D data which are all very common use cases - especially for beginners. And then have a more generic representation for the higher dimensional data.

@mrocklin
Copy link
Member Author

I find the grid structure to be helpful when playing around with dask arrays. Even as an expert I found myself relying on it almost immediately when I started playing around in the notebook. Previously I kept a mental model of chunks in my head, and now I found myself being lazy and leaving that to the UI.

For new users I would like for the repr to be interpretable without explanation, which, to me, rules out the histograms.

However, adding the constraint of information density should help I think. I'm hoping that if we're clever that we can come up with something that is both compact and easily interpretable without explanation. We don't need to visually show every chunk in the array. I think that there are diminishing returns after showing a few representative chunks per dimension. I think that some sort of gridded structure is intutitive, but I think that it breaks down with a large number of chunks.

@mrocklin
Copy link
Member Author

I do really like the idea of exploring HTML reprs, though!

Yes! I was inspired by Iris actually (cc @jacobtomlinson). More PyData projects should explore HTML reprs. Turns out that our displays are capable of more than just fixed-width formatted text :)

@mrocklin
Copy link
Member Author

I also think there's value in building custom visualizations for 2, 3, and 4D data which are all very common use cases - especially for beginners. And then have a more generic representation for the higher dimensional data.

I'm generally on board with this (assuming we go for something gridded like what is proposed here rather than a histogram approach, where this problem goes away).

Thoughts on what 3D or 4D should look like?

@shoyer
Copy link
Member

shoyer commented May 11, 2019

I wonder if log or power-law scaling for sizes would work even with 2D arrays?

@mrocklin
Copy link
Member Author

A small test to encode some expectations explicitly (code not yet written)

def grid_points(chunks):
    """ Compute locations of lines for drawing a grid """

def test_draw_grid_points():
    x = da.ones((100, 100), chunks=(10, 10))
    x, y = grid_points(x.chunks)
    assert (x == y).all()  # respects symmetry

    x = da.ones((10, 50), chunks=(10, 10))
    x, y = grid_points(x.chunks)
    assert len(x) == 2
    assert len(y) == 5   # happy to draw modest grids explicitly
    assert y[-1] > x[-1]  # larger axes should be larger

    x = da.ones((10, 10000), chunks=(10, 100))
    x, y = grid_points(x.chunks)
    assert len(x) == 2
    assert 5 <= len(y) < 30  # not too many lines
    assert y[-1] > x[-1]  # still a difference
    assert y[-1] < x[-1] * 20  # but not beyond perception

@mrocklin
Copy link
Member Author

Log or power-law scaling makes sense. We might also go more extreme and just cap the maximum aspect ratio at something like 10x. Differences between the axes might not matter much beyond that.

@dhirschfeld
Copy link

Maybe a repr_html(**kwargs) function with kwargs which can control the format of the output with _repr_html_ defaulting to a compact version?

e.g. repr_html(show_grid=False, **other_useful_kwargs)

@shoyer
Copy link
Member

shoyer commented May 12, 2019 via email

@mrocklin
Copy link
Member Author

We could also make things more compact.

200px size

Screen Shot 2019-05-12 at 8 38 46 PM

120px size

Screen Shot 2019-05-12 at 8 40 54 PM

@martindurant
Copy link
Member

martindurant commented May 13, 2019

I find that the array name is not very useful, since you can never read it. It could be removed, or be more of a "title", spanning across the table and graphic.

Does this extend to

  • zero dimensions
  • more than 3 dimensions
  • complex dtypes?

@mrocklin
Copy link
Member Author

See conversation above and example notebook for dimensions other then 1,2,3 (the answer is no, it doesn't)

We just print the dtype, so I don't see any reason to suspect that it wouldn't work.

Another special case you haven't mentioned is unknown chunks.

@mrocklin
Copy link
Member Author

Now with lighter weight dimension text

Screen Shot 2019-05-12 at 8 58 31 PM

@jacobtomlinson
Copy link
Member

All credit for the Iris repr goes to @DPeterK. These HTML reprs are awesome though, make life much easier and we should repr all the things! I did the worker logs in dask-kubernetes and it makes it so much easier to debug issues.

It would be interesting to think about nesting these reprs too, or at least making reusable components. I commonly use Iris and Xarray which both use dask array as the internal array. It would be nice for the HTML repr for both of those to pull out at least some of the dask array one!

@mrocklin
Copy link
Member Author

I've added some modest testing and am now raising informative errors. I've also added in docstrings in some of the functions more likely to be seen by users or that are somewhat complex.

@shoyer
Copy link
Member

shoyer commented May 13, 2019 via email

@jakirkham
Copy link
Member

@mrocklin
Copy link
Member Author

OK, handling @martindurant 's feedback here:

I find that the array name is not very useful, since you can never read it. It could be removed, or be more of a "title", spanning across the table and graphic.

And @dhirschfeld 's feedback here:

The size in bytes of each-chunk might be a useful metric?
i.e. the user may want to ensure each chunk fits in the cache

We now have the following:

Screen Shot 2019-05-13 at 7 31 19 PM

@mrocklin
Copy link
Member Author

I'm inclined to merge soonish if there are no objections. This is the sort of thing that we can modify easily in the future. There aren't any strong contracts to break here.

@martindurant
Copy link
Member

I like this last version better. I am uncertain whether there's anything to be gained from showing the array name, except that it is the thing that shows up most prominently in progress-bars and on the dashboard.

@stsievert
Copy link
Member

Ditto, I’m also a fan on the two column version. It provides a clean separation and shows more information than #4794 (comment) in fewer rows (4 vs 6 rows).

@alimanfoo
Copy link
Contributor

Really awesome to see this! It will help hugely for gaining understanding and intuitions for how dask works. (Numpy really should do something similar :-)

Don't know if anyone else is feeling this, but one thing I'm finding a little confusing at the moment is that how the dimensions are displayed changes depending on the number of dimensions. E.g., if you have a 2D array then the first dimension is displayed as vertical length. But then if you have a 3D array then the first dimension is displayed as depth.

One thing I find in explaining arrays to others is that it helps to think of the first dimension visually in the same way and always "draw" it the same way. I.e., when I draw an array, I always draw the first dimension as vertical, the second dimension as horizontal, and then if there is a third dimension, that becomes depth. I don't really mind which way round, as long as it's consistent.

@alimanfoo
Copy link
Contributor

Btw wondering if we could factor this out somehow. Dask, NumPy, Zarr, ... could all use the same HTML repr very effectively. Only difference would be the rows in the table on the left.

@DPeterK
Copy link

DPeterK commented May 14, 2019

@mrocklin this is great! Here's a screenshot of trying this on the dask array at the core of a (small) 4D Iris cube (bigger, more interesting cube incoming):

Screenshot 2019-05-14 at 11 42 33

[Edit]: when this goes in, I'd love to integrate it to get this repr when you click on the 'Shape' element of the cube pretty repr...

@jacobtomlinson
Copy link
Member

Something kind of like this would be nice

image

@DPeterK
Copy link

DPeterK commented May 14, 2019

Here's the promised, much larger dataset loaded as a cube:

Screenshot 2019-05-14 at 13 51 57

@mrocklin
Copy link
Member Author

From @alimanfoo

Don't know if anyone else is feeling this, but one thing I'm finding a little confusing at the moment is that how the dimensions are displayed changes depending on the number of dimensions

Yes, I originally kept things consistent so that the third dimension was the depth dimension. I found that it conflicted with my intuition of what arrays looked like, possibly due to a history of imaging datasets. I wanted the fastest-moving dimensions (like (100, 1024, 1024)) to be facing me as a a rectangle, and then I wanted the slowest moving dimension going away as depth. I fully acknowledge though that this may not be ideal. It would be good to get others' opinions on how best to arrange the dimensions.

@mrocklin
Copy link
Member Author

@DPeterK @jacobtomlinson note that you can get just the grid without the table by using the to_svg method. You can also control the size of the image there with the size= keyword.

@mrocklin
Copy link
Member Author

I think that the only pending issue is @alimanfoo 's from above about which dimension goes where. If people have thoughts that they're able to share that would be great. Additional perspectives would be valuable here.

@shoyer
Copy link
Member

shoyer commented May 14, 2019

I would lean towards including the array name in the HTML repr, under the model that the HTML repr should contain a strict super-set of the information in the text repr. Unless there's something I'm missing about different use-cases for these?

@alimanfoo
Copy link
Contributor

alimanfoo commented May 14, 2019 via email

@mrocklin
Copy link
Member Author

Maybe you could do a webgl version which the user can rotate in 3d to their liking :-)

Sounds like great future work :)

@mrocklin
Copy link
Member Author

I would lean towards including the array name in the HTML repr, under the model that the HTML repr should contain a strict super-set of the information in the text repr. Unless there's something I'm missing about different use-cases for these?

@martindurant gave the opposite recommendation in #4794 (comment) on the grounds that it's not particularly useful.

under the model that the HTML repr should contain a strict super-set of the information in the text repr

I don't see a strong motivation for this personally. I'm optimizing the HTML repr mostly for user experience rather than comprehensive representation.

@martindurant
Copy link
Member

gave the opposite recommendation

I suggested that just seeing part of the name wasn't interesting; if you wanted to have a superset of the text repr, then you would want the whole name, which is too long for the previous location. I could imagine it as a heading for the HTML block.

@stsievert
Copy link
Member

Would showing the text repr in the title-text would help both #4794 (comment) and #4794 (comment)? Something like

<table>
  <tr title="ones-353699120031">
    <th></th>
    <th>Array</th>
    <th>Chunks</th>
  </tr>
  ...
</table>

This looks something like

Screen Shot 2019-05-15 at 11 25 58 AM

@shoyer
Copy link
Member

shoyer commented May 15, 2019

I would avoid title text. You can't select it for copy/paste, and it's not something notebook users know to look for.

@mrocklin
Copy link
Member Author

The array name seems contentious enough that I think I'd like to leave it for future work. I don't think it's necessary to hold up the rest of this PR while waiting for folks to come to an agreement there.

If there are no objections I'll plan to merge this tomorrow.

@mrocklin mrocklin merged commit 081a911 into dask:master May 17, 2019
@mrocklin
Copy link
Member Author

Thanks for the review all. I'm quite happy with how this turned out. It was much nicer for all of the added perspectives.

@mrocklin mrocklin deleted the array-repr-html branch May 17, 2019 16:19
@jakirkham jakirkham mentioned this pull request Jun 27, 2019
@TomNicholas TomNicholas mentioned this pull request Jul 9, 2019
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.