-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add draft of Array._repr_html_ #4794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Maybe instead of "(3, 10) → 30" for # Chunks, this could be "3 x 10 = 10"? |
|
Yes, on mine as well. Does anyone know enough about SVG to help here? I don't know very well how to control font styling. |
|
I think you might be able to do this in pure CSS, which would likely render more consistently. That's just a guess, though, I'm definitely not an expert here. |
|
I do really like the idea of exploring HTML reprs, though! |
|
On the advanced vs beginner thing, I agree. Or even the difference between when you're working and learning and when it's finally working and you want to have a compact and complete notebook. I'm thinking maybe something like sympy's init_printing function that we can recommend in docs and then it's easy to opt out of - or even to set by environment. |
|
I also think there's value in building custom visualizations for 2, 3, and 4D data which are all very common use cases - especially for beginners. And then have a more generic representation for the higher dimensional data. |
|
I find the grid structure to be helpful when playing around with dask arrays. Even as an expert I found myself relying on it almost immediately when I started playing around in the notebook. Previously I kept a mental model of chunks in my head, and now I found myself being lazy and leaving that to the UI. For new users I would like for the repr to be interpretable without explanation, which, to me, rules out the histograms. However, adding the constraint of information density should help I think. I'm hoping that if we're clever that we can come up with something that is both compact and easily interpretable without explanation. We don't need to visually show every chunk in the array. I think that there are diminishing returns after showing a few representative chunks per dimension. I think that some sort of gridded structure is intutitive, but I think that it breaks down with a large number of chunks. |
Yes! I was inspired by Iris actually (cc @jacobtomlinson). More PyData projects should explore HTML reprs. Turns out that our displays are capable of more than just fixed-width formatted text :) |
I'm generally on board with this (assuming we go for something gridded like what is proposed here rather than a histogram approach, where this problem goes away). Thoughts on what 3D or 4D should look like? |
|
I wonder if log or power-law scaling for sizes would work even with 2D arrays? |
|
A small test to encode some expectations explicitly (code not yet written) def grid_points(chunks):
""" Compute locations of lines for drawing a grid """
def test_draw_grid_points():
x = da.ones((100, 100), chunks=(10, 10))
x, y = grid_points(x.chunks)
assert (x == y).all() # respects symmetry
x = da.ones((10, 50), chunks=(10, 10))
x, y = grid_points(x.chunks)
assert len(x) == 2
assert len(y) == 5 # happy to draw modest grids explicitly
assert y[-1] > x[-1] # larger axes should be larger
x = da.ones((10, 10000), chunks=(10, 100))
x, y = grid_points(x.chunks)
assert len(x) == 2
assert 5 <= len(y) < 30 # not too many lines
assert y[-1] > x[-1] # still a difference
assert y[-1] < x[-1] * 20 # but not beyond perception |
|
Log or power-law scaling makes sense. We might also go more extreme and just cap the maximum aspect ratio at something like 10x. Differences between the axes might not matter much beyond that. |
|
Maybe a e.g. |
|
One nice thing about log scaled barplots is they correspond nicely with how
dimension sizes combine: if two different barplots each add up to the same
total area, then the arrays have the same size, because
log(x)+log(y)=log(x*y). So I imagine you could pretty quickly build up an
intuition for what "good" chunk sizes look like at a glance.
…On Sat, May 11, 2019 at 8:51 PM Dave Hirschfeld ***@***.***> wrote:
Maybe a repr_html(**kwargs) function with kwargs which can control the
format of the output with _repr_html_ defaulting to a compact version?
e.g. repr_html(show_grid=False, **other_useful_kwargs)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4794 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAJJFVWAZZM6XEYKBK26FLLPU6H4RANCNFSM4HMIOSPA>
.
|
|
I find that the array name is not very useful, since you can never read it. It could be removed, or be more of a "title", spanning across the table and graphic. Does this extend to
|
|
See conversation above and example notebook for dimensions other then 1,2,3 (the answer is no, it doesn't) We just print the dtype, so I don't see any reason to suspect that it wouldn't work. Another special case you haven't mentioned is unknown chunks. |
|
All credit for the Iris repr goes to @DPeterK. These HTML reprs are awesome though, make life much easier and we should repr all the things! I did the worker logs in dask-kubernetes and it makes it so much easier to debug issues. It would be interesting to think about nesting these reprs too, or at least making reusable components. I commonly use Iris and Xarray which both use dask array as the internal array. It would be nice for the HTML repr for both of those to pull out at least some of the dask array one! |
|
I've added some modest testing and am now raising informative errors. I've also added in docstrings in some of the functions more likely to be seen by users or that are somewhat complex. |
|
Xarray has an HTML repr in progress, but it needs a bit more work to bring
it across the finish line: pydata/xarray#1820
…On Mon, May 13, 2019 at 7:07 AM Matthew Rocklin ***@***.***> wrote:
I've added some modest testing and am now raising informative errors. I've
also added in docstrings in some of the functions more likely to be seen by
users or that are somewhat complex.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4794?email_source=notifications&email_token=AAJJFVR2B3SZUNFWGMHKG63PVFYZBA5CNFSM4HMIOSPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVINO4Q#issuecomment-491837298>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJJFVSSKL2GCMSZNCKAFEDPVFYZBANCNFSM4HMIOSPA>
.
|
This provides more information about Dask arrays as an HTML repr Currently it includes a table of information (bytes, shape, dtype, ...) A visual representation of the grid in 1d or 2d (working on 3d next) A rendered result is availble here https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0df34e9d609f38a2bf3df9311909bb2c/raw/7ab5513cc1c6740ca8b2f4785ecd7d3a3869012c/dask-array-repr.ipynb
|
In case it is of interest, here is a similar feature in Zarr. cc @alimanfoo |
|
OK, handling @martindurant 's feedback here:
And @dhirschfeld 's feedback here:
We now have the following: |
|
I'm inclined to merge soonish if there are no objections. This is the sort of thing that we can modify easily in the future. There aren't any strong contracts to break here. |
|
I like this last version better. I am uncertain whether there's anything to be gained from showing the array name, except that it is the thing that shows up most prominently in progress-bars and on the dashboard. |
|
Ditto, I’m also a fan on the two column version. It provides a clean separation and shows more information than #4794 (comment) in fewer rows (4 vs 6 rows). |
|
Really awesome to see this! It will help hugely for gaining understanding and intuitions for how dask works. (Numpy really should do something similar :-) Don't know if anyone else is feeling this, but one thing I'm finding a little confusing at the moment is that how the dimensions are displayed changes depending on the number of dimensions. E.g., if you have a 2D array then the first dimension is displayed as vertical length. But then if you have a 3D array then the first dimension is displayed as depth. One thing I find in explaining arrays to others is that it helps to think of the first dimension visually in the same way and always "draw" it the same way. I.e., when I draw an array, I always draw the first dimension as vertical, the second dimension as horizontal, and then if there is a third dimension, that becomes depth. I don't really mind which way round, as long as it's consistent. |
|
Btw wondering if we could factor this out somehow. Dask, NumPy, Zarr, ... could all use the same HTML repr very effectively. Only difference would be the rows in the table on the left. |
|
@mrocklin this is great! Here's a screenshot of trying this on the dask array at the core of a (small) 4D Iris cube (bigger, more interesting cube incoming): [Edit]: when this goes in, I'd love to integrate it to get this repr when you click on the 'Shape' element of the cube pretty repr... |
|
From @alimanfoo
Yes, I originally kept things consistent so that the third dimension was the depth dimension. I found that it conflicted with my intuition of what arrays looked like, possibly due to a history of imaging datasets. I wanted the fastest-moving dimensions (like (100, 1024, 1024)) to be facing me as a a rectangle, and then I wanted the slowest moving dimension going away as depth. I fully acknowledge though that this may not be ideal. It would be good to get others' opinions on how best to arrange the dimensions. |
|
@DPeterK @jacobtomlinson note that you can get just the grid without the table by using the |
|
I think that the only pending issue is @alimanfoo 's from above about which dimension goes where. If people have thoughts that they're able to share that would be great. Additional perspectives would be valuable here. |
|
I would lean towards including the array name in the HTML repr, under the model that the HTML repr should contain a strict super-set of the information in the text repr. Unless there's something I'm missing about different use-cases for these? |
|
Heh, I can see this is hard to generalise as different domains think about
arrays in different ways. For our genomic data, array shape is typically
(100 million, 10000, 2) so putting the first dimension as depth will look
pretty weird :-)
Maybe you could do a webgl version which the user can rotate in 3d to their
liking :-)
…On Tue, 14 May 2019, 13:57 Matthew Rocklin, ***@***.***> wrote:
From @alimanfoo <https://github.com/alimanfoo>
Don't know if anyone else is feeling this, but one thing I'm finding a
little confusing at the moment is that how the dimensions are displayed
changes depending on the number of dimensions
Yes, I originally kept things consistent so that the third dimension was
the depth dimension. I found that it conflicted with my intuition of what
arrays looked like, possibly due to a history of imaging datasets. I wanted
the fastest-moving dimensions (like (100, *1024*, *1024*)) to be facing
me as a a rectangle, and then I wanted the slowest moving dimension going
away as depth. I fully acknowledge though that this may not be ideal. It
would be good to get others' opinions on how best to arrange the dimensions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4794?email_source=notifications&email_token=AAFLYQTJE627ZXANGWRHHULPVKZNJA5CNFSM4HMIOSPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVLMNPI#issuecomment-492226237>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFLYQQR4WPQXSC5VZZVVD3PVKZNJANCNFSM4HMIOSPA>
.
|
Sounds like great future work :) |
@martindurant gave the opposite recommendation in #4794 (comment) on the grounds that it's not particularly useful.
I don't see a strong motivation for this personally. I'm optimizing the HTML repr mostly for user experience rather than comprehensive representation. |
I suggested that just seeing part of the name wasn't interesting; if you wanted to have a superset of the text repr, then you would want the whole name, which is too long for the previous location. I could imagine it as a heading for the HTML block. |
|
Would showing the text repr in the title-text would help both #4794 (comment) and #4794 (comment)? Something like <table>
<tr title="ones-353699120031">
<th></th>
<th>Array</th>
<th>Chunks</th>
</tr>
...
</table>This looks something like |
|
I would avoid title text. You can't select it for copy/paste, and it's not something notebook users know to look for. |
|
The array name seems contentious enough that I think I'd like to leave it for future work. I don't think it's necessary to hold up the rest of this PR while waiting for folks to come to an agreement there. If there are no objections I'll plan to merge this tomorrow. |
|
Thanks for the review all. I'm quite happy with how this turned out. It was much nicer for all of the added perspectives. |











This provides more information about Dask arrays as an HTML repr
Currently it includes a table of information (bytes, shape, dtype, ...)
A visual representation of the grid in 1d or 2d (working on 3d next)
A rendered result is availble here https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0df34e9d609f38a2bf3df9311909bb2c/raw/7ab5513cc1c6740ca8b2f4785ecd7d3a3869012c/dask-array-repr.ipynb
TODO
cc @shoyer @birdsarah @rabernat @jhamman
My hope is that with changes like these (I plan to do something similar for high level graphs) we can improve users' literacy and ability to reason about their computations.