Simpler Dask array construction by svank · Pull Request #462 · DKISTDC/dkist

svank · 2024-11-05T01:41:24Z

I'm unsure if this PR is worth it, but it offers some extra analysis on the optimization/chunk size front.

This PR speeds up computation with the distributed scheduler a bit by changing the code for building the Dask array to simplify the task graph (or so I believe). Instead of several iterations of da.stack calls, it directly generates Dask tasks for each chunk of the final array, following the example here. This shaves a bit of time off when using the distributed scheduler (36 to 32 seconds), but the "processes" scheduler is still faster and is almost unaffected by this change (from a consistent 20.5 to 19.7 seconds). I'm using the same test case as in #455 .

This PR currently needs more thought on handling non-default chunksizes (the argument to stack_loader_array, which it looks like can be set in the ASDF file?). I guess I'm not sure what the intended use case is for that. In main it looks like this option can only sub-divide files into smaller chunks. Are there DKIST files out there that are too big to be a single chunk? Or is the idea to subdivide files when only a portion of each file has to be loaded? As this PR is now, each file gets fully loaded, and then split up, which might be less efficient. But before this PR, I'm not sure if that was also the case, or if hdu.data[slice] loads just the sliced portion of the file. If the latter, changing this PR to make use of that would, I think, require manually creating tasks for each chunk of each file, which seems complicated.

In #455, there was discussion of rechunking to larger chunk sizes. At least for that test case, using .rechunk((1, 100, -1, -1)) actually slows down the total computation (I'm seeing it go from 36 to 45 seconds), I assume from the overhead of gathering and combining arrays. This PR's design makes it easy to try larger chunk sizes more efficiently, as one "loading" task can load multiple files and concatenate them immediately, rather than adding a separate layer to the task graph. This is implemented in this PR, in the Docstring commit and before. A "batch size" can be set, and each Dask task loads that many files and concatenates them. (The batch size has to be adjusted in the code---I didn't know how to expose it as an option for load_dataset.) I found that with a batch size of 100 (producing chunks of a few hundred MB, Dask's recommended size), I was actually getting a very slightly longer runtime (going from 24 to 27 seconds), but it's less extreme than using .rechunk(). My guess is that the np.concatenate call copying all the data arrays offsets any efficiency gains from the larger chunks. (Maybe if I were doing a lot more with the data it would be worth it.) I didn't see a way to have Astropy FITS load a file directly into a view of a larger output array, unfortunately.

Here's the task stream (top is before, bottom is this PR):

And here's a portion of the task graph for each (left is before, right is this PR). This is from dataset.data.visualize(), and I made an asdf file with only a handful of FITS files so the images wouldn't be unrenderably large.

Right now my thinking is "just use the processes scheduler", where the gains from this PR are small enough that I'm not sure it's worth more work to handle the chunksize argument better.

codspeed-hq · 2024-11-05T01:52:24Z

CodSpeed Instrumentation Performance Report

Merging #462 will improve performances by 80.26%

_{Comparing svank:making-the-array (a827410) with main (1f48738)}

Summary

⚡ 4 improvements
✅ 10 untouched benchmarks

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`test_dataset_compute_data_full_files[files missing]`	1,007 ms	558.6 ms	+80.26%
⚡	`test_dataset_compute_data_full_files[files]`	2.5 s	2.1 s	+21.28%
⚡	`test_dataset_compute_data_partial_files`	2.3 s	1.8 s	+24.72%
⚡	`test_plot_dataset[axes0]`	5.5 s	3.2 s	+70.72%

Cadair · 2024-11-07T16:05:22Z

I guess I'm not sure what the intended use case is for that. In main it looks like this option can only sub-divide files into smaller chunks. Are there DKIST files out there that are too big to be a single chunk? Or is the idea to subdivide files when only a portion of each file has to be loaded?

This was the idea yes, although as time has progressed I don't think it's a good idea. In theory we could chunk down to the individual compressed tiled in the FITS file, but that would make the dask graph insanely large with insanely small chunk sizes so I don't think we are likely to do it.

Cadair · 2025-01-06T11:32:50Z

@svank If we agreed to get rid of the sub-file loading chunksize (I think this would be dropping doing anything with chunksize completely) would you be willing to get this PR mergeable?

I like the idea of being able to batch chunks, did you try out that functionality with the distributed scheduler? I wonder if it would make more difference there?

dkist/io/dask_utils.py

svank · 2025-01-07T21:04:17Z

Yeah, if we drop the idea of individual files being split into chunks, I think this PR becomes easy to finish. I think the only remaining large task is figuring how to expose the batch size to the user, if that's something we want exposed.

I think it was with the distributed scheduler that I was finding that the batching didn't actually help at all, maybe because of the extra work of concatenating the multiple files into one larger array. I don't know if a benchmark that does more substantial processing on the data after loading would start to see enough benefit to offset that cost.

codspeed-hq · 2025-07-09T12:55:07Z

CodSpeed WallTime Performance Report

Merging #462 will improve performances by 39.17%

_{Comparing svank:making-the-array (a827410) with main (c0c9dc3)}

Summary

⚡ 4 improvements

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`test_dataset_compute_data_full_files[files missing]`	733 ms	311.1 ms	×2.4
⚡	`test_dataset_compute_data_full_files[files]`	1.6 s	1.1 s	+39.17%
⚡	`test_dataset_compute_data_partial_files`	1,247.9 ms	828.3 ms	+50.64%
⚡	`test_plot_dataset[axes0]`	6.5 s	4.7 s	+39.98%

Cadair · 2025-07-09T13:05:26Z

Well, even with the improved more stringent benchmarks this is a clear win, so we need to polish this up.

The issue in it's current state is how to handle the chunksize argument. I'm pro removing it, but I realised that technically it's possible for someone to have written an asdf file with chunksize in, even though I'm pretty sure that the data centre never has. So we have two options, maintain compatibility with chunksize (which is currently implemented but not tested) or just remove it anyway and try not to actively break any asdf files with it set (just have it be a no-op).

The simpler task graph seems to run faster

for more information, see https://pre-commit.ci

This either needs rewriting or removing depending

Cadair · 2025-07-15T16:14:21Z

I've kept chunksize around but made it throw a deprecation warning (which I don't expect people to ever see).

…-the-array

pytest.ini

Cadair reviewed Jan 6, 2025

View reviewed changes

dkist/io/dask_utils.py Outdated Show resolved Hide resolved

svank changed the title ~~Making the array~~ Simpler Dask array construction, w/ optional batching Jan 7, 2025

Cadair force-pushed the making-the-array branch from 4a3fd5f to be4d900 Compare July 8, 2025 10:57

Cadair added the pre-commit.ci autofix Run the pre-commit bot on this PR automatically label Jul 8, 2025

pre-commit-ci bot removed the pre-commit.ci autofix Run the pre-commit bot on this PR automatically label Jul 8, 2025

Cadair force-pushed the making-the-array branch from e8aae6d to 9346c94 Compare July 8, 2025 13:02

Cadair changed the title ~~Simpler Dask array construction, w/ optional batching~~ Simpler Dask array construction Jul 8, 2025

Cadair force-pushed the making-the-array branch 2 times, most recently from 3d0a8a6 to 5342886 Compare July 9, 2025 12:43

Cadair added the Run Walltime Benchmarks label Jul 9, 2025

svank and others added 12 commits July 14, 2025 11:50

Build dask array more directly

bf1cd25

The simpler task graph seems to run faster

Try handling non-default chunksizes

e5e3c58

Move final shape responsibility into dask_utils

26c3fdd

Docstring

a13184f

Removing 'batching' idea, which isn't giving me any speedups

5a73a16

Make sure loaded files have a leading dimension

0b2b488

[pre-commit.ci] auto fixes from pre-commit.com hooks

4c859f2

for more information, see https://pre-commit.ci

lint

efe3de6

EXP: Clean up some stuff

ff623e2

some fixes

9ecbcf8

Go back to getitem

b47c6a8

TMP: Skip chunksize test

d40cfa0

This either needs rewriting or removing depending

Cadair force-pushed the making-the-array branch from 5342886 to d40cfa0 Compare July 14, 2025 10:50

Ignore dask warning in old dask

287a939

Cadair added 2 commits July 15, 2025 17:11

Deprecate chunksize in the loaders

2606e5c

Merge branch 'main' into making-the-array

c753f8e

Cadair marked this pull request as ready for review July 15, 2025 16:13

Cadair added 2 commits July 15, 2025 17:15

Add changelog

eed49a2

Merge branch 'making-the-array' of github.com:svank/dkist into making…

ab9f4ac

…-the-array

Cadair requested a review from SolarDrew July 15, 2025 16:15

Cadair reviewed Jul 16, 2025

View reviewed changes

pytest.ini Outdated Show resolved Hide resolved

Cadair approved these changes Jul 16, 2025

View reviewed changes

SolarDrew approved these changes Jul 16, 2025

View reviewed changes

Update pytest.ini

a827410

Cadair enabled auto-merge (squash) July 16, 2025 10:12

Cadair merged commit 0a9a76f into DKISTDC:main Jul 16, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler Dask array construction#462

Simpler Dask array construction#462
Cadair merged 18 commits intoDKISTDC:mainfrom
svank:making-the-array

svank commented Nov 5, 2024

Uh oh!

codspeed-hq bot commented Nov 5, 2024 •

edited

Loading

Uh oh!

Cadair commented Nov 7, 2024

Uh oh!

Cadair commented Jan 6, 2025

Uh oh!

Uh oh!

svank commented Jan 7, 2025

Uh oh!

codspeed-hq bot commented Jul 9, 2025 •

edited

Loading

Uh oh!

Cadair commented Jul 9, 2025

Uh oh!

Cadair commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

svank commented Nov 5, 2024

Uh oh!

codspeed-hq bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Instrumentation Performance Report

Merging #462 will improve performances by 80.26%

Summary

Benchmarks breakdown

Uh oh!

Cadair commented Nov 7, 2024

Uh oh!

Cadair commented Jan 6, 2025

Uh oh!

Uh oh!

svank commented Jan 7, 2025

Uh oh!

codspeed-hq bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed WallTime Performance Report

Merging #462 will improve performances by 39.17%

Summary

Benchmarks breakdown

Uh oh!

Cadair commented Jul 9, 2025

Uh oh!

Cadair commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq bot commented Nov 5, 2024 •

edited

Loading

codspeed-hq bot commented Jul 9, 2025 •

edited

Loading