Asdf read speed by SolarDrew · Pull Request #514 · DKISTDC/dkist

SolarDrew · 2025-01-30T14:50:18Z

Fixes #500

dkist/dataset/tiled_dataset.py

Cadair · 2025-02-04T09:46:30Z

I just did a quick experiment locally and if we convert the Table to a numpy recarray before we save it (Table.as_array()) then asdf will automatically only write the one binary block with the full data and will save slices in the tree as references

Full code

import dkist; from dkist.data.sample import VBI_AJQWW
tds = dkist.load_dataset(VBI_AJQWW)

whole_table = tds.combined_headers
import asdf

small1 = whole_table[0:10]
small2 = whole_table[10:20]

new_tree = {"whole": whole_table, "small1":small1, "small2": small2}
with asdf.AsdfFile(tree=new_tree) as af:
    af.write_to("test.asdf")

<duplicates the data>

whole_array = whole_table.as_array()
array_tree = {"whole": whole_array, "small1":whole_array[0:10], "small2": whole_array[10:20]}
with asdf.AsdfFile(tree=array_tree) as af:
    af.write_to("array.asdf")

<does not duplicate the data>

For a small example:

whole_table2 = whole_table[["INSTRUME", "DATE-AVG"]]
whole_array2 = whole_table2.as_array()
array_tree2 = {"whole": whole_array2, "small1":whole_array2[0:10], "small2": whole_array2[10:20]}
with asdf.AsdfFile(tree=array_tree2) as af:
    af.write_to("array2.asdf")

yields this asdf:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.5.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    manifest_software: !core/software-1.0.0 {name: asdf_standard, version: 1.1.1}
    software: !core/software-1.0.0 {name: asdf, version: 3.5.0}
small1: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
small2: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
  offset: 1160
whole: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [27]
...
[BINARY BLOCK]
%YAML 1.1
---
- 1231
...

notice the source: 0 for all the arrays, and offset: 1160 for small2.

I think this might be a good idea.

My main worry is that it severely limits how rich we can make the metadata table, i.e. #265 becomes something custom we have to glue on the side rather than being able to use built-in features of astropy Table.

However I think this approach has many advantages:

Obvious performance improvements for single tables (probably), but especially for what you are doing on this PR.
We can still convert up to a table either in the converter or in Dataset itself, without copying the memory.
This is almost certainly more portable to other languages, as the ndarray tag and schema are in the core spec. It would be worth testing to see what happens in IDL.

dkist/io/asdf/resources/manifests/dkist-1.2.0.yaml

dkist/io/asdf/converters/tiled_dataset.py

codspeed-hq · 2025-04-08T10:34:01Z

CodSpeed Instrumentation Performance Report

Merging #514 will improve performances by ×3.9

_{Comparing SolarDrew:asdf-read-speed (4e4e431) with main (50d0686)}

Summary

⚡ 3 improvements
✅ 11 untouched benchmarks

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`test_load_tiled_asdf`	6.2 s	1.6 s	×3.9
⚡	`test_tileddataset_repr[simple-masked]`	2 ms	1.8 ms	+11.45%
⚡	`test_tileddataset_repr[simple-nomask]`	2 ms	1.8 ms	+13.1%

SolarDrew · 2025-06-19T10:39:31Z

If I remember rightly, the current notebooks failure is a known sunpy issue and not a problem with this PR, which means this is actually finally ready to go, pending final review I guess.

dkist/dataset/tiled_dataset.py

tox.ini

Cadair · 2025-07-02T13:07:19Z

I think this is now ready, we just need to coordinate releasing it with the data center team.

Cadair · 2025-07-02T15:05:36Z

@eigenbrot Could you check this out and generate a before and after DL-NIRSP ASDF file to see how we did on performance?

eigenbrot · 2025-07-07T20:18:03Z

I did some before and after tests with a DL dataset consisting of 2116 L1 frames in a 23 x 23 mosaic.

With dkist == 1.13.0:
Generate metadata file with dataset_from_fits: 476 s
Read resulting file with dkist.load_dataset: 149 s

With this PR:
Generate metadata file with dataset_from_fits: 206 s
Read resulting file with dkist.load_dataset: 18 s

So much faster!

eigenbrot

I'm approving just based on the fact that it works on real DL data.

dkist/io/asdf/tests/test_dataset.py

SolarDrew · 2025-07-10T09:51:04Z

build_docs failure is fixed in #574 . Not sure what's up with the oldestdeps one.

SolarDrew · 2025-07-14T08:31:57Z

Merging #514 will degrade performances by 10.98%

WWWWWHHHHHHHYYYY!?!?!?

Cadair · 2025-07-14T08:34:37Z

Does the test file use the new code from this PR?

SolarDrew · 2025-07-14T08:35:52Z

No, that'll do it

Cadair · 2025-07-14T09:12:16Z

SolarDrew added 8 commits January 28, 2025 10:32

Add mechanism for datasets to know if they're a tile

40af8bf

Stack headers and store canonically on TiledDataset

2d35c18

Don't save out headers on mosaic tile Datasets

ec5aaef

Minor test upgrade

afd6f70

Pass headers to TiledDataset in simple_tiled_dataset fixture

d2d29c0

Need to stack the headers

a4d1efe

Make TiledDataset converter read and write headers

28f19ab

Schema nonsense

3601236

Cadair reviewed Jan 30, 2025

View reviewed changes

dkist/dataset/tiled_dataset.py Outdated Show resolved Hide resolved

SolarDrew added 3 commits February 3, 2025 10:23

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

8bb16cc

Needed to point the manifest at the right schema

6343fa8

Replace changes to manifest with new file

2b2b125

SolarDrew added 2 commits February 4, 2025 11:21

More schema schenanigans

4e42f22

Save header table as rec array

374526d

Cadair reviewed Feb 4, 2025

View reviewed changes

dkist/io/asdf/resources/manifests/dkist-1.2.0.yaml Outdated Show resolved Hide resolved

dkist/io/asdf/converters/tiled_dataset.py Outdated Show resolved Hide resolved

SolarDrew added 3 commits February 5, 2025 16:23

Changelog

f3eef71

Save Dataset headers as rec arrays as well

f3e9095

Update some asdf tags

7c61d63

Cadair mentioned this pull request Apr 3, 2025

Add a how-to guide about manipulating a masked Table of headers #553

Open

SolarDrew added 5 commits April 7, 2025 11:58

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

93be016

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

2128324

Forgot to update entry_points in the merge

5c9323a

Fix header fetching in init

bb81acf

Spaces are important apparently

2f2242b

SolarDrew added 4 commits April 8, 2025 12:05

Update schema tag

7d6917d

Allow empty headers (not 100% sure we want to)

ac443a8

Fix header save/load

0c4903f

Update schema in converter

5b9d4bc

Hopefully fix notebooks build

a5282c9

Cadair requested changes Jul 1, 2025

View reviewed changes

dkist/dataset/tiled_dataset.py Outdated Show resolved Hide resolved

dkist/dataset/tiled_dataset.py Outdated Show resolved Hide resolved

tox.ini Outdated Show resolved Hide resolved

Cadair added 2 commits July 2, 2025 11:37

Merge remote-tracking branch 'upstream/main' into asdf-read-speed

e2bade6

Move offset/size unwrapping into Converter

605da2d

Cadair force-pushed the asdf-read-speed branch from 69a5b22 to a1be5db Compare July 2, 2025 11:26

Cadair added 2 commits July 2, 2025 13:56

More comments and clean up

ddd654b

lint

0a65989

Cadair force-pushed the asdf-read-speed branch from a1be5db to 0a65989 Compare July 2, 2025 12:57

Add a tiled_dataset 1.3.0 test file

b4c8a1f

Cadair approved these changes Jul 2, 2025

View reviewed changes

Cadair requested a review from eigenbrot July 2, 2025 15:04

eigenbrot approved these changes Jul 7, 2025

View reviewed changes

Cadair approved these changes Jul 8, 2025

View reviewed changes

dkist/io/asdf/tests/test_dataset.py Outdated Show resolved Hide resolved

Cadair added 3 commits July 8, 2025 11:22

Update dkist/io/asdf/tests/test_dataset.py

b268d16

Merge branch 'main' into asdf-read-speed

7ba6991

Regenerate test asdf for tiledataset 1.3.0

119305e

Cadair and others added 2 commits July 10, 2025 13:05

Merge branch 'main' into asdf-read-speed

1f2f751

Replace vbi test file with rezipped updated sample asdf

5516003

Try again

4e4e431

Cadair merged commit 6a372ec into DKISTDC:main Jul 14, 2025
29 checks passed

Cadair mentioned this pull request Jul 30, 2025

Accessing TiledDataset.combined_headers is slow #497

Closed

Conversation

SolarDrew commented Jan 30, 2025

Uh oh!

Uh oh!

Cadair commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Instrumentation Performance Report

Merging #514 will improve performances by ×3.9

Summary

Benchmarks breakdown

Uh oh!

SolarDrew commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cadair commented Jul 2, 2025

Uh oh!

Cadair commented Jul 2, 2025

Uh oh!

eigenbrot commented Jul 7, 2025

Uh oh!

eigenbrot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SolarDrew commented Jul 10, 2025

Uh oh!

SolarDrew commented Jul 14, 2025

Uh oh!

Cadair commented Jul 14, 2025

Uh oh!

SolarDrew commented Jul 14, 2025

Uh oh!

Uh oh!

Cadair commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cadair commented Feb 4, 2025 •

edited

Loading

codspeed-hq bot commented Apr 8, 2025 •

edited

Loading