Skip to content

Asdf read speed#514

Merged
Cadair merged 101 commits intoDKISTDC:mainfrom
SolarDrew:asdf-read-speed
Jul 14, 2025
Merged

Asdf read speed#514
Cadair merged 101 commits intoDKISTDC:mainfrom
SolarDrew:asdf-read-speed

Conversation

@SolarDrew
Copy link
Contributor

Fixes #500

@Cadair
Copy link
Member

Cadair commented Feb 4, 2025

I just did a quick experiment locally and if we convert the Table to a numpy recarray before we save it (Table.as_array()) then asdf will automatically only write the one binary block with the full data and will save slices in the tree as references

Full code
import dkist; from dkist.data.sample import VBI_AJQWW
tds = dkist.load_dataset(VBI_AJQWW)

whole_table = tds.combined_headers
import asdf

small1 = whole_table[0:10]
small2 = whole_table[10:20]

new_tree = {"whole": whole_table, "small1":small1, "small2": small2}
with asdf.AsdfFile(tree=new_tree) as af:
    af.write_to("test.asdf")

<duplicates the data>

whole_array = whole_table.as_array()
array_tree = {"whole": whole_array, "small1":whole_array[0:10], "small2": whole_array[10:20]}
with asdf.AsdfFile(tree=array_tree) as af:
    af.write_to("array.asdf")

<does not duplicate the data>

For a small example:

whole_table2 = whole_table[["INSTRUME", "DATE-AVG"]]
whole_array2 = whole_table2.as_array()
array_tree2 = {"whole": whole_array2, "small1":whole_array2[0:10], "small2": whole_array2[10:20]}
with asdf.AsdfFile(tree=array_tree2) as af:
    af.write_to("array2.asdf")

yields this asdf:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.5.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    manifest_software: !core/software-1.0.0 {name: asdf_standard, version: 1.1.1}
    software: !core/software-1.0.0 {name: asdf, version: 3.5.0}
small1: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
small2: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
  offset: 1160
whole: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [27]
...
[BINARY BLOCK]
%YAML 1.1
---
- 1231
...

notice the source: 0 for all the arrays, and offset: 1160 for small2.


I think this might be a good idea.

My main worry is that it severely limits how rich we can make the metadata table, i.e. #265 becomes something custom we have to glue on the side rather than being able to use built-in features of astropy Table.

However I think this approach has many advantages:

  • Obvious performance improvements for single tables (probably), but especially for what you are doing on this PR.
  • We can still convert up to a table either in the converter or in Dataset itself, without copying the memory.
  • This is almost certainly more portable to other languages, as the ndarray tag and schema are in the core spec. It would be worth testing to see what happens in IDL.

@codspeed-hq
Copy link

codspeed-hq bot commented Apr 8, 2025

CodSpeed Instrumentation Performance Report

Merging #514 will improve performances by ×3.9

Comparing SolarDrew:asdf-read-speed (4e4e431) with main (50d0686)

Summary

⚡ 3 improvements
✅ 11 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_load_tiled_asdf 6.2 s 1.6 s ×3.9
test_tileddataset_repr[simple-masked] 2 ms 1.8 ms +11.45%
test_tileddataset_repr[simple-nomask] 2 ms 1.8 ms +13.1%

@SolarDrew
Copy link
Contributor Author

If I remember rightly, the current notebooks failure is a known sunpy issue and not a problem with this PR, which means this is actually finally ready to go, pending final review I guess.

@Cadair Cadair force-pushed the asdf-read-speed branch from 69a5b22 to a1be5db Compare July 2, 2025 11:26
@Cadair Cadair force-pushed the asdf-read-speed branch from a1be5db to 0a65989 Compare July 2, 2025 12:57
@Cadair
Copy link
Member

Cadair commented Jul 2, 2025

I think this is now ready, we just need to coordinate releasing it with the data center team.

@Cadair Cadair requested a review from eigenbrot July 2, 2025 15:04
@Cadair
Copy link
Member

Cadair commented Jul 2, 2025

@eigenbrot Could you check this out and generate a before and after DL-NIRSP ASDF file to see how we did on performance?

@eigenbrot
Copy link
Contributor

I did some before and after tests with a DL dataset consisting of 2116 L1 frames in a 23 x 23 mosaic.

With dkist == 1.13.0:
Generate metadata file with dataset_from_fits: 476 s
Read resulting file with dkist.load_dataset: 149 s

With this PR:
Generate metadata file with dataset_from_fits: 206 s
Read resulting file with dkist.load_dataset: 18 s

So much faster!

Copy link
Contributor

@eigenbrot eigenbrot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving just based on the fact that it works on real DL data.

@SolarDrew
Copy link
Contributor Author

build_docs failure is fixed in #574 . Not sure what's up with the oldestdeps one.

@SolarDrew
Copy link
Contributor Author

Merging #514 will degrade performances by 10.98%

WWWWWHHHHHHHYYYY!?!?!?

@Cadair
Copy link
Member

Cadair commented Jul 14, 2025

Does the test file use the new code from this PR?

@SolarDrew
Copy link
Contributor Author

No, that'll do it

@Cadair Cadair merged commit 6a372ec into DKISTDC:main Jul 14, 2025
29 checks passed
@Cadair
Copy link
Member

Cadair commented Jul 14, 2025

glory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Run downstream CI Run's the downstream CI workflow on a PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reading a DL-NIRSP ASDF is very slow

3 participants