Skip to content

Comments

feat: add writer_info field#154

Merged
henryiii merged 3 commits intomainfrom
henryiii/feat/library
May 22, 2025
Merged

feat: add writer_info field#154
henryiii merged 3 commits intomainfrom
henryiii/feat/library

Conversation

@henryiii
Copy link
Member

@henryiii henryiii commented Apr 16, 2025

This adds a place for library-specific metadata to be added. This allows the library and version to be recorded in the histogram. It is not required for reading.

We floated around several ideas for this name; I thought of "vendor" (it's was inspired by the vendor field in CMakePreset.json), and we also considered "header". But since we decided to make the histogram library a key, then "library" seems to be fitting. Open to suggestions, though. "writer_info" is another option.

@henryiii
Copy link
Member Author

@HDembinski, @jpivarski, please sign off if you like it (you can also suggest a name if you don't like "library").

@jpivarski
Copy link
Member

I'd be confused by "library" out of context. If the field is filled in with the name of the software library, that would clue me in and I'd understand it, but I can't be sure I'd get it if I was supposed to fill in the field and I couldn't find the instructions.

"vendor" is more clear except that it implies that someone's selling something, which is almost always not the case for histograms. "header" is too generic.

I looked at JPEG EXIFs, and they use the word "maker" a lot.

XMP (used in formats like PDF) would use "contributors" for this. It inherits from the Dublin Core, but that's more intended for things like books that would name a human author as its creator, rather than a software library. Same for EPUB, which uses "dc:creator".

In the end, how about "software_library" and "software_library_version"?

@henryiii
Copy link
Member Author

What about writer_info (I added that after the initial issue)?

Copy link
Member Author

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what "writer_info" would look like.

@jpivarski
Copy link
Member

"writer_info" sounds good. There's still potential for confusion (among all these formats) between the human who authors the product and the software that generates it, but that confusion shouldn't come up as much with histograms as it does with books.

Do you want to separate out a version field? Otherwise, a single free-text field would get filled inconsistently with versioned and unversioned data, with a space, hyphen, or something else separating the library name from the version number. (It's probably not much of a problem, I'm just asking.)

@henryiii
Copy link
Member Author

The current proposal looks like this:

{
  "writer_info": {
    "boost-histogram": {
      "version": "1.0.0",
      (1) 
    }
  }
  (2)
}

The version is at writer-info/<library-name>/version. (1) is where a library can add any other metadata that they would like. For example, boost-histogram could record which axes had growth applied. That metadata is never required when reading a histogram. (2) is where the rest of the histogram is.

@henryiii
Copy link
Member Author

henryiii commented Apr 18, 2025

It's also possible that if we put all the format writers here in uhi, then it could even look like this:

{
  "writer_info": {
    "boost-histogram": {
      "version": "1.6.0",
    },
    "uhi": {
      "version": "0.6.0",
    }
  }
}

The version of boost-histogram that produced the serialization struct would be recorded, but also the version of uhi that converted that struct in to HDF5 or zip or zarr or whatever could also be recorded.

@henryiii henryiii changed the title feat: add a library field feat: add a writer_info field Apr 18, 2025
@henryiii henryiii force-pushed the henryiii/feat/library branch from 42ad8c9 to 570158e Compare April 18, 2025 18:17
@henryiii henryiii changed the title feat: add a writer_info field feat: add writer_info field Apr 18, 2025
@henryiii henryiii force-pushed the henryiii/feat/library branch from 570158e to 97e124d Compare April 18, 2025 18:20
@HDembinski
Copy link
Member

I am good with writer_info and the suggested use cases.

I see some conflict between the idea of using a structured storage format for what I understand is supposed to be metadata for human consumption. Using a structured format like a dict suggests that the information is designed to be read and used by machines. For pure metadata that is only for human consumption, I would use a string.

@henryiii
Copy link
Member Author

This is naturally (one level) structured data1, and doesn't have to be only for human consumption. A utility like uproot-browser could show the library and version number if it's present. And the round trip example requires machine consumption. For example, boost-histogram could record {"writer_info": "axis_type": "Integer", "growth": True} on an axis, then when restoring an axis, it could restore the original Integer with growth axis instead of loading it as Regular or trying to guess that it might have been Integer based on heuristics. ROOT could store the original storage bit width. I think YODA histograms have some custom properties, though it's been a while since I was looking at that.

Even if improved round trip support isn't ever added when loading, it's better to allow the format to record this where we can access it if we want to later, rather than changing the format down the road. You could even just manually process the file and detect which axis were originally filled with growth on, for example. It's also generally more space efficient to not combine this into an arbitrary readable string. The important thing is that it's not required to read a histogram so libraries can load each other's histograms. It's basically exactly metadata but for library authors instead of users.

Footnotes

  1. I need to, in a later PR, describe precisely what metadata is allowed to be placed into the metadata dictionary, and whatever that is will be true here too. Probably strings, numbers, bools, maybe more, probably initially based on what Uproot places in this field when reading a ROOT file.

@henryiii henryiii force-pushed the henryiii/feat/library branch 3 times, most recently from 0ad211e to ba38769 Compare April 25, 2025 16:01
henryiii and others added 3 commits May 6, 2025 11:49
Signed-off-by: Henry Schreiner <[email protected]>
Update src/uhi/resources/histogram.schema.json

Apply suggestions from code review

Update tests/resources/reg.json
@henryiii henryiii force-pushed the henryiii/feat/library branch from ba38769 to e2d36d1 Compare May 6, 2025 15:49
@henryiii
Copy link
Member Author

henryiii commented May 6, 2025

This now shares #162, so strings, numbers, and bools are the only allowed entries.

@henryiii
Copy link
Member Author

Okay to go in? Would like at least one okay to proceed, and I want to make progress before the next IRIS-HEP Demo Days, where I'll talk about histogram serialization.

@jpivarski
Copy link
Member

Looks good to me!

@henryiii henryiii merged commit d57dedd into main May 22, 2025
10 checks passed
@henryiii henryiii deleted the henryiii/feat/library branch May 22, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants