feat: add writer_info field by henryiii · Pull Request #154 · scikit-hep/uhi

henryiii · 2025-04-16T21:19:29Z

This adds a place for library-specific metadata to be added. This allows the library and version to be recorded in the histogram. It is not required for reading.

We floated around several ideas for this name; I thought of "vendor" (it's was inspired by the vendor field in CMakePreset.json), and we also considered "header". But since we decided to make the histogram library a key, then "library" seems to be fitting. Open to suggestions, though. "writer_info" is another option.

henryiii · 2025-04-18T13:31:59Z

@HDembinski, @jpivarski, please sign off if you like it (you can also suggest a name if you don't like "library").

jpivarski · 2025-04-18T14:18:26Z

I'd be confused by "library" out of context. If the field is filled in with the name of the software library, that would clue me in and I'd understand it, but I can't be sure I'd get it if I was supposed to fill in the field and I couldn't find the instructions.

"vendor" is more clear except that it implies that someone's selling something, which is almost always not the case for histograms. "header" is too generic.

I looked at JPEG EXIFs, and they use the word "maker" a lot.

XMP (used in formats like PDF) would use "contributors" for this. It inherits from the Dublin Core, but that's more intended for things like books that would name a human author as its creator, rather than a software library. Same for EPUB, which uses "dc:creator".

In the end, how about "software_library" and "software_library_version"?

henryiii · 2025-04-18T14:40:45Z

What about writer_info (I added that after the initial issue)?

henryiii

This is what "writer_info" would look like.

docs/serialization.md

src/uhi/resources/histogram.schema.json

tests/resources/reg.json

jpivarski · 2025-04-18T15:23:51Z

"writer_info" sounds good. There's still potential for confusion (among all these formats) between the human who authors the product and the software that generates it, but that confusion shouldn't come up as much with histograms as it does with books.

Do you want to separate out a version field? Otherwise, a single free-text field would get filled inconsistently with versioned and unversioned data, with a space, hyphen, or something else separating the library name from the version number. (It's probably not much of a problem, I'm just asking.)

henryiii · 2025-04-18T15:39:10Z

The current proposal looks like this:

{
  "writer_info": {
    "boost-histogram": {
      "version": "1.0.0",
      (1) 
    }
  }
  (2)
}

The version is at writer-info/<library-name>/version. (1) is where a library can add any other metadata that they would like. For example, boost-histogram could record which axes had growth applied. That metadata is never required when reading a histogram. (2) is where the rest of the histogram is.

henryiii · 2025-04-18T15:49:39Z

It's also possible that if we put all the format writers here in uhi, then it could even look like this:

{
  "writer_info": {
    "boost-histogram": {
      "version": "1.6.0",
    },
    "uhi": {
      "version": "0.6.0",
    }
  }
}

The version of boost-histogram that produced the serialization struct would be recorded, but also the version of uhi that converted that struct in to HDF5 or zip or zarr or whatever could also be recorded.

HDembinski · 2025-04-20T18:09:30Z

I am good with writer_info and the suggested use cases.

I see some conflict between the idea of using a structured storage format for what I understand is supposed to be metadata for human consumption. Using a structured format like a dict suggests that the information is designed to be read and used by machines. For pure metadata that is only for human consumption, I would use a string.

henryiii · 2025-04-21T03:56:56Z

This is naturally (one level) structured data¹, and doesn't have to be only for human consumption. A utility like uproot-browser could show the library and version number if it's present. And the round trip example requires machine consumption. For example, boost-histogram could record {"writer_info": "axis_type": "Integer", "growth": True} on an axis, then when restoring an axis, it could restore the original Integer with growth axis instead of loading it as Regular or trying to guess that it might have been Integer based on heuristics. ROOT could store the original storage bit width. I think YODA histograms have some custom properties, though it's been a while since I was looking at that.

Even if improved round trip support isn't ever added when loading, it's better to allow the format to record this where we can access it if we want to later, rather than changing the format down the road. You could even just manually process the file and detect which axis were originally filled with growth on, for example. It's also generally more space efficient to not combine this into an arbitrary readable string. The important thing is that it's not required to read a histogram so libraries can load each other's histograms. It's basically exactly metadata but for library authors instead of users.

I need to, in a later PR, describe precisely what metadata is allowed to be placed into the metadata dictionary, and whatever that is will be true here too. Probably strings, numbers, bools, maybe more, probably initially based on what Uproot places in this field when reading a ROOT file. ↩

Signed-off-by: Henry Schreiner <[email protected]>

Update src/uhi/resources/histogram.schema.json Apply suggestions from code review Update tests/resources/reg.json

Signed-off-by: Henry Schreiner <[email protected]>

henryiii · 2025-05-06T15:51:56Z

This now shares #162, so strings, numbers, and bools are the only allowed entries.

henryiii · 2025-05-22T05:10:00Z

Okay to go in? Would like at least one okay to proceed, and I want to make progress before the next IRIS-HEP Demo Days, where I'll talk about histogram serialization.

jpivarski · 2025-05-22T12:02:48Z

Looks good to me!

henryiii commented Apr 18, 2025

View reviewed changes

docs/serialization.md Outdated Show resolved Hide resolved

docs/serialization.md Outdated Show resolved Hide resolved

src/uhi/resources/histogram.schema.json Outdated Show resolved Hide resolved

tests/resources/reg.json Outdated Show resolved Hide resolved

henryiii changed the title ~~feat: add a library field~~ feat: add a writer_info field Apr 18, 2025

henryiii force-pushed the henryiii/feat/library branch from 42ad8c9 to 570158e Compare April 18, 2025 18:17

henryiii changed the title ~~feat: add a writer_info field~~ feat: add writer_info field Apr 18, 2025

henryiii force-pushed the henryiii/feat/library branch from 570158e to 97e124d Compare April 18, 2025 18:20

henryiii mentioned this pull request Apr 24, 2025

fix: tighten metadata definition #163

Merged

henryiii force-pushed the henryiii/feat/library branch 3 times, most recently from 0ad211e to ba38769 Compare April 25, 2025 16:01

henryiii and others added 3 commits May 6, 2025 11:49

feat: add a library field

3d8f22c

Signed-off-by: Henry Schreiner <[email protected]>

refactor: rename to writer_info

6549edb

Update src/uhi/resources/histogram.schema.json Apply suggestions from code review Update tests/resources/reg.json

fix: writer_info allowed wherever metadata is allowed

e2d36d1

Signed-off-by: Henry Schreiner <[email protected]>

henryiii force-pushed the henryiii/feat/library branch from ba38769 to e2d36d1 Compare May 6, 2025 15:49

henryiii merged commit d57dedd into main May 22, 2025
10 checks passed

henryiii deleted the henryiii/feat/library branch May 22, 2025 20:07

Comments

Conversation

henryiii commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Apr 18, 2025

Uh oh!

jpivarski commented Apr 18, 2025

Uh oh!

henryiii commented Apr 18, 2025

Uh oh!

henryiii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpivarski commented Apr 18, 2025

Uh oh!

henryiii commented Apr 18, 2025

Uh oh!

henryiii commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDembinski commented Apr 20, 2025

Uh oh!

henryiii commented Apr 21, 2025

Footnotes

Uh oh!

henryiii commented May 6, 2025

Uh oh!

henryiii commented May 22, 2025

Uh oh!

jpivarski commented May 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

henryiii commented Apr 16, 2025 •

edited

Loading

henryiii commented Apr 18, 2025 •

edited

Loading