Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset series #868

Closed
dr-shorthair opened this issue Mar 31, 2019 · 51 comments · Fixed by #1262
Closed

Dataset series #868

dr-shorthair opened this issue Mar 31, 2019 · 51 comments · Fixed by #1262
Assignees
Labels
dataset-series dcat:Dataset dcat dct:spatial dct:temporal due for closing Issue that is going to be closed if there are no objection within 6 days qualified relations Issues somehow related to qualified relationships
Milestone

Comments

@dr-shorthair
Copy link
Contributor

A significant outstanding issue that keeps coming up as the sideline to other conversations [1][2] is the need to have a recommended pattern for cataloguing dataset series. Budget data, satellite imagery, ...

Usually most of the description (metadata) is the same, but the temporal coverage changes between members of a series.

[1] #789
[2] #806

@nicholascar
Copy link
Contributor

Or the spatial coverage changes, as is the case with maps in a series

@davebrowning
Copy link
Contributor

Fully agree that there would be great value in some story, though I'm a little unsure that there is a single pattern that can be recommended. Worth trying, anyway, now we seem to have a stable definition of the qualities of dcat:Distribution. In the spirit of starting a conversation:

The spatial series seems a good entry point here: in that case would each map be a dcat:Dataset with its own dcat:Distributions? [DCAT-rev definition of distributions means that you would have to do this, since the content of different maps is different.] If we just link these (with some variant of dct:relation or dcat:qualifiedRelation) then we have the connections but no holder for metadata common across the series. We could have a parent dataset (trying hard not to call it an atlas) which has the maps as constituent parts, which would hint at common metadata and use something like prov:specializationOf to identify the members of the series, though some kind of generic (cataloguable) container type would work too, I think. Or is all that too simplistic?

For the temporal case, we might be able to do the same kind of pattern, but I suspect that there are naturally more things at play there, or possibly other use cases. Versioning - or more specifically change through the progress of time - seems to me to become part of the picture very quickly, whereas for spatial and other kinds of series it would be a bit more orthogonal. Perhaps the temporal kind of series is better looked at via some kind of service paradigm. That might all just be my narrow/limited world view, though. 😄

@makxdekkers
Copy link
Contributor

makxdekkers commented Apr 2, 2019

As far as I am concerned, I don't think we're going to resolve this issue before finalising the CR. There are several dimensions to this, and I've been involved in long discussions that did not lead to a resolution after many months. Maybe we should move it to the list for version 3, and set up a sprint for it after publication of the CR?

@davebrowning
Copy link
Contributor

+1 to Makx's timing suggestion - I was assuming so, actually. I think there is some value in having such things in the Future Priority backlog - it may solicit further input or use cases.

@dr-shorthair
Copy link
Contributor Author

Yes, sorry, I did not make it clear at the top of the issue that I intended this for the backlog. Definitely too late to do anything reasonable for DCAT 2. I just wanted to have a discrete issue to track.

@dr-shorthair
Copy link
Contributor Author

dr-shorthair commented Jun 7, 2019

The Interest Group on Data Discovery Paradigms of the Research Data Alliance has a task force on Data/metadata granularity, which is considering a taxonomy of data aggregation. @andrea-perego and @dr-shorthair are in contact and will likely bring more detailed requirements that can form the basis of developing DCAT patterns for this.

@der
Copy link

der commented Aug 5, 2019

Sorry to be late with this comment, haven't been following this work.

As an outsider to this WG can I reinforce the importance of this issue. This has been, and continues to be, a substantial pain point in our attempts to use dcat. Sad to hear that it won't be addressed for DCAT 2.

In our experience with public sector datasets it is relatively rare for a dataset to be a unitary thing which be downloaded in its entirety. More typically the non-realtime datasets we see comprise a series of updates (annual, quarterly, monthly etc as determined by some release cycle). Where possible we provide data services and dumps for the whole series. However, both users and publishers want to explicitly see the series of updates as individual elements they can separately download but regard the collection of those updates as a single dataset with common metadata and want the data, and presentation of it, to reflect that.

Possible approaches to this include:

  1. Model each such dataset as a dcat:Catalog which then references each update as a separate dcat:Dataset with it's own distribution but put all the common metadata on the dcat:Catalog. This could work but then it is hard for a generic client to tell the difference between this use of Catalog and the "normal" uses of dcat:Catalog as (possibly hierarchical) collections of heterogeneous datasets. It's also hard to then point to a Distribution for the whole dataset. It would be possible to support this pattern through a marker subclass of catalog (dcat:DatasetSeries or some such).

  2. Use dcat:Dataset for the series but allow a dataset to have multiple partial distributions, each with a separate temporal/spatial/other extent. This could work but the existing text and UML doesn't encourage per-Distribution extents and implies that a Distribution covers a whole dataset. Furthermore if you have different formats available for each update then the relation between the different partial Distributions would be obscure.

  3. Introduce a separate notion of a partition or element of a dataset which can have it's own extent information and its own Distribution(s). This is the route we've used up to now and works fine within our own systems but means that an external client expecting dcat can't see the individual elements in the series. Sadly this is usually the grain size a harvester actually wants to see.

Even if you can't recommend a specific pattern for DCAT 2 would you be able to give some indication of the likely direction of travel (as a guide to those of us who need to work around the limitation in the meantime)?

@makxdekkers
Copy link
Contributor

@der My opinion is that, if we want to support dataset series in DCAT -- and I agree it is a very common request -- the best approach might be to define a new class (subclass of dcat:Resource), e.g. dcat:DatasetSeries. The related datasets could then be linked using dct:hasPart and dct:isPartOf, or similar properties.
The advantage is that we would not be 'repurposing' existing classes which carries the risk that 'legacy' DCAT implementations might not understand what is going on. A new class would allow to explicitly describe the behaviour of the series and the datasets in it. For example, if the behaviour would include a certain 'inheritance' of 'common metadata' from the series to the individual datasets, that could be made clear in the description of the new class.

@der
Copy link

der commented Aug 5, 2019

Thanks @makxdekkers. That would work for me.

In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset. That way we could give a distribution (and extent) for the overall aggregate as well as distributions for the individual elements within the series. It would also give us a transition plan - publish resources now as dcat:Datasets (with the dct:hasPart and dct:isPartOf relationships to other dcat:Datasets for the elements) and then add declarations for rdf:type dcat:DatasetSeries when/if that becomes available with compatible semantics.

Do you think there's a chance of squeezing a non-normative indication of this as a possible future pattern into the doc? Or at least a comment that use of dct:hasPart/isPartOf on datasets is in principle legal? Not sure how close to CR you are so appreciate this might be too late.

I mention this because https://www.w3.org/TR/vocab-dcat-2/#Property:catalog_has_part implies that you have domain/range declarations for dct:hasPart which would mitigate against this pattern. I'm assuming this is just a confusing presentation that what you actually have are owl:allValuesFrom restrictions, and so not a problem.

@makxdekkers
Copy link
Contributor

@der I was just expressing my personal opinion, and the group might not agree.
In any case, I think we need to discuss this in more detail before making some sort of statement for the future -- it would not be good to make such a statement now and then to backtrack on it...

@smrgeoinfo
Copy link
Contributor

for future reference (v3?) I agree DatasetSeries should be a separate subclass of dcat:Resource. Noting that a series would have all the properties that are specific to dataset, from a modeling perspective it might be treated as a subclass of Dataset, with the addition of a mandatory(2..N) 'hasPart' relationship, and properties indicating how the 'granules' in the collection are defined (time, space...).

@dr-shorthair
Copy link
Contributor Author

Yes @smrgeoinfo that is my thinking as well.

Richer treatment of relations between resources (esp. datasets) is one of the features that has been added in DCAT2, so we have the platform already.

https://www.w3.org/TR/vocab-dcat-2/#qualified-relationship

@matthiaspalmer
Copy link

matthiaspalmer commented Sep 18, 2019

A very simple solution is to point to multiple resources by repating dcat:downloadURL from a single distribution. If needed additional properties like dct:title, dct:temporal, dct:spatial can be provided on these resources.

We do it like this in EntryScape, allowing people to upload or point to multiple resources.

@makxdekkers
Copy link
Contributor

@matthiaspalmer If you repeat dcat:downloadURL on a single dcat:Distribution, how do you relate a property like dct:temporal to a particular dcat:downloadURL?

@dr-shorthair
Copy link
Contributor Author

dr-shorthair commented Jul 1, 2020

@der said 'In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset'

I would propose - if a new class is needed at all - that dcat:DatasetSeries rdfs:subClassOf dcat:Dataset.
Else just set dcterms:type <http://registry.it.csiro.au/def/isotc211/MD_ScopeCode/series> ; (or similar)?

@andrea-perego
Copy link
Contributor

@dr-shorthair said:

@der said 'In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset'

I would propose - if a new class is needed at all - that dcat:DatasetSeries rdfs:subClassOf dcat:Dataset.
Else just set dcterms:type <http://registry.it.csiro.au/def/isotc211/MD_ScopeCode/series> ; (or similar)?

GeoDCAT-AP uses the latter approach - see the section on resource types and related example:

## Resource type for series
[] a dcat:Dataset;
  dct:type <http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series> .

@makxdekkers
Copy link
Contributor

@makxdekkers is 'Time series' different to 'Evolution'?

That depends on how these things are defined. The way I think about it is something like this:

Time series: a group of datasets that are related along a time dimension, for example a dataset with the budget for 2019 and another dataset with the budget for 2020; so two datasets that contain the same type of data for a different time period

Evolution: a single dataset that is updated 'in situ' over time with additional or modified data, for example a dataset with year-to-date expenditure data; so a single dataset that changes over time

There are cases where you could model data either way; for example, in the case of YTD information, you could publish a snapshot every time it changes as a dataset with timestamp, or add additional data in the same dataset. It's up to the publisher to decide which one fits the needs of the users. I know a case where a YTD is updated in situ but then published as a snapshot every six months.

@agreiner
Copy link
Contributor

agreiner commented Jul 2, 2020

Hm, typical usage in my own circles is that a time series is a dataset that has time as one variable within that one dataset. I would suggest avoiding using the term to talk about a series of datasets, to avoid confusion.

@kcoyle
Copy link
Contributor

kcoyle commented Jul 5, 2020

Series v Evolution - just to give some support to this, library data recognizes series as:

"Serial: Bibliographic item issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. Includes periodicals; newspapers; annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc., of societies; and numbered monographic series, etc. "

Basically, issued serially over time; a succession of parts or entries or files.

"Integrating resource [kc: terrible name, but like Makx's "evolution"]: Bibliographic resource that is added to or changed by means of updates that do not remain discrete and are integrated into the whole. Examples include updating loose-leafs and updating Web sites. Integrating resources may be finite or continuing."

I think "serial" and "updated" / "integrated" are pretty common patterns. The difficulty is in giving them clear names and definitions. And of course there will be some materials that are a bit of both, and I have no idea how to handle those in a user-friendly way.

@andrea-perego
Copy link
Contributor

Discussion on this topic is also going on in the framework of DCAT-AP.

The following posts provide a survey on how dataset series (and versions) are dealt with in DCAT-AP extensions:

SEMICeu/DCAT-AP#155 (comment)

SEMICeu/DCAT-AP#155 (comment)

@riccardoAlbertoni
Copy link
Contributor

riccardoAlbertoni commented Oct 21, 2020

I have prepared a wiki page with a starting example depicting dataset time series. Do not hesitate to complete the page with other alternative examples and integrate the page If I have overlooked any of the above discussion's key points. I hope that having common examples to reason upon might help to stabilize a solution in the next DCAT call.
see https://github.com/w3c/dxwg/wiki/Examples-on-dataset-series

@andrea-perego
Copy link
Contributor

I have prepared a wiki page with a starting example depicting dataset time series. Do not hesitate to complete the page with other alternative examples and integrate the page If I have overlooked any of the above discussion's key points. I hope that having common examples to reason upon might help to stabilize a solution in the next DCAT call.
see https://github.com/w3c/dxwg/wiki/Examples-on-dataset-series

Thanks, @riccardoAlbertoni .

I've added some examples, and made a few editorial changes.

@andrea-perego andrea-perego linked a pull request Oct 28, 2020 that will close this issue
@agbeltran agbeltran removed the future-work issue deferred to the next standardization round label Oct 30, 2020
@riccardoAlbertoni
Copy link
Contributor

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

@agbeltran
Copy link
Member

should this issue also be referenced in the Editors' note stating "The creation of a specific class for dataset series is under discussion." or should we rather open an specific issue for that discussion?

@riccardoAlbertoni
Copy link
Contributor

I would suggest creating a new GitHub Issue, in which we can reprise the discussion quoting the views already expressed in the existing GitHub issue.

@andrea-perego
Copy link
Contributor

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

@riccardoAlbertoni
Copy link
Contributor

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

In that case, should we get rid of the issue mentioned in the FWPD? or we plan to leave the mention of closed issues to provide context?

@andrea-perego
Copy link
Contributor

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

In that case, should we get rid of the issue mentioned in the FWPD? or we plan to leave the mention of closed issues to provide context?

I suggest we decide about this during our next call.

@agbeltran agbeltran added the due for closing Issue that is going to be closed if there are no objection within 6 days label Nov 11, 2020
@riccardoAlbertoni
Copy link
Contributor

A section about the Dataset series is included in the DCAT FPWD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset-series dcat:Dataset dcat dct:spatial dct:temporal due for closing Issue that is going to be closed if there are no objection within 6 days qualified relations Issues somehow related to qualified relationships
Projects
None yet
Development

Successfully merging a pull request may close this issue.