Dataset series #868

dr-shorthair · 2019-03-31T22:26:16Z

A significant outstanding issue that keeps coming up as the sideline to other conversations [1][2] is the need to have a recommended pattern for cataloguing dataset series. Budget data, satellite imagery, ...

Usually most of the description (metadata) is the same, but the temporal coverage changes between members of a series.

[1] #789
[2] #806

nicholascar · 2019-04-02T08:19:10Z

Or the spatial coverage changes, as is the case with maps in a series

davebrowning · 2019-04-02T14:18:36Z

Fully agree that there would be great value in some story, though I'm a little unsure that there is a single pattern that can be recommended. Worth trying, anyway, now we seem to have a stable definition of the qualities of dcat:Distribution. In the spirit of starting a conversation:

The spatial series seems a good entry point here: in that case would each map be a dcat:Dataset with its own dcat:Distributions? [DCAT-rev definition of distributions means that you would have to do this, since the content of different maps is different.] If we just link these (with some variant of dct:relation or dcat:qualifiedRelation) then we have the connections but no holder for metadata common across the series. We could have a parent dataset (trying hard not to call it an atlas) which has the maps as constituent parts, which would hint at common metadata and use something like prov:specializationOf to identify the members of the series, though some kind of generic (cataloguable) container type would work too, I think. Or is all that too simplistic?

For the temporal case, we might be able to do the same kind of pattern, but I suspect that there are naturally more things at play there, or possibly other use cases. Versioning - or more specifically change through the progress of time - seems to me to become part of the picture very quickly, whereas for spatial and other kinds of series it would be a bit more orthogonal. Perhaps the temporal kind of series is better looked at via some kind of service paradigm. That might all just be my narrow/limited world view, though. 😄

makxdekkers · 2019-04-02T14:37:53Z

As far as I am concerned, I don't think we're going to resolve this issue before finalising the CR. There are several dimensions to this, and I've been involved in long discussions that did not lead to a resolution after many months. Maybe we should move it to the list for version 3, and set up a sprint for it after publication of the CR?

davebrowning · 2019-04-02T14:40:46Z

+1 to Makx's timing suggestion - I was assuming so, actually. I think there is some value in having such things in the Future Priority backlog - it may solicit further input or use cases.

dr-shorthair · 2019-04-02T21:12:12Z

Yes, sorry, I did not make it clear at the top of the issue that I intended this for the backlog. Definitely too late to do anything reasonable for DCAT 2. I just wanted to have a discrete issue to track.

dr-shorthair · 2019-06-07T01:14:44Z

The Interest Group on Data Discovery Paradigms of the Research Data Alliance has a task force on Data/metadata granularity, which is considering a taxonomy of data aggregation. @andrea-perego and @dr-shorthair are in contact and will likely bring more detailed requirements that can form the basis of developing DCAT patterns for this.

der · 2019-08-05T14:03:46Z

Sorry to be late with this comment, haven't been following this work.

As an outsider to this WG can I reinforce the importance of this issue. This has been, and continues to be, a substantial pain point in our attempts to use dcat. Sad to hear that it won't be addressed for DCAT 2.

In our experience with public sector datasets it is relatively rare for a dataset to be a unitary thing which be downloaded in its entirety. More typically the non-realtime datasets we see comprise a series of updates (annual, quarterly, monthly etc as determined by some release cycle). Where possible we provide data services and dumps for the whole series. However, both users and publishers want to explicitly see the series of updates as individual elements they can separately download but regard the collection of those updates as a single dataset with common metadata and want the data, and presentation of it, to reflect that.

Possible approaches to this include:

Model each such dataset as a dcat:Catalog which then references each update as a separate dcat:Dataset with it's own distribution but put all the common metadata on the dcat:Catalog. This could work but then it is hard for a generic client to tell the difference between this use of Catalog and the "normal" uses of dcat:Catalog as (possibly hierarchical) collections of heterogeneous datasets. It's also hard to then point to a Distribution for the whole dataset. It would be possible to support this pattern through a marker subclass of catalog (dcat:DatasetSeries or some such).
Use dcat:Dataset for the series but allow a dataset to have multiple partial distributions, each with a separate temporal/spatial/other extent. This could work but the existing text and UML doesn't encourage per-Distribution extents and implies that a Distribution covers a whole dataset. Furthermore if you have different formats available for each update then the relation between the different partial Distributions would be obscure.
Introduce a separate notion of a partition or element of a dataset which can have it's own extent information and its own Distribution(s). This is the route we've used up to now and works fine within our own systems but means that an external client expecting dcat can't see the individual elements in the series. Sadly this is usually the grain size a harvester actually wants to see.

Even if you can't recommend a specific pattern for DCAT 2 would you be able to give some indication of the likely direction of travel (as a guide to those of us who need to work around the limitation in the meantime)?

makxdekkers · 2019-08-05T15:12:27Z

@der My opinion is that, if we want to support dataset series in DCAT -- and I agree it is a very common request -- the best approach might be to define a new class (subclass of dcat:Resource), e.g. dcat:DatasetSeries. The related datasets could then be linked using dct:hasPart and dct:isPartOf, or similar properties.
The advantage is that we would not be 'repurposing' existing classes which carries the risk that 'legacy' DCAT implementations might not understand what is going on. A new class would allow to explicitly describe the behaviour of the series and the datasets in it. For example, if the behaviour would include a certain 'inheritance' of 'common metadata' from the series to the individual datasets, that could be made clear in the description of the new class.

der · 2019-08-05T16:00:25Z

Thanks @makxdekkers. That would work for me.

In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset. That way we could give a distribution (and extent) for the overall aggregate as well as distributions for the individual elements within the series. It would also give us a transition plan - publish resources now as dcat:Datasets (with the dct:hasPart and dct:isPartOf relationships to other dcat:Datasets for the elements) and then add declarations for rdf:type dcat:DatasetSeries when/if that becomes available with compatible semantics.

Do you think there's a chance of squeezing a non-normative indication of this as a possible future pattern into the doc? Or at least a comment that use of dct:hasPart/isPartOf on datasets is in principle legal? Not sure how close to CR you are so appreciate this might be too late.

I mention this because https://www.w3.org/TR/vocab-dcat-2/#Property:catalog_has_part implies that you have domain/range declarations for dct:hasPart which would mitigate against this pattern. I'm assuming this is just a confusing presentation that what you actually have are owl:allValuesFrom restrictions, and so not a problem.

makxdekkers · 2019-08-05T16:17:58Z

@der I was just expressing my personal opinion, and the group might not agree.
In any case, I think we need to discuss this in more detail before making some sort of statement for the future -- it would not be good to make such a statement now and then to backtrack on it...

smrgeoinfo · 2019-08-05T18:02:45Z

for future reference (v3?) I agree DatasetSeries should be a separate subclass of dcat:Resource. Noting that a series would have all the properties that are specific to dataset, from a modeling perspective it might be treated as a subclass of Dataset, with the addition of a mandatory(2..N) 'hasPart' relationship, and properties indicating how the 'granules' in the collection are defined (time, space...).

dr-shorthair · 2019-08-05T20:37:16Z

Yes @smrgeoinfo that is my thinking as well.

Richer treatment of relations between resources (esp. datasets) is one of the features that has been added in DCAT2, so we have the platform already.

https://www.w3.org/TR/vocab-dcat-2/#qualified-relationship

matthiaspalmer · 2019-09-18T16:47:48Z

A very simple solution is to point to multiple resources by repating dcat:downloadURL from a single distribution. If needed additional properties like dct:title, dct:temporal, dct:spatial can be provided on these resources.

We do it like this in EntryScape, allowing people to upload or point to multiple resources.

makxdekkers · 2019-09-18T17:05:45Z

@matthiaspalmer If you repeat dcat:downloadURL on a single dcat:Distribution, how do you relate a property like dct:temporal to a particular dcat:downloadURL?

dr-shorthair · 2020-07-01T21:56:25Z

@der said 'In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset'

I would propose - if a new class is needed at all - that dcat:DatasetSeries rdfs:subClassOf dcat:Dataset.
Else just set dcterms:type <http://registry.it.csiro.au/def/isotc211/MD_ScopeCode/series> ; (or similar)?

andrea-perego · 2020-07-01T22:17:36Z

@dr-shorthair said:

@der said 'In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset'

I would propose - if a new class is needed at all - that dcat:DatasetSeries rdfs:subClassOf dcat:Dataset.
Else just set dcterms:type <http://registry.it.csiro.au/def/isotc211/MD_ScopeCode/series> ; (or similar)?

GeoDCAT-AP uses the latter approach - see the section on resource types and related example:

## Resource type for series
[] a dcat:Dataset;
  dct:type <http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series> .

makxdekkers · 2020-07-02T08:23:17Z

@makxdekkers is 'Time series' different to 'Evolution'?

That depends on how these things are defined. The way I think about it is something like this:

Time series: a group of datasets that are related along a time dimension, for example a dataset with the budget for 2019 and another dataset with the budget for 2020; so two datasets that contain the same type of data for a different time period

Evolution: a single dataset that is updated 'in situ' over time with additional or modified data, for example a dataset with year-to-date expenditure data; so a single dataset that changes over time

There are cases where you could model data either way; for example, in the case of YTD information, you could publish a snapshot every time it changes as a dataset with timestamp, or add additional data in the same dataset. It's up to the publisher to decide which one fits the needs of the users. I know a case where a YTD is updated in situ but then published as a snapshot every six months.

agreiner · 2020-07-02T22:54:03Z

Hm, typical usage in my own circles is that a time series is a dataset that has time as one variable within that one dataset. I would suggest avoiding using the term to talk about a series of datasets, to avoid confusion.

kcoyle · 2020-07-05T17:32:08Z

Series v Evolution - just to give some support to this, library data recognizes series as:

"Serial: Bibliographic item issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. Includes periodicals; newspapers; annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc., of societies; and numbered monographic series, etc. "

Basically, issued serially over time; a succession of parts or entries or files.

"Integrating resource [kc: terrible name, but like Makx's "evolution"]: Bibliographic resource that is added to or changed by means of updates that do not remain discrete and are integrated into the whole. Examples include updating loose-leafs and updating Web sites. Integrating resources may be finite or continuing."

I think "serial" and "updated" / "integrated" are pretty common patterns. The difficulty is in giving them clear names and definitions. And of course there will be some materials that are a bit of both, and I have no idea how to handle those in a user-friendly way.

andrea-perego · 2020-10-19T11:20:22Z

Discussion on this topic is also going on in the framework of DCAT-AP.

The following posts provide a survey on how dataset series (and versions) are dealt with in DCAT-AP extensions:

SEMICeu/DCAT-AP#155 (comment)

riccardoAlbertoni · 2020-10-21T10:54:51Z

I have prepared a wiki page with a starting example depicting dataset time series. Do not hesitate to complete the page with other alternative examples and integrate the page If I have overlooked any of the above discussion's key points. I hope that having common examples to reason upon might help to stabilize a solution in the next DCAT call.
see https://github.com/w3c/dxwg/wiki/Examples-on-dataset-series

andrea-perego · 2020-10-23T18:57:24Z

I have prepared a wiki page with a starting example depicting dataset time series. Do not hesitate to complete the page with other alternative examples and integrate the page If I have overlooked any of the above discussion's key points. I hope that having common examples to reason upon might help to stabilize a solution in the next DCAT call.
see https://github.com/w3c/dxwg/wiki/Examples-on-dataset-series

Thanks, @riccardoAlbertoni .

I've added some examples, and made a few editorial changes.

riccardoAlbertoni · 2020-11-05T17:55:27Z

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

agbeltran · 2020-11-05T19:30:25Z

should this issue also be referenced in the Editors' note stating "The creation of a specific class for dataset series is under discussion." or should we rather open an specific issue for that discussion?

riccardoAlbertoni · 2020-11-05T20:11:52Z

I would suggest creating a new GitHub Issue, in which we can reprise the discussion quoting the views already expressed in the existing GitHub issue.

andrea-perego · 2020-11-05T21:13:33Z

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

riccardoAlbertoni · 2020-11-09T14:39:31Z

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

In that case, should we get rid of the issue mentioned in the FWPD? or we plan to leave the mention of closed issues to provide context?

andrea-perego · 2020-11-09T15:33:44Z

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge.
We need this open, as we want to collect feedback on this issue with the next FPWD.
Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

In that case, should we get rid of the issue mentioned in the FWPD? or we plan to leave the mention of closed issues to provide context?

I suggest we decide about this during our next call.

riccardoAlbertoni · 2020-11-11T23:10:12Z

A section about the Dataset series is included in the DCAT FPWD.

dr-shorthair added dcat dcat:Dataset dct:temporal labels Mar 31, 2019

dr-shorthair assigned makxdekkers Mar 31, 2019

dr-shorthair assigned nicholascar, davebrowning and andrea-perego Apr 2, 2019

makxdekkers closed this as completed Apr 2, 2019

makxdekkers reopened this Apr 2, 2019

dr-shorthair added this to the DCAT Future Priority Work milestone Apr 2, 2019

dr-shorthair added the dct:spatial label Apr 2, 2019

w3c deleted a comment from makxdekkers Apr 2, 2019

jakubklimek mentioned this issue Apr 3, 2019

Datové série datagov-cz/nkod#3

Closed

oystein-asnes mentioned this issue May 3, 2019

[behov] Beskrive tidsserier og samlinger av datasett Informasjonsforvaltning/dcat-ap-no#21

Closed

pwin mentioned this issue Aug 7, 2019

Schemas for Tabular Data Challenge co-cddo/open-standards#40

Closed

4 tasks

andrea-perego mentioned this issue Sep 13, 2019

DCAT: Updates to ack section #1069

Closed

aidig mentioned this issue Aug 7, 2020

Need for a common approach to modeling dataset series in DCAT-AP SEMICeu/DCAT-AP#155

Closed

riccardoAlbertoni added the qualified relations Issues somehow related to qualified relationships label Sep 30, 2020

andrea-perego added the dataset-series label Sep 30, 2020

andrea-perego mentioned this issue Oct 16, 2020

Possible relevance of FRBR for versioning #1251

Closed

andrea-perego mentioned this issue Oct 23, 2020

Examples for dataset series #806

Closed

riccardoAlbertoni mentioned this issue Oct 27, 2020

Add draft section for dataset series #1262

Merged

andrea-perego linked a pull request Oct 28, 2020 that will close this issue

Add draft section for dataset series #1262

Merged

agbeltran removed the future-work issue deferred to the next standardization round label Oct 30, 2020

riccardoAlbertoni closed this as completed in #1262 Nov 4, 2020

riccardoAlbertoni reopened this Nov 5, 2020

agbeltran added the due for closing Issue that is going to be closed if there are no objection within 6 days label Nov 11, 2020

riccardoAlbertoni mentioned this issue Nov 11, 2020

A separate class for DatasetSeries (?) #1272

Closed

riccardoAlbertoni closed this as completed Nov 11, 2020

matthiaspalmer mentioned this issue Dec 6, 2021

Model Series of Data as Distributions of a single Dataset #1429

Closed

matthiaspalmer mentioned this issue Feb 2, 2023

DatasetSeries SEMICeu/DCAT-AP#240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset series #868

Dataset series #868

dr-shorthair commented Mar 31, 2019

nicholascar commented Apr 2, 2019

davebrowning commented Apr 2, 2019

makxdekkers commented Apr 2, 2019 •

edited

Loading

davebrowning commented Apr 2, 2019

dr-shorthair commented Apr 2, 2019

dr-shorthair commented Jun 7, 2019 •

edited

Loading

der commented Aug 5, 2019

makxdekkers commented Aug 5, 2019

der commented Aug 5, 2019

makxdekkers commented Aug 5, 2019

smrgeoinfo commented Aug 5, 2019

dr-shorthair commented Aug 5, 2019

matthiaspalmer commented Sep 18, 2019 •

edited

Loading

makxdekkers commented Sep 18, 2019

dr-shorthair commented Jul 1, 2020 •

edited

Loading

andrea-perego commented Jul 1, 2020

makxdekkers commented Jul 2, 2020

agreiner commented Jul 2, 2020

kcoyle commented Jul 5, 2020

andrea-perego commented Oct 19, 2020

riccardoAlbertoni commented Oct 21, 2020 •

edited

Loading

andrea-perego commented Oct 23, 2020

riccardoAlbertoni commented Nov 5, 2020

agbeltran commented Nov 5, 2020

riccardoAlbertoni commented Nov 5, 2020

andrea-perego commented Nov 5, 2020

riccardoAlbertoni commented Nov 9, 2020

andrea-perego commented Nov 9, 2020

riccardoAlbertoni commented Nov 11, 2020

Dataset series #868

Dataset series #868

Comments

dr-shorthair commented Mar 31, 2019

nicholascar commented Apr 2, 2019

davebrowning commented Apr 2, 2019

makxdekkers commented Apr 2, 2019 • edited Loading

davebrowning commented Apr 2, 2019

dr-shorthair commented Apr 2, 2019

dr-shorthair commented Jun 7, 2019 • edited Loading

der commented Aug 5, 2019

makxdekkers commented Aug 5, 2019

der commented Aug 5, 2019

makxdekkers commented Aug 5, 2019

smrgeoinfo commented Aug 5, 2019

dr-shorthair commented Aug 5, 2019

matthiaspalmer commented Sep 18, 2019 • edited Loading

makxdekkers commented Sep 18, 2019

dr-shorthair commented Jul 1, 2020 • edited Loading

andrea-perego commented Jul 1, 2020

makxdekkers commented Jul 2, 2020

agreiner commented Jul 2, 2020

kcoyle commented Jul 5, 2020

andrea-perego commented Oct 19, 2020

riccardoAlbertoni commented Oct 21, 2020 • edited Loading

andrea-perego commented Oct 23, 2020

riccardoAlbertoni commented Nov 5, 2020

agbeltran commented Nov 5, 2020

riccardoAlbertoni commented Nov 5, 2020

andrea-perego commented Nov 5, 2020

riccardoAlbertoni commented Nov 9, 2020

andrea-perego commented Nov 9, 2020

riccardoAlbertoni commented Nov 11, 2020

makxdekkers commented Apr 2, 2019 •

edited

Loading

dr-shorthair commented Jun 7, 2019 •

edited

Loading

matthiaspalmer commented Sep 18, 2019 •

edited

Loading

dr-shorthair commented Jul 1, 2020 •

edited

Loading

riccardoAlbertoni commented Oct 21, 2020 •

edited

Loading