Distribution definition [RDIDF] #52

jpullmann · 2018-01-18T21:11:39Z

Distribution definition [RDIDF]

Revise definition of Distribution. Make clearer what a Distribution is and what it is not. Provide better guidance for data publishers.

Related use cases: Relationships between Distributions of a Dataset [ID34]

makxdekkers · 2018-01-19T10:08:42Z

I liked the statement in one of the meetings that all Distributions of a Dataset need to be "informationally equivalent".

dr-shorthair · 2018-03-15T23:50:03Z

https://www.w3.org/TR/dcat-ucr/#ID18 provides some history: a taxonomy of distribution types was contemplated, but did not make the cut for DCAT v1. Do we need to look at this again?

dr-shorthair · 2018-06-05T04:11:30Z

In F2F3 @jpullmann suggested that the definition of Distribution be clarified to emphasize that distributions are primarily concerned with representations of the dataset - in the REST sense (also see FRBR 'Manifestation').

This clarification is consistent with the introduction of DataDistributionService as a separate class. See #172

dr-shorthair · 2018-07-09T06:31:14Z

The definition of dcat:Distribution has been tweaked slightly so that the word 'representation' appears explicitly, and API and feed is removed (DataService should be used for those applications) -

A specific representation of a dataset. A dataset might be available in several different forms, and these forms might comprise both different serializations or different schematic arrangements of the same data. Examples of distributions include a CSV file, a netCDF file, or a data-cube

(Compare to DCAT-2014 where it says "Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed")

@jpullmann is this enough to mark this issue 'closed'?

makxdekkers · 2018-07-09T09:48:57Z

@dr-shorthair There is the issue of backward compatibility. The proposed new definition narrows down the semantics. If people have -- validly -- modelled an API or feed as a dcat:Distribution under the 2014 recommendation, their implementation would no longer conform to the revised recommendation. I do not know how serious this is, but the consequences have to be addressed, maybe as part of a set of warnings to be issued together with the next public review version?

dr-shorthair · 2018-07-10T02:55:47Z

Thanks @makxdekkers . The narrowed definition of dcat:Distribution is consistent with shifting part of its previous scope to dcat:DataDistributionService, but I agree that more documentation/warnings are required in the recommendation text around this.

I've proposed some warnings in #299 - see preview https://rawgit.com/w3c/dxwg/dcat-issue52-simon/dcat/index.html#class-distribution

agbeltran · 2018-07-12T06:32:42Z

In addition to the documentation/warnings about backward compatibility, I think we would need to add more guidance on the point made above by @makxdekkers about distributions being informationally equivalent. The current definition does say "...might comprise both different serializations or different schematic arrangements of the same data", but as discussed in #253 this is not necessarily how people have used distributions.

dr-shorthair · 2018-07-12T07:07:44Z

Yes. Strictly this is a consequence of the more refined description (which is what this topic is about). So if we are happy with the clarification of the definition, then perhaps a WARNING related to #253 is also merited.

There are interactions between #299 and #295 here. Which branch should additional warning text be added to?

dr-shorthair · 2018-09-17T07:07:50Z

Text in draft here https://w3c.github.io/dxwg/dcat/#Class:Distribution appears to resolve this issue?

davebrowning · 2018-09-18T10:54:52Z

Mostly covered by that text. I think we should emphasise that the text now says "arrangements of the same data". I have drafted some text, as NOTE, and included the change in the Changes Summary (Appendix C). See this branch.

I think that does resolve the issue, though requirements such as versioning and subsets (yet to be addressed) will likely - IMHO - influence the final text

davebrowning · 2018-09-19T09:00:24Z

Addressed in PR #357

agreiner · 2018-09-24T21:25:12Z

I don't disagree with the text here, but I think it worth pointing out that it is a bit paradoxical with respect to what some of us have been asserting with regard to profile negotiation.
"The definition text of dcat:Distribution has been revised to clarify that distributions are primarily representations of datasets. As such, all distributions of a given dataset should be informationally equivalent. " Here, it is assumed that representations of a dataset are informationally equivalent, but profile negotiation would return datasets that are not informationally equivalent, because different profiles may include different subsets of the dataset. My preference is to keep distributions informationally equivalent and ask ourselves if there is a way to make it clear that profile negotiation does not deliver informationally equivalent responses.

rob-metalinkage · 2018-09-24T21:55:46Z

Narrowing the scope, as proposed, breaks backwards compatibility with existing DCAT implementations.

Services that support queries against a dataset are never "informationally equivalent

If this restricted view is held, then any distrubution that supports accessing a file remotely, is by definiton not a dcat:Distribution but a dcat:SistributonService. Just basic web architecture of allowing a HTTP HEAD request is sufficient to break information equivalence, and content negotiation over langauge or mime type also does. Different formats are not informationally equivalent - for example a CSV file loses relationships between attributes compared to complex properties:

CSV
id,value1, units1, value2,units2
1, 2.3, "m/s",6.7,"kg"

vs
JSON
{ id: 1 ;
value1: { value: 2.3 ; units "m/s" ; }
value2: { value: 6.7 ; units "kg" ;
}

CSV holds less information because value1 and units1 need further out-of-band information to be related to each other.

So - unless you can come with a robust statement about testability of information equivalence, it strikes me as a slippery slope with no huge value.

OTOH Making an explicit statement that Distributions may not be informationally equivalent seems quite valuable, and makes services equivalent with distributions logically consistent

rob-metalinkage · 2018-09-24T22:36:22Z

further to that - if we know what profiles each distribution and/or services support, perhaps its up to the profiles to be described in a way that makes informational equivalence visible - for example maybe whats really required is a implementation resource to transform a profile into another profile.

Use Cases for reliance on information equivalence would seem to be missing - i think really you would need to find evidence for such a need.

agreiner · 2018-09-24T22:48:19Z

You are right that CSV can offer less information than JSON, and is particularly likely to do so if there is relational information to be shared, though I would argue that your CSV example shows the relationship between the two values by including them on the same line. Clearly, one can publish informationally equivalent data in both formats, and one can also make the mistake of dropping information when translating from one to the other. One might caution publishers to avoid selecting CSV that drops relationships in any guidance document. One might also caution them against dropping out entire rows from a CSV, but one would not then assume that CSV needs to be treated as a form that is inconsistent in informational content. A little googling shows me two definitions of informational equivalence: (1) that information is equivalent if all the information in one representation can be inferred from the other, and (2) that information is equivalent if the same tasks can be performed with both. I don't claim to be expert in information theory (an MIMS degree notwithstanding), but this doesn't seem an intractable problem. (ref: https://books.google.com/books?id=A8TPF_O385AC&pg=PA66&lpg=PA66&dq=%27informationally+equivalent%27&source=bl&ots=fmVHmOjTXb&sig=sCSaAP1nfL8r-TKebXCNnZUvFyU&hl=en&sa=X&ved=2ahUKEwic5bef1tTdAhXzITQIHU_9C-0Q6AEwAnoECAgQAQ#v=onepage&q='informationally%20equivalent'&f=false).

agreiner · 2018-09-24T22:54:19Z

I can think of several use cases for equivalence of informational content. If a two different users wish to avail themselves of data provided from an API, they may each have ingest tools already existing to handle data in different serializations. Neither would want to spend time reworking their tool to handle the other serialization. Another is the issue of reproducibility, comparing data from different analyses to determine whether one should expect them to find similar conclusions.

rob-metalinkage · 2018-09-24T23:20:05Z

Do we have some conflicting perspectives @makxdekkers - i think somethwhere you argued that using DCAT 1.0 to catalogue the DCAT-AP and its distribution resources should be validly backwards compatible, but these resources are not informationally equivalent (if we agree either of the defs found by @agreiner are reasonable).

I think we would need to formalise the Use Case and agree on its requirements, and would need the existing approaches to be populated to show that there are cases we need to handle where we need to assert information equivalence. I think the general concern raised by @agreiner could be handled better by profile descriptions, particularly given the nuances of transformation that might exist in different contexts it would be hard to define a specific model and enforce it for all past DCAT usage.

agreiner · 2018-09-24T23:55:38Z

Uh Oh, thinking this through a bit more, I'm starting to wonder what the difference between a Distribution and a DistributionService would really be. Both deliver a series of TCP packets that become a file when assembled back on the client's system. Both involve downloading something from a URI. One can build a simple REST API by simply posting json files under URIs that show the relationships between them. A REST API does in fact deliver representations of datasets that are transported as files. Hm.

dr-shorthair · 2018-09-25T00:18:15Z

DataDistributionServices like instances of OGC's Web Feature Service, do not appear to be the same as a Distribution, at least not to users. WFS accepts a query and responds with a file. These kinds of service have a long history, predating the REST theory.

I see that fully RESTful 'services' resemble distributions, because of the resource-oriented way that you address them. However, there is still a challenge in that the set of resources (distributions) available from many services is combinatorially large. Describing this set as a 'service' is at the very least a pragmatic solution to this - otherwise a catalog would be overwhelmed by the enumerated listing of the resources available from it. The 'service' in this case is the set of potential datasets/distributions that might be constructed through selection of the various query parameters.

dr-shorthair · 2018-09-25T00:19:32Z

Meanwhile, if this conversation is to continue, then maybe in a new issue instead of trailing on in an already closed one?

agreiner · 2018-09-25T00:35:50Z

good idea. Feel free to make one.

dr-shorthair · 2018-09-25T03:32:09Z

Conversation moved to #411

jpullmann added dcat distribution requirement packaging labels Jan 18, 2018

dr-shorthair added the dcat:Distribution label Feb 1, 2018

davebrowning mentioned this issue Feb 7, 2018

Provide guidance in DCAT2 on how to extend Distribution #106

Closed

dr-shorthair mentioned this issue Feb 22, 2018

Distribution service [RDISV] #56

Closed

dr-shorthair mentioned this issue Mar 15, 2018

Add property to link from Distribution -> Dataset (inverse of dcat:distribution) #166

Closed

aisaac removed distribution labels May 29, 2018

dr-shorthair mentioned this issue Jun 12, 2018

Profile negotiation [RPFN] #74

Open

agbeltran mentioned this issue Jun 12, 2018

Best practice for a loosely-structured catalog #253

Closed

dr-shorthair mentioned this issue Jul 10, 2018

Note on reduced scope of dcat:Distribution #299

Merged

davebrowning added a commit that referenced this issue Sep 18, 2018

Reflect status of #52

a206cc8

davebrowning mentioned this issue Sep 18, 2018

Sundry editorial #357

Merged

davebrowning closed this as completed Sep 19, 2018

dr-shorthair mentioned this issue Sep 25, 2018

Distributions, services and implementation-resources #411

Closed

makxdekkers mentioned this issue Mar 2, 2019

Distribution composed of more than one file, but not packaged #482

Closed

makxdekkers mentioned this issue Jan 27, 2020

DCAT: Proposal for an updated definition for the concept “dataset” #1195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribution definition [RDIDF] #52

Distribution definition [RDIDF] #52

jpullmann commented Jan 18, 2018