Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution definition [RDIDF] #52

Closed
jpullmann opened this issue Jan 18, 2018 · 22 comments
Closed

Distribution definition [RDIDF] #52

jpullmann opened this issue Jan 18, 2018 · 22 comments

Comments

@jpullmann
Copy link

Distribution definition [RDIDF]

Revise definition of Distribution. Make clearer what a Distribution is and what it is not. Provide better guidance for data publishers.


Related use cases: Relationships between Distributions of a Dataset [ID34] 
@makxdekkers
Copy link
Contributor

I liked the statement in one of the meetings that all Distributions of a Dataset need to be "informationally equivalent".

@dr-shorthair
Copy link
Contributor

https://www.w3.org/TR/dcat-ucr/#ID18 provides some history: a taxonomy of distribution types was contemplated, but did not make the cut for DCAT v1. Do we need to look at this again?

@dr-shorthair
Copy link
Contributor

In F2F3 @jpullmann suggested that the definition of Distribution be clarified to emphasize that distributions are primarily concerned with representations of the dataset - in the REST sense (also see FRBR 'Manifestation').

This clarification is consistent with the introduction of DataDistributionService as a separate class. See #172

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 9, 2018

The definition of dcat:Distribution has been tweaked slightly so that the word 'representation' appears explicitly, and API and feed is removed (DataService should be used for those applications) -

A specific representation of a dataset. A dataset might be available in several different forms, and these forms might comprise both different serializations or different schematic arrangements of the same data. Examples of distributions include a CSV file, a netCDF file, or a data-cube

(Compare to DCAT-2014 where it says "Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed")

@jpullmann is this enough to mark this issue 'closed'?

@makxdekkers
Copy link
Contributor

@dr-shorthair There is the issue of backward compatibility. The proposed new definition narrows down the semantics. If people have -- validly -- modelled an API or feed as a dcat:Distribution under the 2014 recommendation, their implementation would no longer conform to the revised recommendation. I do not know how serious this is, but the consequences have to be addressed, maybe as part of a set of warnings to be issued together with the next public review version?

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 10, 2018

Thanks @makxdekkers . The narrowed definition of dcat:Distribution is consistent with shifting part of its previous scope to dcat:DataDistributionService, but I agree that more documentation/warnings are required in the recommendation text around this.

I've proposed some warnings in #299 - see preview https://rawgit.com/w3c/dxwg/dcat-issue52-simon/dcat/index.html#class-distribution

@agbeltran
Copy link
Member

In addition to the documentation/warnings about backward compatibility, I think we would need to add more guidance on the point made above by @makxdekkers about distributions being informationally equivalent. The current definition does say "...might comprise both different serializations or different schematic arrangements of the same data", but as discussed in #253 this is not necessarily how people have used distributions.

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 12, 2018

Yes. Strictly this is a consequence of the more refined description (which is what this topic is about). So if we are happy with the clarification of the definition, then perhaps a WARNING related to #253 is also merited.

There are interactions between #299 and #295 here. Which branch should additional warning text be added to?

@dr-shorthair
Copy link
Contributor

@davebrowning
Copy link
Contributor

davebrowning commented Sep 18, 2018

Mostly covered by that text. I think we should emphasise that the text now says "arrangements of the same data". I have drafted some text, as NOTE, and included the change in the Changes Summary (Appendix C). See this branch.

I think that does resolve the issue, though requirements such as versioning and subsets (yet to be addressed) will likely - IMHO - influence the final text

@davebrowning
Copy link
Contributor

Addressed in PR #357

@agreiner
Copy link
Contributor

I don't disagree with the text here, but I think it worth pointing out that it is a bit paradoxical with respect to what some of us have been asserting with regard to profile negotiation.
"The definition text of dcat:Distribution has been revised to clarify that distributions are primarily representations of datasets. As such, all distributions of a given dataset should be informationally equivalent. " Here, it is assumed that representations of a dataset are informationally equivalent, but profile negotiation would return datasets that are not informationally equivalent, because different profiles may include different subsets of the dataset. My preference is to keep distributions informationally equivalent and ask ourselves if there is a way to make it clear that profile negotiation does not deliver informationally equivalent responses.

@rob-metalinkage
Copy link
Contributor

rob-metalinkage commented Sep 24, 2018

Narrowing the scope, as proposed, breaks backwards compatibility with existing DCAT implementations.

Services that support queries against a dataset are never "informationally equivalent

If this restricted view is held, then any distrubution that supports accessing a file remotely, is by definiton not a dcat:Distribution but a dcat:SistributonService. Just basic web architecture of allowing a HTTP HEAD request is sufficient to break information equivalence, and content negotiation over langauge or mime type also does. Different formats are not informationally equivalent - for example a CSV file loses relationships between attributes compared to complex properties:

CSV
id,value1, units1, value2,units2
1, 2.3, "m/s",6.7,"kg"

vs
JSON
{ id: 1 ;
value1: { value: 2.3 ; units "m/s" ; }
value2: { value: 6.7 ; units "kg" ;
}

CSV holds less information because value1 and units1 need further out-of-band information to be related to each other.

So - unless you can come with a robust statement about testability of information equivalence, it strikes me as a slippery slope with no huge value.

OTOH Making an explicit statement that Distributions may not be informationally equivalent seems quite valuable, and makes services equivalent with distributions logically consistent

@rob-metalinkage
Copy link
Contributor

further to that - if we know what profiles each distribution and/or services support, perhaps its up to the profiles to be described in a way that makes informational equivalence visible - for example maybe whats really required is a implementation resource to transform a profile into another profile.

Use Cases for reliance on information equivalence would seem to be missing - i think really you would need to find evidence for such a need.

@agreiner
Copy link
Contributor

You are right that CSV can offer less information than JSON, and is particularly likely to do so if there is relational information to be shared, though I would argue that your CSV example shows the relationship between the two values by including them on the same line. Clearly, one can publish informationally equivalent data in both formats, and one can also make the mistake of dropping information when translating from one to the other. One might caution publishers to avoid selecting CSV that drops relationships in any guidance document. One might also caution them against dropping out entire rows from a CSV, but one would not then assume that CSV needs to be treated as a form that is inconsistent in informational content. A little googling shows me two definitions of informational equivalence: (1) that information is equivalent if all the information in one representation can be inferred from the other, and (2) that information is equivalent if the same tasks can be performed with both. I don't claim to be expert in information theory (an MIMS degree notwithstanding), but this doesn't seem an intractable problem. (ref: https://books.google.com/books?id=A8TPF_O385AC&pg=PA66&lpg=PA66&dq=%27informationally+equivalent%27&source=bl&ots=fmVHmOjTXb&sig=sCSaAP1nfL8r-TKebXCNnZUvFyU&hl=en&sa=X&ved=2ahUKEwic5bef1tTdAhXzITQIHU_9C-0Q6AEwAnoECAgQAQ#v=onepage&q='informationally%20equivalent'&f=false).

@agreiner
Copy link
Contributor

I can think of several use cases for equivalence of informational content. If a two different users wish to avail themselves of data provided from an API, they may each have ingest tools already existing to handle data in different serializations. Neither would want to spend time reworking their tool to handle the other serialization. Another is the issue of reproducibility, comparing data from different analyses to determine whether one should expect them to find similar conclusions.

@rob-metalinkage
Copy link
Contributor

Do we have some conflicting perspectives @makxdekkers - i think somethwhere you argued that using DCAT 1.0 to catalogue the DCAT-AP and its distribution resources should be validly backwards compatible, but these resources are not informationally equivalent (if we agree either of the defs found by @agreiner are reasonable).

I think we would need to formalise the Use Case and agree on its requirements, and would need the existing approaches to be populated to show that there are cases we need to handle where we need to assert information equivalence. I think the general concern raised by @agreiner could be handled better by profile descriptions, particularly given the nuances of transformation that might exist in different contexts it would be hard to define a specific model and enforce it for all past DCAT usage.

@agreiner
Copy link
Contributor

Uh Oh, thinking this through a bit more, I'm starting to wonder what the difference between a Distribution and a DistributionService would really be. Both deliver a series of TCP packets that become a file when assembled back on the client's system. Both involve downloading something from a URI. One can build a simple REST API by simply posting json files under URIs that show the relationships between them. A REST API does in fact deliver representations of datasets that are transported as files. Hm.

@dr-shorthair
Copy link
Contributor

DataDistributionServices like instances of OGC's Web Feature Service, do not appear to be the same as a Distribution, at least not to users. WFS accepts a query and responds with a file. These kinds of service have a long history, predating the REST theory.

I see that fully RESTful 'services' resemble distributions, because of the resource-oriented way that you address them. However, there is still a challenge in that the set of resources (distributions) available from many services is combinatorially large. Describing this set as a 'service' is at the very least a pragmatic solution to this - otherwise a catalog would be overwhelmed by the enumerated listing of the resources available from it. The 'service' in this case is the set of potential datasets/distributions that might be constructed through selection of the various query parameters.

@dr-shorthair
Copy link
Contributor

Meanwhile, if this conversation is to continue, then maybe in a new issue instead of trailing on in an already closed one?

@agreiner
Copy link
Contributor

good idea. Feel free to make one.

@dr-shorthair
Copy link
Contributor

Conversation moved to #411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants