-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribution definition [RDIDF] #52
Comments
I liked the statement in one of the meetings that all Distributions of a Dataset need to be "informationally equivalent". |
https://www.w3.org/TR/dcat-ucr/#ID18 provides some history: a taxonomy of distribution types was contemplated, but did not make the cut for DCAT v1. Do we need to look at this again? |
In F2F3 @jpullmann suggested that the definition of This clarification is consistent with the introduction of |
The definition of dcat:Distribution has been tweaked slightly so that the word 'representation' appears explicitly, and API and feed is removed (DataService should be used for those applications) -
(Compare to DCAT-2014 where it says "Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed") @jpullmann is this enough to mark this issue 'closed'? |
@dr-shorthair There is the issue of backward compatibility. The proposed new definition narrows down the semantics. If people have -- validly -- modelled an API or feed as a |
Thanks @makxdekkers . The narrowed definition of I've proposed some warnings in #299 - see preview https://rawgit.com/w3c/dxwg/dcat-issue52-simon/dcat/index.html#class-distribution |
In addition to the documentation/warnings about backward compatibility, I think we would need to add more guidance on the point made above by @makxdekkers about distributions being informationally equivalent. The current definition does say "...might comprise both different serializations or different schematic arrangements of the same data", but as discussed in #253 this is not necessarily how people have used distributions. |
Yes. Strictly this is a consequence of the more refined description (which is what this topic is about). So if we are happy with the clarification of the definition, then perhaps a WARNING related to #253 is also merited. There are interactions between #299 and #295 here. Which branch should additional warning text be added to? |
Text in draft here https://w3c.github.io/dxwg/dcat/#Class:Distribution appears to resolve this issue? |
Mostly covered by that text. I think we should emphasise that the text now says "arrangements of the same data". I have drafted some text, as NOTE, and included the change in the Changes Summary (Appendix C). See this branch. I think that does resolve the issue, though requirements such as versioning and subsets (yet to be addressed) will likely - IMHO - influence the final text |
Addressed in PR #357 |
I don't disagree with the text here, but I think it worth pointing out that it is a bit paradoxical with respect to what some of us have been asserting with regard to profile negotiation. |
Narrowing the scope, as proposed, breaks backwards compatibility with existing DCAT implementations. Services that support queries against a dataset are never "informationally equivalent If this restricted view is held, then any distrubution that supports accessing a file remotely, is by definiton not a dcat:Distribution but a dcat:SistributonService. Just basic web architecture of allowing a HTTP HEAD request is sufficient to break information equivalence, and content negotiation over langauge or mime type also does. Different formats are not informationally equivalent - for example a CSV file loses relationships between attributes compared to complex properties: CSV vs CSV holds less information because value1 and units1 need further out-of-band information to be related to each other. So - unless you can come with a robust statement about testability of information equivalence, it strikes me as a slippery slope with no huge value. OTOH Making an explicit statement that Distributions may not be informationally equivalent seems quite valuable, and makes services equivalent with distributions logically consistent |
further to that - if we know what profiles each distribution and/or services support, perhaps its up to the profiles to be described in a way that makes informational equivalence visible - for example maybe whats really required is a implementation resource to transform a profile into another profile. Use Cases for reliance on information equivalence would seem to be missing - i think really you would need to find evidence for such a need. |
You are right that CSV can offer less information than JSON, and is particularly likely to do so if there is relational information to be shared, though I would argue that your CSV example shows the relationship between the two values by including them on the same line. Clearly, one can publish informationally equivalent data in both formats, and one can also make the mistake of dropping information when translating from one to the other. One might caution publishers to avoid selecting CSV that drops relationships in any guidance document. One might also caution them against dropping out entire rows from a CSV, but one would not then assume that CSV needs to be treated as a form that is inconsistent in informational content. A little googling shows me two definitions of informational equivalence: (1) that information is equivalent if all the information in one representation can be inferred from the other, and (2) that information is equivalent if the same tasks can be performed with both. I don't claim to be expert in information theory (an MIMS degree notwithstanding), but this doesn't seem an intractable problem. (ref: https://books.google.com/books?id=A8TPF_O385AC&pg=PA66&lpg=PA66&dq=%27informationally+equivalent%27&source=bl&ots=fmVHmOjTXb&sig=sCSaAP1nfL8r-TKebXCNnZUvFyU&hl=en&sa=X&ved=2ahUKEwic5bef1tTdAhXzITQIHU_9C-0Q6AEwAnoECAgQAQ#v=onepage&q='informationally%20equivalent'&f=false). |
I can think of several use cases for equivalence of informational content. If a two different users wish to avail themselves of data provided from an API, they may each have ingest tools already existing to handle data in different serializations. Neither would want to spend time reworking their tool to handle the other serialization. Another is the issue of reproducibility, comparing data from different analyses to determine whether one should expect them to find similar conclusions. |
Do we have some conflicting perspectives @makxdekkers - i think somethwhere you argued that using DCAT 1.0 to catalogue the DCAT-AP and its distribution resources should be validly backwards compatible, but these resources are not informationally equivalent (if we agree either of the defs found by @agreiner are reasonable). I think we would need to formalise the Use Case and agree on its requirements, and would need the existing approaches to be populated to show that there are cases we need to handle where we need to assert information equivalence. I think the general concern raised by @agreiner could be handled better by profile descriptions, particularly given the nuances of transformation that might exist in different contexts it would be hard to define a specific model and enforce it for all past DCAT usage. |
Uh Oh, thinking this through a bit more, I'm starting to wonder what the difference between a Distribution and a DistributionService would really be. Both deliver a series of TCP packets that become a file when assembled back on the client's system. Both involve downloading something from a URI. One can build a simple REST API by simply posting json files under URIs that show the relationships between them. A REST API does in fact deliver representations of datasets that are transported as files. Hm. |
I see that fully RESTful 'services' resemble distributions, because of the resource-oriented way that you address them. However, there is still a challenge in that the set of resources (distributions) available from many services is combinatorially large. Describing this set as a 'service' is at the very least a pragmatic solution to this - otherwise a catalog would be overwhelmed by the enumerated listing of the resources available from it. The 'service' in this case is the set of potential datasets/distributions that might be constructed through selection of the various query parameters. |
Meanwhile, if this conversation is to continue, then maybe in a new issue instead of trailing on in an already closed one? |
good idea. Feel free to make one. |
Conversation moved to #411 |
Distribution definition [RDIDF]
Revise definition of Distribution. Make clearer what a Distribution is and what it is not. Provide better guidance for data publishers.
Related use cases: Relationships between Distributions of a Dataset [ID34]
The text was updated successfully, but these errors were encountered: