-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributions, services and implementation-resources #411
Comments
Content copied over from #52: agreiner commented 6 hours ago @rob-metalinkage Services that support queries against a dataset are never "informationally equivalent If this restricted view is held, then any distrubution that supports accessing a file remotely, is by definiton not a dcat:Distribution but a dcat:SistributonService. Just basic web architecture of allowing a HTTP HEAD request is sufficient to break information equivalence, and content negotiation over langauge or mime type also does. Different formats are not informationally equivalent - for example a CSV file loses relationships between attributes compared to complex properties: CSV vs CSV holds less information because value1 and units1 need further out-of-band information to be related to each other. So - unless you can come with a robust statement about testability of information equivalence, it strikes me as a slippery slope with no huge value. OTOH Making an explicit statement that Distributions may not be informationally equivalent seems quite valuable, and makes services equivalent with distributions logically consistent @rob-metalinkage Use Cases for reliance on information equivalence would seem to be missing - i think really you would need to find evidence for such a need. @agreiner @agreiner @rob-metalinkage I think we would need to formalise the Use Case and agree on its requirements, and would need the existing approaches to be populated to show that there are cases we need to handle where we need to assert information equivalence. I think the general concern raised by @agreiner could be handled better by profile descriptions, particularly given the nuances of transformation that might exist in different contexts it would be hard to define a specific model and enforce it for all past DCAT usage. @agreiner @dr-shorthair I see that fully RESTful 'services' resemble distributions, because of the resource-oriented way that you address them. However, there is still a challenge in that the set of resources (distributions) available from many services is combinatorially large. Describing this set as a 'service' is at the very least a pragmatic solution to this - otherwise a catalog would be overwhelmed by the enumerated listing of the resources available from it. The 'service' in this case is the set of potential datasets/distributions that might be constructed through selection of the various query parameters. |
This is a conversation that's been lingering on the edges for a while, so its really good that its surfaced. If we ignore services for a moment and just consider the DCAT2014 style of use, having distributions that aren't equivalent immediately raises the question of how does a consumer choose between them? Its true that some formats make it easier to express certain characteristics than others, so there is a challenge for the publisher to be very careful here about what datasets (ie information) is really being published. In at a least one of the uses we have internally (not referenceable at this point, unfortunately), we've been looking at using dataset subset relations as way through this - I'll see if I can provide a coherant use case for this. Once you introduce services that go beyond "download the whole thing" then we've looked at that as a dynamic subset - in effect a slice of some underlying distribution which varies according to the needs of the consumer. That allows the publisher to describe the whole dataset which might be downloadable in one form, or shareable via a cloud based bucket in another (for example) as well as providing access services on the same information via some interface. If the consumer uses a selection interface then they just get a subset, allowing the consumer to trade off completeness against ease of use (ideally). The consumer can rely on all access paths giving access to the same information - the selection control is in the hands of the consumer. But to agree to @rob-metalinkage 's point we do need some documented use cases for this. Unfortunately our use of DCAT hasn't filtered through to our available/public services quite yet, but I'll see what I can do |
I understand that it's not that easy to say something about the content of distributions. My initial question was about a real-world case where distributions did not contain the same data, with distributions under one dataset containing data for different individual years. I do understand that maybe 'informationally equivalent' could be taken to mean exactly the same so it's maybe too strong. |
Could we try to use some examples? |
The matter of 'information equivalence' of 'Distributions' has come up in a couple of conversations I'm having:
In DCAT I believe we encourage the view that the description of the I wonder if we need to take another look at this 'information equivalence' argument with these use-cases in mind. It might be accommodated by introducing an alternative predicate to relate a Distribution to its Dataset - e.g. alongside
maybe also have
The latter would also require some semantic information on the distribution to describe which aspect (e.g. spectral-band, time-slice, spatial-tile, dimension) of the full Dataset is included in the particular Distribution. |
@dr-shorthair I really like the approach you're proposing. I can see it solving a lot of the problems I've seen -- including people attaching per-year data to a multi-annual dataset. |
@dr-shorthair Could it also work for granularities? E.g. the map of a region in different scales? |
maybe dcat:derivedDistribution for distributions that serialize (for spatial data) different scales or map projections, upsampling or down sampling. I like the idea of an operational test for 'information equivalence' between representations A and B to be that one can transform A to B and the resulting B back to A, and get (with acceptable computational approximations for numeric data) the same A (all data elements are present and have same values), One suggestion from a related conversation is that a distribution should include a property specifying the nature of the relationship of the offered serialization to a 'canonical representation', e.g. resampled, anonymized, reprojected. |
Is that the inverse of dcat:servesDataset, which links from a |
Yes I think it would be |
@smrgeoinfo the Note that we have had comment elsewhere (from Clemens on the comments email list here) pushing back on this, primarily on the grounds that this implies an enlargement of scope of (Link to email added by @davebrowning - issues #530 and #531 tracking further discussion/resolution) |
Isn't a dcat catalog just one more variety of registry? The important thing is information model for the descriptions of items in the catalog (i.e. the subtypes of resource). The interesting thing to me about dcat is not Catalog, but Dataset, and adding DataService is an excellent improvement. Maybe there is an argument that the vocabularies for describing different kinds of resources should be in different namespaces, but that's a stewardship question and given the small footprint of the dataservice extension (5 predicates and 3 objects) a whole new vocab would just be a lot of extra work. I'd just like to see that the scope of DataDistributionService is not just APIs but needs to include web applications for slicing and visualizing a dataset like ERDDAP and THREDDS offer. |
@smrgeoinfo - aren't these DAP/2 and similar sophisticated APIs for slicing and dicing? I don't think that the web application, the front end bit, should be included, whereas a specialised API should be in scope. |
Its just that those front end apps are built into the servers that implement the API, and I suspect (unfortunately I don't have hard data...) that a lot of users use those web apps to get the data they need, so its really useful to be able to provide links to them in metadata; currently in the ISO world the distribution is most commonly used for that. Any ERDDAP, THREDDS OPENDAP users out there want to comment? |
We use PyDAP as a handy way of making custom slices of HDF5 files shareable. It gets used particularly heavily by climate researchers. I agree that the collections we make available with that are well described as DataServices. The difference in my mind is that they are meant for human consumption rather than programmatic consumption. |
reviewing the current draft, I see that this issue is linked to a comment in the 6.7 Class:Distribution section that says
Looking back over the discussion (and assuming that we accept DataService as a valid resource type for a dcat:Catalog), Simon's comment (above) starts a good direction I think. There are several relationships between the dataset as a work, and the various ways data providers provide access:
If approaches 2, 3, 4, and 5 are treated as dcat:DataDistributionServices, then the dct:conformsTo, dct:relation, and dct:type properties could be used to provide the necessary information to distinguish these various kinds of distribution/access. Note this issue is closely related to #145 |
@smrgeoinfo asked:
Not quite. The way it is modelled the Note also in Figure 1 that |
@makxdekkers said:
I think we must be very careful in not overcomplicating the use of DCAT. Making explicit the different level of granularity, accuracy, spatio-temporal coverage of distributions may be relevant in some specific use cases, but it is more common that metadata maintainers don't have this information. I would be more in favour of extending the existing approach, by keep on using |
@andrea-perego That is a sensible alternative. |
@makxdekkers , we are actually missing a way to indicate a number of aspects of data quality (here intended in its general sense), and indeed some of them can be considered as domain-specific, so it may be better in scope of profiles of DCAT. One of them is the Coordinate Reference System (CRS), which is one of the key pieces of information for geospatial data. This is supported in GeoDCAT-AP, where this information is normally specified at the level of dataset, but can be associated also with the distribution (in this case, the dataset is made available in different distributions, each using a different CRS). For more generic aspects (precision / accuracy / level of spatial/temporal resolution), what we have at the moment is documented in some of the examples of the DQV spec (see https://www.w3.org/TR/vocab-dqv/#ExpressDatasetAccuracyPrecision), others could be possibly taken from RDF Data Cube. |
@smrgeoinfo a consequence of the discussion in #432, and some other editorial work, the definition of Please look at https://rawgit.com/w3c/dxwg/dcat-issue432-simon/dcat/index.html#dcat-scope (the dot-points above Figure 1, and also the paragraphs below the figure) and also https://rawgit.com/w3c/dxwg/dcat-issue411/dcat/#Class:Dataset and https://rawgit.com/w3c/dxwg/dcat-issue411/dcat/index.html#Class:Distribution. Does this address the concerns you raised in https://lists.w3.org/Archives/Public/public-dxwg-comments/2018Nov/0003.html ? |
Maybe I missed the discussion about this, but the new definitions in the bullet points above the diagram in https://rawgit.com/w3c/dxwg/dcat-issue432-simon/dcat/index.html#dcat-scope now say "dcat:X represents a description of a X" while 2PWD had "dcat:X represents a X". I think the 'a description of' is not correct. The RDF statements about X are a description in which dcat:X represents X. |
Yes the issue432 edits address the concerns from the list posting, thanks! @makxdekkers, I guess the question is are the items in a catalog datasets or descriptions of datasets? In the discussion below Figure 1 in the issue432 branch, the paragraph after 'A data service...' begins thus: for consistency, that should read either or Either perspective can be made to work, it should just be consistent. |
I agree with @makxdekkers that in the first mention (above the diagram) I went too far - 'represents' and 'description' essentially say the same thing, so doubling up makes it all a bit too 'meta' ... @smrgeoinfo I'm inclined to go with the second formulation - the catalog is not a repository so it is not the things themselves that are found there, but descriptions of them. The various |
@dr-shorthair +1 on your suggestions |
Same here, +1 to @dr-shorthair |
So, where are we with regard to enabling a user to determine whether data they are sourcing online is the same data available elsewhere or not? Now that we seem to have an agreement that distributions are not necessarily informationally equivalent, how can a user determine whether one distribution is equivalent to another? This thread exposed at least three use cases relating to this question. 1. A user finds a catalog entry with multiple distributions. How do they decide which to download? 2. A user finds out about a colleague's use of data in one serialization and wants to obtain the same data in a different serialization. How do they know that they are getting equivalent data? 3. A scientist wants to reproduce work by another scientist but doesn't have access to the same data source. They find something that seems to be the same in a data catalog. How do they ensure that they are in fact using the same data? |
@agreiner My take on that is that, once we decided that distributions are not necessarily informationally equivalent, there is no way that you can make sure that it is exactly the same data. The default position would be that it is not the exact same data. If you want to be sure to use the exact same data then you need to use the same file. |
@agreiner , I have the impression that the use cases you outlined are (at least partially) related to what discussed in #433 , so my guess is that they can be addressed accordingly - i.e., if a dataset is published for re-use and/or for reproducing an experiment, the contained distribution(s) are those needed to do the job, and which should be used for doing what is supposed to be indicated in the description of the distribution, and, in addition, by some properties (possibly providing machine-actionable information - e.g., the specification of the spatial/temporal resolution level, the reference system used, etc.). Said that, the possible use cases are so heterogeneous that, IMO, this problem cannot be solved with metadata only. In many cases, the most effective and straightforward option is to get in touch with the data provider, and ask for support. This is of course not in scope with the work we are doing (although there's a BP in DWBP we could refer to), but it is one of the underestimated/neglected aspects of data publication (e.g., as far as I know, the FAIR principles do not mention this explicitly), and one of the main reasons why data are not re-used. |
@agreiner, I think the 3 use cases you spell out are good tests of whether the WG is comfortable about the level of support we now have in the core vocabulary. We're trying to balance the variation we see across domains/participants in what are seen as distributions of the same dataset (specifically - a very wide variation) and the need in some domains for much more precision. The now rather long note on the vocabulary definition for I suspect any commitment that two distributions are the same data has to be clear who is making the commitment. In the first and perhaps second of your use cases its likely that it would be whoever is the publisher of the dataset. In the third case, a suspicious user might want a decent provenance chain (on both the metadata and the data itself) so that they can assess whether they are happy using it. Although its not explicit in the Provenance section of the document, the provisional alignment (turtle here) does have Does that go far enough? |
Since distributions can differ now in pretty significant ways, and since we are suggesting that publishers can use them for informationally equivalent resources (or not), it seems like we should be giving them a way to express that. Maybe something along the lines of @makxdekkers 's suggestion of an indication that one distribution is exactly the same data as another. Or it could be as simple as a boolean property that says "this is the complete dataset, not a subset." |
using dct:conformsTo could specify the profile - and hence information equivalence to that degree. 9or define a subproprty of it) maybe we could use as a convention that a dataset is by default a trivial profile of itself, and S hasDistribution D dct:conformsTo S indicates the distribution is the full dataset? (there is still the issue of the informationr profile of the distriubution and the distribuition mechanism conforming to a profile of a service API for examople - what is the actual domain of dct:conformsTo - its the "distribution" but that involves both access method and payload |
It is always the prerogative of the publisher or catalog provider to decide how much detail to provide, and where to attach it. |
As agreed/minuted in https://www.w3.org/2019/02/27-dxwgdcat-minutes#x08, the base issue here is to be closed as having been addressed in the various pull requests listed here. If there is a strong use case for support of complete/partial distribution semantics within the core vocabulary then either we can re-open this or (preferably) open a new issue. As it stands this appears to be an issue best addressed using profiles. |
Continuing the conversation that has broken out on (closed) issue #52
The text was updated successfully, but these errors were encountered: