-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to express distributions provided as compressed files #259
Comments
I think you're asking a bit much here! If there are a bunch of different things all zipped together (say PDF files, CSV and images), then what are you going to do if you want to communicate the content, provide a list of media types like this:
|
See also: #54 |
@akuckartz thanks for the reminder about #54 which I recall now that you mention it! Is this issue a duplicate? I think it is. Or, at least, the scope of #54 covers this issue. I move to close this issue in favour of dealign with compression issues in #54. |
@nicholascar I actually these there are 2 separate issues and their combination, but even in #54 they are a bit mixed up.
|
Thanks for the summary. My use case is the first one. Here are two popular examples I make use of:
It's also common to server Anyway, I doubt that providers will change their web server settings just to make dcat ontology happy. One example of a dataset with multiple distributions from DNB: https://data.dnb.de/opendata/GND.hdt.gz |
As I wrote in #54 (comment), the solution with +zip is just for the simple use case of a zipped-up distribution file. This solved an issue that was brought up in the work on the European DCAT-AP. |
@makxdekkers Let's see on examples of Note that neither the File Types codelist mandatory in DCAT-AP nor the official IANA Media Types list are exhaustive, therefore we need to use both. The simplest case is an uncompressed CSV file (which is actually served with HTTP gzip compression when supported - transparent to DCAT). There is a CSV on the Web JSON descriptor of the CSV file in @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> . Now let's add the explicit @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:format <http://publications.europa.eu/resource/authority/file-type/GZIP> ;
dcat:mediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
adms:representationTechnique <http://www.iana.org/assignments/media-types/text/csv> .
I would therefore suggest (the actual new properties can actually be different, if appropriate ones are found): @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;
dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> ; Next, the packaging of multiple files. Let's assume that we have a TAR package with a set of homegenous CSV files inside (e.g. for data for individual years). Note that ZIP can be used here as well as packager, not compression: @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dct:format <http://publications.europa.eu/resource/authority/file-type/TAR> ;
# There is no IANA dcat:mediaType for TAR
adms:representationTechnique <http://www.iana.org/assignments/media-types/text/csv> . The same points as with the gzip compression above apply here. In addition:
Therefore, I would suggest: @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;
# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> . Finally, the packaging and compression case. This means multiple CSV files, and for instance TAR packaging and GZIP compression, or ZIP packaging and ZIP compression. Here we need to specify 3 levels - CSV, TAR and GZIP. So I would suggest: @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar.gz> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar.gz> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;
dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> ;
# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> . This gives the publishers the possibility to describe the distribution properly, and the original DCAT properties are still used for the most important format, which is the innermost one. Of course the |
Thanks @jakubklimek, introduction of additional properties for compression and packaging is a good idea. The handling of package formats with multiple files requires to distinguish more case:
|
Thanks @jakubklimek for your thorough analysis. |
@nichtich You bring up a point that led the DCAT-AP development group not to go deeper into the issue of compressed and packaged files, namely that one could imagine many complex cases that would require a lot of specific properties. If this analysis could lead to a limited number of additional properties, and a clear guideline on how to use them, that would help a lot of people. |
@makxdekkers we don't have to cover all cases - @jakubklimek already summarized the most important ones. In short, a distribution file can these independent properties:
Formats can be compression formats (e.g. gzip), package formats (e.g. tar), both (e.g. zip) or none of both (e.g. csv). Furthermore one and only one of these three cases may apply:
These three cases all refer to the internal format and they are disjoint, so I'd use the existing format properties. |
@makxdekkers I see your point regarding the backward compatibility. The downside is that the actual data representation format (e.g. CSV, XML, JSON, RDF) will be attached using different properties for compressed/packaged and uncompressed/unpackaged distributions like this: Uncompressed: @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ; Compressed: @prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;
dcat:containedFormat <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcat:containedMediaType <http://www.iana.org/assignments/media-types/text/csv> ;
dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> ;
# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> . To be honest, I am not sure how publishers actually behaved when faced with this challenge in DCAT 2014, i.e. whether they specified
|
@nichtich This case I think is a typical representative of a wrong Dataset design and should be handled by splitting such Dataset according to the individual formats used in the archive, so that they can be described properly by DCAT (e.g. reference to the format used) using the other cases. Or do you have an example where this would actually be appropriate? |
@jakubklimek scripsit:
And a third way is to configure the web server to look at both
the server would return https://example.org/datasets/ds1.jsonld.gz whereas a request for
would return https://example.org/datasets/ds1.xml.gz |
@larsgsvensson Yes. The question is how would this relates to DCAT. In this case I would imagine a dataset with 2 distributions, one for xml.gz, one for jsonld.gz, each described as shown above. This is necessary because of the other distribution description properties such as a schema, which would be different for XML and for JSON-LD. This leaves the question of how would |
See discussion on this issue in minutes of DCAT meeting https://www.w3.org/2018/06/28-dxwgdcat-minutes.html#x08 |
@dr-shorthair A note to your comment in the email summary of the issue:
I disagree.
Sure, too many layers are impractical, but I was proposing a quite simple solution to common (not all) situations, i.e. compressed file, packaged homogeneous files, and their combination. This also covers a compressed file with a standardized directory structure such as a Data Package. @arminhaller Regarding your point in the minutes:
These should be 3 @andrea-perego Regarding your point in the minutes:
We still need the proposed extension for the common situations.
Primary focus should be on machine readability. In cases something non-standard is used as a distribution, it should be in case where no standard DCAT ways are applicable and this should be documented in the datasets description and documentation. |
I like @jakubklimek's analysis. If the problem can be solved with some extra properties, I am all for it. Apart from allowing machine-processing, it is also very relevant to show to the human user what is inside the file as this would help someone to decide not to download a big ZIP file with something inside that the user can't process. |
See #746 |
Addressed in #746 and closing as agreed at https://www.w3.org/2019/02/27-dxwgdcat-minutes#x08. |
The
dcat:downloadURL
of adcat:Distribution
can point to compressed files (.zip
,.gz
...). What is the data format in this case? Stating that it is a ZIP-File will not help much: the more interesting information is what's in the archive. Fordcat:mediaType
the+gzip
MIME TYPE suffix can be added but what to put intodct:format
- an identifier of the archive format or of its content (I'd prefer the latter)?The text was updated successfully, but these errors were encountered: