Provenance information [RPIF] #76

jpullmann · 2018-01-18T21:12:48Z

Provenance information [RPIF]

Provide a way to link to structured information about the provenance of a dataset including:

the input data used to create a dataset to the dataset.
the software used to produce the dataset to the dataset.
an extensible model different types of agent roles
funders

dct:creator, dct:publisher etc are special cases, which require guidance, further roles may be defined in provenance or other richer models. The requirement is to establish an extensible mechanism, and for profiles to specify canonical equivalents for the special case properties of dcat:Dataset

Related use cases: Common requirements for scientific data [ID9] Modeling data lineage [ID12] Modeling agent roles [ID13] Modeling funding sources [ID31]

andrea-perego · 2018-01-20T00:13:47Z

About agent roles, I have a punctual proposal which is about relaxing the domain of dcat:contactPoint, to allow its use not only for datasets, but also for other resources (e.g., catalogues, catalogue records).

This issue popped up during the development of GeoDCAT-AP, since all the agent roles (dcat:contactPoint included) supported in ISO 19115 can be specified for any resource.

Besides this, the "contact point" is probably the most important role for data consumers, not only for datasets. For instance, for a dcat:Catalog it is possible to specify the dct:publisher, but if I need to ask questions and/or report issues about the catalogue I need to get in touch with the publisher's dedicated contact point.

I wonder whether this (and similar revisions to DCAT) requires the creation of a separate use case.

⚠️ As decided at the end of the DCAT subgroup telecon of 31 Jan 2018, a separate issue has been created (#95)

nicholascar · 2018-01-30T12:31:40Z

Re-represented as RDA Prov Patterns WG Use Case 41: http://patterns.promsns.org/usecase/41

nicholascar · 2018-01-30T12:39:03Z

Several patterns for "providing a way to link to structured information about the provenance of a dataset" are given in both PROV an in patterns by the RDA Prov Patterns WG, such as http://patterns.promsns.org/pattern/12. We should reuse these.

riccardoAlbertoni · 2018-02-02T14:54:03Z

I propose to untag "quality" from this issue, as this issue is more related to provenance than quality. Clearly "provenance" might influence quality but considered that we have the tag "provenance", I think we can remove "quality".

dr-shorthair · 2018-03-07T05:28:03Z

A placeholder section/sub-section or proposal for the DCAT document would be appreciated - to alert the community when we release the FPWD.

nicholascar · 2018-05-23T00:15:16Z

Regarding the first of the three items in the description of this requirement:
[Provide a way to link to] the input data used to create a dataset to the dataset:

Proposal

Assuming an established prov:Entity/dcat:Dataset close relationship, Use two of three of the patterns related in the RDA's Pattern Associating metadata in documents with graph provenance (the third patter is not relevant):

Pattern 1: store provenance in a different document/service to the Dataset metadata and link with either prov:has_provenance or prov:has_query_service relations

This is appropriate when potentially detailed provenance information cannot be well catered for within the standard DCAT document. This will be the case in purpose-built systems that cater for DCAT but not all the possibilities of PROV, even for Dataset/Dataset (Entity/Entity) relations.

Example: Dataset X was derived from Dataset Y and Dataset Z:

Within the DCAT document:

:Dataset_X prov:has_provenance :Bundle_N <-- here the DCAT record for Dataset_X points to a document defined as a prov:Bundle within qhich Dataset_X is referenced and any amounts of PROV provenance relationships given, to other datasets in the same catalogue or others.

Instead of a provenance document, a dataset could be linked to a provenance query service using prov:has_query_service.

Pattern 2: link datasets directly to others with PROV-O relations
This is appropriate when the system used to store DCAT information can store any PROV relationships.

Example: Dataset X was derived from Dataset Y and Dataset Z:

Within the DCAT document:

:Dataset_X prov:wasDerivedFrom :Dataset_Y , Dataset_Z ;

or, qualified forms (see https://www.w3.org/TR/prov-o/#qualifiedDerivation):

:Dataset_X prov:qualifiedDerivation [
    a prov:Derivation;
    prov:entity :Dataset_Y ;

    ## More details about the activity underpinning the derivation        
    prov:hadGeneration :a_detailed_generation; 
    ...
] , [
    a prov:Derivation;
    prov:entity :Dataset_Z ;       
    prov:hadGeneration :different_detailed_generation; 
    ...
]

nicholascar · 2018-05-23T03:24:33Z

Regarding the second of the three items in the description of this requirement:
[Provide a way to link to] the software used to produce the dataset to the dataset:

Proposal

Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.

Example:

Dateset X was derived from Dataset Y and the derivation was made using Software Z

As long the specific instance of software that was used can be recorded (i.e. not the URI of the GitHub repo but of the specific commit that was used) then the above can simply be recorded as:

:Dataset_X prov:wasDerivedFrom :Dataset_Y , :Software_Z

where the derivation from Software Z is understood to be a derivation by instruction due to :Software_Z being a prov:Plan. If this requires more spelling out:

:Dataset_X
    prov:wasDerivedFrom :Dataset_Y ;
    prov:qualifiedDerivation [
        prov:entity :Software_Z ; # still subclassed from Entity as Plan!
        prov:hadRole :some_special_role_for_software ;
    ] ;

nicholascar · 2018-05-23T03:29:19Z

Regarding the third of the three items in the description of this requirement:
[Recommend] an extensible model different types of agent roles

Proposal

For the general case of role or other qualifications, see Qualified forms [RQF] #79 where a proposal for qualified forms is made with agent roles as an example.

larsgsvensson · 2018-05-23T09:04:58Z

@nicholascar wrote:

Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.

I don't know if that's possible: Usually software is considered a prov:Agent, more specifically a prov:SoftwareAgent: "A software agent is running software."

andrea-perego · 2018-05-25T22:11:29Z

@nicholascar , some time ago I added examples of provenance patterns in the wiki, and some of them relate to yours:

https://github.com/w3c/dxwg/wiki/Provenance-patterns

Would you mind having a check, and see if you think they should be revised/extended?

rob-metalinkage · 2018-05-25T22:48:31Z

This gets messier with things like Shacl and Spin where the software is data.

Software is an entity, an instance of running software is an agent?

This fits with software being subject to processes such as automated testing.

kcoyle · 2018-05-26T15:10:20Z

@rob-metalinkage Can you expound on this a bit?

"This gets messier with things like Shacl and Spin where the software is data"

Which software is data? I read Shacl as taking instance data as input, so I'm not sure which software you mean. But I may be thinking of something other than what you meant.

nicholascar · 2018-05-28T06:37:04Z

@rob-metalinkage @larsgsvensson we have long-used precedence with instances of software being prov:Plan (subClassOf prov:Entity) objects used to guide a prov:Activity that then produces things and an executing agent, like a server, being a prov:Agent. This is shown in quick outline in the RDA Provenanc ePatterns WG's pattern 18: https://patterns.promsns.org/pattern/18.

I have run this pattern of the instance of software used being modelled as a prov:Plan and the execution system running it being a prov:Agent or a prov:SoftwareAgent past several original members of PROV (Luc, & Paolo) as well as many PROV practitioners over many years and it works fine although it wasn't expressly catered for in the 2013 PROV publications.

The pattern is generalisable to include methods other than software, such as scientific methods.

I will re-document that pattern for the RDA WG in more detail shortly.

larsgsvensson · 2018-05-30T14:24:36Z

@nicholascar It seems that we need to define exactly what we mean by software... I'd say that we need to differentiate between the sequence of commands being executed (aka a programme), the execution of that programme (aka a process) and any input passed to that process (let's call it input).

If we look at the case of a SHACL engine validating a piece of RDF using a SHACL file it seems to me that the execution of the validation is a prov:Activity executed by the SHACL engine (programme; prov:SoftwareAgent rdfs:subClassOf prov:Agent and uses the SHACL file (input; prov:Plan rdfs:subClassOf prov:Entity). The process is tricky and doesn't always map to PROV-O since the validation can be done by a server process (e. g. a servlet) that's running already and that handles several validations in parallel.

If that's what you mean, I fully agree. And we need a better word than "software".

nicholascar · 2018-06-02T11:26:20Z

@larsgsvensson the differentiation you describe is how I describe things so I agree with your general characterisations.

I do agree that defining a process can be tricky but if we stick to the "provenance that we want", not a "provenance that could be modelled" then we can usually do something sensible. In the example you give of a servlet validating something I would model it thus:

the process as a prov:Activity - starting and ending with the processing of the RDF of interest, regardless of any other jobs it may be doing (we don't care about those)
the servlet as a prov:SoftwareAgent, if that's important to know, or perhaps the server itself
- the choice of which Agent to model will come down to what facts are most importantt o know for a Use Case such as recording info for potential process recreation
the input of the RDF file being validated as a prov:Entity
the input of a SHACL file as a prov:Entity - not a prov:Plan
- here the SHACL file is not instructing the Activity. It's determining a validation assessment but the conducting of the Activity itself is, in fact, guided by the code that applies the validation to the data, the SHACL file to the input RDF.
the output of the validation task - pass, fail, error messages etc - a prov:Entity that prov:wasDerivedFrom the two inputs AND the prov:Plan that instructed that the SHACL input be applied to the RDF input

So this modelling will allow someone to see when (Activity) something (whichever Agent) did what (Plan) with what inputs (Entity x 2) and what output (Entity). Sure, you could model things differently but what's the Use Case?

azaroth42 · 2018-06-21T21:20:58Z

Could someone clarify the relevance of profile_negotiation to this issue, or remove the tag?

dr-shorthair · 2018-07-25T05:25:33Z

Provenance information should probably be available at the level of representations (dcat:Distributions) as well as dcat:Datasets

(Does this need a new Issue?)

andrea-perego · 2019-02-15T23:48:13Z

I think we should consider here cases where "provenance" is expressed in a discursive way - e.g., when describing the dataset lineage (as mentioned in UC9).

This is quite a common practice for scientific data and in some domains, as the geospatial one. In most cases, these lineage descriptions are such that they cannot be easily converted into a machine-actionable representation.

In DCAT-AP, this is done by using dct:provenance/dct:ProvenanceStatement/rdfs:label, and according to the report on DCAT-AP usage statistics from the European Data Portal, this information is included in more than 50% of the EDP records (391,616).

It may be worth considering its inclusion in DCAT.

andrea-perego · 2020-10-28T23:59:52Z

As there has been no further discussion on this issue, I propose to close it.

andrea-perego · 2021-03-13T14:00:09Z

As there has been no further discussion on this issue, I propose to close it.

Noting no objections, I am closing this issue.

jpullmann added profile-negotiation dcat documentation provenance quality referencing requirement roles labels Jan 18, 2018

This was referenced Jan 21, 2018

Create PROV-alignment module #94

Merged

Version release date [RVSDT] #91

Closed

andrea-perego mentioned this issue Jan 31, 2018

Proposal for relaxing domain of dcat:contactPoint #95

Closed

dr-shorthair added dcat:Dataset dcat:contactPoint labels Feb 1, 2018

dr-shorthair removed the quality label Feb 4, 2018

This was referenced Feb 9, 2018

Publication source [RPS] #78

Closed

Use PROV-O to satisfy provenance requirements #128

Closed

dr-shorthair assigned nicholascar Mar 7, 2018

dr-shorthair added this to the Dataset provenance patterns milestone Mar 16, 2018

dr-shorthair mentioned this issue Mar 27, 2018

Food for thought: optionally support PROV standard and ontology ckan/ckanext-dcat#105

Open

aisaac removed documentation labels May 29, 2018

riccardoAlbertoni mentioned this issue Jun 6, 2018

Added a section to deal with quality and started some guidance for r… #245

Merged

nicholascar removed the profile-negotiation label Jun 21, 2018

dr-shorthair mentioned this issue Jul 11, 2018

Project context [RPCX] #71

Closed

dr-shorthair mentioned this issue Jul 25, 2018

Version definition [RVSDF] #90

Closed

dr-shorthair mentioned this issue Aug 2, 2018

prov:wasGeneratedBy in context of dcat:dataset #312

Merged

dr-shorthair removed this from the Dataset provenance patterns milestone Aug 21, 2018

andrea-perego mentioned this issue Nov 20, 2018

Funding source [RFS] #66

Closed

davebrowning added this to the DCAT Backlog milestone Mar 14, 2019

andrea-perego modified the milestones: DCAT Future Priority Work, DCAT3 2PWD Nov 11, 2020

andrea-perego closed this as completed Mar 13, 2021

This was referenced Sep 26, 2021

Distinguish dataset owner, curator, steward, and other functions #1407

Closed

Support description of legal authorities #1406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provenance information [RPIF] #76

Provenance information [RPIF] #76

jpullmann commented Jan 18, 2018