Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provenance information [RPIF] #76

Closed
jpullmann opened this issue Jan 18, 2018 · 20 comments
Closed

Provenance information [RPIF] #76

jpullmann opened this issue Jan 18, 2018 · 20 comments

Comments

@jpullmann
Copy link

Provenance information [RPIF]

Provide a way to link to structured information about the provenance of a dataset including:

  • the input data used to create a dataset to the dataset.
  • the software used to produce the dataset to the dataset.
  • an extensible model different types of agent roles
  • funders

dct:creator, dct:publisher etc are special cases, which require guidance, further roles may be defined in provenance or other richer models. The requirement is to establish an extensible mechanism, and for profiles to specify canonical equivalents for the special case properties of dcat:Dataset


Related use cases: Common requirements for scientific data [ID9] Modeling data lineage [ID12] Modeling agent roles [ID13] Modeling funding sources [ID31] 
@andrea-perego
Copy link
Contributor

andrea-perego commented Jan 20, 2018

About agent roles, I have a punctual proposal which is about relaxing the domain of dcat:contactPoint, to allow its use not only for datasets, but also for other resources (e.g., catalogues, catalogue records).

This issue popped up during the development of GeoDCAT-AP, since all the agent roles (dcat:contactPoint included) supported in ISO 19115 can be specified for any resource.

Besides this, the "contact point" is probably the most important role for data consumers, not only for datasets. For instance, for a dcat:Catalog it is possible to specify the dct:publisher, but if I need to ask questions and/or report issues about the catalogue I need to get in touch with the publisher's dedicated contact point.

I wonder whether this (and similar revisions to DCAT) requires the creation of a separate use case.

⚠️ As decided at the end of the DCAT subgroup telecon of 31 Jan 2018, a separate issue has been created (#95)

This was referenced Jan 21, 2018
@nicholascar
Copy link
Contributor

nicholascar commented Jan 30, 2018

@nicholascar
Copy link
Contributor

Several patterns for "providing a way to link to structured information about the provenance of a dataset" are given in both PROV an in patterns by the RDA Prov Patterns WG, such as http://patterns.promsns.org/pattern/12. We should reuse these.

@riccardoAlbertoni
Copy link
Contributor

riccardoAlbertoni commented Feb 2, 2018

I propose to untag "quality" from this issue, as this issue is more related to provenance than quality. Clearly "provenance" might influence quality but considered that we have the tag "provenance", I think we can remove "quality".

@dr-shorthair
Copy link
Contributor

A placeholder section/sub-section or proposal for the DCAT document would be appreciated - to alert the community when we release the FPWD.

@nicholascar
Copy link
Contributor

nicholascar commented May 23, 2018

Regarding the first of the three items in the description of this requirement:
[Provide a way to link to] the input data used to create a dataset to the dataset:

Proposal

Assuming an established prov:Entity/dcat:Dataset close relationship, Use two of three of the patterns related in the RDA's Pattern Associating metadata in documents with graph provenance (the third patter is not relevant):

Pattern 1: store provenance in a different document/service to the Dataset metadata and link with either prov:has_provenance or prov:has_query_service relations

This is appropriate when potentially detailed provenance information cannot be well catered for within the standard DCAT document. This will be the case in purpose-built systems that cater for DCAT but not all the possibilities of PROV, even for Dataset/Dataset (Entity/Entity) relations.

Example: Dataset X was derived from Dataset Y and Dataset Z:

Within the DCAT document:

:Dataset_X prov:has_provenance :Bundle_N <-- here the DCAT record for Dataset_X points to a document defined as a prov:Bundle within qhich Dataset_X is referenced and any amounts of PROV provenance relationships given, to other datasets in the same catalogue or others.

Instead of a provenance document, a dataset could be linked to a provenance query service using prov:has_query_service.

Pattern 2: link datasets directly to others with PROV-O relations
This is appropriate when the system used to store DCAT information can store any PROV relationships.

Example: Dataset X was derived from Dataset Y and Dataset Z:

Within the DCAT document:

:Dataset_X prov:wasDerivedFrom :Dataset_Y , Dataset_Z ;

or, qualified forms (see https://www.w3.org/TR/prov-o/#qualifiedDerivation):

:Dataset_X prov:qualifiedDerivation [
    a prov:Derivation;
    prov:entity :Dataset_Y ;

    ## More details about the activity underpinning the derivation        
    prov:hadGeneration :a_detailed_generation; 
    ...
] , [
    a prov:Derivation;
    prov:entity :Dataset_Z ;       
    prov:hadGeneration :different_detailed_generation; 
    ...
]

@nicholascar
Copy link
Contributor

Regarding the second of the three items in the description of this requirement:
[Provide a way to link to] the software used to produce the dataset to the dataset:

Proposal

Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.

Example:

Dateset X was derived from Dataset Y and the derivation was made using Software Z

As long the specific instance of software that was used can be recorded (i.e. not the URI of the GitHub repo but of the specific commit that was used) then the above can simply be recorded as:

:Dataset_X prov:wasDerivedFrom :Dataset_Y , :Software_Z

where the derivation from Software Z is understood to be a derivation by instruction due to :Software_Z being a prov:Plan. If this requires more spelling out:

:Dataset_X
    prov:wasDerivedFrom :Dataset_Y ;
    prov:qualifiedDerivation [
        prov:entity :Software_Z ; # still subclassed from Entity as Plan!
        prov:hadRole :some_special_role_for_software ;
    ] ;

@nicholascar
Copy link
Contributor

Regarding the third of the three items in the description of this requirement:
[Recommend] an extensible model different types of agent roles

Proposal

For the general case of role or other qualifications, see Qualified forms [RQF] #79 where a proposal for qualified forms is made with agent roles as an example.

@larsgsvensson
Copy link
Contributor

@nicholascar wrote:

Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.

I don't know if that's possible: Usually software is considered a prov:Agent, more specifically a prov:SoftwareAgent: "A software agent is running software."

@andrea-perego
Copy link
Contributor

@nicholascar , some time ago I added examples of provenance patterns in the wiki, and some of them relate to yours:

https://github.com/w3c/dxwg/wiki/Provenance-patterns

Would you mind having a check, and see if you think they should be revised/extended?

@rob-metalinkage
Copy link
Contributor

This gets messier with things like Shacl and Spin where the software is data.

Software is an entity, an instance of running software is an agent?

This fits with software being subject to processes such as automated testing.

@kcoyle
Copy link
Contributor

kcoyle commented May 26, 2018

@rob-metalinkage Can you expound on this a bit?

"This gets messier with things like Shacl and Spin where the software is data"

Which software is data? I read Shacl as taking instance data as input, so I'm not sure which software you mean. But I may be thinking of something other than what you meant.

@nicholascar
Copy link
Contributor

@rob-metalinkage @larsgsvensson we have long-used precedence with instances of software being prov:Plan (subClassOf prov:Entity) objects used to guide a prov:Activity that then produces things and an executing agent, like a server, being a prov:Agent. This is shown in quick outline in the RDA Provenanc ePatterns WG's pattern 18: https://patterns.promsns.org/pattern/18.

I have run this pattern of the instance of software used being modelled as a prov:Plan and the execution system running it being a prov:Agent or a prov:SoftwareAgent past several original members of PROV (Luc, & Paolo) as well as many PROV practitioners over many years and it works fine although it wasn't expressly catered for in the 2013 PROV publications.

The pattern is generalisable to include methods other than software, such as scientific methods.

I will re-document that pattern for the RDA WG in more detail shortly.

@larsgsvensson
Copy link
Contributor

@nicholascar It seems that we need to define exactly what we mean by software... I'd say that we need to differentiate between the sequence of commands being executed (aka a programme), the execution of that programme (aka a process) and any input passed to that process (let's call it input).

If we look at the case of a SHACL engine validating a piece of RDF using a SHACL file it seems to me that the execution of the validation is a prov:Activity executed by the SHACL engine (programme; prov:SoftwareAgent rdfs:subClassOf prov:Agent and uses the SHACL file (input; prov:Plan rdfs:subClassOf prov:Entity). The process is tricky and doesn't always map to PROV-O since the validation can be done by a server process (e. g. a servlet) that's running already and that handles several validations in parallel.

If that's what you mean, I fully agree. And we need a better word than "software".

@nicholascar
Copy link
Contributor

@larsgsvensson the differentiation you describe is how I describe things so I agree with your general characterisations.

I do agree that defining a process can be tricky but if we stick to the "provenance that we want", not a "provenance that could be modelled" then we can usually do something sensible. In the example you give of a servlet validating something I would model it thus:

  • the process as a prov:Activity - starting and ending with the processing of the RDF of interest, regardless of any other jobs it may be doing (we don't care about those)
  • the servlet as a prov:SoftwareAgent, if that's important to know, or perhaps the server itself
    • the choice of which Agent to model will come down to what facts are most importantt o know for a Use Case such as recording info for potential process recreation
  • the input of the RDF file being validated as a prov:Entity
  • the input of a SHACL file as a prov:Entity - not a prov:Plan
    • here the SHACL file is not instructing the Activity. It's determining a validation assessment but the conducting of the Activity itself is, in fact, guided by the code that applies the validation to the data, the SHACL file to the input RDF.
  • the output of the validation task - pass, fail, error messages etc - a prov:Entity that prov:wasDerivedFrom the two inputs AND the prov:Plan that instructed that the SHACL input be applied to the RDF input

So this modelling will allow someone to see when (Activity) something (whichever Agent) did what (Plan) with what inputs (Entity x 2) and what output (Entity). Sure, you could model things differently but what's the Use Case?

@azaroth42
Copy link

Could someone clarify the relevance of profile_negotiation to this issue, or remove the tag?

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 25, 2018

Provenance information should probably be available at the level of representations (dcat:Distributions) as well as dcat:Datasets

(Does this need a new Issue?)

@andrea-perego
Copy link
Contributor

I think we should consider here cases where "provenance" is expressed in a discursive way - e.g., when describing the dataset lineage (as mentioned in UC9).

This is quite a common practice for scientific data and in some domains, as the geospatial one. In most cases, these lineage descriptions are such that they cannot be easily converted into a machine-actionable representation.

In DCAT-AP, this is done by using dct:provenance/dct:ProvenanceStatement/rdfs:label, and according to the report on DCAT-AP usage statistics from the European Data Portal, this information is included in more than 50% of the EDP records (391,616).

It may be worth considering its inclusion in DCAT.

@davebrowning davebrowning added this to the DCAT Backlog milestone Mar 14, 2019
@andrea-perego
Copy link
Contributor

As there has been no further discussion on this issue, I propose to close it.

@andrea-perego
Copy link
Contributor

As there has been no further discussion on this issue, I propose to close it.

Noting no objections, I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests