-
-
Notifications
You must be signed in to change notification settings - Fork 120
support non-owner edits and sensitive field edits #1058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
janvanrijn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like the intention of this PR at all (nothing personal). By making a copy of the dataset when doing an edit, we change the semantic meaning of the edit function and we open a whole can of worms regarding predictability and maintainability.
I think that the question what should happen when a user wants to edit a dataset that is not under their usespace is something that should be discussed in a broader setting.
Some options
- reject the change (current situation)
- some discussion with the original owner (would be complicated)
- make a copy of the dataset (this PR, but I would heavily argue against it)
- Probably more options...
|
I agree that the semantic meaning of edit changes when we fork or clone the dataset in the edit function, Do you think its better to create a fork_api which can fork the dataset with a different user name? |
|
Clearly there is a difference between "experiment critical" information and "non-critical" information. For the experiment critical information, we should absolutely build a "fork" or "update" function, as we don't want this to falsify earlier results. I am not sure if forking is a good idea for the non-critical meta-data. One of the problems is that we will get many "very similar" datasets, and results that are meant to be on the same datasets will be scattered over different datasets, just because there was some typo in the dataset description. Before, we had some sort if Wiki, where everyone could enter information and that tracked who added something and when. The Wiki has been introduced on some day and has been retired on another day, both without any public discussion. Isn't there a system similar to this that we could use? |
|
|
What about the Wiki route? That would allow people to edit non-critical information to the current dataset. |
|
|
As a clarification: some people really did not like the wiki, since anyone could mess up the description of the dataset (this happened to Pieter, for instance), plus the best open source wiki engine we could find scales poorly, it was prone to crashing, and not very secure. Another issue is that there were effectively 2 ways to change the description: via the wiki and via the API, and that's not so nice and hard to maintain - better to have a single, controllable way of doing it. Another point was that we actually have more structured metadata (e.g. the names of the dataset creators and citation requests), but in the wiki this is all put into one unstructured document without a clear way to feed changes back into the metadata. Hence, the idea was to not have a wiki anymore, but rather only use a proper API for changing the dataset. In addition, we would have a forum where people could make suggestions, but only the dataset owner can actually change the description. Even if the owner neglects this, the forum would show the discussion so people are aware of it. We discussed this at the Dagstuhl workshop, and also in one of the engineering calls, I believe? This is of course still open for discussion. The purpose of forking is not to have a new version where a typo is fixed, it is meant for people seeing actual big errors and wanting to create a better version that they maintain. Moreover, this is technically already possible, I can download a dataset, change the description, and reupload it as a new dataset in a few lines of code in the Python API. I agree that we must be wary of having many semi-identical versions of a dataset. Maybe you are right that we should only automatically fork when changing critical information. In that case, we should tell users to leave a comment in the forum? We could also have a middle way, where the dataset owner indicates whether he/she want to allow community edits. They could allow this if they are happy to let the community fix typo's and improve the description overall, or closed if the uploader feels strongly about managing that dataset. In that case, if the dataset is 'closed for edits' the data edit form could simply show a message that people should leave comments in the forum. We could also discuss this more on Slack? |
|
All together I am not saying that we should go for the wiki route, but there are various systems that allow people to suggest changes, and other people to approve of them.
Doesn't that mean that we didn't configure the wiki correctly? I would guess that a credibility system where people obtain credibility over post they make would be able to alleviate this.
Can you elaborate on these three points?
That is something that we have total control over! We can disallow people to change the dataset description. Btw, when the wiki was in place, there was no dataset edit option ;)
We should indeed facility for that.
Those calls are organized and cancelled quite often last minute, and discussions are not very structured. Moreover, Dagstuhl was almost a year ago. I do not think that we should use one of the many discussions in Dagstuhl to give direction to this discussion. I would actually propose to have at some point a discussion how these decisions are made.
But how are you going to enforce this? In the past, we used too often the strategy "let's implement it and see it it works". I do not think that this is a viable way of keeping a system with so much data as OpenML reliable.
What is wrong with Github? |
That's not supported by the wiki engine. Do you know one that does? It's more common in forums (e.g. Discourse has a user reputation feature).
We used Gollum, which is basically the same engine as GitHub uses, but it gets slow if you have a lot of pages, and we had 10s of thousands of pages. It also crashes unpredictably and one time it was hacked and I had to remove a lot of ads that were put into the descriptions.
I don't see how you can easily control who can edit and who not in a wiki. What's wrong with having a proper API for this?
I don't see what you mean? If we put the author names into a wiki page, they become markdown in a larger page. It's hard to detect those changes and then restore them in a structured way. Having both a structured and unstructured way to store data is very hard to maintain. A wiki is also yet another component in an already complicated system and requires more maintenance.
They are organized every 3rd Tuesday of every month at 17:00. We discuss the things that are on the Engineering roadmap, you can put things there if you want to discuss them: https://github.com/orgs/openml/projects/2
No, we discuss it and then we implement it. We had a lot of discussion about this before we started implementing it. We did so with full faithfulness to the discussions we had before on this topic. I would say that having a data edit API makes the system a lot more reliable than a combination of APIs and wikis.
Nothing really, but I'm afraid that not everyone keeps an eye on this issue tracker and it's slower than a Slack discussion. We could at least post a message to join the discussion here? In short, do you still believe that we should have both a data edit API for the critical data properties, and a separate wiki service to edit the other properties? Why not have a proper API for all dataset metadata and proper controls over who can edit what? It's simpler and easier to maintain. I thought we already moved past this based on previous discussions... ... |
|
Hi Sahi, I just did a call with Jan to facilitate the discussion, since a lot gets lost in writing. Here's my summary (Jan, please add if you think I misunderstood or forgot something): High level:
What to do:
|
@janvanrijn I will edit the current PR such that forking is created only when critical features are edited. Non-critical features will directly be over-written by both owner and non-owner. (until we support versioning and community edit flag). We can later add the community edit flag and then block edits when the user does not allow the edit. Dataset description versioning could be a new PR after we agree on the best way to do this (database/git). |
|
I think the dataset edit function should never make an automatic fork. That changes the semantical meaning of 'edit'. I would suggest:
A separate 'dataset fork' endpoint could be provided for other changes. |
I understand that the semantic meaning of edit changes by creating a copy. But, there is no way to edit critical fields without creating the copy. So, the main purpose of such a fork endpoint would be to allow critical field edits.The fork would then be an update/edit right? The fork API is not going to just fork/clone the dataset. It is fork+edit. The two APIs would be a critical_data_edit and non_critical_data_edit API then (the names could be improved). The main reason I thought of doing this in a single API is it is easier for the user and he/she will not have to use two endpoints for these two cases. |
Exactly! More specifically, you could state all_data_edit and non_critical_data_edit. I would call these functions data_fork and data_edit, respectively.
The fork endpoint should be used with extreme care. I would personally hope that users would only use them when a problem is identified with the previous version of the dataset. |
Yes I agree , I will make the changes and keep you posted. Thanks! |
|
We actually wanted to create separate fork and edit calls initially. It is semantically cleaner. The reason to do 'fork+edit' is to avoid creating two new datasets (the fork itself and then the new version with the fixed critical features). Let's go with the suggested names. IIUC these would be the semantics and protocol:
To be honest, if we go this way, I would find it semantically cleaner if data_fork would only "fork" the dataset: copy and keep a reference to the previous version id. However, then we need another way to change critical features. For instance, we could allow dataset owners to change critical features, but ONLY if there are no existing tasks created on the dataset. Would that not be cleaner? If a user sees a critical issue with a dataset, he would first fork it, then change critical properties (and other properties at will), and then create new tasks. This also avoids that users have to re-fork if they make a mistake: they can fix mistakes as long as they haven't created tasks yet. Just an idea... we can also go with the combined fork+edit call. |
Yes I think this approach sounds clearner.
|
Sounds OK to me.
Yes. Same File Id |
|
LGTM. I'll merge to develop. That way Sahi can continue with the Python API. Let's not forget to do a double-check when Jan is back from holidays. |
|
Hey, a bit late to the discussion. Sorry! Sahi had mentioned the discussions to me but I somehow read past it.
I like this. But I'd like to propose a minor change to this. I think the However, as noted we should avoid using these dataset properties directly when running a task as it creates the issue that this dataset meta-data becomes critical to defining a task. So I would propose to use this dataset meta-data when creating the task to define the default values of the I'm not sure if it's very clear, let me try to give an example:
End Results:
Any future tasks will correctly ignore the |
|
Thanks Pieter. I agree with your observations. We can probably also do an automated check for row_id columns, and alert the uploader in case we find a potential unmarked row_id column? |
There is still a behavioral change, where the That is different from the situation right now, where tasks do not store this column information. If we were to edit the dataset
Yes, I think when/if we provide the tool for dataset upload which automatically marks column types (e.g. nominal vs. numeric), it makes sense to also identify columns which look like |
| $data_id = $this->input->post('data_id'); | ||
| // If data id is not given | ||
| if( $data_id == false ) { | ||
| $this->returnError( 1070, $this->version ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no spaces after ( or before ) (I know we are not consistent with this, but would be good to keep this in mind)
| $update_fields = array(); | ||
| foreach($xml->children('oml', true) as $input) { | ||
| // iterate over all fields, does not check for legal fields, as it wont match the xsd. | ||
| $name = $input->getName() . ''; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too much indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compared to comment
| return; | ||
|
|
||
| // If critical fields need to be edited | ||
| if (isset($update_fields['default_target_attribute']) || isset($update_fields['row_id_attribute']) || isset($update_fields['ignore_attribute'])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than checking these one by one, I would actually prefer to create an "edit XSD", so that we have this consistent in our set of XSD documents (that are easier to review than the code itself)
Two new scenarios are supported in the data_edit API as part of this PR
Editing by non-owner of the dataset
Editing sensitive meta-data by dataset owner
The summar of handling of different scenarios is shown in the table here:
