Skip to content

Conversation

@sahithyaravi
Copy link
Member

@sahithyaravi sahithyaravi commented Aug 7, 2020

Two new scenarios are supported in the data_edit API as part of this PR
Editing by non-owner of the dataset

  • Clones the dataset with different user ID => yields new Data ID
  • Uses the same "arff" file as we are just cloning the row in DB
  • Edits the requested fields in the new ID
  • The old data ID remains unchanged

Editing sensitive meta-data by dataset owner

  • default_target_attribute, ignore_attribute, row_id_attribute
  • Clones the dataset => yields new Data ID
  • Uses the same "arff" file as we are just cloning the row in DB
  • Edits the requested sensitive fields in the new ID
  • The old data ID remains unchanged

The summar of handling of different scenarios is shown in the table here:
image

@sahithyaravi sahithyaravi self-assigned this Aug 7, 2020
@sahithyaravi sahithyaravi requested a review from janvanrijn August 7, 2020 11:13
@sahithyaravi sahithyaravi changed the title support non-owner edtis and sensitive field edits support non-owner edits and sensitive field edits Aug 7, 2020
Copy link
Member

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the intention of this PR at all (nothing personal). By making a copy of the dataset when doing an edit, we change the semantic meaning of the edit function and we open a whole can of worms regarding predictability and maintainability.

I think that the question what should happen when a user wants to edit a dataset that is not under their usespace is something that should be discussed in a broader setting.

Some options

  • reject the change (current situation)
  • some discussion with the original owner (would be complicated)
  • make a copy of the dataset (this PR, but I would heavily argue against it)
  • Probably more options...

@sahithyaravi
Copy link
Member Author

sahithyaravi commented Aug 10, 2020

I agree that the semantic meaning of edit changes when we fork or clone the dataset in the edit function,

Do you think its better to create a fork_api which can fork the dataset with a different user name?
It would basically copy the row as such and change user ID.
After forking, the user has to call edit_api on his own dataset?

@janvanrijn
Copy link
Member

Clearly there is a difference between "experiment critical" information and "non-critical" information. For the experiment critical information, we should absolutely build a "fork" or "update" function, as we don't want this to falsify earlier results.

I am not sure if forking is a good idea for the non-critical meta-data. One of the problems is that we will get many "very similar" datasets, and results that are meant to be on the same datasets will be scattered over different datasets, just because there was some typo in the dataset description.

Before, we had some sort if Wiki, where everyone could enter information and that tracked who added something and when. The Wiki has been introduced on some day and has been retired on another day, both without any public discussion. Isn't there a system similar to this that we could use?

@sahithyaravi
Copy link
Member Author

sahithyaravi commented Aug 10, 2020

Clearly there is a difference between "experiment critical" information and "non-critical" information. For the experiment critical information, we should absolutely build a "fork" or "update" function, as we don't want this to falsify earlier results.
Yes I agree
I am not sure if forking is a good idea for the non-critical meta-data. One of the problems is that we will get many "very similar" datasets, and results that are meant to be on the same datasets will be scattered over different datasets, just because there was some typo in the dataset description.
In the current API, for owners, we do not fork when edits are requested for non-critical meta-data. So we may not create many unwanted copies. Only changes to critical ones would create the fork. But for non-owners, we do create fork for all changes which may lead to too many copies. _But I am not sure how to allow the non-owner to edit a dataset without creating these copies. These copies do not create new files, they just creat a new row in the DB. But still it is going to spam the database if used for minor edits frequently._

@janvanrijn
Copy link
Member

What about the Wiki route? That would allow people to edit non-critical information to the current dataset.

@sahithyaravi
Copy link
Member Author

What about the Wiki route? That would allow people to edit non-critical information to the current dataset.
Sorry, I am not sure I understand the Wiki route? Is this like a forum where owners can approve changes?

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Aug 10, 2020

As a clarification: some people really did not like the wiki, since anyone could mess up the description of the dataset (this happened to Pieter, for instance), plus the best open source wiki engine we could find scales poorly, it was prone to crashing, and not very secure. Another issue is that there were effectively 2 ways to change the description: via the wiki and via the API, and that's not so nice and hard to maintain - better to have a single, controllable way of doing it. Another point was that we actually have more structured metadata (e.g. the names of the dataset creators and citation requests), but in the wiki this is all put into one unstructured document without a clear way to feed changes back into the metadata.

Hence, the idea was to not have a wiki anymore, but rather only use a proper API for changing the dataset. In addition, we would have a forum where people could make suggestions, but only the dataset owner can actually change the description. Even if the owner neglects this, the forum would show the discussion so people are aware of it.

We discussed this at the Dagstuhl workshop, and also in one of the engineering calls, I believe? This is of course still open for discussion.

The purpose of forking is not to have a new version where a typo is fixed, it is meant for people seeing actual big errors and wanting to create a better version that they maintain. Moreover, this is technically already possible, I can download a dataset, change the description, and reupload it as a new dataset in a few lines of code in the Python API.

I agree that we must be wary of having many semi-identical versions of a dataset. Maybe you are right that we should only automatically fork when changing critical information. In that case, we should tell users to leave a comment in the forum?

We could also have a middle way, where the dataset owner indicates whether he/she want to allow community edits. They could allow this if they are happy to let the community fix typo's and improve the description overall, or closed if the uploader feels strongly about managing that dataset. In that case, if the dataset is 'closed for edits' the data edit form could simply show a message that people should leave comments in the forum.

We could also discuss this more on Slack?

@janvanrijn
Copy link
Member

All together I am not saying that we should go for the wiki route, but there are various systems that allow people to suggest changes, and other people to approve of them.

As a clarification: some people really did not like the wiki, since anyone could mess up the description of the dataset (this happened to Pieter, for instance),

Doesn't that mean that we didn't configure the wiki correctly? I would guess that a credibility system where people obtain credibility over post they make would be able to alleviate this.

plus the best open source wiki engine we could find scales poorly, it was prone to crashing, and not very secure.

Can you elaborate on these three points?

Another issue is that there were effectively 2 ways to change the description: via the wiki and via the API, and that's not so nice and hard to maintain - better to have a single, controllable way of doing it.

That is something that we have total control over! We can disallow people to change the dataset description. Btw, when the wiki was in place, there was no dataset edit option ;)

Another point was that we actually have more structured metadata (e.g. the names of the dataset creators and citation requests), but in the wiki this is all put into one unstructured document without a clear way to feed changes back into the metadata.

We should indeed facility for that.

We discussed this at the Dagstuhl workshop, and also in one of the engineering calls, I believe? This is of course still open for discussion.

Those calls are organized and cancelled quite often last minute, and discussions are not very structured. Moreover, Dagstuhl was almost a year ago. I do not think that we should use one of the many discussions in Dagstuhl to give direction to this discussion. I would actually propose to have at some point a discussion how these decisions are made.

The purpose of forking is not to have a new version where a typo is fixed, it is meant for people seeing actual big errors and wanting to create a better version that they maintain.

But how are you going to enforce this? In the past, we used too often the strategy "let's implement it and see it it works". I do not think that this is a viable way of keeping a system with so much data as OpenML reliable.

We could also discuss this more on Slack?

What is wrong with Github?

@joaquinvanschoren
Copy link
Contributor

Doesn't that mean that we didn't configure the wiki correctly? I would guess that a credibility system where people obtain credibility over post they make would be able to alleviate this.

That's not supported by the wiki engine. Do you know one that does? It's more common in forums (e.g. Discourse has a user reputation feature).

plus the best open source wiki engine we could find scales poorly, it was prone to crashing, and not very secure.
Can you elaborate on these three points?

We used Gollum, which is basically the same engine as GitHub uses, but it gets slow if you have a lot of pages, and we had 10s of thousands of pages. It also crashes unpredictably and one time it was hacked and I had to remove a lot of ads that were put into the descriptions.

That is something that we have total control over! We can disallow people to change the dataset description. Btw, when the wiki was in place, there was no dataset edit option ;)

I don't see how you can easily control who can edit and who not in a wiki. What's wrong with having a proper API for this?

We should indeed facility for that.

I don't see what you mean? If we put the author names into a wiki page, they become markdown in a larger page. It's hard to detect those changes and then restore them in a structured way. Having both a structured and unstructured way to store data is very hard to maintain. A wiki is also yet another component in an already complicated system and requires more maintenance.

Those calls are organized and cancelled quite often last minute, and discussions are not very structured. Moreover, Dagstuhl was almost a year ago. I do not think that we should use one of the many discussions in Dagstuhl to give direction to this discussion. I would actually propose to have at some point a discussion how these decisions are made.

They are organized every 3rd Tuesday of every month at 17:00. We discuss the things that are on the Engineering roadmap, you can put things there if you want to discuss them: https://github.com/orgs/openml/projects/2
In the last months we had to cancel a few because of the corona situation and because of overlap with the online hackathon.
All help to organize them better is very very appreciated.

But how are you going to enforce this? In the past, we used too often the strategy "let's implement it and see it it works". I do not think that this is a viable way of keeping a system with so much data as OpenML reliable.

No, we discuss it and then we implement it. We had a lot of discussion about this before we started implementing it. We did so with full faithfulness to the discussions we had before on this topic.
In any case, it's not about enforcement. If people want they can always download the dataset and upload a new version, or worse: upload it as a completely new dataset. It's about giving people a proper way to maintain dataset quality.

I would say that having a data edit API makes the system a lot more reliable than a combination of APIs and wikis.

We could also discuss this more on Slack?

What is wrong with Github?

Nothing really, but I'm afraid that not everyone keeps an eye on this issue tracker and it's slower than a Slack discussion. We could at least post a message to join the discussion here?

In short, do you still believe that we should have both a data edit API for the critical data properties, and a separate wiki service to edit the other properties? Why not have a proper API for all dataset metadata and proper controls over who can edit what? It's simpler and easier to maintain. I thought we already moved past this based on previous discussions...

...

@joaquinvanschoren
Copy link
Contributor

Hi Sahi, I just did a call with Jan to facilitate the discussion, since a lot gets lost in writing. Here's my summary (Jan, please add if you think I misunderstood or forgot something):

High level:

  • The underlying problem is that we have dataset versioning, but not separate versioning for dataset descriptions. Changing the description should lead to a new description version, but not a new dataset version.
  • A deeper problem is that the 'critical' properties (e.g. target/ignore/row_id column) should actually be stored at task level, not dataset level. If we would do that, then dataset would not have critical fields, only tasks would have them. Changing a critical field then creates a new task, not a new dataset version. We can't easily change this, since it requires changes to all the client APIs as well. We could put in on a roadmap for a version 2 of the API.

What to do:

  • We can use the edit API as you implemented, but let's not automatically fork the dataset if a user changes a non-critical field, because this may indeed lead to many dataset versions with minor changes. Instead, it seems like a good idea to let the owner of the dataset indicate whether they allow community edits or not. If the owner does not allow it, the user would have to post a message on the forum with a suggestion to correct something.
  • Think about how we could have dataset description versioning, because we will overwrite the previous description in this implementation, right? Should we store older versions of the description in the database (ie. implement our own versioning), or maybe we could post the dataset description (e.g. as a JSON string) in a git repository after it is edited, which has better versioning support but of course adds complexity.

@sahithyaravi
Copy link
Member Author

sahithyaravi commented Aug 10, 2020

Hi Sahi, I just did a call with Jan to facilitate the discussion, since a lot gets lost in writing. Here's my summary (Jan, please add if you think I misunderstood or forgot something):

High level:

  • The underlying problem is that we have dataset versioning, but not separate versioning for dataset descriptions. Changing the description should lead to a new description version, but not a new dataset version.
  • A deeper problem is that the 'critical' properties (e.g. target/ignore/row_id column) should actually be stored at task level, not dataset level. If we would do that, then dataset would not have critical fields, only tasks would have them. Changing a critical field then creates a new task, not a new dataset version. We can't easily change this, since it requires changes to all the client APIs as well. We could put in on a roadmap for a version 2 of the API.

What to do:

  • We can use the edit API as you implemented, but let's not automatically fork the dataset if a user changes a non-critical field, because this may indeed lead to many dataset versions with minor changes. Instead, it seems like a good idea to let the owner of the dataset indicate whether they allow community edits or not. If the owner does not allow it, the user would have to post a message on the forum with a suggestion to correct something.
  • Think about how we could have dataset description versioning, because we will overwrite the previous description in this implementation, right? Should we store older versions of the description in the database (ie. implement our own versioning), or maybe we could post the dataset description (e.g. as a JSON string) in a git repository after it is edited, which has better versioning support but of course adds complexity.

@janvanrijn I will edit the current PR such that forking is created only when critical features are edited. Non-critical features will directly be over-written by both owner and non-owner. (until we support versioning and community edit flag).

We can later add the community edit flag and then block edits when the user does not allow the edit.

Dataset description versioning could be a new PR after we agree on the best way to do this (database/git).

@janvanrijn
Copy link
Member

I think the dataset edit function should never make an automatic fork. That changes the semantical meaning of 'edit'.

I would suggest:

  • accept all changes to the experimental non-critical fields
  • block all other changes.

A separate 'dataset fork' endpoint could be provided for other changes.

@sahithyaravi
Copy link
Member Author

sahithyaravi commented Aug 10, 2020

I think the dataset edit function should never make an automatic fork. That changes the semantical meaning of 'edit'.

I would suggest:

  • accept all changes to the experimental non-critical fields
  • block all other changes.

A separate 'dataset fork' endpoint could be provided for other changes.

I understand that the semantic meaning of edit changes by creating a copy. But, there is no way to edit critical fields without creating the copy. So, the main purpose of such a fork endpoint would be to allow critical field edits.The fork would then be an update/edit right? The fork API is not going to just fork/clone the dataset. It is fork+edit. The two APIs would be a critical_data_edit and non_critical_data_edit API then (the names could be improved). The main reason I thought of doing this in a single API is it is easier for the user and he/she will not have to use two endpoints for these two cases.

@janvanrijn
Copy link
Member

So, the main purpose of such a fork endpoint would be to allow critical field edits.The fork would then be an update/edit right? The fork API is not going to just fork/clone the dataset. It is fork+edit. The two APIs would be a critical_data_edit and non_critical_data_edit API then (the names could be improved).

Exactly! More specifically, you could state all_data_edit and non_critical_data_edit. I would call these functions data_fork and data_edit, respectively.

The main reason I thought of doing this in a single API is it is easier for the user and he/she will not have to use two endpoints for these two cases.

The fork endpoint should be used with extreme care. I would personally hope that users would only use them when a problem is identified with the previous version of the dataset.

@sahithyaravi
Copy link
Member Author

So, the main purpose of such a fork endpoint would be to allow critical field edits.The fork would then be an update/edit right? The fork API is not going to just fork/clone the dataset. It is fork+edit. The two APIs would be a critical_data_edit and non_critical_data_edit API then (the names could be improved).

Exactly! More specifically, you could state all_data_edit and non_critical_data_edit. I would call these functions data_fork and data_edit, respectively.

The main reason I thought of doing this in a single API is it is easier for the user and he/she will not have to use two endpoints for these two cases.

The fork endpoint should be used with extreme care. I would personally hope that users would only use them when a problem is identified with the previous version of the dataset.

Yes I agree , I will make the changes and keep you posted. Thanks!

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Aug 10, 2020

We actually wanted to create separate fork and edit calls initially. It is semantically cleaner. The reason to do 'fork+edit' is to avoid creating two new datasets (the fork itself and then the new version with the fixed critical features).

Let's go with the suggested names. IIUC these would be the semantics and protocol:

  • data_edit: edits an existing dataset version. This is only allowed for non-critical properties. For now, everyone can do this, in the future we can add a option to disallow community edits.
  • data_fork: creates a new dataset version, with possibly a different owner. In this process, critical features can be changed (in fact, it will be the only way to edit critical features).

To be honest, if we go this way, I would find it semantically cleaner if data_fork would only "fork" the dataset: copy and keep a reference to the previous version id. However, then we need another way to change critical features. For instance, we could allow dataset owners to change critical features, but ONLY if there are no existing tasks created on the dataset.

Would that not be cleaner? If a user sees a critical issue with a dataset, he would first fork it, then change critical properties (and other properties at will), and then create new tasks. This also avoids that users have to re-fork if they make a mistake: they can fix mistakes as long as they haven't created tasks yet.

Just an idea... we can also go with the combined fork+edit call.

@sahithyaravi
Copy link
Member Author

sahithyaravi commented Aug 11, 2020

We actually wanted to create separate fork and edit calls initially. It is semantically cleaner. The reason to do 'fork+edit' is to avoid creating two new datasets (the fork itself and then the new version with the fixed critical features).

Let's go with the suggested names. IIUC these would be the semantics and protocol:

  • data_edit: edits an existing dataset version. This is only allowed for non-critical properties. For now, everyone can do this, in the future we can add a option to disallow community edits.
  • data_fork: creates a new dataset version, with possibly a different owner. In this process, critical features can be changed (in fact, it will be the only way to edit critical features).

To be honest, if we go this way, I would find it semantically cleaner if data_fork would only "fork" the dataset: copy and keep a reference to the previous version id. However, then we need another way to change critical features. For instance, we could allow dataset owners to change critical features, but ONLY if there are no existing tasks created on the dataset.

Would that not be cleaner? If a user sees a critical issue with a dataset, he would first fork it, then change critical properties (and other properties at will), and then create new tasks. This also avoids that users have to re-fork if they make a mistake: they can fix mistakes as long as they haven't created tasks yet.

Just an idea... we can also go with the combined fork+edit call.

Yes I think this approach sounds clearner.

  • Edit API - Edit non-critical data fields for everyone. Edit critical fields only for the owner, provided there are no tasks. If the owner already created a task, we will just throw an error right?

  • Fork API - Clone the row as such with change in dataset ID and uploader ID

@janvanrijn
Copy link
Member

Edit critical fields only for the owner, provided there are no tasks. If the owner already created a task, we will just throw an error right?

Sounds OK to me.

Fork API - Clone the row as such with change in dataset ID and uploader ID

Yes. Same File Id

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Aug 18, 2020

LGTM. I'll merge to develop. That way Sahi can continue with the Python API. Let's not forget to do a double-check when Jan is back from holidays.

@PGijsbers
Copy link
Contributor

PGijsbers commented Aug 27, 2020

Hey, a bit late to the discussion. Sorry! Sahi had mentioned the discussions to me but I somehow read past it.
I know much of the course has been decided, but I figured I'd leave a few remark anyway, as much of the work is still to be done.

A deeper problem is that the 'critical' properties (e.g. target/ignore/row_id column) should actually be stored at task level, not dataset level.

I like this. But I'd like to propose a minor change to this. I think the row_id (and ignore?) column information should also live on the dataset level. My reasoning is that this is useful information about the dataset. It will make it easier for people to use the dataset without also inspecting what columns are marked in related tasks. It is also convenient that the owner of the dataset can immediately flag these columns even if they do not wish to create a task.

However, as noted we should avoid using these dataset properties directly when running a task as it creates the issue that this dataset meta-data becomes critical to defining a task. So I would propose to use this dataset meta-data when creating the task to define the default values of the ignore columns. To clarify, if the task is defined with its default values (based on dataset meta-data), this information is copied, not referenced. This way, we can still update the underlying dataset meta-data without affecting the task definition. I see this as a way to keep both the benefit of storing dataset related information on the dataset level, yet make it editable, while also taking the step to ensure the task completely defines the evaluation strategy in an experiment. Finally it would also allow us to mark "outdated" tasks which use columns which should be ignored according to the dataset owner.

I'm not sure if it's very clear, let me try to give an example:

  • dataset D is created, no columns are marked
  • task T is created on dataset X with defaults and target "class" (target column:"class", ignore columns:None)
  • dataset D is edited, and column id is marked a row_id column
  • task T' is created on dataset X with defaults and target "class" (target column:"class", ignore columns:id)

End Results:

  • dataset D, id is a row_id column
  • task T on dataset D (target column:"class", ignore columns:None)
  • task T' on dataset D (target column:"class", ignore columns:id)

Any future tasks will correctly ignore the row_id column unless explicitly overwritten, yet the task T did not change from its original definition (row_id is not ignored, as this was not set on the creation of the task).

@joaquinvanschoren
Copy link
Contributor

Thanks Pieter. I agree with your observations.
Keeping default row_id and ignore columns, similar to default target columns makes a lot of sense to me. Indeed, the data uploader may just want to upload a dataset but not create tasks on it. In fact, that is likely quite common.
That means that we can keep the current situation, but update the documentation to explain that these are defaults?

We can probably also do an automated check for row_id columns, and alert the uploader in case we find a potential unmarked row_id column?

@PGijsbers
Copy link
Contributor

PGijsbers commented Aug 27, 2020

That means that we can keep the current situation, but update the documentation to explain that these are defaults?

There is still a behavioral change, where the row_id and ignore columns of a dataset are not used anywhere else except to provide some sensible default values on task creation (and as information to the user). Tasks do not reference the dataset meta-data, they define which columns to ignore themselves (by default, by copying over the dataset meta-data). This is very important to ensure the desired behavior where updating the dataset meta-data does not change the behavior of an existing task.

That is different from the situation right now, where tasks do not store this column information. If we were to edit the dataset ignore columns right now, then the tasks would effectively change (hypothetically - it is not possible right now to edit the ignore columns on an existing dataset).

We can probably also do an automated check for row_id columns, and alert the uploader in case we find a potential unmarked row_id column?

Yes, I think when/if we provide the tool for dataset upload which automatically marks column types (e.g. nominal vs. numeric), it makes sense to also identify columns which look like id columns.

$data_id = $this->input->post('data_id');
// If data id is not given
if( $data_id == false ) {
$this->returnError( 1070, $this->version );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no spaces after ( or before ) (I know we are not consistent with this, but would be good to keep this in mind)

$update_fields = array();
foreach($xml->children('oml', true) as $input) {
// iterate over all fields, does not check for legal fields, as it wont match the xsd.
$name = $input->getName() . '';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much indent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compared to comment

return;

// If critical fields need to be edited
if (isset($update_fields['default_target_attribute']) || isset($update_fields['row_id_attribute']) || isset($update_fields['ignore_attribute'])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than checking these one by one, I would actually prefer to create an "edit XSD", so that we have this consistent in our set of XSD documents (that are easier to review than the code itself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants