Skip to content

Conversation

@mpilman
Copy link
Contributor

@mpilman mpilman commented Apr 4, 2019

This PR will add a special client side buggify. The overall idea is that a user can enable this client-side buggify to test client libraries and layers.

If enabled, the client will start throwing errors with a configurable probability. This way an application or layer implementer can test that her exception handling works correctly.

throw invalid_option_value();
}
break;
case FDBNetworkOptions::CLIENTBUGGIFY_ENABLE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the CLIENTBUGGIFY_DISABLE option and simply default the CLIENTBUGGIFY_ENABLE to false? This is totally non-blocking though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I don't know what makes more sense and I don't have strong opinions. My line of thinking was that you might want to be able to enable and disable client buggify multiple times during a single run. But that might be a rare enough use-case so that we might not want to have more options for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm personally ok with having both options, but if anybody disagrees we should have that discussion now rather than try to remove the option from our API later.

Anybody else want to weigh in? @alecgrieser @dongxinEric @alexmiller-apple etc.

@apkar
Copy link
Contributor

apkar commented Apr 5, 2019

@mpilman Thanks for working on it. I was thinking of doing something similar in Doc Layer transaction abstraction. But, this is much better and generic. I didn't do code review, but the concept looks good to me. Couple of comments

  • I see that you are trying to simulate errors in getRange() and commit(). Is there a reason we don't want to buggify clear() and get()?
  • Does it make sense to add artificial delays in addition to errors, so we can trigger transaction_timeout errors as well?

@mpilman
Copy link
Contributor Author

mpilman commented Apr 5, 2019

Thanks, I'll continue with this then. To answer your question:

  1. Yes there is a reason: I didn't finish yet ;) I just wanted to provide a draft PR to showcase the overall idea. I will add more failure cases.
  2. That's a good idea, I will add some of those.

Copy link
Contributor

@ajbeamon ajbeamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we could buggify (though it needn't be part of this PR necessarily) is cluster upgrades via the multi-version client. The idea here would be to cause the multi-version client to "switch" from the current client to the same one while triggering all the usual behavior of failing outstanding requests, etc.

@atn34
Copy link
Collaborator

atn34 commented Apr 5, 2019

  • I see that you are trying to simulate errors in getRange() and commit(). Is there a reason we don't want to buggify clear() and get()?

What kind of errors might clear throw? I think it's just adding a mutation to an in-memory buffer and adding a write conflict range.

@apkar
Copy link
Contributor

apkar commented Apr 5, 2019

  • I see that you are trying to simulate errors in getRange() and commit(). Is there a reason we don't want to buggify clear() and get()?

What kind of errors might clear throw? I think it's just adding a mutation to an in-memory buffer and adding a write conflict range.

Good point. clear is just like set it wouldn't throw any errors.

Copy link
Contributor

@amouehsan amouehsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. This looks great to me. Just a couple of questions for you.

@mpilman mpilman force-pushed the features/client-buggify branch from 775baf3 to 32585e5 Compare April 6, 2019 15:01
Copy link
Contributor

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of these were discussed in real life, but here are my comments for completeness.

@ajbeamon
Copy link
Contributor

Is there anything blocking this from being a regular, non-draft PR? If so, what is it?

@mpilman mpilman force-pushed the features/client-buggify branch from 32585e5 to 6ea7571 Compare June 16, 2019 16:10
@mpilman mpilman marked this pull request as ready for review June 16, 2019 16:10
@ajbeamon ajbeamon self-assigned this Jun 17, 2019
case FDBNetworkOptions::CLIENT_BUGGIFY_SECTION_ACTIVATED_PROBABILITY:
validateOptionValue(value, true);
clearBuggifySections(BuggifyType::Client);
P_BUGGIFIED_SECTION_ACTIVATED[int(BuggifyType::Client)] = 100.0/double(extractIntOption(value, 0, 100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be value/100.0 rather than 100.0/value? The same question applies to CLIENT_BUGGIFY_SECTION_FIRED_PROBABILITY as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, of course. This is an embarrassing one

description="" />
<Option name="client_buggify_section_activated_probability" code="82"
paramType="Int" paramDescription="probability expressed as a percentage between 0 and 100"
description="Set the probability of a CLIENT_BUGGIFY section being active for the current execution. Only applies to code paths first traversed AFTER this option is changed." />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? It looks like the buggified sections get cleared when this option is set. That leads me to another thought, which is that if it's the case that buggified sections can change during a run, we may want to note that somewhere in case someone tries to write a buggify statement that depends on the section being stable during a run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, this is not true anymore (it used to be but I later introduced the whole clear thingy as I thought it will make it easier to understand).

For your second thought: the normal BUGGIFY which we usually use should be stable, this only applies to CLIENT_BUGGIFY. But nonetheless it would be good to keep this in mind. Where do you want me to document this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, and I'm not sure there's any perfect place to put it. My best suggestion would be to note this as a comment on the macros CLIENT_BUGGIFY* in NativeAPI.actor.h.

Copy link
Contributor

@ajbeamon ajbeamon Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the comment now, it looks good. Don't forget to also update the description of this option to reflect the new behavior of clearing activated sections.

@ajbeamon
Copy link
Contributor

  • I see that you are trying to simulate errors in getRange() and commit(). Is there a reason we don't want to buggify clear() and get()?
  • Does it make sense to add artificial delays in addition to errors, so we can trigger transaction_timeout errors as well?

I just noticed this previous comment, and you had mentioned subsequently that you would add some of these (though for clear we decided there was nothing to do). Was the plan for those changes to be included here or in a future PR that expands the feature?

@mpilman
Copy link
Contributor Author

mpilman commented Jun 17, 2019

Good point - I missed the get - this is why I shouldn't keep my PRs open for this long :( I pushed another change

@ajbeamon ajbeamon merged commit c3aa581 into apple:master Jun 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants