Skip to content

[sled agent] Add "rebooting" as enum variant, update sled agent state-handling#88

Merged
smklein merged 25 commits intomainfrom
reboot-states
May 25, 2021
Merged

[sled agent] Add "rebooting" as enum variant, update sled agent state-handling#88
smklein merged 25 commits intomainfrom
reboot-states

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented May 12, 2021

This patch consists of two major changes: (1) Updating the API for instance state, and (2) a refactor of the sled agent's state-handling logic.

Updating the API for the instance state:

BEFORE:

  • ApiInstanceState was separate from the concept of "rebooting" - reboot_in_progress was a separate bool, which means it could be toggled regardless of state. This usage of types presented a slightly imprecise view of the world: rebooting as a concept only makes sense within the "stopped" and "stopping" states.

AFTER:

  • reboot_in_progress removed from ApiInstanceRuntimeState, replaced by an enum variant rebooting within the Stopped and Stopping varaints.
  • reboot_wanted removed from ApiInstanceRuntimeStateRequested, replaced by ApiInstanceStateRequested::Reboot.

These changes make it more difficult for clients to make requests which are invalid (such as requesting a change to "failed", or requesting a change to "destroyed + reboot_wanted").

Note on the Database...

This aforementioned change has a trickle-effect throughout storage, translation, etc, as it modifies a structure stored within CockroachDB.

Changes necessary to update a struct stored in the DB...

  • [model.rs] Change the structure itself (add rebooting: bool as an enum variant). This required...
  • [dbinit.sql] Update the database schema.
  • [schema.rs] Update the database schema.
  • serialization:
    • [conversions.rs] Add SqlSerialize implementation for ApiInstanceState.
    • [conversions.rs] Update SqlSerialize implementation for ApiInstanceRuntimeState, which itself contains ApiInstanceState.
    • [datastore.rs] Update storage of ApiInstanceState within the SqlValueSet within project_delete_instance.
  • deserialization:
    • [model.rs] Remove try_from(&str) -> Result<ApiInstanceState> which relied on deserialization from JSON, as it is not intended to be stored as a JSON string.
    • [model.rs] Add try_from((&str, bool)) -> Result<ApiInstanceState>, which instead reconstructs the enum from a (variant, rebooting) tuple.
    • [model_db.rs] Add try_from(tokio_postgres::Row) -> Result<ApiInstanceState>, which itself invokes the (&str, bool) variant after extracting values from the SQL row.
    • [model_db.rs] Update try_from(tokio_postgres::Row) -> Result<ApiInstanceRuntimeState>, which itself relied on the deserialization of the instance state.

(anyway, here's footage of me making all these changes, still insisting that this is a small and reasonable update)

Refactoring the Sled Agent's state-handling Logic

This mostly impacted next_state_for_new_target. In effect, rather than using multiple matches of conditional logic to try to match all cases, this updates the format to an actual match statement, to make state transitions a bit more obvious.

@smklein smklein requested a review from davepacheco May 12, 2021 22:51
@smklein smklein changed the title [sled agent] Add "rebooting" indicator under a subset of states [sled agent] Add "rebooting" as enum variant, update sled agent state-handling May 12, 2021
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! When you started working on this, I didn't anticipate how soon you'd be diving into the half-baked abstractions in the database code you'd be diving into, or I'd have documented it better and maybe suggested talking through some of it. Sorry about that, and I hope that wasn't too painful!

) -> Result<Self, Self::Error> {
println!("ApiInstanceState::try_from: ({}, {})", variant, rebooting);
let r = match variant {
"creating" => ApiInstanceState::Creating,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making the the "rebooting" column in the database a nullable boolean? That seems to me a more precise analog to the Rust data type here, where there's no boolean value present in many of the variants.

This would have some small impacts in a bunch of places, but I think those changes would be clarifying, too. In this function, you'd be accepting an Option<bool>, and we could validate that it's None if the variant is not Stopping or Stopped, similar to what we do for the attached instance id in ApiDiskState. Another example: it might make more sense for is_rebooting() to return Option<bool> (because it seems weird to me to say that a Destroyed instance is not rebooting -- I'd expect kind of a "mu" answer). That one depends on the callers and what they tend to care about, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; this involved the following:

  • removing the NOT NULL constraint
  • updating all sql_row_value invocations to explicitly return an Option<bool>
  • updating the try_from impl to act on an optional bool

This does mean that SqlSerialize is now doing some shenanigans to try to ensure this field is truly none if we're in any non-stopping states, which means injecting None into the SqlValueSet.

fn try_from(
(variant, rebooting): (&str, bool),
) -> Result<Self, Self::Error> {
println!("ApiInstanceState::try_from: ({}, {})", variant, rebooting);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a debug println left in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

-- instance_state omicron.public.InstanceState NOT NULL, // TODO see above
instance_state TEXT NOT NULL,
reboot_in_progress BOOL NOT NULL,
rebooting BOOL NOT NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this name change semantically meaningful? (This is fine -- they seem equivalent to me -- I'm just curious if by changing this you intended some distinction that I'm missing.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh no, this was just a weird by-product of my progression, which was...

  1. Add the "rebooting" variant to "instance_state" (which I thought could still be converted to TEXT via json)
  2. Remove the separate "reboot_in_progress" variable (which meant deleting it here)
  3. Realize that I should not be storing the instance object as JSON, so add this auxiliary data back here (as rebooting).

TBH, I actually think instance_state_rebooting ends up being clearer, since it makes it more obvious that it's associated with the instance_state object. Updating now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I left instance_state as-is, but I'd actually be willing to change it to instance_state_variant...)

let p1 = cond_sql.next_param(&stopped);
let failed = ApiInstanceState::Failed.to_string();
let p2 = cond_sql.next_param(&failed);
cond_sql.push_str(&format!("instance_state in ({}, {})", p1, p2));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels kind of weirder now but I'm not sure if we should do anything about it or what we would do about it.

What's weird is that at L394 we're constructing a value for Stopped {rebooting: false} but at L396 we're expecting it to match equal to rows that would really be represented as Stopped { rebooting: true }.

I guess a more precise translation from the Rust type would be something like this:

(instance_state, run_state) IN (("stopped", false), ("stopped", true), ("failed", NULL))

I'm not sure you can do that in SQL per se, but I think the meaning is clear. You might have to decompose that into a more complex boolean expression. (This also assumes we made "rebooting" a nullable column.)

The question here is: what level of abstraction are these functions operating at? The answer in the new code is that they're pretty close-to-the-SQL: they know the underlying representation of a type like ApiInstanceState, and the code is within its rights to say: "I know this Rust type maps to two SQL columns, but I only care about the instance_state part, so I'm going to construct a variant with a dummy "rebooting" value in order to get its instance_state part [and throw away the other part]".

The other answer would be to say: these functions aren't allowed to assume knowledge of the SQL representation and are instead expected to rely on the various traits to serialize values to/from SQL. This was my original thought here. But looking at this example, it seems like that would require quite a lot more work, and I'm not sure what the abstraction would look like.

Anyway, I don't think there's much actionable here. This just seemed like a bigger change in abstraction than it looks and worth thinking about for the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good point (I'll at least include a comment to make this more explicit).

(instance_state, run_state) IN (("stopped", false), ("stopped", true), ("failed", NULL))

I'm hesitant with this example - if the stored value was an integer, not a bool, I wouldn't want to exhaustively list all integers! The truth is just that I simply don't care about it here - I want an analog for "this enum matches this variant, but we can discard the content in this context".

/*
* To implement reboot, we issue a call to the sled agent to set a
* runtime state with "reboot_wanted". We cannot simply stop the
* runtime state with "rebooting = true". We cannot simply stop the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be updated to say that we're setting the runtime state to "Reboot" (not "with "rebooting = true"")?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

* we might leave it stopped.
*
* When an instance is rebooted, the "reboot_in_progress" remains set on
* When an instance is rebooted, the "rebooting" remains set on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'the "rebooting" flag', maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/**
* Tests SimObject in general when backed by a SimDisk in particular, for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that say "SimInstance in particular"? (I think this was there before your change but I just noticed it.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, bad comment migration from L419 in the old version. I'll just re-phrase this entirely.

Base automatically changed from ensure-api to main May 14, 2021 23:04
@smklein smklein merged commit c5f23b7 into main May 25, 2021
@smklein smklein deleted the reboot-states branch May 25, 2021 15:34
citrus-it added a commit to citrus-it/omicron that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants