[sled agent] Add "rebooting" as enum variant, update sled agent state-handling#88
[sled agent] Add "rebooting" as enum variant, update sled agent state-handling#88
Conversation
davepacheco
left a comment
There was a problem hiding this comment.
Thanks for doing this! When you started working on this, I didn't anticipate how soon you'd be diving into the half-baked abstractions in the database code you'd be diving into, or I'd have documented it better and maybe suggested talking through some of it. Sorry about that, and I hope that wasn't too painful!
| ) -> Result<Self, Self::Error> { | ||
| println!("ApiInstanceState::try_from: ({}, {})", variant, rebooting); | ||
| let r = match variant { | ||
| "creating" => ApiInstanceState::Creating, |
There was a problem hiding this comment.
What do you think about making the the "rebooting" column in the database a nullable boolean? That seems to me a more precise analog to the Rust data type here, where there's no boolean value present in many of the variants.
This would have some small impacts in a bunch of places, but I think those changes would be clarifying, too. In this function, you'd be accepting an Option<bool>, and we could validate that it's None if the variant is not Stopping or Stopped, similar to what we do for the attached instance id in ApiDiskState. Another example: it might make more sense for is_rebooting() to return Option<bool> (because it seems weird to me to say that a Destroyed instance is not rebooting -- I'd expect kind of a "mu" answer). That one depends on the callers and what they tend to care about, though.
There was a problem hiding this comment.
Done; this involved the following:
- removing the
NOT NULLconstraint - updating all
sql_row_valueinvocations to explicitly return anOption<bool> - updating the
try_fromimpl to act on an optional bool
This does mean that SqlSerialize is now doing some shenanigans to try to ensure this field is truly none if we're in any non-stopping states, which means injecting None into the SqlValueSet.
omicron-common/src/model.rs
Outdated
| fn try_from( | ||
| (variant, rebooting): (&str, bool), | ||
| ) -> Result<Self, Self::Error> { | ||
| println!("ApiInstanceState::try_from: ({}, {})", variant, rebooting); |
There was a problem hiding this comment.
This looks like a debug println left in.
omicron-common/src/sql/dbinit.sql
Outdated
| -- instance_state omicron.public.InstanceState NOT NULL, // TODO see above | ||
| instance_state TEXT NOT NULL, | ||
| reboot_in_progress BOOL NOT NULL, | ||
| rebooting BOOL NOT NULL, |
There was a problem hiding this comment.
Was this name change semantically meaningful? (This is fine -- they seem equivalent to me -- I'm just curious if by changing this you intended some distinction that I'm missing.)
There was a problem hiding this comment.
ahhhh no, this was just a weird by-product of my progression, which was...
- Add the "rebooting" variant to "instance_state" (which I thought could still be converted to
TEXTvia json) - Remove the separate "reboot_in_progress" variable (which meant deleting it here)
- Realize that I should not be storing the instance object as JSON, so add this auxiliary data back here (as
rebooting).
TBH, I actually think instance_state_rebooting ends up being clearer, since it makes it more obvious that it's associated with the instance_state object. Updating now.
There was a problem hiding this comment.
(I left instance_state as-is, but I'd actually be willing to change it to instance_state_variant...)
| let p1 = cond_sql.next_param(&stopped); | ||
| let failed = ApiInstanceState::Failed.to_string(); | ||
| let p2 = cond_sql.next_param(&failed); | ||
| cond_sql.push_str(&format!("instance_state in ({}, {})", p1, p2)); |
There was a problem hiding this comment.
This feels kind of weirder now but I'm not sure if we should do anything about it or what we would do about it.
What's weird is that at L394 we're constructing a value for Stopped {rebooting: false} but at L396 we're expecting it to match equal to rows that would really be represented as Stopped { rebooting: true }.
I guess a more precise translation from the Rust type would be something like this:
(instance_state, run_state) IN (("stopped", false), ("stopped", true), ("failed", NULL))
I'm not sure you can do that in SQL per se, but I think the meaning is clear. You might have to decompose that into a more complex boolean expression. (This also assumes we made "rebooting" a nullable column.)
The question here is: what level of abstraction are these functions operating at? The answer in the new code is that they're pretty close-to-the-SQL: they know the underlying representation of a type like ApiInstanceState, and the code is within its rights to say: "I know this Rust type maps to two SQL columns, but I only care about the instance_state part, so I'm going to construct a variant with a dummy "rebooting" value in order to get its instance_state part [and throw away the other part]".
The other answer would be to say: these functions aren't allowed to assume knowledge of the SQL representation and are instead expected to rely on the various traits to serialize values to/from SQL. This was my original thought here. But looking at this example, it seems like that would require quite a lot more work, and I'm not sure what the abstraction would look like.
Anyway, I don't think there's much actionable here. This just seemed like a bigger change in abstraction than it looks and worth thinking about for the future.
There was a problem hiding this comment.
Yeah, it's a good point (I'll at least include a comment to make this more explicit).
(instance_state, run_state) IN (("stopped", false), ("stopped", true), ("failed", NULL))
I'm hesitant with this example - if the stored value was an integer, not a bool, I wouldn't want to exhaustively list all integers! The truth is just that I simply don't care about it here - I want an analog for "this enum matches this variant, but we can discard the content in this context".
omicron-nexus/src/nexus.rs
Outdated
| /* | ||
| * To implement reboot, we issue a call to the sled agent to set a | ||
| * runtime state with "reboot_wanted". We cannot simply stop the | ||
| * runtime state with "rebooting = true". We cannot simply stop the |
There was a problem hiding this comment.
I think this should be updated to say that we're setting the runtime state to "Reboot" (not "with "rebooting = true"")?
omicron-nexus/src/nexus.rs
Outdated
| * we might leave it stopped. | ||
| * | ||
| * When an instance is rebooted, the "reboot_in_progress" remains set on | ||
| * When an instance is rebooted, the "rebooting" remains set on |
There was a problem hiding this comment.
'the "rebooting" flag', maybe?
| } | ||
|
|
||
| /** | ||
| * Tests SimObject in general when backed by a SimDisk in particular, for |
There was a problem hiding this comment.
Should that say "SimInstance in particular"? (I think this was there before your change but I just noticed it.)
There was a problem hiding this comment.
Ah, yeah, bad comment migration from L419 in the old version. I'll just re-phrase this entirely.
This patch consists of two major changes: (1) Updating the API for instance state, and (2) a refactor of the sled agent's state-handling logic.
Updating the API for the instance state:
BEFORE:
ApiInstanceStatewas separate from the concept of "rebooting" -reboot_in_progresswas a separate bool, which means it could be toggled regardless of state. This usage of types presented a slightly imprecise view of the world: rebooting as a concept only makes sense within the "stopped" and "stopping" states.AFTER:
reboot_in_progressremoved fromApiInstanceRuntimeState, replaced by an enum variantrebootingwithin theStoppedandStoppingvaraints.reboot_wantedremoved fromApiInstanceRuntimeStateRequested, replaced byApiInstanceStateRequested::Reboot.These changes make it more difficult for clients to make requests which are invalid (such as requesting a change to "failed", or requesting a change to "destroyed + reboot_wanted").
Note on the Database...
This aforementioned change has a trickle-effect throughout storage, translation, etc, as it modifies a structure stored within CockroachDB.
Changes necessary to update a struct stored in the DB...
model.rs] Change the structure itself (addrebooting: boolas an enum variant). This required...dbinit.sql] Update the database schema.schema.rs] Update the database schema.conversions.rs] AddSqlSerializeimplementation forApiInstanceState.conversions.rs] UpdateSqlSerializeimplementation forApiInstanceRuntimeState, which itself containsApiInstanceState.datastore.rs] Update storage ofApiInstanceStatewithin theSqlValueSetwithinproject_delete_instance.model.rs] Removetry_from(&str) -> Result<ApiInstanceState>which relied on deserialization from JSON, as it is not intended to be stored as a JSON string.model.rs] Addtry_from((&str, bool)) -> Result<ApiInstanceState>, which instead reconstructs the enum from a (variant, rebooting) tuple.model_db.rs] Addtry_from(tokio_postgres::Row) -> Result<ApiInstanceState>, which itself invokes the(&str, bool)variant after extracting values from the SQL row.model_db.rs] Updatetry_from(tokio_postgres::Row) -> Result<ApiInstanceRuntimeState>, which itself relied on the deserialization of the instance state.(anyway, here's footage of me making all these changes, still insisting that this is a small and reasonable update)
Refactoring the Sled Agent's state-handling Logic
This mostly impacted
next_state_for_new_target. In effect, rather than using multiple matches of conditional logic to try to match all cases, this updates the format to an actual match statement, to make state transitions a bit more obvious.