Skip to content

Europa: The Great Compromise of December 16, 2021. (spec).#1234

Merged
shykes merged 1 commit intodagger:mainfrom
shykes:europa-hidden-fields-plan-b
Dec 17, 2021
Merged

Europa: The Great Compromise of December 16, 2021. (spec).#1234
shykes merged 1 commit intodagger:mainfrom
shykes:europa-hidden-fields-plan-b

Conversation

@shykes
Copy link
Copy Markdown
Contributor

@shykes shykes commented Dec 16, 2021

Summary

This PR implements the Great Design Compromise of December 16, 2021, summarized below.

Note I am incorporating the tweak suggested by @talentedmrjones of using task: _name rather than task: _id.

// Core types (e.g. #FS)
#FS: {
  $dagger: fs: _id: string
}

// Tasks (e.g. #Exec)
#Exec: {
  $dagger: task: _name: "Exec"
}

Related PRs

Future issues

See #1241 . Note that we are kicking this can down the road.

Original proposal

Below is my original proposal, for historical record. Most of the comments below are in response to this.

This is our plan B, in case it is confirmed that we can’t rely on hidden fields for matching internal types in disjunctions, the way we hoped to.

Context from Discord:

Quick summary of next steps:

  1. [Joel] is going to check with CUE devs whether this is a bug or feature. If we’re lucky, it’s a bug, they fix it next week, and we can pretend this conversation never happened.

  2. We discussed plan B. There are some variations & bikeshedding opportunities, but basically it revolves around using a visible field, while making it as non-intrusive as possible to the developer, and minimizing risk of dev error. [Solomon] will draft a proposal to kickoff bikeshedding.

TLDR: we have a path to shipping Europa no matter what, without major changes to the codebase or APIs.

Rationale

In preparation for bikeshedding, here is my rationale for this design.

  • We are making “10x better DX” a key differentiator of Dagger. Therefore we should take every opportunity to make the DX better, even to the last detail.
  • In this case, the opportunity is to make our internal type system as easy to understand and consistent as possible, for 2 audiences: 1) developers who will evaluate their Dagger config in full detail (for debug, testing, learning etc), and 2) developers that will bother to read the CUE files of our engine API. Yes, those will be mostly advanced developers. But that doesn’t mean they don’t appreciate consistency and clarity. Experts aren’t masochists.
  • With that in mind: we are forced to transform some hidden fields into public fields, because we can’t rely on hidden fields for matching as we hoped.
  • One option is to change only some of the fields (so-called “core types”: FS, Secret, Service, Stream) since they are the only ones affected by the limitation of hidden fields. This would result in an inconsistent set of fields: some hidden, some visible.
  • This inconsistency would compound an existing inconsistency in the current type system. Specifically: there are 2 types of hidden fields: _type: “Foo” and _foo: {}.
    • The inconsistency between field naming schemes is a problem: all these fields are looked up in some way by our Go implementation; it’s not sufficient to say “we use those fields in two different ways internally, so the inconsistency makes sense to us. Read the go code base and it will make sense to you too”.
    • Additionally, the choice of _type itself is also a problem. It creates confusion with our term “core types” (which don’t have a _type field), and it’s generally too broad: does lack of _type indicate the lack of a type? This will cause confusion.
  • TLDR: rather than add an extra layer of inconsistency with this change, let’s use it as an opportunity to clean up inconsistency

Hence the proposed design rules:

  • If a CUE type is special to the Go engine implementation, it must always have a $dagger field. This field is used to differentiate between specific tasks, FS trees, sefrets, etc. It also holds additional information filled by the runtime, eg. IDs.
  • If a CUE type does not have a $dagger field, it is not special to the Go engine implementation
  • There are no hidden fields. This PR does not remove the hidden fields under inputs, because they have been created outside of the spec, but I recommend removing them too.

cc @samalba @aluzzardi @jlongtine @talentedmrjones hopefully this clarifies my thinking, it’s been a long day and we spent most of it talking about intricate abstract concepts. Hopefully async written discussion will help clarify and digest.

@shykes shykes added this to the v0.2.0 "Europa" milestone Dec 16, 2021
@aluzzardi
Copy link
Copy Markdown
Contributor

aluzzardi commented Dec 16, 2021

I do agree things as they are are a bit confusing, however I disagree on why it's confusing and how to fix it :)

I realize Discord isn't the best medium to share context, so let me try to here.

Those fields only matter to people writing Engine code / advanced users, and IMHO if explained/organized better, it makes a lot of sense.

There's two things:

  • Core Types: CUE has built-in types. string, bytes, int, ... If we could extend the AST, Dagger would add fs, secret, service. Unfortunately we can't, so we end up with #Secret, #FS, ...
  • Tasks The engine provides a collection of tasks (#ReadFile, #Exec)
    • Ideally, we wouldn't need to do anything special to handle those. However, due to limitations in the CUE API, we have to add a marker to figure out what Go code to invoke when we encounter them.
    • We copied the exact same model of cue/cmd, except we're using a hidden field rather than $id (cue/cmd predates hidden fields, I bet if those were able at the time, that's what the CUE team would have used).

I agree it's currently not in the best of shapes from a simplicity point of view, and I have a few suggestions to make it better:

  • _type is a lousy field name. _task would be a better match perhaps
  • Mixing data types and tasks in the same package is confusing. Perhaps we could differentiate the two. Either/Or:
    • Following Richard's suggestion (Europa: specialize engine packages #1221), we could specialize tasks into sub-packages (e.g. http.#Fetch) for tasks and keep the top-level package for types
    • We could move data types to engine/types.{FS,Secret,...}
    • There's no reason for types to be definitions: it could be types.Secret or whatever (because those two things are inherently different, what forces tasks to be definitions doesn't apply to types)

On consistency:

If a CUE type does not have a $dagger field, it is not special to the Go engine implementation

Following this logic -- shouldn't we add a marker to engine.#CacheDir? engine.#Mount? engine.#Ref? engine.#Plan? Those are special to the engine, but they are just plain data structures passed to tasks, they're not tasks themselves.

Where does it stop? Should we have an engine.#String since the engine is reading/writing those (not intended as a joke -- I'm thinking of *aws.String in aws-go).

On inputs:

This PR does not remove the hidden fields under inputs, because they have been created outside of the spec, but I recommend removing them too.

I agree the inlined _type is very weird.

However, I recommend doing something along these lines:

#Plan: {
    input: secrets: [name=string]: #_ClientReadSecret | _#ClientExecSecret
}

// Read secret from a file ON THE CLIENT MACHINE
_#ClientReadSecret: {
    _task: "ClientReadSecret"

    path: string

    // Reference to the secret contents
    contents: #Secret
}

_#ClientExecSecret: {
    _task: "ClientExecSecret"

    // Execute a command ON THE CLIENT MACHINE and read secret from standard output
    command: [string, ...string] | string
    // Execute command in an interactive terminal
    //  for example to prompt for a passphrase
    interactive: true | *false   

    // Reference to the secret contents
    contents: #Secret
}

engine.#ReadFile and engine._#ClientReadSecret behave pretty much the same as far as tasks go. Only reason _#ClientReadSecret is hidden is because we don't want anyone using that directly.

#Plan schema enforces where you're allowed to use those tasks.

To me this feels pretty consistent, I believe advanced users would have no trouble understanding what's going on.

@aluzzardi
Copy link
Copy Markdown
Contributor

BTW, as you can tell I'm pretty excited about this model because IMHO it makes contributing/developing the Engine dead simple.

Want to add support for output?

  • Define a _#ClientWriteDirectory task
  • Add output: directories: [name=string]: _#ClientWriteDirectory to the #Plan
  • Drop your Go implementation in plan/task/clientwritedirectory.go (and call Register("ClientWriteDirectory", yourHandler) in init())
  • Done

No other changes to the entire Go code base. No need to add a concept of outputs, that's entirely in CUE and the core task model is enough. Some day those could very well be external plugins.

Since we're using cue/flow, those outputs are optimized out of the box. No need to wait for all actions to complete before starting to write outputs -- as soon as the output's dependencies are completed, the export will start.

@shykes
Copy link
Copy Markdown
Contributor Author

shykes commented Dec 16, 2021

  • Core Types: CUE has built-in types. string, bytes, int, ... If we could extend the AST, Dagger would add fs, secret, service. Unfortunately we can't, so we end up with #Secret, #FS, ...

  • Tasks The engine provides a collection of tasks (#ReadFile, #Exec)

    • Ideally, we wouldn't need to do anything special to handle those. However, due to limitations in the CUE API, we have to add a marker to figure out what Go code to invoke when we encounter them.
    • We copied the exact same model of cue/cmd, except we're using a hidden field rather than $id (cue/cmd predates hidden fields, I bet if those were able at the time, that's what the CUE team would have used).

This makes sense to me. Clarifying the terminology of “core types” vs “tasks” is helpful.

  • _type is a lousy field name. _task would be a better match perhaps

Yes I agree, if we ignore for a second the issues with hidden fields, adding _task alongside _fs, _secret, _service and _stream would be both clear and consistent.

Now, incorporating the hidden fields issue into the mix: if we were to replace _fs, _secret, _service, _stream with $fs, $secret, $service, $stream, I would much prefer to also replace _task with $task, even though it’s not needed to fix the issue. This would improve consistency and therefore improve DX in a small but (IMO) meaningful way.

On top of this, my thinking was: “let’s just group all these keys in a reserved namespace, with a predictable key $dagger, and train our developers to consider everything inside that key as dagger engine business. Seems like a better DX”. I still believe it would be a better DX, but I would settle for $task, $fs, $service, $service, $stream aka “the 5 dollars”.

So, my order of preferences from favorite to least favorite:

  1. “One dollar” ($dagger)
  2. “Five dollars” ($task, $fs, $secret, $service, $stream)
  3. “Four dollars and one underscore” (_task, $fs, $secret, $service, $stream)
  • Mixing data types and tasks in the same package is confusing. Perhaps we could differentiate the two. Either/Or:

I don’t actually find that confusing. There are very few core types, and not that many tasks. Keeping it all under a single package is fine, and a good tradeoff IMO.

I actually tried this for the container image operations: engine.#Pull used to be engine/oci.#Pull, etc. I reverted it back because the extra overhead is just not worth it.

  • We could move data types to engine/types.{FS,Secret,...}

Same reaction, not worth it. Also dagger.#FS is more clear than types.#FS in a config.

  • There's no reason for types to be definitions: it could be types.Secret or whatever (because those two things are inherently different, what forces tasks to be definitions doesn't apply to types)

This is a completely new concept to me, so I might need time to digest it. My initial reaction is: definitions nowhere could work, definitions everywhere could work, half and half scares me because it’s too subtle. But my opinion may evolve as I digest.

On consistency:

If a CUE type does not have a $dagger field, it is not special to the Go engine implementation

Following this logic -- shouldn't we add a marker to engine.#CacheDir? engine.#Mount? engine.#Ref? engine.#Plan? Those are special to the engine, but they are just plain data structures passed to tasks, they're not tasks themselves.

If the engine actually needs to accurately match these types (which is what I meant by “special”), then yes they should have a marker. But:

  • #Ref is not actually matched in the engine (it’s used in #Push and #Pull which are matched and therefore have a marker)
  • #CacheDir is not matched (it’s used in #Exec which is mathed and therefore has a marker)
  • #Mount is not matched (used in #Exec)
  • #Plan is not matched (it’s assumed to be unified with the top-level root. This should not be mandatory by the way but that’s another story)

Where does it stop? Should we have an engine.#String since the engine is reading/writing those (not intended as a joke -- I'm thinking of *aws.String in aws-go).

What I’m proposing is a very clear cut rule. Either the engine needs to reliably match a type (aka “it’s special”) or not. I don’t understand how this #String type would work, but the rule is very simple and unambiguous, so I’m confident we can apply it easily to any new types.

You don’t have to agree with my suggestion that grouping everything under top-level marker key is better DX, but let’s at least agree that it’s a reasonable and easy rule to apply. This “where does it stop” argument is nonsense.

On inputs:

This PR does not remove the hidden fields under inputs, because they have been created outside of the spec, but I recommend removing them too.

I agree the inlined _type is very weird.

However, I recommend doing something along these lines:

#Plan: {
    input: secrets: [name=string]: #_ClientReadSecret | _#ClientExecSecret
}

// Read secret from a file ON THE CLIENT MACHINE
_#ClientReadSecret: {
    _task: "ClientReadSecret"

    path: string

    // Reference to the secret contents
    contents: #Secret
}

_#ClientExecSecret: {
    _task: "ClientExecSecret"

    // Execute a command ON THE CLIENT MACHINE and read secret from standard output
    command: [string, ...string] | string
    // Execute command in an interactive terminal
    //  for example to prompt for a passphrase
    interactive: true | *false   

    // Reference to the secret contents
    contents: #Secret
}

engine.#ReadFile and engine._#ClientReadSecret behave pretty much the same as far as tasks go. Only reason _#ClientReadSecret is hidden is because we don't want anyone using that directly.

#Plan schema enforces where you're allowed to use those tasks.

To me this feels pretty consistent, I believe advanced users would have no trouble understanding what's going on.

This is not terrible but I believe we can do better (I mean DX wise - I don’t have strong opinions on the implementation beyond how it impacts DX).

Specifically I see 2 possible improvements to the DX:

  1. As you know for consistency, I strongly prefer all our markers to follow the same style: either hidden fields or dollar fields, but not a mix. In practice this means all dollars (either one big dollar or 5 little dollars :). BUT I am guessing that your proposal does not allow for making these privileged tasks to be dollars, because you rely on the scoped nature of hidden fields to prevent malicious actions from executing them at will. My answer to this is that, if you added filtering by location (to make sure privileged tasks can only be executed from the authorized path aka: not in actions), then you can change to dollars and we can achieve balance in the universe once again.

  2. At this point, if we can filter these privileged tasks by location, the logical next question is: why do we need the dollar fields in the first place? Can’t we just check the location at task lookup, and use the location itself as a marker? This makes the DX even better because A) it’s less code which is always good, and B) it downgrades dagger.#Plan from “mandatory, nothing will run unless you unify it at the root” to “it’s a regular convenience definition, you should probably use it for type checking, but there’s nothing magic about it”.

EDIT: adding your follow-up comment here, which I only saw now

BTW, as you can tell I'm pretty excited about this model because IMHO it makes contributing/developing the Engine dead simple.

Want to add support for output?

  • Define a _#ClientWriteDirectory task
  • Add output: directories: [name=string]: _#ClientWriteDirectory to the #Plan
  • Drop your Go implementation in plan/task/clientwritedirectory.go (and call Register("ClientWriteDirectory", yourHandler) in init())
  • Done

No other changes to the entire Go code base. No need to add a concept of outputs, that's entirely in CUE and the core task model is enough. Some day those could very well be external plugins.

Since we're using cue/flow, those outputs are optimized out of the box. No need to wait for all actions to complete before starting to write outputs -- as soon as the output's dependencies are completed, the export will start.

I love this, and would love it even more if privileged tasks could be registered against a path instead of a marker. That way I get my dream DX, and you get your dream implementation :)

I think the distinction between privileged and unprivileged tasks is important enough to justify a slightly different registration mecanism (but then it can all be run in the same unified cue-flow, just with slightly different matching logic in the respective cueflow handlers). Perhaps this is wishful thinking and absolutely impossible in the implementation,but one can hope :)

@aluzzardi
Copy link
Copy Markdown
Contributor

aluzzardi commented Dec 16, 2021

What I’m proposing is a very clear cut rule. Either the engine needs to reliably match a type (aka “it’s special”) or not. I don’t understand how this #String type would work, but the rule is very simple and unambiguous, so I’m confident we can apply it easily to any new types.

You don’t have to agree with my suggestion that grouping everything under top-level marker key is better DX, but let’s at least agree that it’s a reasonable and easy rule to apply. This “where does it stop” argument is nonsense.

Perhaps that's the misconception.

The engine doesn't care one bit. It doesn't need to reliably match core types. #FS: string would work just as much. The engine knows that #Exec.input is an #FS in the same way it knows #Exec.always is a bool: that's an expectation, derived from #Exec's schema.

The thing that cares about differentiating #FS from #Secret is CUE itself (e.g. mount: contents: #FS | #Service).

It's not nonsense -- we have the same problem with engine.#TempDir, which contains no other field than an optional size?, which is throwing off CUE disjunctions (mount: contents: #FS | #Service | #TempDir --> CUE is just as confused about #TempDir than it is about #FS).

EDIT

What I'm saying is, we don't need any kind of special markers for core types.

What we need is CUE to be happy about ALL of our types (whether they're low or high level) being distinguishable one another if we plan to use disjunctions on them or validate that we're passing the right thing (e.g. passing a #FS instead of a #Service).

Where does it stop? --> I legit meant this in a non sarcastic way. Like, do we end up throwing a $dagger field whenever CUE can't make a clear cut distinction between types (e.g. #TempDir)?

In the case of #TempDir, I'd suggest we make it distinguishable by changing size?: int to size: int | *-1 or something.

And the same goes for core types -- the runtime doesn't care about _fs or $dagger or whatnot. As long as you have #FS: { any field, really }, CUE is happy.

@shykes
Copy link
Copy Markdown
Contributor Author

shykes commented Dec 16, 2021

The engine doesn't care one bit. It doesn't need to reliably match core types. #FS: string would work just as much. The engine knows that #Exec.input is an #FS in the same way it knows #Exec.always is a bool: that's an expectation, derived from #Exec's schema.
The thing that cares about differentiating #FS from #Secret is CUE itself (e.g. mount: contents: #FS | #Service).

OK I understand what you mean, but what about the ID field? The engine needs to fill and lookup that field. Don’t you count that as a “special” interaction that makes #FS fundamentally different from eg. #Mount? Granted, “fill and lookup and ID” is not the same special interaction as “match as a cueflow task”. But there’s still a clear line between “types that our Go implementation must know about” and “types that it doesn’t need to know about”.

It's not nonsense -- we have the same problem with engine.#TempDir, which contains no other field than an optional size?, which is throwing off CUE disjunctions (mount: contents: #FS | #Service | #TempDir --> CUE is just as confused about #TempDir than it is about #FS).

Let me provide additional context on this one, since I wrote that disjunction.

  • As a CUE developer I wrote this disjunction because it seemed like the best DX for my definition: one less key depth
  • Several of the values in the disjunctions happen to be core types, but I don’t think it’s the core type’s responsibility to make my disjunctions work. It’s more convenient to me if it does, but worst case, I can just add an extra key, or perhaps tweak #TempDir to make that field have a default value and not be optional.

Bottom line, I’m fine with a strict interpretation of my “markers are only for special interactions with the engine” rule. If that means a little more work for me getting my magic disjunctions to work, that’s OK.

@aluzzardi
Copy link
Copy Markdown
Contributor

I think we landed on a good compromise which incorporates:

// Core types (e.g. #FS)
#FS: {
  $dagger: fs: _id: string
}

// Tasks (e.g. #Exec)
#Exec: {
  $dagger: task: _id: "Exec"
}

@shykes
Copy link
Copy Markdown
Contributor Author

shykes commented Dec 16, 2021

I think we landed on a good compromise which incorporates:

// Core types (e.g. #FS)
#FS: {
  $dagger: fs: _id: string
}

// Tasks (e.g. #Exec)
#Exec: {
  $dagger: task: _id: "Exec"
}

Yes this works for me. Note that in the future I might advocate for getting rid of hidden fields, depending on whether the possible issue described in #1241 materializes. But we can cross that bridge if/when we get to it.

@talentedmrjones
Copy link
Copy Markdown

Im not sold on $dagger: task: _id: "Exec" as using the _id like that is weird. I get that _task wouldnt work though so I propose: $dagger: task: _name: "Exec"

I might agree with @shykes re: hidden fields. It might get weird to see visible keys with invisible and un-referencable contents

@shykes
Copy link
Copy Markdown
Contributor Author

shykes commented Dec 16, 2021

Im not sold on $dagger: task: _id: "Exec" as using the _id like that is weird. I get that _task wouldnt work though so I propose: $dagger: task: _name: "Exec"

I’m fine with that too.

I might agree with @shykes re: hidden fields. It might get weird to see visible keys with invisible and un-referencable contents

Just to re-iterate, I’m happy to kick this can down the road.

@shykes
Copy link
Copy Markdown
Contributor Author

shykes commented Dec 16, 2021

Next step: I will update this PR so that the spec matches the corresponding implementation change in #1242 .

@shykes shykes changed the title Europa: remove hidden fields in engine, they don't work the way we want Europa: use $dagger field to mark CUE types for engine recognition Dec 16, 2021
@shykes shykes changed the title Europa: use $dagger field to mark CUE types for engine recognition Europa: Great Compromise of December 16, 2021. (spec). Dec 17, 2021
@shykes
Copy link
Copy Markdown
Contributor Author

shykes commented Dec 17, 2021

Merging in spite of failing (flaky) integration test, since this is a 100% spec change.

@shykes shykes merged commit eb86b07 into dagger:main Dec 17, 2021
@shykes shykes deleted the europa-hidden-fields-plan-b branch December 17, 2021 05:57
@shykes shykes changed the title Europa: Great Compromise of December 16, 2021. (spec). Europ: The Great Compromise of December 16, 2021. (spec). Dec 17, 2021
@shykes shykes changed the title Europ: The Great Compromise of December 16, 2021. (spec). Europa: The Great Compromise of December 16, 2021. (spec). Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants