Skip to content

Comments

Add base Version classes#36

Merged
JacobHayes merged 1 commit intogoldenfrom
version
Apr 25, 2021
Merged

Add base Version classes#36
JacobHayes merged 1 commit intogoldenfrom
version

Conversation

@JacobHayes
Copy link
Member

@JacobHayes JacobHayes commented Apr 21, 2021

Adds the base Version class with a few starter implementations representing different strategies for a Producer:

  • GitCommit: change version automatically based on the git commit
  • SemVer: a "major.minor.patch" tuple following (my skimming of) https://semver.org/
    • <1.0.0 hashes all components
    • >=1.0.0 hashes only the major component
  • String: arbitrary version string
  • Timestamp: Unix seconds
  • _Source: an experimental strategy that fingerprints the Producer's source code definition (:exploding_head:... cc @mikss)

Tagged a few others to get more ideas around these and other strategies.

@JacobHayes JacobHayes self-assigned this Apr 21, 2021
@mikss
Copy link
Contributor

mikss commented Apr 21, 2021

Taking a deep look, but before then...

I know this is experimental, but the _Source idea is both appealing and terrifying. Although, I suppose the risk is primarily false positives (e.g., "I think the Producer has changed, even though it has not, due to something silly like going from 3 spaces to 4 spaces") rather than false negatives (e.g., "I think the Producer has not changed, since the source has not changed, even though its functionality is actually different now for some reason").

Is it safer to go to a lower level of abstraction (e.g., abstract syntax tree or compiled bytecode)? Would those be more robust to, e.g., silly source code changes like variable renaming (genuinely asking, since it's been a long time since I've touched this kind of thing...).

@JacobHayes
Copy link
Member Author

the _Source idea is both appealing and terrifying

Very much so! I was just gonna add a comment in the PR description saying we could do it, but in writing how easy it'd be, I just decided to implement it haha. 😁

My guess is false negatives (or at least behavior preferences) probably increase "exponentially" with Producer complexity (custom base class, non-method helper functions, etc) and false positives are probably not a big deal for simple Producers (either they're cheaper to rerun or have little that could change), so they're probably best suited for small one-offs.

There might also be little value in hashing the version (since that's what this is), map (map semantic changes will get picked up via different input dependencies), and a few other definitions. We could also consider hashing just the .build method's source, trading off a loss of sensitivity to other helper methods (eg: sometimes the build method is just a few lines - not a lot of value hashing that alone). We might be able to recursively parse references, but that might be the other extreme of too sensitive (eg: direct and transitive libraries/utils/etc references).

Buuut yeah, using something like the AST (pruning comments, etc) might be reasonable and support more tunable behavior. With that approach, we might be able to:

  • Use AST, stripping all comments and other "junk" things
  • AST the build method
  • recurse into certain references, configurable by user (eg: don't recurse, only consider MRO methods, recurse all, etc)
  • combine those ASTs into a single Fingerprint

This would probably be a v2 thing, but could be implemented in any arti consumer if desired!

class Timestamp(Version):
def __init__(self, dt: Optional[datetime] = None):
if dt is None:
dt = datetime.utcnow()
Copy link
Contributor

@mikss mikss Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this default utcnow very confusing. Consider, e.g., the usage in the Producer class

version: Version = Timestamp()
, where the default timestamp will register at class declaration / module import time. But that time is kind of meaningless, right? E.g., suppose you subclass and define class NewProducer(Producer): and do not declare a version... then its version is the version of its parent, which is Timestamp, which defaults to utcnow, specifically the "UTC now" of whenever Producer was brought into local memory, the import time... right?

IDK about this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's sort of intended to be "meaningless" for cases where you just want to rerun something every time, perhaps because you're developing/iterating or as part of a "side effect" (see #11 for more there). Perhaps a Random one would be better for that case (though ordering by time might be convenient).

Note that a Producer's fingerprint (via static_fingerprint) also mixes in the Producer's class name (.key) so has a different fingerprint (akin to ==1.2.3 - version 1.2.3 of what?):

[ins] In [1]: from arti.producers.core import Producer
         ...: from arti.artifacts.core import Artifact
         ...:
         ...: class P1(Producer):
         ...:     def build(self, x: Artifact) -> Artifact:
         ...:         pass
         ...:
         ...: class P2(P1):
         ...:     pass
         ...:
         ...: p1, p2 = P1(x=Artifact()), P2(x=Artifact())
         ...: print([p1.version.fingerprint, p2.version.fingerprint])
         ...: print([p1.static_fingerprint, p2.static_fingerprint])
         ...:
[Fingerprint(key=int64(1619050761)), Fingerprint(key=int64(1619050761))]
[Fingerprint(key=int64(4585353627840274035)), Fingerprint(key=int64(-5625694804379569552))]

Either way, "set on import" semantics might be a bit tricky - do you think it'd be better to make these "set on access" or "set at (hypothetical) Graph.run"? On a completely different / crazy tangent, this might be interesting for long running processes (like a "live loader" / builder).

Regardless, any "dynamic" behavior could cause issues in distributed contexts (eg: dask) where different workers will start at different times (and hence disagree about where to look for output - though still shouldn't have collisions). Hence, we'll need to serialize / resolve all things statically (either import time, Graph def time, Graph run time, etc) and then pass json/pickle objects recording the original key (ie: resolve once on the "main" executor node, pass to workers).

Copy link
Contributor

@mikss mikss Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do "Graph run time"? To me, that feels more intuitive, and is the actual resolution to your proposed potential uses of

(though ordering by time might be convenient).

and

for cases where you just want to rerun something every time

.


I ask because as currently written, it's not clear to me, since utcnow is the default for Timestamp, which is the default for Version, which is set at Graph def time, not at Graph run time.

Copy link
Member Author

@JacobHayes JacobHayes Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way with code changes. 😄 I'll remove the defaulting logic here and in the future when I build out Graph run logic, I'll pick "Graph run time" for:

serialize / resolve all things statically (either import time, Graph def time, Graph run time, etc) and then pass json/pickle objects recording the original key

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the default logic makes sense to me. Sweet.

@codecov
Copy link

codecov bot commented Apr 23, 2021

Codecov Report

Merging #36 (2be84af) into golden (5a57411) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 2be84af differs from pull request most recent head c2e610a. Consider uploading reports for the commit c2e610a to get more accurate results
Impacted file tree graph

@@            Coverage Diff            @@
##            golden       #36   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        15    +1     
  Lines          512       544   +32     
  Branches        68        66    -2     
=========================================
+ Hits           512       544   +32     
Impacted Files Coverage Δ
src/arti/graphs/core.py 100.00% <ø> (ø)
src/arti/fingerprints/core.py 100.00% <100.00%> (ø)
src/arti/internal/utils.py 100.00% <100.00%> (ø)
src/arti/producers/core.py 100.00% <100.00%> (ø)
src/arti/versions/core.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a57411...c2e610a. Read the comment docs.

@JacobHayes JacobHayes force-pushed the fingerprint branch 3 times, most recently from 7f45735 to 9196625 Compare April 25, 2021 01:18
Base automatically changed from fingerprint to golden April 25, 2021 01:22
@JacobHayes JacobHayes force-pushed the version branch 3 times, most recently from 2d44fb3 to 832641a Compare April 25, 2021 06:38
@JacobHayes JacobHayes merged commit d8bc228 into golden Apr 25, 2021
@JacobHayes JacobHayes deleted the version branch April 25, 2021 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants