Conversation
|
Taking a deep look, but before then... I know this is experimental, but the Is it safer to go to a lower level of abstraction (e.g., abstract syntax tree or compiled bytecode)? Would those be more robust to, e.g., silly source code changes like variable renaming (genuinely asking, since it's been a long time since I've touched this kind of thing...). |
Very much so! I was just gonna add a comment in the PR description saying we could do it, but in writing how easy it'd be, I just decided to implement it haha. 😁 My guess is false negatives (or at least behavior preferences) probably increase "exponentially" with Producer complexity (custom base class, non-method helper functions, etc) and false positives are probably not a big deal for simple Producers (either they're cheaper to rerun or have little that could change), so they're probably best suited for small one-offs. There might also be little value in hashing the Buuut yeah, using something like the AST (pruning comments, etc) might be reasonable and support more tunable behavior. With that approach, we might be able to:
This would probably be a v2 thing, but could be implemented in any |
src/arti/versions/core.py
Outdated
| class Timestamp(Version): | ||
| def __init__(self, dt: Optional[datetime] = None): | ||
| if dt is None: | ||
| dt = datetime.utcnow() |
There was a problem hiding this comment.
I find this default utcnow very confusing. Consider, e.g., the usage in the Producer class
artigraph/src/arti/producers/core.py
Line 25 in 7b81d97
class NewProducer(Producer): and do not declare a version... then its version is the version of its parent, which is Timestamp, which defaults to utcnow, specifically the "UTC now" of whenever Producer was brought into local memory, the import time... right?
IDK about this one.
There was a problem hiding this comment.
Yeah, it's sort of intended to be "meaningless" for cases where you just want to rerun something every time, perhaps because you're developing/iterating or as part of a "side effect" (see #11 for more there). Perhaps a Random one would be better for that case (though ordering by time might be convenient).
Note that a Producer's fingerprint (via static_fingerprint) also mixes in the Producer's class name (.key) so has a different fingerprint (akin to ==1.2.3 - version 1.2.3 of what?):
[ins] In [1]: from arti.producers.core import Producer
...: from arti.artifacts.core import Artifact
...:
...: class P1(Producer):
...: def build(self, x: Artifact) -> Artifact:
...: pass
...:
...: class P2(P1):
...: pass
...:
...: p1, p2 = P1(x=Artifact()), P2(x=Artifact())
...: print([p1.version.fingerprint, p2.version.fingerprint])
...: print([p1.static_fingerprint, p2.static_fingerprint])
...:
[Fingerprint(key=int64(1619050761)), Fingerprint(key=int64(1619050761))]
[Fingerprint(key=int64(4585353627840274035)), Fingerprint(key=int64(-5625694804379569552))]Either way, "set on import" semantics might be a bit tricky - do you think it'd be better to make these "set on access" or "set at (hypothetical) Graph.run"? On a completely different / crazy tangent, this might be interesting for long running processes (like a "live loader" / builder).
Regardless, any "dynamic" behavior could cause issues in distributed contexts (eg: dask) where different workers will start at different times (and hence disagree about where to look for output - though still shouldn't have collisions). Hence, we'll need to serialize / resolve all things statically (either import time, Graph def time, Graph run time, etc) and then pass json/pickle objects recording the original key (ie: resolve once on the "main" executor node, pass to workers).
There was a problem hiding this comment.
Is there a way to do "Graph run time"? To me, that feels more intuitive, and is the actual resolution to your proposed potential uses of
(though ordering by time might be convenient).
and
for cases where you just want to rerun something every time
.
I ask because as currently written, it's not clear to me, since utcnow is the default for Timestamp, which is the default for Version, which is set at Graph def time, not at Graph run time.
There was a problem hiding this comment.
There is a way with code changes. 😄 I'll remove the defaulting logic here and in the future when I build out Graph run logic, I'll pick "Graph run time" for:
serialize / resolve all things statically (either import time, Graph def time, Graph run time, etc) and then pass json/pickle objects recording the original key
There was a problem hiding this comment.
Removing the default logic makes sense to me. Sweet.
Codecov Report
@@ Coverage Diff @@
## golden #36 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 14 15 +1
Lines 512 544 +32
Branches 68 66 -2
=========================================
+ Hits 512 544 +32
Continue to review full report at Codecov.
|
7f45735 to
9196625
Compare
2d44fb3 to
832641a
Compare
Adds the base Version class with a few starter implementations representing different strategies for a Producer:
<1.0.0hashes all components>=1.0.0hashes only the major componentTagged a few others to get more ideas around these and other strategies.