Introduce python view, pickle format, local storage#62
Conversation
Codecov Report
@@ Coverage Diff @@
## golden #62 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 17 21 +4
Lines 619 675 +56
Branches 74 75 +1
=========================================
+ Hits 619 675 +56
Continue to review full report at Codecov.
|
0ea31f8 to
a2f07c7
Compare
src/arti/views/python.py
Outdated
|
|
||
|
|
||
| @write.register | ||
| def _write(view: Python, format: Pickle, storage: LocalFile) -> None: |
There was a problem hiding this comment.
Planting this seed (probably should wait to experiment), but iirc multimethod supports class hierarchy based dispatch. In other words, if we built a few mixins (or Protocols, granted multimethod support) for format/storage/view behavior (right now, they're all data), then we could add some high-level read/write handlers that delegate to cover a lot of ground, while still being able to write specific or optimized handlers as needed.
Eg: we could:
- add a
FileLikemixin/Protocolthat has an.openmethod - make
LocalFilesubclassFileLikeand implement.open - make
read/writehandlers that have(format: Pickle, storage: FileLike, view: Python)and usestorage.openinstead ofopen - mix and match with format/view interface mixins for more generalization OR dive into specific classes (like the current PR) for optimizations
src/arti/views/core.py
Outdated
|
|
||
|
|
||
| @multidispatch | ||
| def read(view: View, format: Format, storage: Storage) -> View: |
There was a problem hiding this comment.
What do you think about changing this to -> None and just relying on the view.data assignment? That might be a bit more clear that it works via mutation.
Alternatively, should we get rid of the mutation (can't remember where our prior discussion landed) and return the underlying data/-> Any? I think the challenge (is it one?) we realized was that then write needs an additional data: Any argument
There was a problem hiding this comment.
I don't think we like this idea:
Alternatively, should we get rid of the mutation (can't remember where our prior discussion landed) and return the underlying data/-> Any? I think the challenge (is it one?) we realized was that then write needs an additional data: Any argument
Specifically because then the issue of "where is my data" becomes muddled. One side effect is, as you point out, write then needs to find the data before writing. I think the way we have it as-is, a View is an actual object with the data, rather than just an instruction, which smooths over these issues.
There was a problem hiding this comment.
I'm still torn because that makes View act a bit different than the strictly metadata Format, Storage, etc (which can be reused across calls to build). For example, if we added a PandasDataFrameView(index=["date"]), than in each call to build (eg: for multiple partitions), we would have to recreate that object w/ the same metadata just to wrap different data.
--
Let's leave View.data is for now and we can reconsider (weighing the costs of an extra write argument) later on.
There was a problem hiding this comment.
we would have to recreate that object w/ the same metadata just to wrap different data.
🤷 Is an artifact a thing or a collection of things? Is a view a snapshot of a thing or a collection of snapshots of things? I think if Views support slicing (in this case, rows), different instances per partition feels natural.
43d6199 to
809878b
Compare
* Add NamedTemporaryFile alternative that can be re-opened on Windows * Use NamedTemporaryFile rather than mock_open (to fix breakpoint)
cb71a5c to
bddd507
Compare
JacobHayes
left a comment
There was a problem hiding this comment.
Looking great! Punting on View.data mutation until we build some of the higher up machinery.
--
Next up, we probably need to update Producer.build to accept Views rather than Artifacts directly... Thinking about that a bit, if we expand View to have similar matching as the TypeAdapters, then maybe the build signature can be Annotated[pd.DataFrame, MyArtifact] rather than requiring PandasDataFrame[MyArtifact]
More complex cases may still want to use PandasDataFrame directly to pick a specific View or set additional metadata, but the simple cases would stay simple. 🤔
This PR introduces simple example implementations of View (Python), Storage (LocalFile), and Format (Pickle).
The
Viewobject is instantiated with.readand persisted with.write, but these methods in fact call out to multiple dispatch functionsreadandwrite, that implement the expected behavior based on the types of the passed input arguments (namely, view, storage, format).In principle, the read/write will have to do quite a bit of heavy lifting, potentially using
type_systemattributes for conversion, but in this lucky example of Python/Local/Pickle, none of that is necessary.