Use functors in Diffing, instead of too many parameters#3
Use functors in Diffing, instead of too many parameters#3gasche wants to merge 6 commits intoOctachron:semdiff_type_declfrom
Conversation
54c3936 to
125520e
Compare
|
I agree that |
|
The gains from my perspective are the following:
The loss is the fact that some forms of parametrization between different "instances" cannot be easily expressed anymore: (There is room for potential naming discussions: maybe the types |
| diff depending on the placement choices for a prefix of the | ||
| input. This is done by returning (optional) extensions for the | ||
| left or right input array. *) | ||
| val update : change -> state -> state * left array option * right array option |
There was a problem hiding this comment.
The option type is splitting the "no extension" value into None | Some [||]. If the aim is to explicit the fact that extensions are not mandatory, I would rather kept the not variadic variant (and ideally keep the limitation that extension are only allowed on one side).
There was a problem hiding this comment.
I was going to make the same comment: the previous (slightly convoluted) type was there on purpose, we want to enforce that users are only extending on one side only, and that it doesn't change during execution.
There was a problem hiding this comment.
The change of type is indeed blurring the distinction between "I statically know that we will never extend along this direction" and "dynamically I don't want to extend this time". It results in a substantial simplification -- but we could always revert to the previous complex API.
(One thing I can do but haven't yet is get rid of the dynamic = [||] check in the diffing code, given that producers now choose None in this situation instead.)
Why is it important to prevent an update function from trying to update both directions? I don't understand the details of the algorithm but the code suggests that extending alternatively in either directions would in fact be fine.
There was a problem hiding this comment.
Termination is not guaranteed if the client can extends on both side (it could alternate a bit on each side in a loop).
We could just write it in the documentation, but this is a low-usage API, so we should use the inconvenient-but-safe API. With the update that returns 2 options, it's too easy to do a local change in Includemod_error (for instance), without realizing that an invariant is broken.
There was a problem hiding this comment.
Is termination guaranteed if the update function always add a new line? The column width remains fixed, but I would naively assume that you need to explore the infinitely-growing lines in case they contain a better path?
it's too easy to do a local change in Includemod_error (for instance), without realizing that an invariant is broken.
What is the invariant we are talking about? If it is "we only add in the same direction", why is it worth enforcing in the code?
There was a problem hiding this comment.
updateis only called in theKeep/Changecase, which corresponds to a diagonal move in the matrix.
I don't get it. Here is the code I see in trunk:
let compute_inner_cell ~weight ~test ~update tbl i j =
let compute_proposition i j diff =
let* diff = diff in
let+ localstate = Matrix.state tbl i j in
weight diff + Matrix.weight tbl i j, (diff, localstate)
in
let del =
let diff = let+ x = Matrix.line tbl (i-1) j in Delete x in
compute_proposition (i-1) j diff
in
let insert =
let diff = let+ x = Matrix.column tbl i (j-1) in Insert x in
compute_proposition i (j-1) diff
in
let diag =
let diff =
let* state = Matrix.state tbl (i-1) (j-1) in
let* line = Matrix.line tbl (i-1) (j-1) in
let* column = Matrix.column tbl (i-1) (j-1) in
match test state.state line column with
| Ok ok -> Some (Keep (line, column, ok))
| Error err -> Some (Change (line, column, err))
in
compute_proposition (i-1) (j-1) diff
in
let*! newweight, (diff, localstate) =
select_best_proposition [diag;del;insert]
in
let state = update diff localstate in
Matrix.set tbl i j ~weight:newweight ~state ~diff:(Some diff)It looks to me like update is called unconditionally, on a diff whose category (insert/delete/keep/change) depends on the user-provided weight function.
I think that the property you are referring to is the fact that, in the user code in includemod.ml, the update function in fact only provides a non-empty array in the Keep/Change case. This is what I call a dynamic property (in particular, there is no enforcement by the type system).
There was a problem hiding this comment.
Indeed the following code diverges in trunk:
let () =
let li =
variadic_diff
~weight:(function _ -> 0)
~test:(fun _ _ _ -> Error ())
~update:(With_left_extensions (fun _ () -> (), [|()|]))
() [|()|] [|()|]
in
print_endline (string_of_int (List.length li))There was a problem hiding this comment.
Do we have a deal this way?
Not at all.
Exposing variadic diffing as the only interface seems unnecessarily error prone.
(Of course the update type is an implementation detail that can go away).
Similarly, I would rather make the variadic interface as constrained as possible (with one side at a time, which removes a possible mistake at the consumer sode).
And adding those specialized functors seems straigthforward?
There was a problem hiding this comment.
And adding those specialized functors seems straigthforward?
Yes, of course. I can also reinstate the ugly update (I don't particularly care) if this is a blocking point.
There was a problem hiding this comment.
The interface could be strengthened to avoid this issue, but your example is in the territory of "clearly malicious codewhereas the current iteration was aimed at relieving theupdate` writer from the burden of remembering which side should be extended.
Drup
left a comment
There was a problem hiding this comment.
At the beginning, I was totally on board with this change ... until I realized Diffable needs to be a functor parametrized by the local environment. If that's the case, we gain approximately nothing, we just trade one kind of complexity for another. If on top of that you need to simplify the arguments (like update so that everyone follow the same mold, it's really not worth it all that much.
| diff depending on the placement choices for a prefix of the | ||
| input. This is done by returning (optional) extensions for the | ||
| left or right input array. *) | ||
| val update : change -> state -> state * left array option * right array option |
There was a problem hiding this comment.
I was going to make the same comment: the previous (slightly convoluted) type was there on purpose, we want to enforce that users are only extending on one side only, and that it doesn't change during execution.
|
@Drup What is the problem with having some Diff callsites use a Diffable parametrized over the local environment? I could get rid of this by defining them as a local module inside the corresponding |
|
It's not a "problem" per se. but you are trading parametric polymorphism with 4 or 5 arguments, with a functor+local functor declaration. The complexity difference just isn't that big. |
|
It's the same on the callsite, but the definition side is substantially simpler, I think? Both in the type declarations and in the implementation. (I realize now reading the functor implementation again that I forgot to propagate the |
125520e to
70c5baa
Compare
|
Done. On types: -val diff :
- weight:(('l, 'r, 'eq, 'diff) change -> int) ->
- test:('state -> 'l -> 'r -> ('eq, 'diff) result) ->
- update:(('l, 'r, 'eq, 'diff) change -> 'state -> 'state) ->
- 'state -> 'l array -> 'r array -> ('l, 'r, 'eq, 'diff) patch
+ val weight : change -> int
+ val test : state -> left -> right -> (eq, diff) result
+ val update : change -> state -> state * left array option * right array option
+ val diff : T.state -> T.left array -> T.right array -> T.patchOn terms: -let diff ~weight ~test ~update state line column =
- let update d fs = { fs with state = update d fs.state } in
- let fullstate = { line; column; state } in
- compute_matrix ~weight ~test ~update fullstate
- |> construct_patch
-
+ let diff state line column =
+ { state; line; column }
+ |> compute_matrix
+ |> construct_patch |
|
(I'm thinking of going ahead and functorizing |
|
Concerning the Diffable functor, this can be avoided by integrating the environment and location in the state, isn't it? |
|
I pushed a new commit that also functorizes |
|
There you go: the |
|
Also as a generic comment, the functorization of the |
|
I wanted to see |
|
I'm a bit frustrated by the discussion around this PR. I have the impression that I made a real effort working on this code to get the hang of it; not expecting a medal, but you could sound more enthusiastic about #2 at least! |
|
Thank you for taking the time of understanding the rootPR. #2 is mostly moving around complexity (maybe decreasing it a little?). |
|
In order to facilitate comparison with other approaches, I implemented the restricted interfaces with |
|
Re. termination: I think that the key argument is that termination is ensured when the update function that the "input sizes" of all produced intermediary states are globally bounded in both directions. One sufficient criterion for this is that (1) one size never extends, and (2) the other never extends on Insert/Delete, but both aspects could be relaxed, it's just that nobody cares to write the more general implementation and that we don't have use-cases for now. (The current codebase is not correct in presence of extension on both sides; I think that the matrix computation works but reconstruction of the best path does not. This is a sensible reason to hide the extending-on-both-sides capability from the user.) |
|
I have played with a version with twice the number of functors: https://github.com/Octachron/ocaml#semdiff_functor_types . Having separate type definition functors make the impact on the implementation less intrusive while making sure that each diffing function is constrained to its own change type. |
|
I think the result is reasonably nice. Would you care to submit it as a PR to your PR, superseding these ones? Some comments looking at semdiff_type_decl...semdiff_functor_types (I can't post comments inline), in reverse patch order:
|
|
Indeed, I completely ran out of naming fuel along the way.
This has the advantage of making the |
|
Superseded by #4 |
Effect syntax: use Ctype.new_local_type
…l#13294) The toplevel printer detects cycles by keeping a hashtable of values that it has already traversed. However, some OCaml runtime types (at least bigarrays) may be partially uninitialized, and hashing them at arbitrary program points may read uninitialized memory. In particular, the OCaml testsuite fails when running with a memory-sanitizer enabled, as bigarray printing results in reads to uninitialized memory: ``` ==133712==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x4e6d11 in caml_ba_hash /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 #1 0x52474a in caml_hash /var/home/edwin/git/ocaml/runtime/hash.c:251:35 #2 0x599ebf in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1065:14 #3 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #4 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #5 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #6 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) Uninitialized value was created by a heap allocation #0 0x47d306 in malloc (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x47d306) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) #1 0x4e7960 in caml_ba_alloc /var/home/edwin/git/ocaml/runtime/bigarray.c:246:12 #2 0x4e801f in caml_ba_create /var/home/edwin/git/ocaml/runtime/bigarray.c:673:10 #3 0x59b8fc in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14 #4 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #5 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #6 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #8 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 in caml_ba_hash ``` The only use of hashing in genprintval is to avoid cycles, that is, it is only useful for OCaml values that contain other OCaml values (including possibly themselves). Bigarrays cannot introduce cycles, and they are always printed as "<abstr>" anyway. The present commit proposes to be more conservative in which values are hashed by the cycle detector to avoid this issue: we skip hashing any value with tag above No_scan_tag -- which may not contain any OCaml values. Suggested-by: Gabriel Scherer <[email protected]> Signed-off-by: Edwin Török <[email protected]> Co-authored-by: Edwin Török <[email protected]>
This PR is on top of #2. I wrote it as an integrable proof-of-concept:
if @Octachron agrees that this makes the code nicer, ideally he would be willing to finish the job by also functorizing Diffing_with_keys on his own. I now implemented functorization for bothDiffingandDiffing_with_keys, completing the change I had in mind.