Use functors in Diffing, instead of too many parameters by gasche · Pull Request #3 · Octachron/ocaml

gasche · 2021-06-08T07:56:00Z

This PR is on top of #2. I wrote it as an integrable proof-of-concept: ~~if @Octachron agrees that this makes the code nicer, ideally he would be willing to finish the job by also functorizing Diffing_with_keys on his own~~. I now implemented functorization for both Diffing and Diffing_with_keys, completing the change I had in mind.

Octachron · 2021-06-08T08:12:12Z

I agree that Diffing.diff is essentially a functor with only one function in the body. This is really the edge case where I hesitate between functors and functions with too many arguments.

gasche · 2021-06-08T08:15:01Z

The gains from my perspective are the following:

less type parameters all around: I can actually read the types now
no need to carry ~test ~update all around

The loss is the fact that some forms of parametrization between different "instances" cannot be easily expressed anymore: map is not expressible for example. But we didn't need those anyway. Having to repeat the type declarations in the producer and the consumer also gets tiresome, so I kept a parametrized version of change outside the functor.

(There is room for potential naming discussions: maybe the types eq and diff should be eq_witness and diff_witness for example?)

Octachron · 2021-06-08T08:29:47Z

utils/diffing.ml

+     diff depending on the placement choices for a prefix of the
+     input. This is done by returning (optional) extensions for the
+     left or right input array. *)
+  val update : change -> state -> state * left array option * right array option


The option type is splitting the "no extension" value into None | Some [||]. If the aim is to explicit the fact that extensions are not mandatory, I would rather kept the not variadic variant (and ideally keep the limitation that extension are only allowed on one side).

I was going to make the same comment: the previous (slightly convoluted) type was there on purpose, we want to enforce that users are only extending on one side only, and that it doesn't change during execution.

The change of type is indeed blurring the distinction between "I statically know that we will never extend along this direction" and "dynamically I don't want to extend this time". It results in a substantial simplification -- but we could always revert to the previous complex API.
(One thing I can do but haven't yet is get rid of the dynamic = [||] check in the diffing code, given that producers now choose None in this situation instead.)

Why is it important to prevent an update function from trying to update both directions? I don't understand the details of the algorithm but the code suggests that extending alternatively in either directions would in fact be fine.

Termination is not guaranteed if the client can extends on both side (it could alternate a bit on each side in a loop).

We could just write it in the documentation, but this is a low-usage API, so we should use the inconvenient-but-safe API. With the update that returns 2 options, it's too easy to do a local change in Includemod_error (for instance), without realizing that an invariant is broken.

Is termination guaranteed if the update function always add a new line? The column width remains fixed, but I would naively assume that you need to explore the infinitely-growing lines in case they contain a better path?

it's too easy to do a local change in Includemod_error (for instance), without realizing that an invariant is broken.

What is the invariant we are talking about? If it is "we only add in the same direction", why is it worth enforcing in the code?

update is only called in the Keep/Change case, which corresponds to a diagonal move in the matrix.

I don't get it. Here is the code I see in trunk:

let compute_inner_cell ~weight ~test ~update tbl i j = let compute_proposition i j diff = let* diff = diff in let+ localstate = Matrix.state tbl i j in weight diff + Matrix.weight tbl i j, (diff, localstate) in let del = let diff = let+ x = Matrix.line tbl (i-1) j in Delete x in compute_proposition (i-1) j diff in let insert = let diff = let+ x = Matrix.column tbl i (j-1) in Insert x in compute_proposition i (j-1) diff in let diag = let diff = let* state = Matrix.state tbl (i-1) (j-1) in let* line = Matrix.line tbl (i-1) (j-1) in let* column = Matrix.column tbl (i-1) (j-1) in match test state.state line column with | Ok ok -> Some (Keep (line, column, ok)) | Error err -> Some (Change (line, column, err)) in compute_proposition (i-1) (j-1) diff in let*! newweight, (diff, localstate) = select_best_proposition [diag;del;insert] in let state = update diff localstate in Matrix.set tbl i j ~weight:newweight ~state ~diff:(Some diff)

It looks to me like update is called unconditionally, on a diff whose category (insert/delete/keep/change) depends on the user-provided weight function.

I think that the property you are referring to is the fact that, in the user code in includemod.ml, the update function in fact only provides a non-empty array in the Keep/Change case. This is what I call a dynamic property (in particular, there is no enforcement by the type system).

Indeed the following code diverges in trunk:

let () = let li = variadic_diff ~weight:(function _ -> 0) ~test:(fun _ _ _ -> Error ()) ~update:(With_left_extensions (fun _ () -> (), [|()|])) () [|()|] [|()|] in print_endline (string_of_int (List.length li))

Do we have a deal this way?

Not at all.

Exposing variadic diffing as the only interface seems unnecessarily error prone.
(Of course the update type is an implementation detail that can go away).

Similarly, I would rather make the variadic interface as constrained as possible (with one side at a time, which removes a possible mistake at the consumer sode).

And adding those specialized functors seems straigthforward?

And adding those specialized functors seems straigthforward?

Yes, of course. I can also reinstate the ugly update (I don't particularly care) if this is a blocking point.

The interface could be strengthened to avoid this issue, but your example is in the territory of "clearly malicious codewhereas the current iteration was aimed at relieving theupdate` writer from the burden of remembering which side should be extended.

Drup

At the beginning, I was totally on board with this change ... until I realized Diffable needs to be a functor parametrized by the local environment. If that's the case, we gain approximately nothing, we just trade one kind of complexity for another. If on top of that you need to simplify the arguments (like update so that everyone follow the same mold, it's really not worth it all that much.

Drup · 2021-06-08T08:31:38Z

utils/diffing.ml

+     diff depending on the placement choices for a prefix of the
+     input. This is done by returning (optional) extensions for the
+     left or right input array. *)
+  val update : change -> state -> state * left array option * right array option


I was going to make the same comment: the previous (slightly convoluted) type was there on purpose, we want to enforce that users are only extending on one side only, and that it doesn't change during execution.

gasche · 2021-06-08T10:14:28Z

@Drup What is the problem with having some Diff callsites use a Diffable parametrized over the local environment? I could get rid of this by defining them as a local module inside the corresponding diff function, but I thought it would be clearer this way. It is purely a client-side decision, not related to the functorized API on the Diffing side. (I could of course shoehorn (loc, env) as a non-changing part of the state, or add yet another type component to the functor input module for this callsite information, but that would be more complex.)

Drup · 2021-06-08T10:21:48Z

It's not a "problem" per se. but you are trading parametric polymorphism with 4 or 5 arguments, with a functor+local functor declaration. The complexity difference just isn't that big.

gasche · 2021-06-08T11:32:24Z

It's the same on the callsite, but the definition side is substantially simpler, I think? Both in the type declarations and in the implementation.

(I realize now reading the functor implementation again that I forgot to propagate the weight, test, update functor parameters to the internal functions, which are still parametrized when they don't need to. Let me change that.)

gasche · 2021-06-08T11:42:34Z

Done.

On types:

-val diff :
-  weight:(('l, 'r, 'eq, 'diff) change -> int) ->
-  test:('state -> 'l -> 'r -> ('eq, 'diff) result) ->
-  update:(('l, 'r, 'eq, 'diff) change -> 'state -> 'state) ->
-  'state -> 'l array -> 'r array -> ('l, 'r, 'eq, 'diff) patch

+ val weight : change -> int
+ val test : state -> left -> right -> (eq, diff) result
+ val update : change -> state -> state * left array option * right array option
+ val diff : T.state -> T.left array -> T.right array -> T.patch

On terms:

-let diff ~weight ~test ~update state line column =
-  let update d fs = { fs with state = update d fs.state } in
-  let fullstate = { line; column; state } in
-  compute_matrix ~weight ~test ~update fullstate
-  |> construct_patch
- 
+ let diff state line column =
+  { state; line; column }
+  |> compute_matrix
+  |> construct_patch

gasche · 2021-06-08T14:20:29Z

(I'm thinking of going ahead and functorizing Diffing_with_keys as well.)

Octachron · 2021-06-08T14:28:39Z

Concerning the Diffable functor, this can be avoided by integrating the environment and location in the state, isn't it?

gasche · 2021-06-08T15:30:44Z

I pushed a new commit that also functorizes Diffing_with_keys, resulting in a pleasant simplification. (The user does not have to deal with two diffing modules at once.) There is a bit of boilerplate in building the Diffable argument to the Diffing-with-keys functor, but I think this is less frightening for users/maintainers than unspeakable (or rather unwritable) parametric polymorphism.

gasche · 2021-06-08T15:39:20Z

There you go: the Diffable modules are now inside the Diffing functions instead of being functorized over their parameters, a 40% reduction in the number of functor applications introduced by this patchset.

Octachron · 2021-06-08T16:20:37Z

Also as a generic comment, the functorization of the Diffing module seems orthogonal to the parent PR? Once we converge on an interface, I would propose that I (and @Drup ?) review it as an independent PR. Then I will the functorization commit of the keyed version.

gasche · 2021-06-08T16:31:16Z

I wanted to see Diffing_with_keys functorized as I was reviewing the PR, and I thought that getting my hands dirty would be even better in terms of code-understanding-sharing than just asking someone else to consider extra work. (Sure was.) Functorizing Diffing was a natural first step before functorizing Diffing_with_keys. I don't have strong opinions in the order in which the PRs should be considered.

gasche · 2021-06-08T21:07:22Z

I'm a bit frustrated by the discussion around this PR. I have the impression that I made a real effort working on this code to get the hang of it; not expecting a medal, but you could sound more enthusiastic about #2 at least!

Octachron · 2021-06-08T21:33:54Z

Thank you for taking the time of understanding the rootPR.
The dead code fix in #2 is nice.
Nevertheless, I am not enthusiastic about either PRs.

#2 is mostly moving around complexity (maybe decreasing it a little?).
#3 seems to be mostly increasing complexity by lifting back a functor that was lowered to the core language. If the core issue was the unreadable types, maybe something like 0ac0fef would work better?

gasche · 2021-06-09T13:55:53Z

In order to facilitate comparison with other approaches, I implemented the restricted interfaces with Diffing.Make, Diffing.MakeVariadicLeft and Diffing.MakeVariadicRight. There is a bit more (harmless) boilerplate in the library (four interfaces and four functors, three of them exposed and one hidden), but it does simplify the client code a bit.

gasche · 2021-06-09T13:59:37Z

Re. termination: I think that the key argument is that termination is ensured when the update function that the "input sizes" of all produced intermediary states are globally bounded in both directions. One sufficient criterion for this is that (1) one size never extends, and (2) the other never extends on Insert/Delete, but both aspects could be relaxed, it's just that nobody cares to write the more general implementation and that we don't have use-cases for now. (The current codebase is not correct in presence of extension on both sides; I think that the matrix computation works but reconstruction of the best path does not. This is a sensible reason to hide the extending-on-both-sides capability from the user.)

Octachron · 2021-06-09T17:34:29Z

I have played with a version with twice the number of functors: https://github.com/Octachron/ocaml#semdiff_functor_types . Having separate type definition functors make the impact on the implementation less intrusive while making sure that each diffing function is constrained to its own change type.
Of course, this is also probably totally over-engineered.

gasche · 2021-06-09T20:03:12Z

I think the result is reasonably nice. Would you care to submit it as a PR to your PR, superseding these ones?

Some comments looking at semdiff_type_decl...semdiff_functor_types (I can't post comments inline), in reverse patch order:

in Diffing_with_keys, I find the name Extended_defs unfelicitous ; these definitions are arguably not an "extension" of the previous ones (we are never working with both at the same time), but rather a sort of "inner core" on which the features related to the Defs are built. So I would call them Inner_def.
In Diffing, sig val diff : ... end is duplicated three times. This also happens in my version (in the present PR; ), but I can't avoid it as the type of my diff depends on the functor argument. This is not the case in your version, you could have a Defined.Diff module type for this, and you could in fact also reuse it in Diffing_with_keys.
Going in a different direction: I find your diffing_with_keys.mli hard to read, because it exposes a lot of details on the setup structure of Diffing, you have to be familiar with this module to understand the Diffing_with_keys signature. (This was already somewhat the case with my version, but less so.). Most of this internal layout is not in fact needed to describe the type of the final diff function. I think you should try unfolding/inlining the functor layers here, see if you can get a definition that is conceptually slightly more redundant, but more self-contained and much easier to read in practice.
Why are you using a generalize function to witness the relation between the specialized change and the generic version, instead of just exposing type change = (left, right, ...) Generic.change in the signature?
Defined is not a very nice name (the fact that its argument is named Defs suggest that you ran out of naming inspiration at that point.) If we want a bland name, what about just Make? Its result could be named Diff, and then the result of Simple would be just F (instead of Diff).

Octachron · 2021-06-09T20:16:13Z

Indeed, I completely ran out of naming fuel along the way.

Why are you using a generalize function to witness the relation between the specialized change and the generic version, instead of just exposing type change = (left, right, ...) Generic.change in the signature?

This has the advantage of making the update and test handles the correct change type by construction since the specialized change type is the only one in scope.

Octachron · 2021-06-22T12:49:08Z

Superseded by #4

Effect syntax: use Ctype.new_local_type

…l#13294) The toplevel printer detects cycles by keeping a hashtable of values that it has already traversed. However, some OCaml runtime types (at least bigarrays) may be partially uninitialized, and hashing them at arbitrary program points may read uninitialized memory. In particular, the OCaml testsuite fails when running with a memory-sanitizer enabled, as bigarray printing results in reads to uninitialized memory: ``` ==133712==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x4e6d11 in caml_ba_hash /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 #1 0x52474a in caml_hash /var/home/edwin/git/ocaml/runtime/hash.c:251:35 #2 0x599ebf in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1065:14 #3 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #4 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #5 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #6 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) Uninitialized value was created by a heap allocation #0 0x47d306 in malloc (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x47d306) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) #1 0x4e7960 in caml_ba_alloc /var/home/edwin/git/ocaml/runtime/bigarray.c:246:12 #2 0x4e801f in caml_ba_create /var/home/edwin/git/ocaml/runtime/bigarray.c:673:10 #3 0x59b8fc in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14 #4 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #5 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #6 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #8 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 in caml_ba_hash ``` The only use of hashing in genprintval is to avoid cycles, that is, it is only useful for OCaml values that contain other OCaml values (including possibly themselves). Bigarrays cannot introduce cycles, and they are always printed as "<abstr>" anyway. The present commit proposes to be more conservative in which values are hashed by the cycle detector to avoid this issue: we skip hashing any value with tag above No_scan_tag -- which may not contain any OCaml values. Suggested-by: Gabriel Scherer <[email protected]> Signed-off-by: Edwin Török <[email protected]> Co-authored-by: Edwin Török <[email protected]>

gasche added 2 commits June 6, 2021 22:01

diffing_keys.ml: precise type annotations to ease comprehension

0a9ccb3

diffing_with_keys: heterogeneous presentation, because we can

3097e5d

gasche force-pushed the semdiff_type_decl-functors branch from 54c3936 to 125520e Compare June 8, 2021 07:57

gasche mentioned this pull request Jun 8, 2021

Diffing for mismatches in variant and record declarations ocaml/ocaml#10361

Merged

Octachron reviewed Jun 8, 2021

View reviewed changes

Drup reviewed Jun 8, 2021

View reviewed changes

utils/diffing.ml: use functors

70c5baa

gasche force-pushed the semdiff_type_decl-functors branch from 125520e to 70c5baa Compare June 8, 2021 11:37

utils/diffing_with_keys: use functors

7cb214e

review: move the Diffable modules within the diffing functions

08c6340

diffing: expore more restricted Make{,Variadic{Left,Right}} functors

fa021c2

Octachron mentioned this pull request Jun 14, 2021

Twice-functorized diffing (with improved documentation) #4

Merged

Octachron closed this Jun 22, 2021

Octachron pushed a commit that referenced this pull request Apr 12, 2024

Merge pull request #3 from Octachron/effect-syntax-newtype

4fdc825

Effect syntax: use Ctype.new_local_type

Octachron added a commit that referenced this pull request Sep 9, 2024

Unify mark and direction (#3)

57bd829

Octachron added a commit that referenced this pull request Sep 24, 2024

Unify mark and direction (#3)

9eadfb8

Conversation

gasche commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Octachron commented Jun 8, 2021

Uh oh!

gasche commented Jun 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Drup Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Drup left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche commented Jun 8, 2021

Uh oh!

Drup commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Jun 8, 2021

Uh oh!

gasche commented Jun 8, 2021

Uh oh!

gasche commented Jun 8, 2021

Uh oh!

Octachron commented Jun 8, 2021

Uh oh!

gasche commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Octachron commented Jun 8, 2021

Uh oh!

gasche commented Jun 8, 2021

Uh oh!

gasche commented Jun 8, 2021

Uh oh!

Octachron commented Jun 8, 2021

Uh oh!

gasche commented Jun 9, 2021

Uh oh!

gasche commented Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Octachron commented Jun 9, 2021

Uh oh!

gasche commented Jun 9, 2021

Uh oh!

Octachron commented Jun 9, 2021

Uh oh!

Octachron commented Jun 22, 2021

Uh oh!

Reviewers

gasche commented Jun 8, 2021 •

edited

Loading

Drup Jun 8, 2021 •

edited

Loading

gasche Jun 8, 2021 •

edited

Loading

gasche Jun 8, 2021 •

edited

Loading

Drup commented Jun 8, 2021 •

edited

Loading

gasche commented Jun 8, 2021 •

edited

Loading

gasche commented Jun 8, 2021 •

edited

Loading

gasche commented Jun 9, 2021 •

edited

Loading