Make type-checking of "open" constant-time by alainfrisch · Pull Request #834 · ocaml/ocaml

alainfrisch · 2016-10-03T16:53:16Z

This PR is built on top of #828 and implements the idea in http://caml.inria.fr/mantis/view.php?id=6826, namely that type-checking an "open" statement should take constant time, independently of the size of the opened module.

This is done by keeping a layered representation for the environment: a set of local bindings, and possibly a symbolic representation of an open (just its path and its computed set of components, which is memoized) + the bindings before that open. Each lookup traverses this linked list of partial environments. This might be a bit slower, but of course, each map is smaller. For instance, previously, a lookup in an environment obtained by opening two modules with 1000 components would take O(log(2000)), while now it takes O(2*log(1000)).

Moreover, the new representation allows a lighter representation, since e.g. support for warnings on open don't pollute the representation for local bindings. In addition to that, find_same lookups don't need to go through parts representing components imported by an open statement, so they might actually be faster than in the current implementation.

All in all, I could not find a case where the new strategy slows down lookup enough to make the entire type-checking slower.

alainfrisch · 2016-10-03T21:30:55Z

Timings for ocamlc.opt, compiling:

module A = struct
  let x1 = 1
  ...
  let xn = n
end
let x = A.(x1)
...
let x = A.(xn)

Results:

n	trunk	branch
1000	0.8s	0.06s
2000	4.0s	0.09s
3000	10s	0.14s
4000	21s	0.2s

alainfrisch · 2016-10-03T22:00:27Z

And a worst-case scenario, with many lookups that need to traverse a lot of "open" (here, 40):

module A = struct end
let x = 42
open A  (* 1 *)
....
open A  (* 40 *)
let y1 = x
...
let yn = x

Timings (seconds):

n	trunk	branch
10000	0.23	0.26
20000	0.49	0.53
40000	1.0	1.14

With only 10 opens:

n	trunk	branch
10000	0.23	0.23
20000	0.47	0.48
40000	0.98	1.03

alainfrisch · 2016-10-03T22:12:49Z

Code typical of let-open-in intensive style:

let r1 = Pervasives.(ref, succ)
....
let rn = Pervasives.(ref, succ)

n	trunk	branch
10000	2.8	0.6
20000	6.7	1.2

alainfrisch · 2016-10-04T12:01:53Z

An even worst-case scenario:

let x = 42
open List (* 1 *)
....
open List  (* n *)
let y1 = x
...
let y40000 = x

Here one needs to look for x in all the Tbl.t representing each open. Timings:

n	trunk	branch
10	1.00	1.03
40	1.00	1.21
80	1.03	1.40

I've tried to memoize the lookup results in the environment tables (so that further lookups on the same name don't need to traverse all the chain), and this reduces the overhead by about 1/2. Considering how small the overhead is for typical number of nested opens, I don't believe this is worth the extra complexity.

alainfrisch · 2016-10-04T14:05:48Z

I don't understand the problem with the testsuite (tests/typing-poly/poly.ml). This is an "expect" test. The current source is:

module M
: sig val f : (<m : 'b. 'b * ('b * <m:'c. 'c * 'bar> as 'bar)>) -> unit end
= struct let f (x : <m : 'a. 'a * ('a * 'foo)> as 'foo) = () end;;
module M
: sig type t = <m : 'b. 'b * ('b * <m:'c. 'c * 'bar> as 'bar)> end
= struct type t = <m : 'a. 'a * ('a * 'foo)> as 'foo end;;
[%%expect {|
Line _, characters 2-64:
Error: Signature mismatch:
       Modules do not match:
         sig val f : (< m : 'a. 'a * ('a * 'b) > as 'b) -> unit end
       is not included in
         sig
           val f : < m : 'b. 'b * ('b * < m : 'c. 'c * 'a > as 'a) > -> unit
         end
       Values do not match:
         val f : (< m : 'a. 'a * ('a * 'b) > as 'b) -> unit
       is not included in
         val f : < m : 'b. 'b * ('b * < m : 'c. 'c * 'a > as 'a) > -> unit
|}];;

and the correct version is:

module M
: sig val f : (<m : 'b. 'b * ('b * <m:'c. 'c * 'bar> as 'bar)>) -> unit end
= struct let f (x : <m : 'a. 'a * ('a * 'foo)> as 'foo) = () end;;
module M
: sig type t = <m : 'b. 'b * ('b * <m:'c. 'c * 'bar> as 'bar)> end
= struct type t = <m : 'a. 'a * ('a * 'foo)> as 'foo end;;
[%%expect {|
Line _, characters 2-64:
Error: Signature mismatch:
       ...
       Values do not match:
         val f : (< m : 'a. 'a * ('a * 'b) > as 'b) -> unit
       is not included in
         val f : < m : 'b. 'b * ('b * < m : 'c. 'c * 'a > as 'a) > -> unit
|}, Principal{|
Line _, characters 2-64:
Error: Signature mismatch:
       Modules do not match:
         sig val f : (< m : 'a. 'a * ('a * 'b) > as 'b) -> unit end
       is not included in
         sig
           val f : < m : 'b. 'b * ('b * < m : 'c. 'c * 'a > as 'a) > -> unit
         end
       Values do not match:
         val f : (< m : 'a. 'a * ('a * 'b) > as 'b) -> unit
       is not included in
         val f : < m : 'b. 'b * ('b * < m : 'c. 'c * 'a > as 'a) > -> unit
|}];;

The trouble is that I cannot reproduce the new "bad" behavior (showing "..." instead of "Module do not match") manually outside the expect tool, even in the toplevel. @diml : do you see how to investigate that?

alainfrisch · 2016-10-04T14:14:27Z

Argh, the choice to print "..." in Includemod.report_error is based on the size of the marshaling of the internal representation of the object to be printed (here, a Module_type of module_type * module_type constructor) -- and whether marshaling succeeds or not. I don't see immediately why this would be impacted by the current PR, but this seems rather fragile.

alainfrisch · 2016-10-04T14:27:58Z

The size in the example is 501 bytes (with a limit at 500). This is probably related to a change of sharing (perhaps related to the fact that some Path.t, including simple Pident cases, are now recomputed on each access instead of being stored in the Env.t). I'm tempted to simply validate the test, since the current failure is more the sign of a weakness of the printer heuristics.

ghost · 2016-10-06T07:27:16Z

For the test you should be able to raise the limit by writing this in poly.ml:

Clflags.error_size := 1000;;

(I just discovered this variable)

alainfrisch · 2016-10-06T07:55:58Z

I was actually considering setting it to 0 (no limit), since any value can make the test quite fragile. This seems to work fine for the current tests.

alainfrisch · 2016-10-06T10:44:01Z

So I've set error_size to 0 directly in expect_test.ml (and no test had to be adjusted). The rationale is that any arbitrary limit is likely to create irrelevant failures in the testsuite at some point (or be useless if it is set too high), considering how fragile the criterion is (which might be ok for interactive use, not for non-regression testing).

alainfrisch · 2016-10-26T23:04:04Z

Does anyone want to review this?

@xavierleroy : you told me that you were concerned with degrading performances of lookups. How do the benchmarks above look to you?

let-def · 2016-10-31T10:29:38Z

I started reviewing. Overall it seems fine.
I made a few esthetic comments, I didn't find anything wrong while reading the implementation.

As for the performance profile, it seems to better reflect the actual use of open (a few of them in a given branch, sometime used for very short-time). (Except maybe a code generator that outputs a huge number opening of small modules?)

alainfrisch · 2016-10-31T10:46:20Z

Great, thanke @let-def .

I made a few esthetic comments

I cannot see the comments. If you started a review, you need to "submit" to make comments visible.

let-def · 2016-10-31T10:10:42Z

typing/env.ml

+  end
+
+
+module EnvTbl2 =


Maybe use more descriptive names than EnvTbl & EnvTbl2?

Do you have a suggestion?

Good question... The distinction is between components that have physical representation and those which are "floating"?

Something like LabelTable & RootedTable?
I don't know of a general term for distinguishing those two categories in the compiler, though it would be relevant.

One is for labels and constructors, for which several definitions can exist in the same module (and all need to be retrieved by name).

Ok, proposing IdTbl for components with a physical representation (i.e. identified by an id) and TycompTbl for "components of types", i.e. labels and constructors. I'll happily rename if better names are proposed (this is purely internal to env.ml, so trivial to rename).

let-def · 2016-10-31T10:14:33Z

typing/env.ml

@@ -281,29 +492,35 @@ let is_in_signature env = env.flags land in_signature_flag <> 0
 let is_implicit_coercion env = env.flags land implicit_coercion_flag <> 0

 let diff_keys is_local tbl1 tbl2 =


diff_keys(2) could be lifted inside the EnvTbl(2) module.
Also, the only purpose of EnvTbl.local_keys is to implement diff_keys.

So all could be replaced by just an export of EnvTbl.diff_keys & EnvTbl2.diff_keys.

Done, thanks!

let-def · 2016-10-31T10:21:33Z

typing/env.ml

-        env
-  with Not_found ->
-    env
+  (* update summary?? *)


I think summary should be updated too. The updated environment appears in a Typedtree, so if inspected from a CMT the reconstructed environment would be wrong.
(I cannot think of an existing codepath in Merlin that can fail because of that, but summary affects cmt processing and its implementation of short-paths)

I've pushed a partial fix to that. It's not clear to me what to do when changing the value_description of an identifier imported from an open statement. In the old version, the id of these values was "hidden" (lookup can only be done by name); now there is no dummy id for these values so I don't see how to record the information in the summary (except by introducing a new Env_value_by_name of summary * string * value_description). But I'm not even sure that rewriting the type of such values is required... Do you know? @garrigue?

Ok, I've pushed a more complete fix: now the "copy_types" operation is exposed by Env and represented as such in the "summary".

After discussing with @garrigue, it seems the current situation is not ideal anyway, since the type of all values should be copied in non-principal mode, including those accessed with a module qualifier (which is not the case now, yielding to a different behavior when a value is accessed through an "open" or not). Cleaning that would be nice but is quite independent of this PR, so I prefer to keep the current behavior untouched, and it seems cleaner to make the "copy" operation supported directly by Env.

let-def · 2016-10-31T11:01:09Z

@alainfrisch thanks, I never used that :)

alainfrisch · 2016-11-10T21:27:03Z

I really think this is worth merging, as it enables a style with many small local opens (which are arguably less risky than global ones) without degrading much the style with global opens; moreover, it makes the support for open-related warnings arguably simpler. @let-def is happy with the current implementation. Does someone else want to review?

gasche · 2016-11-10T21:50:08Z

(cc @garrigue who may be interested in reviewing and may or may not have been notified about this PR.)

damiendoligez · 2016-11-17T16:23:16Z

@alainfrisch you should remove the "work-in-progress" label if you think this is ready for merge.

alainfrisch · 2016-11-17T16:24:49Z

you should remove the "work-in-progress" label if you think this is ready for merge.

Yes, indeed. Thanks!

mshinwell · 2016-12-28T12:12:41Z

@garrigue Are you indeed interested in reviewing this?

garrigue · 2016-12-28T13:44:12Z

Sorry. I read the diff, and answered @alainfrisch personally, but did not write here.

I understand the goal, but I am not convinced by the implementation.
My main concern is that env.ml is already fairly complex, and this adds extra complexity.

alainfrisch · 2017-01-01T14:48:29Z

My main concern is that env.ml is already fairly complex, and this adds extra complexity.

I really believe that the implementation makes the treatment of open-related warnings much simpler to understand, and removes a weird notion of "signature substitution". It also opens the door to removing some of the current caching logic. Moreover, the "extra complexity" is tiny compared to other parts of the compiler (cf recent huge changes for flambda or spacetime); the new data structure is relatively simple and documented. This extra complexity seems well deserved to me considering the performance gains.

@garrigue Do you see a simpler way to achieve the same effect?

garrigue · 2017-01-06T06:37:32Z

Not simpler. I was just thinking about how to avoid the overhead with lots of open's.
If you say that open is free, people could start using hundreds of them...

One solution would be to amortize the cost by copying each definition to the current environment after accessing it. I.e. some kind of lazy copying.

alainfrisch · 2017-01-30T14:37:17Z

people could start using hundreds of them...

I'm a bit puzzled with the argument that the PR makes "open" faster, and could thus encourage people to use it too much, and so we should make "open" even faster. This kind of arguments could be used against most kinds of optimizations. Honestly, if we make open faster, I don't think that people will realistically start nesting hundreds of open; this would be terrible for code readibility anyway. What they could do is to use a lot of local opens instead of a few global ones, which is, in the eyes of some at least, an improvement.

Not simpler.

In the previous note, you wrote "My main concern is that env.ml is already fairly complex, and this adds extra complexity." Are you now ok with the complexity of the suggested new implementation?

One solution would be to amortize the cost by copying each definition to the current environment after accessing it. I.e. some kind of lazy copying

I've tried it. A naive implementation slows down the common case (few nested opens) quite a bit, since it complexifies the data structure and yield some extra mutations. Moreover, treatment of the open warnings is made more complex.

More complex strategies could certainly be considered to get better worst-case complexity without degrading performances in most cases. Can we keep that for later, if the need arise?

garrigue · 2017-02-09T00:41:53Z

@alainfrisch My point is that we should be careful that using open does not slow down the subsequent typechecking. However your tests seems to show the contrary, so maybe it's fine.
Can we discuss that at the developer meeting?

Also, my feeling is that at some point we will need a big clean up of env.ml, since we have been piling optimization over optimization for about 20 years...

…hadowed identifiers.

…y related to this PR).

…tension constructors, the actual position is extracted from tje Ctr_extension tag).

…d tests.

…tation in summary), instead of a more generic 'update_value'.

More changes addressing review comments on ocaml-10831

…l#834) * Add priority computation to the dataflow. * Don't use polymorphic [min] on ints

alainfrisch added the work-in-progress label Oct 3, 2016

alainfrisch force-pushed the open_revisited branch from 30d4a4d to ef0d736 Compare October 3, 2016 21:10

alainfrisch mentioned this pull request Oct 5, 2016

Alternative implementation of type-checking for open #828

Closed

let-def reviewed Oct 31, 2016

View reviewed changes

alainfrisch force-pushed the open_revisited branch from eabc2e6 to afe0e41 Compare October 31, 2016 17:41

alainfrisch removed the work-in-progress label Nov 17, 2016

alainfrisch added 19 commits March 24, 2017 14:24

Restore proper error message when opening a functor.

1cad974

Fix ocamldoc.

0ab1339

Fix 'deprecated module' warning.

8f9fdff

Starting to switch to a layered representation of 'opens'.

0fd3e65

Do not mark opens as being used when they are only used to look for s…

a9800f5

…hadowed identifiers.

Optimize lookups by specializing Tbl.find to string keys (not strictl…

30f502a

…y related to this PR).

Comments.

cd2ea8b

Simplify: no need to keep the position of labels/constructors (for ex…

3253c0d

…tension constructors, the actual position is extracted from tje Ctr_extension tag).

Cleanup.

a441268

Factorize type expression.

1c0ad86

Avoid arbitrary limit on error message size, too fragile for automate…

01fce81

…d tests.

Bootstrap.

7da629a

Move diff_keys functions to EnvTbl/EnvTbl2.

f067ec1

Simplify.

a26a93e

Record update_value in summary.

acedb1e

Expose a more explicit 'copy types' operation in Env (with a represen…

d8b413f

…tation in summary), instead of a more generic 'update_value'.

Renaming EnvTbl -> TycompTbl; EnvTbl2 -> IdTbl.

bf3c2d4

Changelog.

c012ea5

Non-regression test for ocaml#7372

f91f561

alainfrisch force-pushed the open_revisited branch from a6803b3 to f91f561 Compare March 24, 2017 13:50

alainfrisch merged commit e692b5e into ocaml:trunk Mar 24, 2017

damiendoligez mentioned this pull request Mar 31, 2017

Tentative fix for #7372 (GADTs and inline records) #824

Closed

shindere mentioned this pull request May 25, 2018

do not error when instantiating polymorphic fields in patterns #1748

Merged

trefis mentioned this pull request Jan 29, 2019

Env: remove prefix_idents cache #2229

Merged

This was referenced Mar 14, 2019

multiple "open" can become expensive in memory #5877

Closed

Improve compile time of opens, esp. for local opens #6826

Closed

Bug in type-checker with GADTs and inline records #7372

Closed

gasche mentioned this pull request Jun 22, 2020

usability issue: no error when opening an alias to a missing module #9695

Closed

ctk21 pushed a commit to ocaml-multicore/ocaml that referenced this pull request Jan 7, 2022

Merge pull request ocaml#834 from ctk21/20220107_address_review

c3e9ee3

More changes addressing review comments on ocaml-10831

stedolan pushed a commit to stedolan/ocaml that referenced this pull request Sep 21, 2022

CFG: Improve the way priority is computed for dataflow analysis (ocam…

45bcf26

…l#834) * Add priority computation to the dataflow. * Don't use polymorphic [min] on ints

		@@ -281,29 +492,35 @@ let is_in_signature env = env.flags land in_signature_flag <> 0
		let is_implicit_coercion env = env.flags land implicit_coercion_flag <> 0

		let diff_keys is_local tbl1 tbl2 =

Conversation

alainfrisch commented Oct 3, 2016

Uh oh!

alainfrisch commented Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alainfrisch commented Oct 3, 2016

Uh oh!

alainfrisch commented Oct 3, 2016

Uh oh!

alainfrisch commented Oct 4, 2016

Uh oh!

alainfrisch commented Oct 4, 2016

Uh oh!

alainfrisch commented Oct 4, 2016

Uh oh!

alainfrisch commented Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Oct 6, 2016

Uh oh!

alainfrisch commented Oct 6, 2016

Uh oh!

alainfrisch commented Oct 6, 2016

Uh oh!

alainfrisch commented Oct 26, 2016

Uh oh!

let-def commented Oct 31, 2016

Uh oh!

alainfrisch commented Oct 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

let-def commented Oct 31, 2016

Uh oh!

alainfrisch commented Nov 10, 2016

Uh oh!

gasche commented Nov 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damiendoligez commented Nov 17, 2016

Uh oh!

alainfrisch commented Nov 17, 2016

Uh oh!

mshinwell commented Dec 28, 2016

Uh oh!

garrigue commented Dec 28, 2016

Uh oh!

alainfrisch commented Jan 1, 2017

Uh oh!

garrigue commented Jan 6, 2017

Uh oh!

alainfrisch commented Jan 30, 2017

Uh oh!

garrigue commented Feb 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

alainfrisch commented Oct 3, 2016 •

edited

Loading

alainfrisch commented Oct 4, 2016 •

edited

Loading

gasche commented Nov 10, 2016 •

edited

Loading