a custom parsing loop that stop consuming top-level input on error by let-def · Pull Request #6 · gasche/ocaml

let-def · 2018-08-02T09:36:33Z

This implements a custom parsing loop to improve the behavior of top-level recovery.

Quoting a comment from the code:

The parser detected an error.
At this point we don't want to consume input anymore.
In the top-level it would translate into waiting for the user to type
something just to raise an error at some earlier position rather
than just raising the error immediately.

This worked before with yacc because, AFAICT (@let-def):

yacc eagerly reduces "default reduction" (when the next action
is to reduce the same production no matter what token is read,
yacc reduces it immediately rather than waiting for that token
to be read)

error productions in OCaml grammar are always in a position that
allows default reduction ("error" symbol is the last producer,
and the lookahead token will not be used to disambiguate between
two possible error rules)

This solution is fragile because it relies on an optimization
(default reduction), that changes the semantics of the parser the
way it is implemented in Yacc (an optimization that changes
semantics? hmmmm).

Rather than relying on implementation details of the parser, when
an error is detected in this loop we stop looking at the input and
fill the parser with EOF tokens.
The skip_phrase logic will resynchronize the input stream by
looking for the next ';;'.

gasche · 2018-08-02T09:44:09Z

Thanks! Could you run menhir-bench.bash on your machine to see if there appear to be any notable performance regression? (My guess is that there aren't; I checked your previous introduction of a token filter and it was lost in the (abundant) noise.)

trefis · 2018-08-02T09:44:22Z

Can you include the updated testsuite as well?

gasche · 2018-08-02T09:50:03Z

Naive question from someone not familiar with LR parsing.

After you see HandlingError for the first time, you keep reducing by feeding those dummy EOF token until you get Rejected, and then you raise an exception. Why not raise an exception right away at HandlingError?

I suppose that in some error-handling strategies the reductions that after in between are useful to get in a "better summarized" state in which it is easier to explain the error; I remember reading about hints you can give Menhir to do this in its new fancy error-message support. But here we don't keep/preserve any information from the parsing state / checkpoint on error, so what difference would it make?

let-def · 2018-08-02T09:50:42Z

Parse only

native (30 iterations)

yacc: 0m5.580s
menhir: 0m5.726s
m/y ratio: 1.0261

byte (10 iterations)

yacc: 0m4.723s
menhir: 0m5.455s
m/y ratio: 1.1549

Parse and Type

native (4 iterations)

yacc: 0m1.638s
menhir: 0m1.759s
m/y ratio: 1.0738

byte (1 iterations)

yacc: 0m1.293s
menhir: 0m1.357s
m/y ratio: 1.0494

Full compile

native (3 iterations)

yacc: 0m1.372s
menhir: 0m1.373s
m/y ratio: 1.0007

byte (1 iterations)

yacc: 0m1.446s
menhir: 0m1.474s
m/y ratio: 1.0193

let-def · 2018-08-02T09:51:10Z

This loop is slightly simpler than Menhir's one, if anything I would expect a small speed-up.

gasche · 2018-08-02T09:52:21Z

Wow, the results are much better on your machine than on mine... I'm jaleous of the pitch-perfect "there is absolutely no overhead" result that you get on the native full-compile benchmark. Would you mind maybe running the benchmark from the current menhir branch for comparison (we know they will be essentially the same), to have a fair comparison?

gasche · 2018-08-02T09:52:54Z

@let-def, you are now officially our Benchmark Result Producer for the rest of the GPR#292 discussion.

let-def · 2018-08-02T09:54:19Z

https://www.youtube.com/watch?v=HANOtqR3nuY

trefis · 2018-08-02T09:58:23Z

I'm jaleous of the pitch-perfect "there is absolutely no overhead" result that you get on the native full-compile benchmark.

That makes sense though, doesn't it?

trefis · 2018-08-02T09:59:11Z

PS: the testsuite changes brought me tears of joy.

gasche · 2018-08-02T10:03:41Z

(I am not sure whether my earlier question was lost in the noise of Youtube links and tears of joy.)

let-def · 2018-08-02T10:10:22Z

Naive question from someone not familiar with LR parsing.
After you see HandlingError for the first time, you keep reducing by feeding those dummy EOF token until you get Rejected, and then you raise an exception. Why not raise an exception right away at HandlingError?
I suppose that in some error-handling strategies the reductions that after in between are useful to get in a "better summarized" state in which it is easier to explain the error; I remember reading about hints you can give Menhir to do this in its new fancy error-message support. But here we don't keep/preserve any information from the parsing state / checkpoint on error, so what difference would it make?

If you keep feeding tokens, you get the chance to trigger one of the few error handling rules from the grammar.
So the outcome can be reaching Rejected but also getting an exception from a semantic-action as a result of calling I.resume checkpoint.

If we move to an error handling approach that is outside of the grammar (menhir messages for instance), the story will be different (when reaching HandlingError, we will call another tool to produce error messages), but with the current parser this is the best we can do, and that's also why we get a few more error messages (though I am not sure why yacc fallbacks to Parser.Error before reducing these error rules... My explanation might be wrong: if it finds no default reduction, it raises immediately).

let-def · 2018-08-02T10:20:03Z

Pre-patch bench result:

Parse only

native (30 iterations)

yacc: 0m5.135s
menhir: 0m5.452s
m/y ratio: 1.0617

byte (10 iterations)

yacc: 0m4.386s
menhir: 0m4.851s
m/y ratio: 1.1060

Parse and Type

native (4 iterations)

yacc: 0m1.329s
menhir: 0m1.371s
m/y ratio: 1.0316

byte (1 iterations)

yacc: 0m1.114s
menhir: 0m1.173s
m/y ratio: 1.0529

Full compile

native (3 iterations)

yacc: 0m1.203s
menhir: 0m1.186s
m/y ratio: .9858

byte (1 iterations)

yacc: 0m1.268s
menhir: 0m1.323s
m/y ratio: 1.0433

let-def · 2018-08-02T10:28:01Z

And after patch:

Parse only

native (30 iterations)

yacc: 0m5.176s
menhir: 0m5.156s
m/y ratio: .9961

byte (10 iterations)

yacc: 0m4.363s
menhir: 0m4.898s
m/y ratio: 1.1226

Parse and Type

native (4 iterations)

yacc: 0m1.300s
menhir: 0m1.350s
m/y ratio: 1.0384

byte (1 iterations)

yacc: 0m1.100s
menhir: 0m1.125s
m/y ratio: 1.0227

Full compile

native (3 iterations)

yacc: 0m1.182s
menhir: 0m1.174s
m/y ratio: .9932

byte (1 iterations)

yacc: 0m1.271s
menhir: 0m1.313s
m/y ratio: 1.0330

gasche · 2018-08-02T11:56:31Z

Would you mind maybe explaining this (the interest of pushing forward to raise interesting-error-messages exceptions) in the comment? Also, making loop a local definition to wrap_menhir might make the code more readable for newcomers.

gasche · 2018-08-02T12:44:22Z

I'll merge now because it clearly improves the branch/GPR.

The toplevel printer detects cycles by keeping a hashtable of values that it has already traversed. However, some OCaml runtime types (at least bigarrays) may be partially uninitialized, and hashing them at arbitrary program points may read uninitialized memory. In particular, the OCaml testsuite fails when running with a memory-sanitizer enabled, as bigarray printing results in reads to uninitialized memory: ``` ==133712==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x4e6d11 in caml_ba_hash /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 #1 0x52474a in caml_hash /var/home/edwin/git/ocaml/runtime/hash.c:251:35 #2 0x599ebf in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1065:14 #3 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #4 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #5 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #6 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) Uninitialized value was created by a heap allocation #0 0x47d306 in malloc (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x47d306) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) #1 0x4e7960 in caml_ba_alloc /var/home/edwin/git/ocaml/runtime/bigarray.c:246:12 #2 0x4e801f in caml_ba_create /var/home/edwin/git/ocaml/runtime/bigarray.c:673:10 #3 0x59b8fc in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14 #4 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #5 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #6 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #8 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 in caml_ba_hash ``` The only use of hashing in genprintval is to avoid cycles, that is, it is only useful for OCaml values that contain other OCaml values (including possibly themselves). Bigarrays cannot introduce cycles, and they are always printed as "<abstr>" anyway. The present commit proposes to be more conservative in which values are hashed by the cycle detector to avoid this issue: we skip hashing any value with tag above No_scan_tag -- which may not contain any OCaml values. Suggested-by: Gabriel Scherer <[email protected]> Signed-off-by: Edwin Török <[email protected]>

…l#13294) The toplevel printer detects cycles by keeping a hashtable of values that it has already traversed. However, some OCaml runtime types (at least bigarrays) may be partially uninitialized, and hashing them at arbitrary program points may read uninitialized memory. In particular, the OCaml testsuite fails when running with a memory-sanitizer enabled, as bigarray printing results in reads to uninitialized memory: ``` ==133712==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x4e6d11 in caml_ba_hash /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 #1 0x52474a in caml_hash /var/home/edwin/git/ocaml/runtime/hash.c:251:35 #2 0x599ebf in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1065:14 #3 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #4 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #5 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #6 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) Uninitialized value was created by a heap allocation #0 0x47d306 in malloc (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x47d306) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) #1 0x4e7960 in caml_ba_alloc /var/home/edwin/git/ocaml/runtime/bigarray.c:246:12 #2 0x4e801f in caml_ba_create /var/home/edwin/git/ocaml/runtime/bigarray.c:673:10 #3 0x59b8fc in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14 #4 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9 #5 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3 #6 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #7 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef) #8 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37) SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 in caml_ba_hash ``` The only use of hashing in genprintval is to avoid cycles, that is, it is only useful for OCaml values that contain other OCaml values (including possibly themselves). Bigarrays cannot introduce cycles, and they are always printed as "<abstr>" anyway. The present commit proposes to be more conservative in which values are hashed by the cycle detector to avoid this issue: we skip hashing any value with tag above No_scan_tag -- which may not contain any OCaml values. Suggested-by: Gabriel Scherer <[email protected]> Signed-off-by: Edwin Török <[email protected]> Co-authored-by: Edwin Török <[email protected]>

a custom parsing loop that stop consuming top-level input on error

84132d0

update testsuite

fc609fa

gasche merged commit 9cdc163 into gasche:menhir Aug 2, 2018

Conversation

let-def commented Aug 2, 2018

Uh oh!

gasche commented Aug 2, 2018

Uh oh!

trefis commented Aug 2, 2018

Uh oh!

gasche commented Aug 2, 2018

Uh oh!

let-def commented Aug 2, 2018

Parse only

native (30 iterations)

byte (10 iterations)

Parse and Type

native (4 iterations)

byte (1 iterations)

Full compile

native (3 iterations)

byte (1 iterations)

Uh oh!

let-def commented Aug 2, 2018

Uh oh!

gasche commented Aug 2, 2018

Uh oh!

gasche commented Aug 2, 2018

Uh oh!

let-def commented Aug 2, 2018

Uh oh!

trefis commented Aug 2, 2018

Uh oh!

trefis commented Aug 2, 2018

Uh oh!

gasche commented Aug 2, 2018

Uh oh!

let-def commented Aug 2, 2018

Uh oh!

let-def commented Aug 2, 2018

Parse only

native (30 iterations)

byte (10 iterations)

Parse and Type

native (4 iterations)

byte (1 iterations)

Full compile

native (3 iterations)

byte (1 iterations)

Uh oh!

let-def commented Aug 2, 2018

Parse only

native (30 iterations)

byte (10 iterations)

Parse and Type

native (4 iterations)

byte (1 iterations)

Full compile

native (3 iterations)

byte (1 iterations)

Uh oh!

gasche commented Aug 2, 2018

Uh oh!

gasche commented Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants