Skip to content

a custom parsing loop that stop consuming top-level input on error#6

Merged
gasche merged 2 commits intogasche:menhirfrom
let-def:menhir
Aug 2, 2018
Merged

a custom parsing loop that stop consuming top-level input on error#6
gasche merged 2 commits intogasche:menhirfrom
let-def:menhir

Conversation

@let-def
Copy link
Collaborator

@let-def let-def commented Aug 2, 2018

This implements a custom parsing loop to improve the behavior of top-level recovery.

Quoting a comment from the code:

The parser detected an error.
At this point we don't want to consume input anymore.
In the top-level it would translate into waiting for the user to type
something just to raise an error at some earlier position rather
than just raising the error immediately.

This worked before with yacc because, AFAICT (@let-def):

  • yacc eagerly reduces "default reduction" (when the next action
    is to reduce the same production no matter what token is read,
    yacc reduces it immediately rather than waiting for that token
    to be read)
  • error productions in OCaml grammar are always in a position that
    allows default reduction ("error" symbol is the last producer,
    and the lookahead token will not be used to disambiguate between
    two possible error rules)

This solution is fragile because it relies on an optimization
(default reduction), that changes the semantics of the parser the
way it is implemented in Yacc (an optimization that changes
semantics? hmmmm).

Rather than relying on implementation details of the parser, when
an error is detected in this loop we stop looking at the input and
fill the parser with EOF tokens.
The skip_phrase logic will resynchronize the input stream by
looking for the next ';;'.

@gasche
Copy link
Owner

gasche commented Aug 2, 2018

Thanks! Could you run menhir-bench.bash on your machine to see if there appear to be any notable performance regression? (My guess is that there aren't; I checked your previous introduction of a token filter and it was lost in the (abundant) noise.)

@trefis
Copy link
Collaborator

trefis commented Aug 2, 2018

Can you include the updated testsuite as well?

@gasche
Copy link
Owner

gasche commented Aug 2, 2018

Naive question from someone not familiar with LR parsing.

After you see HandlingError for the first time, you keep reducing by feeding those dummy EOF token until you get Rejected, and then you raise an exception. Why not raise an exception right away at HandlingError?

I suppose that in some error-handling strategies the reductions that after in between are useful to get in a "better summarized" state in which it is easier to explain the error; I remember reading about hints you can give Menhir to do this in its new fancy error-message support. But here we don't keep/preserve any information from the parsing state / checkpoint on error, so what difference would it make?

@let-def
Copy link
Collaborator Author

let-def commented Aug 2, 2018

Parse only

native (30 iterations)

yacc: 0m5.580s
menhir: 0m5.726s
m/y ratio: 1.0261

byte (10 iterations)

yacc: 0m4.723s
menhir: 0m5.455s
m/y ratio: 1.1549

Parse and Type

native (4 iterations)

yacc: 0m1.638s
menhir: 0m1.759s
m/y ratio: 1.0738

byte (1 iterations)

yacc: 0m1.293s
menhir: 0m1.357s
m/y ratio: 1.0494

Full compile

native (3 iterations)

yacc: 0m1.372s
menhir: 0m1.373s
m/y ratio: 1.0007

byte (1 iterations)

yacc: 0m1.446s
menhir: 0m1.474s
m/y ratio: 1.0193

@let-def
Copy link
Collaborator Author

let-def commented Aug 2, 2018

This loop is slightly simpler than Menhir's one, if anything I would expect a small speed-up.

@gasche
Copy link
Owner

gasche commented Aug 2, 2018

Wow, the results are much better on your machine than on mine... I'm jaleous of the pitch-perfect "there is absolutely no overhead" result that you get on the native full-compile benchmark. Would you mind maybe running the benchmark from the current menhir branch for comparison (we know they will be essentially the same), to have a fair comparison?

@gasche
Copy link
Owner

gasche commented Aug 2, 2018

@let-def, you are now officially our Benchmark Result Producer for the rest of the GPR#292 discussion.

@let-def
Copy link
Collaborator Author

let-def commented Aug 2, 2018

https://www.youtube.com/watch?v=HANOtqR3nuY

@trefis
Copy link
Collaborator

trefis commented Aug 2, 2018

I'm jaleous of the pitch-perfect "there is absolutely no overhead" result that you get on the native full-compile benchmark.

That makes sense though, doesn't it?

@trefis
Copy link
Collaborator

trefis commented Aug 2, 2018

PS: the testsuite changes brought me tears of joy.

@gasche
Copy link
Owner

gasche commented Aug 2, 2018

(I am not sure whether my earlier question was lost in the noise of Youtube links and tears of joy.)

@let-def
Copy link
Collaborator Author

let-def commented Aug 2, 2018

Naive question from someone not familiar with LR parsing.
After you see HandlingError for the first time, you keep reducing by feeding those dummy EOF token until you get Rejected, and then you raise an exception. Why not raise an exception right away at HandlingError?
I suppose that in some error-handling strategies the reductions that after in between are useful to get in a "better summarized" state in which it is easier to explain the error; I remember reading about hints you can give Menhir to do this in its new fancy error-message support. But here we don't keep/preserve any information from the parsing state / checkpoint on error, so what difference would it make?

If you keep feeding tokens, you get the chance to trigger one of the few error handling rules from the grammar.
So the outcome can be reaching Rejected but also getting an exception from a semantic-action as a result of calling I.resume checkpoint.

If we move to an error handling approach that is outside of the grammar (menhir messages for instance), the story will be different (when reaching HandlingError, we will call another tool to produce error messages), but with the current parser this is the best we can do, and that's also why we get a few more error messages (though I am not sure why yacc fallbacks to Parser.Error before reducing these error rules... My explanation might be wrong: if it finds no default reduction, it raises immediately).

@let-def
Copy link
Collaborator Author

let-def commented Aug 2, 2018

Pre-patch bench result:

Parse only

native (30 iterations)

yacc: 0m5.135s
menhir: 0m5.452s
m/y ratio: 1.0617

byte (10 iterations)

yacc: 0m4.386s
menhir: 0m4.851s
m/y ratio: 1.1060

Parse and Type

native (4 iterations)

yacc: 0m1.329s
menhir: 0m1.371s
m/y ratio: 1.0316

byte (1 iterations)

yacc: 0m1.114s
menhir: 0m1.173s
m/y ratio: 1.0529

Full compile

native (3 iterations)

yacc: 0m1.203s
menhir: 0m1.186s
m/y ratio: .9858

byte (1 iterations)

yacc: 0m1.268s
menhir: 0m1.323s
m/y ratio: 1.0433

@let-def
Copy link
Collaborator Author

let-def commented Aug 2, 2018

And after patch:

Parse only

native (30 iterations)

yacc: 0m5.176s
menhir: 0m5.156s
m/y ratio: .9961

byte (10 iterations)

yacc: 0m4.363s
menhir: 0m4.898s
m/y ratio: 1.1226

Parse and Type

native (4 iterations)

yacc: 0m1.300s
menhir: 0m1.350s
m/y ratio: 1.0384

byte (1 iterations)

yacc: 0m1.100s
menhir: 0m1.125s
m/y ratio: 1.0227

Full compile

native (3 iterations)

yacc: 0m1.182s
menhir: 0m1.174s
m/y ratio: .9932

byte (1 iterations)

yacc: 0m1.271s
menhir: 0m1.313s
m/y ratio: 1.0330

@gasche
Copy link
Owner

gasche commented Aug 2, 2018

Would you mind maybe explaining this (the interest of pushing forward to raise interesting-error-messages exceptions) in the comment? Also, making loop a local definition to wrap_menhir might make the code more readable for newcomers.

@gasche
Copy link
Owner

gasche commented Aug 2, 2018

I'll merge now because it clearly improves the branch/GPR.

@gasche gasche merged commit 9cdc163 into gasche:menhir Aug 2, 2018
gasche pushed a commit that referenced this pull request Jul 13, 2024
The toplevel printer detects cycles by keeping a hashtable of values
that it has already traversed.

However, some OCaml runtime types (at least bigarrays) may be
partially uninitialized, and hashing them at arbitrary program points
may read uninitialized memory. In particular, the OCaml testsuite
fails when running with a memory-sanitizer enabled, as bigarray
printing results in reads to uninitialized memory:

```
==133712==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x4e6d11 in caml_ba_hash /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45
    #1 0x52474a in caml_hash /var/home/edwin/git/ocaml/runtime/hash.c:251:35
    #2 0x599ebf in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1065:14
    #3 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
    #4 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
    #5 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #6 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)

  Uninitialized value was created by a heap allocation
    #0 0x47d306 in malloc (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x47d306) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)
    #1 0x4e7960 in caml_ba_alloc /var/home/edwin/git/ocaml/runtime/bigarray.c:246:12
    #2 0x4e801f in caml_ba_create /var/home/edwin/git/ocaml/runtime/bigarray.c:673:10
    #3 0x59b8fc in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14
    #4 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
    #5 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
    #6 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #7 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #8 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 in caml_ba_hash
```

The only use of hashing in genprintval is to avoid cycles, that is, it
is only useful for OCaml values that contain other OCaml values
(including possibly themselves). Bigarrays cannot introduce cycles,
and they are always printed as "<abstr>" anyway.

The present commit proposes to be more conservative in which values
are hashed by the cycle detector to avoid this issue: we skip hashing
any value with tag above No_scan_tag -- which may not contain any
OCaml values.

Suggested-by: Gabriel Scherer <[email protected]>
Signed-off-by: Edwin Török <[email protected]>
gasche added a commit that referenced this pull request Jul 30, 2024
…l#13294)

The toplevel printer detects cycles by keeping a hashtable of values
that it has already traversed.

However, some OCaml runtime types (at least bigarrays) may be
partially uninitialized, and hashing them at arbitrary program points
may read uninitialized memory. In particular, the OCaml testsuite
fails when running with a memory-sanitizer enabled, as bigarray
printing results in reads to uninitialized memory:

```
==133712==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x4e6d11 in caml_ba_hash /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45
    #1 0x52474a in caml_hash /var/home/edwin/git/ocaml/runtime/hash.c:251:35
    #2 0x599ebf in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1065:14
    #3 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
    #4 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
    #5 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #6 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)

  Uninitialized value was created by a heap allocation
    #0 0x47d306 in malloc (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x47d306) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)
    #1 0x4e7960 in caml_ba_alloc /var/home/edwin/git/ocaml/runtime/bigarray.c:246:12
    #2 0x4e801f in caml_ba_create /var/home/edwin/git/ocaml/runtime/bigarray.c:673:10
    #3 0x59b8fc in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14
    #4 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
    #5 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
    #6 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #7 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #8 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 in caml_ba_hash
```

The only use of hashing in genprintval is to avoid cycles, that is, it
is only useful for OCaml values that contain other OCaml values
(including possibly themselves). Bigarrays cannot introduce cycles,
and they are always printed as "<abstr>" anyway.

The present commit proposes to be more conservative in which values
are hashed by the cycle detector to avoid this issue: we skip hashing
any value with tag above No_scan_tag -- which may not contain any
OCaml values.

Suggested-by: Gabriel Scherer <[email protected]>

Signed-off-by: Edwin Török <[email protected]>
Co-authored-by: Edwin Török <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants