Skip to content

Fix evaluation order problem#966

Merged
mshinwell merged 39 commits intoocaml:trunkfrom
mshinwell:eval_order
Feb 15, 2017
Merged

Fix evaluation order problem#966
mshinwell merged 39 commits intoocaml:trunkfrom
mshinwell:eval_order

Conversation

@mshinwell
Copy link
Contributor

@mshinwell mshinwell commented Dec 13, 2016

(Was PR#7346)

Selectgen can currently reorder load instructions against aliasing stores when emitting the arguments to operations (such as function calls, allocation, etc). This causes inconsistency with the bytecode compiler's evaluation order. This patch should prevent it happening. I have also stopped the moving of Ccheckbound since that too has effects.

For the record, even though the evaluation order is officially unspecified, it is a matter of concern for Jane Street that it may vary between bytecode and native.

(By the way this is not a fix for PR#6136, which I am going to look at next.)

@mshinwell mshinwell added the bug label Dec 13, 2016
@mshinwell mshinwell added this to the 4.05.0 milestone Dec 13, 2016
@xavierleroy
Copy link
Contributor

I need to be reassured that this doesn't degrade performance, esp. for Cload.

@mshinwell
Copy link
Contributor Author

We can try to run Louis's benchmarking suite on this, although I'm not sure how much is running on trunk yet.

@chambart I thought Flambda would be immune to this but I think I might be wrong---it looks like [Un_anf] might be able to generate e.g. applications that contain side effects. (See the comment "When sequences of let-bindings...")

@xavierleroy
Copy link
Contributor

Make sure you test x86-32 bits with floating-point codes. Your change has potential to kill the way ocamlopt utilises the x87 FP stack.

@mshinwell
Copy link
Contributor Author

@xavierleroy What about if we annotated Cload as to whether it is reading from a mutable field? Apparently in multicore the code up to Cmmgen already knows this property, so we could use that to obtain this annotation. Then we could prevent only the loads which read from mutable fields from being re-ordered.

@xavierleroy
Copy link
Contributor

Mutability information is always good to have. But many FP loads that are performance critical are from mutable storage (arrays, mutable unboxed record fields), so there is still much potential for performance breakage, esp on x86-x87.

@mshinwell
Copy link
Contributor Author

The first version of this had dire performance even on normal x86-64. I've made a revised version which will be benchmarked shortly.

@mshinwell
Copy link
Contributor Author

mshinwell commented Dec 27, 2016

@chambart This issue appears to be more serious than I first thought: it looks like code that does not depend on unspecified evaluation order can be incorrectly re-ordered as a result of this bug when compiled with Flambda. (Thanks to Yaron Minsky for hard questioning that lead me to construct this test case.)

let i = ref 0

let f x y =
  Printf.printf "%d %d\n" x y;
  0
[@@inline never]

let foo _ = ()

let foobar baz =
  let incr_i _ =
    incr i;
    !i
  in
  let b = !i in
  let z = foo 42 in
  let a = (incr_i [@inlined never]) z in
  let x = f a b in
  x + 1

let () =
  ignore ((foobar 0) : int)

I think this is because we assume during Un_anf that the backend is always going to evaluate right-to-left. For example, the above code built with Flambda (at the default optimization level) produces the following Clambda for foobar:

(+
  (apply* camlBreak_eval_order__f_21 
    (apply* camlBreak_eval_order__incr_i_77  0a)
    (field 0 (field 0 "camlBreak_eval_order__Pmakeblock_124")))
  1)

(camlBreak_eval_order__Pmakeblock_124 is the reference called i.)

This bug does at least seem rather difficult to trigger because the "let-moveable" transformation in Un_anf is very conservative. Once we manage to fix this problem in a satisfactory manner we should consider applying this to the 4.04 branch also.

@mshinwell
Copy link
Contributor Author

mshinwell commented Dec 28, 2016

I'm still working on this problem, but as an aside, it looks like the list of side-effecting expressions in is_simple_expr may be missing some cases even ignoring Cload (Ccheckbound in particular is absent, and can raise an exception). (Actually never mind, I see this was noted above.)

@mshinwell mshinwell changed the title Fix one example of different evaluation order in bytecode and native Fix evaluation order problem Dec 28, 2016
let current_function_name = ref ""

(* Environment parameter for the function being compiled, if any. *)
let current_function_env_param = ref None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role of this reference is to detect a common case of reading from an immutable block (the current closure). This seems a bit hackish. Couldn't one mark the load instruction instead? This seems cleaner and might (perhap later) cover more cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a plan to combine a patch from the multicore branch and an old patch of mine to propagate mutability information into Cmm code. However it's a fair amount of work to do that, so for now, I'd just like to catch the very common case of loads from the current closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Actually I've incorporated the old patch of mine, but not yet the one which propagates mutability information from the middle-end. I'm going to let @lpw25 submit that separately.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could still be made less hackish by using the environment in cmmgen to do that: in Cmmgen.get_field, if ptr is the current environment variable, then generate a immutable cload.

@mshinwell
Copy link
Contributor Author

This revised version is ready for review for 4.05. There are two parts:

  1. Addition of mutability information to Cload. This is deliberately conservative at the moment.
  2. A simple effect/coeffect analysis in Selectgen to restrict the movement of expressions to avoid the original evaluation order problem. (Note that deferral of expressions is still guarded by the original is_simple_expr.)

A substantial amount of effort has been invested in this patch to try to ensure that any performance degradation is kept to a minimum. The evolution of the code has largely been guided by benchmarking.

The multicore devs have a patch to propagate mutability information from earlier in the compiler which can link up with the new annotation on Cload and remove the need for the special case tracking the environment parameter.

We also have a 4.03 version of this patch (being used internally at Jane Street); performance numbers on x86-64 Linux can be seen here:
http://bench.flambda.ocamlpro.com/compare?test=2017-02-12-2300%2F4.03.1%2B%2Bpr966%2Bbench&reference=2017-02-12-2300%2F4.03.1%2Bdev%2Bbench

This patch is being re-tested on i386 32-bit at present and numbers should be available shortly. I expect it to show a degradation of a couple of percent on floating-point intensive work.

@mshinwell
Copy link
Contributor Author

One of the benchmarks exhibits a small difference in allocation numbers but a large difference in compactions. It is believed that this is because of potential variable lifetime changes as a result of this patch. For example for some function alloc which allocates, the code:

foo(alloc(), load block[0])

was effectively translated as:

let alloc_result = alloc () in
foo(alloc_result, load block[0])

With this patch, we instead get:

let block_result = load block[0] in
let alloc_result = alloc () in
foo(alloc_result, block_result)

As such, if block is dead after the load, it might be collected during the call to alloc. Without the patch it would live for longer.

Copy link
Contributor

@chambart chambart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems correct and fixes the bug.

I'd prefer the environment variable hack to go away and be done in cmmgen.
The tests could be written in cmm to be more robust.

@@ -227,6 +305,65 @@ method is_simple_expr = function
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth expanding that _ pattern.

try env_find_static_exception nfail env
with Not_found ->
fatal_error ("Selection.emit_expr: unboun label "^
string_of_int nfail)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious unrelated indentation change. (even if it fixes the indentation)

in
(* Intermediate registers to handle cases where some
registers from src are present in dest *)
registers from src are present in dest *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated indentation too

let tmp_regs = Reg.createv_like src in
(* Ccatch registers are created with type Val. They must not
contain out of heap pointers *)
contain out of heap pointers *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated indentation too

([], EC.none)
exp_list
in
List.fold_left (fun results_and_env (exp, effects_after) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not behave as the original one in case of overloading. Does it ever happen ?

required semantics of [loc_external_arguments] (see proc.mli). *)
across multiple registers to line up correctly, by virtue of the
semantics of [split_int64_for_32bit_target] in cmmgen.ml, and the
required semantics of [loc_external_arguments] (see proc.mli). *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated indentation


method! is_simple_expr e =
match e with
| Cop(Cextcall (fn, _, _, _), args, _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated whitespace

method! effects_of e =
match e with
| Cop(Cextcall(fn, _, _, _), args, _)
when List.mem fn inline_ops ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a comment on inline_ops definition to tell that if a new non-effect free is included this should be updated.
Same remark for arm64 and i386

let current_function_name = ref ""

(* Environment parameter for the function being compiled, if any. *)
let current_function_env_param = ref None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could still be made less hackish by using the environment in cmmgen to do that: in Cmmgen.get_field, if ptr is the current environment variable, then generate a immutable cload.

Cop(Cand, [get_header ptr dbg; Cconst_int 255], dbg)
else (* If byte loads are efficient *)
Cop(Cload Byte_unsigned,
Cop(Cload (Byte_unsigned, Mutable), (* Same comment as above *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not obvious what this comment refers to. Maybe just explicitly repeat that tags can be mutated by Obj.set_tag.

@lpw25
Copy link
Contributor

lpw25 commented Feb 14, 2017

I've reviewed this, it looks fine.

@damiendoligez
Copy link
Member

We have two positive reviews so this is good to go. @mshinwell please remove the unrelated whitespace changes, then go ahead and merge.

@mshinwell
Copy link
Contributor Author

I've addressed all of the code review comments save for the one about rewriting the testcases in Cmm, which I think is going to have to wait, and one I don't understand about overloading. I have asked @chambart about the latter. This should be ready for merging shortly.

@mshinwell
Copy link
Contributor Author

The overloading comment was I believe supposed to say overriding; there should be no issue, since none of the "emit_parts*" functions are exposed in the .mli.

@mshinwell
Copy link
Contributor Author

@lpw25 has read the post-code-review changes so I'm going to merge this now. I've asked @chambart to eyeball them in any case, but I fully expect that to be fine.

@mshinwell mshinwell merged commit ea5fa10 into ocaml:trunk Feb 15, 2017
camlspotter pushed a commit to camlspotter/ocaml that referenced this pull request Oct 17, 2017
Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick review and the code looks good to me.

Performance gains will be limited because Cmm generation is conservative w.r.t. immutability annotations. For example, reading the header of a block is considered mutable because of Obj.set_block, so computing the length of an array will not be factored across an assignment to a reference. But this is OK, as we can tighten those Cmm annotations later, and the code in this PR will automatically take advantage.

I'm currently running a CI precheck to make sure all architectures are OK.

stedolan pushed a commit to stedolan/ocaml that referenced this pull request Mar 21, 2023
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Add link to job offers at my company TrustInSoft.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants