Avoid overflow when checking for an oversized result String/Bytes.concat#815
Avoid overflow when checking for an oversized result String/Bytes.concat#815alainfrisch wants to merge 4 commits intoocaml:trunkfrom
Conversation
stdlib/bytes.ml
Outdated
| let iteri f a = | ||
| for i = 0 to length a - 1 do f i (unsafe_get a i) done | ||
|
|
||
| let () = assert (3 * Sys.max_string_length <= max_int) |
There was a problem hiding this comment.
The test should be Sys.max_string_length <= max_int / 3.
There was a problem hiding this comment.
And there should be a comment explaining its purpose.
There was a problem hiding this comment.
Do we really want to run this in every ocaml program ? Besides the assertion will fail in bucklescript since they plan to move to Sys.max_string_length = max_int
There was a problem hiding this comment.
@dbuenzli: it seems helpful for the test to fail in BuckleScript, since the concat code relies on an assumption that doesn't hold there.
There was a problem hiding this comment.
My implicit meaning was, why do you solve the problem in a way that will fail in some ocaml backends ?
There was a problem hiding this comment.
I think the idea is that the problem is not in this code, but in the choice of setting Sys.max_string_length to max_int, i.e. BucleScript should set max_string_length to max_int/3.
There was a problem hiding this comment.
Maybe. I still think that the decision to fix this by introducing an invariant between Sys.max_string_length and max_int that depends on the values these values take on the 32-bit platform is quite ugly and more brittle than it could be.
There was a problem hiding this comment.
(especially since this invariant is defined by a function of the standard library, not by properties of the runtime system itself)
There was a problem hiding this comment.
|
I propose this version, which does not impose any limit on Also, you should add @yallop's example (from #805) to the test suite and add a Changes entry. |
|
@damiendoligez: I think your version is fine. I'll just point out that it's still a bit longer (and slower, and IMO more complex) than my original, once the one step unrolling and the auxiliary exception are removed: let check_pos v = if v < 0 then invalid_arg "String.concat" else v
let rec sum_lengths acc seplen = function
| [] -> acc
| [x] -> length x + acc
| hd :: tl -> sum_lengths (check_pos (length hd + seplen + acc)) seplen tl
let rec concat sep l =
match l with
| [] -> ""
| hd :: tl ->
let sep_len = length sep in
let r = B.create (sum_lengths 0 sep_len l) in
unsafe_blit hd 0 r 0 (length hd);
let pos = ref(length hd) in
List.iter
(fun s ->
unsafe_blit sep 0 r !pos sep_len;
pos := !pos + sep_len;
unsafe_blit s 0 r !pos (length s);
pos := !pos + length s)
tl;
Bytes.unsafe_to_string r(mostly untested) To be fair, I do still have the assumption that [Edit: now tested.] |
833510c to
f26b91e
Compare
|
I've switch to @damiendoligez 's version.
JS backends have their own runtime, so this should be fine. Do you see other places in the stdlib which makes this assumption? |
Yes, but I don't think that this is the place for that discussion. If it's a problem then it's a problem for the js backends, and not for any of the proposed implementations of |
|
If the stdlib depends on a undocumented invariant of our runtime system which is potentially broken by alternative runtimes, it is certainly a problem on OCaml's side and once we are aware of this problem, we should not introduce more of it. I'd be fine documenting that |
Have you changed your mind about this since opening this issue? |
No, I said it was probably harmless, but if it is easy enough to fix without obvious drawbacks, it's worth doing it. |
|
Here's another implementation, which is
than any of the other implementations so far: let check_pos v = if v < 0 then invalid_arg "String.concat" else v
let rec sum_lengths acc seplen = function
| [] -> acc
| [x] -> length x + acc
| hd :: tl -> sum_lengths (check_pos (length hd + seplen + acc)) seplen tl
let rec unsafe_blits dst pos sep seplen = function
[] -> dst
| [x] ->
unsafe_blit x 0 dst pos (length x); dst
| hd :: tl ->
unsafe_blit hd 0 dst pos (length hd);
unsafe_blit sep 0 dst (pos + length hd) seplen;
unsafe_blits dst (pos + length hd + seplen) sep seplen tl
let concat sep l =
let sep_len = length sep in
unsafe_blits (B.create (sum_lengths 0 sep_len l)) 0 sep sep_len l
|
89724dc to
50bdf03
Compare
|
I've added the test and an entry in Changes. FWIW, I like @yallop's latest proposal, modulo that it'd be better to lift the assumption on max_string_length, considering that it's not more complex ( |
Thank you. That's the nicest thing anyone's said about my code in at least the last 62 comments.
I'm happy to do that. |
|
Here's a version that doesn't make assumptions about let ensure_ge x y = if x >= y then x else invalid_arg "String.concat"
let rec sum_lengths acc seplen = function
| [] -> acc
| hd :: [] -> length hd + acc
| hd :: tl -> sum_lengths (ensure_ge (length hd + seplen + acc) acc) seplen tl
let rec unsafe_blits dst pos sep seplen = function
[] -> dst
| hd :: [] ->
unsafe_blit hd 0 dst pos (length hd); dst
| hd :: tl ->
unsafe_blit hd 0 dst pos (length hd);
unsafe_blit sep 0 dst (pos + length hd) seplen;
unsafe_blits dst (pos + length hd + seplen) sep seplen tl
let concat sep l =
let seplen = length sep in
unsafe_blits (B.create (sum_lengths 0 seplen l)) 0 sep seplen l |> bts |
|
We can still get an invalid_arg with a string different from String.concat is the total length is greater than max_string_length (but without int overflow). I think this might be worth fixing by checking explicitly the resulting size. Also, an int overflow could still occur on the final |
|
I've pushed a version close to @yallop's latest proposal, adding an explicit check against max_string_length, protecting the final sum against int overflow, and adding a fast path for the case where the list is empty (avoiding allocation in that case). |
|
@alainfrisch I think the cases you mention are handled by |
|
What do you mean? Today, |
|
(If/when strings were/are really immutable, one could also optimize the case of a list with a single element in |
If |
|
I prefer the code as it was, without Checking for overflow multiple times (before and within |
|
Here's a version that incorporates @alainfrisch's two proposed improvements (optimization of the empty list case and more consistent error reporting): let ensure_ge x y = if x >= y then x else invalid_arg "String.concat"
let rec sum_lengths acc seplen = function
| [] -> acc
| hd :: [] -> length hd + acc
| hd :: tl -> sum_lengths (ensure_ge (length hd + seplen + acc) acc) seplen tl
let rec unsafe_blits dst pos sep seplen = function
[] -> dst
| hd :: [] ->
unsafe_blit hd 0 dst pos (length hd); dst
| hd :: tl ->
unsafe_blit hd 0 dst pos (length hd);
unsafe_blit sep 0 dst (pos + length hd) seplen;
unsafe_blits dst (pos + length hd + seplen) sep seplen tl
let concat sep = function
[] -> ""
| l -> let seplen = length sep in bts @@
unsafe_blits
(B.create ~caller:"String.concat" (sum_lengths 0 seplen l))
0 sep seplen l
|
|
I really like @yallop's latest version: it's fast and readable, quite perfect. But I would remove the
While I agree that it might be a good idea to add a Also, I want to cherry-pick this PR into 4.04 but I won't do it with the change to |
|
@damiendoligez: I'm fine with removing |
|
@yallop I assume opening a PR with your latest version - |
|
See #833. |
|
This is superseded by #833, right? Can we close it? |
Various tweaks
Follow up to #805.
String and Bytes now depend on Sys (for max_string_length). One could avoid that by copying the code which computes this constant. Actually, it would make sense to have String and Byte expose a
max_lengthvalue, but this is another story.