Skip to content

Avoid overflow when checking for an oversized result String/Bytes.concat#815

Closed
alainfrisch wants to merge 4 commits intoocaml:trunkfrom
alainfrisch:fix_overflow_string_concat
Closed

Avoid overflow when checking for an oversized result String/Bytes.concat#815
alainfrisch wants to merge 4 commits intoocaml:trunkfrom
alainfrisch:fix_overflow_string_concat

Conversation

@alainfrisch
Copy link
Contributor

Follow up to #805.

String and Bytes now depend on Sys (for max_string_length). One could avoid that by copying the code which computes this constant. Actually, it would make sense to have String and Byte expose a max_length value, but this is another story.

stdlib/bytes.ml Outdated
let iteri f a =
for i = 0 to length a - 1 do f i (unsafe_get a i) done

let () = assert (3 * Sys.max_string_length <= max_int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be Sys.max_string_length <= max_int / 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there should be a comment explaining its purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to run this in every ocaml program ? Besides the assertion will fail in bucklescript since they plan to move to Sys.max_string_length = max_int

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbuenzli: it seems helpful for the test to fail in BuckleScript, since the concat code relies on an assumption that doesn't hold there.

Copy link
Contributor

@dbuenzli dbuenzli Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My implicit meaning was, why do you solve the problem in a way that will fail in some ocaml backends ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that the problem is not in this code, but in the choice of setting Sys.max_string_length to max_int, i.e. BucleScript should set max_string_length to max_int/3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I still think that the decision to fix this by introducing an invariant between Sys.max_string_length and max_int that depends on the values these values take on the 32-bit platform is quite ugly and more brittle than it could be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(especially since this invariant is defined by a function of the standard library, not by properties of the runtime system itself)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 100% with @dbuenzli here. @lefessan weren't you the one who once proposed a change of runtime representation that would allow strings up to max_int in length?

Also, this assertion (if we decide to keep it) should be located in Sys because it's part of the specification of max_string_length.

@damiendoligez
Copy link
Member

I propose this version, which does not impose any limit on Sys.max_string_length:

let concat sep l =
  match l with
  | [] -> ""
  | hd :: tl ->
      let len = ref (length hd) in
      let sep_len = length sep in
      List.iter
        (fun s ->
          let old_len = !len in
          len := old_len + sep_len + length s;
          if !len < old_len then invalid_arg "String.concat" (* overflow *)
        )
        tl;
      if !len > Sys.max_string_length then invalid_arg "String.concat";
      let r = B.create !len in
      unsafe_blit hd 0 r 0 (length hd);
      let pos = ref(length hd) in
      List.iter
        (fun s ->
          unsafe_blit sep 0 r !pos sep_len;
          pos := !pos + sep_len;
          unsafe_blit s 0 r !pos (length s);
          pos := !pos + length s)
        tl;
      Bytes.unsafe_to_string r

Also, you should add @yallop's example (from #805) to the test suite and add a Changes entry.

@yallop
Copy link
Member

yallop commented Sep 16, 2016

@damiendoligez: I think your version is fine. I'll just point out that it's still a bit longer (and slower, and IMO more complex) than my original, once the one step unrolling and the auxiliary exception are removed:

let check_pos v = if v < 0 then invalid_arg "String.concat" else v

let rec sum_lengths acc seplen = function
  | [] -> acc
  | [x] -> length x + acc
  | hd :: tl -> sum_lengths (check_pos (length hd + seplen + acc)) seplen tl

let rec concat sep l =
  match l with
  | [] -> ""
  | hd :: tl ->
    let sep_len = length sep in
    let r = B.create (sum_lengths 0 sep_len l) in
    unsafe_blit hd 0 r 0 (length hd);
    let pos = ref(length hd) in
    List.iter
      (fun s ->
        unsafe_blit sep 0 r !pos sep_len;
        pos := !pos + sep_len;
        unsafe_blit s 0 r !pos (length s);
        pos := !pos + length s)
      tl;
    Bytes.unsafe_to_string r

(mostly untested)

To be fair, I do still have the assumption that max_string_length is less than half max_int. But that assumption is thoroughly baked into the runtime and standard library so I'm not especially troubled by it.

[Edit: now tested.]

@alainfrisch alainfrisch force-pushed the fix_overflow_string_concat branch from 833510c to f26b91e Compare September 16, 2016 15:18
@alainfrisch
Copy link
Contributor Author

I've switch to @damiendoligez 's version.

that assumption is thoroughly baked into the runtime and standard library so I'm not especially troubled by it.

JS backends have their own runtime, so this should be fine. Do you see other places in the stdlib which makes this assumption?

@yallop
Copy link
Member

yallop commented Sep 16, 2016

Do you see other places in the stdlib which makes this assumption?

Yes, but I don't think that this is the place for that discussion. If it's a problem then it's a problem for the js backends, and not for any of the proposed implementations of concat.

@alainfrisch
Copy link
Contributor Author

If the stdlib depends on a undocumented invariant of our runtime system which is potentially broken by alternative runtimes, it is certainly a problem on OCaml's side and once we are aware of this problem, we should not introduce more of it. I'd be fine documenting that max_string_length < max_int / 2 if there is a strong reason to require it, but it seems just better and more friendly to maintainers of alternative runtimes to lift this assumption assuming it is not too difficult to do so.

@yallop
Copy link
Member

yallop commented Sep 16, 2016

If the stdlib depends on a undocumented invariant of our runtime system which is potentially broken by alternative runtimes, it is certainly a problem on OCaml's side and once we are aware of this problem, we should not introduce more of it.

Have you changed your mind about this since opening this issue?

@alainfrisch
Copy link
Contributor Author

Have you changed your mind about this since opening this issue?

No, I said it was probably harmless, but if it is easy enough to fix without obvious drawbacks, it's worth doing it.

@yallop
Copy link
Member

yallop commented Sep 16, 2016

Here's another implementation, which is

  • smaller (16 non-blank lines)
  • simpler (no references, only one conditional, besides patterns) and
  • faster (25% faster than the stdlib on @alainfrisch's benchmark)

than any of the other implementations so far:

let check_pos v = if v < 0 then invalid_arg "String.concat" else v

let rec sum_lengths acc seplen = function
  | [] -> acc
  | [x] -> length x + acc
  | hd :: tl -> sum_lengths (check_pos (length hd + seplen + acc)) seplen tl

let rec unsafe_blits dst pos sep seplen = function
    [] -> dst
  | [x] ->
    unsafe_blit x 0 dst pos (length x); dst
  | hd :: tl ->
    unsafe_blit hd 0 dst pos (length hd);
    unsafe_blit sep 0 dst (pos + length hd) seplen;
    unsafe_blits dst (pos + length hd + seplen) sep seplen tl

let concat sep l =
  let sep_len = length sep in
  unsafe_blits (B.create (sum_lengths 0 sep_len l)) 0 sep sep_len l

@alainfrisch alainfrisch force-pushed the fix_overflow_string_concat branch from 89724dc to 50bdf03 Compare September 16, 2016 15:58
@alainfrisch
Copy link
Contributor Author

I've added the test and an entry in Changes.

FWIW, I like @yallop's latest proposal, modulo that it'd be better to lift the assumption on max_string_length, considering that it's not more complex (let l = length hd + seplen +acc in if l < acc then invalid_arg ... else l).

@yallop
Copy link
Member

yallop commented Sep 16, 2016

FWIW, I like @yallop's latest proposal

Thank you. That's the nicest thing anyone's said about my code in at least the last 62 comments.

modulo that it'd be better to lift the assumption on max_string_length

I'm happy to do that.

@yallop
Copy link
Member

yallop commented Sep 19, 2016

Here's a version that doesn't make assumptions about max_string_length:

let ensure_ge x y = if x >= y then x else invalid_arg "String.concat"

let rec sum_lengths acc seplen = function
  | [] -> acc
  | hd :: [] -> length hd + acc
  | hd :: tl -> sum_lengths (ensure_ge (length hd + seplen + acc) acc) seplen tl

let rec unsafe_blits dst pos sep seplen = function
    [] -> dst
  | hd :: [] ->
    unsafe_blit hd 0 dst pos (length hd); dst
  | hd :: tl ->
    unsafe_blit hd 0 dst pos (length hd);
    unsafe_blit sep 0 dst (pos + length hd) seplen;
    unsafe_blits dst (pos + length hd + seplen) sep seplen tl

let concat sep l =
  let seplen = length sep in
  unsafe_blits (B.create (sum_lengths 0 seplen l)) 0 sep seplen l |> bts

@alainfrisch
Copy link
Contributor Author

We can still get an invalid_arg with a string different from String.concat is the total length is greater than max_string_length (but without int overflow). I think this might be worth fixing by checking explicitly the resulting size. Also, an int overflow could still occur on the final length hd + acc.

@alainfrisch
Copy link
Contributor Author

I've pushed a version close to @yallop's latest proposal, adding an explicit check against max_string_length, protecting the final sum against int overflow, and adding a fast path for the case where the list is empty (avoiding allocation in that case).

@dbuenzli
Copy link
Contributor

dbuenzli commented Sep 20, 2016

@alainfrisch I think the cases you mention are handled by B.create.

@alainfrisch
Copy link
Contributor Author

What do you mean? Today, Bytes.create 0 != Bytes.create 0. And clearly, it raises Invalid_argument "String.create", not Invalid_argument "Bytes.concat".

@alainfrisch
Copy link
Contributor Author

(If/when strings were/are really immutable, one could also optimize the case of a list with a single element in String.concat.)

@dbuenzli
Copy link
Contributor

What do you mean?

If sum_length overflows B.create will raise.

@yallop
Copy link
Member

yallop commented Sep 21, 2016

I prefer the code as it was, without assert false, physical equality, if/else branches, code duplication, etc.

Checking for overflow multiple times (before and within create) should be avoided. It would be better to have create take the name of the calling function as an additional parameter.

@yallop
Copy link
Member

yallop commented Sep 23, 2016

Here's a version that incorporates @alainfrisch's two proposed improvements (optimization of the empty list case and more consistent error reporting):

let ensure_ge x y = if x >= y then x else invalid_arg "String.concat"

let rec sum_lengths acc seplen = function
  | [] -> acc
  | hd :: [] -> length hd + acc
  | hd :: tl -> sum_lengths (ensure_ge (length hd + seplen + acc) acc) seplen tl

let rec unsafe_blits dst pos sep seplen = function
    [] -> dst
  | hd :: [] ->
    unsafe_blit hd 0 dst pos (length hd); dst
  | hd :: tl ->
    unsafe_blit hd 0 dst pos (length hd);
    unsafe_blit sep 0 dst (pos + length hd) seplen;
    unsafe_blits dst (pos + length hd + seplen) sep seplen tl

let concat sep = function
    [] -> ""
  | l -> let seplen = length sep in bts @@
          unsafe_blits 
            (B.create ~caller:"String.concat" (sum_lengths 0 seplen l))
            0 sep seplen l

Bytes.create will also need to be changed to accept an optional ?caller argument. Actually, it would be a useful improvement to the standard library as a whole if internal functions that raised exceptions were to take an optional argument with the name of the caller. There are several places where the name of an internal function currently appears in an exception argument, and it's an easy fix.

@damiendoligez
Copy link
Member

I really like @yallop's latest version: it's fast and readable, quite perfect.

But I would remove the ?caller argument:

  • Bytes.create is not just an internal function: it has a public interface, so changing it is a rather big deal.
  • I don't think having the wrong message in Invalid_argument is such a big problem: the users have tools (backtraces and debuggers) to figure out where it really comes from.

While I agree that it might be a good idea to add a ?caller argument to a bunch of stdlib functions, I think it's a topic for another PR (and long discussion about the relative speed of double-checking, passing an optional argument, and setting up an exception handler in various back-ends). In this one we should concentrate on fixing the bug at hand.

Also, I want to cherry-pick this PR into 4.04 but I won't do it with the change to Bytes.create.

@yallop
Copy link
Member

yallop commented Sep 27, 2016

@damiendoligez: I'm fine with removing ?caller (and will consider proposing some variant of it in a separate PR). Is there anything you need from me to make cherry-picking into 4.04 easier?

@hannesm
Copy link
Member

hannesm commented Sep 30, 2016

@yallop I assume opening a PR with your latest version - ?caller would be needed...

@yallop
Copy link
Member

yallop commented Oct 3, 2016

See #833.

@damiendoligez
Copy link
Member

This is superseded by #833, right? Can we close it?

kayceesrk added a commit to ocaml-multicore/ocaml that referenced this pull request Dec 23, 2021
stedolan pushed a commit to stedolan/ocaml that referenced this pull request Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants