Skip to content

Conversation

@gasche
Copy link
Member

@gasche gasche commented Jan 11, 2023

Current status of this PR

(last updated: September 27th 2023)

  • We reached consensus on having a boxed implementation as a first step.
    People agreed with this approach there, there, there and there.
    See in particular this benchmark observing a 25% worst-case overhead due to boxing in a real-world codebase of @c-cube where dynarrays are a performance-critical data structure.

  • We reached consensus on the right behavior for iterator invalidation.
    See in particular this part of the discussion with @gadmm that converged to the currently-implemented behavior: mutations to existing elements during iteration are allowed, but mutating the length by removing or adding elements is a programming error and we do a best-effort attempt at detecting and failing with Invalid_argument in that case.

  • The interface and documentation have been reviewed.
    See there for the thorough review by @dbuenzli.
    See also this comment for a history of the proposed API.

  • We got two reviews of the implementation from @clef-men and @damiendoligez (thanks!)

  • We got two maintainers approval (@damiendoligez on June 8th 2023, @Octachron on September 27th 2023)


(what follows was written at the time this PR was submitted on January 11th, 2023)

Introduction

This PR is an attempt to un-stuck #11563, which attempted the noble goal of adding a Dynarray module to the standard library -- dynamic arrays are like arrays, representing a sequence of elements indexed from 0 to length - 1, but the length of dynamic arrays is itself mutable, it can be changed by adding or removing elements at the end of the array.

#11563, proposed by @c-cube, got stuck in a debate about how to implement the unsafe Obj tricks that are required to provide an "unboxed" implementation, where a dynamic arrays of elements of type 'a is represented by a "backing" array of elements of type 'a. (The other possibilities are to make the API more complex to ask the user to provide "filler" elements, or to accept leaking user-provided values. People are not too enthusiastic about those either.)

This PR builds on top of (a rebased version of) #11563, and provides a boxed implementation of dynamic arrays, with morally 'a option elements in the backing array. The implementation:

  • uses only safe OCaml features,
  • does not leak user-provided values,
  • has a simple API, and
  • (thanks to careful optimization) has good-enough performance.

I think that we could gather consensus on such a boxed implementation, instead of debating unsafe Obj tricks at nauseam. If later people want to propose further performance improvements using unsafe Obj tricks (I have ideas that I might try myself), those could come as later PRs, and users would still benefit from the useful module in the stdlib in the meantime.

There is, alas, one issue worth discussing (there are also benchmarks, but those are boring). A drop of debate in this ocean of consensual directions. The PR does not have one implementation of boxed dynarrays but two, related to the question of iterator invalidation: what happens if a user modifies a dynarray while they are iterating on it? In short, my first implementation was "non-reentrant", it specifies that this is a programming error and may raise Invalid_argument in that case, while the second is "reentrant", it supports this gracefully at the cost of a modest performance overhead. The two versions have a different specification, so we have to decide which one we want.

Happy resizing!

Structure of the PR

This PR starts with a rebased version of #11563 on the current trunk, with slight cleanups of the history (but not much). @c-cube, please feel free to update your PR with those commits (I don't have push rights to your fork).

commits: gasche/ocaml@dynarray.base...dynarray.rebase

Then I performed several small changes to #11563 that I think improve the final results (see the commit messages for more details):

  • I removed the functions blit and rev that I think are not necessary for a first PR
  • I added a function reset (in addition to clear) for consistency with Buffer
  • I optimized add_last and removed unsafe_add_last (from the public API) that added complexity without a noticeable performance difference anymore
  • I tuned the growth strategy a bit1

commits: gasche/ocaml@dynarray.rebase...dynarray.unboxed

Then I wrote a better documentation for dynarray.mli, and the non-reentrant implementation: dynarray.mli, dynarray.mli. For comparison, the previous .mli from #11563.

commit: gasche/ocaml@dynarray.unboxed...dynarray.non-reentrant

Finally, a reentrant version: dynarray.mli (very few changes from the non-reentrant version), dynarray.mli.

commit: gasche/ocaml@dynarray.non-reentrant...dynarray.reentrant

How to review

You decide, but I expect that the following questions are important:

  1. Beautiful code or eldritch horrors? Are we willing to go for a boxed implementations of dynamic array as a first step, even though the performance of an unsafe unboxed representation could in theory be better? In my microbenchmarks, my boxed representations are in fact equal or better than the unboxed implementation of add Dynarray to the stdlib. #11563 (because I optimized more carefully), except in bufer-style workflows where you start from an empty array and repeatedly add elements, those get a slowdown from 1.4x to 2x in microbenchmarks (certainly much much less in real life).

  2. Usability or performance? Are we willing to go for a reentrant implementation with a simpler specification and more predictable behavior, or do we want to rule out iterator invalidation as a programming error in the interest of the modest performance gains of an optimized non-reentrant version? (non-reentrant iterators are slightly faster than in add Dynarray to the stdlib. #11563, and reentrant iterators are slightly slower.)

  3. Bugs? Is the final, reentrant implementation proposed in this PR correct? (Or, if you answer "performance" to question (2), you can review the non-reentrant implementation.)

  4. Reviewing the .mli: the proposed API and its documentation.

It probably makes sense to try to get a feeling of the general opinion on (1) and then (2) before going too deep into (3) and (4), but of course looking at the code and the specification is important to get a feeling for (1) and (2). (You will be so impressed by my documentation comments that you will support this PR.)

This is a self-contained, completely new module for the stdlib. If you want to look at the code, I would recommend just reading the .mli and the .ml in one go, instead of trying to look at the commit diffs

Performance

I wrote microbenchmarks to compare my boxed implementations with the unboxed implementation of @c-cube, and other data structures when relevant.

In general, boxed representations have a small but noticeable performance disadvantage; I carefully optimized my implementations to compensate for the difference as much as possible.

The implementations are in the same ballpark (sometimes faster, sometimes slower) for get and set workloads, and also for stack-like worloads with a balanced number of add_last and pop_last operation.

Note that, in my tests, the fixed-sized Array module is roughly 2x faster than the Dynarray implementations for 'get' and 'set', and the Stack module (which uses Lists internally) is 1.6-2x faster than Dynarray for 'add_last' and 'pop_last' benchmarks2. My microbenchmarks magnify the performance difference, it is much smaller in real applications that typically do some work between each dynarray operation. In this context, would consider a performance difference of 1.2x-1.3x as a very small difference, a difference of 2x as a small difference, as 5x-10x as a difference. (To put things in perspective, a microbenchmark comparison of 'set' on Array, Hashtbl and Map suggests that Hashtbl is 42x slower than Array on my machine, and Map is 78x slower than Array.)

One area where the unboxed implementation has an advantage is workflows that only add elements. (For example, repeatedly accumulating elements into an empty dynarray, similar to Buffer usage for strings.) In my tests, when the user provides size hints using ensure_capacity, the boxed implementations are roughly 1.6x slower, and when it does not (more reallocations of the backing array) they can be up to 2x slower.

(The difference is even larger, 3x-4x, with append_array, which is special because the unboxed implementations can directly blit the elements from the input array to its backing array. For append_seq the boxing overhead is only 1.4x for example.)

Finally, my 'reentrant' and 'non-reentrant' implementations generally perform the same, except for iterators (iter, fold_left, for_all...) where the non-reentrant implementation is faster, thanks to its choice of a worse behavior -- the non-reentrant implementation is chosen to maximise speed at the code of fragile behavior, the reentrant implementation is chosen to maximise well-definedness at the cost of performance. The difference remain fairly small: I see 1.2-1.3x differences for most benchmarks, up to 1.6x in some cases. In general the non-reentrant version is slightly faster than the unboxed implementation of #11563, and the reentrant version is slightly slower.

(Note: microbenchmarks in this style are very fragile and susceptible to measurement noise and bias. I have of course more precise numbers than what I mentioned in this section, tables and tables of them, but some of them are probably meaningless and I don't know which ones, so I stuck to the big picture that is less likely to be completely wrong. I will submit my benchmark code somewhere, it is mostly the obvious stuff so feel free to write your own.)

Reentrancy

I only realized that reentrancy was such an important question while working on the present PR. It is absent from the discussion of #11563.

In the case of Dynarray.iter f a for example, what should happen if an invokation of f in the middle of the iter computation decides to set, add or remove elements of a?

I see three choices:

  1. We could claim that this is a programming error and that the behavior is unspecified, and may or may not raise an Invalid_argument exception. This lets us optimize performance but it gives poor usability, especially if the implementation does not reliably fail in this case. (The implementation can try to catch more such cases and fail more often, but this comes at the cost of performance, which was the reason for ruling those out in the first place.)

  2. We could claim that this allowed (will not blow up with an exception), but that the behavior is under-specified. We can give it a weak-memory-model-like specification, which is that the iteterator will observe some subset of the updates that happen during its execution. (In particular, it may not observe some updates, and observe a later update.)

  3. We could allow this an provide a precise, informative specification, at the cost of performance. The specification is that the iterator will access each element once in increasing indices; any time it accesses an element at a given position, it observes the "current" value in this position, or stops if it observes that the position is at the end of the array.

My intuition would be, for a first version of a module that will be improved later but in backwards-compatible ways, to:

  • not focus on optimization too much, we can always optimize later
  • give a weak specification that constrains the implementation less, we can always make it more precise (and allow more user code) later

The problem is that these two ideas are in contradiction here. If we decide to go for the weakest specification, that is choice (1) (programming error), there is no point in providing a slow but convenient implementation as valid code can never benefit from it. On the contrary if we decide to give a simple and permissive specification, we allow iterator-invalidating program and moving to (1) later would break compatibility.

My personal opinion is that correctness and convenience should have priority over performance (especially when the difference is modest, as it is here). I think that we should go for specification (3) and not have regrets. There is probably still room for further optimizations to decrease the performane gap with (1) and (2). This is why the 'reentrant' implementation is the last commit of this PR, the version I am proposing for inclusion.

(what follows are more advanced discussions around re-entrancy, feel free to skip them and stop reading here)

Degrees of non-reentrant scenarios

They are different source of such "external' updates in the middle of a function, from the weirdest to the most natural:

  • Parallel updates from another domain. Clearly those should be considered a programming error, just as other single-owner mutable structures in the stdlib. We should still preserve memory safety in this case, which reduces our optimization opportunities somewhat.3

  • Asynchronous updates coming from finalizers, memprof callbacks, intra-domain Threads and signal handlers. Usually stdlib modules decide to ignore these, so it would be okay to give up and also rule them out as programming errors. They are easier to try to support than parallel updates, as they can only occur on "safe points" (I think: allocations, function calls, and loop back-edges), but still tricky.

  • Synchronous updates that come from the natural control flow of the program, for example if we call a user-provided function from an iterator function (fold_left, iter, for_all etc.) and that function mutates the dynarray, or in-between the forcing of Seq.t nodes produced by to_seq.

In fact, my implementation is trying to do "the right thing" for both synchronous and asynchronous updates. (So for example: I think you could store a log of memprof events as a Dynarray that would grow on memprof callbacks, including those invoked from within the Dynarray implementation.) Supporting asynchronous updates is tricky and there may be bugs there, but those bugs would only affect people writing asynchronous-callback code anyway, which are few and far between.4

Re-entrant vs. domain-safe

At first my idea was that trying to support reentrant updates was silly: people will devise concurrent versions of this data-structure, and people that want to write code with (sequential) re-entrancy issues can just use one of those concurrent version. So we could just support the iterator case, and tell memprof people to look elsewhere.

However, there is a catch: if your sequential version is not safe with respect to asynchronous callbacks, then the obvious concurrency strategy of wrapping the sequential version under a mutex is not safe either -- an asynchronous callback could run while you hold the mutex, and try to take the mutex to perform an operation. With a non-recursive mutex you have a deadlock, with a recursive mutex your code silently does the wrong thing.

So this is an extra argument in favor of writing a reentrant implementation: it can be turned into a concurrent implementation easily, which is not the case of the non-reentrant version. (And if you try to get a concurrent version by doing something more clever than just a lock around the code, then your code will be strictly more complex than the reentrant version.)

Footnotes

  1. The reason to tune the growth strategy is that reallocation is more costly in boxed implementations, so tuning it helped reduce the performance gap for add-heavy workflows. I included the change in the unboxed implementation as well to have comparable benchmark results.

  2. The fact that Stack is faster than Dynarray for purely stack-like workflows came as a surprise to me, I would have assumed that Dynarrays are equivalent or faster. I measured with stacks of maximal size 1024. At very large sizes (say 1_000_000), locality effects give a clear advantage to dynarrays and I have observed them to be 2x faster, but I think that 1024 is more representative of typical workflows, which are probably even smaller than that most of the time. Note of course that Dynarrays allow stack-like workflows but also constant-time arbitrary access within the sequence, while a list-based Stack could only provide linear-time arbitrary access; so some scenarios cannot use Stack anyway.

  3. We cannot enforce the invariant that the "logical end" of the array is before the "memory end" of its backing array, so we need slightly more out-of-bound checks that with a purely-sequence implementation. I would not worry about the performance cost here, rather about the memory-safety issues of unsuspecting unsafe code.

  4. When you try to write a reentrant implementation that respect a strong specification even in presence of asynchronous updates, the bugs are bugs against this strong specification -- you revert to the "weak memory model" view of your dynarray state. In contrast, when you try to take advantage of a non-reentrant specification, you add more Array.unsafe_* calls for optimization. Those are worse bugs, they break memory-safety and would affect more people. Bug in my experience it is easier to reason about the correctness of those unsafe_* functions than about the reentrancy of an implementation. Generally my reentrant implementation is simpler in most places, but very tricky in just a couple places.

@gasche
Copy link
Member Author

gasche commented Jan 11, 2023

I have been too succinct in my PR description and a lot of information is missing. Apologies! Here is a bit more on reentrancy.

(Note that the questions of reentrancy are not specific to this PR and apply to #11563 just as well. They just did not come up in the review and discussion of #11563 yet.)

Current behavior of the two implementation on concrete reentrancy/conflict scenarios

Currently, if you call set, add or remove in the middle of an iterator:

  • In both versions, set on elements that were initially in range and have not been traversed yet will be observed1
  • In the current non-reentrant version, add operations will be ignored (not observed), and pop operations will result in an Invalid_argument exception. (In particular if you add then set in the previous range, you will observe the set of not the add, which gives this "weak memory model" vibe.)
  • In the reentrant version, both add and pop operations will be observed

On the other hand, if you add/append or pop or iterate from an asynchronous callback in the middle of another add/append operation (I think that pop-style operations are atomic from the point of view of asynchronous code execution):

  • with the current non-reentrant implementation, add/add conflicts will typically work properly (both will be observed in some order), add/pop conflicts are basically unspecified and may result in invalid states that raise Invalid_argument later, and iterating from within an addition either work fine or reliably raise Invalid_argument
  • with the re-entrant implementation, add/pop and add/add and add/iterate will be linearized in some order; but note that operations that add several elements (append_*) are not atomic, they add their elements one by one, so in the case of an add/append conflict you may see the elements from both sides interleaved in the output.

The reentrant version has these nice properties and also a very simple implementation of iterators (it is the code you would write naively without trying to optimize anything). I have ideas for how to optimize them, at the code of some code complexity, that could possibly close the modest performance gap (typically 20-30%) between the two versions -- then basically there would be no point in using the non-reentrant implementation, but we still need to decide which specification to adopt. But I need to first take a break from dynarray implementations and look at other PRs on my TODO-list.

Reentrant implementation vs. specification

Remember the "three choices" from the PR description, regarding iterator invalidation:

  1. We could claim that iterator invalidation is a programming error and that the behavior is unspecified, and may or may not raise an Invalid_argument exception.
  2. We could claim that iterator invalidation is allowed (will not blow up with an exception), but that the behavior is under-specified. We can give it a weak-memory-model-like specification, which is that the iteterator will observe some subset of the updates that happen during its execution. (In particular, it may not observe some updates, and observe a later update.)
  3. We could allow iterator invalidation and provide a precise, informative specification, at the cost of performance. The specification is that the iterator will access each element once in increasing indices; any time it accesses an element at a given position, it observes the "current" value in this position, or stops if it observes that the position is at the end of the array.

Currently the reentrant implementation provides very nice, strong properties, so it provides choice (3), but the reentrant specification gives much weaker guarantees, it provides choice (2). The idea is to not tie our hands if we want to move to a more relaxed, more efficient (reentrant) implementation later. I think this is the right choice for the PR.

But this raises the question of whether people will actually read the documentation and understand that the behavior they observe is under-specified, and that the nice properties that they see in their tests may be weakened later -- in a specification-respecting way. I am pessimistic that this will be the case, and we may end up being stuck with (3) as a de-facto specification. Personally I think that this would be fine and we could in fact consider specifying (3) right away -- I would not, on the other hand, give strong guarantees on asynchronous callbacks, and I haven't audited my code as carefully for that less important use-case.

Footnotes

  1. This is, by coincidence, a stronger/simpler/better behavior than what non-reentrant unboxed implementations give you, because in them you "lose" the set operations that happen after the first reallocation of the dynarray. Of course the specification should not promise this behavior (it does not) so that we can move to an unboxed implementation later.

@dbuenzli
Copy link
Contributor

Regarding 1. I think it's nice to try to avoid tricks as a first step, though I have to say I'm a little bit sad for the Uchar.t Dynarray.t case. To avoid surprises I think the docs should make it more clear that for now one should not expect "intable" elements to have the same access and memory characteristics as in an 'a array (but I'm jumping to 4 here).

@gasche
Copy link
Member Author

gasche commented Jan 11, 2023

Yes, this could definitely be mentioned in the implementation, that there is some boxing going on.

Right now the documentation starts with as few low-level concept at possible and leaves them to the end where capacity-hinting functions are discussed. At the very end there is a remark about the absence of lifetime leaks. Dear master of the docs, do you agree that the same approach could be followed here, leaving the remark about representation for the very end?

@gasche
Copy link
Member Author

gasche commented Jan 11, 2023

Unrelated: it occurred to me this morning that ensure_capacity arr n, which guarantees that the capacity is at least n (but in practice will be higher due to the exponential-resizing strategy), should probably be completed with a strict version set_capacity arr n (which fails if the current length is higher than that), which sets exactly this capacity.

It is important to have ensure (the lax behavior) so that people that implement repeated pushes with ensure_capacity arr (length arr + 1) if they want to, but it is important to have the strict behavior if you know that the final result will be between 9 and 10 gigabytes, and don't want the capacity to end up at 16 gigabytes instead. (In practice we use a 1.5 growth factor for large sizes, so it could be at most 13.5 gigabytes, but still, you get the idea.)

@dbuenzli
Copy link
Contributor

do you agree that the same approach could be followed here, leaving the remark about representation for the very end?

I would still warn and link to that remark before. People tend overlook documentation details (a human factor we need to design with, not complain about :-) and the current module preamble really makes you believe this is just an array but resizable.

For example in the 'a t docstring have a sentence like:

{b Warning.} The {{!memory_layout}memory layout} of 
dynamic arrays differs from the one of {!Array}s.

or even replace something like

The {!Array} module provide arrays of fixed length. In contrast,
the length of a dynamic array can change over time, we can add
more elements or remove elements at the end of the array.

by

Dynamic arrays are variable length {!Array.t} (their 
{{!memory_layout}memory layout} is also slightly different). Elements
can be efficiently added or removed at the end of the array.

@c-cube
Copy link
Contributor

c-cube commented Jan 11, 2023

I want to write more detailed comments later, but an important aspect of resource usage that is not mentioned as a drawback to this approach is memory consumption. Even if CPU time is not far behind an array-like implementation (which is impressive tbh; does it depend on the access patterns?), this design will consume up to 3× more memory in the case of a dynarray of primitive types.

@gasche
Copy link
Member Author

gasche commented Jan 11, 2023

I see this as a reason to continue working on unboxed designs, but not a reason to reject a boxed implementation. Stack.t also consumes 3 words per element, Hashtbl.t is between 4 and 5 words, Set.t uses 5 words, etc.

@c-cube
Copy link
Contributor

c-cube commented Jan 11, 2023

I mean, yes, that's a reason why I've been using dynarrays as stacks for years, and basically never use Stack.t 🙂 . We're in agreement.

@alainfrisch
Copy link
Contributor

Just a word of caution: some users expect that stdlib datastructures are forward Marshal-compatible across versions of OCaml (see the hack in Hashtbl to allow demarshaling tables marshaled by older versions of OCaml). If we know that the representation is likely to change, we might want to explicitly warn users.

@gadmm
Copy link
Contributor

gadmm commented Jan 12, 2023

(I have not reviewed the implementation at this point. I wrote this over the past two days so apologies if this does not take into account the discussion that happened in the meanwhile.)

Thanks for doing this, it is a good strategy to first write a plain OCaml version and look into the unboxing optimization later. This optimization polluted the discussion about the correct interface. It makes sense to me to first focus on specification and correctness, and care about optimisations later. (This does not mean that the unboxing optimization is optional: a compelling argument to have Dynarrays is that they have good cache locality and can be helpful to implement other structures with good cache locality, but this is only true when unboxed. But this can come later.)

  • How confident are you with your micro-benchmarks and claims that derive from them? (For instance, a lot of things can go "wrong" in micro-benchmarks, such as hiding the impact of cache locality.)
  • Should one implement shrinking when removing elements?
  • Should it implement sort?

The main point you seem to make of the discussion is whether one should try to make sense of iterator invalidations. Strangely you do not mention the discussion about concurrent mutation update detection from the other PR. You even say that reentrancy is not mentioned. But one of its purpose is indeed to detect structure changes from the same thread as well, not just in a parallel setting (and when not in parallel, I think the detection is accurate). (And it enables some optimizations to iteration too, as mentioned previously, and perhaps this is what you talk about.)

I do not think there is doubt as to the fact that iterator invalidation is a source of hard-to-find bugs, with GCed languages making them harder to notice thanks to memory safety. (Given that this detection has been in Java for X decades, surely there must be some literature one can turn to in order to make an informed decision.) Obviating the benefits of the optimizations it enables, I do not expect its costs to be significant (one well-predictable, non-atomic integer comparison at every iteration). Have you explored this path? You mention costs, have you tried to measured them?

  • There are further arguments with parallelism: how do you intend to leverage the guarantees of the memory model (previously discussed at add Dynarray to the stdlib. #11563 (comment)). We were sold easier debugging thanks to stronger guarantees, but your implementation is likely to fail silently, for instance corrupting the data structure in a way that is only noticed at completely unrelated time and location, thereby throwing away the data produced from these guarantees.
  • The discussion about asynchronous callbacks sounds strange to me. Ideally there should be nothing to say about it. During the discussion about documenting the memory model, I argued that the model should specify what happens with concurrent mutations inside async callbacks, by treating them like parallel mutations. This addresses your questions in a general manner (edit: concretely, the writer of an asynchronous callbacks uses atomics instead of plain references to communicate with its thread). (In your example, the mutex problem sounds orthogonal to me.) But I do not think this was noticed (Manual chapters on parallelism and memory model #11280 (comment)).

For the bigger picture probably:

My personal opinion is that correctness and convenience should have priority over performance (especially when the difference is modest, as it is here). I think that we should go for specification (3) and not have regrets.

You seem to associate correctness and convenience to trying to make sense of iterator invalidation, but this link escapes me so far.

@gasche
Copy link
Member Author

gasche commented Jan 12, 2023

I found it helpful to wonder about how a user would implement iterators outside the abstraction boundary of a module.

Basically I see two simple ways:

let iter f a =
  for i = 0 to length a - 1 do
    f (get a i)
  done

let iter f a =
  let i = 0 in
  while !i < length a do
    f (get a !i);
    incr i
  done

The first "iter-for" version behaves, morally, similarly my 'non-reentrant" implementation. If length a increases during the run, it will not notice it and ignore new elements, and if length a decreases you will get a failure when reaching the end.

The second "iter-while" behaves exactly like my "reentrant" implementation, in fact it is my implementation. It will iterate until it observes at some point that it has reached the end of the array. (It could still fail due to concurrent updates between the !i < length a check and the get a i access.)

My intuition is that, for a structure that advertises the mutability of length a, the iter-while implementation is obviously the better choice, and that we should take its behavior as inspiration to (implement) and specify the function.

(Maybe i should have described things this way instead of talking about "iterator invalidation" that can mean many things to many people.)

@bluddy
Copy link
Contributor

bluddy commented Jan 12, 2023

@gadmm makes a very interesting point regarding Java and its years of experience with GC'd multithread-supporting data structures. A quick lookup brought me to this answer. I think it's possibly very wise of us to investigate their code, since we've never dealt with these issues before.

Also, their invalidation rules for single-thread iteration seem interesting, as can be seen here. I don't think our mutable data structures necessarily consider the different scenarios as well as they might.

@gasche
Copy link
Member Author

gasche commented Jan 12, 2023

But Dynarray is not intended to be a multi-thread-supporting data structure, it explicitly forbids unsychronized concurrent access. I am interested, in general terms, in playing more with @gadmm's idea of (atomic) version counting to detect unsafe unsychronized access earlier, but right now I do not wish want to work on it, and I do not consider it within the scope of the present PR. (I spent about 4 days of work on the current implementation and I don't want spend extra days chasing even bigger challenges before we make a decision. I do of course expect extra work refining and iterating the PR thanks to review feedback, but let's stay within the scope of the proposed module.)

I'm happy to discuss single-threaded invalidation. For unsynchronized concurrent usage, the only requirement is to preserve memory safety.

@c-cube
Copy link
Contributor

c-cube commented Jan 12, 2023

My impression, @gasche, is that the java-like concurrent modification detection is the way to go because it's the only semantic that will work as well when the Dynarray implementation becomes memory-efficient (i.e. a flat memory chunk). Single-thread invalidation is just the easy mode of concurrent invalidation. If we make different, incompatible promises about iterator invalidation (e.g. changing the array while iterating is defined, like your proposal in 3.) then there is no walking that back, and the fast Dynarray is no longer possible.

I see that Hashtbl.iter does not specify concurrent modifications at all. That might also be a good place to fail hard if concurrent modification is detected, imho.

Another nitpick I have:

My microbenchmarks magnify the performance difference, it is much smaller in real applications that typically do some work between each dynarray operation.

I think it's the opposite 🙂 . In real life, where items are not allocated next to one another, each Dynarray.get arr i incurs potentially an additional cache miss for the element box. For a dynarray of primitives, this turns a few cycles into potentially dozen or hundreds of cycles if the box is in L3 or main RAM1. I personally use a ton of int Vec.t (with a home grown dynarray) so that'd be an unacceptable overhead in my particular case. My point is just that each box has the potential of adding significant slowdowns with real-life memory access patterns.

I don't want to detract from your work. It's very nice to have several distinct proposals. If I had to choose right now, I'd take the non-reentrant version, with the explicit warning in the .mli that concurrent iteration is unspecified and might even raise; this way we can change to a better implementation later.

Lastly: I like the custom growth selection based on the current capacity!

Footnotes

  1. http://ithare.com/wp-content/uploads/part101_infographics_v08.png

@gadmm
Copy link
Contributor

gadmm commented Jan 12, 2023

I see that Hashtbl.iter does not specify concurrent modifications at all. That might also be a good place to fail hard if concurrent modification is detected, imho.

This is a good example. Hashtbl.iter tries hard to make sense of iterator invalidation (there is no going back), and it ends up with a iter function that unexpectedly mutates the data structure (and thus cannot be used in parallel).

@gasche
Copy link
Member Author

gasche commented Jan 12, 2023

Thanks @c-cube for the interesting feedback.

In real life, where items are not allocated next to one another, each Dynarray.get arr i incurs potentially an additional cache miss for the element box.

I certainly agree that my benchmarks probably did not reflect locality effect accurately, but I am not sure that the situation you describe is a common case. I would rather expect that the element boxes fairly often end up close together in memory, and therefore benefit from each other's cache line lookups, as the element boxes allocated during the same minor collection will be promoted together to the major heap.

I personally use a ton of int Vec.t (with a home grown dynarray) so that'd be an unacceptable overhead in my particular case.

Would you actually try adding this elem-box indirection in your Vec.t implementation and measure the performance impact on your Vec-heavy program?

the java-like concurrent modification detection is the way to go because it's the only semantic that will work as well when the Dynarray implementation becomes memory-efficient

I mentioned three specifications for what happens if we mutate a Dynarray during its iteration:

  1. This is a programming error, and we may fail with Invalid_argument.
  2. Okay, but we don't give any guarantees on the subset of update iterations your iterator will observe.
  3. Whenever the iterator reads position i, it gets exactly the current element at i.

I think that I could implement all three semantics with unboxed dynarrays. So I don't follow/understand/agree when you say that only (1) works with unboxed designs.

Currently my non-reentrant version implements (1), and my reentrant version specifies (2) (leaving room for more efficient versions later) but implements (3). Neither of the implementations depend on the boxed representation.

The only relation between iteration behavior and the boxed representation is that my non-reentrant version naturally observes all set operations on elements within its initial range, even after reallocation events. But it misses additions, so it belongs to the models (1) or (2) anyway, and in either models dropping this property (by moving to an unboxed representation) preserves the specification.

Finally: I suspect that one could actually implement model (3) just as efficiently as model (2) in practice, with only a well-predicted "check that nothing funny has been going on" extra test at each iteration of the iterator. But I am not proposing to document (3) in this PR, only to document (2), to stay on the safe side / leave us some wiggle room for later. (If people decided that they really want (3) I could be convinced to experiment a bit further in this direction.)

@gadmm
Copy link
Contributor

gadmm commented Jan 12, 2023

My intuition is that, for a structure that advertises the mutability of length a, the iter-while implementation is obviously the better choice, and that we should take its behavior as inspiration to (implement) and specify the function.

I agree with this intuition when you frame the problem like this, but this omits asking the question of how you would reason about such code. People who bring iterator invalidation into the discussion simply want to say "you don't".

@gasche
Copy link
Member Author

gasche commented Jan 12, 2023

This omits asking the question of how you would reason about such code. People who bring iterator invalidation into the discussion simply want to say "you don't".

My understanding of the Java position is (but you know better than I do about this): "if you need mutations during iteration, use a concurrent version instead". This does not solve the problem of how to reason about the code. What are the benefits? Are we expecting strong performance differences (I'm not seeing that for dynarray)? Or is the idea that mutating a collection during iteration is often a bug and it is better to ask the user to fix their code?

@gadmm
Copy link
Contributor

gadmm commented Jan 12, 2023

"if you need mutations during iteration, use a concurrent version instead". This does not solve the problem of how to reason about the code.

Indeed, and I do not make such a claim. (It is not clear to me that/how Java is a model for its standard library in a SMP context, but at least OCaml should not do worse than Java.)

We would need concrete use-cases to think along those lines. For instance, persistent data structures could more amenable to reasoning if there are such use-cases so it might be a matter of using the right tool for each job.

Or is the idea that mutating a collection during iteration is often a bug and it is better to ask the user to fix their code?

Yes, this is indeed one of the benefits of failing fast (perhaps I was not clear about it).

@gadmm
Copy link
Contributor

gadmm commented Jan 18, 2023

I am interested, in general terms, in playing more with @gadmm's idea of (atomic) version counting to detect unsafe unsychronized access earlier, but right now I do not wish want to work on it, and I do not consider it within the scope of the present PR.

How do you see a possible evolution of Dynarray to this model, on top of your PR? Do you consider it reasonable to have Dynarray without such a check in a first time and possibly introduce it later, or do you welcome a helping hand in implementing this feature before release?

@nojb nojb added the stdlib label Jan 22, 2023
@gasche
Copy link
Member Author

gasche commented Jan 24, 2023

I thought a bit more about this. I proposed three specifications for mutation during iteration:

  1. mutations are a programming error and may result in an Invalid_argument exception.
  2. mutations are allowed but only some of them may be observed by ongoing iterations.
  3. mutations are allowed and ongoing iterations observe all of them.

My personal favorites are (3), which is simple and clear, and (1) if we make sure that we fail reliably, instead of failing only in corner cases and behaving in a permissive way otherwise. (Failing more often for predictability also has a small performance cost.)

After thinking more, I propose the following specification, which is a variant of (1):

  • modifications to existing elements (set) are observed by the iterator in a timely manner
  • modifications to the length are a programming error, and may result in Invalid_argument exceptions

Two remarks:

  • This specification guarantees that Dynarray iterators behave exactly like Array iterators if no size change occurs
  • I propose to systematically fail whenever the iterator detects a size change (instead of failing on pop and ignoring add for example). This still allows corner cases where a user function performs well-bracketed pushes and pops to the structure, preserving the number of elements on return. I think that this compromise is fine: good performance and easy to justify.

@gadmm
Copy link
Contributor

gadmm commented Jan 24, 2023

I agree with this analysis and it summarises well the various points. I am not sure about the following, though:

This still allows corner cases where a user function performs well-bracketed pushes and pops to the structure, maintaining initial the number of element on return. I think this compromise is fine: good performance and easy to justify.

Does the programmer has to ensure somehow that no reallocation occurs? or are reallocations permitted? In the first case, I find this example artificial: if the function controls well-bracketing and the fact that no reallocation is needed, could the function not be rewritten only in terms of get and set? And if structure changes are permitted during iteration, this will prevent the desired optimizations to iterators, forcing to reload the array at ever iteration.

Using a version count à la Java has a simpler specification (some operations change the structure, other do not). It also pays for itself by not having to reload the array and length during iteration (you replace two loads from the vector metadata by one load from the vector metadata, per iteration). (Here we must think in terms of cache lines accessed.)

An even cheaper method for the same specification would be to use a non-atomic reference-count of the number of concurrent readers. Then you do not need to access the vector metadata at all in the middle of iteration (an actual win), only at the start and end. However in this case, the overflow can break programs in theory, and it also has less helpful behaviour in case of data races. That's a lot to pay for a minor optimization.

@gasche
Copy link
Member Author

gasche commented Jan 24, 2023

My proposal is to:

  • specify that changes to the number of elements are a programming error, but
  • for now, only check the length at each iteration.

(My comment on well-bracketed add/pop was not that our specification would allow this, it would not, but that our implementation would not catch this programming error.)

This can be later refined into your proposal, which respects the same specification and fails more often, with a slightly more complex implementation but also benefits in the concurrent case.

@gadmm
Copy link
Contributor

gadmm commented Jan 26, 2023

Ah ok, it was a bit ambiguous. I agree that in a first time it is ok to do a limited check for a clear specification; another PR can come later to bring version-based detection and to optimize iterators.

@gasche
Copy link
Member Author

gasche commented Jan 27, 2023

@c-cube

I think it's the opposite slightly_smiling_face . In real life, where items are not allocated next to one another, each Dynarray.get arr i incurs potentially an additional cache miss for the element box. For a dynarray of primitives, this turns a few cycles into potentially dozen or hundreds of cycles if the box is in L3 or main RAM1. I personally use a ton of int Vec.t (with a home grown dynarray) so that'd be an unacceptable overhead in my particular case. My point is just that each box has the potential of adding significant slowdowns with real-life memory access patterns.

I went ahead and did some benchmark on your code: I boxed the vectors in your SMT solver mc2, see c-cube/mc2@master...gasche:mc2:boxed-vector . Testing on pigeonhole8.cnf (takes roughly 1s), boxing vectors adds a 25% overhead on the whole run on my machine. (Same result with hanoi4.cnf; only a 10-15% overhead with reluplex/test_wbig or unsat/iso_icl527.smt2.)

What do you think? My impressions:

  • these results do validate your intuition that we can see "significant slowdowns" on real-life programs
  • ... but the slowdowns remain small enough that I consider them acceptable for a first implementation

@nojb
Copy link
Contributor

nojb commented Jan 27, 2023

  • first implementation

I have been following the discussion from a distance, but just to make sure I understand: the hope/expectation/etc is that a hypothethical "unboxed" implementation would maintain the exact same present interface?

@gasche
Copy link
Member Author

gasche commented Jan 27, 2023

Yes, the exact same interface. We discussed several choices of interface (the ones I call (1), (2), (3), and then the compromise solution above which could be called (1')), they can all be realized with an unboxed implementation.

@bluddy
Copy link
Contributor

bluddy commented Jan 29, 2023

The question in my mind is how long it would take the unboxed implementation to enter the stdlib following this, because really the whole point of dynamic arrays is performance. If we're talking about a version or two, then that's great and this is worth doing. If it could take years, then it's not really different from sorting the issues with the unboxed version now - I don't think many people would want to use this data structure until it has excellent performance.

gasche and others added 23 commits October 21, 2023 07:23
Suggested-by: Daniel Bünzli <[email protected]>
Suggested-by: Damien Doligez <[email protected]>
Suggested-by: Guillaume Munch-Maccagnoni <[email protected]>
Suggested-by: Simon Cruanes <[email protected]>
@gasche
Copy link
Member Author

gasche commented Oct 21, 2023

CI is green, several reviews, two maintainer approvals, no one speaking against, merging now!

@gasche gasche merged commit d21d884 into ocaml:trunk Oct 21, 2023
@gasche
Copy link
Member Author

gasche commented Jan 5, 2024

#12885 proposes an unboxed representation of dynarrays, using type-unsafe dummies in the empty space.

@gasche
Copy link
Member Author

gasche commented May 3, 2024

The "unboxed" implementation has been merged thanks to @OlivierNicole's review work, and should be available in OCaml 5.3 (not 5.2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.