Skip to content

add iterators to most containers in stdlib#635

Closed
c-cube wants to merge 2 commits intoocaml:trunkfrom
c-cube:stdlib-iterators
Closed

add iterators to most containers in stdlib#635
c-cube wants to merge 2 commits intoocaml:trunkfrom
c-cube:stdlib-iterators

Conversation

@c-cube
Copy link
Contributor

@c-cube c-cube commented Jun 25, 2016

Following the new guidelines, here is the first of a series of proposals to improve and extend the stdlib.

I think iterators should be available in the stdlib for several of the reasons mentionned in the guidelines:

  • "they cannot easily be implemented externally": if I don't have access to the internals of Map, Stack or Hashtbl, I cannot implement proper iterators on them;
  • "they facilitate communication between independent external libraries": having iterators makes it easier to transfer data between various data structures by just providing two functions (from/to).

Testing

There are no tests right now, because it's not clear to me how to test those. I already have some tests using qtest, but that relies on 3rd party tools. I think tests using expect/result files are overkill, so we probably should have a small unit/random testing lib in the compiler? I will obviously upgrade this PR once the proper way of testing becomes clear.

The iterator type

I have been involved in discussions with Janestreet, @dbuenzli and others to try to agree on a common iterator type (batteries/extlib have Enum, JST has Core.Sequence, I use sequence, etc.). We did not reach an agreement, in parts because of performance considerations related to flambda. I still think the solution I propose, type 'a gen = unit -> 'a option, is the best tradeoff:

  • it is simple, and easy to implement;
  • it is structural, making compatibility between libraries trivial (just declare it; no need for a result/bytes compatibility package);
  • it looks a lot like iterators in many other languages, including python, java, C#, rust, etc. and therefore should be familiar to many programmers;
  • there already is an extensive library of combinators for it, demonstrating that it is indeed possible to implement many operations on it;
  • it has good performance, even with flambda (see benchmark results for 4.03 and 4.03+flambda below); the much faster iterators (sequence and fold) are less expressive, and cannot express zip, merge, map2 and the likes. In contrast, gen can be used for consuming an iterator step by step.

edit: the current proposal no longer uses gen, but instead uses the same type as Core.Sequence, which should make interoperability with Core possible. It has the advantages of being explicit, purely functional, and likely to be optimized by flambda in the future.

https://github.com/c-cube/iterators_bench/blob/master/res_4.03
https://github.com/c-cube/iterators_bench/blob/master/res_4.03_flambda

Other planned proposals:

(as a mere preview ;)

  • a module for options (map, flat_map, etc. really "fill obvious gaps", as in, they are useful in every single OCaml program).
  • ideally, I'd like a module IO in which channel-related functions could go, but they already belong in pervasives, so this ship probably has sailed. Nevertheless, I'd like to add some functions that promote a safe handling of resources, in the vein of val with_file_in : string -> (in_channel -> 'a) -> 'a (taking care of always closing the file); also, some idiom to iterate on the lines in a channel, read a whole channel into a string, and other basics that get reinvented hundreds of times in the community.

@gasche gasche added the stdlib label Jun 25, 2016
@gasche
Copy link
Member

gasche commented Jun 25, 2016

If someone reading this discussion wants to understand why 'a gen = unit -> 'a option is both more expressive and less efficient to enumerate from than 'a sequence = ('a -> unit) -> unit, this is discussed in my old blog post on Generators, iterators, control and continuations.

The short story is that with a gen, the consumer is in control, while with a sequence the producer is in control. It's easier and more efficient to turn your data-structure into a sequence, but then the consumer has to pay in convenience/complexity. It's easier and more efficient to consume someone else's gen, but then the producer had to pay in convenience/complexity. Delimited continuations (as proposed by the multicore-ocaml folks, or yield in languages that have that) reverse control, so they allow either side to recover the convenience (not necessarily the efficiency).

@Drup
Copy link
Contributor

Drup commented Jun 25, 2016

To complement @gasche's point, sequence does not strictly need further support from the stdlib. All datastructures have an iter function, which is equivalent to to_sequence.

I have been using those two style of iterators for a while now, I think they are very useful and covers all my usecases. Having the various gen functions directly in the stdlib (and hopefully a standardization in the OCaml ecosystem) would be extremely valuable.

b

let of_list l =
let b = create 32 in
Copy link
Contributor

@gmevel gmevel Jun 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about List.length?

@bobot
Copy link
Contributor

bobot commented Jun 25, 2016

I'm not sure it is possible to implement the iterator functions on ephemerons

I think it is possible and that it is better to keep the same signature for hashtable and weak hashtable. If it is not clear how to do it from the current code I could look at it.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 26, 2016

We did not reach an agreement, in parts because of performance considerations related to flambda.

I'd just like to point out that the reason why we did not reach an agreement on my part was absolutely not because of this.

The reason is that none of the proposed solutions had a compelling story about error reporting --- needed both at the consumer and producer end point --- and error recovery, i.e. be able to restart the enumeration once an error was hit at the producer point. Both features are very important for writing real world streaming codecs.

I think that the proposed interface is 1) Conceptually ugly as it favours imperative programming as can be witnessed by the implementations provided in the current PR 2) Very inefficient as it boxes all enumerated values and 3) As mentioned above, has no story for error reporting/recovery beyond reusing the exceptional mecanism that we all eventually learned should not be used for non exceptional error conditions.

We definitively need a good streaming processing framework for OCaml but I'm afraid this is not it and I think this is more likely to add noise in the system and trick people into using inadapted and inefficient abstractions.

People willing to use this interface can perfectly use the Gen library. I personally wouldn't bless this approach in the stdlib.

@gasche
Copy link
Member

gasche commented Jun 26, 2016

Regarding testing, I think the tests should go in the testsuite, which already has lib- tests for the standard library. It should be very easy to convert non-random qtest tests into assert calls, with an empty expected-result file.

@c-cube
Copy link
Contributor Author

c-cube commented Jun 26, 2016

@dbuenzli that was mostly your concern. As far as I remember the biggest issue discussed by most people was performance. I don't think error reporting is important in this case, since for IO people will probably use Lwt or Async (with Async.Pipe or something similar). Streaming data and streaming IO are really different things.

For IO you need to consider:

  • error reporting
  • resource handling (closing files, etc.)
  • concurrency issues (limit size of internal buffers, etc.)

None of those are a concern for purely algorithmic iterators. In addition, people can't "just use the gen library" because the standard types (Map.S.t, etc.) are opaque; some cooperation from the stdlib is needed.

@dbuenzli
Copy link
Contributor

Streaming data and streaming IO are really different things.

That's your point of view. I see the later as a generalization and find "purely algorithmic iterators" as you put it to be of limited value in practice. What we need is a stream processing framework that is able to deal in a uniform manner with for example sequences of values decoded from a file or pulled out of a list and added to a data structure or written to a database.

What you are proposing here is an inefficient toy of disputable taste whose limited use cases are already better and more efficiently served by other functions already present in the standard library, namely iter and fold functions.

@c-cube
Copy link
Contributor Author

c-cube commented Jun 26, 2016

iter and fold are not composable (except when using something like sequence, but well). For IO I still think a solution would need to be able to accomodate a monad such as Lwt, and it would be very inefficient for pure in-memory processing. Of course I'm open to alternative solutions.

Also, for the record, I am a bit tired of the way you criticize solutions that do not please you. "inefficient toy of disputable taste" is based on your taste, not on data (or, find something better and more efficient before saying such things. PRs on my benchmark are welcome).

@dbuenzli
Copy link
Contributor

"inefficient toy of disputable taste" is based on your taste, not on data

Taste is based on taste which is exactly what I write here, you need again to learn to read me. However boxing each value you transfer between data structures in an option has nothing to do with taste AFAIK. This won't be the case if you use things like simple generators --- basically iter --- which are already well supported by the stdlib.

or, find something better and more efficient before saying such things.

It is not because I have nothing better to propose that I should not be allowed to cast criticisms. Again I find what you propose to be of limited use in practice and hence the addition to be of limited value as it only bloats the library and doesn't solve the problems that need to be solved.

@Drup
Copy link
Contributor

Drup commented Jun 26, 2016

@dbuenzli You say that efficiency was not the reason, yet you write "inefficient" everywhere and you probably didn't even look at the benchmark, since it's pretty clear it's wrong (2 times slower than sequence is good, given the difference in expressivity, and it blows everything else).

As for the rest: Please do try to write zip or drop with only an iter function exposed in a datastructure, without using any inversion of control. Sure, sequence is faster and is enough for quite a good amount of stuff, but sometimes you need the extra control.

@stedolan
Copy link
Contributor

There are a number of other functions that seem to have snuck in with the generator proposal, and I think they should have separate discussion. For instance, I do not think that the proposed Array.range is a good addition. I expect a range function to satisfy:

append (range a b) (range b c) = range a c
length (range a b) = (b - a)

To this end, I think range a b should throw an exception if a > b, and perhaps support an optional possibly-negative ~step argument if decreasing ranges are desired. (For reference, I think Python's range function is more-or-less right, with the caveat that it returns the empty list in some cases where I think it should fail).

Similarly, to_list, of_list, add_list should be a separate proposal. Both range and *_list are good suggestions, but should be discussed independently!

Also: to_gen/to_gen_i is inconsistent with map/mapi and iter/iteri as used in the rest of the stdlib, and Set.to_gen_range is inconsistent with Array.range, disagreeing about whether a "range" includes its upper endpoint.

@c-cube
Copy link
Contributor Author

c-cube commented Jun 26, 2016

@gasche @Drup an interesting alternative would be an analog to BatSeq, which is even a bit more powerful than gen because it is functional. Something like:

type 'a t = unit -> 'a cell
and 'a cell =
  | Yield of 'a * 'a t
  | Error of exn
  | End

That could be mixed with an IO monad (I think...?), be used even with backtracking monads, and keeps decent performance (see ulist in the benchmark).

@c-cube
Copy link
Contributor Author

c-cube commented Jun 26, 2016

@stedolan you are right, I should have resisted the urge to add range :)
The right-bound-excluding version does, I think, satisfy the laws you propose (using abs in the second case), but maybe a step argument is better.

I will fix the PR to remove range anyway, and maybe add them back in a "enrich list/array" proposal.

@dbuenzli
Copy link
Contributor

You say that efficiency was not the reason, yet you write "inefficient" everywhere

Thanks for this pointless comment @Drup, so if I need to clarify this: efficiency was not my main reason.

I find it a little bit sad that no one is addressing my main point which is that these things are not terribly useful if they are not able to interact with IO and errors in a decent way. The Haskell people have been trying to solve that problem for a few years now in many different approaches and papers. I don't expect to solve it on a gh pull request discussion --- especially since the potential introduction of effects with multicore changes the whole game again.

I hope my points have been heard by the people who are in charge of merging, so I'll stop discussing this.

@stedolan
Copy link
Contributor

The right-bound-excluding version does, I think, satisfy the laws you propose

It does if you remove the feature that produces descending sequences when the end point is before the start point. Otherwise, you get this:

concat (range 4 2) (range 2 5) <> range 4 5

I'd prefer the left-hand-side to be an error.

If you have the time, I'd love to see a range PR (ideally, one also including List.range). Apart from this sort of bikeshedding, I think it would be an uncontroversial and useful addition (I've implemented List.range myself more times than I'd like to admit).

@c-cube
Copy link
Contributor Author

c-cube commented Jun 26, 2016

@stedolan indeed, will do (problably with a step argument). Thanks for the feedback!

To summarize why I don't think having one single iterator for all use-cases is a good idea:

  • in memory iterators, as used in rust or F#, are an improvement on fold/iter because you can manipulate an iterator as a value. For instance, you can map, flat_map, filter, etc. on top of an iterator before transferring it into another data structure. See how they are used with great success in rust (with exactly the same type as I propose). However, do be useful, they need to be very fast and lightweight.
  • for IO, you need to handle errors, to handle resources (once you stop using the iterator), to handle concurrency (concurrent push/pull to the iterator; in practice you need to wrap next into a monad). In haskell they seem to have several solutions, all of which have issues (and none of them tries to replace lazy lists, which are the Haskell equivalent of in-memory iterators!). In OCaml we have Async.Pipe and Lwt_stream for this kind of use-case. They would be orders of magnitudes slower than gen for computations not involving IOs.

That is why I suggest this change: F# and rust use it a lot (it works wonders in F# with |>, by the way, and lots of my code uses sequence in the same way). The part about multicore effects is totally unrelated (if anything, it would make gen even better by making python-style generators possible without writing the inversion of control yourself).

edit: rust seems to use the equivalent of ('a, _) result gen for streaming with blocking IOs. It should be pretty easy to write functions such as ('a, 'err) result gen -> ('a list, 'err) result or ('a -> 'b) -> ('a, 'err) result gen -> ('b, 'err) result gen for this particular use-case.

@dbuenzli
Copy link
Contributor

The part about multicore effects is totally unrelated

Having one-shot delimited continuations in the language and potentially typed effects for typed error handling is totally related.

@radekm
Copy link

radekm commented Jun 26, 2016

@c-cube Note that IEnumerator<'T> from F# implements IDisposable - i.e. it handles resources.

edit: For reference IEnumerator<T> from .NET Framework has 4 members:

// Gets the element in the collection at the current position of the enumerator.
T Current { get; }
// Performs application-defined tasks associated with freeing,
// releasing, or resetting unmanaged resources.
void Dispose();
// Advances the enumerator to the next element of the collection.
bool MoveNext();
// Sets the enumerator to its initial position,
// which is before the first element in the collection.
void Reset();

@stedolan
Copy link
Contributor

See how they are used with great success in rust (with exactly the same type as I propose).

For certain data structures, it is necessary to do some clean-up after iteration has finished. For instance, see flip_ongoing_traversal in hashtbl.ml, which is used to detect modification of a hashtable during iteration.

This is easily handled with the iter / fold-style interfaces, which just do some work after iteration completes, before returning. With a gen-style interface, there is no way to determine when the iteration is over, since the consumer is free to drop the iterator whenever. (The proposed implementation doesn't attempt to detect concurrent modification, and produces bad results if a stale iterator is used).

Rust focuses heavily on lifetime management, which makes it easy to run cleanup code when the iterator dies. Iterator lifetimes are checked at compile time. By contrast, since OCaml does not track lifetimes, an iterator-style API will introduce the same troubles as e.g. Java / C++ have with iterator invalidation.

@c-cube
Copy link
Contributor Author

c-cube commented Jun 26, 2016

@stedolan indeed, we have to work around OCaml's limitations (there is simply no clean way of managing resources in general). TIL about hashtbl.flip_ongoing_traversal, it's interesting. On the other hand, say, Array.iter doesn't seem to offer any protection against concurrent modification of the array.

I suppose this makes the case for sequence stronger (since it's just iter on steroids).

@gasche
Copy link
Member

gasche commented Jun 26, 2016

While there seems to be no agreement on blessing a specific iteration/streaming interface in the standard library, maybe a compromise could be to at least provide the consumer-controlled iteration functions in the modules where type abstraction does not allow to efficiently implement it outside the library. If this is not done with the intent of standardization, we can relax some of @c-cube constraints (for example use a pure interface instead of an effectful one, which tends to be incompatible with purely structural solutions), and allow external libraries to implement the iteration interface of their choice for stdlib modules as well.

@c-cube
Copy link
Contributor Author

c-cube commented Jun 26, 2016

I suppose I can implement something like

type 'a t
type 'a cursor
val cursor_next : 'a cursor -> ('a * 'a cursor) option
val start_cursor : 'a t -> 'a cursor

for Hashtbl, Set, Map, Queue, Stack and Buffer. Then one can implement its own favorite iterator on top.

I still think OCaml should have a standard iterator type (or several ones, why not). Standardization allows everyone in the community(ies?) to agree on common interfaces. Maybe in 10 years...

@alainfrisch
Copy link
Contributor

It seems a bit weird to me to push forward this imperative API, while there are good functional counterparts (such as UList in the bench). Is speed the main argument in favor of the imperative version?

I'm tempted to believe that when performance really matters, people would not use such high-level abstractions anyway, and would rather do some manual deforestation. Or perhaps they would use some high-level abstraction if flambda can map them to code which is as efficient as the hand-written version. Is it the case with the current proposal? If not, are there reasons to believe that if flambda is ever able to optimize the imperative version, it would not be able to do the same job with a functional version?

@gasche
Copy link
Member

gasche commented Jun 27, 2016

Is speed the main argument in favor of the imperative version?

I think that the main reason is actually to have a fully structural type. At the time of @c-cube design, the idea that a structural type would be easier to adopt by various libraries as it does not require a shared dependency.

Of course if we want to have something in the standard library, a nominal type definition would also be reasonable -- although see the pains in Batteries-land to connect to the stdlib's result datatype with a differently-named second constructor...

@jakemcarthur
Copy link

I was thinking that an always-successful sequence would have a return type like unit and a possibly-failing sequence would have a return type of the form (something, err) result.

@ghost
Copy link

ghost commented Sep 24, 2016

BTW, at least in native code Done () doesn't allocate. The compiler should have no problem compiling it as a shared constant. Funny fact, having all constructors be non-constant can make pattern matching slightly faster (you don't have to do the is_int check and can dispatch immediately on the tag)

@paurkedal
Copy link

I was thinking that an always-successful sequence would have a return type like unit and a possibly-failing sequence would have a return type of the form (something, err) result.

What would the something payload hold?

@jakemcarthur
Copy link

The something could be unit (in which case you could instead choose to use err option, if you want) or it could have some continuation value.

@c-cube
Copy link
Contributor Author

c-cube commented Jan 5, 2017

After ./configure; make world I have the following error that I don't understand:

make[4]: Entering directory '/home/simon/workspace/ocaml/tools'
../boot/ocamlrun ../boot/ocamlc -nostdlib -I ../boot -use-prims ../byterun/primitives -I .. make_opcodes.ml -o make_opcodes
File "make_opcodes.ml", line 1:
Error: Required module `Iter' is unavailable

I thought I put iter.cmo and iter.cmi everywhere list.cmo was…

Side note: I am really getting tired of this PR. I would really appreciate if a maintainer would tell me if this has chances to be merged some day.

@nojb
Copy link
Contributor

nojb commented Jan 5, 2017

I think you may need to bootstrap the compiler.

@c-cube
Copy link
Contributor Author

c-cube commented Jan 5, 2017

That's what I feared. The problem is that make world fails, and I don't know how to bootstrap in this situation.

edit: ok my bad, it seems it was because of a wrong order of .cmo in some makefiles‽

@Drup
Copy link
Contributor

Drup commented Jan 5, 2017

This is just an issue in the order of .cmo: list depends on iter, so iter.cmo must come first.

@shindere
Copy link
Contributor

shindere commented Jan 9, 2017 via email

| Yield of 'a * 's

type +_ t =
| Sequence : 's * ('s -> ('a,'s) step) -> 'a t

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an abstraction leak. Skip is really only intended as an optimization. I think it makes sense to be able to construct a t using step, but not to observe Skips by pattern matching the way this allows. For example, the current interface allows you to count the elements dropped by filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful to be able to write one's own combinators based on the type definition. At the very least, it should be within a (**/**) block so that the various stdlib extensions/replacements can interoperate with it.

type key
type 'a t
val create: int -> 'a t
val create : int -> 'a t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious whitespace additions (and the most common style is rather to drop such spaces).

stdlib/iter.ml Outdated
(* *)
(* Simon Cruanes *)
(* *)
(* Copyright 2001 Institut National de Recherche en Informatique et *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in 2017!

@alainfrisch
Copy link
Contributor

Was some kind of consensus on the type definition reached?

There were proposals to enrich the type (to account for final error conditions) or to simplify it (using the simpler "ulist" definition). I'd like to hear from people making these suggestions how they feel about the current state of the PR.

@c-cube
Copy link
Contributor Author

c-cube commented Jan 9, 2017

Ping @fpottier @jakemcarthur @paurkedal @Drup @gasche about the type definition.

For the record, I am relatively open to the idea of ulist (i.e. a list guarded by unit -> 'a cell closures), but I think the error handling version of the current is too complicated and cumbersome for the benefits it provides (additional type parameter, some things are not constants anymore, etc.).

I will rename the module to Seq once nothing else is blocking.

@yminsky
Copy link

yminsky commented Jan 10, 2017

I'm really not sure that this patch is the right approach in principle. In particular, the question of how to construct an appropriate sequence type is still in flux. As such, making this a concrete (rather than abstract) type takes away future freedom in optimizing the implementation. Given that Flambda is still early in its life, I think we don't yet fully know what is the ideal shape of this type, and it seems like a mistake to tie ourselves to a concrete representation now.

Is there a compelling reason not to make the type abstract? Base's sequence type doesn't expose the concrete representation, though it does expose a Step type to base the various Unfold operations on.

https://github.com/janestreet/base/blob/master/src/sequence.mli

@alainfrisch
Copy link
Contributor

Base's sequence type doesn't expose the concrete representation, though it does expose a Step type to base the various Unfold operations on.

So in Base, values of type Step.t are produced by client code, but never inspected (it seems so, according to the interface). Is there a reason to expose its definition instead of just constructor functions?

@fpottier
Copy link
Contributor

To answer @c-cube and @yminsky.

I think Yaron's concern about exposing a concrete type is fair: exposing a badly-chosen concrete type now could become a problem in the future.

That said, the type "ulist" (the concrete type of delayed lists) is somewhat special in that, as far as consumers are concerned, it is just as good as an abstract type: it exposes only the fact that one can query a sequence, producing either an end-of-sequence or one element and a remainder. So, I feel that it is OK to expose this type for use by consumers.

To a producer, we could still offer a choice between 1- constructing ulists directly and 2- going through more elaborate sequence construction APIs. One such API could involve another data type of sequences, with three constructors Nil/Cons/Skip, and a conversion function from this data type to ulist.

@c-cube
Copy link
Contributor Author

c-cube commented Jan 10, 2017

That does it, I'm closing this PR and opening one with ulist. Apparently JST might be moving away from this type, that no one else uses currently, so there goes interop. More, having an abstract type here would prevent 3rd party libs from exposing efficient combinators.

@fpottier we should discuss your proven-correct lib some day, if you will :-)

@c-cube c-cube closed this Jan 10, 2017
@alainfrisch
Copy link
Contributor

Thanks @c-cube for your endless efforts. Many people hope to see something integrated, but the actual design choices are not straightforward, as the discussion shows. I would suggest to start with a PR which adds only the core definitions (without extending all data structures) to try reaching a consensus on those definitions. Even "delayed lists" can be implemented in different ways (functions/lazy, using option or a custom definition, etc).

@fpottier
Copy link
Contributor

@c-cube

I'd be happy to discuss. At the moment I have a library of "cascade" (= delayed list) combinators, but it is not verified yet. The only function that I have verified is HashTable.cascade, which constructs a sequence of the key-value pairs in a hash table. I will talk about it at CPP in Paris next week.

camlspotter pushed a commit to camlspotter/ocaml that referenced this pull request Oct 17, 2017
EduardoRFS pushed a commit to esy-ocaml/ocaml that referenced this pull request Dec 17, 2021
stedolan pushed a commit to stedolan/ocaml that referenced this pull request May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.