add iterators to most containers in stdlib#635
Conversation
|
If someone reading this discussion wants to understand why The short story is that with a |
|
To complement @gasche's point, I have been using those two style of iterators for a while now, I think they are very useful and covers all my usecases. Having the various |
| b | ||
|
|
||
| let of_list l = | ||
| let b = create 32 in |
I think it is possible and that it is better to keep the same signature for hashtable and weak hashtable. If it is not clear how to do it from the current code I could look at it. |
I'd just like to point out that the reason why we did not reach an agreement on my part was absolutely not because of this. The reason is that none of the proposed solutions had a compelling story about error reporting --- needed both at the consumer and producer end point --- and error recovery, i.e. be able to restart the enumeration once an error was hit at the producer point. Both features are very important for writing real world streaming codecs. I think that the proposed interface is 1) Conceptually ugly as it favours imperative programming as can be witnessed by the implementations provided in the current PR 2) Very inefficient as it boxes all enumerated values and 3) As mentioned above, has no story for error reporting/recovery beyond reusing the exceptional mecanism that we all eventually learned should not be used for non exceptional error conditions. We definitively need a good streaming processing framework for OCaml but I'm afraid this is not it and I think this is more likely to add noise in the system and trick people into using inadapted and inefficient abstractions. People willing to use this interface can perfectly use the |
|
Regarding testing, I think the tests should go in the |
|
@dbuenzli that was mostly your concern. As far as I remember the biggest issue discussed by most people was performance. I don't think error reporting is important in this case, since for IO people will probably use For IO you need to consider:
None of those are a concern for purely algorithmic iterators. In addition, people can't "just use the |
That's your point of view. I see the later as a generalization and find "purely algorithmic iterators" as you put it to be of limited value in practice. What we need is a stream processing framework that is able to deal in a uniform manner with for example sequences of values decoded from a file or pulled out of a list and added to a data structure or written to a database. What you are proposing here is an inefficient toy of disputable taste whose limited use cases are already better and more efficiently served by other functions already present in the standard library, namely |
|
Also, for the record, I am a bit tired of the way you criticize solutions that do not please you. "inefficient toy of disputable taste" is based on your taste, not on data (or, find something better and more efficient before saying such things. PRs on my benchmark are welcome). |
Taste is based on taste which is exactly what I write here, you need again to learn to read me. However boxing each value you transfer between data structures in an option has nothing to do with taste AFAIK. This won't be the case if you use things like simple generators --- basically
It is not because I have nothing better to propose that I should not be allowed to cast criticisms. Again I find what you propose to be of limited use in practice and hence the addition to be of limited value as it only bloats the library and doesn't solve the problems that need to be solved. |
|
@dbuenzli You say that efficiency was not the reason, yet you write "inefficient" everywhere and you probably didn't even look at the benchmark, since it's pretty clear it's wrong (2 times slower than sequence is good, given the difference in expressivity, and it blows everything else). As for the rest: Please do try to write |
|
There are a number of other functions that seem to have snuck in with the generator proposal, and I think they should have separate discussion. For instance, I do not think that the proposed To this end, I think Similarly, Also: |
|
@gasche @Drup an interesting alternative would be an analog to type 'a t = unit -> 'a cell
and 'a cell =
| Yield of 'a * 'a t
| Error of exn
| EndThat could be mixed with an IO monad (I think...?), be used even with backtracking monads, and keeps decent performance (see |
|
@stedolan you are right, I should have resisted the urge to add I will fix the PR to remove range anyway, and maybe add them back in a "enrich list/array" proposal. |
Thanks for this pointless comment @Drup, so if I need to clarify this: efficiency was not my main reason. I find it a little bit sad that no one is addressing my main point which is that these things are not terribly useful if they are not able to interact with IO and errors in a decent way. The Haskell people have been trying to solve that problem for a few years now in many different approaches and papers. I don't expect to solve it on a gh pull request discussion --- especially since the potential introduction of effects with multicore changes the whole game again. I hope my points have been heard by the people who are in charge of merging, so I'll stop discussing this. |
It does if you remove the feature that produces descending sequences when the end point is before the start point. Otherwise, you get this: I'd prefer the left-hand-side to be an error. If you have the time, I'd love to see a |
|
@stedolan indeed, will do (problably with a To summarize why I don't think having one single iterator for all use-cases is a good idea:
That is why I suggest this change: F# and rust use it a lot (it works wonders in F# with edit: rust seems to use the equivalent of |
Having one-shot delimited continuations in the language and potentially typed effects for typed error handling is totally related. |
|
@c-cube Note that edit: For reference // Gets the element in the collection at the current position of the enumerator.
T Current { get; }
// Performs application-defined tasks associated with freeing,
// releasing, or resetting unmanaged resources.
void Dispose();
// Advances the enumerator to the next element of the collection.
bool MoveNext();
// Sets the enumerator to its initial position,
// which is before the first element in the collection.
void Reset(); |
For certain data structures, it is necessary to do some clean-up after iteration has finished. For instance, see This is easily handled with the Rust focuses heavily on lifetime management, which makes it easy to run cleanup code when the iterator dies. Iterator lifetimes are checked at compile time. By contrast, since OCaml does not track lifetimes, an iterator-style API will introduce the same troubles as e.g. Java / C++ have with iterator invalidation. |
|
@stedolan indeed, we have to work around OCaml's limitations (there is simply no clean way of managing resources in general). TIL about I suppose this makes the case for |
|
While there seems to be no agreement on blessing a specific iteration/streaming interface in the standard library, maybe a compromise could be to at least provide the consumer-controlled iteration functions in the modules where type abstraction does not allow to efficiently implement it outside the library. If this is not done with the intent of standardization, we can relax some of @c-cube constraints (for example use a pure interface instead of an effectful one, which tends to be incompatible with purely structural solutions), and allow external libraries to implement the iteration interface of their choice for stdlib modules as well. |
|
I suppose I can implement something like type 'a t
type 'a cursor
val cursor_next : 'a cursor -> ('a * 'a cursor) option
val start_cursor : 'a t -> 'a cursorfor I still think OCaml should have a standard iterator type (or several ones, why not). Standardization allows everyone in the community(ies?) to agree on common interfaces. Maybe in 10 years... |
|
It seems a bit weird to me to push forward this imperative API, while there are good functional counterparts (such as UList in the bench). Is speed the main argument in favor of the imperative version? I'm tempted to believe that when performance really matters, people would not use such high-level abstractions anyway, and would rather do some manual deforestation. Or perhaps they would use some high-level abstraction if flambda can map them to code which is as efficient as the hand-written version. Is it the case with the current proposal? If not, are there reasons to believe that if flambda is ever able to optimize the imperative version, it would not be able to do the same job with a functional version? |
I think that the main reason is actually to have a fully structural type. At the time of @c-cube design, the idea that a structural type would be easier to adopt by various libraries as it does not require a shared dependency. Of course if we want to have something in the standard library, a nominal type definition would also be reasonable -- although see the pains in Batteries-land to connect to the stdlib's |
|
I was thinking that an always-successful sequence would have a return type like |
|
BTW, at least in native code |
What would the |
|
The |
203d414 to
45b43b8
Compare
|
After
I thought I put Side note: I am really getting tired of this PR. I would really appreciate if a maintainer would tell me if this has chances to be merged some day. |
|
I think you may need to bootstrap the compiler. |
|
That's what I feared. The problem is that edit: ok my bad, it seems it was because of a wrong order of .cmo in some makefiles‽ |
45b43b8 to
2d98f7a
Compare
|
This is just an issue in the order of .cmo: list depends on iter, so iter.cmo must come first. |
|
Simon Cruanes (2017/01/05 13:24 -0800):
That's what I feared. The problem is that `make world` fails, and I
don't know how to bootstrap in this situation.
See the bootstrapping howto which is given as comments in the Makefile.
Incidentally: shouldn't these comments be moved to the HACKING file?
|
| | Yield of 'a * 's | ||
|
|
||
| type +_ t = | ||
| | Sequence : 's * ('s -> ('a,'s) step) -> 'a t |
There was a problem hiding this comment.
This seems like an abstraction leak. Skip is really only intended as an optimization. I think it makes sense to be able to construct a t using step, but not to observe Skips by pattern matching the way this allows. For example, the current interface allows you to count the elements dropped by filter.
There was a problem hiding this comment.
I think it's useful to be able to write one's own combinators based on the type definition. At the very least, it should be within a (**/**) block so that the various stdlib extensions/replacements can interoperate with it.
stdlib/hashtbl.ml
Outdated
| type key | ||
| type 'a t | ||
| val create: int -> 'a t | ||
| val create : int -> 'a t |
There was a problem hiding this comment.
Spurious whitespace additions (and the most common style is rather to drop such spaces).
stdlib/iter.ml
Outdated
| (* *) | ||
| (* Simon Cruanes *) | ||
| (* *) | ||
| (* Copyright 2001 Institut National de Recherche en Informatique et *) |
|
Was some kind of consensus on the type definition reached? There were proposals to enrich the type (to account for final error conditions) or to simplify it (using the simpler "ulist" definition). I'd like to hear from people making these suggestions how they feel about the current state of the PR. |
|
Ping @fpottier @jakemcarthur @paurkedal @Drup @gasche about the type definition. For the record, I am relatively open to the idea of I will rename the module to |
|
I'm really not sure that this patch is the right approach in principle. In particular, the question of how to construct an appropriate sequence type is still in flux. As such, making this a concrete (rather than abstract) type takes away future freedom in optimizing the implementation. Given that Flambda is still early in its life, I think we don't yet fully know what is the ideal shape of this type, and it seems like a mistake to tie ourselves to a concrete representation now. Is there a compelling reason not to make the type abstract? Base's sequence type doesn't expose the concrete representation, though it does expose a Step type to base the various Unfold operations on. |
So in |
|
To answer @c-cube and @yminsky. I think Yaron's concern about exposing a concrete type is fair: exposing a badly-chosen concrete type now could become a problem in the future. That said, the type "ulist" (the concrete type of delayed lists) is somewhat special in that, as far as consumers are concerned, it is just as good as an abstract type: it exposes only the fact that one can query a sequence, producing either an end-of-sequence or one element and a remainder. So, I feel that it is OK to expose this type for use by consumers. To a producer, we could still offer a choice between 1- constructing ulists directly and 2- going through more elaborate sequence construction APIs. One such API could involve another data type of sequences, with three constructors Nil/Cons/Skip, and a conversion function from this data type to ulist. |
|
That does it, I'm closing this PR and opening one with @fpottier we should discuss your proven-correct lib some day, if you will :-) |
|
Thanks @c-cube for your endless efforts. Many people hope to see something integrated, but the actual design choices are not straightforward, as the discussion shows. I would suggest to start with a PR which adds only the core definitions (without extending all data structures) to try reaching a consensus on those definitions. Even "delayed lists" can be implemented in different ways (functions/lazy, using option or a custom definition, etc). |
|
I'd be happy to discuss. At the moment I have a library of "cascade" (= delayed list) combinators, but it is not verified yet. The only function that I have verified is |
Make lib-str domain safe
Following the new guidelines, here is the first of a series of proposals to improve and extend the stdlib.
I think iterators should be available in the stdlib for several of the reasons mentionned in the guidelines:
Map,StackorHashtbl, I cannot implement proper iterators on them;Testing
There are no tests right now, because it's not clear to me how to test those. I already have some tests using qtest, but that relies on 3rd party tools. I think tests using expect/result files are overkill, so we probably should have a small unit/random testing lib in the compiler? I will obviously upgrade this PR once the proper way of testing becomes clear.
The iterator type
I have been involved in discussions with Janestreet, @dbuenzli and others to try to agree on a common iterator type (batteries/extlib have
Enum, JST hasCore.Sequence, I usesequence, etc.). We did not reach an agreement, in parts because of performance considerations related to flambda. I still think the solution I propose,type 'a gen = unit -> 'a option, is the best tradeoff:result/bytescompatibility package);zip,merge,map2and the likes. In contrast,gencan be used for consuming an iterator step by step.edit: the current proposal no longer uses
gen, but instead uses the same type asCore.Sequence, which should make interoperability withCorepossible. It has the advantages of being explicit, purely functional, and likely to be optimized by flambda in the future.https://github.com/c-cube/iterators_bench/blob/master/res_4.03
https://github.com/c-cube/iterators_bench/blob/master/res_4.03_flambda
Other planned proposals:
(as a mere preview ;)
IOin which channel-related functions could go, but they already belong in pervasives, so this ship probably has sailed. Nevertheless, I'd like to add some functions that promote a safe handling of resources, in the vein ofval with_file_in : string -> (in_channel -> 'a) -> 'a(taking care of always closing the file); also, some idiom to iterate on the lines in a channel, read a whole channel into a string, and other basics that get reinvented hundreds of times in the community.