-
-
Notifications
You must be signed in to change notification settings - Fork 378
Description
This is a weird/radical idea, but @oremanj's comment here (in response to @Badg) raised some red flags for me:
And the channel interface is nicer than the stream one for incremental processing --
async for chunk in channel:rather thanwhile True: chunk = await stream.receive_some(ARBITRARILY_CHOSEN_POWER_OF_TWO) if not chunk: break ...
At the conceptual level, the output from a process is exactly a Stream (or ReceiveStream or whatever), and if people are trying to jump through hoops to avoid using our Stream ABC to represent the Stream concept, then that seems like a bad sign!
So, let's at least go through the thought experiment: what if we got rid of Stream and used Channel[bytes] instead?
Basic usability
Remembering which is a "stream" and which is a "channel" is super annoying. Merging them would eliminate this problem. Also annoying: constantly going through the pointless ritual of inventing a made-up buffer size (@oremanj's ARBITRARILY_CHOSEN_POWER_OF_TWO). And writing that while True loop over and over is also annoying. Making it a plain await channel.receive() or async for chunk in channel: would eliminate these annoyances.
Conceptual level
For me, the major conceptual difference is that I've thought of Channel as inherently preserving object boundaries, i.e., whatever I pass to send is what comes out of receive. In this way of thinking, a Stream is equivalent to a Channel[single_byte], but since handling single bytes individual would be inefficient, it uses batched-send and batched-receive operations. If we do decide to merge Stream and Channel, then we'd have to change this, and start saying that some Channels don't preserve the 1-to-1 mapping between send and receive.
I'm not sure how I feel about this. It's certainly doable on a technical level. But conceptually – it feels weird to say that a websocket and a TCP socket are both Channel[bytes], given that one is framed and the other isn't – that's a fundamental difference in their usage. (Right now one is Channel[bytes] and the other is Stream.) It would mean Uint32Framing adaptor doesn't convert a Stream into a Channel[bytes], it converts a Channel[bytes] into another Channel[bytes]. And that a TCP socket and a UDP socket have the same type. Intuitively this feels weird. It seems like this is a distinction you want to expose, and emphasize, on the type level.
An interesting partial counter-example would be h11: an h11.Connection object is essentially a Channel[h11.Event]: you send a sequence of objects like h11.Request, h11.Data, h11.EndOfMessage, and then receive a sequence of similar objects. Sometimes, the objects on the sender and receiver sides match 1-to-1, like Request and EndOfMessage. But sometimes they don't, like Data, which might be arbitrarily rechunked! So if you want to treat h11.Connection as a Channel[h11.Event], it's sort of simultaneously a 1-to-1 Channel and also a re-chunking Channel.
One possibility is to distinguish them somehow at the type level, but make them more consistent, or identical, in terms of the operations they happen to implement.
In Liskov terms, a 1-to-1/framed Actually, they are not Liskov-compatible – see below!Channel[bytes] IS-A rechunking/unframed Channel[bytes] – all it does is add stronger guarantees. So merging them would at least have that going for it. But I don't put a huge amount of weight on that – in practice they're used very differently.
Technical level
Currently we have:
Stream: send_all, wait_send_all_might_not_block, send_eof, receive_some
Channel: send, send_nowait, receive, receive_nowait, clone, iteration
Problematic bits:
wait_send_all_might_not_block: we actually have an issue open about possibly getting rid of this: Should send_all automatically do wait_send_all_might_not_block after sending the data? #371. If we do that would remove this problem :-)send_eof: we could add this to a bidirectionalChannel, it wouldn't be too weird*_nowait: we've been considering moving these to memory channels specifically, instead of the genericChannelinterfaceclone: we've been considering dropping this (same link as for*_nowait)
So all those might get sorted out? And send_all and send are already basically the same. So that just leaves async def receive() versus async def receive_some(max_nbytes). And max_nbytes is also the obstacle to having iteration. ...basically this is THE core distinction between the two APIs. So, what do we think about max_nbytes.
Specifying max_nbytes manually all the time is tiresome and annoying, as noted above.
Also, I note that Twisted/Asyncio/libuv always handle max_nbytes internally, and the user just deals with whatever they get.
Most Stream users basically want to read everything, and the only thing max_nbytes effects is efficiency, not correctness. In practice it's almost always set arbitrarily. I've never even seen anyone even benchmarking different values, except in extreme cases like trying to transmit multiple gigabytes/second through python. For SocketStream, there's some penalty for setting it too big – Python has to first allocate a max_nbytes-sized buffer, then realloc it down to size (see). And of course if you set it too small then you pay some overhead from doing lots of small recvs instead of one big one. So you want some kind of "not too big, not too little" setting.
For other Stream implementations, this doesn't apply – for example, SSLStream.receive_some forces you to pass max_nbytes, and that controls how many bytes it reads out of its internal decrypted data buffer at any one time, but this has no effect at all on how much data it reads at a time from the underlying socket, when it needs to refill its buffer. That's controlled by the constructor argument SSLStream(max_refill_bytes=...).
There are also cases where there is a "natural" size to return from receive_some. For example:
-
in most applications,
SSLStreammight as well return whatever data has already been decrypted and is sitting in memory, instead of spending instructions messing around with buffers. -
A
GunzipStreammight as well return whatever data it got from decompressing the last chunk it read. (This can avoid some non-trivial complications: urllib3.response.GzipDecoder is accidentally quadratic, which allows a malicious server to DoS urllib3 clients urllib3/urllib3#1467) -
When reading from the Windows console, the underlying representation is natively unicode. This means that when we want to read bytes, we have to transcode into utf8, which in turn means that it may be impossible to read less than 4 bytes at a time (at least without nasty buffering tricks)
Given that most people don't tune it at all, I bet if we did a bit of benchmarking then we could pick a default SocketStream recv size that would work better that 99% of what people currently do. And I guess we'd make this an argument to the SocketStream / SocketChannel constructor, exactly like how SSLStream currently works, so people could override it if they want. This could complicate code where the stream is constructed implicitly though, like p = Process(..., stdout=PIPE) – if you don't want p.stdout to use the default max_nbytes setting, then how do you specify something different? Some options:
-
We could simply set the default and tell everyone to live with it.
-
We could add some way to pass this through, like
Process(..., stdout=NewPipe(max_nbytes=...)). -
We could provide some API to mutate it, like
process.stdout.max_nbytes = new_value. -
We could tell people with this unusual requirement that they should create their own pipe with whatever settings they want (this functionality is somewhat needed anyway, see Add support for talking to our stdin/stdout/stderr as streams #174, support for windows named pipes #824), then pass in one end by hand.
What about cases where correctness does depend on setting max_nbytes? It can never be the case that setting max_nbytes too small affects correctness, because Stream is already free to truncate max_nbytes to some smaller value if it wants to – no guarantees. But, we do make a guarantee that we won't return more than max_nbytes.
That... actually is important for correctness in some cases. For example, from this comment:
4. We should provide a
trio.input, that's likebuiltins.inputbut async and routed through our stdio-handling machinery. Probably it just callsreceive_some(1)a bunch of times until it sees a newline.
This is why we can't quite think of our current Channel[bytes] as being a sub-interface of Stream – in this one very specific case, Stream genuinely has slightly more functionality.
This is probably a rare occurrence in practice. Most protocols need an internal buffer anyway, so any over-reads just go into the buffer for next time. And sometimes you want to hand-off between protocols, e.g. SSLStream.unwrap, or something like switching from HTTP/1.1 to Websocket... but in those cases we generally don't try to avoid over-reading from the underlying stream. Instead, we just accept that some over-read may have happened, and give it to the user to deal with (example 1, example 2). And in many cases, it's actually impossible to avoid this in any efficient way – e.g. if you have a newline-delimited protocol, then you have no idea where the next line boundary will be, so the only way to avoid over-read is to read one-byte-at-a-time, which is way too inefficient. In theory we could avoid it for TLS (which is length-prefixed), or for other length-prefixed protocols (like Uint32FramedChannel), but it doesn't seem worth it in most cases.
The special thing about trio.input is that it's sharing the process's stdin with who-knows-what-else, so we can't coordinate our buffer usage with other users, and are reduced to this kind of stone-age receive_some(1) technique.
Some options here:
-
Treat this as a special case for
trio.input, and implement it using some specialized tools. E.g., make sure thatopen_stdincan safely be called repeatedly within a single program and returns different handles that don't interfere with each other, and then havetrio.inputdoasync with trio.open_stdin(max_nbytes=1): .... Or provide some low-levelreceive_some_from_stdinfunction, or something. -
Have some
Channel[bytes]implementations wherereceivetakes an optionalmax_nbytesargument, as a matter of convention. -
Same as previous point, but also formalize this convention as a named sub-interface – though I'm having trouble thinking of a good name! This might help with our problem up above, about wanting some more informative way to describe the type of
Uint32Framing? But of course proliferating names always has its own cost, especially if the names are awkward.Also, naming the interface creates an interesting challenge: how do you type
StapledChannel? You wantStapledChannel.receiveto have the same signature asStapledChannel.receive_channel.receive, and at runtime this is easy – just use*args, **kwargs. But if we name this sub-interface, then the proper static type forStapledChanneldepends on the static type of itsReceiveChannel. I'm not sure whether givingStapledChannelthe right static type matters or not.