Skip to content

aiohttp/http_parser.py: pure-python fix for big chunk decoding (#1899)#1900

Closed
Anvil wants to merge 4 commits intoaio-libs:masterfrom
Anvil:big-chunk-decoding
Closed

aiohttp/http_parser.py: pure-python fix for big chunk decoding (#1899)#1900
Anvil wants to merge 4 commits intoaio-libs:masterfrom
Anvil:big-chunk-decoding

Conversation

@Anvil
Copy link
Copy Markdown
Contributor

@Anvil Anvil commented May 17, 2017

What do these changes do?

It should be a pure-python fix #1899
The .pyx implementation is not done. Not sure I'll be able to do that.

Are there changes in behavior for the user?

response.content.iter_any() will not yield partial chunk

Related issue number

#1899

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new entry to CHANGES.rst
    • Choose any open position to avoid merge conflicts with other PRs.
    • Add a link to the issue you are fixing (if any) using #issue_number format at the end of changelog message. Use Pull Request number if there are no issues for PR or PR covers the issue only partially.

@Anvil Anvil force-pushed the big-chunk-decoding branch from 5a1bdef to 48a741a Compare May 17, 2017 14:27
@Anvil
Copy link
Copy Markdown
Contributor Author

Anvil commented May 17, 2017

I can rebase on 2.0 if required.

@fafhrd91
Copy link
Copy Markdown
Member

could you add test?

@Anvil
Copy link
Copy Markdown
Contributor Author

Anvil commented May 18, 2017

I'm trying...

@Anvil Anvil force-pushed the big-chunk-decoding branch from 48a741a to 535003b Compare May 18, 2017 09:47
@codecov-io
Copy link
Copy Markdown

codecov-io commented May 18, 2017

Codecov Report

Merging #1900 into master will decrease coverage by 0.01%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1900      +/-   ##
==========================================
- Coverage   97.06%   97.05%   -0.02%     
==========================================
  Files          37       37              
  Lines        7560     7560              
  Branches     1314     1314              
==========================================
- Hits         7338     7337       -1     
- Misses        100      101       +1     
  Partials      122      122
Impacted Files Coverage Δ
aiohttp/http_parser.py 96.89% <71.42%> (-0.26%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04dfcf8...535003b. Read the comment docs.

@Anvil Anvil force-pushed the big-chunk-decoding branch from 535003b to fa81efa Compare May 18, 2017 10:13
@Anvil
Copy link
Copy Markdown
Contributor Author

Anvil commented May 18, 2017

OK, this is the best I could come up with. I know it does the job, I dont know if it's minimalist enough.

@Anvil
Copy link
Copy Markdown
Contributor Author

Anvil commented May 18, 2017

The failure in the tests in the C parser failing, not the python one.

@fafhrd91
Copy link
Copy Markdown
Member

I don't think this is good design choice. This will load whole chunk into memory before processing, so it would be very easy to flood server memory.

@asvetlov @kxepal comments?

@kxepal
Copy link
Copy Markdown
Member

kxepal commented May 19, 2017

I agree with concern about memory footprint. And I guess this feature should be done for both parsers in anyway in order to preserve similar behaviour.

As about the problem, I think here is the server who should be fixed in order to not send too big chunks. Otherwise, there could be some magic constant which user explicitly will change in order to receive big chunks. But certainly, that's not good idea to have support of those by default for everyone without any way to disable it.

@asvetlov
Copy link
Copy Markdown
Member

I don't think this is the bug at all.
Chunked encoding is just transport protocol only, semantically it's a sequence of bytes without boundaries. Just like a TCP socket is semantically a stream of bytes while the content is transferred by packets.

From my perspective partial chunks are totally correct.
Thoughts?

@asvetlov
Copy link
Copy Markdown
Member

Moreover proxy could change transfer encoding, it even could compress/decompress the body.
End user's code should not rely on it.

@fafhrd91
Copy link
Copy Markdown
Member

I agree, developer should not rely on transfer-encoding

@asvetlov
Copy link
Copy Markdown
Member

@Anvil do you have a strong reason for supporting this behavior.
I'm inclining to closing the issue as 'no a bug'

@Anvil
Copy link
Copy Markdown
Contributor Author

Anvil commented May 21, 2017

(Ok for reference and reminder, chunk encoding is
<hex size>\r\n
<content of given size>\r\n
[repeat size + content scheme]
0\r\n
\r\n
EOT
)

I think that you cannot "fix" servers and tell them "dont send big chunk". No offense, but the choice is yours to make.

Now about the bug itself. Because it is a bug. This is not about big chunks, this is about partial chunks.

Thing is, aiohttp client/parser detects content is chunk-encoded (through HTTP headers) and starts interpreting the data by catching it and by stripping the size information from the communication. In case two chunks are received at once, aiohttp knows how many bytes to expect, nicely performs the split and the content.iter_any() successively yields the first and then the second chunk.

Seeing that, we thought that it was aiohttp job to handle all cases and to return us chunks. Not raw data. Sounded like a very nice feature.

Turned out that when a chunk is partially received, iter_any yields partial content. Annnd I think this defeats the purpose of the generator.
When iter_any yields garbage instead of a chunk, not only we become blind because we dont know the size of the chunk: aiohttp keeps that information for itself but if that can happen once, how can i trust any value yielded by iter_any to be consistent ?
Since, we dont have access to the data size information any more, the only approach left would be "make an educated guess, dude"

This is not exactly a reliable behavior.

IMHO, this patch brings consistency back. At the cost a potentially big memory footprint, yes. My opinion is that if someone wants not to have the whole chunk in memory, (s)he should disable chunk parsing in aiohttp and should implement his/her own way to handle chunks.

Also, if I should not rely in transfer-encoding, then, with all due respect, neither should you, nor the aiohttp client.

(If my opinion matters on the subject, yes, both parsers should implement this.)

Regards.

(Edit: fixed <> escaping for markdown)

@fafhrd91
Copy link
Copy Markdown
Member

I wouldn't mind adding this feature if it would be possible to limit to one use case, this would affect everyone. Probably in your case it is ok to load everything into memory, but I am sure for most users that would be very big surprise. Also discovering that anyone can kill your service without much effort would be not very pleasant experience with framework

@hubo1016
Copy link
Copy Markdown
Contributor

What about add a max-read-size parameter as the ones in read() functions? We may set it to something like 1MB by default, if the chunk size exceeds the limit, partial result is returned. And if the user set it to None, it means that he is willing to accept the risk. That's the same as open("xxx","r").read().

@hubo1016
Copy link
Copy Markdown
Contributor

And also I would like to remind that for usages that yields multiple JSON objects through chunked-encoded data, it is not necessary to depend on the chunk boundary to split it. JSON objects (data surrounded by {}) can be splitted by a state machine e.g. counting the braces in a smart way, this is used in JSON-RPC protocol.
Actually many implementations are just returning JSON objects on different lines, and using readline() is enough. Usually we would consider chunked encoding as an implementation detail of HTTP protocol, but not an attribute of content; if the HTTP is transferred through a proxy, the proxy may break the chunked boundary - because RFC does not say NO.

@asvetlov
Copy link
Copy Markdown
Member

I could live with chunk splitting by default but supporting opt-in flag for brave guys who have unlimited memory and don't care about attacks.
Personally I don't feel a strong motivation for providing a fix but will do review/merge if somebody will create a PR (both Python and Cython parts should be fixed at once).

@lock
Copy link
Copy Markdown

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Big chunk decoding makes content.iter_any() iterate over partial content

6 participants