aiohttp/http_parser.py: pure-python fix for big chunk decoding (#1899) by Anvil · Pull Request #1900 · aio-libs/aiohttp

Anvil · 2017-05-17T14:15:17Z

What do these changes do?

It should be a pure-python fix #1899
The .pyx implementation is not done. Not sure I'll be able to do that.

Are there changes in behavior for the user?

response.content.iter_any() will not yield partial chunk

Related issue number

#1899

Checklist

I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt
- The format is <Name> <Surname>.
- Please keep alphabetical order, the file is sorted by names.
Add a new entry to CHANGES.rst
- Choose any open position to avoid merge conflicts with other PRs.
- Add a link to the issue you are fixing (if any) using #issue_number format at the end of changelog message. Use Pull Request number if there are no issues for PR or PR covers the issue only partially.

Anvil · 2017-05-17T14:28:28Z

I can rebase on 2.0 if required.

fafhrd91 · 2017-05-18T07:07:30Z

could you add test?

Anvil · 2017-05-18T08:46:57Z

I'm trying...

…ibs#1899) Signed-off-by: Damien Nadé <[email protected]>

Signed-off-by: Damien Nadé <[email protected]>

codecov-io · 2017-05-18T10:01:08Z

Codecov Report

Merging #1900 into master will decrease coverage by 0.01%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master    #1900      +/-   ##
==========================================
- Coverage   97.06%   97.05%   -0.02%     
==========================================
  Files          37       37              
  Lines        7560     7560              
  Branches     1314     1314              
==========================================
- Hits         7338     7337       -1     
- Misses        100      101       +1     
  Partials      122      122

Impacted Files	Coverage Δ
aiohttp/http_parser.py	`96.89% <71.42%> (-0.26%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04dfcf8...535003b. Read the comment docs.

Signed-off-by: Damien Nadé <[email protected]>

Anvil · 2017-05-18T10:14:31Z

OK, this is the best I could come up with. I know it does the job, I dont know if it's minimalist enough.

Anvil · 2017-05-18T11:34:58Z

The failure in the tests in the C parser failing, not the python one.

fafhrd91 · 2017-05-19T15:28:10Z

I don't think this is good design choice. This will load whole chunk into memory before processing, so it would be very easy to flood server memory.

@asvetlov @kxepal comments?

kxepal · 2017-05-19T16:30:57Z

I agree with concern about memory footprint. And I guess this feature should be done for both parsers in anyway in order to preserve similar behaviour.

As about the problem, I think here is the server who should be fixed in order to not send too big chunks. Otherwise, there could be some magic constant which user explicitly will change in order to receive big chunks. But certainly, that's not good idea to have support of those by default for everyone without any way to disable it.

asvetlov · 2017-05-20T01:03:02Z

I don't think this is the bug at all.
Chunked encoding is just transport protocol only, semantically it's a sequence of bytes without boundaries. Just like a TCP socket is semantically a stream of bytes while the content is transferred by packets.

From my perspective partial chunks are totally correct.
Thoughts?

asvetlov · 2017-05-20T04:44:15Z

Moreover proxy could change transfer encoding, it even could compress/decompress the body.
End user's code should not rely on it.

fafhrd91 · 2017-05-20T16:53:39Z

I agree, developer should not rely on transfer-encoding

asvetlov · 2017-05-21T07:49:33Z

@Anvil do you have a strong reason for supporting this behavior.
I'm inclining to closing the issue as 'no a bug'

Anvil · 2017-05-21T19:11:18Z

(Ok for reference and reminder, chunk encoding is
<hex size>\r\n
<content of given size>\r\n
[repeat size + content scheme]
0\r\n
\r\n
EOT
)

I think that you cannot "fix" servers and tell them "dont send big chunk". No offense, but the choice is yours to make.

Now about the bug itself. Because it is a bug. This is not about big chunks, this is about partial chunks.

Thing is, aiohttp client/parser detects content is chunk-encoded (through HTTP headers) and starts interpreting the data by catching it and by stripping the size information from the communication. In case two chunks are received at once, aiohttp knows how many bytes to expect, nicely performs the split and the content.iter_any() successively yields the first and then the second chunk.

Seeing that, we thought that it was aiohttp job to handle all cases and to return us chunks. Not raw data. Sounded like a very nice feature.

Turned out that when a chunk is partially received, iter_any yields partial content. Annnd I think this defeats the purpose of the generator.
When iter_any yields garbage instead of a chunk, not only we become blind because we dont know the size of the chunk: aiohttp keeps that information for itself but if that can happen once, how can i trust any value yielded by iter_any to be consistent ?
Since, we dont have access to the data size information any more, the only approach left would be "make an educated guess, dude"

This is not exactly a reliable behavior.

IMHO, this patch brings consistency back. At the cost a potentially big memory footprint, yes. My opinion is that if someone wants not to have the whole chunk in memory, (s)he should disable chunk parsing in aiohttp and should implement his/her own way to handle chunks.

Also, if I should not rely in transfer-encoding, then, with all due respect, neither should you, nor the aiohttp client.

(If my opinion matters on the subject, yes, both parsers should implement this.)

Regards.

(Edit: fixed <> escaping for markdown)

fafhrd91 · 2017-05-21T19:39:46Z

I wouldn't mind adding this feature if it would be possible to limit to one use case, this would affect everyone. Probably in your case it is ok to load everything into memory, but I am sure for most users that would be very big surprise. Also discovering that anyone can kill your service without much effort would be not very pleasant experience with framework

hubo1016 · 2017-05-22T10:28:57Z

What about add a max-read-size parameter as the ones in read() functions? We may set it to something like 1MB by default, if the chunk size exceeds the limit, partial result is returned. And if the user set it to None, it means that he is willing to accept the risk. That's the same as open("xxx","r").read().

hubo1016 · 2017-05-22T10:43:49Z

And also I would like to remind that for usages that yields multiple JSON objects through chunked-encoded data, it is not necessary to depend on the chunk boundary to split it. JSON objects (data surrounded by {}) can be splitted by a state machine e.g. counting the braces in a smart way, this is used in JSON-RPC protocol.
Actually many implementations are just returning JSON objects on different lines, and using readline() is enough. Usually we would consider chunked encoding as an implementation detail of HTTP protocol, but not an attribute of content; if the HTTP is transferred through a proxy, the proxy may break the chunked boundary - because RFC does not say NO.

asvetlov · 2017-05-22T23:37:34Z

I could live with chunk splitting by default but supporting opt-in flag for brave guys who have unlimited memory and don't care about attacks.
Personally I don't feel a strong motivation for providing a fix but will do review/merge if somebody will create a PR (both Python and Cython parts should be fixed at once).

lock · 2019-10-28T23:04:53Z

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

Anvil force-pushed the big-chunk-decoding branch from 5a1bdef to 48a741a Compare May 17, 2017 14:27

Damien Nadé added 2 commits May 18, 2017 11:47

aiohttp/http_parser.py: pure-python fix for big chunk decoding (aio-l…

f186b2e

…ibs#1899) Signed-off-by: Damien Nadé <[email protected]>

aiohttp/http_parser.py: minor simplification of previous commit

4a2b2cd

Signed-off-by: Damien Nadé <[email protected]>

Anvil force-pushed the big-chunk-decoding branch from 48a741a to 535003b Compare May 18, 2017 09:47

Damien Nadé added 2 commits May 18, 2017 12:12

CHANGES.rst: partial fix for aio-libs#1899

f7209db

Signed-off-by: Damien Nadé <[email protected]>

tests/test_http_parser.py: test_parse_payload_response_partial_chunk

fa81efa

Signed-off-by: Damien Nadé <[email protected]>

Anvil force-pushed the big-chunk-decoding branch from 535003b to fa81efa Compare May 18, 2017 10:13

asvetlov added the wontfix label Jun 1, 2017

asvetlov closed this Jun 1, 2017

pkch mentioned this pull request Jun 2, 2017

Doc clarification - StreamReader.iter_any() vs iter_chunks() #1949

Closed

jlacoline mentioned this pull request Jul 31, 2017

Feature/bugfix: make the HTTP client able to return HTTP chunks when chunked transfer encoding is used #2150

Merged

5 tasks

lock bot added the outdated label Oct 28, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019

Uh oh!

Conversation

Anvil commented May 17, 2017

What do these changes do?

Are there changes in behavior for the user?

Related issue number

Checklist

Uh oh!

Anvil commented May 17, 2017

Uh oh!

fafhrd91 commented May 18, 2017

Uh oh!

Anvil commented May 18, 2017

Uh oh!

codecov-io commented May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Anvil commented May 18, 2017

Uh oh!

Anvil commented May 18, 2017

Uh oh!

fafhrd91 commented May 19, 2017

Uh oh!

kxepal commented May 19, 2017

Uh oh!

asvetlov commented May 20, 2017

Uh oh!

asvetlov commented May 20, 2017

Uh oh!

fafhrd91 commented May 20, 2017

Uh oh!

asvetlov commented May 21, 2017

Uh oh!

Anvil commented May 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fafhrd91 commented May 21, 2017

Uh oh!

hubo1016 commented May 22, 2017

Uh oh!

hubo1016 commented May 22, 2017

Uh oh!

asvetlov commented May 22, 2017

Uh oh!

lock bot commented Oct 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-io commented May 18, 2017 •

edited

Loading

Anvil commented May 21, 2017 •

edited

Loading