Don't quote boundary parameter for multipart unless necessary#2545
Don't quote boundary parameter for multipart unless necessary#2545asvetlov merged 10 commits intoaio-libs:masterfrom
Conversation
Also accept byte-strings, although only in a limited form due to an architectural restriction. Also properly quote boundaries when they contain `"` or `\` characters. Some servers don't understand quoted boundaries in multipart documents, which is why it should only be used when necessary.
c63ed75 to
1aaf422
Compare
Codecov Report
@@ Coverage Diff @@
## master #2545 +/- ##
==========================================
+ Coverage 97.78% 97.78% +<.01%
==========================================
Files 36 36
Lines 7225 7236 +11
Branches 1260 1262 +2
==========================================
+ Hits 7065 7076 +11
Misses 55 55
Partials 105 105
Continue to review full report at Codecov.
|
aiohttp/helpers.py
Outdated
| # ; any VCHAR, except delimiters | ||
| # VCHAR = %x21-7E | ||
| valid_tchar_regex = br"^[!#$%&'*+\-.^_`|~\w]+$" | ||
| if re.match(valid_tchar_regex, value): |
There was a problem hiding this comment.
Please use compiled regular exception
There was a problem hiding this comment.
Could you explain the rationale for this?
There was a problem hiding this comment.
Compiled regexs are faster, isn't it?
There was a problem hiding this comment.
Yes, but only if they are cached properly so they aren't recompiled every time. Python does some internal caching for regular expressions already, so unless we need a lot of regular expressions, the internal cache should cover most use cases.
In fact, if we don't cache the regex and compile it every time we need it, we actually degrade performance compared to having Python (or the re module) handle caching automatically.
There was a problem hiding this comment.
I propose to cache regexps at module level, like [here](https://github.com/aio-libs/aiohttp/blob/master/aiohttp/helpers.py#L452
re module has 512 slots for LRU cache of compiled regexps.
Thus the program may degrade performance if other code uses non-complied exceptions for solving own tasks. If the total regexps number exceeds 512 -- re will recompile them constantly.
There was a problem hiding this comment.
This function will only be run once for every multipart document sent, so I believe the performance impact will be minimal at bestworst, but we can do it anyway if that's your standard.
There was a problem hiding this comment.
Thanks.
- It is the aiohttp standard.
- The function runs once per multipart document but for high-load site it could clash with other handlers and their functionality.
aiohttp/helpers.py
Outdated
| return value | ||
|
|
||
| invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]" | ||
| if re.match(invalid_qdtext_char_regex, value): |
aiohttp/helpers.py
Outdated
| quoted_value_content = value.replace(b'\\', b'\\\\') # %x5C | ||
| quoted_value_content = value.replace(b'"', b'\\"') # %x22 | ||
|
|
||
| return b'"' + quoted_value_content + b'"' |
There was a problem hiding this comment.
The HTTP spec operates on bytes, which is why I designed this helper function to follow that. The only reason why we need to convert back to a string is because the underlying Payload API requires a string for header values. Or at least I think so. If it doesn't, I'll be glad to remove the conversion back to a unicode string.
I could of course special-case the function for just our situation here, but it should go into multipart.py then.
There was a problem hiding this comment.
Also, we need to use bytes in the actual multipart body, so we should try to convert exactly that onto a potentially quoted parameter value.
aiohttp/multipart.py
Outdated
| except (UnicodeEncodeError, UnicodeDecodeError): | ||
| raise ValueError('boundary should contain ASCII only chars') \ | ||
| from None | ||
| boundary_value = format_parameter_value(self._boundary).decode('ascii') |
There was a problem hiding this comment.
Because it looks strange. We take string boundary, turns it into bytes, and then turn it back to stirng to apply formatting. This wouldn't be a bottleneck for sure, but looks like useless work.
aiohttp/multipart.py
Outdated
| except UnicodeEncodeError: | ||
| raise ValueError('boundary should contains ASCII only chars') | ||
| ctype = 'multipart/{}; boundary="{}"'.format(subtype, boundary) | ||
| if isinstance(boundary, bytes): |
There was a problem hiding this comment.
What is the case for binary boundary input?
There was a problem hiding this comment.
It is true to how it is actually used and defined in the spec. In this particular situations, users don't get more value out of it because we have to convert the boundary back to a unicode string for the Payload header. Otherwise characters above 0x80 could be used inside the boundary.
Ideally, all headers would work on byte strings (only or additionally), but I didn't consider this to be in the scope of this PR.
There was a problem hiding this comment.
*bytes above 0x80 could be used. HTTP is an ascii-readable byte-based protocol.
aiohttp/multipart.py
Outdated
| if isinstance(boundary, bytes): | ||
| boundary.decode('ascii') | ||
| self._boundary = boundary | ||
| else: |
There was a problem hiding this comment.
If this if-else still exists, would be good to move default boundary generation here into another branch. No point to do isinstance check if boundary passed as None since we know that it will not be a bytes.
There was a problem hiding this comment.
It would complicate (or at least worsen code beauty) of the next part when we try to catch encoding errors.
aiohttp/multipart.py
Outdated
| self._boundary = boundary | ||
| else: | ||
| self._boundary = boundary.encode('ascii') | ||
| except (UnicodeEncodeError, UnicodeDecodeError): |
There was a problem hiding this comment.
I would recommend to minimize exception catch scope.
|
Regarding the header stuff, I think I can implement byte string headers and automatic conversion using Regarding the motivation of that suggested change: Also, what about my |
|
Aiohttp treats header value as an utf-8. I think a lot of servers does the same nowadays. |
|
Pseudonymin in |
|
Yes, historically aiohttp uses utf-8 for headers with a backdoor for getting bytes for communication with malformed client/server, maybe we need a hooks for sending custom bytes too. Anyway for me there is no reason for switching to byte-based headers, especially for boundary -- the value perfectly fits in ascii, isn't it? |
|
No, technically bytes 0x80 to 0xff are allowed as well. When quoted, that is. It's the obs-text (or char?) rule in the abnf I included in the issue and in the comment for the helper function. (I'm on mobile, otherwise I'd quote it myself.) Again, if you don't feel good about the weird conversion between bytes and unicode, I can remove it and it won't be backwards incompatible. It wouldn't allow any bytes higher than 0x7f, but the current implementation doesn't as well and it's perfectly backwards compatible. It just bothers me in a way that the specification isn't respected entirely. |
|
@FichteFoll don't get me wrong.
We can return to discussion if somebody will complain aiohttp in not supporting non-ascii boundaries. |
|
I agree with this sentiment. Will adjust the PR tomorrow. Do you want me to squash? Otherwise I'd just commit on top of this so we can keep the other commit in case this ever comes up. It might end up being useful, maybe. |
|
Just merge master and add required changes. |
aiohttp/helpers.py
Outdated
|
|
||
| invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]" | ||
| if re.match(invalid_qdtext_char_regex, value): | ||
| raise ValueError("parameter value contains invalid characters") |
There was a problem hiding this comment.
The line is not covered by tests
Also - fix an error when escaping backslashes in the quoted parameter, - cache compiled regular expressions (by aiohttp standards), and - cover all lines in tests.
There is little use in having this property public. If a use case arises, it could easily be made public in the future.
|
@kxepal please review again |
kxepal
left a comment
There was a problem hiding this comment.
LGFM, but there are two bits remains that concerns me.
aiohttp/multipart.py
Outdated
| if re.match(self._valid_tchar_regex, value): | ||
| return value.decode('ascii') # cannot fail | ||
|
|
||
| if re.match(self._invalid_qdtext_char_regex, value): |
There was a problem hiding this comment.
I think you mean re.search. Because match will not match anything if string starts with valid chars.
There was a problem hiding this comment.
Good call. Fixed that and another issue with the previous regex not considering that $ also matches before a newline.
tests/test_multipart.py
Outdated
| with self.assertRaises(ValueError): | ||
| aiohttp.multipart.MultipartWriter(boundary='тест') | ||
| with self.assertRaises(ValueError): | ||
| aiohttp.multipart.MultipartWriter(boundary='\n') |
|
@FichteFoll, Thanks! @asvetlov? |
|
|
||
| .. attribute:: boundary | ||
|
|
||
| The byte-string representation of the boundary. |
There was a problem hiding this comment.
The type bothers me.
We pass optional str value into constructor but the property is byteish.
Maybe the method should return self._boundary.decode('ascii')?
It is a breaking change but we could do it in aiohttp 3.0.
Amount of affected user code is very low IMHO.
There was a problem hiding this comment.
This behavior is unchanged. I just documented it.
There was a problem hiding this comment.
Yes, I see.
But maybe now is the time for behavior changing?
Current one is confusing a little.
There was a problem hiding this comment.
Well, imo everything should be saved as byte-strings anyway (as I explained before), so having the boundary as a byte-string doesn't seem far off. It's only used that way, too, so either we'd add an additional property that converts the boundary from a unicode to a byte string, or we encode it at every point we need it (not a fan).
I'd say we leave this for now. I would rather spend time to convert all header strings to byte-strings rather than unwrapping the boundary here.
There was a problem hiding this comment.
I fear it was bytes all the time. Let's change that for 3.0! This could be done with separate PR with breaking change notice.
There was a problem hiding this comment.
Ok, let's merge.
Technically single Pull Request can have several records in CHANGES/ but I'm ok with separate PR.
There was a problem hiding this comment.
@FichteFoll I have a feeling that all headers should be either strings or bytes.
All means really all, including request.headers etc.
This change in too intrusive, the ship has sailed many years ago.
Also I think 99% aiohttp users are happy with string headers but byte header values will confuse them.
There was a problem hiding this comment.
I agree that this should ideally be standardized on a single header type. For compatibility, all unicode strings could be converted to byte-strings using ASCII encoding, so that the outgoing API doesn't change much except that it allows byte-strings for all headers too.
However, this goes way beyond the scope of this PR and would be a large architectural consideration that I'm sure you're going to have a much better time deciding on. It might be something worth considering for the next major release, i.e. 3.0, as long as someone puts in the work.
|
@FichteFoll thanks! |
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs. |
What do these changes do?
Only quote multipart boundary when necessary.
Also accept byte-strings, although only in a limited form due to anarchitectural restriction.
Also properly quote boundaries when they contain
"or\characters.Are there changes in behavior for the user?
Boundaries provided by the user, which contain
"or\characters, are properly escaped and invalid characters such as\ncause an exception.Related issue number
Fixes #2544.
Checklist
CONTRIBUTORS.txtCHANGESfolder<issue_id>.<type>for example (588.bug)issue_idchange it to the pr id after creating the pr.feature: Signifying a new feature..bugfix: Signifying a bug fix..doc: Signifying a documentation improvement..removal: Signifying a deprecation or removal of public API..misc: A ticket has been closed, but it is not of interest to users.Is it okay for me to add myself to
CONTRIBUTORS.txtwith my pseudonym? I saw nobody else doing it, which is why I didn't for now.