gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477
gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477bitdancer merged 9 commits intopython:mainfrom
Conversation
The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.
|
@bitdancer, as the email expert, do you want to review this? |
|
Yes, I'll add it to my list. It will take me quite some time to get through the backlog, unfortunately. |
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
When parsing email messages from Unicode strings (but not bytes), get_local_part() recorded a NonASCIILocalPartDefect for non-ASCII characters. RFC 5322 permits such addresses. This change: - removes the parse-time detection for a non-ASCII local-part (and a related test) - adds tests for passing a non-ASCII addr_spec to email.headerregistry.Address.__init__() - marks the (undocumented) email.errors.NonASCIILocalPartDefect as unused and deprecated This affected parsing email messages from Unicode strings (but not from bytes), and also prevented
|
I have made the requested changes; please review again |
|
Thanks for making the requested changes! @bitdancer: please review the changes made to this pull request. |
|
This PR is stale because it has been open for 30 days with no activity. |
|
@github-actions that doesn't seem like a good way to encourage contributions from volunteers. |
|
It did remind me that I hadn't reviewed the PR, which I guess is the point? I've been ignoring a lot of these pings on issues I'm not ready to deal with yet, though, so you also have a point ;) Now, unfortunately (or fortunately, depending on how you look at it) in the time since you made the change I have discovered, while doing work on the parser, that there actually is a place where the non-ascii local part defect is generated and is appropriate during non-utf8 processing: it happens when the local part is an encoded word that decodes to something containing non-ascii. That's not RFC compliant, of course, but that's why it's a defect ;) I'm not quite sure where to go with this. On the one hand that's clearly a defect, on the other hand if we took a message like that and refolded it for sending via SMTPUTF8, it would be valid, just as it would be valid if it came in through an Address. So I guess I'm leaning toward still getting rid of the defect, and making sure there is a defect for an encoded word in the local part. The latter is, unfortunately, not currently the case. I'm in the midst of re-writing the parser, so there are two ways we could go here: go ahead and fix the code now to add the encoded word defect, or just ignore the issue until I've rewritten the parser where I will add that fix. The problem with the latter is that currently the non-ascii defect causes the mailbox to get marked as invalid. The problem with the former is that with the current code it would require scanning the local-part after it is created to see if to contains any encoded words. That's not a huge overhead, so maybe it is worth it. On the third hand, currently an address like A third approach would be to keep the current code but replace the non-ascii defect with an encoded word defect, which would preserve the current invalid-mailbox behavior (both right and wrong) except that the defect type would change. That might be the safest option. Thoughts? Opinions? |
|
Ah. Another problem. My comment above said "now that we are doing the serialization check", but as far as I can tell we are not doing that. A non-ascii local part results in our writing an RFC invalid email with the local part rfc2047 encoded :( :( Sigh. I'm not sure what to do with that, because if we fix it, we'll probably break peoples code. Allowing Address to create such addresses will make this worse, so should we add something to generator that will deprecate this behavior? If so we should probably open a new issue and not commit this PR until after the deprecation is in place. |
I think PR #122540 was meant to address that?
I'd argue code relying on that is already broken: it's generating undeliverable email. Raising an error instead seems like an improvement. (I arrived in these issues via Django, which had interpreted Python's support for rfc2047-encoded local-parts as spec compliance. We took some steps to correct that in recent releases.) I'll need a bit to think on your other questions, but would the more detailed errors floated in #122540 (comment) help any? |
|
Ah, OK, that makes more sense: I have just lost the context :( Sorry about the delay in dealing with these issues, I'll see if I can drive them through to completion. I'll have to re-load my brain with the full context first. |
|
OK, so #122540 still looks good and I'll get it merged before the beta ;). For this one, I think we should convert the non-ascii defect into a regular InvalidHeaderDefect talking about the invalid encoded word. That will be a little more backward and forward compatible; I'll clean it up more fully in my parser rewrite. |
|
And having written that I'm having second thoughts. The whole problem is the existence of a defect, and just changing the type would not change the existence of the defect. We'd have to go with the "look for an encoded word" pattern, which would only catch some cases. So I take it back, let's just go with what you have here. |
|
@bitdancer thanks for moving this forward. The context is all slowly coming back to me. Some thoughts… In a post-EAI world, an addr-spec with non-ASCII characters is valid. So merely representing such an addr-spec in an So the real issue may be that
The defect there seems to be invalid (or obsolete?) CFWS in an addr-spec, with a valid local-part Backing up a bit, I think the context for gh-81074 was trying to use def send_message(name, email):
message = email.message.EmailMessage()
message["To"] = email.headerregistry.Address(
display_name=name, addr_spec=email
)
...
# SMTP.send_messages() automatically requires SMTPUTF8
# and switches to a utf8=True policy if necessary.
smtp.send_message(message)The change in this PR would allow that use (either the current version of the PR or the original that kept NonAsciiLocalPartDefect). But it does seem like there are some other considerations around invalid headers, which might be better addressed in your parser rewrite. Footnotes
|
|
Oh, also while you're looking at parsers and encoded words, you might be interested in PR #130749. |
|
Yeah, my conclusion is the same: no defect on parsing, raise an error on serialization with utf8=False. Which is what the code in #122540 is going to do, not only for addr-spec but for all other no-encoded-words-allowed cases. Hopefully that won't break anyone's code ;) As far as decoding them in add-spec, that's mostly a consequence of the parser's "don't raise errors during parsing" contract. I mean, we could leave them un-decoded, but the natural code path in the parser was to decode them, so I just went with it ;). Not decoding them would actually require an extra check, and while that's trivial, I'll probably preserve the behavior in the rewrite for backward compatibility, for whatever that's worth. I'm open to argument on it, though. I think I'm handing #130749 correctly in the rewrite, but I'll double check. I'll add that issue to my list of things to deal with before finishing the rewrite...I'm thinking of fixing the bugs you and others (and I during the rewrite) have found first, so that the rewrite is changing less behavior when it lands. Which looks like it is going to be quite a while yet. |
The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.
This PR implements @bitdancer's suggested fix from #81074 (comment):
(The other bugs discussed in that comment are reported separately as #83938 and #122476.)
Fixes gh-81074