Skip to content

gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477

Merged
bitdancer merged 9 commits intopython:mainfrom
medmunds:fix-issue-81074
May 1, 2026
Merged

gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477
bitdancer merged 9 commits intopython:mainfrom
medmunds:fix-issue-81074

Conversation

@medmunds
Copy link
Copy Markdown
Contributor

@medmunds medmunds commented Jul 30, 2024

The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.

This PR implements @bitdancer's suggested fix from #81074 (comment):

… But with RFC6532 support, it should be valid to have a local part that has non-ascii in an Address, and the error, as I noted above, should be raised only at serialization time and when we don't have an original source string. So that raise should be modified to explicitly ignore the NonASCIILocalPartDefect.

(The other bugs discussed in that comment are reported separately as #83938 and #122476.)

Fixes gh-81074

The email.headerregistry.Address constructor raised an error if
addr_spec contained a non-ASCII character. (But it fully supports
non-ASCII in the separate username and domain args.) This change
removes the error for a non-ASCII addr_spec.
@medmunds medmunds requested a review from a team as a code owner July 30, 2024 19:16
@ghost
Copy link
Copy Markdown

ghost commented Jul 30, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jul 30, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

medmunds and others added 4 commits September 9, 2024 14:06
The email.headerregistry.Address constructor raised an error if
addr_spec contained a non-ASCII character. (But it fully supports
non-ASCII in the separate username and domain args.) This change
removes the error for a non-ASCII addr_spec.
@encukou
Copy link
Copy Markdown
Member

encukou commented Mar 19, 2025

@bitdancer, as the email expert, do you want to review this?

@bitdancer
Copy link
Copy Markdown
Member

Yes, I'll add it to my list. It will take me quite some time to get through the backlog, unfortunately.

Comment thread Lib/email/headerregistry.py Outdated
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 18, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

medmunds added 3 commits May 26, 2025 14:48
When parsing email messages from Unicode strings (but not bytes),
get_local_part() recorded a NonASCIILocalPartDefect for non-ASCII
characters. RFC 5322 permits such addresses.

This change:
- removes the parse-time detection for a non-ASCII local-part
  (and a related test)
- adds tests for passing a non-ASCII addr_spec to
  email.headerregistry.Address.__init__()
- marks the (undocumented) email.errors.NonASCIILocalPartDefect
  as unused and deprecated

 This affected parsing email messages from Unicode strings
(but not from bytes), and also prevented
@medmunds
Copy link
Copy Markdown
Contributor Author

I have made the requested changes; please review again

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 26, 2025

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

@bedevere-app bedevere-app Bot requested a review from bitdancer May 26, 2025 23:19
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 17, 2026
@medmunds
Copy link
Copy Markdown
Contributor Author

@github-actions that doesn't seem like a good way to encourage contributions from volunteers.

@bitdancer
Copy link
Copy Markdown
Member

It did remind me that I hadn't reviewed the PR, which I guess is the point? I've been ignoring a lot of these pings on issues I'm not ready to deal with yet, though, so you also have a point ;)

Now, unfortunately (or fortunately, depending on how you look at it) in the time since you made the change I have discovered, while doing work on the parser, that there actually is a place where the non-ascii local part defect is generated and is appropriate during non-utf8 processing: it happens when the local part is an encoded word that decodes to something containing non-ascii. That's not RFC compliant, of course, but that's why it's a defect ;)

I'm not quite sure where to go with this. On the one hand that's clearly a defect, on the other hand if we took a message like that and refolded it for sending via SMTPUTF8, it would be valid, just as it would be valid if it came in through an Address. So I guess I'm leaning toward still getting rid of the defect, and making sure there is a defect for an encoded word in the local part.

The latter is, unfortunately, not currently the case. I'm in the midst of re-writing the parser, so there are two ways we could go here: go ahead and fix the code now to add the encoded word defect, or just ignore the issue until I've rewritten the parser where I will add that fix. The problem with the latter is that currently the non-ascii defect causes the mailbox to get marked as invalid. The problem with the former is that with the current code it would require scanning the local-part after it is created to see if to contains any encoded words. That's not a huge overhead, so maybe it is worth it. On the third hand, currently an address like =?utf-8?q?foo?= @abc.com will not show as having any defects and will therefore be marked as a valid mailbox. Which is a bug ;) On the fourth hand, nothing is currently done by the rest of the code with mailboxes marked invalid, so it will only matter to people actually digging in to the parse tree, and there are probably very few of those. (There are other problems with the whole 'invalid mailbox' idea, but that's a whole separate set of issues I'm not ready to deal with.)

A third approach would be to keep the current code but replace the non-ascii defect with an encoded word defect, which would preserve the current invalid-mailbox behavior (both right and wrong) except that the defect type would change. That might be the safest option.

Thoughts? Opinions?

@bitdancer
Copy link
Copy Markdown
Member

Ah. Another problem. My comment above said "now that we are doing the serialization check", but as far as I can tell we are not doing that. A non-ascii local part results in our writing an RFC invalid email with the local part rfc2047 encoded :( :(

Sigh. I'm not sure what to do with that, because if we fix it, we'll probably break peoples code. Allowing Address to create such addresses will make this worse, so should we add something to generator that will deprecate this behavior? If so we should probably open a new issue and not commit this PR until after the deprecation is in place.

@medmunds
Copy link
Copy Markdown
Contributor Author

Ah. Another problem. My comment above said "now that we are doing the serialization check", but as far as I can tell we are not doing that. A non-ascii local part results in our writing an RFC invalid email with the local part rfc2047 encoded :( :(

I think PR #122540 was meant to address that?

Sigh. I'm not sure what to do with that, because if we fix it, we'll probably break peoples code. [...]

I'd argue code relying on that is already broken: it's generating undeliverable email. Raising an error instead seems like an improvement. (I arrived in these issues via Django, which had interpreted Python's support for rfc2047-encoded local-parts as spec compliance. We took some steps to correct that in recent releases.)

I'll need a bit to think on your other questions, but would the more detailed errors floated in #122540 (comment) help any?

@bitdancer
Copy link
Copy Markdown
Member

Ah, OK, that makes more sense: I have just lost the context :( Sorry about the delay in dealing with these issues, I'll see if I can drive them through to completion. I'll have to re-load my brain with the full context first.

@bitdancer
Copy link
Copy Markdown
Member

OK, so #122540 still looks good and I'll get it merged before the beta ;). For this one, I think we should convert the non-ascii defect into a regular InvalidHeaderDefect talking about the invalid encoded word. That will be a little more backward and forward compatible; I'll clean it up more fully in my parser rewrite.

@bitdancer
Copy link
Copy Markdown
Member

And having written that I'm having second thoughts. The whole problem is the existence of a defect, and just changing the type would not change the existence of the defect. We'd have to go with the "look for an encoded word" pattern, which would only catch some cases. So I take it back, let's just go with what you have here.

@medmunds
Copy link
Copy Markdown
Contributor Author

medmunds commented Apr 30, 2026

@bitdancer thanks for moving this forward. The context is all slowly coming back to me. Some thoughts…

In a post-EAI world, an addr-spec with non-ASCII characters is valid. So merely representing such an addr-spec in an Address object is not any sort of defect. Parsing one in a non-utf8 context would be a defect, and trying to flatten it with a non-utf8 policy should (and will) be an error.

So the real issue may be that Address(addr_spec=...) borrows parser.get_addr_spec(), and the parser assumes a non-utf8 policy when recording defects. I can't remember if you were thinking about giving the parser access to a policy?

[...] currently an address like =?utf-8?q?foo?= @abc.com will not show as having any defects and will therefore be marked as a valid mailbox. Which is a bug ;) [...]

The defect there seems to be invalid (or obsolete?) CFWS in an addr-spec, with a valid local-part =?utf-8?q?foo?= (since by definition rfc2047 doesn't apply in an addr-spec 😄). Or maybe something involving decoded display-name foo and an addr-spec with multiple defects (missing local-part, missing angle brackets). But not a NonAsciiLocalPartDefect.1

Backing up a bit, I think the context for gh-81074 was trying to use email.headerregistry.Address as the modern API replacement for legacy email.utils.formataddr. If you have a name and an email address from a user, you want something like this to encode properly and be safe from address header injections:

def send_message(name, email):
    message = email.message.EmailMessage()
    message["To"] = email.headerregistry.Address(
        display_name=name, addr_spec=email
    )
    ...
    # SMTP.send_messages() automatically requires SMTPUTF8
    # and switches to a utf8=True policy if necessary.
    smtp.send_message(message)

The change in this PR would allow that use (either the current version of the PR or the original that kept NonAsciiLocalPartDefect). But it does seem like there are some other considerations around invalid headers, which might be better addressed in your parser rewrite.

Footnotes

  1. Btw (and fwiw), when I looked a few years back, I wasn't able to find any MTAs or mail clients that generated or decoded rfc2047 in addr-specs. So I'm not sure there's practical value in trying to decode it there rather than taking the specs literally.

    rfc2047 does often get improperly wrapped in a quoted-string (just to be extra safe with specials, I guess?). And attachment filenames seem at least as likely to be incorrectly encoded with rfc2047 as correctly with rfc2231.

@medmunds
Copy link
Copy Markdown
Contributor Author

Oh, also while you're looking at parsers and encoded words, you might be interested in PR #130749.

@bitdancer
Copy link
Copy Markdown
Member

Yeah, my conclusion is the same: no defect on parsing, raise an error on serialization with utf8=False. Which is what the code in #122540 is going to do, not only for addr-spec but for all other no-encoded-words-allowed cases. Hopefully that won't break anyone's code ;)

As far as decoding them in add-spec, that's mostly a consequence of the parser's "don't raise errors during parsing" contract. I mean, we could leave them un-decoded, but the natural code path in the parser was to decode them, so I just went with it ;). Not decoding them would actually require an extra check, and while that's trivial, I'll probably preserve the behavior in the rewrite for backward compatibility, for whatever that's worth. I'm open to argument on it, though.

I think I'm handing #130749 correctly in the rewrite, but I'll double check. I'll add that issue to my list of things to deal with before finishing the rewrite...I'm thinking of fixing the bugs you and others (and I during the rewrite) have found first, so that the rewrite is changing less behavior when it lands. Which looks like it is going to be quite a while yet.

@bitdancer bitdancer merged commit b605573 into python:main May 1, 2026
44 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Stale PR or inactive for long period of time. topic-email

Projects

None yet

Development

Successfully merging this pull request may close these issues.

email.headerregistry.Address blocks Unicode local part addr_spec accepted elsewhere

5 participants