Internationalise The Fediverse
We live in the future now. It is OK to use Unicode everywhere.
It seems bizarre to me that modern Internet services sometimes "forget" that there's a world outside the Anglosphere. Some people have the temerity to speak foreign languages! And some of those languages have accents on their letters!! Even worse, some don't use English letters at all!!!
A decade ago, I was miffed that GitHub only supported some ASCII characters in its project names. There's no technical reason why your repo can't be called "ഹലോ വേൾഡ്".
Similarly, I'm frustrated that Mastodon (the largest ActivityPub service) doesn't allow Unicode usernames and has resisted efforts to change.
So I built a small ActivityPub server which publishes content from an Actor called @你好@i18n.viii.fi
- it is only a demo account, but it works!
Some ActivityPub clients report that they are able to follow it and receive messages from it. Others - like Mastodon - simply can't see anything from it. Take a look at the replies on Mastodon to see which services work. You can also see some of its posts on the Fediverse.
What Does The Fox Spec Say?
The ActivityPub specification says:
Building an international base of users is important in a federated network. Internationalization
I can't find anything in the specifications which limits what languages a username can be written in. But there are a few clues scattered about.
The user's @
name is defined by preferredUsername
which is:
A short username which may be used to refer to the actor, with no uniqueness guarantees. 4.1 Actor objects
There's nothing in there about what scripts it can contain. However, later on, the spec says:
Properties containing natural language values, such as
name
,preferredUsername
, orsummary
, make use of natural language support defined in ActivityStreams. 4. Actors
So it is expected that a preferred username could be written in multiple scripts. Which implies that the default need not be limited to A-Z0-9.
The ActivityStreams specification talks about language mapping.
Finally, the ActivityPub specification has some examples on non-Latin text in names.
So, I think that it is acceptable for usernames to be written in a variety of non-Latin scripts.
But What About...?
There are usually a few objections to "Unicode Everywhere" zealots like me. I'd like to forestall any arguments.
What about homograph attacks?
Well, what about them? ASCII has plenty of similar looking characters. I doubt most people would notice when a capital i is replaced by a lower L - and vice-versa. Similarly the kerning issue of an r and n looking like an m is well known. Are mixed language homographs more dangerous? I don't think so.
What if people make names that can't be typed?
Well, what if they do? Maybe not being found by people who can't type your language is a feature, not a bug. But, anyway, clients can let users search for other people, or copy and paste their names.
What about weird "Zalgo" text?
It is up to a client to decide how they want to render text input. The "problems" of strange Unicode combinations are well known. This is not a hard computer-science problem.
What about bi-directional text?
The spec makes clear this is allowed.
Do people even want a username in their own script?
I have no evidence for this. But I bet you'd get pretty frustrated if you had to switch keyboard just to type your own name, wouldn't you? In any case, why can't I have a username of @😉
What's Next?
If you build ActivityPub software, give some thought to the billions of people who don't have names which easily fit into ASCII.
If your software can see @你好@i18n.viii.fi
and its posts, please let me know.
From Hubzilla.
Yes. Yes we do. Great work and I hope it catches on! 🙂
But when it's social media, what's the worst that could happen if you follow olly instead of oIIy?
If there is a vital security issue, punycoded domains names while leaving unicode account names seems like a reasonable compromise (that's why thinks like mastodon domain verification exists).On the issue of emoji account names though, emoji is an absolute mess and I hate all of it. But you do you. 😤
hi, hi
just reporting in to say that current versions of GoToSocial can see the account, and probably could see it's posts if not for the lack of backfill
Yeah, the amount of times I ended up having a square in the middle of my surname made me really wary of putting my real name on official documents in the west. Instead I operate under a fake name "Kielinski" instead.
Thanks for your repy. Re your comment about a "self own". The purpose of hyperbole in written text is to convey the ridiculous nature of a statement by making it obviously extreme. For example, I used multiple exclamation marks and preceded it with a couple of other statements of a similar nature. In doing so, I hoped to lead my reader into understanding that I disagreed with the proposition - as set out by the rest of the post. I'm sorry if that wasn't clear.
It will open a whole area of phishing and other kinds of vulnerabilities.
Homographs are a big security problem, also an easily printable id is needed in many protocols for development, debugging and bug reports. Unless you want to replace ids with qrcodes or similar...
As I mention in the post, ASCll aIready has a H0M0GRAPH problem. You also pre-suppose that all programmers are able to read A-Z as well as their own alphabet. But, even if that's not the case, the IDs can be URl encoded.
@
-form shows nothing. Searching by URL gives: (this is GotoSocialmain
as of yesterday or so) Personally, I’ve got mixed feelings on this one. I agree that the localpart should be able to contain Unicode codepoints. Some should be excluded. I don’t know the exact set offhand, but those allowed in URLs (after the server and/
, i.e. in thepath
component) should probably be fine. The domain part, however, I’m rather firm on it not deviating from ASCII, i.e. to internationalise it the punycode representation (xn--something
) must be used, not the Unicode representation. So, no complaint against@☻@example.com
but I consider@foo@example.ею
invalid because it needs to be spelt@[email protected]
instead. (What clients make of this is up to them, as usual with IDNs… sigh)More comments on Mastodon.