ratbox at jdc.parodius.com
Sat May 4 17:40:42 UTC 2013
On Sat, May 04, 2013 at 07:25:36PM +0300, Daniel Corbe wrote:
> Jilles Tjoelker <jilles at stack.nl> writes:
> > On Sat, May 04, 2013 at 05:43:28PM +0300, Daniel Corbe wrote:
> >> As part of an ongoing effort to create a stand-alone Jabber MUC which
> >> uses an IRC server as a back end, I've created a very simple patch to
> >> ircd-ratbox which enables unicode nick support.
> >> As it turns out, it was quite easy because the server is pretty much
> >> agnostic to encoding.
> >> The patch is available at
> >> http://www.corbe.net/static/ircd-ratbox-3.0.8-unicode.patch
> >> A production server is up at irc.corbe.net.
> >> The working repo can be tracked at
> >> git://apollo.corbe.net/ircd-ratbox.git
> > Allowing characters like '#', '$', '&', '*' and ':' breaks the protocol
> > (this list is not exhaustive). For example, PRIVMSG interprets things
> > starting with '#' or '&' as channels and things starting with '$' as
> > globals. The asterisk and question mark are wildcard characters, which
> > is "fun" for miscreants who will make it likely that everyone will be
> > banned when an attempt is made to ban them. Parameters starting with a
> > colon are special to the framing mechanism; clients and servers alike
> > will get confused when a nickname starts with a colon.
> The proto-breaking characters are disabled in the update version of the
The patch just says "Unicode". Unicode means a lot of different things;
are we talking about UTF-8, UTF-16, or UTF-32? I have to assume UTF-8
(from briefly examining the patch).
So here's a question for you: what's to stop someone from using a
nickname that contains characters that visually look nearly identical to
delimiters or non-permitted (protocol-violating) characters? While
these won't break parsers, they will cause mass confusion for end-users.
Examples include, but are not limited to:
- U+0x02f8 -- Raised colon (:)
- U+0xfe30 -- Vertical two-dot (looks like colon) (:)
- U+0xfe55 -- Small colon (:)
- U+0xfe5f -- Small hash symbol (#)
- U+0xfe60 -- Small ampersand (&)
- U+0xfe61 -- Small asterisk (*)
- U+0xfe69 -- Small dollar symbol ($)
- U+0xfe6b -- Small at symbol (@)
- U+0xff03 -- (Japanese) Full-width hash symbol (#)
- U+0xff04 -- (Japanese) Full-width dollar symbol ($)
- U+0xff06 -- (Japanese) Full-width ampersand (&)
- U+0xff0a -- (Japanese) Full-width asterisk (*)
- U+0xff1a -- (Japanese) Full-width colon (:)
- U+0xff20 -- (Japanese) Full-width at symbol (@)
And one I do not care to look up in depth (because there are tons of
UTF-8 entries for these) are all forms of spaces.
While I support UTF-8 given its ASCII backwards-compatibility, when it
comes to existing chat protocols one must be very careful.
I recommend you go through all UTF-8 blocks/pages (0x01 to 0xff) and
examine just how many similarities there are across the board. I sure
as hell wouldn't want to be the one to have to write a "UTF-8 parser"
just to filter out all of this.
| Jeremy Chadwick jdc at koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the ircd-ratbox