[FoRK] Slow grep

Eugen Leitl eugen at leitl.org
Fri Mar 19 02:40:55 PST 2004

On Fri, Mar 19, 2004 at 09:20:15AM +0000, Aidan Kehoe wrote:

> No-one is saying the designers were to blame. And UTF-8 is a useful,

I frequently do blame them, but hindsight is always 20/20. However,
overengineering stuff as a rule is a good attitude to take, and I'm missing
this all too often with designs fresh off the press.

> compatible answer to the question of how to usefully internationalise the

It's supposed to be compatible. It breaks a lot of things. It also inherently
bloated, and will result in a security nightmare of unprecedented

> primary layer of things on the net. 
>  > The bad shit starts when today's jackasses try to "fix" these broken
>  > standards. Instead of transcribing these funny characters, they chose to
>  > extend the set, use alternate keyboard layouts, etc.
> What a fucked-up attitude to take. How much more uselessly difficult would

Not really. Most of Western Europe is easily pressable into a 7-bit charset,
even slavic languages can be transcribed, though the latter is a bit of a
stretch. Umlauts and ess-zet in German are commonly transcribed (I never use
them in email traffic, and I frequently see problems with naive users who
just use them because they're there). 

> your life be if, to use URLs and the internet in general, you had to learn
> to transcribe from your native Cantonese to an [-A-Z0-9] representation of
> the sounds of proper nouns, to access information relevant to you? Oh, and

There's a bazillion dialects in India, but about all of them speak English.
About everybody in EU speaks English (allowing multiple native languages was
a Massively Bad Idea).

A phonetic alphabet of a few 10 characters is very easy to memorize
additionally, if you're an ideogram user.

> of course, the transcription the people setting up the servers used was from
> Mandarin, so it bears no relation whatsoever to the sound of the words as
> you say them. 

Exactly, so they should have used an existing legacy system. The English
> Or, the bind and sendmail people could suck it down, allow eight-bit-set

Why should they? Why should they introduce bloat and bugs by the metric

What is this nonsense with umlaut domains? Who the fuck needs them?

> hostnames--and comparatively little work is needed for it, too--and you'd be
> able to type the Han for what you mean, and it would Just Work. 

I see the point of using native languages in office environment. I do not see
the point of altering core infrastructure without any incentive.

You could just add a compatibility layer, which uniquely transcribes Han into domain
names, like liao2zhen1fang5.org, or whatever.  That would have been a clean
>  > Do we really need to be able to use host names with umlauts, or spell
>  > them in Klingon, or Urdu? It would have a point, if it wasn't such a
>  > giant can of worms.
> We need them in Kanji, and Han Chinese, and if we solve the issue for that,
> we get Urdu for free, architecturally. (We don't get Klingon--Klingon made

Make a scheme to map 7-bit ASCII to UTF-8 and back. Let application layer
deal with the issue (a single library should do).

> it into the vendor private use area for Linux, but it didn't make it into
> Unicode as a whole.)

I'm not at all sure Unicode is a good idea.

Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://lair.xent.com/pipermail/fork/attachments/20040319/6ed485eb/attachment.pgp

More information about the FoRK mailing list