[FoRK] Slow grep
eugen at leitl.org
Fri Mar 19 02:40:55 PST 2004
On Fri, Mar 19, 2004 at 09:20:15AM +0000, Aidan Kehoe wrote:
> No-one is saying the designers were to blame. And UTF-8 is a useful,
I frequently do blame them, but hindsight is always 20/20. However,
overengineering stuff as a rule is a good attitude to take, and I'm missing
this all too often with designs fresh off the press.
> compatible answer to the question of how to usefully internationalise the
It's supposed to be compatible. It breaks a lot of things. It also inherently
bloated, and will result in a security nightmare of unprecedented
> primary layer of things on the net.
> > The bad shit starts when today's jackasses try to "fix" these broken
> > standards. Instead of transcribing these funny characters, they chose to
> > extend the set, use alternate keyboard layouts, etc.
> What a fucked-up attitude to take. How much more uselessly difficult would
Not really. Most of Western Europe is easily pressable into a 7-bit charset,
even slavic languages can be transcribed, though the latter is a bit of a
stretch. Umlauts and ess-zet in German are commonly transcribed (I never use
them in email traffic, and I frequently see problems with naive users who
just use them because they're there).
> your life be if, to use URLs and the internet in general, you had to learn
> to transcribe from your native Cantonese to an [-A-Z0-9] representation of
> the sounds of proper nouns, to access information relevant to you? Oh, and
There's a bazillion dialects in India, but about all of them speak English.
About everybody in EU speaks English (allowing multiple native languages was
a Massively Bad Idea).
A phonetic alphabet of a few 10 characters is very easy to memorize
additionally, if you're an ideogram user.
> of course, the transcription the people setting up the servers used was from
> Mandarin, so it bears no relation whatsoever to the sound of the words as
> you say them.
Exactly, so they should have used an existing legacy system. The English
> Or, the bind and sendmail people could suck it down, allow eight-bit-set
Why should they? Why should they introduce bloat and bugs by the metric
What is this nonsense with umlaut domains? Who the fuck needs them?
> hostnames--and comparatively little work is needed for it, too--and you'd be
> able to type the Han for what you mean, and it would Just Work.
I see the point of using native languages in office environment. I do not see
the point of altering core infrastructure without any incentive.
You could just add a compatibility layer, which uniquely transcribes Han into domain
names, like liao2zhen1fang5.org, or whatever. That would have been a clean
> > Do we really need to be able to use host names with umlauts, or spell
> > them in Klingon, or Urdu? It would have a point, if it wasn't such a
> > giant can of worms.
> We need them in Kanji, and Han Chinese, and if we solve the issue for that,
> we get Urdu for free, architecturally. (We don't get Klingon--Klingon made
Make a scheme to map 7-bit ASCII to UTF-8 and back. Let application layer
deal with the issue (a single library should do).
> it into the vendor private use area for Linux, but it didn't make it into
> Unicode as a whole.)
I'm not at all sure Unicode is a good idea.
Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144 http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: not available
Url : http://lair.xent.com/pipermail/fork/attachments/20040319/6ed485eb/attachment.pgp
More information about the FoRK