[FoRK] Slow grep

Aidan Kehoe kehoea at parhasard.net
Fri Mar 19 03:36:07 PST 2004


 Ar an 19ú lá de mí 3, scríobh Eugen Leitl :

 > It's supposed to be compatible. It breaks a lot of things. It also
 > inherently bloated, and will result in a security nightmare of
 > unprecedented proportions.

Explain to me how it is inherently bloated? You use English and the
characters in US-ASCII, you've the same overhead as before, but the
capability to have the Euro sign not break on you, the option of using
decent typography, the option of using the IPA, the option of using
mathematical symbols that don't look like arse. You write web pages in
Japanese, and the structure and HTML overhead is exactly the same as with
ASCII, while you're using two or three bytes for each non-Roman
character. Probably less overhead than UTF-16. 

Any technical change has implications for security. It's not impossible to
make a technical change and get security right, though. 

 > Not really. Most of Western Europe is easily pressable into a 7-bit charset,
 > even slavic languages can be transcribed, though the latter is a bit of a
 > stretch. Umlauts and ess-zet in German are commonly transcribed (I never use
 > them in email traffic, and I frequently see problems with naive users who
 > just use them because they're there). 

Western Europe isn't the issue. Western Europe is getting along okay with
ASCII--text looks like arse, but people struggle along. 

 > > your life be if, to use URLs and the internet in general, you had to learn
 > > to transcribe from your native Cantonese to an [-A-Z0-9] representation of
 > > the sounds of proper nouns, to access information relevant to you? Oh, and
 > 
 > There's a bazillion dialects in India, but about all of them speak English.

No, they don't all speak English. 

 > About everybody in EU speaks English 

No. Even go to the former East Germany, and that's not true. Never mind
Hungary and Finland (whee!, agglutinative languages, English is as easy to
get their heads around as Klingon for them).

 > (allowing multiple native languages was a Massively Bad Idea).

Sure. That's not something we can do anything about, though.

 > A phonetic alphabet of a few 10 characters is very easy to memorize
 > additionally, if you're an ideogram user.

Syoor. Uhv cohrs, if yoor yuzing it tu rite Inglish, wich yu seem tu bee
[switch to normal orthography, this is giving me a headache] so fixed on,
that's no real use, because common words' spellings pay sufficiently little
heed to phonemics that you have to approach them as ideographics, anyway. If
you were arguing about German, you might have had a point.

 > > of course, the transcription the people setting up the servers used was
 > > from Mandarin, so it bears no relation whatsoever to the sound of the
 > > words as you say them.
 > 
 > Exactly, so they should have used an existing legacy system. The English
 > language.

For expressing a concept or a proper noun that has no English equivalent?

 > > Or, the bind and sendmail people could suck it down, allow eight-bit-set
 > 
 > Why should they? Why should they introduce bloat and bugs by the metric
 > shitload?

Bloat? Nope, see above. What they have to do is not strip the highest bit
from hostnames, as they do now, and apply UTF-8 canonicalisation to names
that do have the highest bit set. Not quite a re-implementation of Oracle in
650k. 

Bugs? Well, of course, it's Bind and Sendmail, the (broken) reference
implementations of the Two Most Broken Protocols In The World Ever. Doesn't
mean development of the protocols should be stopped--it means it should be
furthered, if at all possible. 

 > What is this nonsense with umlaut domains? Who the fuck needs them?

I'm not saying anyone does. 

 > > hostnames--and comparatively little work is needed for it, too--and
 > > you'd be able to type the Han for what you mean, and it would Just
 > > Work.
 > 
 > I see the point of using native languages in office environment. I do not
 > see the point of altering core infrastructure without any incentive.
 > 
 > You could just add a compatibility layer, which uniquely transcribes Han
 > into domain names, like liao2zhen1fang5.org, or whatever.  That would
 > have been a clean solution.

To _everything_? To Sendmail, to Bind, to MSIE, to Mozilla, to Eurdora, to
/bin/mail? That's even more disruptive again. Never mind that it has much,
much less chance of getting industry support and of happening than adding
eighth-bit-set support + UTF-8 canonicalisation to BIND and Sendmail. 

 > > We need them in Kanji, and Han Chinese, and if we solve the issue for
 > > that, we get Urdu for free, architecturally. (We don't get
 > > Klingon--Klingon made
 > 
 > Make a scheme to map 7-bit ASCII to UTF-8 and back. Let application layer
 > deal with the issue (a single library should do).

I could copy and paste from http://cr.yp.to/djbdns/idn.html (search for
"Damage caused by IDNA") but I have better things to be doing. 

 > I'm not at all sure Unicode is a good idea.

Explain to me how can you see the point of using native languages in an
office environment and have _any fucking doubt_ that Unicode is a good idea?

-- 
I don't care if it rains or freezes/'Long as I got my Plastic Jesus
Riding on the dashboard of my car.


More information about the FoRK mailing list