[FoRK] Slow grep

Eugen Leitl eugen at leitl.org
Fri Mar 19 05:18:03 PST 2004


On Fri, Mar 19, 2004 at 11:36:07AM +0000, Aidan Kehoe wrote:

> Explain to me how it is inherently bloated? You use English and the

The methods to process Unicode. Not only more lines of code, far more
complexity. Coding monkeys are stupid enough as is (I see it demonstrated to
me vividly every single day, I'm not longer surprised at anything), having to deal with
Unicode makes them jump through even more hoops, and dropping the bananas
they've been juggling. Man, monkeys *hate* that.

> characters in US-ASCII, you've the same overhead as before, but the
> capability to have the Euro sign not break on you, the option of using

I can say EUR, or USD, or JPY just fine. Most professionals use these handy
TLAs, not the funky symbols.

> decent typography, the option of using the IPA, the option of using
> mathematical symbols that don't look like arse. You write web pages in

Representation is orthogonal to content. TeX or PostScript do just fine
without Unicode.

> Japanese, and the structure and HTML overhead is exactly the same as with

I don't see any need to break *nix tools just to be able to use Kanji in
command line. I've never had trouble writing Cyrillic in XEmacs; it seems to
handle Far East users just fine, too, whatever arcane entry methods they use.

> ASCII, while you're using two or three bytes for each non-Roman
> character. Probably less overhead than UTF-16. 
> 
> Any technical change has implications for security. It's not impossible to
> make a technical change and get security right, though. 

Not after the fact (you design with security in mind). 
Not if the change inplies balooning complexity.
 
> Western Europe isn't the issue. Western Europe is getting along okay with
> ASCII--text looks like arse, but people struggle along. 

My command line fonts look just fine. I could really do without proportional
fonts and ligatures in command line, thank you.
 
> No, they don't all speak English. 

It is the main traffic language. The lingua franca. The whole point of a
lingua franka is to be the reference unit. So that you don't have to deal
with each explicit conversion case.
 
>  > About everybody in EU speaks English 
> 
> No. Even go to the former East Germany, and that's not true. Never mind

The whole point is that anyone knows that English is de facto standard
everywhere, and that it's a basic skill on almost any resume. You can assume
that the differences due to schooling are being nivellated as we speak.

You can deny that English is the de facto traffic language (though EU
officials like to think that German and French is that, too), but the reality
looks different.

> Hungary and Finland (whee!, agglutinative languages, English is as easy to
> get their heads around as Klingon for them).

Finns do just fine in English (as does the rest of Scandinavia), Hungary less
so -- because they live in the Post-Soviet fallout cloud. You can assume they
will fix that in no time.

It's just a schooling issue. If there's a perceived need, people will bloody
learn it.
 
>  > (allowing multiple native languages was a Massively Bad Idea).
> 
> Sure. That's not something we can do anything about, though.

Yes. Do not pamper the users. Refuse accepting code commented in anything but
English.
 
> Syoor. Uhv cohrs, if yoor yuzing it tu rite Inglish, wich yu seem tu bee
> [switch to normal orthography, this is giving me a headache] so fixed on,
> that's no real use, because common words' spellings pay sufficiently little
> heed to phonemics that you have to approach them as ideographics, anyway. If
> you were arguing about German, you might have had a point.

The structure of the language is not a problem (it will make you weak at
telling R and L apart, so what), the idea is that learning some 25 glyphs is
trivial if you've learned some >>5 k of them. The phonetic mapping idear
ain't that hard to grok, either.
 
> For expressing a concept or a proper noun that has no English equivalent?

Any language is Turing-complete. Any meaning can be conveyed in any other
major current (private and dead languages are too handicapped) 
language, more or less verbosely.
 
> Bugs? Well, of course, it's Bind and Sendmail, the (broken) reference
> implementations of the Two Most Broken Protocols In The World Ever. Doesn't
> mean development of the protocols should be stopped--it means it should be
> furthered, if at all possible. 

Protocols are being designed, not evolved. If a protocol needs evolution,
it implies it's a design failure. If you need  a new functionality, develop a new
protocol. Successful protocols manage to cope with domains they were not
designed for, and degrade more or less gracefully when they start breaking
down.
 
> To _everything_? To Sendmail, to Bind, to MSIE, to Mozilla, to Eurdora, to

Precisely not to MTAs or MUAs. If you want funny symbols in the address bar,
use a plugin hiding the transcoded-into-legacy-represention from you. The
legacy code base doesn't need to be touched, and is guaranteed to work
because transcoding is fundamentally safe, and contained. 

> /bin/mail? That's even more disruptive again. Never mind that it has much,
> much less chance of getting industry support and of happening than adding
> eighth-bit-set support + UTF-8 canonicalisation to BIND and Sendmail. 

You're living on some weird planet, dO0d (how many ways are there to spell
that in Unicode?).
 
>  > > We need them in Kanji, and Han Chinese, and if we solve the issue for
>  > > that, we get Urdu for free, architecturally. (We don't get
>  > > Klingon--Klingon made
>  > 
>  > Make a scheme to map 7-bit ASCII to UTF-8 and back. Let application layer
>  > deal with the issue (a single library should do).
> 
> I could copy and paste from http://cr.yp.to/djbdns/idn.html (search for
> "Damage caused by IDNA") but I have better things to be doing. 

I have better things to do than reading Bernstein's screeds. His approaches are
frequently... unique, to put it politely. Do not play well with others.
 
>  > I'm not at all sure Unicode is a good idea.
> 
> Explain to me how can you see the point of using native languages in an
> office environment and have _any fucking doubt_ that Unicode is a good idea?

I think plutonium has its uses, if restricted to safe environments, and
procedures. I don't recommend selling nice shiny chunks of it at the local
WalMart at low low low prices.

I *could* handle plutonium. I'm pretty sure I can't handle Unicode. YMMV.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://lair.xent.com/pipermail/fork/attachments/20040319/59c83029/attachment.pgp


More information about the FoRK mailing list