[FoRK] Slow grep

Aidan Kehoe kehoea at parhasard.net
Fri Mar 19 07:10:16 PST 2004

 Ar an 19ú lá de mí 3, scríobh Eugen Leitl :

 > On Fri, Mar 19, 2004 at 11:36:07AM +0000, Aidan Kehoe wrote:
 > > Explain to me how it is inherently bloated? You use English and the
 > The methods to process Unicode. Not only more lines of code, far more
 > complexity.

Where? In DNS and SMTP, stop stripping the eighth bit, canonicalise the
UTF-8, map visually identical characters to the same thing. The
canonicalisation standard is there; the visually identical mapping will be.

In office suites and client software you have to do all the funky
complicated stuff like mapping i to LATIN CAPITAL LETTER I WITH DOT ABOVE if
if you're using Turkish and converting to upper case, and deciding which of
the various ways of storing LATIN SMALL LETTER E WITH ACUTE you prefer. But
you had to do that anyway.

 > > characters in US-ASCII, you've the same overhead as before, but the
 > > capability to have the Euro sign not break on you, the option of using
 > I can say EUR, or USD, or JPY just fine. Most professionals use these
 > handy TLAs, not the funky symbols.

Depends on the context. In some contexts, the symbol is just more

 > > decent typography, the option of using the IPA, the option of using
 > > mathematical symbols that don't look like arse. You write web pages in
 > Representation is orthogonal to content. TeX or PostScript do just fine
 > without Unicode.

Yeah, sure. 

 > > Japanese, and the structure and HTML overhead is exactly the same as with
 > I don't see any need to break *nix tools just to be able to use Kanji in
 > command line.

Eh? But they're fixed. The software had a bug, it got fixed. What Justin
said; they were biting the bullet. I don't blame them for it. 

 > I've never had trouble writing Cyrillic in XEmacs; it seems to handle Far
 > East users just fine, too, whatever arcane entry methods they use.

XEmacs and Unix are not the same world. XEmacs has had decent i18n support
since the early nineties; the same is very very untrue of the free Unix

 > > Any technical change has implications for security. It's not impossible
 > > to make a technical change and get security right, though.
 > Not after the fact (you design with security in mind). 

SMTP wasn't designed with security in mind--it was barely designed. Ditto
DNS. We _have_ to do it after the fact. 

 > Not if the change inplies balooning complexity.
 > > Western Europe isn't the issue. Western Europe is getting along okay with
 > > ASCII--text looks like arse, but people struggle along. 
 > My command line fonts look just fine. I could really do without proportional
 > fonts and ligatures in command line, thank you.

Umlauts as vowel + e and the _deutsche Anfuehrungszeichnen_ as "" don't look
like arse to you? Then, sir, I find issue with your sense of aesthetics as a
user of German.

 > > No, they don't all speak English. 
 > It is the main traffic language. The lingua franca. The whole point of a
 > lingua franka is to be the reference unit. So that you don't have to deal
 > with each explicit conversion case.

That's not the point of a lingua franca at all. The point of a lingua franca
is that you're Franks and Normans and Saxons and Italians and Spanish people
fighting a multinational religious war and you need some stupid hacky means
of communicating with each other.

There's no need for the stupid hack. We can get it right. 

 > The whole point is that anyone knows that English is de facto standard
 > everywhere, and that it's a basic skill on almost any resume. You can assume
 > that the differences due to schooling are being nivellated as we speak.

[Okay, so _niveau_ made it into Russian after all, with _cauchemar_ and
_anecdote_. Go Napoléon.] 

 > You can deny that English is the de facto traffic language (though EU
 > officials like to think that German and French is that, too), but the
 > reality looks different.

Again, Western Europe isn't the issue. 

 > > Hungary and Finland (whee!, agglutinative languages, English is as easy to
 > > get their heads around as Klingon for them).
 > Finns do just fine in English (as does the rest of Scandinavia), Hungary
 > less so -- because they live in the Post-Soviet fallout cloud. You can
 > assume they will fix that in no time.

[Finnish-speaking] Finns do okay in English, though not as well as the
Germanic-speaking rest of Scandinavia. It's not easy for them, but nothing
is--bar Estonian--so they don't dwell on it. 

 > It's just a schooling issue. If there's a perceived need, people will
 > bloody learn it.

If there's a perceived need, the code monkeys will bloody learn it. And, in
the grand scheme of things, educating them in it will be more efficient than
educating four billion people in something stupid and hacky. 

 > >  > (allowing multiple native languages was a Massively Bad Idea).
 > > 
 > > Sure. That's not something we can do anything about, though.
 > Yes. Do not pamper the users. Refuse accepting code commented in anything
 > but English.

I don't follow you there. You're suggesting that refusing to accept code
commented in Bahasa Indonesian will result in the decline of Bahasa
Indonesian as a native tongue?

 > > Syoor. Uhv cohrs, if yoor yuzing it tu rite Inglish, wich yu seem tu
 > > bee [switch to normal orthography, this is giving me a headache] so
 > > fixed on, that's no real use, because common words' spellings pay
 > > sufficiently little heed to phonemics that you have to approach them as
 > > ideographics, anyway. If you were arguing about German, you might have
 > > had a point.
 > The structure of the language is not a problem (it will make you weak at
 > telling R and L apart, so what), the idea is that learning some 25 glyphs is
 > trivial if you've learned some >>5 k of them. The phonetic mapping idear
 > ain't that hard to grok, either.

Eh? I'm talking about the structure of English, which language you seemed to
want to four billion people to learn, and which language is really, really
unintuitive to write. 
 > > For expressing a concept or a proper noun that has no English equivalent?
 > Any language is Turing-complete. Any meaning can be conveyed in any other
 > major current (private and dead languages are too handicapped) 
 > language, more or less verbosely.


 > > Bugs? Well, of course, it's Bind and Sendmail, the (broken) reference
 > > implementations of the Two Most Broken Protocols In The World
 > > Ever. Doesn't mean development of the protocols should be stopped--it
 > > means it should be furthered, if at all possible.
 > Protocols are being designed, not evolved. If a protocol needs evolution,
 > it implies it's a design failure. If you need a new functionality,
 > develop a new protocol. Successful protocols manage to cope with domains
 > they were not designed for, and degrade more or less gracefully when they
 > start breaking down.

Shite. Protocols evolve. MX records. Host: fields. NFS over TCP. 

 > Precisely not to MTAs or MUAs. If you want funny symbols in the address bar,
 > use a plugin hiding the transcoded-into-legacy-represention from you. The
 > legacy code base doesn't need to be touched, and is guaranteed to work
 > because transcoding is fundamentally safe, and contained. 

That's a change to an MUA, though, because you want to be able to use this
domain to send email, too. 

 > [...]  I have better things to do than reading Bernstein's screeds. His
 > approaches are frequently... unique, to put it politely. Do not play well
 > with others.

Dismissal of a technical proposal based on the social skills of its author?
What a useful, constructive approach. 

I don't care if it rains or freezes/'Long as I got my Plastic Jesus
Riding on the dashboard of my car.

More information about the FoRK mailing list