[FoRK] Slow grep

Eugen Leitl eugen at leitl.org
Fri Mar 19 10:48:00 PST 2004

On Fri, Mar 19, 2004 at 03:10:16PM +0000, Aidan Kehoe wrote:

> Where? In DNS and SMTP, stop stripping the eighth bit, canonicalise the

I've long ago stopped expecting that mail transmission is 8-bit clean.
As I am writing this, I've got broken umlauts as a result of dos2unix conversion
on an index file, while (supposedly) trivially porting an intranet app from Windows to

I.e., 8-bit clean code is surprising. And there are no good surprises, just
bad ones. 

> UTF-8, map visually identical characters to the same thing. The

I have absolutely no idea: how many lines of code do you need to canonicalize
it? What means "map visually identical characters to the same thing"? How
difficult is that, in lines of code?

We already have way too many ways of different coding of superficially
similiar ways (see phishing kersploits), we could go without Unicode here.

> canonicalisation standard is there; the visually identical mapping will be.

I'm smelling something fishy here. But, I can't put my finger on it, because
I don't know what you're talking about.
> In office suites and client software you have to do all the funky
> complicated stuff like mapping i to LATIN CAPITAL LETTER I WITH DOT ABOVE if
> if you're using Turkish and converting to upper case, and deciding which of
> the various ways of storing LATIN SMALL LETTER E WITH ACUTE you prefer. But
> you had to do that anyway.

No, thanks, I'd rather not.
>  > I can say EUR, or USD, or JPY just fine. Most professionals use these
>  > handy TLAs, not the funky symbols.
> Depends on the context. In some contexts, the symbol is just more
> appropriate. 

I never use them, precisely because experience has told me that these symbols
break routinely. Naive users wouldn't get these surprises if there were no umlauts on
their keyboards. All keyboards would have the same layouts, within limits.

The original design was short-sighted, but once it settled into rigor mortis we had
to live with it, for better or worse. The people who attempted to "improve"
upon the standard, by circumventing the limitations did improve things on a
naive user level, but degraded the functionality at a deeper level.
> Eh? But they're fixed. The software had a bug, it got fixed. What Justin
> said; they were biting the bullet. I don't blame them for it. 

No, they fixed one known problem. In a specific tool. A problem which wasn't there
in the first place, if they didn't decide to change the context in which it

I can gurantee you there will be much wailing and gnashing of teeth. And
distribution forks. And holy jihads. See, we're halfway into one already.
> XEmacs and Unix are not the same world. XEmacs has had decent i18n support

It was some horrible hack, including those interconversion tools for those
four of five coding "standards" (why did they settle on just those few?) 
used for Cyrillic encoding. Guess what? It worked very well. I can't
touch-type cyrillic, so I chose a phonetic-type keymap. Within a few hours, I
was typing fluently (which saved my ass, because I was way behind on the
thesis writedown, because the advisor had some quite strange ideas on the
state of the art in computational chemistry).

> since the early nineties; the same is very very untrue of the free Unix
> world. 

The original Unix wasn't 8-bit clean. That's a design flaw, it wouldn't have
required the conversion to 8-bit cleanness. The conversion to Unicode, if it
indeed happens, fills me with a sense of impending dread.
> SMTP wasn't designed with security in mind--it was barely designed. Ditto
> DNS. We _have_ to do it after the fact. 

You. Cannot. Add. Security. To. A. System. As. An. Afterthought. If you can,
it was designed by a genius with plugin-slots with security in mind.
> Umlauts as vowel + e and the _deutsche Anfuehrungszeichnen_ as "" don't look

I don't use "a "u and "s (as a shorthand for \"o) outside of TeX. 
I use ae oe ue and ss instead, which look perfectly acceptable to anyone but
a purist language nazi.

> like arse to you? Then, sir, I find issue with your sense of aesthetics as a
> user of German.

If Hollerith was a linguist, maybe we wouldn't find ourselves in this sorry
mess of patching after the fact. No, thank you very much, I don't need umlaut
support in the command line and the tools.
> That's not the point of a lingua franca at all. The point of a lingua franca
> is that you're Franks and Normans and Saxons and Italians and Spanish people
> fighting a multinational religious war and you need some stupid hacky means
> of communicating with each other.

The lingua franca before French was Latin. Then, briefly, German. Now it's
English. Ditto currency standard. Ditto programming language standard. Ditto
OS standard.
> There's no need for the stupid hack. We can get it right. 

There's nothing hackish about normalizing things to a standard coding. Forth
tanked because everyone rolled their own, and it disintegrated into a
veritable babylonic bedlam.
> Again, Western Europe isn't the issue. 

Yes, it is. *nix command line isn't broken. Let those who have the problems
fix them, or invent something else entirely. Don't patch *nix, it's broken by
design. Let people fix this by developing something new.
> [Finnish-speaking] Finns do okay in English, though not as well as the
> Germanic-speaking rest of Scandinavia. It's not easy for them, but nothing
> is--bar Estonian--so they don't dwell on it. 

I'm familiar with trivial linguistics. It illustrates, again, that background is not
a problem. Attitude is.
> If there's a perceived need, the code monkeys will bloody learn it. And, in

No code monkey will learn 20+ odd languages, just to be able to maintain an
application. I'm despairing. This can't be that hard to get, can it? 

> the grand scheme of things, educating them in it will be more efficient than
> educating four billion people in something stupid and hacky. 

Let eye/limb tracking and voice input handle the point-and-drool issues.
Please don't break a perfectly good system for no reason at all.
> I don't follow you there. You're suggesting that refusing to accept code
> commented in Bahasa Indonesian will result in the decline of Bahasa
> Indonesian as a native tongue?

No. It will enhance the command of English of technically proficient users.
Which would be a Good Thing indeed.
> Eh? I'm talking about the structure of English, which language you seemed to
> want to four billion people to learn, and which language is really, really
> unintuitive to write. 

Let's see: I don't look at the keys, but the majority of keyboard users hunts
and pecks. Sounds pretty intuitive to me.

And there's really nothing optional about learning a lingua franca, if you
want to communicate. China attempts a fork by sheer user base impetus and
deliberate isolationism; I'm not sure this will succeed.

If it will, I'm going to learn Han, or Mandarin. As will everyone else, see

> www.regionalofficeBeijingforthedisseminationofacademicworkanddiscussiononthemeritsofgloriousdepartedleaderMaoZeDongslittleredbook.edu.cn

Nevermind the burning giraffe, but that would have been a hash into a 
distributed global filestore, and people would find it via a p2p search engine.

So, how long will DNS be still around, you think?

>  > they were not designed for, and degrade more or less gracefully when they
>  > start breaking down.
> Shite. Protocols evolve. MX records. Host: fields. NFS over TCP. 

A valid comparison would have been a change of TCP/IP. Notice that IPv6 is
similiarly broken-by-design, despite of a far more recent origin.
>  > Precisely not to MTAs or MUAs. If you want funny symbols in the address bar,
>  > use a plugin hiding the transcoded-into-legacy-represention from you. The
>  > legacy code base doesn't need to be touched, and is guaranteed to work
>  > because transcoding is fundamentally safe, and contained. 
> That's a change to an MUA, though, because you want to be able to use this
> domain to send email, too. 

The funny characters are completely invisble to the transport layer. Use a
plugin to render a perfectly vanilla kosher gefilte blutwurst email address.
Sheit, use an armored cryptohash of the UTF-8 string. Whatevah. 

> Dismissal of a technical proposal based on the social skills of its author?
> What a useful, constructive approach. 

I'm not sure what so hazy about reputation. Then, I fail to see how one would
miss the value of a single communication language, but, hey.

Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://lair.xent.com/pipermail/fork/attachments/20040319/ba623701/attachment.pgp

More information about the FoRK mailing list