[FoRK] Slow grep

Aidan Kehoe kehoea at parhasard.net
Sat Mar 20 14:13:59 PST 2004

 Ar an 19ú lá de mí 3, scríobh Eugen Leitl :

 > On Fri, Mar 19, 2004 at 03:10:16PM +0000, Aidan Kehoe wrote:
 > > Where? In DNS and SMTP, stop stripping the eighth bit, canonicalise the
 > I've long ago stopped expecting that mail transmission is 8-bit clean.

When did you ever have that expectation?

And, it has got _much_ better since I've been on the net, and I'm a spring

 > As I am writing this, I've got broken umlauts as a result of dos2unix conversion
 > on an index file, while (supposedly) trivially porting an intranet app from Windows to
 > Solaris.

Then your dos2unix is amateur, and stupidly broken. Fix it, or use "tr -d '\r'" . 

 > I.e., 8-bit clean code is surprising. 

Not in my experience. 

 > And there are no good surprises, just bad ones. 

Again, not in my experience. Fire up Terminal in OS X; robust, useful UTF-8
handling by default. If you create a file with a non-ASCII character in the
name, its name is stored as UTF-8. If you, programmatically, create one with
an invalid UTF-8 sequence in it, it'll give an error. Slightly unexpected,
but The Right Thing. 

 > > UTF-8, map visually identical characters to the same thing. The
 > I have absolutely no idea: how many lines of code do you need to canonicalize
 > it? 

I don't know. 

 > What means "map visually identical characters to the same thing"? 

Cyrillic es to "c", Greek omicron to "o", Greek kappa to "k".

 > How difficult is that, in lines of code?

I don't know. I can find out, if you'd like--I didn't have anything else
planned this weekend. 

 > We already have way too many ways of different coding of superficially
 > similiar ways (see phishing kersploits), we could go without Unicode here.

Then we work out a standard visual mapping so that characters that look the
same have only one value in our DNS system. 

 > > Canonicalisation standard is there; the visually identical mapping will be.
 > I'm smelling something fishy here. But, I can't put my finger on it, because
 > I don't know what you're talking about.

What, in particular, Mr. "Nivellating"? 

 > > In office suites and client software you have to do all the funky
 > > complicated stuff like mapping i to LATIN CAPITAL LETTER I WITH DOT ABOVE if
 > > if you're using Turkish and converting to upper case, and deciding which of
 > > the various ways of storing LATIN SMALL LETTER E WITH ACUTE you prefer. But
 > > you had to do that anyway.
 > No, thanks, I'd rather not.

Someone has to. 

 > >  > I can say EUR, or USD, or JPY just fine. Most professionals use these
 > >  > handy TLAs, not the funky symbols.
 > > 
 > > Depends on the context. In some contexts, the symbol is just more
 > > appropriate. 
 > I never use them, precisely because experience has told me that these symbols
 > break routinely. Naive users wouldn't get these surprises if there were no umlauts on
 > their keyboards. All keyboards would have the same layouts, within limits.

If they are to use those same keyboards in local language office
applications, they need some way of typing local characters with them. 

 > The original design was short-sighted, but once it settled into rigor mortis we had
 > to live with it, for better or worse. The people who attempted to "improve"
 > upon the standard, by circumventing the limitations did improve things on a
 > naive user level, but degraded the functionality at a deeper level.

No, not in my experience, and I'm not a naive user. 

 > > Eh? But they're fixed. The software had a bug, it got fixed. What Justin
 > > said; they were biting the bullet. I don't blame them for it. 
 > No, they fixed one known problem. In a specific tool. A problem which wasn't there
 > in the first place, if they didn't decide to change the context in which it
 > existed.

Red Hat didn't change that context; Microsoft and every PC vendor in the
world did, by making the PC into something used outside Western Europe. 

 > I can gurantee you there will be much wailing and gnashing of teeth. And
 > distribution forks. And holy jihads. See, we're halfway into one already.

Whee :-) . 

 > > since the early nineties; the same is very very untrue of the free Unix
 > > world. 
 > The original Unix wasn't 8-bit clean. That's a design flaw, it wouldn't have
 > required the conversion to 8-bit cleanness. The conversion to Unicode, if it
 > indeed happens, fills me with a sense of impending dread.
 > > SMTP wasn't designed with security in mind--it was barely designed. Ditto
 > > DNS. We _have_ to do it after the fact. 
 > You. Cannot. Add. Security. To. A. System. As. An. Afterthought. If you can,
 > it was designed by a genius with plugin-slots with security in mind.

The. System. Does. Not. Have. Security. As. It. Is. Less security is not
really a problem, in that context. 

 > > Umlauts as vowel + e and the _deutsche Anfuehrungszeichnen_ as "" don't
 > > look [...]
 > I don't use "a "u and "s (as a shorthand for \"o) outside of TeX. 
 > I use ae oe ue and ss instead, which look perfectly acceptable to anyone but
 > a purist language nazi.

They look like arse ... 

 > > That's not the point of a lingua franca at all. The point of a lingua franca
 > > is that you're Franks and Normans and Saxons and Italians and Spanish people
 > > fighting a multinational religious war and you need some stupid hacky means
 > > of communicating with each other.
 > The lingua franca before French was Latin. Then, briefly, German. Now it's
 > English. Ditto currency standard. Ditto programming language standard. Ditto
 > OS standard.

Ah, but you don't follow me. The original _lingua franca_ wasn't French at
all. It was a Romance pidgin used as a trade language around the
Mediterranean up until the end of the nineteenth century, and is named for
the Crusaders who brought it to prominence. 

English, and French before it--I don't admit that German was ever the
primary language of Western culture outside central Europe, shame that that
is--were and are, certainly, vehicles of communication of great
expressivity. That doesn't mean that they are or were appropriate to the
number of contexts, worldwide, that you were advocating, any more than Latin
was appropriate for bedroom conversations in the Middle Ages.

 > > There's no need for the stupid hack. We can get it right. 
 > There's nothing hackish about normalizing things to a standard coding. Forth
 > tanked because everyone rolled their own, and it disintegrated into a
 > veritable babylonic bedlam.

There is if the standard coding chosen is ambiguous, has no correspondence
to any other coding scheme out there, will break depending on the
configuration of the local machine from which you're accessing a host, and
could be replaced, with an equivalent amount of work, by one that doesn't
have these issues. 

 > > Again, Western Europe isn't the issue. 
 > Yes, it is. *nix command line isn't broken. Let those who have the problems
 > fix them, or invent something else entirely. Don't patch *nix, it's broken by
 > design. Let people fix this by developing something new.


The people who have the problems are fixing them, and you're complaining
about it, because it made your grep slower for a few months. 

 > > [Finnish-speaking] Finns do okay in English, though not as well as the
 > > Germanic-speaking rest of Scandinavia. It's not easy for them, but nothing
 > > is--bar Estonian--so they don't dwell on it. 
 > I'm familiar with trivial linguistics. It illustrates, again, that background is not
 > a problem. Attitude is.

It doesn't illustrate that at all. So, the Finns you've met spent thousands
of hours getting acquainted with English, and they don't complain about
it. If what they do for the majority of their working lives doesn't need
English--and that is true for many, many occupations--they've wasted those
hours. It's inefficient. 

 > > If there's a perceived need, the code monkeys will bloody learn it. And, in
 > No code monkey will learn 20+ odd languages, just to be able to maintain an
 > application. I'm despairing. This can't be that hard to get, can it? 

What need is there to learn 20+ odd languages to use test data? 

 > [...] Please don't break a perfectly good system for no reason at all.

a) The system isn't perfectly good. 
b) I'm not proposing to break it for no reason at all. 

 > > Eh? I'm talking about the structure of English, which language you seemed to
 > > want to four billion people to learn, and which language is really, really
 > > unintuitive to write. 
 > Let's see: I don't look at the keys, but the majority of keyboard users hunts
 > and pecks. Sounds pretty intuitive to me.
 > And there's really nothing optional about learning a lingua franca, if you
 > want to communicate. China attempts a fork by sheer user base impetus and
 > deliberate isolationism; I'm not sure this will succeed.

Eh. It will. 

 > If it will, I'm going to learn Han, or Mandarin. As will everyone else,
 > see above.

For some small value of everyone. 

 > > www.regionalofficeBeijingforthedisseminationofacademicworkanddiscussiononthemeritsofgloriousdepartedleaderMaoZeDongslittleredbook.edu.cn
 > Nevermind the burning giraffe, but that would have been a hash into a 
 > distributed global filestore, and people would find it via a p2p search engine.

So no-one is going to remember URIs any more? 

 > So, how long will DNS be still around, you think?

Too long, in its current incarnation. In some form, indefinitely.

I don't care if it rains or freezes/'Long as I got my Plastic Jesus
Riding on the dashboard of my car.

More information about the FoRK mailing list